Machine learning for the diagnosis of pulmonary hypertension

Objective This paper aims to investigate whether machine learning (ML) can be used to predict the state of pulmonary hypertension (PH), including pre-capillary and post-capillary, from echocardiographic data. Two hundred and seventy-five patients with PH who underwent both echocardiography and right heart catheterization were included in the study. Mean pulmonary artery pressure, pulmonary artery wedge pressure measured by right heart catheterization were used as criteria for judging pre-capillary PH and post-capillary PH. Thirteen echocardiographic indicators were used to predict whether the PH was pre-capillary or post-capillary. Nine ML models were used to make predictions. Accuracy was used as the primary reference standard, and the performance of classification model is observed in conjunction with area under curve (AUC), specificity (Sp), sensitivity (Se), Positive Prediction Value (PPV), Negative Prediction Value (NPV), Positive Likelihood Ratio (PLR) and Negative Likelihood Ratio (NLR) and other assessment protocols. Results the other of the classification under the nine ML models, it can be found that the ML model can effectively identify the pre-capillary PH and the post-capillary PH. LogitBoost performed best in nine ML models (ACC=0.87, Recall=0.83, F1score=0.85, AUC=0.87, Se=0.90, NPV=0.88, PPV=0.87, PLR=8.61 and NLR=0.18, AUC=0.83), it showed good results in identification of the pre-capillary PH (ACC=0.83, Recall=0.87, F-score=0.85); Post-vascular PH (ACC=0.90, Recall=0.88, F-score=0.89). Decision Tree (ACC=0.75, Recall=0.77, F1score=0.78, AUC=0.75, Se=0.72, NPV=0.78, PPV=0.77, PLR=3.66 and NLR=0.29, AUC=0.79) performed worst, and the accuracy of the other seven models was greater than 0.82. Conclusion The classification results of the nine ML models in this paper indicate that the ML method can effectively identify the pre-capillary PH and post-capillary PH from echocardiographic data. Compared with medical diagnosis, ML methods can distinguish between pre-capillary PH and the post-capillary PH under non-invasive conditions.


Introduction
Pulmonary hypertension (PH) is a pathophysiological disorder, which is defined as mean pulmonary artery pressure (PAPm) ≥25 mm Hg assessed by right heart catheterization (RHC) [1]. According to the latest European Society of Cardiology (ESC) and the European Respiratory Society (ERS) guidelines, PH can be divided into two subgroups, namely pre-capillary and post-capillary PH, with a different hemodynamic feature of pulmonary artery wedge pressure (PAWP) ≤15 or >15 mm Hg [2]. Left heart disease is the primary etiology in the post-capillary PH, which has different treatment compared with pre-capillary PH [3]. So far, RHC is used as a gold standard to distinguish pre-capillary PH and post-capillary PH. However, RHC is an invasive method, which costs a lot of money, and requires multiple indicators to be in a standard state and might have multiple complications. Thus, exploration of a new non-invasive method is necessary. In this paper, 15 clinical parameters (including echocardiographic data and RHC data) from 275 patients were used for machine learning (ML).
Accurate classification of PH is not only beneficial to individuals but also important for medicine. In clinical diagnosis, it is time-intensive for the manual diagnosis, which may also require a lot of information such as clinical test scores, laboratory findings, and informed reporter reports. The efficiency and accuracy of the diagnosis depend on the ОРИГИНАЛЬНЫЕ СТАТЬИ § professionalism of the doctor. It's a daunting task in some areas with poor medical conditions. ML is an advanced computing technology that can improve the analysis of medical data and automatically make the diagnostic decision [4].
The echocardiographic data from 275 patients were analyzed by various ML algorithms including decision tree learning (Decision Tree), instance-based algorithm (K-nearest neighbor, KNN), kernel-based algorithm (Support Vector Machine (SVM), Linear Discriminant Analysis (LDA)), integrated algorithm (Random Forest, Adacboost, LogitBoost), and regression algorithm (Logical Regression). This paper aims to explore whether ML can effectively discriminate between the pre-capillary PH and post-capillary PH, to compare the performance of different ML algorithms, and then to establish a specific ML model to achieve the purpose of effective discrimination of PH.

Data Collection
The RHC and echocardiography data of 275 patients suspected to have PH in the First Affiliated Hospital of Nanjing Medical University from April 2013 to March 2018 were enrolled in this study. According to hemodynamic parameters in RHC data, patients were classified into two groups: class 1 with PAPm ≥25 mm Hg and PAWP ≤15 mm Hg (pre-capillary PH) and class 2 with PAPm ≥25 mm Hg and PAWP >15 mm Hg (post-capillary PH). The baseline characteristics of echocardiographic parameters in the study population are shown in table 1. Informed consent was obtained from all patients, and the study was approved by the First Affiliated Hospital of Nanjing Medical University Review Board. Figure 1 shows the workflow of our methods. The original data were cleaned according to the specific medical business background, and then the model and prediction analysis were conducted for the cleaned subgroups by 10-fold cross validation method, which separates the data into 10 folds and uses each fold as the test set, the remaining sets as the training set in turns. Afterward, accuracy, precision, sensitivity, specificity, area under receiver operating characteristic (ROC) curve (AUC), F1-score and other evaluation methods were used to analyze the performance of models.

Construction of the diagnostic models
Nine ML models were selected for training and the fitting performance of each model was compared. In order to avoid overfitting, cross-validation was used to assess the fit performance of the predictive model, and the independent test data had been used to evaluate the generalization performance of the model.

Model Selection
The Logistic study belongs to the discriminant model. Based on the cross entropy, the exponential family distribution is introduced as the activation function to form two classification algorithms. SVM is a supervised learning model with related learning algorithms for classification and regression analysis. LDA is a generalization of Fisher's linear discriminant. In ML a linear discriminant model can be used as a linear classifier. KNN belongs to the discriminant model in ML and is a nonparametric method for classification or regression [5]. As one of the most common predictive modeling methods in ML, the decision tree (as a predictive  Figure 1. The workflow of PH data processing in our method to develop and validate the diagnostic model ОРИГИНАЛЬНЫЕ СТАТЬИ § model) draws conclusions about the project (represented in the branch) to the target value of the project (represented in the leaf). Considering that the integrated model will obtain better prediction results and a more stable model, this study chose four integration models based on two integration algorithms (Boosting and Bagging). AdaBoost is the most famous algorithm model in the integrated algorithm. It defaults to the CART (Classification And Regression Trees) regression tree as a weak classifier. AdaBoost has high classification accuracy when used as a classifier. Logit Boost is an enhancement algorithm. The original paper [6] converted the AdaBoost algorithm into a statistical framework. If AdaBoost is considered as a generalized additive model, and then the cost function of Logistic Regression (LR) is applied, the LogitBoost algorithm can be derived. Gradient enhancement is a ML technique for regression and classification problems that produces a predictive model in the form of a weak set prediction model (usually as decision tree), which can be interpreted as an optimization algorithm for a suitable cost function [7]. Random forest (RF) is a collective learning method used for classification and for other tasks to manipulate trees by constructing multiple decision trees during training and outputting categories as categories (classification) patterns or predicting (regressing) individuals [8,9].

Cross Validation
As shown in figure 2, all data is divided into two parts, training data (80 %), and test data (20 %). The training data is randomly split in two ten folds, each time 9 folds of them were used as training data in turn, and one fold was used as test data for testing. Each test will result in a corresponding correct rate or error rate. The average of the correct rate or error rate of the 10 results is used as an estimate of the accuracy of the algorithm. Generally, multiple 10-fold cross-validation is required, and then the mean is used as the pairing algorithm. Estimation of accuracy.

Model Training and Evaluation
The dataset was randomly split into training (80 %) and testing (20 %) sub-datasets. Of these 220 patients (108 precapillary PH, 112 post-capillary PH) were used as training data, and the remaining 55 patients (31 pre-capillary PH, 24 postcapillary PH) were used as test data to verify the generalization performance of the model. Finally, nine models were used to train and predict respectively. The hierarchical cross-validation method was used to verify these models; the accuracy was used to evaluate the performance of different models on training data; the confusion matrix was used to show the results of classification. After applying the model to the test data, accuracy was also used to evaluate the generalization performance of the models.
In order to more accurately measure the effect of the model, ten evaluation criteria were used to compare the accuracy of the model from multiple aspects in this paper. Next methods were used to evaluate the performance of the model comprehensively:  ОРИГИНАЛЬНЫЕ СТАТЬИ § TP, FP, TN, FN represent true positives, false positives, true negatives, and false negatives, respectively. Accuracy is the most common method of evaluation, which is calculated by the number of all accurately predicted samples divided by the total number of samples. The higher the accuracy, the better the classification method. For formula (b), we can see that it represents the probability that the predicted true positive divided by predicted condition positive. Recall is a measure of coverage and it also called sensitivity. F-Measure is the weighted harmonic average of Precision and Recall: F1 score was used in this paper. When F-score (d) is a weighted value of 1, when it is higher, it can be proved that the test method is more effective. Positive Prediction Value (PPV) refers to the proportion of truly "ill" cases (true positives) among all the positive cases detected by the screening test, and reflects the possibility of the target disease being positive in the screening test result. Negative Prediction Value (NPV) known as backtracking accuracy is a measure of the completeness of the result, which is the ratio of negative samples that are correctly predicted to all negative samples. The calculation of Positive Likelihood Ratio (PLR) and Negative Likelihood Ratio (NLR) combines the advantages of Sensitivity (Se), Specificity (Sp), PPV and NPV. It can not only make predictions based on the presence or absence of certain alarm symptoms in patients, but also not be affected by the incidence of lesions in the tested population, and they are used in a variety of clinical environments.

Results
Accuracy, recall and other seven evaluation indicators were used to evaluate the nine models. Table 2 Table 4 shows the results of the classification status report for each model. By comparing the precision, recall rate and F-score of pre-capillary PH and post-capillary PH, it can be found that nine ML methods show consistency in the classification of the two classifications. LDA with the highest F-Score both in pre-capillary PH and post-capillary PH, and LogitBoost flowed it.
In order to compare the performance of each model more intuitively, Figure 3 shows the ROC and the AUC of each model in training data with 10-fold in 95 % confidence interval. This value shows the overall performance of a trained classifier and is often used to determine the validity of a model prediction class. It can be seen that SVM performs best (AUC=0.91), followed by LDA (AUC=0.90), and Decision Tree performs the worst (AUC=0.75). The AUC of LogitBoost is 0.89 followed LDA. Observing the ROC curve trend of the AUC for each model, it can be found that nine machine learning methods can effectively classify precapillary and post-capillary PH. Figure 4 shows the ROC curve and the AUC of each model in test data. It can be seen that for the nine ML algorithms except for the decision tree the AUC on the test set is greater than 0.8.

Discussion
PH is a pathophysiological disorder, which can be divided into different types with different hemodynamic characteristics. RHC is the gold standard to distinguish the PH subgroup clinically. Because RHC is an invasive operation and can't be widely used in clinical practice, easier and more convenient method is needed. In order to find a preselection for RHC, clinical medical data of 275 patients   were analyzed by selected ML methods, and the model was established. Finally, based on the application characteristics of each model, nine models were selected for modeling and analysis, and the performance of the model was evaluated by accuracy, precision, recall, F-measure and so on.
There are many studies on the diagnosis of PH by echocardiographic parameters, but there are few studies about the differential diagnosis of pre-capillary and postcapillary PH [10,11]. This paper combined the ML method and the experimental test using multiple models. Finally, through the various assessment, our research conclusions have been strongly explained in the medical field. In terms of comprehensive evaluation results, we can use echocardiographic data as the standard to divide the pre-capillary PH and post-capillary PH, which is significant for the clinical practice. Using our model, a more rapid and convenient diagnosis and treatment can be selected.
However, the less flexible approach can achieve better results than the more popular ML algorithms due to the limitations of the number of data samples and the number of features. The data were tested in a relatively small sample size and in a single medical center. In the feature, more data samples and features will be added, and the prediction model will be improved to distinguish the pre-capillary and post-capillary PH with echocardiographic parameters instead of RHC parameters, which will be more convenient and noninvasive.

Conclusions
The results of this study can prove that the ML model can better distinguish the pre-capillary PH and postcapillary PH according to echocardiographic data, which can assist doctors in diagnosis. Moreover, by comparing nine ML methods for the parameters of pre-capillary and post-capillary PH classification, a ML algorithm that can accurately classify the pre-capillary PH and post-capillary PH was found. Among the nine classification models, the LogitBoost algorithm performed best. Decision Tree performed worst. We found that linear SVM, KNN and Random Forest models have good classification accuracy.