Abstract
Background: The use of pulmonary function tests is primarily based on expert opinion and international guidelines. Current interpretation strategies are using predefined cutoffs for the description of a typical pattern. Objectives: We aimed to explore the predicted disease outcome based on the American Thoracic Society/European Respiratory Society (ATS/ERS) interpreting strategy. Subsequently, we investigated whether an unbiased machine learning framework integrating lung function with clinical variables may provide alternative decision trees resulting in a more accurate diagnosis. Methods: Our study included data from 968 subjects admitted for the first time to a pulmonary practice. The final clinical diagnosis was based on the combination of complete pulmonary function with the investigations that were decided at the physician's discretion. Clinical diagnoses were separated into 10 different groups and validated by an expert panel. Results: The ATS/ERS algorithm resulted in a correct diagnostic label in 38% of the subjects. Chronic obstructive pulmonary disease (COPD) was detected with an acceptable accuracy (74%), whereas all other diseases were poorly identified. The new data-based decision tree improved the general accuracy to 68% after 10-fold cross-validation when detecting the most common lung diseases, with a significantly higher positive predictive value and sensitivity for COPD, asthma, interstitial lung disease, and neuromuscular disorder (83/78, 66/82, 52/59, and 100/54%, respectively). Conclusions: Our data show that the current algorithms for lung function interpretation can be improved by a computer-based choice of lung function and clinical variables and their decision-making thresholds.
Introduction
Pulmonary function tests (PFTs) are the prime tool of respiratory physicians. In clinical medicine, PFTs are often used to evaluate respiratory symptoms, to diagnose diseases, and to assess functionality and preoperative risk [1]. After taking the patient's history, the PFT is often combined with blood analysis, lung imaging, and more specific tests (such as skin prick tests, bronchial challenge, exercise testing, bronchoscopy with biopsies, bronchoalveolar lavage, etc. [2,3,4]) to come to a final diagnosis [5]. When considering the value of PFTs alone, the Belgian Pulmonary Function Study (BPFS) clearly showed that spirometry, resistance, lung volume, and diffusing capacity significantly and independently contributed to the diagnostic workup [6]. When combined with clinical history, PFTs had an accuracy of 77% in predicting the diagnosis, which may indicate that these tests (when correctly interpreted) have a high potential for an appropriate diagnostic labeling.
The use of PFTs is primarily based on expert opinion and international guidelines predominantly dealing with asthma, chronic obstructive pulmonary disease (COPD), and lung fibrosis [7,8]. In 2005, an American Thoracic Society/European Respiratory Society (ATS/ERS) task force designed a simplified algorithm to assess lung function in clinical practice [9]. It involves the recognition of typical patterns with strictly defined cutoffs for abnormality (obstructive, restrictive, or mixed pattern) and grades severity over time. Various algorithms have incorporated this strategy, and most PFT equipment has now built-in software that can generate a preliminary interpretation [10,11,12]. Unfortunately, these algorithms are not well established; they remain rather descriptive, do not deal with atypical patterns, and do not rely on clinical characteristics for diagnostic suggestion. In daily practice, these shortcomings are tackled by an expert clinical reading, which may explain why none of these automated protocols have found their way to clinical routine. Interestingly, the opposite has happened with automated protocols for electrocardiography (ECG) recordings [13,14].
In this study, we aimed to explore whether the interpretation of PFTs for respiratory disease labeling can be automated. By developing a computer algorithm based on the ATS/ERS interpreting strategy, we checked the predicted disease outcome in a large population. Subsequently, we investigated whether a machine learning framework may provide new decision algorithms that result in a more accurate diagnosis.
Subjects and Methods
Study Population
The study included the data of 968 subjects from the BPFS, a prospective cohort study that enrolled a clinical population-based sample of all successive undiagnosed patients admitted for the first time to one of the 33 participating Belgian hospitals due to respiratory symptoms [6]. The study was performed in the periods from June 6 to September 12, 2011, and from January 16 to June 12, 2012. Briefly, all enrolled subjects were Caucasians between 18 and 75 years old who had performed a complete PFT at cohort entry (including post-bronchodilator spirometry, whole-body plethysmography for lung volume and airway resistance, and diffusing capacity). To establish a respiratory disease diagnosis, all necessary additional tests including imaging, ECG, and other PFTs were performed at the physician's discretion. The final diagnosis for each subject was subsequently validated by Belgian local focus groups (groups of 20-25 pulmonologists) who jointly evaluated all tests outcomes. The study population's baseline characteristics are shown in Table 1, covering healthy controls as well as a wide range of respiratory diseases that may present with a disturbed PFT with a specific pattern: asthma, COPD, other obstructive diseases (including bronchiolitis, bronchiectasis, and cystic fibrosis), upper airway obstruction, obesity, interstitial lung disease (ILD), systemic sclerosis or vascular disease, cardiac failure, and hyperventilation. Because of similar lung function disturbances, neuromuscular disorder (NMD) combined patients with chest wall or pleural disease, lung resection, or true neuromuscular disease. Others diagnoses (lung cancer, rhinosinusitis, etc.) were excluded from the analysis. The study protocol was approved by the Ethics Committee of the University Hospital in Leuven, which acted as the leading Ethics Committee, and by all the Ethics Committees of the participating hospitals which acted as subsidiary Ethics Committees. All included patients provided informed consent. The BPFS design can be found on www.clinicaltrials.gov (NCT01297881).
Pulmonary Function Tests
All PFTs were performed according to the ATS/ERS criteria [15] using standardized equipment (Masterscreen Jaeger; CareFusion, Germany). Spirometry data are post-bronchodilator measures and are expressed, along with plethysmography measurements of airway resistance and lung volumes, as percent predicted of normal reference values [16,17]. Diffusing capacity for carbon monoxide (DLCO) was measured by the single-breath gas transfer method, using a standardized carbon monoxide and helium gas mixture, and expressed as percent predicted of reference values [18].
PFT Interpretation Algorithm
To interpret the PFTs, the international guidelines of the ATS and the ERS were applied [9]. An important modification was made for the cutoff of FEV1/VC, where we applied a post-bronchodilator FEV1/FVC fixed ratio of 0.7, in accordance with the GOLD and our local clinical recommendations for the diagnosis of obstructive airways disease [8] (Fig. 1).
Computer Algorithm and Validation
The development of a computer algorithm to automatically interpret all PFTs was performed in MATLAB 8.3 (The MathWorks, Natick, MA, USA). Using the Statistics and Machine Learning Toolbox, a new decision tree was developed based on lung function data (included measurements: FEV1, FVC, FEV1/FVC, PEF, FEF25, FEF50, FEF75, FEF25-75, RAW, sGAW, VC, RV, TGV, TLC, DLCO, KCO), combined with the patient characteristics age, pack-years, CAT score, gender, and BMI. The best split criterion was determined using maximal deviance reduction [19]. To avoid “overfitting” of the model and to get a better sense of the predictive accuracy of a decision tree, a stratified 10-fold cross-validation was performed [20]. Briefly, the training data were randomly split into 10 equal segments, and 10 new trees were trained on 9 segments and validated on the data from the segment not included in training. This method gives a better estimation of the predictive accuracy of the produced decision tree, since it tests and improves new trees on new data.
Results
Abnormalities and Diagnosis
Applying the ATS/ERS algorithm revealed that in a real clinical population sample, the most prevalent lung function pattern was the healthy one (60%) followed by an obstructive pattern (36%), whereas restrictive and mixed patterns were very uncommon (4% together). Only 5% of the truly healthy subjects had an FEV1/FVC ratio <0.7, yet only 25% of all subjects with a normal pattern were truly healthy (Fig. 2). As expected, the majority of healthy patterns were found in patients with asthma, who are known to have normal pulmonary function in stable conditions. Of the 222 COPD patients, 197 had an obstructive pattern according to the modified ATS/ERS rules; 25 had an FEV1/FVC ratio >0.7 and may have been labeled as COPD patients by the expert panel because of hyperinflation, high resistance, low DLCO, or emphysema on CT. An obstructive pattern was also retrieved in 30% of the asthma subjects. A purely restrictive pattern was rarely seen in COPD (only 1 patient), and although some patients with ILD, NMD, or obesity presented within this restrictive subgroup, the majority were found under the healthy label. When further applying the ATS/ERS algorithm, the lower limit of normal (LLN) for DLCO was used as a cutoff to split the predicted disease patterns (Fig. 3). It helped in the differentiation of asthma from COPD within the obstructive group, but was not selective enough to identify pulmonary vascular disease in the group with normal pattern or ILD in the group with restrictive pattern. Despite the fact that the largest fraction of patients with real ILD and NMD were found with a normal lung function pattern, once categorized within the restrictive group, DLCO was able to correctly classify 5 of the 6 ILD patients. Unfortunately, DLCO was not able to identify NMD within the restrictive labels, as 10 patients out of 13 were misdiagnosed with ILD. In the complete cohort, 38% of subjects were correctly classified. If restricting the cohort to diseases that can be labeled by the ATS/ERS algorithm (healthy, asthma, COPD, PV, ILD, and NMD; n = 810), accuracy increased to 46%. It resulted in 24% sensitivity and 94% specificity for the diagnosis of asthma, 73% sensitivity and 94% specificity for COPD, and 13% sensitivity and 98% specificity for ILD. A detailed comparison of correct and incorrect classifications is presented in Figure 4a.
The Decision Tree Suggests the Final Diagnosis
As the computer development of new decision trees is very much affected by disease prevalence, we only included data from the most common lung diseases (asthma, COPD, ILD, NMD) and added the healthy individuals. PV was only present in 9 cases (<1%) and was therefore not included in the analysis. Applying the modified ATS/ERS tree to this group (n = 801) of most common diseases resulted in a similar accuracy of 46% (365/801) for correct diagnosis. We then developed a computer-based algorithm to define upfront a 100% specific lung function pattern for NMD (Fig. 5a), as its prevalence in the BPFS was still too low to accurately discriminate within the entire population. Fourteen out of 26 NMD subjects were a priori selected as having a unique lung function pattern of NMD. Next, all lung function data and a selected set of clinical parameters (age, pack-years, CAT score, gender, and BMI) of the remaining population (asthma, COPD, ILD, and healthy subjects; n = 784) were subjected to a machine learning framework to develop a decision algorithm. This tree, visualized in Figure 5b, resulted in 74% accuracy on the training data, which decreased to 68% after 10-fold cross-validation. Most interestingly, the decision tree started with a diffusing capacity cutoff of 70% predicted as first discriminator, followed by an FEV1/FVC ratio cutoff around 70% predicted. On the lower levels of the tree, pack-years, age, PEF, and TLC were all found to offer significant discriminative power to the tree. A confusion matrix on the decision algorithm that combined the a priori identification of NMD with the tree for the more prevalent disease on all 801 subjects is shown in Figure 4b. It demonstrates that the new algorithm was able to recognize COPD with a high positive predictive value (PPV = 83%) and sensitivity (true positive rate [TPR] = 78%). The proposed algorithm was much stronger to predict the presence of asthma (PPV = 66%, TPR = 82%) and ILD (PPV = 52%, TPR = 59%). For the NMD pattern, a TPR of 54% and a PPV of 100% were reached. Obviously, the decision rules were specific enough, crossing the 90% specificity for every disease. Defining a clear healthy pattern gave the least favorable results, as it is often confused with non-obstructed asthma.
Discussion
In this study, we developed a computer program for the automated data-driven interpretation of PFTs. When based on the international guidelines, the developed program was able to provide a correct diagnosis in only 38% of the studied population. Additionally, major mistakes were found in the discrimination of asthma from COPD and healthy individuals, and in the accurate labeling of restrictive diseases such as ILD and NMD. When developing an unbiased decision tree based on data mining programs that included not only lung function variables but also clinical patient characteristics, the accuracy increased significantly to 68%.
Automation in clinical practice is gaining pace. Apart from automated ECG interpretation, we are witnessing similar automations in detecting mammogram abnormalities, multiple sclerosis lesions on CT scans, or interpretation of laboratory tests [21,22,23]. Automated interpretation of PFTs has not yet become a clinical reality, due to the lack of accurate diagnostic guidelines to label lung disorders based on specific pulmonary function patterns. The weaknesses of current interpretation strategies, which have been criticized by several experts in the past, are clearly demonstrated by our results [24,25,26]. One of the reasons for these weaknesses may be found in the choice of thresholds and parameters, where authors were motivated by simplicity rather than pure evidence. For example, in the ATS/ERS decision tree, a DLCO below the LLN is used to differentiate between ILD and NMD. Both diseases may present with reduced alveolar ventilation (VA) leading to reduced gas transfer, but in contrast to ILD and interstitial involvement, the reduction in VA will be the main cause of reduced DLCO in NMD. KCO normalizes diffusing capacity for VA in case of chest wall disorders or NMD and is therefore a more reliable marker for the further distinction of ILD from NMD [27,28]. When analyzing our data using the LLN of KCO as a threshold, none of the NMD patients had low KCO, whereas 63% of the ILD subjects did. Further analysis identified KCO as the best differentiator between NMD and ILD, as a threshold at 85% of predicted values discriminated 88% of all NMD and ILD. Another weakness of the ATS/ERS decision tree lies in the definition of normal healthy lung function patterns, which is solely based on an FEV1/FVC ratio, a VC, and a DLCO above the LLN. Many lung diseases may still appear with large disturbances of other parameters in combination with changes on these key parameters that are statistically within the normal limits, but in combination may clearly suggest disease.
By the development of a new unbiased decision tree, we are demonstrating that careful data-based modelling improves the accuracy of decision-making. Interestingly, the computer integrates clinical variables in the decision process, which is exactly what clinicians would do when reading and interpreting lung function data. For instance, the different pathways that are given by the computer are all tracks that make clinical sense, with computer-based cutoffs and thresholds that are different from the lower or upper limits that are normally used. Being data-driven, our decision tree follows clinical reasoning, but also takes into account statistical chances for a certain outcome. For example, normal DLCO will never lead to a diagnosis of COPD, but rather to one of obstructive asthma. It does not mean that COPD with normal DLCO does not exist, yet the chance is low and therefore accepted as a mistake. Overall, the decision tree designed by the computer is approaching the accuracy of the expert panel, which reached 80% accuracy for healthy, asthma, ILD, NMD, and COPD based on the combination of 4 tests with clinical history [6].
Although the current algorithms are improving lung function interpretation, there are still many mistakes with asthma and its differentiation from COPD and healthy. This problem has certainly to do with the fact that asthma may appear with a complete healthy and normal lung function on the one hand, and with an irreversible obstruction mimicking COPD on the other hand. We think that further developments for automated interpretation strategies should focus on lung functions that have an a priori disturbed pattern. How this disturbance should be defined is still unclear, but it is obvious that we should be less restrictive than what the current ATS/ERS rules are suggesting. One challenging approach may be the inclusion of new more selective parameters of PFTs to improve the differentiation between these diseases [29,30]. Another problem for the development of automated interpretation of lung function is its inherent dependency on the prevalence of the diseases comprised in the dataset. For instance, when developing a tree with no a priori exclusion of NMD, the PPV was only 42%, with only 50% of these patients being correctly labeled despite a very specific disease pattern in a majority of these rare cases. We solved this problem by an upfront selection of the pattern on which there is no doubt and found out that this combined approach improved the overall accuracy from 64 to 68%.
An alternative approach for diagnostic labeling is to switch from easily interpretable algorithms such as decision trees to hardly or not interpretable black box algorithms such as neural networks, support vector machines, and others, which are used in most decision support systems nowadays [31,32]. Undoubtedly, these approaches will be beneficial in terms of diagnostic correctness [33,34,35]. An exploratory analysis of neural networks in our dataset already increased the overall accuracy to 82% after 10-fold cross-validation [33]. However, they are dataset specific and therefore will only be applicable once we have much larger datasets. Moreover, they will be hardly accepted by clinicians if there is no logical understanding.
Finally, and inherent to any automated interpretation of PFTs, there is the need of sufficient quality of the performed tests. Different studies have shown that poorly performed tests increase the risk of misinterpretation and misdiagnosis [36,37]. Within-test acceptability and between-test reproducibility should be achieved by adequate coaching from pulmonary function laboratory personnel. Hence, there are many guidelines assisting for optimal quality control [15,38,39].
To conclude, using the simple decision algorithms to determine abnormal patterns of disease and subsequently predict specific respiratory diseases is not accurate. Our data show that the current algorithms can be improved by a computer-based choice of key decision variables and their decision-making thresholds. This technology may lead to the development of computer-aided decision support systems for the accurate interpretation of PFTs.
Acknowledgments
We thank all pulmonologists, pulmonary function technicians, patients, and hospitals participating in the study for providing data. We thank the Belgian Society of Pneumology for their financial and moral support. Finally, we thank Mr. Vasileios Exadaktylos and Mr. Geert Celis for technical support. This work was supported by an Astra Zeneca Chair 2013-2015. W. Janssens is supported the Flemish Research Foundation (FWO).
BPFS Investigators
R. Vanherreweghe (Algemeen Stedelijk Ziekenhuis, Aalst); R. Deman, S. Deryke, B. Ghesyens, M. Haerens, S. Maddens (AZ Groeninge, Kortijk); W. Bultynck, W. Temmerman, L. Van Zandweghe (AZ Sint Blasius, Dendermonde); R. De Pauw, C. Depuydt, C. Haenebalcke, V. Ringoet, D. Van Renterghem (AZ Sint Jan, Brugge); A. Carlier, D. Colle, C. De Cock, J. Lamont (AZ Maria Middelares, Gent); P. Brancaleone (CH Jolimont-Lobbes); M. Vander Stappen (CHR Haute Senne); R. Peché, P. Pierard, P. Quarré, A. Van Meerhaeghe (CHU, Charleroi); D. Cataldo, J.L. Corhay, B. Duysinx, V. Heinen, M. Naldi, D. Nguyen, L. Renaud (CHU, Liège); M. Cotils, P. Duchatelet, A. Gocmen, C. Lenclud (Clinique Louis Caty, Baudour); P. Lebrun, J. Noel (Clinique Saint-Pierre, Ottignies); A. Frémault (Grand Hôpital de Charleroi); P. Bertrand, B. Bouckaert, I. Demedts, P. Demuynck (Heilig Hart Ziekenhuis, Roeselare); E. Frans, A. Heremans, T. Lauwerier, J. Roelandts (Imelda, Bonheiden); G. Bral, I. Declercq, I. Malysse (Jan Ypermanziekenhuis, Ieper); V. Van Damme (St. Andries Ziekenhuis, Tielt); Dr. Martinot (St. Elisabeth, Namur); P. Vandenbrande (St. Maarten Duffel, Mechelen); M. Bruyneel, I. Muylle, V. Ninane (St. Pierre, Bruxelles); P. Collard, G. Liistro, B. Mwenge Gimbada, T. Pieters, C. Pilette, F. Pirson, D. Rodenstein (UCL, Bruxelles); L. Delaunois, E. Marchand, O. Vandenplas (UCL, Mont-Godinne); W. De Backer, P. Germonpre, A. Janssens, A. Vrints (UZA, Antwerpen); M. Decramer, W. Janssens, P. Van Bleyenbergh, G. Verleden, W. Wuyts (UZ Gasthuisberg, Leuven); G. Brusselle, E. Derom, G. Joos, K. Tournoy, K. Vermaelen (UZ, Gent); P. Alexander, C. Gysbrechts (Ziekenhuis Ronse); L. Bedert, Dr. Bomans, Dr. De Beukelaar, E. De Droogh, D. Galdermans, Dr. Ingelbrecht, Dr. Lefebure, H. Slabbynck, Dr. Van Mulders, Dr. Van Schaardenburg, Dr. Verbuyst (ZNA, Antwerpen); M. Daenen (ZOL Genk).