Published 31 Dec 2020

Deep Learning-Based A.I. Software for Chest X-Ray Analysis to Detect Microbiologically-Confirmed Tuberculosis: A Prospective Study of Diagnostic Accuracy

Author: Gamuchirai Tavaziva1, Arman Majidulla2, Ahsana Nazish3, Syed K Abidi1, Saima Saeed2, D. Menzies1, A. Benedetti1, A N Khan2 and Faiz Ahmad Khan1. 1



Advances in artificial intelligence-based image recognition, particularly the use of deep learning methods, have been leveraged to develop computer programs that can analyze chest X-rays (CXR) to detect pulmonary tuberculosis (PTB), in place of human readers. However, there is little data on the diagnostic accuracy of commercially-available deep learning-based CXR analysis software. Objective: Primary: To estimate the diagnostic accuracy of two commercially-available deep learning-based CXR analysis software for the detection of microbiologically-confirmed PTB. Secondary: To identify patient characteristics that modify diagnostic accuracy of these software. Methods: We enrolled adults presenting with symptoms of PTB at a hospital in Karachi, Pakistan. For all participants, we performed a digital CXR, and asked them to submit 2 sputum samples for smear and liquid TB culture (MGIT) and 1 for nucleic acid amplification testing (NAAT, Xpert MTB/Rif). We analyzed each CXR with CAD4TB (v6) and qXR (v2). These programs analyze CXR and output an “abnormality score” on a 100-point scale. A “threshold score" must be selected for use as a cutoff to differentiate normal versus abnormal CXR. For each software, we first identified the threshold score that had a sensitivity of 0.90 when pooling all participants together (“overall"). Using this threshold score, we calculated the overall specificity, and the sensitivity and specificity in subgroups defined by sex, age, sputum smear, diabetes, tobacco smoking, and history of prior TB. We repeated this for the threshold score with an overall sensitivity of 0.95. We used chi-square tests to compare accuracy between subgroup strata. Results: Of 2370 eligible participants, we excluded 95 missing sputum data, 7 with cultures contaminated or growing NTM, and 2 missing CXR. Amongst 2267 included participants, 278 (12.3%) were diagnosed with culture- or NAAT-confirmed PTB. CAD4TB specificity was 0.76 (95%CI: 0.74-0.78) and 0.66 (95%CI: 0.64-0.68) at sensitivity of 0.90 and 0.95, respectively. qXR specificity was 0.78 (95%CI: 0.76-0.80) and 0.73 (95%CI: 0.71-0.75) at sensitivity of 0.90 and 0.95, respectively. With CAD4TB, sensitivity was lower and specificity higher, in women versus in men. For both software, sensitivity was lower for smear negative TB, and specificity was lower in the older age category and in people with prior TB. Discussion: In people seeking care for PTB symptoms, deep learning-based CXR analysis achieved a moderate specificity even at high sensitivity, but diagnostic accuracy was modified by age, smear-status, and prior TB history. Gender modified the accuracy of CAD4TB, but not qXR 


Gamuchirai Tavaziva1, Arman Majidulla2, Ahsana Nazish3, Syed K Abidi1, Saima Saeed2, D. Menzies1, A. Benedetti1, A N Khan2 and Faiz Ahmad Khan1. 1


1. McGill International TB Centre 2. Research Institute of the McGill University Health Centre University Health Centre 3. Montreal 4. QC 5. Canada Interactive Research & Development 6. Karachi 7. Pakistan Indus Hospital 8. Karachi 9. Pakistan

Share this publication