breast cancer; machine learning; patient stratification; recurrence prediction; secondary use; structured data; unstructured data; Oncology; Cancer Research
Abstract :
[en] Recurrence is a critical aspect of breast cancer (BC) that is inexorably tied to mortality. Reuse of healthcare data through Machine Learning (ML) algorithms offers great opportunities to improve the stratification of patients at risk of cancer recurrence. We hypothesized that combining features from structured and unstructured sources would provide better prediction results for 5-year cancer recurrence than either source alone. We collected and preprocessed clinical data from a cohort of BC patients, resulting in 823 valid subjects for analysis. We derived three sets of features: structured information, features from free text, and a combination of both. We evaluated the performance of five ML algorithms to predict 5-year cancer recurrence and selected the best-performing to test our hypothesis. The XGB (eXtreme Gradient Boosting) model yielded the best performance among the five evaluated algorithms, with precision = 0.900, recall = 0.907, F1-score = 0.897, and area under the receiver operating characteristic AUROC = 0.807. The best prediction results were achieved with the structured dataset, followed by the unstructured dataset, while the combined dataset achieved the poorest performance. ML algorithms for BC recurrence prediction are valuable tools to improve patient risk stratification, help with post-cancer monitoring, and plan more effective follow-up. Structured data provides the best results when fed to ML algorithms. However, an approach based on natural language processing offers comparable results while potentially requiring less mapping effort.
Disciplines :
Computer science
Author, co-author :
González-Castro, Lorena ; School of Telecommunication Engineering, University of Vigo, 36310 Vigo, Spain
Chávez, Marcela ; Department of Information System Management, Centre Hospitalier Universitaire de Liège, 4000 Liège, Belgium
Duflot, Patrick ; Centre Hospitalier Universitaire de Liège - CHU > > Secteur Appui méthodologique aux Projets GSI et Planification (APP)
Bleret, Valérie ; Université de Liège - ULiège > Département des sciences cliniques
Martin, Alistair G; Science Department, Symptoma GmbH, 1030 Vienna, Austria
Zobel, Marc; Science Department, Symptoma GmbH, 1030 Vienna, Austria
Nateqi, Jama; Science Department, Symptoma GmbH, 1030 Vienna, Austria ; Department of Internal Medicine, Paracelsus Medical University, 5020 Salzburg, Austria
Lin, Simon ; Science Department, Symptoma GmbH, 1030 Vienna, Austria ; Department of Internal Medicine, Paracelsus Medical University, 5020 Salzburg, Austria
Pazos-Arias, José J ; atlanTTic Research Center, Department of Telematics Engineering, University of Vigo, 36310 Vigo, Spain
Del Fiol, Guilherme ; Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT 84108, USA
López-Nores, Martín ; atlanTTic Research Center, Department of Telematics Engineering, University of Vigo, 36310 Vigo, Spain
Language :
English
Title :
Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records.
Alternative titles :
[fr] Algorithmes de Machine Learning pour prédire la récidive du cancer du sein à l’aide de sources structurées et non structurées provenant de dossiers médicaux électroniques.
Patients-centered SurvivorShIp care plan after Cancer treatments based on Big Data and Artificial Intelligence technologies
Funders :
EU - European Union
Funding text :
Part of this work was supported by the European Union’s Horizon 2020 research and innovation program under Grant Agreement No. 875406. The authors from the University of Vigo received support from the European Regional Development Fund (ERDF) and the Galician Regional Government under an agreement to fund the atlanTTic Research Center for Telecommunication Technologies.
Bray F. Ferlay J. Soerjomataram I. Siegel R.L. Torre L.A. Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries CA Cancer J. Clin. 2018 68 394 424 10.3322/caac.21492
Roux A. Cholerton R. Sicsic J. Moumjid N. French D.P. Giorgi Rossi P. Balleyguier C. Guindy M. Gilbert F.J. Burrion J.B. et al. Study protocol comparing the ethical, psychological and socio-economic impact of personalised breast cancer screening to that of standard screening in the “My Personal Breast Screening” (MyPeBS) randomised clinical trial BMC Cancer 2022 22 1 13 10.1186/s12885-022-09484-6 35524202
Esserman L.J. The WISDOM Study: Breaking the deadlock in the breast cancer screening debate NPJ Breast Cancer 2017 3 34 10.1038/s41523-017-0035-5
Hortobagyi G.N. Stephen B.E. Armando G. New and important changes in the TNM staging system for breast cancer Am. Soc. Clin. Oncol. Educ. Book 2018 38 457 467 10.1200/EDBK_201313
van Maaren M.C. de Munck L. Strobbe L.J. Sonke G.S. Westenend P.J. Smidt M.L. Poortmans P.M.P. Siesling S. Ten-year recurrence rates for breast cancer subtypes in the Netherlands: A large population-based study Int. J. Cancer 2019 144 263 272 10.1002/ijc.31914 30368776
Liu F.-F. Shi W. Done S.J. Miller N. Pintilie M. Voduc D. Nielsen T.O. Nofech-Mozes S. Chang M.C. Whelan T.J. et al. Identification of a low-risk luminal A breast cancer cohort that may not benefit from breast radiotherapy J. Clin. Oncol. 2015 33 2035 2040 10.1200/JCO.2014.57.7999
Tsutsui S. Ohno S. Murakami S. Hachitanda Y. Oda S. Prognostic value of c-erbB2 expression in breast cancer J. Surg. Oncol. 2002 79 216 223 10.1002/jso.10079 11920778
Tobin N.P. Harrell J.C. Lövrot J. Brage S.E. Stolt M.F. Carlsson L. Einbeigi Z. Linderholm B. Loman L. Malmberg M. et al. Molecular subtype and tumor characteristics of breast cancer metastases as assessed by gene expression significantly influence patient post-relapse survival Ann. Oncol. 2015 26 81 88 10.1093/annonc/mdu498
Dent R. Trudeau M. Pritchard K.I. Hanna W.M. Kahn H.K. Sawka C.A. Lickley L.A. Rawlinson E. Sun P. Narod S.A. Triple-negative breast cancer: Clinical features and patterns of recurrence Clin. Cancer Res. 2007 13 4429 4434 10.1158/1078-0432.CCR-06-3045
Boyle P. Triple-negative breast cancer: Epidemiological considerations and recommendations Ann. Oncol. 2012 23 vi7 vi12 10.1093/annonc/mds187
Luz E.J.d.S. Schwartz W.R. Cámara-Chávez G. Menotti D. ECG-based heartbeat classification for arrhythmia detection: A survey Comput. Methods Programs Biomed. 2016 127 144 164 10.1016/j.cmpb.2015.12.008 26775139
Zou Q. Qu K. Luo Y. Yin D. Ju Y. Tang H. Predicting diabetes mellitus with machine learning techniques Front. Genet. 2018 9 515 10.3389/fgene.2018.00515 30459809
Mahmoudi E. Kamdar N. Kim N. Gonzales G. Singh K. Waljee A.K. Use of electronic medical records in development and validation of risk prediction models of hospital readmission: Systematic review BMJ 2020 369 m958 10.1136/bmj.m958 32269037
Liu X. Song L. Liu S. Zhang Y. A review of deep-learning-based medical image segmentation methods Sustainability 2021 13 1224 10.3390/su13031224
Bullard J. Dust K. Funk D. Strong J.E. Alexander D. Garnett L. Boodman C. Bello A. Hedley A. Schiffman Z. et al. Predicting infectious severe acute respiratory syndrome coronavirus 2 from diagnostic samples Clin. Infect. Dis. 2020 71 2663 2666 10.1093/cid/ciaa638
Agrebi S. Anis L. Use of Artificial Intelligence in Infectious Diseases. Artificial Intelligence in Precision Health Academic Press Cambridge, MA, USA 2020 415 438 10.1016/B978-0-12-817133-2.00018-5
Moncada-Torres A. van Maaren M.C. Hendriks M.P. Siesling S. Geleijnse G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival Sci. Rep. 2021 11 6968 10.1038/s41598-021-86327-7
Othman M. and Mohd A.M.B. Probabilistic neural network for brain tumor classification Proceedings of the 2011 Second International Conference on Intelligent Systems, Modelling and Simulation Phnom Penh, Cambodia 25–27 January 2011 10.1109/ISMS.2011.32
Choi Y.J. Baek J.H. Park H.S. Shim W.H. Kim T.Y. Shong Y.K. Lee J.H. A computer-aided diagnosis system using artificial intelligence for the diagnosis and characterization of thyroid nodules on ultrasound: Initial clinical assessment Thyroid 2017 27 546 552 10.1089/thy.2016.0372
Mambou S.J. Maresova P. Krejcar O. Selamat A. Kuca K. Breast cancer detection using infrared thermal imaging and a deep learning model Sensors 2018 18 2799 10.3390/s18092799
Stark G.F. Hart G.R. Nartowt B.J. Deng J. Predicting breast cancer risk using personal health data and machine learning models PLoS ONE 2019 14 e0226765 10.1371/journal.pone.0226765
Parikh R.B. Manz C. Chivers C. Regli S.H. Braun J. Draugelis M.E. Schuchter L.M. Schulman L.N. Navathe A.S. Patel M.S. et al. Machine learning approaches to predict 6-month mortality among patients with cancer JAMA Netw. Open 2019 2 e1915997 10.1001/jamanetworkopen.2019.15997 31651973
Alabi R.O. Elmusrati M. Sawazaki-Calone I. Kowalski L.P. Haglund C. Coletta R.D. Mäkitie A.A. Salo T. Almangush A. Leivo I. Comparison of supervised machine learning classification techniques in prediction of locoregional recurrences in early oral tongue cancer Int. J. Med. Inform. 2020 136 104068 10.1016/j.ijmedinf.2019.104068
Xu Y. Ju L. Tong J. Zhou C.M. Yang J.J. Machine learning algorithms for predicting the recurrence of stage IV colorectal cancer after tumor resection Sci. Rep. 2020 10 2519 10.1038/s41598-020-59115-y 32054897
Lou S.-J. Hou M.F. Chang H.T. Chiu C.C. Lee H.H. Yeh S.C.J. Shi H.Y. Machine learning algorithms to predict recurrence within 10 years after breast cancer surgery: A prospective cohort study Cancers 2020 12 3817 10.3390/cancers12123817 33348826
Boeri C. Chiappa C. Galli F. De Berardinis V. Bardelli L. Carcano G. Rovera F. Machine Learning techniques in breast cancer prognosis prediction: A primary evaluation Cancer Med. 2020 9 3234 3243 10.1002/cam4.2811 32154669
Yang P.-T. Wu W.S. Wu C.C. Shih Y.N. Hsieh C.H. Hsu J.L. Breast cancer recurrence prediction with ensemble methods and cost-sensitive learning Open Med. 2021 16 754 768 10.1515/med-2021-0282
Ngiam K.Y. Khor W. Big data and machine learning algorithms for health-care delivery Lancet Oncol. 2019 20 e262 e273 10.1016/S1470-2045(19)30149-4
Chen M. Hao Y. Hwang K. Wang L. Wang L. Disease prediction by machine learning over big data from healthcare communities IEEE Access 2017 5 8869 8879 10.1109/ACCESS.2017.2694446
Zhang D. Yin C. Zeng J. Yuan X. Zhang P. Combining structured and unstructured data for predictive models: A deep learning approach BMC Med. Inform. Decis. Mak. 2020 20 1 11 10.1186/s12911-020-01297-6
Zeng Z. Espino S. Roy A. Li X. Khan S.A. Clare S.E. Jiang X. Neapolitan R. Luo Y. Using natural language processing and machine learning to identify breast cancer local recurrence BMC Bioinform. 2018 19 65 74 10.1186/s12859-018-2466-x
Karimi Y.H. Blayney D.W. Kurian A.W. Shen J. Yamashita R. Rubin D. Banerjee I. Development and use of natural language processing for identification of distant cancer recurrence and sites of distant recurrence using unstructured electronic health record data JCO Clin. Cancer Inform. 2021 5 469 478 10.1200/CCI.20.00165
Datta S. Bernstam E.V. Roberts K. A frame semantic overview of NLP-based information extraction for cancer-related EHR notes J. Biomed. Inform. 2019 100 103301 10.1016/j.jbi.2019.103301 31589927
Barber E.L. Garg R. Persenaire C. Simon M. Natural language processing with machine learning to predict outcomes after ovarian cancer surgery Gynecol. Oncol. 2021 160 182 186 10.1016/j.ygyno.2020.10.004 33069375
Ribelles N. Jerez J.M. Rodriguez-Brazzarola P. Jimenez B. Diaz-Redondo T. Mesa H. Marquez A. Sanchez-Muñoz A. Pajares B. Carabantes F. et al. Machine learning and natural language processing (NLP) approach to predict early progression to first-line treatment in real-world hormone receptor-positive (HR+)/HER2-negative advanced breast cancer patients Eur. J. Cancer 2021 144 224 231 10.1016/j.ejca.2020.11.030 33373867
González-Castro L. Cal-González V.M. Del Fiol G. López-Nores M. CASIDE: A data model for interoperable cancer survivorship information based on FHIR J. Biomed. Inform. 2021 124 103953 10.1016/j.jbi.2021.103953
Quan H. Sundararajan V. Halfon P. Fong A. Burnand B. Luthi J.C. Saunders L.D. Beck C.A. Feasby T.E. Ghali W.A. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data Med. Care 2005 43 1130 1139 10.1097/01.mlr.0000182534.19832.83
Bonaccorso G. Machine Learning Algorithms Packt Publishing Ltd. Birmingham, UK 2017
Chen T. Guestrin C. Xgboost: A scalable tree boosting system Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining San Francisco, CA, USA 13–17 August 2016
Pedregosa F. Varoquaux G. Gramfort A. Michel V. Thirion B. Grisel O. Blondel M. Prettenhofer P. Weiss R. Dubourg V. et al. Scikit-learn: Machine learning in Python J. Mach. Learn. Res. 2011 12 2825 2830
Virtanen P. Gommers R. Oliphant T.E. Haberland M. Reddy T. Cournapeau D. Burovski E. Peterson P. Weckesser W. Bright J. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python Nat. Methods 2020 17 261 272 10.1038/s41592-019-0686-2
Kantarjian H. Yu P.P. Artificial intelligence, big data, and cancer JAMA Oncol. 2015 1 573 574 10.1001/jamaoncol.2015.1203
Vinayak R.K. Gilad-Bachrach R. Dart: Dropouts meet multiple additive regression trees Proceedings of the Artificial Intelligence and Statistics, PMLR San Diego, CA, USA 9–12 May 2015 489 497
Tomašev N. Harris N. Baur S. Mottram A. Glorot X. Rae J.W. Zielinski M. Askham H. Saraiva A. Magliulo V. et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records Nat. Protoc. 2021 16 2765 2787 10.1038/s41596-021-00513-5
Gupta M. Phan T.L.T. Bunnell H.T. Beheshti R. Obesity Prediction with EHR Data: A deep learning approach with interpretable elements ACM Trans. Comput. Healthc. (HEALTH) 2022 3 1 19 10.1145/3506719
Pham T. Tran T. Phung D. Venkatesh S. Predicting healthcare trajectories from medical records: A deep learning approach J. Biomed. Inform. 2017 69 218 229 10.1016/j.jbi.2017.04.001
Shwartz-Ziv R. Armon A. Tabular data: Deep learning is not all you need Information Fusion 2022 81 84 90 10.1016/j.inffus.2021.11.011
Schuster M. and Paliwal K.K. Bidirectional recurrent neural networks IEEE Trans. Signal Process. 1997 45 2673 2681 10.1109/78.650093
Devlin J. Chang M.W. Lee K. Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding arXiv 2018 1810.04805
Gianni C. Palleschi M. Schepisi G. Casadei C. Bleve S. Merloni F. Circulating inflammatory cells in patients with metastatic breast cancer: Implications for treatment Front. Oncol. 2022 12 882896 10.3389/fonc.2022.882896 36003772
Onesti C.E. Josse C. Boulet D. Thiry J. Beaumecker B. Bours V. Jerusalem G. Blood eosinophilic relative count is prognostic for breast cancer and associated with the presence of tumor at diagnosis and at time of relapse Oncoimmunology 2020 9 1761176 10.1080/2162402X.2020.1761176
Onesti C.E. Josse C. Poncin A. Frères P. Poulet C. Bours V. Jerusalem G. Predictive and prognostic role of peripheral blood eosinophil count in triple-negative and hormone receptor-negative/HER2-positive breast cancer patients undergoing neoadjuvant treatment Oncotarget 2018 9 33719 10.18632/oncotarget.26120 30263098