automatic scoring; visual scoring; scoring variability; large datasets
Abstract :
[en] Sleep studies face new challenges in terms of data, objectives and metrics. This requires reappraising the adequacy of existing analysis methods, including scoring methods. Visual and automatic sleep scoring of healthy individuals were compared in terms of reliability (i.e., accuracy and stability) to find a scoring method capable of giving access to the actual data variability without adding exogenous variability. A first dataset (DS1, four recordings) scored by six experts plus an autoscoring al-gorithm was used to characterize inter-scoring variability. A second dataset (DS2, 88 recordings) scored a few weeks later was used to explore intra-expert variabil-ity. Percentage agreements and Conger's kappa were derived from epoch-by-epoch comparisons on pairwise and consensus scorings. On DS1 the number of epochs of agreement decreased when the number of experts increased, ranging from 86% (pairwise) to 69% (all experts). Adding autoscoring to visual scorings changed the kappa value from 0.81 to 0.79. Agreement between expert consensus and autoscor-ing was 93%. On DS2 the hypothesis of intra-expert variability was supported by a systematic decrease in kappa scores between autoscoring used as reference and each single expert between datasets (.75–.70). Although visual scoring induces inter- and intra-expert variability, autoscoring methods can cope with intra-scorer variabil-ity, making them a sensible option to reduce exogenous variability and give access to the endogenous variability in the data.
Disciplines :
Neurosciences & behavior Electrical & electronics engineering Human health sciences: Multidisciplinary, general & others
Author, co-author :
Berthomier, Christian ✱; PHYSIP, Paris, France
Muto, Vincenzo ✱; Université de Liège - ULiège > CRC In vivo Imaging-Sleep and chronobiology
Schmidt, Christina ; Université de Liège - ULiège > CRC In vivo Imaging-Sleep and chronobiology
Vandewalle, Gilles ; Université de Liège - ULiège > CRC In vivo Imaging-Sleep and chronobiology
Jaspar, Mathieu ; Université de Liège - ULiège > Département de Psychologie > Ergonomie et intervention au travail
Anderer, P., Gruber, G., Parapatics, S., Woertz, M., Miazhynskaia, T., Klosch, G., … Dorffner, G. (2005). An E-health solution for automatic sleep classification according to Rechtschaffen and Kales: Validation study of the Somnolyzer 24 x 7 utilizing the siesta database. Neuropsychobiology, 51, 115–133.
Anderer, P., Moreau, A., Woertz, M., Ross, M., Gruber, G., Parapatics, S., … Dorffner, G. (2010). Computer-assisted sleep classification according to the standard of the American academy of sleep medicine: Validation study of the AASM version of the Somnolyzer 24 x 7. Neuropsychobiology, 62, 250–264.
Berthomier, C., Drouot, X., Herman-Stoïca, M., Berthomier, P., Prado, J., Bokar-Thire, D., … d'Ortho, M.-P. (2007). Automatic analysis of single-channel sleep EEG: Validation in healthy individuals. Sleep, 30, 1587–1595. https://doi.org/10.1093/sleep/30.11.1587
Castro, L. S., Poyares, D., Leger, D., Bittencourt, L., & Tufik, S. (2013). Objective prevalence of insomnia in the Sao Paulo, Brazil epidemiologic sleep study. Annals of Neurology, 74, 537–546.
Chediak, A., Esparis, B., Isaacson, R., Cruz, L. D. L., Ramirez, J., Rodriguez, J. F., … Abreu, A. (2006). How many polysomnograms must sleep fellows score before becoming proficient at scoring sleep? Journal of Clinical Sleep Medicine, 2, 427–430. https://doi.org/10.5664/jcsm.26659
Cohen, J. (1960). A coefficient of reliability for nominal scales. Educational and psychological measurement. Educational and Psychological Measurement, 20, 37–46.
Collop, N. A. (2002). Scoring variability between polysomnography technologists in different sleep laboratories. Sleep Medicine, 3, 43–47. https://doi.org/10.1016/S1389-9457(01)00115-0
Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328. https://doi.org/10.1037/0033-2909.88.2.322
Danker-Hopfe, H., Anderer, P., Zeitlhofer, J., Boeck, M., Dorn, H., Gruber, G. … Dorffner, G. (2009). Interrater reliability for sleep scoring according to the Rechtschaffen & Kales and the new AASM standard. Journal of Sleep Research, 18, 74–84.
De Zambotti, M., Godino, J. G., Baker, F. C., Cheung, J., Patrick, K., & Colrain, I. M. (2016). The boom in wearable technology: Cause for alarm or just what is needed to better understand sleep? Sleep, 39, 1761–1762. https://doi.org/10.5665/sleep.6108
Dean, D. A., Goldberger, A. L., Mueller, R., Kim, M., Rueschman, M., Mobley, D., … Redline, S. (2016). Scaling up scientific discovery in sleep medicine: The national sleep research resource. Sleep, 39, 1151–1164. https://doi.org/10.5665/sleep.5774
Fiorillo, L., Puiatti, A., Papandrea, M., Ratti, P.-L., Favaro, P., Roth, C., … Faraci, F. D. (2019). Automated sleep scoring: A review of the latest approaches. Sleep Medicine Reviews, 48, 101204.
Grigg-Damberger, M. M. (2012). The AASM scoring manual four years later. Journal of Clinical Sleep Medicine, 8, 323–332. https://doi.org/10.5664/jcsm.1928
Gwet, K. L. (2012). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among multiple raters, 3rd ed. Gaithersburg, MD: Advanced Analytics Press.
Himanen, S. L., & Hasan, J. (2000). Limitations of Rechtschaffen and Kales. Sleep Medicine Reviews, 4, 149–167. https://doi.org/10.1053/smrv.1999.0086
Iber, C., Ancoli-Israel, S. Jr, Chesson, A. L., & Quan, S. F. (2007) The AASM Manual for the scoring of sleep and associated events: Rules, terminology and technical specifications. Westchester, Illinois: American Academy of Sleep Medicine.
Kaplan, R. F., Wang, Y., Loparo, K. A., Kelly, M. R., & Bootzin, R. R. (2014). Performance evaluation of an automated single-channel sleep-wake detection algorithm. Nature and Science of Sleep, 6, 113–122.
Koupparis, A. M., Kokkinos, V., & Kostopoulos, G. K. (2014). Semi-automatic sleep EEG scoring based on the hypnospectrogram. Journal of Neuroscience Methods, 221, 189–195. https://doi.org/10.1016/j.jneumeth.2013.10.010
Ktonas, P. Y., & Smith, J. R. (1976). Semi-automatic analysis of rapid eye movement (REM) patterns: A software package. Computers and Biomedical Research, an International Journal, 9, 109–124. https://doi.org/10.1016/0010-4809(76)90034-3
Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. https://doi.org/10.2307/2529310
Magalang, U. J., Chen, N.-H., Cistulli, P. A., Fedson, A. C., Gíslason, T., Hillman, D., … Pack, A. I. (2013). Agreement in the scoring of respiratory events and sleep among international sleep centers. Sleep, 36, 591–596. https://doi.org/10.5665/sleep.2552
Malhotra, A., Younes, M., Kuna, S. T., Benca, R., Kushida, C. A., Walsh, J., … Pien, G. W. (2013). Performance of an automated polysomnography scoring system versus computer-assisted manual scoring. Sleep, 36, 573–582. https://doi.org/10.5665/sleep.2548
Morgenthaler, T. I., Deriy, L., Heald, J. L., & Thomas, S. M. (2016). The evolution of the AASM clinical practice guidelines: Another step forward. Journal of Clinical Sleep Medicine, 12, 129–135. https://doi.org/10.5664/jcsm.5412
Penzel, T., Zhang, X., & Fietze, I. (2013). Inter-scorer reliability between sleep centers can teach us what to improve in the scoring rules. Journal of Clinical Sleep Medicine, 9, 89–91. https://doi.org/10.5664/jcsm.2352
Pittman, S. D., MacDonald, M. M., Fogel, R. B., Malhotra, A., Todros, K., Levy, B., … White, D. P. (2004). Assessment of automated scoring of polysomnographic recordings in a population with suspected sleep-disordered breathing. Sleep, 27, 1394–1403. https://doi.org/10.1093/sleep/27.7.1394
Popovic, D., Khoo, M., & Westbrook, P. (2014). Automatic scoring of sleep stages and cortical arousals using two electrodes on the forehead: Validation in healthy adults. Journal of Sleep Research, 23, 211–221. https://doi.org/10.1111/jsr.12105
Redline, S., Amin, R., Beebe, D., Chervin, R. D., Garetz, S. L., Giordani, B., … Ellenberg, S. (2011). The Childhood Adenotonsillectomy Trial (CHAT): Rationale, design, and challenges of a randomized controlled trial evaluating a standard surgical procedure in a pediatric population. Sleep, 34, 1509–1517. https://doi.org/10.5665/sleep.1388
Redline, S., Dean, D. 3rd, & Sanders, M. H. (2013). Entering the era of "big data": Getting our metrics right. Sleep, 36, 465–469. https://doi.org/10.5665/sleep.2524
Redline, S., Schluchter, M. D., Larkin, E. K., & Tishler, P. V. (2003). Predictors of longitudinal change in sleep-disordered breathing in a nonclinic population. Sleep, 26, 703–709. https://doi.org/10.1093/sleep/26.6.703
Rosenberg, R. S., & Van Hout, S. (2013). The American Academy of Sleep Medicine inter-scorer reliability program: Sleep stage scoring. Journal of Clinical Sleep Medicine, 9, 81–87. https://doi.org/10.5664/jcsm.2350
Schulz, H. (2008). Rethinking sleep analysis. Journal of Clinical Sleep Medicine, 4, 99–103. https://doi.org/10.5664/jcsm.27124
Silber, M. H., Ancoli-Israel, S., Bonnet, M. H., Chokroverty, S., Grigg-Damberger, M. M., Hirshkowitz, M., … Iber, C. (2007). The visual scoring of sleep in adults. Journal of Clinical Sleep Medicine, 3, 121–131. https://doi.org/10.5664/jcsm.26814
Stephansen, J. B., Olesen, A. N., Olsen, M., Ambati, A., Leary, E. B., Moore, H. E., … Mignot, E. (2018). Neural network analysis of sleep stages enables efficient diagnosis of narcolepsy. Nature Communications, 9, 5229. https://doi.org/10.1038/s41467-018-07229-3
Sun, H., Jia, J., Goparaju, B., Huang, G.-B., Sourina, O., Bianchi, M. T., & Westover, M. B. (2017). Large-scale automated sleep staging. Sleep, 40(10). https://doi.org/10.1093/sleep/zsx139
Van Dongen, H. P., Vitellaro, K. M., & Dinges, D. F. (2005). Individual differences in adult human sleep and wakefulness: Leitmotif for a research agenda. Sleep, 28, 479–496. https://doi.org/10.1093/sleep/28.4.479
Virkkala, J., Hasan, J., Varri, A., Himanen, S. L., & Muller, K. (2007). Automatic sleep stage classification using two-channel electro-oculography. Journal of Neuroscience Methods, 166, 109–115. https://doi.org/10.1016/j.jneumeth.2007.06.016
Wang, Y., Loparo, K. A., Kelly, M. R., & Kaplan, R. F. (2015). Evaluation of an automated single-channel sleep staging algorithm. Nature and Science of Sleep, 7, 101–111. https://doi.org/10.2147/NSS.S77888
Whitney, C. W., Gottlieb, D. J., Redline, S., Norman, R. G., Dodge, R. R., Shahar, E., … Nieto, F. J. (1998). Reliability of scoring respiratory disturbance indices and sleep staging. Sleep, 21, 749–757. https://doi.org/10.1093/sleep/21.7.749
Younes, M., Raneri, J., & Hanly, P. (2016). Staging sleep in polysomnograms: Analysis of inter-scorer variability. Journal of Clinical Sleep Medicine, 12, 885–894. https://doi.org/10.5664/jcsm.5894
Zhang, X., Dong, X., Kantelhardt, J. W., Li, J., Zhao, L., Garcia, C., … Han, F. (2015). Process and outcome for international reliability in sleep scoring. Sleep Breath, 19, 191–195. https://doi.org/10.1007/s11325-014-0990-0