Re-assessing accuracy degradation: a framework for understanding DNN behavior on similar-but-non-identical test datasets

Anzaku, Esla Timothy; Wang, Haohan; Babalola, Ajiboye; Van Messem, Arnout; De Neve, Wesley

doi:10.1007/s10994-024-06693-x

Article (Scientific journals)

Re-assessing accuracy degradation: a framework for understanding DNN behavior on similar-but-non-identical test datasets

Anzaku, Esla Timothy; Wang, Haohan; Babalola, Ajiboye et al.

2025 • In Machine Learning, 114 (3)

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/2268/328593

DOI
10.1007/s10994-024-06693-x

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

0038_2025 Anzaku.pdf

Author postprint (2.72 MB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Abstract :

[en] Abstract Deep Neural Networks (DNNs) often demonstrate remarkable performance when evaluated on the test dataset used during model creation. However, their ability to generalize effectively when deployed is crucial, especially in critical applications. One approach to assess the generalization capability of a DNN model is to evaluate its performance on replicated test datasets, which are created by closely following the same methodology and procedures used to generate the original test dataset. Our investigation focuses on the performance degradation of pre-trained DNN models in multi-class classification tasks when evaluated on these replicated datasets; this performance degradation has not been entirely explained by generalization shortcomings or dataset disparities. To address this, we introduce a new evaluation framework that leverages uncertainty estimates generated by the models studied. This framework is designed to isolate the impact of variations in the evaluated test datasets and assess DNNs based on the consistency of their confidence in their predictions. By employing this framework, we can determine whether an observed performance drop is primarily caused by model inadequacy or other factors. We applied our framework to analyze 564 pre-trained DNN models across the CIFAR-10 and ImageNet benchmarks, along with their replicated versions. Contrary to common assumptions about model inadequacy, our results indicate a substantial reduction in the performance gap between the original and replicated datasets when accounting for model uncertainty. This suggests a previously unrecognized adaptability of models to minor dataset variations. Our findings emphasize the importance of understanding dataset intricacies and adopting more nuanced evaluation methods when assessing DNN model performance. This research contributes to the development of more robust and reliable DNN models, especially in critical applications where generalization performance is of utmost importance. The code to reproduce our experiments will be available at https://github.com/esla/Reassessing_DNN_Accuracy.

Disciplines :

Computer science

Author, co-author :

Anzaku, Esla Timothy

Wang, Haohan

Babalola, Ajiboye

Van Messem, Arnout ; Université de Liège - ULiège > Mathematics

De Neve, Wesley

Language :

English

Title :

Re-assessing accuracy degradation: a framework for understanding DNN behavior on similar-but-non-identical test datasets

Publication date :

14 February 2025

Journal title :

Machine Learning

ISSN :

0885-6125

eISSN :

1573-0565

Publisher :

Springer Science and Business Media LLC

Volume :

114

Issue :

Peer reviewed :

Peer Reviewed verified by ORBi

Additional URL :

https://link.springer.com/content/pdf/10.1007/s10994-024-06693-x.pdf

Available on ORBi :

since 20 February 2025

Statistics

Number of views

16 (0 by ULiège)

Number of downloads

6 (0 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

See more details

publications

supporting

mentioning

contrasting

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Bibliography

G.W. Brier Verification of forecasts expressed in terms of probability Monthly Weather Review 78 1 10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2 1274.05312
J. Bröcker L.A. Smith Increasing the reliability of reliability diagrams Weather and Forecasting 22 3 651 661 10.1175/WAF993.1 07634376
M.A. Brookhart S. Schneeweiss K.J. Rothman R.J. Glynn J. Avorn T. Stürmer Variable selection for propensity score models American Journal of Epidemiology 163 12 1149 1156 10.1093/aje/kwj149 1343.92026
Cortes, C., DeSalvo, G., & Mohri, M. (2016) Learning with Rejection. In: Proceedings of The 27th International Conference on Algorithmic Learning Theory. Lecture Notes in Computer Science, vol. 9925, pp. 67–82
J.T. Chu Optimal decision functions for computer character recognition Journal of the ACM 12 2 213 226 195218 10.1145/321264.321271 0133.12705
Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., & Feng, J. (2017) Dual Path Networks. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4467–4475
Chiappori, P.-A., & Salanié, B. (2016). The Econometrics of Matching Models. Journal of Economic Literature,54(3), 832–861. Publisher: American Economic Association. Accessed 2024-05-26
Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021) Emerging Properties in Self-Supervised Vision Transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9630–9640. https://doi.org/10.1109/ICCV48922.2021.00951
Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., & Tian, Q. (2021) Visformer: The Vision-Friendly Transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 589–598
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Ninth International Conference on Learning Representations
Darlow, L.N., Crowley, E.J., Antoniou, A., & Storkey, A.J. (2018) CINIC-10 is not ImageNet or CIFAR-10. arXiv arXiv:1810.03505
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009) Imagenet: A Large-scale Hierarchical Image Database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255
Ding, Y., Liu, J., Xiong, J., & Shi, Y. (2019) Evaluation of Neural Network Uncertainty Estimation with Application to Resource-Constrained Platforms. CoRR. arXiv: 1903.02050
Dong, L., Piao, S., & Wei, F. (2022) BEiT: BERT Pre-Training of Image Transformers. In: The Tenth International Conference on Learning Representations
Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Steinhardt, J., & Madry, A. (2020) Identifying Statistical Bias in Dataset Replication. In: Proceedings of the 37th International Conference on Machine Learning, vol. 119, pp. 2922–2932
Gastaldi, X. (2017) Shake-Shake regularization. CoRR arXiv: 1705.07485
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K.Q. (2017) On Calibration of Modern Neural Networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1321–1330
Hein, M., Andriushchenko, M., & Bitterwolf, J. (2019) Why ReLU Networks Yield High-Confidence Predictions Far Away From the Training Data and How to Mitigate the Problem. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 41–50. https://doi.org/10.1109/CVPR.2019.00013
Hendrycks, D., & Gimpel, K. (2017) A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In: Proceedings of the 5th International Conference on Learning Representations
D.E. Ho K. Imai G. King E.A. Stuart Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference Political Analysis 15 3 199 236 10.1093/pan/mpl013
Han, D., Kim, J., & Kim, J. (2017) Deep Pyramidal Residual Networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6307–6315. https://doi.org/10.1109/CVPR.2017.668
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K.Q. (2017) Densely Connected Convolutional Networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Hu, J., Shen, L., & Sun, G. (2018) Squeeze-and-Excitation Networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018). https://doi.org/10.1109/CVPR.2018.00745
Hinton, G., Vinyals, O., & Dean, J. (2015) Distilling the Knowledge in a Neural Network. http://arxiv.org/abs/1503.02531
He, K., Zhang, X., Ren, S., & Sun, J. (2016) Deep Residual Learning for Image Recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
He, K., Zhang, X., Ren, S., & Sun, J. (2016) Identity Mappings in Deep Residual Networks. In: The European Conference on Computer Vision. https://doi.org/10.1007/978-3-319-46493-0_38
Krizhevsky, A. (2009) Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto, Toronto
Kumar, A., Sarawagi, S., & Jain, U. (2018) Trainable Calibration Measures for Neural Networks from Kernel Mean Embeddings. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 2805–2814
Krishnan, R., & Tickoo, O. (2020) Improving model calibration with accuracy versus uncertainty optimization. In: Advances in Neural Information Processing Systems, vol. 33, pp. 18237–18248
Lee, D.-H. (2013) Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. https://www.semanticscholar.org/paper/Pseudo-Label Accessed 2024-04-30
Lin, T.-Y., Goyal, P., Girshick, R.B., He, K., & Dollár, P. (2017) Focal Loss for Dense Object Detection. 2017 IEEE International Conference on Computer Vision (ICCV), 2999–3007
Lee, K., Lee, K., Lee, H., & Shin, J. (2018) A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. In: Advances in Neural Information Processing Systems, vol. 31
Liang, S., Li, Y., & Srikant, R. (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In: Proceedings of the 6th International Conference on Learning Representations
Lu, S., Nott, B., Olson, A., Todeschini, A., Vahabi, P., Yair, C., & Schmidt, L. (2020) Harder or Different? A Closer Look at Distribution Shift in Dataset Reproduction. In: Uncertainty and Robustness in Deep Learning Workshop (UDL), ICML
Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L., Yuille, A.L., Huang, J., & Murphy, K. (2017) Progressive Neural Architecture Search. CoRR abs/1712.00559
Mukhoti, J., Kirsch, A., Amersfoort, J.v., Torr, P.H.S., & Gal, Y. (2021) Deterministic Neural Networks with Appropriate Inductive Biases Capture Epistemic and Aleatoric Uncertainty. CoRR abs/2102.11582
Müller, R., Kornblith, S., & Hinton, G.E. (2019) When does label smoothing help? In: Wallach, H., Larochelle, H. (eds.) Proceedings of the 33rd International Conference on Neural Information Processing Systems, vol. 422, pp. 4694–4703
J. Mukhoti V. Kulharia A. Sanyal S. Golodetz P. Torr P. Dokania Calibrating deep neural networks using focal loss Advances in Neural Information Processing Systems 33 15288 15299
Miller, J.P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P.W., Shankar, V., Liang, P., Carmon, Y., & Schmidt, L. (2021) Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 7721–7735
Northcutt, C.G., Athalye, A., & Mueller, J. (2021) Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=XccDXrDNLek
Nixon, J., Dusenberry, M., Zhang, L., Jerfel, G., & Tran, D. (2019) Measuring Calibration in Deep Learning. http://arxiv.org/abs/1904.01685
Naeini, M.P., Gregory, F.C., & Hauskrecht, M. (2015) Obtaining Well Calibrated Probabilities Using Bayesian Binning. Proceedings of the.. AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence 2015, 2901–2907
C. Northcutt L. Jiang I. Chuang Confident learning: Estimating uncertainty in dataset labels Journal of Artificial Intelligence Research 70 1373 1411 4306614 10.1613/jair.1.12125 1510.68088
Niculescu-Mizil, A., & Caruana, R. (2005) Predicting good Probabilities with Supervised Learning. In: Proceedings of the 22nd International Conference on Machine Learning, vol. 119, pp. 625–632. https://doi.org/10.1145/1102351.1102430
Pearce, T., Brintrup, A., & Zhu, J. (2021) Understanding Softmax Confidence and Uncertainty. CoRR abs/2106.04972
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., & Bengio, Y. (2015) FitNets: Hints for Thin Deep Nets. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations
Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019) Do ImageNet Classifiers Generalize to ImageNet? In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 5389–5400
Rubin, D.B. (2006) Matched Sampling for Causal Effects. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511810725
K. Sohn D. Berthelot N. Carlini Z. Zhang H. Zhang C.A. Raffel E.D. Cubuk A. Kurakin C.-L. Li FixMatch: Simplifying semi-supervised learning with consistency and confidence Advances in Neural Information Processing Systems 33 596 608
S. Rabanser Stephan S. Günnemann Z. Lipton Failing loudly: An empirical study of methods for detecting dataset shift Advances in Neural Information Processing Systems 32 45 0683.73010
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A.A. (2017) Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 4278–4284
Sun, R., & Lampert, C. H. (2020). KS(conf): A Light-Weight Test if a Multiclass Classifier Operates Outside of Its Specifications. International Journal of Computer Vision, 128 (4), 970–995. https://doi.org/10.1007/s11263-019-01232-x. Accessed 2024-04-30.
E.A. Stuart Matching methods for causal inference: A review and a look forward Statistical Science: A Review Journal of the Institute of Mathematical Statistics 25 1 1 21 2741812 10.1214/09-STS313 1328.62007
Szyc, K., Walkowiak, T., & Maciejewski, H. (2023) Why Out-of-Distribution detection experiments are not reliable - subtle experimental details muddle the OOD detector rankings. In: Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 2078–2088. ISSN: 2640-3498
Simonyan, K., & Zisserman, A. (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. In: International Conference on Learning Representations
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021) Training Data-efficient Image Transformers & Distillation Through Attention. In: Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 10347–10357
Torralba, A., & Efros, A.A. (2011) Unbiased Look at Dataset Bias. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1521–1528
Tan, M., & Le, Q.V. (2019) EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 6105–6114
Takahashi, R., Matsubara, T., & Uehara, K. (2018) RICAP: Random Image Cropping and Patching Data Augmentation for Deep CNNs. In: Proceedings of The 10th Asian Conference on Machine Learning, vol. 95, pp. 786–798
L.E. Thomas S. Yang D. Wojdyla D.E. Schaubel Matching with time-dependent treatments: A review and look forward Statistics in Medicine 39 17 2350 2370 4119736 10.1002/sim.8533 1546.62750
Wightman, R. (2019). PyTorch Image Models. GitHub. Publication Title: GitHub repository. https://doi.org/10.5281/zenodo.4414861https://github.com/rwightman/pytorch-image-models
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017) Aggregated Residual Transformations for Deep Neural Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5987–5995. https://doi.org/10.1109/CVPR.2017.634
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., & Yoo, Y. (2019) CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In: International Conference on Computer Vision (ICCV)
Zhang, H., Cissé, M., Dauphin, Y.N., & Lopez-Paz, D. (2017) mixup: Beyond Empirical Risk Minimization. CoRR. arXiv: 1710.09412
G. Zhang Y. Ge Z. Dong H. Wang Y. Zheng S. Chen Deep high-resolution representation learning for cross-resolution person re-identification IEEE Transactions on Image Processing 30 8913 8925 10.1109/TIP.2021.3120054 1478.93160

Similar publications

Sorry the service is unavailable at the moment. Please try again later.

Name	Provider / Domaine	Expiration	Description
JSESSIONID	Oracle Corporation www.uliege.be	Session	General purpose platform session cookie, used by sites written in JSP. Usually used to maintain an anonymous user session by the server.
CookieScriptConsent	CookieScript .uliege.be	1 year	This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.

Name	Provider / Domaine	Expiration	Description
_pk_id	InnoCraft Ltd .uliege.be	1 year	Used to store a few details about the user such as the unique visitor ID
_pk_ses	InnoCraft Ltd .uliege.be	30 minutes	Short lived cookies used to temporarily store data for the visit
_pk_ref	InnoCraft Ltd .uliege.be	6 months	Used to store the attribution information, the referrer initially used to visit the website