[en] The ultimate goal of a one-class classifier like the “rigorous” soft independent modeling of class analogy (SIMCA) is to predict with a certain confidence probability, the conformity of future objects with a given reference class. However, the SIMCA model, as currently implemented often suffers from an undercoverage problem, meaning that its observed sensitivity often falls far below the desired theoretical confidence probability, hence undermining its intended use as a predictive tool. To overcome the issue, the most reported strategy in the literature, involves incrementing the nominal confidence probability until the desired sensitivity is obtained in cross-validation. This article proposes a statistical prediction interval-based strategy as an alternative strategy to properly overcome this undercoverage issue. The strategy uses the concept of predictive distributions sensu stricto to construct statistical prediction regions for the metrics. Firstly, a procedure based on goodness-of-fit criteria is used to select the best-fitting family of probability models for each metric or its monotonic transformation, among several plausible candidate families of right-skewed probability distributions for positive random variables, including the gamma and the lognormal families. Secondly, assuming the best-fitting distribution, a generalized linear model is fitted to each metric data using the Bayesian method. This method enables to conveniently estimate uncertainties about the parameters of the selected distribution. Propagating these uncertainties to the best-fitting probability model of the metric enables to derive its so-called posterior predictive distribution, which is then used to set its critical limit. Overall, the evaluation of the proposed approach on a diversity of real datasets shows that it yields unbiased and more accurate sensitivities than existing methods which are not based on predictive densities. It can even yield better specificities than the strategy that attempts to improve sensitivities of existing methods by “optimizing” the type 1 error, especially in low sample sizes’ contexts.
Research Center/Unit :
CIRM - Centre Interdisciplinaire de Recherche sur le Médicament - ULiège
Disciplines :
Pharmacy, pharmacology & toxicology
Author, co-author :
Avohou, Tonakpon Hermane ; Université de Liège - ULiège > Département de pharmacie > Chimie analytique
Sacre, Pierre-Yves ; Université de Liège - ULiège > Département de pharmacie > Chimie analytique
Hamla, Sabrina ; Université de Liège - ULiège > Unités de recherche interfacultaires > Centre Interdisciplinaire de Recherche sur le Médicament (CIRM)
Lebrun, Pierre ; Université de Liège - ULiège > Département de pharmacie > Chimie analytique
Hubert, Philippe ; Université de Liège - ULiège > Département de pharmacie > Chimie analytique
Ziemons, Eric ; Université de Liège - ULiège > Département de pharmacie > Chimie analytique
Language :
English
Title :
Optimizing the soft independent modeling of class analogy (SIMCA) using statistical prediction regions
Oliveri, P., Class-modeling in food analytical chemistry Development, sampling, optimization and validation issues – a tutorial. Anal. Chim. Acta 982 (2017), 9–19.
Casale, M., Oliveri, P., Armanino, C., Lanteri, S., Forina, M., NIR and UV–vis spectroscopy, artificial nose and tongue: comparison of four fingerprinting techniques for the characterization of Italian red wines. Anal. Chim. Acta 668 (2010), 143–148.
Oliveri, P., Casale, M., Casolino, M.C., Baldo, M.A., Grifi, F.I., Forina, M., Comparison between classical and innovative class-modeling techniques for the characterization of a PDO olive oil. Anal. Bioanal. Chim 399 (2011), 2105–2113.
Oliveri, P., Lopez, M.I., Casolino, M.C., Ruisanchez, I., Calao, M.P., Medini, L., Lanteri, S., Partial least squares density modeling (PLS-DM) - a new class-modeling strategy applied to the authentication of olives in brine by near-infrared spectroscopy. Anal. Chim. Acta 851 (2014), 30–36.
Ciza, P.H., Sacre, P.-Y., Waffo, C., Coïc, L., Avohou, T.H., Mbinze, J.K., Ngono, R., Marini, R.D., Hubert, Ph, Ziemons, E., Comparing the qualitative performances of handheld NIR and Raman spectrophotometers for the detection of falsified pharmaceutical products. Talanta 202 (2019), 469–478.
Wold, S., Sjöström, M., Chapter 12, SIMCA: a method for analyzing chemical data in terms of similarity and analogy. Kowalski, B.R., (eds.) Chemometrics, Theory and Application, vol. 52, 1977, American Chemical Society, Washington, DC, 243–282.
De Maesschalck, R., Candolfi, A., Massart, D.L., Heuerding, S., Decision criteria for soft independent modeling of class analogy applied to near infrared data. Chemometr. Intell. Lab. Syst. 47 (1999), 65–77.
Vitale, R., Marini, F., Ruckebusch, C., SIMCA modeling for overlapping classes: fixed or optimized decision threshold?. Anal. Chem. 90 (2018), 10738–10747.
Pomerantsev, A.L., Acceptance areas for multivariate classification derived by projection methods. J. Chemom. 22 (2008), 601–609.
Pomerantsev, A.L., O.Ye. Rodionova, Concept and role of extreme objects in PCA/SIMCA. J. Chemom. 28 (2014), 429–438.
Rodionova, O.Ye, Oliveri, P., Pomerantsev, A., Rigorous and compliant approaches to one-class classification. Chemometr. Intell. Lab. Syst. 159 (2016), 89–96.
Pomerantsev, A.L., O.Ye. Rodionova, Popular decision rules in SIMCA: critical review J. Chem 34 (2020), 429–438.
Malyjurek, Z., Vitale, R., Walczak, B., Different strategies for class model optimization, A comparative study. Talanta 215 (2020), 1–9.
Chen, Z., de Boves Harrington, P., Automatic soft independent modeling for class analogies. Anal. Chim. Acta 1090 (2019), 47–56.
Murphy, K.P., Probabilistic Machine Learning: an Introduction. 2021, The MIT Press, Cambridge, 578.
Clarke, B.S., Clarke, J.S., Predictive Statistics: Analysis and Inference beyond Models. 2018, Cambridge University Press, Cambridge, 642.
Meeker, W.Q., Hahn, G.J., Escobar, L.A., Statistical Intervals: A Guide for practitioners and Researchers. second ed., 2017, John Wiley & Sons Inc, Hoboken, 578.
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B., Bayesian Data Analysis. 2014, Chapman and Hall/CRC, Boca Raton, 675.
Lockhart, R.A., Stephens, M.A., The probability plot tests of fit based on the correlation coefficient. Handb. Stat. 17 (1998), 453–473.
Vogel, R.M., The probability plot correlation coefficient test for normal, lognormal, and Gumbel distributional hypotheses. Water Resour. Res. 22 (1986), 587–590.
Anderson, T.W., Anderson - darling tests of goodness-of-fit. Lovric, M., (eds.) International Encyclopedia of Statistical Science, 2011, Springer, Berlin, Heidelberg, 32–54.
Murdoch, D.J., Tsai, Y.-L., Adcock, J., P-values are random variables. Am. Statistician 62 (2008), 242–245.
Krishnamoorthy, K., Handbook of Statistical Distributions with Applications. second ed., 2016, Taylor and Francis, Boca Raton, 398.
Stan Development Team. Stan Function Reference Version 2.19. 2019, Stan Development Team, 153.
Bedbur, S., Lennartz, J.M., Kamps, U., On minimum volume properties of some confidence regions for multiple multivariate normal means. Stat. Probab. Lett. 158 (2020), 1–4.
R Core Team. R, A Language and Environment for Statistical Computing. 2018 R Foundation for Statistical Computing, Vienna, Austria https://www.R-project.org/.
Jackson, J.E., Mudholkar, J.S., Control procedures for residuals associated with principal component analysis. Technometrics 21 (1979), 341–349.
Box, G.E.P., Some theorems on quadratic forms applied in the study of analysis of variance problems: effect of inequality of variance in one-way classification. Ann. Math. Stat. 25 (1954), 290–302.
Kucheryavskiy, S., Mdatools - R package for chemometrics. Chemometr. Intell. Lab. Syst. 198 (2020), 1–10.
Millard, S.P., EnvStats an R Package for Environmental Statistics. 2013, Springer Science and Business Media, New York, 291.
Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., Riddell, A., Stan A probabilistic programming language. J. Stat. Software 76 (2017), 1–32.
Goodrich, B., Gabry, J., Ali, I., Brilleman, S., Rstanarm Bayesian Applied Regression Modeling via Stan. 2020 R package version 2.21.1 https://mc-stan.org/rstanarm.
Savitzky, A., Golay, M.J.E., Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36 (1964), 1627–1639.
Montgomery, D.C., Introduction to Statistical Quality Control. seventh ed., 2013, John Wiley and Sons Inc, Hoboken, 754.