Abstract :
[en] 1 Introduction
Class-modelling methods are important an class of multivariate semi-supervised classification models recommended in authentication and verification problems [1]. Conceptually, these methods aim at predicting the conformance of future unknown objects to a given reference class, the target class, based on classification rules built using a set of objects belonging exclusively to that class [1]. They are widely used in analytical chemistry to verify conformity of food and drug samples to references using spectroscopic data such as NIR and Raman spectral data. The most widely used class-modelling method in chemometrics is undoubtedly the Soft-Independent Modelling of Class-Analogy (SIMCA) model, which in nutshell constructs a confidence interval as acceptance region for some PCA-based distances of objects of the training set [1].
This work aims at introducing to the chemometric community, a novel approach to class-modelling based on the newly emerging concept of functional depth or outlierness statistics. In addition, the proposed methods is fully predictive, accounting conveniently for uncertainties about model parameters, which fact makes it robust. Its performances and advantages are discussed and compared to the SIMCA model.
2 Theory
Functional depth or outlierness statistics are an emerging field of nonparametric statistics. In nutshell, the functional depths (vs. oulierness) are mappings that assign to each spectrum a centrality (vs. oulierness) value, which is generally a positive value, and captures its the relative behavior (magnitude and shape) w.r.t all the other spectra of the training dataset, in a way that spectra that are central and thus typical have high depth (vs. low outlierness) values while outlying spectra have low depth (vs. high outlierness) values [2]. Two of such depth (vs. outlierness) metrics have been used, to measure the centrality (vs. outlyingness) behaviour of each spectrum of calibration set namely:
(1) the Adjusted Band Depth (ABD) which is a centrality metrics that accounts for the average of the maximal deviation (distance) of a spectrum w.r.t the bands delimited by pairs of spectra; and
(2) the Infimal Area which is an outlierness metrics that accounts for the proportion of points where a spectrum is extreme w.r.t to the other spectra [2].
3 Material and methods
The proposed classification method based on the depth (vs. outlierness) metrics proceeds as follows: Step 01, consists in building the calibration dataset; At Step 02 the depth (vs. outlierness) value of each spectrum of the training dataset is computed reducing the data matrix to a one-dimensional data vector of positive values; At Step 03, the obtained depth (vs. outlierness) values are modelled using a Bayesian generalized linear model to derive their predictive distribution; this is the predictive probability distribution that accounts for all the model uncertainties; a 95%-lower (vs. upper) prediction interval is then computed from the predictive distribution and used as limit for the acceptance region; Step 04, the depth (vs. outlierness) of new test spectra is computed and compared to the acceptance limit to decide its conformity.
This method has been applied to the identification of two types of drug products, first 13 paracetamol formulations, and second 11 Ibuprofen formulations. The data used consisted of 20 Raman spectra measured on each batches of each formulation, for 2 to 4 batches. The classification performances were compared to the data-driven SIMCA (DD-SIMCA), using the confusion matrices derived from true positives and false positives. The parameters of the DD-SIMCA namely the number of components and the confidence level were optimized for each formulation to have an effective true positive rate closed 95%.
4 Results and discussion
The analysis of the confusion matrices for each type of products showed that the classification performances of the three methods are generally similar to that of the SIMCA model. However, the newly proposed approaches offer several advantages. The true positive rates are generally closed to and consistent with the nominal content of the prediction interval, contrary to the SIMCA model where the nominal confidence might be very far from the observed. Secondly, there is no parameter to optimize contrary to the SIMCA.
5 Conclusion
A new class-modelling method based on the concept of functional depth and Bayesian modelling is proposed. It is promising as it is competitive with existing methods in terms classification performances while being easier to optimize.
6 References
[1] P Oliveri. Class-Modelling in Food Analytical Chemistry: Development, Sampling, Optimisation and Validation Issues - A tutorial. Anal. Chim. Acta 2017, 982, 9-19.
[2] S Nagy. Statistical depth for functional. Thesis, Charles University and KU Leuven 2016.