No document available.
Abstract :
[en] Cluster analysis may be performed when one wishes to group similar objects into a given
number of clusters. Several algorithms are available in order to construct these clusters. In
this poster, focus will be on two particular cases of the generalized k-means algorithm : the
classical k-means procedure as well as the k-medoids algorithm, while the data of interest are
assumed to come from an underlying population consisting of a mixture of two groups. Among
the outputs of these clustering techniques, a classification rule is provided in order to classify
the objects into one of the clusters. When classification is the main objective of the statistical
analysis, performance is often measured by means of an error rate. Two types of error rates
can be computed : a theoretical one and a more empirical one. The first one can be written as
ER(F, Fm) where F is the distribution of the training sample used to set up the classification
rule and Fm (model distribution) is the distribution under which the quality of the rule is
assessed (via a test sample). The empirical error rate corresponds to ER(F, F), meaning that
the classification rule is tested on the same sample as the one used to set up the rule.
In case there are some outliers in the data, the classification rule may be corrupted. Even if
it is evaluated at the model distribution, the theoretical error rate may then be contaminated,
while the effect of contamination on the empirical error rate is two-fold : the rule but also the
test sample are contaminated. To measure the robustness of classification based on clustering,
influence functions have been computed, both for the theoretical and the empirical error rates.
When using the theoretical error rate, similar results as those derived by Croux et al
(2008) and Croux et al (2008) in discriminant analysis were observed. More specifically, under
optimality (which happens when the model distribution is FN = 0.5N(μ1, ) + 0.5N(μ2, ),
Qiu and Tamhane 2007), the contaminated error rate can never be smaller than the optimal
value, resulting in a first order influence function identically equal to 0. Second order influence
functions would then need to be computed, as this will be done in future research. When the
optimality does not hold, the first order influence function of the theoretical error rate does
not vanish anymore and shows that contamination may improve the error rate achieved under
the non-optimal model. Similar computations have been performed for the empirical error
rate, as the poster will show.
The first and, when required, second order influence functions of the theoretical and empirical
error rates are useful in their own right to compare the robustness of the 2-means
and 2-medoids classification procedures. They have also other applications. For example, they
may be used to derive diagnostic tools in order to detect observations having an unduly large
influence on the error rate. Also, under optimality, the second order influence function of the
theoretical error rate can yield asymptotic relative classification efficiencies.