Influence function of the error rate of classification based on clustering

Ruwet, Christel; Haesbroeck, Gentiane

No full text

Unpublished conference/Abstract (Scientific congresses and symposiums)

Influence function of the error rate of classification based on clustering

Ruwet, Christel; Haesbroeck, Gentiane

2009 • Third annual doctoral workshop of the Graduate School STAT-ACTU

Permalink
https://hdl.handle.net/2268/27531

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

No document available.

Annexes

Presentation19_05_09.pdf

Publisher postprint (1.64 MB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Abstract :

[en] Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on two particular cases of the generalized k-means algorithm : the classical k-means procedure as well as the k-medoids algorithm, while the data of interest are assumed to come from an underlying population consisting of a mixture of two groups. Among the outputs of these clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. When classification is the main objective of the statistical analysis, performance is often measured by means of an error rate. Two types of error rates can be computed: a theoretical one and a more empirical one. The first one can be written as ER(F, Fm) where F is the distribution of the training sample used to set up the classification rule and Fm (model distribution) is the distribution under which the quality of the rule is assessed (via a test sample). The empirical error rate corresponds to ER(F, F), meaning that the classification rule is tested on the same sample as the one used to set up the rule. This talk will present the results concerning the theoretical error rate. In case there are some outliers in the data, the classification rule may be corrupted. Even if it is evaluated at the model distribution, the theoretical error rate may then be contaminated. To measure the robustness of classification based on clustering, influence functions have been computed. Similar results as those derived by Croux et al (2008) and Croux et al (2008) in discriminant analysis were observed. More specifically, under optimality (which happens when the model distribution is FN = 0.5 N(μ1, σ) + 0.5 N(μ2, σ), Qiu and Tamhane 2007), the contaminated error rate can never be smaller than the optimal value, resulting in a first order influence function identically equal to 0. Second order influence functions need then to be computed. When the optimality does not hold, the first order influence function of the theoretical error rate does not vanish anymore and shows that contamination may improve the error rate achieved under the non-optimal model. The first and, when required, second order influence functions of the theoretical error rate are useful in their own right to compare the robustness of the 2-means and 2-medoids classification procedures. They have also other applications. For example, they may be used to derive diagnostic tools in order to detect observations having an unduly large influence on the error rate. Also, under optimality, the second order influence function of the theoretical error rate can yield asymptotic relative classification efficiencies.

Disciplines :

Mathematics

Author, co-author :

Ruwet, Christel ; Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)

Haesbroeck, Gentiane ; Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)

Language :

English

Title :

Influence function of the error rate of classification based on clustering

Publication date :

19 May 2009

Event name :

Third annual doctoral workshop of the Graduate School STAT-ACTU

Event place :

Liège, Belgium

Event date :

19 mai 2009

Available on ORBi :

since 04 November 2009

Statistics

Number of views

217 (25 by ULiège)

Number of downloads

70 (4 by ULiège)

More statistics