Detection of influential observations on the error rate based on the generalized k-means clustering procedure

Ruwet, Christel; Haesbroeck, Gentiane

No full text

Unpublished conference/Abstract (Scientific congresses and symposiums)

Detection of influential observations on the error rate based on the generalized k-means clustering procedure

Ruwet, Christel; Haesbroeck, Gentiane

2009 • 17th Annual meeting of the Belgian Statistical Society

Permalink
https://hdl.handle.net/2268/27529

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

No document available.

Annexes

PresentationSBS2009-RUWET.pdf

Publisher postprint (1.72 MB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Abstract :

[en] Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on the generalized k-means algorithm, while the data of interest are assumed to come from an underlying population consisting of a mixture of two groups. Among the outputs of this clustering technique, a classi cation rule is provided in order to classify the objects into one of the clusters. When classi cation is the main objective of the statistical analysis, performance is often measured by means of an error rate ER(F; Fm) where F is the distribution of the training sample used to set up the classi cation rule and Fm (model distribution) is the distribution under which the quality of the rule is assessed (via a test sample). Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, the error rate will be corrupted since it relies on a contaminated rule, while the test sample may still be considered as being distributed according to the model distribution. To measure the robustness of classification based on this clustering proce- dure, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al. (2008) and Croux et al. (2008) in the context of linear and logistic discrimination. In this setup, the contaminated distribution takes the form F(eps)= (1-eps)*Fm + eps*Dx, where Dx is the Dirac distribution putting all its mass at x: After studying the influence function of the error rate of the generalized k- means procedure, which depends on the influence functions of the generalized k-means centers derived by Garcia-Escudero and Gordaliza (1999), a diagnostic tool based on its value will be presented. The aim is to detect observations in the training sample which can be influential for the error rate.

Disciplines :

Mathematics

Author, co-author :

Ruwet, Christel ; Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)

Haesbroeck, Gentiane ; Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)

Language :

English

Title :

Detection of influential observations on the error rate based on the generalized k-means clustering procedure

Publication date :

14 October 2009

Event name :

17th Annual meeting of the Belgian Statistical Society

Event place :

Lommel, Belgium

Event date :

du 14 octobre 2009 au 16 octobre 2009

Available on ORBi :

since 04 November 2009

Statistics

Number of views

131 (35 by ULiège)

Number of downloads

50 (12 by ULiège)

More statistics