Impact of contamination on empirical and theoretical error

Ruwet, Christel; Haesbroeck, Gentiane

No full text

Unpublished conference/Abstract (Scientific congresses and symposiums)

Impact of contamination on empirical and theoretical error

Ruwet, Christel; Haesbroeck, Gentiane

2009 • ICORS09 International Conference on Robust Statistics

Permalink
https://hdl.handle.net/2268/27530

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

No document available.

Annexes

Icors09CRuwet.pdf

Publisher postprint (1.01 MB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Generalized k-means; Influence function; Error rate

Abstract :

[en] Classification analysis allows to group similar objects into a given number of groups by means of a classification rule. Many classification procedures are available : linear discrimination, logistic discrimination, etc. Focus in this poster will be on classification resulting from a clustering analysis. Indeed, among the outputs of classical clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. More precisely, let F denote the underlying distribution and assume that the generalized kmeans algorithm with penalty function is used to construct the k clusters C1(F), . . . ,Ck(F) with centers T1(F), . . . , Tk(F). When one feels that k true groups are existing among the data, classification might be the main objective of the statistical analysis. Performance of a particular classification technique can be measured by means of an error rate. Depending on the availability of data, two types of error rates may be computed: a theoretical one and a more empirical one. In the first case, the rule is estimated on a training sample with distribution F while the evaluation of the classification performance may be done through a test sample distributed according to a model distribution of interest, Fm say. In the second case, the same data are used to set up the rule and to evaluate the performance. Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, thetheoretical error rate will be corrupted since it relies on a contaminated rule but it may still consider a test sample distributed according to the model distribution. The empirical error rate will be affected twice: via the rule and also via the sample used for the evaluation of the classification performance. To measure the robustness of classification based on clustering, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al (2008) and Croux et al (2008) in the context of linear and logistic discrimination. In the computation of influence functions, the contaminated distribution takes the form F(eps) = (1 − eps)*Fm + eps* Dx, where Dx is the Dirac distribution putting all its mass at x. It is interesting to note that the impact of the point mass x may be positive, i.e. may decrease the error rate, when the data at hand is used to evaluate the error.

Disciplines :

Mathematics

Author, co-author :

Ruwet, Christel ; Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)

Haesbroeck, Gentiane ; Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)

Language :

English

Title :

Impact of contamination on empirical and theoretical error

Publication date :

18 June 2009

Event name :

ICORS09 International Conference on Robust Statistics

Event place :

Parma, Italy

Event date :

du 14 juin 2009 au 19 juin 2009

Audience :

International

Available on ORBi :

since 04 November 2009

Statistics

Number of views

182 (25 by ULiège)

Number of downloads

34 (4 by ULiège)

More statistics