No full text
Unpublished conference/Abstract (Scientific congresses and symposiums)
Impact of contamination on empirical and theoretical error
Ruwet, Christel; Haesbroeck, Gentiane
2009ICORS09 International Conference on Robust Statistics
 

Files


Full Text
No document available.
Annexes
Icors09CRuwet.pdf
Publisher postprint (1.01 MB)
Download

All documents in ORBi are protected by a user license.

Send to



Details



Keywords :
Generalized k-means; Influence function; Error rate
Abstract :
[en] Classification analysis allows to group similar objects into a given number of groups by means of a classification rule. Many classification procedures are available : linear discrimination, logistic discrimination, etc. Focus in this poster will be on classification resulting from a clustering analysis. Indeed, among the outputs of classical clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. More precisely, let F denote the underlying distribution and assume that the generalized kmeans algorithm with penalty function is used to construct the k clusters C1(F), . . . ,Ck(F) with centers T1(F), . . . , Tk(F). When one feels that k true groups are existing among the data, classification might be the main objective of the statistical analysis. Performance of a particular classification technique can be measured by means of an error rate. Depending on the availability of data, two types of error rates may be computed: a theoretical one and a more empirical one. In the first case, the rule is estimated on a training sample with distribution F while the evaluation of the classification performance may be done through a test sample distributed according to a model distribution of interest, Fm say. In the second case, the same data are used to set up the rule and to evaluate the performance. Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, thetheoretical error rate will be corrupted since it relies on a contaminated rule but it may still consider a test sample distributed according to the model distribution. The empirical error rate will be affected twice: via the rule and also via the sample used for the evaluation of the classification performance. To measure the robustness of classification based on clustering, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al (2008) and Croux et al (2008) in the context of linear and logistic discrimination. In the computation of influence functions, the contaminated distribution takes the form F(eps) = (1 − eps)*Fm + eps* Dx, where Dx is the Dirac distribution putting all its mass at x. It is interesting to note that the impact of the point mass x may be positive, i.e. may decrease the error rate, when the data at hand is used to evaluate the error.
Disciplines :
Mathematics
Author, co-author :
Ruwet, Christel ;  Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)
Haesbroeck, Gentiane ;  Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)
Language :
English
Title :
Impact of contamination on empirical and theoretical error
Publication date :
18 June 2009
Event name :
ICORS09 International Conference on Robust Statistics
Event place :
Parma, Italy
Event date :
du 14 juin 2009 au 19 juin 2009
Audience :
International
Available on ORBi :
since 04 November 2009

Statistics


Number of views
95 (25 by ULiège)
Number of downloads
16 (4 by ULiège)

Bibliography


Similar publications



Contact ORBi