[en] Logistic regression is frequently used for classifying observations into two groups. Unfortunately there are often outlying observations in a data set and these might affect the estimated model and the associated classification error rate. In this paper, the authors study the effect of observations in the training sample on the error rate by deriving influence functions. They obtain a general expression for the influence function of the error rate, and they compute it for the maximum likelihood estimator as well as for several robust logistic discrimination procedures. Besides being of interest in their own right, the influence functions are also used to derive asymptotic, classification efficiencies of different logistic discrimination rules. The authors also show how influential points can be detected by means of a diagnostic plot based on the values of the influence function.
Disciplines :
Mathematics
Author, co-author :
Croux, Christophe
Haesbroeck, Gentiane ; Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)
Joossens, Kristel
Language :
English
Title :
Logistic discrimination using robust estimators: an influence function approach
A. M. Bianco & V. J. Yohai (1996). Robust estimation in the logistic regression model. In Robust Statistics, Data Analysis and Computer Intensive Methods (H. Rieder, ed.), Springer, New York, pp. 17-34.
G. Boente, A. M. Pires & I. M. Rodrigues (2002). Influence functions and outlier detection under the common principal components model: a robust approach. Biometrika, 89, 861-875.
H. D. Bondell (2005). Minimum distance estimation for the logistic regression model. Biometrika, 92, 724-731.
R. J. Carroll & S. Pederson (1993). On robust estimation in the logistic regression model. Journal of the Royal Statistical Society Series B, 55, 693-706.
A. Christmann (1996). High breakdown point estimators in logistic regression. In Robust Statistics, Data Analysis and Computer Intensive Methods (H. Rieder, ed.), Springer Verlag, New York, pp. 79-89.
A. Christmann & P. Rousseeuw (2001). Measuring overlap in binary regression. Computational Statistics and Data Analysis, 37, 65-75.
R. D. Cook & S. Weisberg (1982). Residuals and Influence in Regression. Chapman & Hall, London.
J. B. Copas (1988). Binary regression models for contaminated data. Journal of the Royal Statistical Society Series B, 50, 225-265.
F. Critchley & C. Vitiello (1991). The influence of observations on misclassification probability estimates in linear discriminant analysis. Biometrika, 78, 677-690.
C. Croux & C. Dehon (2001). Robust linear discriminant analysis using Sestimators. The Canadian Journal of Statistics, 29, 473-492.
C. Croux, P. Filzmoser & K. Joossens (2008). Classification efficiencies for robust linear discriminant analysis. Statistica Sinica, 18 (2), in press; article #SS-05-270.
C. Croux, C. Randre & G. Haesbroeck (2002). The breakdown behaviour of the maximum likelihood estimator in the logistic regression model. Statistics & Probability Letters, 60, 377-386.
C. Croux & G. Haesbroeck (2003). Implementing the Bianco and Yohai estimator for logistic regression. Computational Statistics and Data Analysis, 44, 273-295.
P. L. Davies (1987). Asymptotic behavior of S-estimators of multivariate location parameters and dispersion matrices. The Annals of Statistics, 15, 1269-1292.
B. Efron (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70, 892-898.
J. Friedman, T. Hastie & R. Tibshirani (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, New York.
F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw & W. A. Stahel (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley, New York.
X. He & W. K. Fung (2000). High breakdown estimation for multiple populations with applications to discriminant analysis. Journal of Multivariate Analysis, 72, 151-162.
R. A. Johnson & D. W. Wichern (1998). Applied Multivariate Statistical Analysis, 4th Edition. Prentice Hall, Upper Saddle River, New Jersey.
W. Johnson (1985). Influence measures for logistic regression: another point of view. Biometrika, 72, 59-65.
H. R. Künsch, L. A. Stefanski & R. J. Carroll (1989). Conditionally unbiased bounded influence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association, 84, 460-466.
G. Pison, P. J. Rousseeuw, P. Filzmoser & C. Croux (2003). Robust factor analysis. Journal of Multivariate Analysis, 84, 145-172.
G. Pison & S. Van Aelst (2004). Diagnostic plots for robust multivariate methods. Journal of Computational and Graphical Statistics, 13, 310-329.
D. Pregibon (1981). Logistic regression diagnostics. Annals of Statistics, 9, 705-724.
D. Pregibon (1982). Resistant fits for some commonly used logistic models with medical applications. Biometrics, 38, 485-498.
P. J. Rousseeuw & A. Christmann (2003). Robustness against separation and outliers in logistic regression. Computational Statistics and Data Analysis, 43, 315-332.
P. J. Rousseeuw & A. M. Leroy (1987), Robust Regression and Outlier Detection. Wiley, New York.
L. A. Stefanski, R. J. Carroll & D. Ruppert (1986). Optimally bounded score functions for generalizes linear models with applications to logistic regression. Biometrika, 73, 413-424.