Course notes (Learning materials)
Multivariate statistical analysis
Albert, Adelin
2018
 

Files


Full Text
Multivariate statistical analysis (A. Albert, 2018).pdf
Author postprint (1.64 MB)
Download

All documents in ORBi are protected by a user license.

Send to



Details



Keywords :
statistics; multivariate
Abstract :
[en] Multivariate statistics covers all methods that deal with the analysis of observations made on several variables simultaneously. Therefore it refers intensively to vectors and more generally to matrix calculus. It is obvious that most problems and applications encountered in practice involve at least two variables and are therefore multidimensional. The most renowned example in the literature is the Fisher iris dataset which consists of the measurements of four variables, respectively sepal length, sepal width, petal length and petal width, made on 50 iris setosa, 50 iris versicolor and 50 iris virginica flowers. Amazingly, this dataset which is reproduced in Annex can be used to illustrate most of the multivariate statistical methods described in this book. Multivariate statistics requires at least some basic knowledge of univariate statistics. This can be found in any elementary or advanced textbook. Multivariate statistical methods were mainly developed during the 20th century but their intensive use has only been made possible by the advent of computers. Actually, as the number of variables to be studied gets larger, statistical calculations become longer, more complicated and eventually intractable manually. Today computers can solve complex multivariate statistical problems in a few seconds. Moreover, the advent of user-friendly statistical packages such as SAS, SPSS, Stata and R, has made multivariate methods available to a large research community of users. Nonetheless, despite the fact that statistical packages have become cheaper and easier to use, their appropriate utilization in practice requires to have some knowledge about what these multivariate methods can do and what they cannot do. This was the primary objective when preparing these lecture notes for students and researchers. This book is structured in 8 chapters. Chapter 1 is a rapid overview of the basic notions of matrix calculus needed to present, characterize and solve multivariate statistical problems. In Chapter 2, we briefly introduce the concepts of population, sample, variable, type of variable (quantitative, qualitative and binary), and data. Then we define the n × p data matrix which results from the observation on n subjects or objects (rows) of p variables (columns). The representation of multivariate observations by different techniques will also be addressed. In Chapter 3 we will extend the classical concepts of mean and variance to the multivariate setting. In particular, we introduce the notion of variance-covariance matrix and of correlation matrix. The Mahalanobis distance between two points (observations) of the p-dimensional space, which generalizes the classical Euclidean distance by accounting for the associations between variables, will also be defined. Chapter 4 is concerned with the first important multivariate statistical method, namely principal component analysis (PCA). Indeed, this method can provide a two-dimensional graphical representation of data points of the multidimensional space with a minimal loss of information. In other words, it gives us a photography of the data matrix. In this chapter we also briefly mention the more recent biplot method which provides in a similar way a graphical representation of the variables studied. Chapter 5 looks at the relation and correlation between a “dependent” quantitative variable and a set of so-called “independent” or “explanatory” variables. This is the method of multiple regression (MR) and correlation. In many statistics books, this method is seen as a univariate method in the sense that the explanatory variables are considered as factors of an experiment which are fixed by the investigator (factorial design) and only the dependent variable is observed. We decided to consider it as a multivariate statistical method because it requires matrix calculus but also because it becomes a multivariate problem when all the variables, dependent and explanatory, are observed simultaneously, as it is often the case. In Chapter 6 we refer to one of the most utilized methods of multivariate statistics. Logistic regression (LR) studies the association between a binary dependent variable (occurrence or non-occurrence of an event) and a vector of variables. The notion of the odds ratio (OR) will be introduced. We shall also give a short description of ordinal logistic regression (OLR) which extends logistic regression to the case where the dependent variable is no longer binary but ordinal (ordered categories). Chapter 7 is dedicated to survival analysis. Basically, the methods described here concern the study of the association between a lifetime or timeto-event variable and a set of covariates. This requires the definition of the survival curve and its Kaplan-Meier estimation, of the hazard function and of the hazard ratio (HR) The chapter will also focus on the most celebrated Cox “proportional hazards (PH)” regression model. This method which has a wide range of applications is one of the most cited in the literature. Finally, Chapter 8 handles one of the oldest multivariate statistical problem, namely discriminant analysis (DA). We describe how two or several populations can be discriminated (separated) based on a vector of variables. This can be done by the Fisher linear discriminant function or its multiple group extension. We also show how “canonical discriminant analysis” allows to display the populations on a 2-dimensional plane, as in principal component analysis. Interestingly, discriminant analysis can also be seen as the method of classifying a multivariate observation (subject or object) of unknown origin among two or several populations with a minimal risk of error. This approach will be briefly described. The Annexes reproduce three datasets which will be used to illustrate the multivariate statistical methods described in the text : (1) Fisher iris dataset, (2) Severe head injury patients dataset, and (3) Rectal cancer patients dataset. They also contain four of the most frequently used statistical tables, namely the Gaussian (Normal), Chi-squared, Student t and Snedecor F distributions .
Disciplines :
Mathematics
Author, co-author :
Albert, Adelin  ;  Université de Liège - ULiège > Département des sciences de la santé publique
Language :
English
Title :
Multivariate statistical analysis
Publication date :
2018
Institution :
ULiège - Université de Liège [Médecine], Liège, Belgium
Available on ORBi :
since 31 October 2025

Statistics


Number of views
7 (0 by ULiège)
Number of downloads
43 (0 by ULiège)

Bibliography


Similar publications



Contact ORBi