No document available.
Abstract :
[en] Background
One of the most applied methods which help researchers to find a structure in their dataset is cluster analysis. In cluster analysis, multi-dimensional data are divided into homogeneous groups such that subjects in each group have similar properties. However, missing value is an unavoidable part of multi-dimensional data. Even if missing data can now be easily handled by several methods, clustering approaches have to account for this management of missing values.
Objective
Multiple imputation is a simple but powerful method in this field. However, there are several challenges for clustering when multiple imputation is applied. The objective of this present research was to introduce an efficient framework to apply cluster analysis on incomplete dataset by using multiple imputation. By simulating different scenarios inspired by real data, our proposed method addressed some limitations in statistical literature to find high discriminating clusters.
Method
In the first step of multiple imputation, m imputed datasets were generated. Variable reduction methods and cluster analysis strategies were then applied to imputed datasets. Finally, for each imputed dataset, cluster assignment was calculated. For that purpose, application of finite mixture of multivariate multinomial distribution was proposed to estimate number of clusters; final cluster result was assigned to observations by solving maximum likelihood via EM algorithm.
Results
Motivated by real datasets, 178 subjects with mixed continuous and categorical variables but with two known clusters were generated by normal and multivariate mixture distribution, respectively. Several scenarios were defined for different percentages of missingness (e.g. 25%, 50%, 75%) and overlap between two known clusters (e.g. 30%, 45%, 65%). In addition, different imputation, variable reduction and clustering methods were compared. The results showed that our proposed method had high discrimination and matching compared to other methods. The best method, based on multiple imputation, variable reduction and our proposed combination method, was then applied on real data from the Pneumology Department of the University hospital of Liege, which aimed to identify clinical phenotypes among adults suffering from Chronic obstructive pulmonary disease.
Conclusions
Based on large simulation study, our proposed method yielded to the best discrimination with the highest matching between the final result of clustering and the known clustering from the simulated dataset.