Cluster Analysis in Incomplete Data

Nekoee Zahraei, Halehsadat; Louis, Renaud; Donneau, Anne-Françoise

No full text

Unpublished conference/Abstract (Scientific congresses and symposiums)

Cluster Analysis in Incomplete Data

Nekoee Zahraei, Halehsadat; Louis, Renaud; Donneau, Anne-Françoise

2020 • 41st Annual Conference of the International Society for Clinical Biostatistics

Permalink
https://hdl.handle.net/2268/252748

Files (0)Send to Details Statistics Bibliography Similar publications

Files

Full Text

No document available.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Abstract :

[en] Background One of the most applied methods which help researchers to find a structure in their dataset is cluster analysis. In cluster analysis, multi-dimensional data are divided into homogeneous groups such that subjects in each group have similar properties. However, missing value is an unavoidable part of multi-dimensional data. Even if missing data can now be easily handled by several methods, clustering approaches have to account for this management of missing values. Objective Multiple imputation is a simple but powerful method in this field. However, there are several challenges for clustering when multiple imputation is applied. The objective of this present research was to introduce an efficient framework to apply cluster analysis on incomplete dataset by using multiple imputation. By simulating different scenarios inspired by real data, our proposed method addressed some limitations in statistical literature to find high discriminating clusters. Method In the first step of multiple imputation, m imputed datasets were generated. Variable reduction methods and cluster analysis strategies were then applied to imputed datasets. Finally, for each imputed dataset, cluster assignment was calculated. For that purpose, application of finite mixture of multivariate multinomial distribution was proposed to estimate number of clusters; final cluster result was assigned to observations by solving maximum likelihood via EM algorithm. Results Motivated by real datasets, 178 subjects with mixed continuous and categorical variables but with two known clusters were generated by normal and multivariate mixture distribution, respectively. Several scenarios were defined for different percentages of missingness (e.g. 25%, 50%, 75%) and overlap between two known clusters (e.g. 30%, 45%, 65%). In addition, different imputation, variable reduction and clustering methods were compared. The results showed that our proposed method had high discrimination and matching compared to other methods. The best method, based on multiple imputation, variable reduction and our proposed combination method, was then applied on real data from the Pneumology Department of the University hospital of Liege, which aimed to identify clinical phenotypes among adults suffering from Chronic obstructive pulmonary disease. Conclusions Based on large simulation study, our proposed method yielded to the best discrimination with the highest matching between the final result of clustering and the known clustering from the simulated dataset.

Disciplines :

Public health, health care sciences & services

Author, co-author :

Nekoee Zahraei, Halehsadat ; Université de Liège - ULiège > Département des sciences de la santé publique > Biostatistique

Louis, Renaud; Université de Liège - ULiège > Département des sciences cliniques > Pneumologie - Allergologie

Donneau, Anne-Françoise ; Université de Liège - ULiège > Département des sciences de la santé publique > Biostatistique

Language :

English

Title :

Cluster Analysis in Incomplete Data

Publication date :

August 2020

Event name :

41st Annual Conference of the International Society for Clinical Biostatistics

Event place :

Krakow, Poland

Event date :

23-08-2020 to 27-08-2020

Audience :

International

Available on ORBi :

since 18 November 2020

Statistics

Number of views

84 (8 by ULiège)

Number of downloads

0 (0 by ULiège)

More statistics