Abstract :
[en] [en] UNLABELLED: Multi-view data offer advantages over single-view data for characterizing individuals, which is crucial in precision medicine toward personalized prevention, diagnosis, or treatment follow-up. Here, we develop a network-guided multi-view clustering framework named netMUG to identify actionable subgroups of individuals. This pipeline first adopts sparse multiple canonical correlation analysis to select multi-view features possibly informed by extraneous data, which are then used to construct individual-specific networks (ISNs). Finally, the individual subtypes are automatically derived by hierarchical clustering on these network representations. We applied netMUG to a dataset containing genomic data and facial images to obtain BMI-informed multi-view strata and showed how it could be used for a refined obesity characterization. Benchmark analysis of netMUG on synthetic data with known strata of individuals indicated its superior performance compared with both baseline and benchmark methods for multi-view clustering. In addition, the real-data analysis revealed subgroups strongly linked to BMI and genetic and facial determinants of these classes. NetMUG provides a powerful strategy, exploiting individual-specific networks to identify meaningful and actionable strata. Moreover, the implementation is easy to generalize to accommodate heterogeneous data sources or highlight data structures.
AUTHOR SUMMARY: In recent years, we see the increasing possibility of collecting data from multiple modalities in various fields, requesting novel methods to exploit the consensus among different data types. As exemplified in systems biology or epistasis analyses, the interactions between features may contain more information than the features themselves, thereby necessitating the use of feature networks. Furthermore, in real-life scenarios, subjects, such as patients or individuals, may originate from diverse populations, which underscores the importance of subtyping or clustering these subjects to account for their heterogeneity. In this study, we present a novel pipeline for selecting the most relevant features from multiple data types, constructing a feature network for each subject, and obtaining a subgrouping of samples informed by a phenotype of interest. We validated our method on synthetic data and demonstrated its superiority over several state-of-the-art multi-view clustering approaches. Additionally, we applied our method to a real-life, large-scale dataset of genomic data and facial images, where it effectively identified a meaningful BMI subtyping that complemented existing BMI categories and offered new biological insights. Our proposed method has wide applicability to complex multi-view or multi-omics datasets for tasks such as disease subtyping or personalized medicine.