Abstract :
[en] Introduction
Regional connectivity-based parcellation (CBP) aims to find biologically meaningful parcels or subregions. This is achieved by clustering the voxels in a region of interest (ROI) based on their connectivity profiles. Using a large resting-state fMRI (rs-fMRI) sample, we show that deviant connectivity profiles substantially influence group-based clustering results. Such outliers can arise due to various reasons and we investigated one possible reason for high dimensional data: difference in intrinsic dimensionality.
Methods
The Right (R) insula ROI (Fig. 2C), subject to repeated CBP analyses [1], was defined using the Harvard Oxford Atlas [2]. rs-fMRI data from 408 healthy unrelated subjects (2mm isotropic, TR=0.72s, age 22-37, 205 males) from the Human Connectome Project [3] were included. FIX- denoised data was preprocessed with SPM8 [4] using unified segmentation [5], 5mm FWHM smoothed, WM-CSF signal regressed, and frequency-filtered (0.01-0.08 Hz). Correlations between time-series of each ROI voxel and all brain gray-matter voxels were computed and Fisher Z- transformed, yielding an ROI-to-whole-brain connectivity matrix per subject. k-means (with k from 2 to 5) was performed on each connectivity matrix. To identify outliers, for each subject a nearest- neighbor subject was identified using Euclidean distance between connectivity matrices. The resulting vector d was Z-scored (Fig. 1A). k-means (k=2) clustering of d revealed a separation around 0, providing a conservative threshold (Fig.1B). Two further thresholds were chosen: 1.69 (.95 left tail area on a standard normal distribution) and a liberal 2.5. Group parcellations for each k using hierarchical clustering with average linkage and Hamming distance were calculated after excluding outliers based on these thresholds. The adjusted rand index (ARI) between k-means cluster results of all subjects was computed, retaining the highest values per subject as a similarity vector a (Fig. 1C). Lastly, principal component analysis was performed on the connectivity matrices, noting the number of components retaining 95% of variance. Correlating the component numbers to d uncovers whether there is a relationship between intrinsic dimensionality of the connectivity matrices and their distances to one another.
Results
Applying the thresholds of 0, 1.69, and 2.5 removed 134, 32, and 14 subjects, respectively. When correlating distances d (Fig. 1A) to the similarity vector a (Fig. 1C), we found correlations of -.38, -.41, -.49, and -.53, for k=2, 3, 4, and 5, respectively. This result suggests outliers cluster differently, thus including them into a group-level consensus might be detrimental. Accordingly, differences were found between group-level parcellations (Fig. 2A). For instance, when comparing the liberal 2.5 threshold-removed group parcellation (Fig. 2D, column two) with a group parcellation without outlier removal (Fig. 2D, column one), there was only an 81% overlap, ARI=.55 for k=3 (ARI=.67 and .71 for k=4, 5, resp.). Further comparisons are illustrated in Figure 2D. The distances d were related to the number of principal components retaining 95% of variance with correlation of -.79 (Fig. 2B). That is, if intrinsic dimensionality was low for a subject, the associated connectivity matrix would be more distant to the rest of the sample (Fig. 2D).
Conclusion
The differences in clusterings highlights the influence of outliers. While assessment of the group- level parcellations reveals that clustering results were relatively stable across thresholds for k=2 (Fig. 2D), ample evidence suggests more than 2 clusters in the R-insula [6,7,8]. As linkage algorithms in hierarchical clustering as well as k-means clustering are sensitive to outliers [9], it is important to remove them by using a proper identification threshold. In the future we will focus on automatic identification of parameters that lead to biologically meaningful parcellations.