Abstract :
[en] Introduction
Flow cytometry (FCM) is widely used in research and clinical practice to characterise
complex cell populations, generating high-dimensional single-cell data across numerous
markers. Despite technological advances, manual gating remains the standard approach for
annotating cell populations, even if the process is time-consuming and operator-dependent.
Deep generative models offer the potential to perform classification and discovery tasks
simultaneously, improving efficiency and consistency.
Methods
Model
MARVIN is a semi-supervised deep generative model for cytometry analysis. Its architecture
is structured around the biological assumption that the immune system consists of mixtures
of cell populations. This assumption constrains its latent space to reflect the population
structure, enabling biologically interpretable representations. MARVIN can perform multiple
tasks: classification of known populations, discovery of novel or rare subpopulations, and
exploration of immune system dynamics.
Dataset and Experiments
The dataset comprises 5,480,065 cells from three patients without active disease and
10,222 malignant lymphoblastic cells from four additional patients. In total, the dataset
includes 12 annotated cell populations profiled with 8 markers. All measurements were
transformed and standardized using an auto-logicle transformation.
Classification task:
We trained the model by using a large dataset combining labelled cells from one patient and
unlabeled cells from others.
Cell-discovery tasks:
Healthy and pathological cells were merged, and two analyses were conducted:
(i)
(ii)
Results
Subpopulation discovery: Increasing the number of clusters in the latent space
and masking malignant cells during training and evaluating whether MARVIN
isolates pathological cells into additional clusters.
Anomaly detection: A previously unseen cell population was provided to the
model without addition of new clusters, and reconstruction error was used to
assess its dissimilarity from learned populations.
Classification task:
Accuracy, F1 score and balanced accuracy are for patient 2, 99.21%, 94.83%, 96.83%,
respectively and for patient 3, 75.88%, 78.41% and 92.26%, respectively.
Discovery/anomaly detection
MARVIN successfully highlighted rare pathological populations (<0.1%). Through cluster
expansion, it identified new pathological populations as distinct from healthy cells. It grouped
two small MRD populations (MRD2 and MRD4) into the same cluster while still detecting
subtle differences, and it mapped patient 1 and 3 blast groups into separate clusters. Marvin
detected and correctly assigned 99.2% leukemic cells in new clusters. Using reconstruction
error, MARVIN identified all pathological populations as previously unseen and suitable for
further characterisation.
Conclusion
MARVIN is a semi-supervised generative model grounded in biological assumptions for
FCM data. It can be trained on routinely standardised datasets and applied across
instruments, supporting broad laboratory implementation. MARVIN achieves high
classification accuracy and detects novel populations through expanded clustering and
reconstruction-loss evaluation. Ongoing work focuses on biological refinement to improve
rare population clustering and applying MARVIN to study MRD dynamics in acute leukemia.