Abstract :
[en] Plant viruses represent a vast and phylogenetically diverse group of pathogens that threaten global agricultural productivity. To leverage the recent expansion of high-throughput sequencing, we collected, curated and clustered over 150,000 plant viral proteins, including coat protein, movement protein, RNA silencing suppressors, RdRp, and polyprotein complexes. For each protein, we generated about 1,000 physicochemical and structural features using, among others, ProtLearn and Bio2Byte algorithms. On the other hand, a database of more than 21,000 plant host-virus relationships, including 3,820 virus and 4,223 plant species, was built by large-scale gathering and data mining. This database was peer-reviewed by experts from multiple partner laboratories. Next, we designed a machine learning pipeline called Holistic AutoML-driven Robust pipeline optimization tool for Applied Multi-Omics (HARAMO).
HARAMO was applied on the curated databases to identify key physicochemical signatures of proteins (amino acid composition and properties, backbone dynamics, early folding regions, properties of secondary structures, and intrinsic disorder) involved in virus-host specificity in plants. Our protein-based approach predicted more than 1,500 host plants with MCC scores (≈ robust balanced accuracy, expressed in percentage here) ranging from 80% to 99%, depending on the input viral protein and the target plant. Overall, our integrative framework offers a robust protein-based host prediction tool for elucidating complex virus–host interactions. It opens new perspectives for studying the host range of virus species and guiding both fundamental and applied research. Indeed, the HARAMO results are raising new experimental questions to better understand the interactions between a virus and its plant host. In addition, knowing the putative host range of a virus is a significant asset for epidemiological studies, providing critical insights for monitoring and managing viral spread. Currently, we are developing a user-friendly dashboard based on our database, framework and prediction models, which will be available soon. This dashboard will guide future experimental research for biological validation with partner laboratories.