[en] Identifying orthology relationships among sequences is fundamental in phyloge-
nomics; indeed, those are essential to understand evolution, diversity of life and
ancestry among organisms. To build alignments of orthologous sequences, phy-
logenomic pipelines often start with a step of all-vs-all similarity search followed
by a clustering with an algorithm such as OrthoFinder [Emms and Kelly (2015)
Genome Biol 16:157]. For it to be as accurate as possible, proteomes of good quality
are needed but their availability is limited to a small subset of the living beings.
Therefore, large-scale taxonomic phylogenomic analyses imply the enrichment of
preexisting orthologous groups with transcriptomic or genomic data and the need
for robust tools for identifying orthologues from heterogeneous sequence data. To
this end, we have developed a novel tool, ”Forty-Two”, along the lines of HaMStR
[Ebersberger et al. (2009) BMC Evol Biol 9:157], whose aim is to add (and op-
tionally align) sequences to thousands of preexisting multiple sequence alignments
(MSA) while controlling for orthology relationships and potentially contaminating
sequences. ”Forty-Two” uses advanced heuristics based on a multiple Best Recipro-
cal Hit (multi-BRH) strategy against reference proteomes to distinguish orthologous
and paralogous sequences among homologues. It is fully functional and has already
been used in two high-profile phylogenomic manuscripts (under review) dealing with
the animal tree of life. Here, we present the principles and algorithms underlying
”Forty-Two” as well as the results of an extensive test suite of its features, in order
to support its release to the public.
Disciplines :
Biochemistry, biophysics & molecular biology
Author, co-author :
Baurain, Denis ; Université de Liège - ULiège > Département des sciences de la vie > Phylogénomique des eucaryotes