ToRQuEMaDA: tool for retrieving queried Eubacteria, metadata and dereplicating assemblies

Léonard, Raphaël; Leleu, Marie; Van Vlierberghe, Mick; Cornet, Luc; Kerff, Frédéric; Baurain, Denis

doi:10.7717/peerj.11348

Download

Article (Scientific journals)

ToRQuEMaDA: tool for retrieving queried Eubacteria, metadata and dereplicating assemblies

Léonard, Raphaël; Leleu, Marie; Van Vlierberghe, Mick et al.

2021 • In PeerJ, 9, p. 11348

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/2268/259995

DOI
10.7717/peerj.11348

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

peerj-11348.pdf

Publisher postprint (598.45 kB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Dereplication; Prokaryotes; Genome quality; Genome selection; Alignment-free methods; Phylogenomics; NCBI RefSeq; Singularity; Metagenomics

Abstract :

[en] TQMD is a tool for high-performance computing clusters which downloads, stores and produces lists of dereplicated prokaryotic genomes. It has been developed to counter the ever-growing number of prokaryotic genomes and their uneven taxonomic distribution. It is based on word-based alignment-free methods (k-mers), an iterative single-linkage approach and a divide-and-conquer strategy to remain both efficient and scalable. We studied the performance of TQMD by verifying the influence of its parameters and heuristics on the clustering outcome. We further compared TQMD to two other dereplication tools (dRep and Assembly-Dereplicator). Our results showed that TQMD is primarily optimized to dereplicate at higher taxonomic levels (phylum/class), as opposed to the other dereplication tools, but also works at lower taxonomic levels (species/strain) like the other dereplication tools. TQMD is available from source and as a Singularity container at [https://bitbucket.org/phylogeno/tqmd ].

Disciplines :

Microbiology
Genetics & genetic processes
Biochemistry, biophysics & molecular biology

Author, co-author :

Léonard, Raphaël ; Université de Liège - ULiège > InBioS

Leleu, Marie ; Université de Liège - ULiège > InBioS

Van Vlierberghe, Mick ; Université de Liège - ULiège > InBioS

Cornet, Luc ; Université de Liège - ULiège > Département des sciences de la vie > Phylogénomique des eucaryotes

Kerff, Frédéric ; Université de Liège - ULiège > Département des sciences de la vie > Centre d'ingénierie des protéines

Baurain, Denis ; Université de Liège - ULiège > Département des sciences de la vie > Phylogénomique des eucaryotes

Language :

English

Title :

ToRQuEMaDA: tool for retrieving queried Eubacteria, metadata and dereplicating assemblies

Publication date :

05 May 2021

Journal title :

PeerJ

eISSN :

2167-8359

Publisher :

PeerJ, United States - California

Volume :

Pages :

e11348

Peer reviewed :

Peer Reviewed verified by ORBi

Additional URL :

https://peerj.com/articles/11348/

Funders :

FRIA - Fonds pour la Formation à la Recherche dans l'Industrie et dans l'Agriculture
BELSPO - Politique scientifique fédérale
ANR - Agence Nationale de la Recherche
F.R.S.-FNRS - Fonds de la Recherche Scientifique

Available on ORBi :

since 17 May 2021

Statistics

Number of views

161 (22 by ULiège)

Number of downloads

180 (7 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Allman ES, Rhodes JA, Sullivant S. 2017. Statistically consistent k-mer methods for phylogenetic tree reconstruction. Journal of Computational Biology 24:153–171 DOI 10.1089/cmb.2015.0216.
Batista MVA, Ferreira TAE, Freitas AC, Balbino VQ. 2011. An entropy-based approach for the identification of phylogenetically informative genomic regions of Papillomavirus. Infection, Genetics and Evolution 11:2026–2033 DOI 10.1016/j.meegid.2011.09.013.
Bentley JL. 1980. Multidimensional divide-and-conquer. Communications of the ACM 23.4:214–229 DOI 10.1145/358841.358850.
Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TBK, Schulz F, Jarett J, Rivers AR, Eloe-Fadrosh EA. 2017. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology 35:725–731 DOI 10.1038/nbt.3893.
Cavalier-smith T, Chao EE. 2020. Multidomain ribosomal protein trees and the planctobacterial origin of neomura (eukaryotes, archaebacteria). Protoplasma 257:621–753 DOI 10.1007/s00709-019-01442-7.
Chan CX, Bernard G, Poirion O, Hogan JM, Ragan MA. 2014. Inferring phylogenies of evolving sequences without multiple sequence alignment. Scientific Reports 46504.
Cornet L, Bertrand AR, Hanikenne M, Javaux EJ, Wilmotte A, Baurain D. 2018a. Metagenomic assembly of new (sub) polar Cyanobacteria and their associated microbiome from non-axenic cultures. Microbial Genomics 4:e000212 DOI 10.1099/mgen.0.000212.
Cornet L, Meunier L, Van Vlierberghe M, Léonard RR, Durieu B, Lara Y, Misztak A, Sirjacobs D, Javaux EJ, Wilmotte A, Philippe H, Baurain D. 2018b. Consensus assessment of the contamination level of publicly available cyanobacterial genomes. PLOS ONE 13.7:e0200323 DOI 10.1371/journal.pone.0200323.
Criscuolo A, Gribaldo S. 2010. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology 10:210 DOI 10.1186/1471-2148-10-210.
Daubin V, Moran NA, Ochman H. 2003. Phylogenetics and the cohesion of bacterial genomes. Science 301:829–832 DOI 10.1126/science.1086568.
Edgar RC. 2018. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics 34:2371–2375 DOI 10.1093/bioinformatics/bty113.
Federhen S. 2012. The NCBI taxonomy database. Nucleic Acids Research 40:D136–D143 DOI 10.1093/nar/gkr1178.
Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152 DOI 10.1093/bioinformatics/bts565.
Gupta RS, Bhandari V. 2011. Phylogeny and molecular signatures for the phylum Thermotogae and its subgroups. 1–34 DOI 10.1007/s10482-011-9576-z.
Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072–1075 DOI 10.1093/bioinformatics/btt086.
Hoang DT, Chernomor O, Von Haeseler A, Minh BQ, Vinh LS. 2018. UFBoot2: improving the ultrafast bootstrap approximation. Molecular Biology and Evolution 35:518–522 DOI 10.1093/molbev/msx281.
Irisarri I, Baurain D, Brinkmann H, Delsuc F, Sire J-Y, Kupfer A, Petersen J, Jarek M, Meyer A, Vences M. 2017. Phylotranscriptomic consolidation of the jawed vertebrate timetree. Nature Ecology & Evolution 1:1370–1378 DOI 10.1038/s41559-017-0240-5.
Jauffrit F, Penel S, Delmotte S, Rey C, De Vienne DM, Gouy M, Charrier J-P, Flandrois J-P, Brochier-Armanet C. 2016. RiboDB database: a comprehensive resource for prokaryotic systematics. Molecular Biology and Evolution 33:2170–2172 DOI 10.1093/molbev/msw088.
Jones NC, Pevzner PA, Pevzner . 2004. An introduction to bioinformatics algorithms. Cambridge: MIT Press.
Jumas-Bilak E, Roudiere L, Marchandin H. 2009. Description of ‘Synergistetes’ phyl, nov. and emended description of the phylum ‘Deferribacteres’ and of the family Syntrophomonadaceae, phylum ‘Firmicutes’. International Journal of Systematic and Evolutionary Microbiology 59:1028–1035 DOI 10.1099/ijs.0.006718-0.
Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 30:772–780 DOI 10.1093/molbev/mst010.
Kolmogorov AN. 1965. Three approaches to the quantitative definition of information. Problems of Information Transmission 1:1–7.
Kullback S, Leibler RA. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22:79–86 DOI 10.1214/aoms/1177729694.
Kurtzer GM, Sochat V, Bauer MW. 2017. Singularity: scientific containers for mobility of compute. PLOS ONE 12:e0177459 DOI 10.1371/journal.pone.0177459.
Lagesen K, Hallin P, Rødland EA, Stærfeldt H-H, Rognes T, Ussery DW. 2007. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Research 35:3100–3108 DOI 10.1093/nar/gkm160.
Letunic I, Bork P. 2019. Interactive ‘Tree of Life’ (iTOL) v4: recent updates and new developments. Nucleic Acids Research 47:W256–W259 DOI 10.1093/nar/gkz239.
Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659 DOI 10.1093/bioinformatics/btl158.
Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27:764–770 DOI 10.1093/bioinformatics/btr011.
Nesbø CL, Bapteste E, Curtis B, Dahle H, Lopez P, Macleod D, Dlutek M, Bowman S, Zhaxybayeva O, Birkeland N-K, et al. 2009. The genome of Thermosipho africanus TCF52B: lateral genetic connections to the Firmicutes and Archaea. Journal of Bacteriology 191:1974–1978 DOI 10.1128/JB.01448-08.
Nguyen L-T, Schmidt HA, Von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32:268–274 DOI 10.1093/molbev/msu300.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research 44:D733–D745 DOI 10.1093/nar/gkv1189.
Olm MR, Brown CT, Brooks B, Banfield JF. 2017. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. London: Nature Publishing Group, 1–5 DOI 10.1038/ismej.2017.126.
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. 2016. Mashă: fast genome and metagenome distance estimation using Min-Hash. Genome Biology 1–14 DOI 10.1186/s13059-016-0997-x.
Parks DH, Chuvochina M, Chaumeil P-A, Rinke C, Mussig AJ, Hugenholtz P. 2020. A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology 38:1079–1086 DOI 10.1038/s41587-020-0501-8.
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research 25:1043–1055 DOI 10.1101/gr.186072.114.
Real R, Vargas JM. 1996. The probabilistic basis of Jaccard’s index of similarity. Systematic Biology 45.3:380–385 DOI 10.1093/sysbio/45.3.380.
Roure B, Rodriguez-Ezpeleta N, Philippe H. 2007. SCaFoS: a tool for selection, concatenation and fusion of sequences for phylogenomics. BMC Evolutionary Biology 7(Suppl 1):S2 DOI 10.1186/1471-2148-7-S1-S2.
Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. 2020. GenBank. Nucleic Acids Research 48:D84–D86 DOI 10.1093/nar/gkaa500.
Shannon CE. 1948. A mathematical theory of communication. The Bell System Technical Journal 27:379–423 DOI 10.1002/j.1538-7305.1948.tb01338.x.
Simion P, Philippe H, Baurain D, Jager M, Richter DJ, Di Franco A, Roure B, Satoh N, Queinnec E, Ereskovsky A, et al. 2017. A large and consistent phylogenomic dataset supports sponges as the sister group to all other animals. Current Biology 27:958–967 DOI 10.1016/j.cub.2017.02.031.
Taton A, Grubisic S, Brambilla E, Wit RD, Wilmotte A. 2003. Cyanobacterial diversity in natural and artificial microbial mats of Lake Fryxell (McMurdo Dry Valleys, Antarctica): a morphological and molecular approach. Applied and Environmental Microbiology 69.9:5157–5169 DOI 10.1128/AEM.69.9.5157.
Tribus M, McIrvine EC. 1971. Energy and information. Scientific American 225:179–190 DOI 10.1038/scientificamerican0971-179.
Van Vlierberghe M. 2021. Supplementary file 1. figshare. Dataset. London: Springer Nature. Available at https://doi.org/10.6084/m9.figshare.14079866.v1.
Wen J, Chan RH, Yau S-C, He RL, Yau SS. 2014. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 546:25–34 DOI 10.1016/j.gene.2014.05.043.
Wick RR, Holt KE. 2019. rrwick/Assembly-Dereplicator: assembly dereplicator v0.1.0 (Version v0.1.0). Zenodo. DOI 10.5281/zenodo.3365572.
Zielezinski A, Vinga S, Almeida J, Karlowski WM. 2017. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biology 18:186 DOI 10.1186/s13059-017-1319-7.