[en] TQMD is a tool for high-performance computing clusters which downloads, stores and produces lists of dereplicated prokaryotic genomes. It has been developed to counter the ever-growing number of prokaryotic genomes and their uneven taxonomic distribution. It is based on word-based alignment-free methods (k-mers), an iterative single-linkage approach and a divide-and-conquer strategy to remain both efficient and scalable. We studied the performance of TQMD by verifying the influence of its parameters and heuristics on the clustering outcome. We further compared TQMD to two other dereplication tools (dRep and Assembly-Dereplicator). Our results showed that TQMD is primarily optimized to dereplicate at higher taxonomic levels (phylum/class), as opposed to the other dereplication tools, but also works at lower taxonomic levels (species/strain) like the other dereplication tools. TQMD is available from source and as a Singularity container at [https://bitbucket.org/phylogeno/tqmd ].
FRIA - Fonds pour la Formation à la Recherche dans l'Industrie et dans l'Agriculture BELSPO - Politique scientifique fédérale ANR - Agence Nationale de la Recherche F.R.S.-FNRS - Fonds de la Recherche Scientifique
scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.
Bibliography
Allman ES, Rhodes JA, Sullivant S. 2017. Statistically consistent k-mer methods for phylogenetic tree reconstruction. Journal of Computational Biology 24:153–171 DOI 10.1089/cmb.2015.0216.
Batista MVA, Ferreira TAE, Freitas AC, Balbino VQ. 2011. An entropy-based approach for the identification of phylogenetically informative genomic regions of Papillomavirus. Infection, Genetics and Evolution 11:2026–2033 DOI 10.1016/j.meegid.2011.09.013.
Bentley JL. 1980. Multidimensional divide-and-conquer. Communications of the ACM 23.4:214–229 DOI 10.1145/358841.358850.
Bowers RM, Kyrpides NC, Stepanauskas R, Harmon-Smith M, Doud D, Reddy TBK, Schulz F, Jarett J, Rivers AR, Eloe-Fadrosh EA. 2017. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nature Biotechnology 35:725–731 DOI 10.1038/nbt.3893.
Cavalier-smith T, Chao EE. 2020. Multidomain ribosomal protein trees and the planctobacterial origin of neomura (eukaryotes, archaebacteria). Protoplasma 257:621–753 DOI 10.1007/s00709-019-01442-7.
Chan CX, Bernard G, Poirion O, Hogan JM, Ragan MA. 2014. Inferring phylogenies of evolving sequences without multiple sequence alignment. Scientific Reports 46504.
Cornet L, Bertrand AR, Hanikenne M, Javaux EJ, Wilmotte A, Baurain D. 2018a. Metagenomic assembly of new (sub) polar Cyanobacteria and their associated microbiome from non-axenic cultures. Microbial Genomics 4:e000212 DOI 10.1099/mgen.0.000212.
Cornet L, Meunier L, Van Vlierberghe M, Léonard RR, Durieu B, Lara Y, Misztak A, Sirjacobs D, Javaux EJ, Wilmotte A, Philippe H, Baurain D. 2018b. Consensus assessment of the contamination level of publicly available cyanobacterial genomes. PLOS ONE 13.7:e0200323 DOI 10.1371/journal.pone.0200323.
Criscuolo A, Gribaldo S. 2010. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology 10:210 DOI 10.1186/1471-2148-10-210.
Daubin V, Moran NA, Ochman H. 2003. Phylogenetics and the cohesion of bacterial genomes. Science 301:829–832 DOI 10.1126/science.1086568.
Edgar RC. 2018. Updating the 97% identity threshold for 16S ribosomal RNA OTUs. Bioinformatics 34:2371–2375 DOI 10.1093/bioinformatics/bty113.
Federhen S. 2012. The NCBI taxonomy database. Nucleic Acids Research 40:D136–D143 DOI 10.1093/nar/gkr1178.
Fu L, Niu B, Zhu Z, Wu S, Li W. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152 DOI 10.1093/bioinformatics/bts565.
Gupta RS, Bhandari V. 2011. Phylogeny and molecular signatures for the phylum Thermotogae and its subgroups. 1–34 DOI 10.1007/s10482-011-9576-z.
Gurevich A, Saveliev V, Vyahhi N, Tesler G. 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29:1072–1075 DOI 10.1093/bioinformatics/btt086.
Hoang DT, Chernomor O, Von Haeseler A, Minh BQ, Vinh LS. 2018. UFBoot2: improving the ultrafast bootstrap approximation. Molecular Biology and Evolution 35:518–522 DOI 10.1093/molbev/msx281.
Irisarri I, Baurain D, Brinkmann H, Delsuc F, Sire J-Y, Kupfer A, Petersen J, Jarek M, Meyer A, Vences M. 2017. Phylotranscriptomic consolidation of the jawed vertebrate timetree. Nature Ecology & Evolution 1:1370–1378 DOI 10.1038/s41559-017-0240-5.
Jauffrit F, Penel S, Delmotte S, Rey C, De Vienne DM, Gouy M, Charrier J-P, Flandrois J-P, Brochier-Armanet C. 2016. RiboDB database: a comprehensive resource for prokaryotic systematics. Molecular Biology and Evolution 33:2170–2172 DOI 10.1093/molbev/msw088.
Jones NC, Pevzner PA, Pevzner . 2004. An introduction to bioinformatics algorithms. Cambridge: MIT Press.
Jumas-Bilak E, Roudiere L, Marchandin H. 2009. Description of ‘Synergistetes’ phyl, nov. and emended description of the phylum ‘Deferribacteres’ and of the family Syntrophomonadaceae, phylum ‘Firmicutes’. International Journal of Systematic and Evolutionary Microbiology 59:1028–1035 DOI 10.1099/ijs.0.006718-0.
Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution 30:772–780 DOI 10.1093/molbev/mst010.
Kolmogorov AN. 1965. Three approaches to the quantitative definition of information. Problems of Information Transmission 1:1–7.
Kullback S, Leibler RA. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22:79–86 DOI 10.1214/aoms/1177729694.
Kurtzer GM, Sochat V, Bauer MW. 2017. Singularity: scientific containers for mobility of compute. PLOS ONE 12:e0177459 DOI 10.1371/journal.pone.0177459.
Lagesen K, Hallin P, Rødland EA, Stærfeldt H-H, Rognes T, Ussery DW. 2007. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Research 35:3100–3108 DOI 10.1093/nar/gkm160.
Letunic I, Bork P. 2019. Interactive ‘Tree of Life’ (iTOL) v4: recent updates and new developments. Nucleic Acids Research 47:W256–W259 DOI 10.1093/nar/gkz239.
Li W, Godzik A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659 DOI 10.1093/bioinformatics/btl158.
Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27:764–770 DOI 10.1093/bioinformatics/btr011.
Nesbø CL, Bapteste E, Curtis B, Dahle H, Lopez P, Macleod D, Dlutek M, Bowman S, Zhaxybayeva O, Birkeland N-K, et al. 2009. The genome of Thermosipho africanus TCF52B: lateral genetic connections to the Firmicutes and Archaea. Journal of Bacteriology 191:1974–1978 DOI 10.1128/JB.01448-08.
Nguyen L-T, Schmidt HA, Von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32:268–274 DOI 10.1093/molbev/msu300.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research 44:D733–D745 DOI 10.1093/nar/gkv1189.
Olm MR, Brown CT, Brooks B, Banfield JF. 2017. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. London: Nature Publishing Group, 1–5 DOI 10.1038/ismej.2017.126.
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. 2016. Mashă: fast genome and metagenome distance estimation using Min-Hash. Genome Biology 1–14 DOI 10.1186/s13059-016-0997-x.
Parks DH, Chuvochina M, Chaumeil P-A, Rinke C, Mussig AJ, Hugenholtz P. 2020. A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology 38:1079–1086 DOI 10.1038/s41587-020-0501-8.
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research 25:1043–1055 DOI 10.1101/gr.186072.114.
Real R, Vargas JM. 1996. The probabilistic basis of Jaccard’s index of similarity. Systematic Biology 45.3:380–385 DOI 10.1093/sysbio/45.3.380.
Roure B, Rodriguez-Ezpeleta N, Philippe H. 2007. SCaFoS: a tool for selection, concatenation and fusion of sequences for phylogenomics. BMC Evolutionary Biology 7(Suppl 1):S2 DOI 10.1186/1471-2148-7-S1-S2.
Sayers EW, Cavanaugh M, Clark K, Ostell J, Pruitt KD, Karsch-Mizrachi I. 2020. GenBank. Nucleic Acids Research 48:D84–D86 DOI 10.1093/nar/gkaa500.
Shannon CE. 1948. A mathematical theory of communication. The Bell System Technical Journal 27:379–423 DOI 10.1002/j.1538-7305.1948.tb01338.x.
Simion P, Philippe H, Baurain D, Jager M, Richter DJ, Di Franco A, Roure B, Satoh N, Queinnec E, Ereskovsky A, et al. 2017. A large and consistent phylogenomic dataset supports sponges as the sister group to all other animals. Current Biology 27:958–967 DOI 10.1016/j.cub.2017.02.031.
Taton A, Grubisic S, Brambilla E, Wit RD, Wilmotte A. 2003. Cyanobacterial diversity in natural and artificial microbial mats of Lake Fryxell (McMurdo Dry Valleys, Antarctica): a morphological and molecular approach. Applied and Environmental Microbiology 69.9:5157–5169 DOI 10.1128/AEM.69.9.5157.
Tribus M, McIrvine EC. 1971. Energy and information. Scientific American 225:179–190 DOI 10.1038/scientificamerican0971-179.
Van Vlierberghe M. 2021. Supplementary file 1. figshare. Dataset. London: Springer Nature. Available at https://doi.org/10.6084/m9.figshare.14079866.v1.
Wen J, Chan RH, Yau S-C, He RL, Yau SS. 2014. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene 546:25–34 DOI 10.1016/j.gene.2014.05.043.
This website uses cookies to improve user experience. Read more
Save & Close
Accept all
Decline all
Show detailsHide details
Cookie declaration
About cookies
Strictly necessary
Performance
Strictly necessary cookies allow core website functionality such as user login and account management. The website cannot be used properly without strictly necessary cookies.
This cookie is used by Cookie-Script.com service to remember visitor cookie consent preferences. It is necessary for Cookie-Script.com cookie banner to work properly.
Performance cookies are used to see how visitors use the website, eg. analytics cookies. Those cookies cannot be used to directly identify a certain visitor.
Used to store the attribution information, the referrer initially used to visit the website
Cookies are small text files that are placed on your computer by websites that you visit. Websites use cookies to help users navigate efficiently and perform certain functions. Cookies that are required for the website to operate properly are allowed to be set without your permission. All other cookies need to be approved before they can be set in the browser.
You can change your consent to cookie usage at any time on our Privacy Policy page.