Article (Scientific journals)
Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics.
Lupo, Valérian; Van Vlierberghe, Mick; Vanderschuren, Hervé et al.
2021In Frontiers in Microbiology, 12, p. 755101
Peer Reviewed verified by ORBi
 

Files


Full Text
Lupo_et_al_2021_FMIC_postprint_editor.pdf
Publisher postprint (6.4 MB)
Download
Full Text Parts
Lupo_et_al_2021_FMIC_suppl_data.pdf
Author preprint (381.09 kB)
Download

All documents in ORBi are protected by a user license.

Send to



Details



Keywords :
NCBI RefSeq; assembly; contamination; databases; genomes; phylogenomics; sequencing
Abstract :
[en] Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.
Disciplines :
Biochemistry, biophysics & molecular biology
Microbiology
Genetics & genetic processes
Author, co-author :
Lupo, Valérian ;  Université de Liège - ULiège > Département des sciences de la vie > Phylogénomique des eucaryotes
Van Vlierberghe, Mick ;  Université de Liège - ULiège > Département des sciences de la vie > Phylogénomique des eucaryotes
Vanderschuren, Hervé  ;  Université de Liège - ULiège > Département GxABT > Plant Sciences
Kerff, Frédéric  ;  Université de Liège - ULiège > Département des sciences de la vie > Centre d'ingénierie des protéines
Baurain, Denis  ;  Université de Liège - ULiège > Département des sciences de la vie > Phylogénomique des eucaryotes
Cornet, Luc ;  Université de Liège - ULiège > Département des sciences de la vie > Phylogénomique des eucaryotes
Language :
English
Title :
Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics.
Publication date :
2021
Journal title :
Frontiers in Microbiology
eISSN :
1664-302X
Publisher :
Frontiers, Lausanne, Switzerland
Volume :
12
Pages :
755101
Peer reviewed :
Peer Reviewed verified by ORBi
Tags :
CÉCI : Consortium des Équipements de Calcul Intensif
Funders :
FRIA - Fonds pour la Formation à la Recherche dans l'Industrie et dans l'Agriculture [BE]
F.R.S.-FNRS - Fonds de la Recherche Scientifique [BE]
CÉCI - Consortium des Équipements de Calcul Intensif [BE]
Commentary :
Copyright © 2021 Lupo, Van Vlierberghe, Vanderschuren, Kerff, Baurain and Cornet.
Available on ORBi :
since 18 January 2022

Statistics


Number of views
84 (15 by ULiège)
Number of downloads
88 (3 by ULiège)

Scopus citations®
 
16
Scopus citations®
without self-citations
13
OpenCitations
 
10

Bibliography


Similar publications



Contact ORBi