[en] OBJECTIVES: Complex algae are photosynthetic organisms resulting from eukaryote-to-eukaryote endosymbiotic-like interactions. Yet the specific lineages and mechanisms are still under debate. That is why large scale phylogenomic studies are needed. Whereas available proteomes provide a limited diversity of complex algae, MMETSP (Marine Microbial Eukaryote Transcriptome Sequencing Project) transcriptomes represent a valuable resource for phylogenomic analyses, owing to their broad and rich taxonomic sampling, especially of photosynthetic species. Unfortunately, this sampling is unbalanced and sometimes highly redundant. Moreover, we observed contaminated sequences in some samples. In such a context, tree inference and readability are impaired. Consequently, the aim of the data processing reported here is to release a unique set of clean and non-redundant transcriptomes produced through an original protocol featuring decontamination, pooling and dereplication steps. DATA DESCRIPTION: We submitted 678 MMETSP re-assembly samples to our parallel consolidation pipeline. Hence, we combined 423 samples into 110 consolidated transcriptomes, after the systematic removal of the most contaminated samples (186). This approach resulted in a total of 224 high-quality transcriptomes, easy to use and suitable to compute less contaminated, less redundant and more balanced phylogenies.
Baurain, Denis ; Université de Liège - ULiège > Département des sciences de la vie > Phylogénomique des eucaryotes
Language :
English
Title :
Decontamination, pooling and dereplication of the 678 samples of the Marine Microbial Eukaryote Transcriptome Sequencing Project.
Publication date :
2021
Journal title :
BMC Research Notes
eISSN :
1756-0500
Publisher :
BioMed Central, London, United Kingdom
Volume :
14
Issue :
1
Pages :
306
Peer reviewed :
Peer Reviewed verified by ORBi
Funders :
FRIA - Fonds pour la Formation à la Recherche dans l'Industrie et dans l'Agriculture [BE] F.R.S.-FNRS - Fonds de la Recherche Scientifique [BE] ULiège - Université de Liège [BE]
Zimorski V, et al. Endosymbiotic theory for organelle origins. Curr Opin Microbiol. 2014;22:38–48. DOI: 10.1016/j.mib.2014.09.008
Ponce-Toledo RI, et al. Horizontal and endosymbiotic gene transfer in early plastid evolution. New Phytol. 2019;224(2):618–24. DOI: 10.1111/nph.15965
Sibbald SJ, Archibald JM. Genomic insights into plastid evolution. Genome Biol Evol. 2020;12:978–90. DOI: 10.1093/gbe/evaa096
Keeling PJ. The number, speed, and impact of plastid endosymbioses in eukaryotic evolution. Annu Rev Plant Biol. 2013;64:583–607. DOI: 10.1146/annurev-arplant-050312-120144
Nowack EC, Melkonian M. Endosymbiotic associations within protists. Philos Trans R Soc Lond B Biol Sci. 2010;365(1541):699–712. DOI: 10.1098/rstb.2009.0188
Larkum AW, et al. Shopping for plastids. Trends Plant Sci. 2007;12(5):189–95. DOI: 10.1016/j.tplants.2007.03.011
Bodyl A. Did some red alga-derived plastids evolve via kleptoplastidy? A hypothesis. Biol Rev Camb Philos Soc. 2018;93(1):201–22. DOI: 10.1111/brv.12340
Archibald JM. Genomic perspectives on the birth and spread of plastids. Proc Natl Acad Sci U S A. 2015;112(33):10147–53. DOI: 10.1073/pnas.1421374112
Keeling PJ, et al. The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): illuminating the functional diversity of eukaryotic life in the oceans through transcriptome sequencing. PLoS Biol. 2014;12(6):e1001889. DOI: 10.1371/journal.pbio.1001889
Johnson LK, et al. Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Gigascience. 2019. 10.1093/gigascience/giy158. DOI: 10.1093/gigascience/giy158
Simion P, et al. A software tool “CroCo” detects pervasive cross-species contamination in next generation sequencing data. BMC Biol. 2018;16(1):28. DOI: 10.1186/s12915-018-0486-7
Simion P, et al. A large and consistent phylogenomic dataset supports sponges as the sister group to all other animals. Curr Biol. 2017;27(7):958–67. DOI: 10.1016/j.cub.2017.02.031
Irisarri I, et al. Phylotranscriptomic consolidation of the jawed vertebrate timetree. Nat Ecol Evol. 2017;1(9):1370–8. DOI: 10.1038/s41559-017-0240-5
Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. Comment: The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:1–9.
Li W, et al. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17(3):282–3. DOI: 10.1093/bioinformatics/17.3.282
Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. DOI: 10.1093/bioinformatics/btl158
Simao FA, et al. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2. DOI: 10.1093/bioinformatics/btv351
Waterhouse RM, et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol Biol Evol. 2018;35(3):543–8. DOI: 10.1093/molbev/msx319
Van Vlierberghe M, Philippe H, Baurain D. Broadly sampled orthologous groups of eukaryotic proteins for the phylogenetic study of plastid-bearing lineages. BMC Res Notes. 2021;14:21–4. DOI: 10.1186/s13104-020-05428-0
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data file 1—Methods. 2021. Figshare. 10.6084/m9.figshare.14079866.v5.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data set 1—Forty-Two reports and configuration files (662 individual samples). 2021. Figshare. 10.6084/m9.figshare.12362699.v1.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data file 2—Consolidation table. 2021. Figshare. 10.6084/m9.figshare.14727411.v3.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data file 3—Sample consolidation report. 2021. Figshare. 10.6084/m9.figshare.12154824.v3.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data file 4—Redundancy drop analysis. 2021. Figshare. 10.6084/m9.figshare.12213731.v3.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data set 2—Transcriptomes. 2021. Figshare. 10.6084/m9.figshare.13634840.v1.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data file 5—Sobek analysis summary. 2021. Figshare. 10.6084/m9.figshare.12410522.v3.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data set 3—Forty-Two reports and configuration files (260 transcriptomes). 2021. Figshare. 10.6084/m9.figshare.13006622.v1.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data file 6—Consolidated sample purity (cross-contaminations). 2021. Figshare. 10.6084/m9.figshare.12173235.v3.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data file 7—Consolidated sample purity (contaminations). 2021. Figshare. 10.6084/m9.figshare.12998726.v3.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data file 8—Completeness analysis. 2021. Figshare. 10.6084/m9.figshare.12154833.v3.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data set 4—Taxonomic samplings. 2021. Figshare. 10.6084/m9.figshare.12401639.v1.
Van Vlierberghe M, Di Franco A, Philippe H, Baurain D. Data set 5—GAPDH phylogenies. 2021. Figshare. 10.6084/m9.figshare.13096208.v2.