This note benchmarks six different contamination detection programs on the same data. It covers the majority of the algorithms relying on a reference database. We further provide Singularity [63] definition files to help with the installation of these programs and ensure reproducibility of the reported results. These files are accessible through the Figshare repository https://doi.org/10.6084/m9.figshare.17705558.v1.
$ wget XXXX # Python script for chimeric genome creation
$ wget https://figshare.com/ndownloader/files/32405687 -O Cornet-Baurain-contams.tgz
$ tar -xzf Cornet-Baurain-contams.tgz
Cornet-2022-GBIO-Figshare/
├── contams.def # Singularity def file for programs part 1 (all programs except EukCC)
├── eukcc.def # Singularity def file for programs part 2 (EukCC)
├── Chimeric-genomes.py # Python script for chimeric genome creation
├── contam-labels.idl # idl file use for Physeter and Forty-Two (from https://bitbucket.org/phylogeno/42-ribo-msas/)
├── life-tqmd-of73.dmnd # DIAMOND blast database for Physeter (from https://doi.org/10.3389/fmicb.2021.755101)
├── life-tqmd-of73.gca # List of GCA numbers for Physeter database (needed to enable the k-fold mode)
├── taxdump-20211206 # NCBI Taxonomy dump used across the study
└── README
Informations on how to use Singularity containers can be found here:
We used Singularity v3.8.0 and containers were bound to working directories. To facilitate the reading of this note, Singularity-related commands were not transcribed; only commands driving the programs themselves are reported here.
Two chimeric genomes were created to systematically test genomic contamination detection programs, one including only bacterial sequences and one combining bacterial and eukaryotic sequences.
For the bacterial chimera, two uncontaminated bacterial genomes were selected as references and a chimeric genome mixing sequences from these two genomes was created using a custom script (see above). Both redundant and non-redundant contaminations were simulated, based on “exchanges†of core-gene coding sequences (CDS) after orthologous gene inference.
Briefly, Prodigal
v2.6.3 (https://github.com/hyattpd/Prodigal) was used to predict proteins on the two uncontaminated genomes. inst-abbr-ali.pl
v0.212670 (https://metacpan.org/dist/Bio-MUST-Core) was then used to format the names of the sequences. OrthoFinder
v2.5.4 (https://github.com/davidemms/OrthoFinder) was used on these proteins files to infer orthologous groups. Groups with only one sequence from each genome (i.e., single-copy core genes) were randomly sampled either to replace (non-redundant contamination) or supplement (redundant contamination) the corresponding CDS from one genome (here named the master genome) with sequences from the second genome (here named the sub-genome).
Two genomes were selected based on the results of our contamination analyses of NCBI RefSeq [45-46]. These genomes should have no contamination reported by CheckM [48] with a maximum of two contigs (e.g., one chromosome and one plasmid). Here, we used a Firmicutes (GCF_000003645.1
) (https://www.ncbi.nlm.nih.gov/assembly/GCF_000003645.1/) as the master genome and a (Gamma)Proteobacteria (GCF_000007145.1
) (https://www.ncbi.nlm.nih.gov/assembly/GCF_000007145.1/) as the sub-genome.
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/645/GCF_000003645.1_ASM364v1/GCF_000003645.1_ASM364v1_genomic.fna.gz
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/145/GCF_000007145.1_ASM714v1/GCF_000007145.1_ASM714v1_genomic.fna.gz
Prodigal
was used to predict the proteins, then definition lines were abbreviated with inst-abbr-ali.pl
.
$ prodigal \
-i GCF_000003645.1_ASM364v1_genomic.fna \
-o GCF_000003645.1.out -a GCF_000003645.1.faa -d GCF_000003645.1.genes.fna
$ prodigal \
-i GCF_000007145.1_ASM714v1_genomic.fna \
-o GCF_000007145.1.out -a GCF_000007145.1.faa -d GCF_000007145.1.genes.fna
$ inst-abbr-ids.pl GCF_000003645.1.faa --id-regex=:DEF --id-prefix=GCF_000003645.1
$ inst-abbr-ids.pl GCF_000007145.1.faa --id-regex=:DEF --id-prefix=GCF_000007145.1
OrthoFinder
was used to infer the orthologous groups.
$ mkdir OF-indir
$ cp *.faa OF-indir/
$ orthofinder -t 10 -a 10 -f OF-indir/
$ mkdir OG
$ mv OF-indir/OrthoFinder/Results_*/Orthogroup_Sequences/*.fa OG/
We designed a Python script to automatically create the chimeric genomes: Chimeric-genomes.py
. It takes as input the protein files for the master and sub-genomes, then replacement and addition of genes in the master genome are carried out based on single-copy core genes. Redundant and non-redundant contamination levels were both set at 5% of the number of proteins. These percentages correspond to the threshold values proposed in [64].
$ ./Chimeric-genomes.py GCF_000003645.1.faa GCF_000007145.1.faa --replacement=5 --redundancy=5
Master genome: GCF_000003645.1 contains 5382 genes.
Sub genome: GCF_000007145.1 contains 4292 genes.
269 master genes replaced by genes from sub genome: GCF_000007145.1
269 master genes duplicated by genes from sub genome: GCF_000007145.1
Master genome represents 88.1064928717 of the chimeric genome length
Sub genome represents 11.8935071283 of the chimeric genome length
$ ali2fasta.pl chimeric.ali
$ mv chimeric.fasta chimeric-bact.fasta
The chimeric genome chimeric-bact.fasta
had 88.1% of Firmicutes CDS and 11.89% of (Gamma)Proteobacteria CDS, equally partitioned between redundant and non-redundant contaminations.
In order to demonstrate inter-domain contamination detection, we also created a eukaryotic genome including bacterial sequences. To this end, an uncontaminated eukaryotic genome was selected based on the analyses of [40], Ustilago maydis (GCF_000328475.2
) (https://www.ncbi.nlm.nih.gov/assembly/GCF_000328475.2/), a fungus belonging to Basidiomycota. This genome was then concatenated to the chimeric bacterial genome described above.
$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/328/475/GCF_000328475.2_Umaydis521_2.0/GCF_000328475.2_Umaydis521_2.0_genomic.fna.gz
$ gunzip GCF_000328475.2_Umaydis521_2.0_genomic.fna.gz
$ cat chimeric.fasta GCF_000328475.2_Umaydis521_2.0_genomic.fna > chimeric-euk-bact.fasta
Overall, this chimeric eukaryotic genome is composed of sequences belonging to Opisthokonta (80.1%), Firmicutes (16.1%) and (Gamma)Proteobacteria (3.8%).
The five genomes (two reference bacteria, one chimeric bacterium, one reference eukaryote and one chimera between the eukaryote and the two bacteria) were analyzed with the six programs. Running times are reported for 20 CPU cores (AMD EPYC 7742 at 2.3 GHz), except for Kraken2
, which was run on older hardware (Intel Xeon E5-2640 v4 at 2.4 GHz) due to taxonomic issues (see below).
# Command log for CheckM v1.1.3
$ mkdir chimeric
$ mv *.fasta chimeric/
$ checkm lineage_wf -t 20 -x fasta chimeric runa > checkm.result
CheckM
took 9 minutes and 40 seconds.
# Results of CheckM
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bin Id Marker lineage # genomes # markers # marker sets 0 1 2 3 4 5+
Completeness Contamination Strain heterogeneity
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gammaproteobacteria_GCF_000007145.1 c__Gammaproteobacteria (UID4202) 67 481 276 1 480 0 0 0 0 99.64 0.00 0.00
Firmicutes_GCF_000003645.1 g__Bacillus (UID902) 44 1171 324 7 1163 1 0 0 0 99.07 0.02 0.00
chimeric_bacterial g__Bacillus (UID902) 44 1171 324 104 1047 19 1 0 0 91.79 1.65 0.00
Eukaryote_GCF_000328475.2 k__Bacteria (UID203) 5449 104 58 38 34 20 12 0 0 48.96 16.85 0.00
chimeric_eukaryotic k__Bacteria (UID203) 5449 101 56 2 12 36 33 13 5 98.05 122.41 0.00
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The output shows that no redundant contamination are found by CheckM
in the two reference bacterial genomes. CheckM
finds 1.65% of redundant contamination in the chimeric bacterial genome. This low level of contamination is a surprising result because the redundancy of the chimera is based on single-copy core genes, which have a higher chance of being picked-up by CheckM
. CheckM
finds 16.85% of redundant contamination in the reference eukaryotic genome, while labelling it as a bacterium. CheckM
finds 122.41% of redundant contamination in the chimeric eukaryotic genome (including bacteria), while labelling it as a bacterium. The labelling of eukaryotes as bacteria is not surprising because CheckM
was designed to work on bacteria only and can thus mistake mitochondria for bacteria, especially when (contaminating) Proteobacteria sequences are also present. Nevertheless, this demonstrates that a naive usage of CheckM
on complex samples, composed of a mixture of eukaryotes and bacteria, can be difficult to interpret.
# Command log for EukCC v2.1.0
$ wget http://ftp.ebi.ac.uk/pub/databases/metagenomics/eukcc/eukcc2_db_ver_1.1.tar.gz
$ tar -xzf eukcc2_db_ver_1.1.tar.gz
$ eukcc folder --out EUKCC --threads 20 chimeric/ --suffix .fasta --db eukcc2_db_ver_1.1
EukCC
took 11 minutes and 12 seconds.
# Results of EukCC
bin completeness contamination
Gammaproteobacteria_GCF_000007145.1 0.0 0.0
Eukaryote_GCF_000328475.2 100.0 0.0
chimeric_eukaryotic 100.0 14.8
EukCC
logically does not work on the two Firmicutes genomes (including the chimeric one) since it was designed for eukaryotes. In contrast, the (Gamma)Proteobacteria does pass in EukCC
, which might be due to the (Alpha)Proteobacterial origin of mitochondria. However, both completeness and contamination values are null (0%), which indicates that this genome is not suitable for EukCC
. The reference eukaryotic genome is not contaminated and complete, as expected. Regarding the chimeric eukaryotic genome, it shows a contamination level of 14.8%. This is higher than expected considering EukCC
should only detect eukaryotic contaminants, whereas we introduced bacterial sequences.
# Command log for 42 v0.213470
# setup
# download NCBI Taxonomy (also useful for Physeter below)
$ setup-taxdir.pl --taxdir=taxdump
# download ribosomal protein databases (in FASTA format)
$ git clone https://bitbucket.org/phylogeno/42-ribo-msas
# build BLAST databases for BRH reference organisms
$ cd 42-ribo-msas/ref_orgs/life/
$ for REFORG in *.fasta; do makeblastdb -in $REFORG \
-dbtype prot -out `basename $REFORG .fasta` -parse_seqids; done
$ cd ../../../
# build BLAST databases for genomes to analyse
$ cd genomes/
$ for GENOME in *.fasta; do makeblastdb -in $GENOME \
-dbtype nucl -out `basename $GENOME .fasta` -parse_seqids; done
$ cat > mapper-genomes.idm
# Gammaproteobacteria_GCF_000007145.1 GCF_000007145.1_ASM714v1_genomic
# Firmicutes_GCF_000003645.1 GCF_000003645.1_ASM364v1_genomic
# chimeric_bacterial chimeric-bact
# Eukaryote_GCF_000328475.2 GCF_000328475.2_Umaydis521_2.0_genomic
# chimeric_eukaryotic chimeric-bact-euk
# ^D
# build 42 YAML configuration file (using newer 'life' dataset)
$ yaml-generator-42.pl --run_mode=metagenomic \
--out_suffix=-42 --queries 42-ribo-msas/queries/queries-life-fast.idl \
--evalue=1e-3 --homologues_seg=yes --max_target_seqs=50 --templates_seg=no \
--bank_dir genomes --bank_suffix=.nsq --bank_mapper genomes/mapper-genomes.idm --code=1 \
--ref_brh=on --ref_bank_dir 42-ribo-msas/ref_orgs/life --ref_bank_suffix=.psq \
--ref_bank_mapper 42-ribo-msas/ref_orgs/life/mapper-life-fast.idm \
--ref_org_mul=0.33 --ref_score_mul=0.99 \
--trim_homologues=on --trim_max_shift=5000 --trim_extra_margin=15 \
--merge_orthologues=off --aligner_mode=off \
--tax_reports=on --tax_dir taxdump \
--tax_min_score=80 --tax_score_mul=0.95 --tax_min_ident=0 --tax_min_len=0 --tax_max_hits=10 \
--tol_check=off
# run 42
$ forty-two.pl --config=config-42.yaml --outdir=reports \
--verbosity=1 --threads=20 42-ribo-msas/MSAs/life/*.ali
# analyse 42 results
$ debrief-42.pl --indir=reports/ --in=-42 --taxdir=taxdump/ \
--seq_labeling=42-ribo-msas/labelers/seq-labels.idl \
--contam_labeling=42-ribo-msas/labelers/contam-labels.idl
Forty-Two
took 5 mins and 1 sec.
# Results of 42
$ cut -f1-5,7-11 per-genome-42.stats | column -t
bank tested_genes added_ali clean_ali contam_ali added_seq clean_seq contam_seq unclass_contam_seq unknown_seq
Gammaproteobacteria_GCF_000007145.1 98 52 52 0 52 52 0 0 0
Firmicutes_GCF_000003645.1 98 47 47 0 47 47 0 0 0
chimeric_bacterial 98 48 42 40 82 42 40 0 0
Eukaryote_GCF_000328475.2 98 58 47 6 67 51 6 0 10
chimeric_eukaryotic 98 84 47 48 148 51 87 0 10
$ tail -n+2 chimeric_eukaryotic-42.tsv | sort -rnk1
51 SELF
40 cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus; Bacillus cereus group
34 cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Xanthomonas
10 unknown
4 cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Xanthomonas; Xanthomonas campestris
2 cellular organisms; Bacteria; Proteobacteria
1 cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Eubacteriales; Peptostreptococcaceae; Filifactor; Filifactor alocis; Filifactor alocis ATCC 35896
1 cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Bacilli; Bacillales; Bacillaceae; Bacillus
1 cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae; Xanthomonas; Xanthomonas campestris; Xanthomonas campestris pv. campestris
1 cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Xanthomonadales; Xanthomonadaceae
1 cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rickettsiales; Rickettsiaceae; Rickettsieae; Rickettsia
1 cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rhodospirillales
1 cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria
0 unclassified
On the bacterial genomes, whether reference or chimeric, the inter-domain ‘life’ dataset composed of 98 ribosomal proteins performs exactly as the smaller prokaryotic (RiboDB) dataset composed of 90 proteins (data not shown). Both reference genomes are clean (0 contam_ali
and 0 contam_seq
). They are also rather complete, with 52 and 47 recovered single-copy ribosomal proteins (added_ali
/added_seq
) in the (Gamma)Proteobacteria and the Firmicutes, respectively. Forty-Two
finds a lot of contaminants in the chimeric bacterial genome (40 contam_ali
/contam_seq
). This might be explained by an enrichment in ribosomal proteins among the single-copy core genes used to simulate the contamination. As the inter-domain dataset is still new, its performance is lower on the eukaryotic genomes in comparison to the well-tested eukaryotic dataset composed of 78 ribosomal proteins [40]. Indeed, some genuine eukaryotic ribosomal proteins are missed due to less sensitive BRH tests (47 vs 73 clean_ali
, data not shown) whereas 6 “bacterial†homologues (contam_ali
/contam_seq
) are detected (and 10 remain tagged as unknown_seq
); these mostly correspond to mitochondrial ribosomal proteins. Nevertheless, Forty-Two
clearly reports that the chimeric eukaryotic genome is highly contaminated (48 contam_ali
and 87 contam_seq
). Moreover, the taxonomic analysis of the recovered ribosomal proteins globally matches the expectations, with SELF
being Ustilago, and the other lineages corresponding to (Gamma)Proteobacteria (Xanthomonas) and Firmicutes (Bacillus), which are indeed the two bacterial genomes introduced in the fungal genome. Mitochondrial proteins appear clearly as (Alpha)Proteobacteria (and probably also as unknown
entries).
# Command log for GUNC v1.0.5
$ gunc download_db guncDB
$ gunc run --db guncDB/gunc_db_progenomes2.1.dmnd \
--input_fasta chimeric/GCF_000007145.1_ASM714v1_genomic.fasta --threads 20 \
--out_dir GUNC/GCF_000007145.1
$ gunc run --db guncDB/gunc_db_progenomes2.1.dmnd \
--input_fasta chimeric/GCF_000003645.1_ASM364v1_genomic.fasta --threads 20 \
--out_dir GUNC/GCF_000003645.1
$ gunc run --db guncDB/gunc_db_progenomes2.1.dmnd \
--input_fasta chimeric/chimeric-bact.fasta --threads 20 \
--out_dir GUNC/chimeric-bact
$ gunc run --db guncDB/gunc_db_progenomes2.1.dmnd \
--input_fasta chimeric/GCF_000328475.2_Umaydis521_2.0_genomic.fasta --threads 20 \
--out_dir GUNC/GCF_000328475.2
$ gunc run --db guncDB/gunc_db_progenomes2.1.dmnd \
--input_fasta chimeric/chimeric-bact-euk.fasta --threads 20 \
--out_dir GUNC/chimeric-bact-euk
GUNC took 35 min and 10 seconds, requiring a large amount of time for the eukaryotic genomes but less than 3 minutes per bacterial genome.
# Results of GUNC
genome n_genes_called n_genes_mapped n_contigs taxonomic_level proportion_genes_retained_in_major_clades genes_retained_index clade_separation_score contamination_portion n_effective_surplus_clades mean_hit_identity reference_representation_score pass.GUNC
Gammaproteobacteria_GCF_000007145.1 4245 4198 1 kingdom 1.0 0.99
0 0.0 0.0 0.98 0.97 True
Firmicutes_GCF_000003645.1 5412 5307 1 kingdom 1.0 0.98
0.0 0.0 0.0 0.93 0.91 True
chimeric_bacterial 5376 5293 5291 phylum 1.0 0.98
1.0 0.1 0.23 0.94 0.92 False
Eukaryote_GCF_000328475.2 11033 3117 25 kingdom 1.0 0.28
0.0 0.12 0.27 0.38 0.11 True
chimeric_eukaryotic 16409 8411 5317 kingdom 1.0 0.51
0.0 0.04 0.09 0.73 0.38 True
GUNC detects no contamination in the two reference bacterial genomes, the contaminations scores being all at 0 while the reference_representation_score
is high, indicating that both genomes map confidently to the database. The proportion of genes used is thus high, above 98%. The scores for the chimeric bacterial genome indicate that it maps correctly to the GUNC database with a reference_representation_score
of 0.92 and 98% of genes used. The contamination_portion
is 10%, which perfectly matches our simulations (11.89%). The reference eukaryotic genome shows few genes used, with only 28%, and its representation in the database is low too, 11%. This indicates that GUNC is not suitable for the evaluation of this genome, which was expected as GUNC was designed for bacteria. The delineation is more ambiguous for the eukaryotic genome contaminated by bacteria, since 51% of the genes are used with a reference_representation_score
of 0.38. As above, this shows that a naive usage of such a detection program can be a source of errors.
Physeter
runs on BLAST reports. Here we used DIAMOND blastx
[28] to generate them.
The database used in this note is the inter-domain database constructed to work on complex samples [3]: see https://doi.org/10.3389/fmicb.2021.755101.
# Command log for Physeter v0.213470 (diamond v2.0.4.142)
$ mkdir temp
# Gammaproteobacteria_GCF_000007145.1
# mv GCF_000007145.1_ASM714v1_genomic.fna GCF_000007145.1.fasta
$ diamond blastx -d life-tqmd-of73.dmnd -q GCF_000007145.1.fasta \
-o GCF_000007145.1.blastx -t temp -k 50 -e 1e-10 -f tab -p 20
$ physeter.pl GCF_000007145.1.blastx --fasta-dir=./ --outfile=contam_GCF_000007145.1.report \
--taxdir=taxdump-20211206/ --taxon-list=contam-labels.idl
# Firmicutes_GCF_000003645.1
# cp GCF_000003645.1_ASM364v1_genomic.fasta GCF_000003645.1.fasta
$ diamond blastx -d life-tqmd-of73.dmnd -q GCF_000003645.1.fasta \
-o GCF_000003645.1.blastx -t temp -k 50 -e 1e-10 -f tab -p 20
$ physeter.pl GCF_000003645.1.blastx --fasta-dir=./ --outfile=contam_GCF_000003645.1.report \
--taxdir=taxdump-20211206/ --taxon-list=contam-labels.idl
# chimeric_bacterial
$ diamond blastx -d life-tqmd-of73.dmnd -q chimeric-bact.fasta \
-o chimeric-bact.blastx -t temp -k 50 -e 1e-10 -f tab -p 20
$ physeter.pl chimeric-bact.blastx --fasta-dir=./ --outfile=contam_chim.report \
--taxdir=taxdump-20211206/ --taxon-list=contam-labels.idl \
--auto-detect --tax-min-hits=1 --tax-max-hits=50 --kfold=life-tqmd-of73.gca
# Eukaryote_GCF_000328475.2
# mv GCF_000328475.2_Umaydis521_2.0_genomic.fasta GCF_000328475.2.fasta
$ diamond blastx -d life-tqmd-of73.dmnd -q GCF_000328475.2.fasta \
-o GCF_000328475.2.blastx -t temp -k 50 -e 1e-10 -f tab -p 20
$ physeter.pl GCF_000328475.2.blastx --fasta-dir=./ --outfile=contam_GCF_000328475.2.report \
--taxdir=taxdump-20211206/ --taxon-list=contam-labels.idl \
--exp-tax=Opisthokonta
# chimeric_eukaryotic
$ diamond blastx -d life-tqmd-of73.dmnd -q chimeric-bact-euk.fasta \
-o chimeric-bact-euk.blastx -t temp -k 50 -e 1e-10 -f tab -p 20
$ physeter.pl chimeric-bact-euk.blastx --fasta-dir=./ --outfile=contam_chimeukal-lca.report \
--taxdir=taxdump-20211206/ --taxon-list=contam-labels.idl \
--auto-detect --tax-min-hits=1 --tax-max-hits=50
$ physeter.pl chim-bact-euk-split.blastx --fasta-dir=./ --outfile=contam_chimeukal-lca.report \
--taxdir=taxdump-20211206/ --taxon-list=contam-labels.idl \
--auto-detect --tax-min-hits=1 --tax-max-hits=50 --kfold=life-tqmd-of73.gca
Altogether, DIAMOND blastx
and Physeter
took 1 hour, 22 minutes and 1 second.
# Results of Physeter
# Gammaproteobacteria_GCF_000007145.1 with default options
Gammaproteobacteria_GCF_000007145.1 Proteobacteria 100.00 0.00 0.00 0.00 1.00
# Firmicutes_GCF_000003645.1 with default options
Firmicutes_GCF_000003645.1 Firmicutes 100.00 0.00 0.00 0.00 1.00
# chimeric bacterial, kfold and auto-detection
chimeric_bacterial Firmicutes 21.79 25.77 11.77 40.67 Proteobacteria=9.88,Terrabacteria group=4.16,Euryarchaeota=2.92,Opisthokonta=1.89,Synergistetes=1.67,Thermotogae=1.02,TACK group=0.83,PVC group=0.52,Spirochaetes=0.50,Viridiplantae=0.50,FCB group=0.31,Stramenopiles=0.31,Amoebozoa=0.26,Alveolata=0.22,Aquificae=0.22,Cryptophyceae=0.17,Haptista=0.11,Rhodophyta=0.11,Rhizaria=0.07,Glaucocystophyceae=0.06,Euglenozoa=0.04 2.90
chimeric-bacterial Firmicutes 21.94 25.27 12.21 40.58 Proteobacteria=9.10,Terrabacteria group=3.66,Euryarchaeota=2.89,Opisthokonta=2.02,Synergistetes=1.59,Thermotogae=1.05,TACK group=0.93,Viridiplantae=0.65,PVC group=0.50,Spirochaetes=0.48,Thermodesulfobacteria=0.44,Amoebozoa=0.35,FCB group=0.35,Stramenopiles=0.31,Alveolata=0.24,Aquificae=0.19,Cryptophyceae=0.19,Haptista=0.11,Rhodophyta=0.11,Rhizaria=0.06,Glaucocystophyceae=0.04,Euglenozoa=0.02 2.89
chimeric_bacterial Firmicutes 21.52 25.98 11.86 40.64 Proteobacteria=9.88,Terrabacteria group=3.53,Euryarchaeota=3.26,Opisthokonta=1.98,Synergistetes=1.67,Thermotogae=0.98,TACK group=0.91,Viridiplantae=0.52,PVC group=0.48,Thermodesulfobacteria=0.46,Spirochaetes=0.44,Amoebozoa=0.35,FCB group=0.35,Stramenopiles=0.31,Alveolata=0.20,Cryptophyceae=0.19,Aquificae=0.17,Haptista=0.11,Rhodophyta=0.07,Rhizaria=0.06,Glaucocystophyceae=0.04,Euglenozoa=0.02 2.84
chimeric_bacterial Firmicutes 22.85 24.83 11.71 40.61 Proteobacteria=9.94,Terrabacteria group=3.96,Euryarchaeota=3.05,Synergistetes=1.65,TACK group=1.02,Thermotogae=0.96,Opisthokonta=0.59,Viridiplantae=0.57,Spirochaetes=0.48,Amoebozoa=0.41,PVC group=0.41,Thermodesulfobacteria=0.41,FCB group=0.30,Stramenopiles=0.30,Aquificae=0.19,Cryptophyceae=0.17,Haptista=0.13,Alveolata=0.09,Glaucocystophyceae=0.07,Rhodophyta=0.07,Rhizaria=0.06,Euglenozoa=0.02 2.98
chimeric_bacterial Firmicutes 22.55 25.07 11.82 40.56 Proteobacteria=9.99,Terrabacteria group=3.39,Euryarchaeota=2.79,Opisthokonta=1.81,Synergistetes=1.61,Thermotogae=1.00,TACK group=0.81,Viridiplantae=0.52,PVC group=0.46,Spirochaetes=0.46,Thermodesulfobacteria=0.41,FCB group=0.33,Amoebozoa=0.31,Alveolata=0.24,Stramenopiles=0.24,Aquificae=0.17,Cryptophyceae=0.15,Haptista=0.11,Rhodophyta=0.11,Rhizaria=0.09,Glaucocystophyceae=0.04,Euglenozoa=0.02 2.94
chimeric_bacterial Firmicutes 22.33 25.42 11.95 40.30 Proteobacteria=9.90,Terrabacteria group=3.68,Euryarchaeota=2.81,Opisthokonta=1.96,Synergistetes=1.61,Thermotogae=0.94,TACK group=0.81,Viridiplantae=0.56,PVC group=0.48,Spirochaetes=0.44,Thermodesulfobacteria=0.41,FCB group=0.37,Stramenopiles=0.35,Amoebozoa=0.31,Alveolata=0.20,Aquificae=0.17,Cryptophyceae=0.17,Haptista=0.11,Rhodophyta=0.06,Euglenozoa=0.04,Glaucocystophyceae=0.04 2.92
chimeric_bacterial Firmicutes 22.33 25.37 11.54 40.76 Proteobacteria=9.75,Terrabacteria group=4.26,Euryarchaeota=3.02,Opisthokonta=1.89,Synergistetes=1.05,TACK group=0.83,Thermotogae=0.83,Viridiplantae=0.61,PVC group=0.52,Thermodesulfobacteria=0.48,FCB group=0.43,Spirochaetes=0.39,Stramenopiles=0.33,Amoebozoa=0.28,Alveolata=0.22,Cryptophyceae=0.17,Haptista=0.11,Rhodophyta=0.11,Rhizaria=0.06,Glaucocystophyceae=0.04 2.97
chimeric_bacterial Firmicutes 22.70 25.05 11.88 40.37 Proteobacteria=9.42,Terrabacteria group=3.70,Euryarchaeota=2.96,Opisthokonta=1.94,Synergistetes=1.74,TACK group=0.85,Thermotogae=0.81,Viridiplantae=0.54,PVC group=0.48,Thermodesulfobacteria=0.43,Amoebozoa=0.39,FCB group=0.39,Stramenopiles=0.33,Alveolata=0.22,Aquificae=0.17,Cryptophyceae=0.17,Spirochaetes=0.17,Haptista=0.11,Rhodophyta=0.11,Rhizaria=0.06,Euglenozoa=0.04,Glaucocystophyceae=0.04 2.87
chimeric_bacterial Firmicutes 21.83 25.53 12.04 40.60 Proteobacteria=9.25,Terrabacteria group=3.81,Euryarchaeota=3.20,Opisthokonta=2.04,Synergistetes=1.61,Thermotogae=0.98,TACK group=0.81,Viridiplantae=0.56,PVC group=0.52,Spirochaetes=0.46,Thermodesulfobacteria=0.44,FCB group=0.41,Stramenopiles=0.30,Alveolata=0.26,Cryptophyceae=0.22,Amoebozoa=0.19,Aquificae=0.19,Haptista=0.11,Rhodophyta=0.09,Rhizaria=0.06,Euglenozoa=0.04 2.88
chimeric_bacterial Firmicutes 22.37 25.20 11.91 40.52 Proteobacteria=9.45,Terrabacteria group=3.98,Euryarchaeota=3.20,Opisthokonta=2.00,Synergistetes=1.68,TACK group=0.93,Thermotogae=0.63,Spirochaetes=0.48,FCB group=0.46,Thermodesulfobacteria=0.46,Stramenopiles=0.37,Viridiplantae=0.37,Amoebozoa=0.31,Alveolata=0.24,Aquificae=0.22,PVC group=0.13,Rhodophyta=0.13,Rhizaria=0.06,Glaucocystophyceae=0.04,Cryptophyceae=0.02,Euglenozoa=0.02,Haptista=0.02 2.92
# Eukaryote_GCF_000328475.2 with Opisthokonta as an expected taxon
Eukaryote_GCF_000328475.2 Opisthokonta 100.00 0.00 0.00 0.00 1.00
# chimeric eukaryotic genome, auto-detection, min 2 hits
chimeric_eukaryotic Firmicutes 22.09 26.09 11.67 40.15 Proteobacteria=9.78,Terrabacteria group=3.68,Euryarchaeota=3.22,Opisthokonta=2.37,Synergistetes=1.64,Thermotogae=0.94,TACK group=0.79,Viridiplantae=0.63,Thermodesulfobacteria=0.50,PVC group=0.48,Spirochaetes=0.39,FCB group=0.35,Stramenopiles=0.33,Amoebozoa=0.31,Alveolata=0.22,Cryptophyceae=0.17,Haptista=0.11,Rhodophyta=0.09,Glaucocystophyceae=0.04,Rhizaria=0.04,Euglenozoa=0.02 2.87
# chimeric eukaryotic genome, after splitting, kfold and auto-detection
chimeric_eukaryotic Opisthokonta 56.63 2.94 0.84 39.59 Firmicutes=1.42,Proteobacteria=0.62,Euryarchaeota=0.20,Actinobacteria=0.13,Synergistetes=0.10,Terrabacteria group=0.10,Thermotogae=0.06,TACK group=0.06,Viridiplantae=0.03,PVC group=0.03,Spirochaetes=0.03,Thermodesulfobacteria=0.03,Amoebozoa=0.02,Bacteroidetes=0.02,Stramenopiles=0.02,Alveolata=0.02,Aquificae=0.01,Rhodophyta=0.01,Haptista=0.01,Rhizaria=0.00,Cryptophyceae=0.00,Euglenozoa=0.00,Glaucocystophyceae=0.00 1.16
chimeric_eukaryotic Opisthokonta 56.62 2.94 0.85 39.59 Firmicutes=1.41,Proteobacteria=0.60,Euryarchaeota=0.20,Actinobacteria=0.12,Terrabacteria group=0.12,Synergistetes=0.10,Thermotogae=0.06,TACK group=0.06,Viridiplantae=0.04,PVC group=0.03,Spirochaetes=0.03,Thermodesulfobacteria=0.03,Bacteroidetes=0.03,Stramenopiles=0.02,Amoebozoa=0.02,Alveolata=0.02,Cryptophyceae=0.01,Aquificae=0.01,Rhodophyta=0.01,Haptista=0.01,Euglenozoa=0.00,Glaucocystophyceae=0.00,Rhizaria=0.00 1.16
chimeric_eukaryotic Opisthokonta 56.54 2.99 0.86 39.61 Firmicutes=1.43,Proteobacteria=0.63,Euryarchaeota=0.20,Actinobacteria=0.14,Terrabacteria group=0.13,Synergistetes=0.09,Thermotogae=0.07,TACK group=0.06,Viridiplantae=0.04,PVC group=0.03,Spirochaetes=0.03,Thermodesulfobacteria=0.03,Bacteroidetes=0.02,Stramenopiles=0.02,Amoebozoa=0.02,Alveolata=0.01,Cryptophyceae=0.01,Aquificae=0.01,Haptista=0.01,Rhodophyta=0.01,Euglenozoa=0.00,Glaucocystophyceae=0.00,Rhizaria=0.00 1.16
chimeric_eukaryotic Opisthokonta 56.62 2.99 0.79 39.60 Firmicutes=1.46,Proteobacteria=0.64,Euryarchaeota=0.18,Actinobacteria=0.14,Terrabacteria group=0.14,Synergistetes=0.09,TACK group=0.06,Thermotogae=0.05,Viridiplantae=0.04,PVC group=0.03,Bacteroidetes=0.03,Amoebozoa=0.02,Stramenopiles=0.02,Aquificae=0.02,Alveolata=0.01,Cryptophyceae=0.01,Haptista=0.01,Spirochaetes=0.01,Euglenozoa=0.01,Rhodophyta=0.01,Glaucocystophyceae=0.00,Rhizaria=0.00 1.16
chimeric_eukaryotic Opisthokonta 56.63 2.92 0.86 39.59 Firmicutes=1.45,Proteobacteria=0.56,Euryarchaeota=0.20,Actinobacteria=0.12,Terrabacteria group=0.11,Synergistetes=0.11,Thermotogae=0.06,TACK group=0.05,PVC group=0.03,Spirochaetes=0.03,Thermodesulfobacteria=0.03,Amoebozoa=0.02,Bacteroidetes=0.02,Viridiplantae=0.02,Stramenopiles=0.02,Alveolata=0.02,Cryptophyceae=0.01,Aquificae=0.01,Haptista=0.01,Rhodophyta=0.00,Rhizaria=0.00,Euglenozoa=0.00,Glaucocystophyceae=0.00 1.16
chimeric_eukaryotic Opisthokonta 56.62 2.96 0.82 39.60 Firmicutes=1.45,Proteobacteria=0.64,Euryarchaeota=0.18,Actinobacteria=0.12,Terrabacteria group=0.12,Synergistetes=0.10,TACK group=0.06,Thermotogae=0.06,Viridiplantae=0.04,PVC group=0.03,Spirochaetes=0.03,Thermodesulfobacteria=0.03,Stramenopiles=0.02,Bacteroidetes=0.02,Amoebozoa=0.02,Alveolata=0.02,Cryptophyceae=0.01,Aquificae=0.01,Haptista=0.01,Rhodophyta=0.01,Rhizaria=0.00,Euglenozoa=0.00,Glaucocystophyceae=0.00 1.16
chimeric_eukaryotic Opisthokonta 56.63 2.94 0.82 39.61 Firmicutes=1.44,Proteobacteria=0.62,Euryarchaeota=0.16,Actinobacteria=0.14,Terrabacteria group=0.13,Synergistetes=0.11,TACK group=0.06,Thermotogae=0.04,Viridiplantae=0.04,Spirochaetes=0.03,Thermodesulfobacteria=0.03,Stramenopiles=0.03,Bacteroidetes=0.02,PVC group=0.02,Amoebozoa=0.02,Alveolata=0.02,Aquificae=0.01,Cryptophyceae=0.01,Rhodophyta=0.01,Euglenozoa=0.00,Rhizaria=0.00,Glaucocystophyceae=0.00,Haptista=0.00 1.17
chimeric_eukaryotic Opisthokonta 56.63 2.89 0.86 39.62 Firmicutes=1.35,Proteobacteria=0.64,Euryarchaeota=0.21,Actinobacteria=0.13,Terrabacteria group=0.12,Synergistetes=0.10,Thermotogae=0.07,TACK group=0.06,Spirochaetes=0.03,Viridiplantae=0.03,Thermodesulfobacteria=0.03,Bacteroidetes=0.03,Stramenopiles=0.02,Alveolata=0.02,Amoebozoa=0.01,Aquificae=0.01,Cryptophyceae=0.01,PVC group=0.01,Haptista=0.01,Rhodophyta=0.01,Glaucocystophyceae=0.00,Euglenozoa=0.00,Rhizaria=0.00 1.15
chimeric_eukaryotic Opisthokonta 9.95 3.18 2.10 84.77 Firmicutes=1.45,Proteobacteria=0.63,Euryarchaeota=0.19,Actinobacteria=0.13,Terrabacteria group=0.12,Synergistetes=0.11,Stramenopiles=0.07,Thermotogae=0.07,Viridiplantae=0.06,Amoebozoa=0.06,TACK group=0.05,PVC group=0.04,Spirochaetes=0.03,Thermodesulfobacteria=0.03,Rhodophyta=0.03,Rhizaria=0.02,Haptista=0.02,Alveolata=0.02,Bacteroidetes=0.02,Cryptophyceae=0.01,Euglenozoa=0.01,Glaucocystophyceae=0.01 2.62
chimeric_eukaryotic Opisthokonta 56.63 2.97 0.82 39.58 Firmicutes=1.45,Proteobacteria=0.64,Euryarchaeota=0.21,Actinobacteria=0.11,Synergistetes=0.10,Terrabacteria group=0.10,Thermotogae=0.06,TACK group=0.05,Viridiplantae=0.03,PVC group=0.03,Thermodesulfobacteria=0.03,Bacteroidetes=0.02,Spirochaetes=0.02,Amoebozoa=0.02,Stramenopiles=0.02,Alveolata=0.02,Cryptophyceae=0.01,Aquificae=0.01,Haptista=0.01,Rhodophyta=0.01,Rhizaria=0.00,Euglenozoa=0.00
Physeter
identifies no contamination in the three reference genomes in auto-detection mode, as expected since it can work in an inter-domain setting. On the chimeric bacterial genome, Physeter
correctly identifies the main organism (as Firmicutes) and Proteobacteria as the main contaminant, with (as expected) nearly 10%. However Physeter
overestimates the contamination level by detecting other contaminants that are supposedly not present in the genome. When the bacterial and eukaryotic genomes are concatenated, Physeter
detects the Firmicutes as the main organism. This is a logical result since we used CDS as sequences for bacteria, which represent thousands of individual sequences compared to the 37 scaffolds of the genuine eukaryotic genome. Indeed Physeter
reports its estimates in terms of sequence numbers (whether CDS, genome contigs or scaffolds, reads etc). When the eukaryotic genome is split into pseudo-reads, as recommended in Cornet et al. (2018) [22] and Lupo et al. (2021) [3], Physeter
correctly identifies the eukaryotic genome as the main organism.
At the time of writing, we were not able to install the Kraken2
database, probably because of a recent modification in the architecture of the NCBI Taxonomy. Since Kraken
(and Kraken2
) have been maintained over the last ten years, we have no doubt that the current issue will be addressed by the authors. Meanwhile, we used the nt
database installed (a couple of weeks earlier) on our older computing cluster. Therefore, the running time reported here should be interpreted as an upper bound only.
# Command log for Kraken2 v2.0.8-beta
$ kraken2 --use-names --db /media/vol1/databases/kraken2-nt/ \
GCF_000007145.1_ASM714v1_genomic.fasta --threads 40 \
--report GCF_000007145.1_ASM714v1_genomic.report > GCF_000007145.1_ASM714v1_genomic.kraken
$ kraken2 --use-names --db /media/vol1/databases/kraken2-nt/ \
GCF_000003645.1_ASM364v1_genomic.fasta --threads 40 \
--report GCF_000003645.1_ASM364v1_genomic.report > GCF_000003645.1_ASM364v1_genomic.kraken
$ kraken2 --use-names --db /media/vol1/databases/kraken2-nt/ \
chimeric-bact.fasta --threads 40 \
--report chimeric-bact.report > chimeric-bact.kraken
$ kraken2 --use-names --db /media/vol1/databases/kraken2-nt/ \
GCF_000328475.2_Umaydis521_2.0_genomic.fasta --threads 40 \
--report GCF_000328475.2_Umaydis521_2.0_genomic.report > GCF_000328475.2_Umaydis521_2.0_genomic.kraken
$ kraken2 --use-names --db /media/vol1/databases/kraken2-nt/ \
GCF_000328475.2-split.fasta --threads 40 \
--report GCF_000328475.2-split.report > GCF_000328475.2-split.kraken
$ kraken2 --use-names --db /media/vol1/databases/kraken2-nt/ \
chimeric-bact-euk-split.fasta --threads 40 \
--report chimeric-bact-euk-split.report > chimeric-bact-euk-split.kraken
Kraken2
on the nt
database took 5 hours, 51 minutes and 12 seconds.
# Gammaproteobacteria_GCF_000007145.1
100.00 1 0 R 1 root
100.00 1 0 R1 131567 cellular organisms
100.00 1 0 D 2 Bacteria
100.00 1 0 P 1224 Proteobacteria
100.00 1 0 C 1236 Gammaproteobacteria
100.00 1 0 O 135614 Xanthomonadales
100.00 1 0 F 32033 Xanthomonadaceae
100.00 1 0 G 338 Xanthomonas
100.00 1 0 S 339 Xanthomonas campestris
100.00 1 0 S1 340 Xanthomonas campestris pv. campestris
100.00 1 1 S2 190485 Xanthomonas campestris pv. campestris str. ATCC 33913
# Firmicutes_GCF_000003645.1
100.00 1 0 R 1 root
100.00 1 0 R1 131567 cellular organisms
100.00 1 0 D 2 Bacteria
100.00 1 0 D1 1783272 Terrabacteria group
100.00 1 0 P 1239 Firmicutes
100.00 1 0 C 91061 Bacilli
100.00 1 0 O 1385 Bacillales
100.00 1 0 F 186817 Bacillaceae
100.00 1 0 G 1386 Bacillus
100.00 1 0 G1 86661 Bacillus cereus group
100.00 1 1 S 2026186 Bacillus paranthracis
# chimeric_bacterial
0.20 11 11 U 0 unclassified
99.80 5394 26 R 1 root
99.30 5367 6 R1 131567 cellular organisms
99.04 5353 7 D 2 Bacteria
88.86 4803 2 D1 1783272 Terrabacteria group
88.83 4801 0 P 1239 Firmicutes
88.81 4800 0 C 91061 Bacilli
88.81 4800 0 O 1385 Bacillales
88.68 4793 437 F 186817 Bacillaceae
80.52 4352 364 G 1386 Bacillus
73.45 3970 1576 G1 86661 Bacillus cereus group
21.44 1159 1159 S 2026186 Bacillus paranthracis
11.40 616 505 S 1396 Bacillus cereus
0.78 42 42 S1 361100 Bacillus cereus Q1
0.43 23 23 S1 405535 Bacillus cereus AH820
0.43 23 0 S1 1179100 Bacillus cereus biovar anthracis
0.43 23 23 S2 637380 Bacillus cereus biovar anthracis str. CI
...
10.05 543 0 P 1224 Proteobacteria
10.05 543 0 C 1236 Gammaproteobacteria
9.95 538 0 O 135614 Xanthomonadales
9.95 538 0 F 32033 Xanthomonadaceae
9.95 538 5 G 338 Xanthomonas
9.86 533 230 S 339 Xanthomonas campestris
5.57 301 280 S1 340 Xanthomonas campestris pv. campestris
0.31 17 17 S2 190485 Xanthomonas campestris pv. campestris str. ATCC 33913
0.04 2 2 S2 1358004 Xanthomonas campestris pv. campestris str. CN03
0.02 1 1 S2 1281283 Xanthomonas campestris pv. campestris str. CN15
0.02 1 1 S2 1358018 Xanthomonas campestris pv. campestris str. CN13
0.04 2 0 S1 359385 Xanthomonas campestris pv. raphani
0.04 2 2 S2 990315 Xanthomonas campestris pv. raphani 756C
0.07 4 0 O 91347 Enterobacterales
0.06 3 0 F 543 Enterobacteriaceae
0.06 3 0 F1 2890311 Klebsiella/Raoultella group
0.06 3 0 G 570 Klebsiella
0.04 2 2 S 2058152 Klebsiella grimontii
0.02 1 1 S 548 Klebsiella aerogenes
...
# Eukaryote_GCF_000328475.2
100.00 27 0 R 1 root
100.00 27 0 R1 131567 cellular organisms
100.00 27 0 D 2759 Eukaryota
100.00 27 0 D1 33154 Opisthokonta
100.00 27 0 K 4751 Fungi
100.00 27 0 K1 451864 Dikarya
100.00 27 0 P 5204 Basidiomycota
100.00 27 0 P1 452284 Ustilaginomycotina
100.00 27 0 C 5257 Ustilaginomycetes
100.00 27 0 O 5267 Ustilaginales
100.00 27 0 F 5268 Ustilaginaceae
100.00 27 0 G 5269 Ustilago
100.00 27 0 S 5270 Ustilago maydis
100.00 27 27 S1 237631 Ustilago maydis 521
# Eukaryote_GCF_000328475.2, with split in pseudo-reads
24.92 19597 19597 U 0 unclassified
75.08 59045 18 R 1 root
75.02 59001 260 R1 131567 cellular organisms
74.21 58363 121 D 2759 Eukaryota
73.78 58021 135 D1 33154 Opisthokonta
71.19 55988 2 K 4751 Fungi
71.18 55981 19 K1 451864 Dikarya
70.98 55821 5 P 5204 Basidiomycota
70.96 55802 5 P1 452284 Ustilaginomycotina
70.94 55791 0 C 5257 Ustilaginomycetes
70.94 55791 0 O 5267 Ustilaginales
70.94 55791 109 F 5268 Ustilaginaceae
70.42 55376 3 G 5269 Ustilago
70.35 55328 3044 S 5270 Ustilago maydis
66.48 52280 52280 S1 237631 Ustilago maydis 521
0.01 4 4 S1 559306 Ustilago maydis FB1
0.05 43 43 S 307758 Ustilago bromivora
0.00 1 1 S 120017 Ustilago hordei
0.00 1 1 S 185366 Ustilago esculenta
0.22 170 21 G 63265 Sporisorium
0.09 70 44 S 72558 Sporisorium reilianum
0.02 15 15 S1 999809 Sporisorium reilianum SRZ2
0.01 11 11 S1 72559 Sporisorium reilianum f. sp. reilianum
0.09 69 69 S 49012 Sporisorium scitamineum
0.01 10 10 S 280036 Sporisorium graminicola
...
# chimeric_eukaryotic
23.33 19608 19608 U 0 unclassified
76.67 64439 44 R 1 root
76.59 64368 266 R1 131567 cellular organisms
69.45 58371 121 D 2759 Eukaryota
69.04 58028 135 D1 33154 Opisthokonta
66.62 55988 2 K 4751 Fungi
66.61 55981 19 K1 451864 Dikarya
66.42 55821 5 P 5204 Basidiomycota
66.39 55802 5 P1 452284 Ustilaginomycotina
66.38 55791 0 C 5257 Ustilaginomycetes
66.38 55791 0 O 5267 Ustilaginales
66.38 55791 109 F 5268 Ustilaginaceae
65.89 55376 3 G 5269 Ustilago
65.83 55328 3044 S 5270 Ustilago maydis
62.20 52280 52280 S1 237631 Ustilago maydis 521
0.00 4 4 S1 559306 Ustilago maydis FB1
0.05 43 43 S 307758 Ustilago bromivora
0.00 1 1 S 120017 Ustilago hordei
0.00 1 1 S 185366 Ustilago esculenta
0.20 170 21 G 63265 Sporisorium
0.08 70 44 S 72558 Sporisorium reilianum
0.02 15 15 S1 999809 Sporisorium reilianum SRZ2
66.42 55821 5 P 5204 Basidiomycota
66.39 55802 5 P1 452284 Ustilaginomycotina
66.38 55791 0 C 5257 Ustilaginomycetes
66.38 55791 0 O 5267 Ustilaginales
66.38 55791 109 F 5268 Ustilaginaceae
65.89 55376 3 G 5269 Ustilago
65.83 55328 3044 S 5270 Ustilago maydis
62.20 52280 52280 S1 237631 Ustilago maydis 521
0.00 4 4 S1 559306 Ustilago maydis FB1
...
6.81 5724 25 D 2 Bacteria
5.85 4917 3 D1 1783272 Terrabacteria group
5.77 4852 0 P 1239 Firmicutes
5.75 4834 0 C 91061 Bacilli
5.73 4819 0 O 1385 Bacillales
5.71 4799 437 F 186817 Bacillaceae
5.18 4354 364 G 1386 Bacillus
4.72 3970 1576 G1 86661 Bacillus cereus group
1.38 1159 1159 S 2026186 Bacillus paranthracis
0.73 616 505 S 1396 Bacillus cereus
0.05 42 42 S1 361100 Bacillus cereus Q1
0.03 23 23 S1 405535 Bacillus cereus AH820
0.03 23 0 S1 1179100 Bacillus cereus biovar anthracis
0.03 23 23 S2 637380 Bacillus cereus biovar anthracis str. CI
...
0.86 727 8 P 1224 Proteobacteria
0.75 634 3 C 1236 Gammaproteobacteria
0.66 557 0 O 135614 Xanthomonadales
0.66 555 1 F 32033 Xanthomonadaceae
0.65 543 7 G 338 Xanthomonas
0.63 533 230 S 339 Xanthomonas campestris
0.36 301 280 S1 340 Xanthomonas campestris pv. campestris
0.02 17 17 S2 190485 Xanthomonas campestris pv. campestris str. ATCC 33913
...
Kraken2
finds no contamination in the reference genomes. As for Physeter
, the results are computed based on the number of sequences. When the eukaryotic genome is split into pseudo-reads, the number of unclassified sequences increases and the kmer affiliation change. Kraken2
correctly identifies the main organism in the chimeric bacterial genome, with 88% of Firmicutes and 10% of Proteobacteria, as expected from the simulated data. Kraken2
works also correctly in our inter-domain setting, even if the proportions of each organism are not well estimated (69.5% of Eukaryota vs 80.1% in the simulation, 6.1% of Firmicutes vs 16.1% in the simulation, 0.86% of Proteobacteria vs 3.8% in the simulation).
(END)