QC Statistics
Using CheckM2 on long prokaryotic contigs
The easiest way to filter out long (>300 kb) erroneous prokaryotic contigs is to run CheckM2 on individual contigs.
If a chimeric contig is long enough, it will often have high CheckM contamination due to the presence of duplicated marker genes from multiple organisms.
To extract all contigs of length >= X bp and run CheckM2, you can do (with mylotools)
cd myloasm_results
mylotools extract-contigs --output-folder contigs_dir --min-contig-length X
checkm2 predict --input contigs_dir -x fa -o checkm2_results --threads 40
Using k-mer multiplicity statistics for duplicates/strain chimeras
Myloasm's fasta outputs have information about how often 21-mers are repeated (its multiplicity) within a contig:
>u123123ctg_XXX_mult-1.00 <- fasta record with k-mer multiplicity
In our experience, prokaryotic contigs should almost always have average k-mer multiplicity near 1.00. If you have a very long contig (> 1M bp) of multiplicity > 1.05, it may be a chimera from multiple strains of a species.
For small genomes (e.g. viruses), the expected k-mer multiplicity may deviate from 1.00. However, a small contig with k-mer multiplicity >> 1 can be suspicious. A contig with k-mer multiplicity = 2, 3, or an integer multiple can indicate a perfectly duplicated contig.
Notes on circularized contigs
For prokaryotic genomes, low CheckM2 completeness can indicate premature circularization. However:
- we have found that complete organelle genomes (mitochondria, plastids from microeukaryotes) can have non-trivial CheckM2 completeness (> 30% but < 90%)
- secondary chromosomes and genomes with multiple chromosomes can have lower completeness
- some clades of microbes have low CheckM2 completeness scores, even when they're complete
The myloasm tag circular-possibly
indicates lower confidence circular genomes (due to low coverage or assembly graph ambiguity), but these can often be complete genomes, especially if CheckM2 scores are good.