Output directory
myloasm_output
├── assembly_primary.fa <-- CONTIGS
├── final_contig_graph.(gfa/edges) <-- GRAPH
└── OTHER STUFF (secondary files)
The main contig file: assembly_primary.fa
These are the main polished contigs to be used for downstream analysis most of the time.
Contig fasta record label definition
The fasta records looks like this:
>u2658272ctg_len-2275750_circular-possibly_depth-10-9-9_mult-1.00
u2658272ctg
is the identifier and len-X
is the length in nucleotides. There is also information about the circularity, depth of coverage, and repetitiveness of the contig.
Circularity
circular-X
: X is either yes
, no
, or possibly
.
If a contig has a self loop in the assembly graph and
- has average depth > 5x and
- has no other connections (i.e., consists of a single node and a single edge)
then it is yes
.
If it has a self loop, but (1) or (2) fails above, then it is possibly
. Otherwise, it is no
.
Estimated depth of coverage
depth-X1-X2-X3
: This is the estimated depth of coverage.
The three distinct values represent different levels of nucleotide identity thresholds.
X1
is the depth while allowing alignments of approximately > 99% true nucleotide similarity (estimated using myloasm's SNP + k-mer formulation).X2
is the depth for > 99.75% similarity.X3
is the depth for 100% similarity.
K-mer multiplicity
mult-X
: this is the estimated fraction of repetitiveness and useful for quality control.
This is estimated by counting 21-mers across the contig and seeing how often they repeat on average after removing the most frequent and least frequent 10% of k-mers.
- Most complete prokaryotic genomes should have
mult-1.00
. - Sometimes myloasm erroneously concatenates two similar strain genomes together. In this case,
mult
is > 1.0 and indicative of contamination. - Long reads often contain duplicate plasmids. Sometimes, a plasmid is erroneously duplicated or miscircularized, giving
mult
> 1.
The final assembly graph: final_contig_graph.gfa
Each contig in assembly_primary.fa
is a node in the final assembly graph, final_contig_graph.gfa
. This is in modified GFA format and can be visualized with Bandage.
Each *.gfa
file has a corresponding *.edges
file. The edges files gives more detailed information about contig-to-contig overlaps.
Understanding myloasm's GFA assembly graph format
Myloasm's GFA file contains a header starting with H
. Then, there is additional contig + read information that may be useful for manual curation.
Contig information
S u165592ctg LN:i:1034269 DP:f:25.0 DP2:f:24.0 DP3:f:23.0 MEDIAN_DP1:f:35.0
- NOTE:
LN:i
is only an approximate length. This is the length before polishing. DP
,DP2
, andDP3
is the median of allDP/2/3
values across the reads (see below). See here for more information on depth.MEDIAN_DP1
is the median-of-the-median-depth across all reads.DP
is the median-of-the-minimum-depth across all reads. You can ignore the difference and useDP
instead ofMEDIAN_DP1
most of the time.
Read information:
a u165592ctg,0-28710 165592 0 SRRX.2+split0 - 28710 DP1:22,DP2:22,DP3:21,READ_LEN:43601,OL_LEN_NEXT:14808,SNP_SHARE_NEXT:216,SNP_DIFF_NEXT:1
a u165592ctg,34-6529 1421634 28710 SRRX.1+split0+split0 + 6495 DP1:42,DP2:42,DP3:13,READ_LEN:21212,OL_LEN_NEXT:14629,SNP_SHARE_NEXT:129,SNP_DIFF_NEXT:0
a
indicates this is a readu165592ctg,0-28710
indicates that positions 0-28710 come from readSRRX.2+split0
before polishing.165592
is the numeric read id. Myloasm assigns each read as a unique id.0
indicates this read's contribution to the unpolished contig starts at position 0SRRX.2+split0
- Read name. If myloasm thinks a read is chimeric, it may be split into several sub reads. For example,split1
indicates the second chunk after splitting. Unsplit reads may also be denoted as+split0
. There are two rounds of chimera splitting.-/+
- read orientation
The last columns indicate the depth and overlap information of each read. See here for more information on varying depth values.
OL_LEN_NEXT
- number of overlapping bases to the next readSNP_SHARE_NEXT
- number of estimated shared SNPs to the next readSNP_DIFF_NEXT
- number of estimated different SNPs to the next read.