Evaluating collapsed misassembly with asmgene

25 December 2020

Why?

It is usually easy to evaluate the contiguity of a de novo assembly – just compute N50. It is much harder to evaluate the correctness. We typically identify misassemblies by aligning contigs to a reference genome. However, it is tricky to interpret the results. In case of human, there are thousands of structural variations (SVs) between the reference and the sample being assembled. Alignment-based evaluation often mistakes these SVs as misassemblies. For example, QUAST identifies >10,000 “misassemblies” in the T2T assembly when compared to GRCh38. We can’t reliably tell misassemblies from SVs which leads to overestimated misassembly rate. A second problem with reference-based alignment is that most alignment differences come from complex regions such as centromeres and subtelomeres. It fails to evaluate gene regions we are mostly interested in; on the contrary it penalizes an assembly that represents these complex regions better.

How?

Most assembly problems are caused by repetitive or paralogous regions. When an assembler cannot resolve such a region, it either creates an assembly gap or forces through the region with a misassembly. To probe these issues, we can align a multi-copy gene to the assembly and see if it remains multi-copy.

More precisely, we do the following. We align all annotated transcripts to a reference genome and select the longest isoform among overlapping transcripts. For each selected transcript, we record a hit if the transcript is mapped at ≥99% identity over ≥99% of of the transcript length. A transcript is considered to be single-copy (SC) if it has only one hit; otherwise it is considered multi-copy (MC). We do the same for the assembly and then compute the fraction of missing multi-copy gene as

MMC = 1 - |{MCinASM} ∩ {MCinREF}| / |{MCinREF}|

In the ideal case of a perfect assembly, %MMC should be zero. A higher fraction suggests more collapsed assemblies. We can compute percent MMC (%MMC) with paftools.js asmgene from minimap2:

minimap2 -cxsplice:hq -t8 ref.fa cdna.fa > ref.cdna.paf
minimap2 -cxsplice:hq -t8 asm.fa cdna.fa > asm.cdna.paf
paftools.js asmgene [-a] ref.cdna.paf asm.cdna.paf

The output looks like:

H Metric   ref.cdna asm.cdna
X full_sgl 36426    36389
X full_dup 0        18
X frag     0        4
X part50+  0        5
X part10+  0        0
X part10-  0        10
X dup_cnt  1334     1330
X dup_sum  4110     4080

On the line X dup_cnt, 1334 is the number of multi-copy genes in the reference, of which 1330 remain multi-copy in the assembly. %MMC is thus 1-1330/1334=0.3%. Also in this output, 36426 is the number of single-copy genes in the reference, of which 36389 remain single-copy in the assembly and 18 are false duplications. We can similarly compute BUSCO-like metrics but based on the reference.

Collapsed misassemblies in long-read assemblies

The following figure shows the level of collapsed genes in various CHM13 assemblies, taking Ensembl genes as input and the T2T CHM13 as the reference:

Because the reference and the assemblies all come from the same sample, a perfect assembly should have no collapsed genes. Hifiasm and HiCanu are close to that mark. However, some assemblers may miss up to 70% of multi-copy genes. We had a closer look at missing genes. As is expected, they either fall into assembly gaps or leave a misassembly in a long contig. If you want to study a gene family, such assembly problems will ruin your day.

CHM13 is a homozygous cell line. The following figure shows the level of collapsed genes in diploid HG00733 assemblies, again with CHM13 as the reference:

In this figure, even GRCh38 misses 10% multi-copy genes in CHM13. This is background noises caused by between-sample SVs. It is much lower than the level of collapsed misassemblies of many assemblers, demonstrating the effectiveness this metric.

Conclusion

Percent MMC is a new metric to measure the quality of an assembly. It takes minutes to compute, is gene focused and is robust to structural variations in comparison to evaluations based on assembly-to-reference alignment. The obvious downside of %MMC is that it requires a high-quality reference genome and is not applicable to new species, but this is not a concern during the development of assemblers.