# On a reference pan-genome model

In the last weekend, I made gfatools and minigraph open to the public. Both repos come with some documentations, but they haven’t explained the background and motivation behind. This blog post gives a more complete picture.

The primary assembly of GRCh38, our current human reference genome, is largely the concatenation of individual haplotype segments. It aims to model a single human genome and lacks thousands of structural variations (SVs). These SVs are causing a multitude of problems which have been documented in many papers. A solution is to construct a pan-genome reference. The question is “how?”.

My answer to that is rGFA. rGFA is a text format and more importantly a
data model. It introduces the concept of *stable coordiate*, which is
persistent under the sequence split and insertion operations. We can
incrementally “add” a new genome to a graph without breaking the old coordinate
system. At the same time, if we start with a linear reference genome, an
augmented graph naturally inherits the coordinate system from the linear
reference. We can have the benefits of both linear and graphical
representations.

Minigraph proves the above is more than just an idea; it is practically working at least to some extent (constructing a graph from 15 human assemblies in an hour). Along this line of work, GAF is a first text format to describe sequence-to-graph mapping. The sister repo gfatools implements a few utilities to work with rGFA. These are all connected by design.

There are much more to be done: minigraph has many limitations;
gfatools lacks important functionalies; GAF alone is inadequate and can’t play
the same role as SAM; the starting linear reference genome has a lot of room
for improvement given the advances in sequencing technologies. With the same
data model, there can also be alternative approaches to graph construction
(e.g. via VCF, compact de Bruijn graph, mutliple-sequence alignment or all-pair
alignment). Minigraph is more of a *proof-of-concept* starting point. Community
efforts are the only way to build a pan-genome reference that is practical,
accurate, and comprehensive enough to represent genome diversity and ultimately
help us to understand genetics better.

blog comments powered by Disqus