On a reference pan-genome model (Part II)

12 July 2019

I wrote a blog post on a potential reference pan-genome model. I had more thoughts in my mind. I didn’t write about them because they are immature. Nonetheless, a few readers raised questions related to my immature thoughts, so I decide to add this “Part II” as a response. Please note that this and the previous blog posts only represent my own limited view. A consortium will be formed to build a better genome reference. They may come up with a distinct but much better solution than what I am saying here.

The previous post talked about one problem with the current reference genome: it lacks diversity. Another more practical issue is that the coordinate system drastically changes with each major release. To avoid inconsistent coordinate, many projects are still using the older GRCh37 even though GRCh38 has been around for half of a decade. At the same time, GRCh38 is being actively updated with new “patches” arriving regularly. However, these patches are not well integrated into GRCh38. Using them naively will lead to loss of information. Almost no complete tool chains work with them; the few existing patch-aware tools (e.g. BWA-MEM ALT mapping) are only ad hoc hacks. If we keep using patches, few will be benefited from them; if we integrate patches into the primary assembly, everyone will need to remap all the data from time to time. A few readers of the previous blog post asked: how will a graph model help? Here is my vision:

In the short term (say 5 years), we can start from the primary assembly of GRCh38 and build a reference graph integrating large variations in existing patches and other long-read assemblies. We have three possible strategies to work with the new reference:

Those who prefer the current practice can continue to map data to the primary GRCh38, the backbone of the graph. The graph will tell us regions susceptible to artifacts caused by large varitions. These are like blacklisted regions generated by ENCODE.
We can extract novel sequences in the graph and treat them as separate contigs. These contigs are like decoy sequences and will attract false mappings away. We will still use our existing tools (e.g. STAR and Bowtie2) for most analyses. This strategy will supposedly give us cleaner results, but it won’t take full advantage of the graph.
When we have capable graph mappers, we can map data the graph. This will do better than the strategy 2. We will need to project graph mappings to GRCh38. Vg has surjection; I have a complement idea, which I am not detailing here.

All the three strategies use the same coordinate system: GRCh38. The analysis results won’t be the same, but will be close. We won’t need liftOver. I imagine in a foreseeable future, a great majority of the community will use the linear coordinate most of time. The graph elements only come in at certain steps.

When we update the reference genome, we will insert new sequence segments, add deletion links or mark a segment to be “deleted” without hard removal from the graph. These modifications are like the current GRC patches, but they won’t mess up the existing coordinate system. This will allow us to integrate data from different minor versions of the reference. We do need to watch out batch effects due to version changes. I can’t predict how much they matter in comparison to batch effects from other sources. Excluding blacklisted regions may alleviate this issue.

You can already explore the graph world in my vision if you are brave enough. Strategy 1 is what you are currently doing. I haven’t created the blacklist, but I see it can be done by traversing the graph. For Strategy 2, you can run the following command line to get the linearized reference:

git clone https://github.com/lh3/gfatools
cd gfatools && make
wget ftp://ftp.dfci.harvard.edu/pub/hli/minigraph/GRCh38-0.1-14.gfa.gz
./gfatools gfa2fa -s GRCh38-0.1-14.gfa.gz > GRCh38-0.1-14.fa

Here option -s asks for stable FASTA output. The output file GRCh38-0.1-14.fa will include the entire GRCh38 primary assembly and extra 26Mb large SVs or diverged regions extracted from 13 other human assemblies. You can build a STAR/Bowtie2 index and map reads normally. You will see reads mapped to the additional contigs.

Strategy 3 is incomplete, but you can get a peek about how a graph reference may affect your analysis:

git clone https://github.com/lh3/minigraph
cd minigraph && make
./minigraph -x se -t16 /path/to/GRCh38-0.1-14.gfa.gz se-reads.fa > out.gaf

For 150bp Illumina reads, you will see reads bridging GRCh38 and non-GRCh38 segments. There are not a lot of them, but most of them come from regions with weird alignment. Longer reads will involve more non-GRCh38 paths in the alignment.

The graph in this example was automatically built with minigraph. It just gives you a feeling. I envision the construction of the actual reference graph is likely to involve meticulous manual curation. A reference may not be complete, but what is in there has to be extremely accurate.

I don’t have a clear vision for the long term. The GRCh38 coordinate system will have a longer life span in the graph world. We can keep using it for quite some time. Ultimately, we will routinely sequence human genomes to a quality higher than the primary GRCh38 assembly. We will need to rethink the concept of “reference”. I would lose my job if I kept doing the same thing by then.

Anyway, the above is my vision. I am aware of and have thought over other alternatives, but I think a conservative reference model is more likely to be accepted and actually benefit the community. Finally, I reiterate that these ideas are immature and only represent my own view. They are more like food for thought. What will happen may be vastly different.