Heng Li's blog2024-03-07T19:21:01+00:00http://lh3.github.ioHeng Lilh3@me.comWhat high-performance language to learn?2024-03-05T00:00:00+00:00http://lh3.github.io/2024/03/05/what-high-performance-language-to-learn
<p>In the past couple of months, I have been asked several times about what language(s) to learn
if someone wants to write high-performance programs.
This is a sensitive topic that often triggers heated debate
partly because many fast languages share similar features and are comparable in performance.
My general take is that for small research projects only involving a few developers,
the choice of programming languages is personal.
Nevertheless, if you hold a gun to head and force me to recommend a single high-performance language,
I will probably say <a href="https://www.rust-lang.org/">Rust</a>.
Here is my thought.</p>
<h3 id="why-not-python-or-r">Why not Python or R?</h3>
<p>Python and R are the most popular programming lauguages in computational biology.
However, they are slow unless you can find libraries written in other efficient languages.
How slow? Often <a href="https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/python3-gcc.html">~50 times slower</a> than a C/C++ implementation.
This means a one-minute task in C/C++ is turned into one hour and
a one-hour task is turned into two days.
When you have a new method with the bottleneck falling in to pure Python/R code,
the inefficiency of Python/R may limit your stretch.</p>
<p>Another practical problem, at least with Python, is the deployment of your tools.
To run a Python program, users have to install dependencies on their machines.
This is slow and sometimes problematic when dependencies of different packages conflictive with each other.
With C/C++, it is possible, though takes efforts, to compile portable binaries that do not require users to install dependencies.
Properly distributed C/C++ tools are easier to install and use.</p>
<h3 id="why-rust">Why Rust?</h3>
<p>Rust is a mature memory-safe programming language with little/no compromise on performance.
It is as efficient as C/C++ and almost free of memory-related errors.
Unlike C/C++, Rust comes with its own package manager, which greatly simplifies the reuse of existing libraries;
Rust also makes it easy to create portable executables, improving user experiences.
Rust has been used <a href="https://www.kernel.org/doc/html/next/rust/index.html">for Linux kernel development</a> and endorsed <a href="https://security.googleblog.com/2024/03/secure-by-design-googles-perspective-on.html">by Google</a>, <a href="https://twitter.com/markrussinovich/status/1571995117233504257">by the CTO of Microsoft Azure</a> and
surprisingly, even <a href="https://www.whitehouse.gov/oncd/briefing-room/2024/02/26/memory-safety-statements-of-support/">by White House</a>.
You can easily find articles on the Internet about how good Rust is.
I will save some words here.</p>
<h3 id="the-caveat">The caveat</h3>
<p>Rust is associated with a steep learning curve.
Its secure memory model makes it challenging to implement certain data structures.
Even implementing a <a href="https://rust-unofficial.github.io/too-many-lists/">doubly linked list</a> or graph may take a lot more efforts than in C/C++.
Furthermore, the Rust memory model brings new concepts that are rarely seen in other common programming languages and take time to digest.
I have written five short programs in Rust but I am still at the bottom of the learning curve.</p>
<p>Nevertheless, I want to note that mastering C from scratch might not be easier.
Although C looks simpler, getting the best out of C without frequent <a href="https://en.wikipedia.org/wiki/Segmentation_fault">segmentation fault</a> may take years of practice.
Modern C++ is a very different language from C.
It is a behemoth with a myriad of modern features and decades of legacy packed in one of the most complex languages.
Mastering C++ takes more efforts.</p>
<h3 id="other-contenders">Other contenders</h3>
<p>Rust is not the only language that matches the performance of C/C++ without memory-related bugs.
If you do not need the best possible performance and feel Rust is difficult to learn, Go may be a decent second choice.
As to other high-performance languages, some have not reached the v1.0 milestone and are not stable;
some have a small community, which will make learning harder;
some, in my humble opinion, are not designed well;
some (e.g. Swift and C#) are popular in industry but are rarely used in our field for historical reasons.
I like playing with these languages, but I wouldn’t recommend them to general computational biologists.</p>
<h3 id="am-i-switching-to-rust">Am I switching to Rust?</h3>
<p>Not in the near future.
I enjoy the freedom of <a href="https://www.stroustrup.com/quotes.html">shooting myself in the foot</a>.</p>
Random open syncmers2022-10-21T00:00:00+00:00http://lh3.github.io/2022/10/21/random-open-syncmers
<p>A $k$-long sequence $P$ is a ($k$,$s$)-open-<a href="https://peerj.com/articles/10805/">syncmer</a>, $s\le k$, if
$P[1,s]$ is the smallest among all $s$-mers in $P$. Suppose function $\phi$ is
a bijective hash function of $k$-long sequences. $P$ is a random
($k$,$s$)-syncmer if $\phi(P)$ is an open syncmer. Because we often map
$k$-mers to integers, $\phi$ can take the form of an <a href="https://gist.github.com/lh3/974ced188be2f90422cc">invertible integer hash
function</a>. In practice, $\phi$ does not have to be a bijection. It can
also map a sequence to an integer of a different length or even operate in the
bit space (see the <a href="https://arxiv.org/abs/2210.08052">miniprot preprint</a>).</p>
<p>As overlapping $k$-mers have dependency, the definition of the original open
syncmer often involves one more parameter to improve its quality. Original open
syncmers also do not work well with protein sequences with varying amino acid
frequency. Using a good hash function, random open syncmers do not have these
problems.</p>
<p>I implemented random open syncmers <a href="https://github.com/lh3/minimap2/blob/c2f07ff2ac8bdc5c6768e63191e614ea9012bd5d/sketch.c#L145-L192">in minimap2</a>. In comparison to
random minimizers of the same density, syncmers lead to better chaining scores
but are more repetitive. This is partly because ($k$,$w$)-minimizers are
generating $k$-mers from a $k+w-1$ window and to some extent, using slightly
longer $k$-mers in effect. Due to the repetitiveness, syncmers slow down
minimap2 chaining a lot, similar to the observation made by <a href="https://academic.oup.com/bioinformatics/article/38/20/4659/6432031">Shaw and Yu
(2022)</a>. I tried a few different syncmer configurations and found
minimzers and syncmers are comparable overall. In practical implementation, it
probably does not matter what strategy to use. Nonetheless, in theoretical
analysis, random open syncmers are the better choice as they are largely
independent of each other under a good hash function.</p>
A few suggestions for creating command line interfaces2022-09-28T00:00:00+00:00http://lh3.github.io/2022/09/28/additional-recommendations-for-creating-command-line-interfaces
<p><a href="https://en.wikipedia.org/wiki/Command-line_interface">Command-line interface</a>, or CLI in brief, specifies how a user interacts
with a program on the command line. <a href="https://www.doherty.edu.au/people/associate-professor-torsten-seemann">Torsten Seemann</a> wrote a <a href="https://academic.oup.com/gigascience/article/2/1/2047-217X-2-15/2656133">good
article</a> on creating CLI. This blog post adds a few more suggestions.</p>
<h4 id="1-keep-the-backward-compatibility-of-cli-as-much-as-possible">1. Keep the backward compatibility of CLI as much as possible</h4>
<p>Backward compatibility here means users can upgrade and run a tool without
changing the command lines they used in the past. This implies we should not
remove or change the meaning of an existing option. It is ok to add new
options. Backward compatibility, in my opinion, is the most important factor in
CLI design and outweighs all the following points.</p>
<h4 id="2-human-first-command-lines-are-meant-for-a-human-to-type">2. Human-first: command lines are meant for a human to type</h4>
<p>It is important to keep CLI simple such that a human can remember the basic
syntax and type a command line without reading the full manual or looking back
through the bash history. For this goal, the tool should only require
indispensable input (e.g. input files) and it should set sensible default
values good for general use cases.</p>
<p>What are good default values has to be analyzed on a case-by-case basis. I only
mention one example. A common parameter used by high-performance tools is the
number of parallel threads. In my opinion, a tool should not attempt to use all
available CPUs by default because this default behavior may greatly impact many
users in a cluster environment. Some of my tools use 3 or 4 threads by default.
Defaulting to one thread is perhaps more common.</p>
<p>Command-line tools are also invoked in shell scripts or workflow scripts. In
this case, we do not repeatedly type command lines. An explicit and verbose CLI
may help to reduce typos and is preferred. It is worth considering such use
cases. Nonetheless, the human-first principle is still more important.</p>
<h4 id="3-if-possible-readwrite-a-file-as-a-data-stream">3. If possible, read/write a file as a data stream</h4>
<p>With file streaming, we read or write a file without jumping back and forth
using something like <code class="language-plaintext highlighter-rouge">seek()</code> calls. Streaming is essential to unix pipes and
will make the tool work nicely with others. By convention, it is also preferred
to support a single dash <code class="language-plaintext highlighter-rouge">-</code> for standard input/output, but this is not that
important in unix as we can use <code class="language-plaintext highlighter-rouge">/dev/stdin</code> or <code class="language-plaintext highlighter-rouge">/dev/stdout</code> as long as the
tool supports file streaming.</p>
<p>A corallary is not to guess the input/output file formats by
file extensions because data streams do not have file extensions. We may use
<a href="https://en.wikipedia.org/wiki/Named_pipe">named pipe</a> but it is awkward.</p>
<h4 id="4-print-useful-information-to-the-standard-error-output">4. Print useful information to the standard error output</h4>
<p>It would be good for a tool to print the version number and the full command
line in use. I often find this is helpful when going back to old analysis. I
recommend to print something like “Done!” when the tool finishes. This lets
users know the tool has not crashed in the middle. Printing progress is also
convenient for a long running job as users may get a rough estimate about how
long the job will take.</p>
<p>All these messages should be printed to the <a href="https://en.wikipedia.org/wiki/Standard_streams#Standard_error_(stderr)">standard error output</a>,
not to the <a href="https://en.wikipedia.org/wiki/Standard_streams#Standard_output_(stdout)">standard output</a>. The standard output is meant to be used
for piping (see suggestion 3). In addition, in many languages, text outputted
to the standard output is bufferred by default for efficiency. It may not be
written to the output when the tool is interrupted. The standard error output
is usually not bufferred and is more useful for logging and debugging.</p>
<h4 id="5-use-a-getopt-compatible-library-to-parse-command-line-options">5. Use a getopt-compatible library to parse command-line options</h4>
<p>This is a minor point. The unix/GNU getopt convention allows both short options
and long options with multiple variations (see <a href="https://nullprogram.com/blog/2020/08/01/">this article</a> for
details). Most unix tools, except gcc, follow this convention. The standard
libraries in many languages also support it. A tool adopting different
behaviors will increase the chance of misuses. Some may argue the unix
convention is confusing but breaking the convention is worse.</p>
Introducing dual assembly2021-10-10T00:00:00+00:00http://lh3.github.io/2021/10/10/introducing-dual-assembly
<p><strong>Definition.</strong> The dual assembly of a diploid sample consists of two sets of contigs with each
set representing one complete haploid genome.
Similar to contigs in a primary assembly,
contigs in a dual assembly may have occasional switches between parental haplotypes.
I called such an assembly as partially phased assembly in an <a href="http://lh3.github.io/2021/04/17/concepts-in-phased-assemblies">earlier post</a> but
decided to coin a new term in our <a href="https://arxiv.org/abs/2109.04785">new hifiasm preprint</a> for clarity.</p>
<p><em>Why dual assembly?</em></p>
<p><strong>A primary or collapsed assembly only represents one haploid genome.</strong>
It is okay and may be preferred if we want to construct the reference genome of a new species.
However, if we want to profile sequence variations in a population,
such an assembly would not work as it randomly misses half of information in a diploid genome.
It is necessary to recover both haplotypes.</p>
<p><strong>Haplotype-resolved de novo assembly is the ultimate solution to variant calling.</strong>
Genome-In-A-Bottle (GIAB) recently constructed two new variant calling benchmarks, one <a href="https://www.nature.com/articles/s41467-020-18564-9">on HLA</a>
and the other <a href="https://www.biorxiv.org/content/10.1101/2021.06.07.444885v3">on clinically important genes</a>.
In both cases, GIAB used haplotype-resolved assembly as the main source of ground truth
because assembly can reconstruct complex regions longer than the read length.
In this case, reference-based read mapping would often misplace reads due to the lack of long-range information.
The figure below shows an example where read mapping leaves a gap in
gene <em>GTF2IRD2</em> while trio or Hi-C assemblies can go through both haplotypes.</p>
<p><strong>However, producing a haplotype-resolved assembly requires multiple data types,</strong>
such as parental sequences or Hi-C in addition to long sequence reads.
This increases sequencing costs and is sometimes infeasible, for example,
when we cannot obtain enough DNA or don’t have access to parental samples.
The need of more data is particularly problematic for clinical samples.</p>
<p><strong>A dual asembly can be produced with long reads only.</strong>
It is a weaker version of haplotype-resolved assembly but is almost as powerful for variant calling purposes.
In the figure below, the dual assembly (the top two tracks)
also correctly resolves both haplotypes.
The primary/alternate assembly pair can also be produced with long reads only.
However, the alternate assembly is often too fragmented.
It misses one haplotype in the example below.</p>
<p><strong>I recommend to produce a dual assembly for the calling of structural variations,</strong> if you only have long reads.
As of now, only hifiasm and <a href="https://github.com/cschin/Peregrine">peregrine</a> can do such assembly for PacBio HiFi data only
but I expect more assemlers to support such assembly for more data types.</p>
<p><img src="http://www.liheng.org/images/GTF2IRD2-igv.png" alt="drawing" width="700" /></p>
Remapping an aligned BAM2021-07-06T00:00:00+00:00http://lh3.github.io/2021/07/06/remapping-an-aligned-bam
<p>This is a short post on how to remap short reads in an aligned BAM using
bwa-mem. My recommendation is (requiring bash)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>samtools collate -Oun128 in.bam | samtools fastq -OT RG,BC - \
| bwa mem -pt8 -CH <(samtools view -H in.bam|grep ^@RG) ref.fa - \
| samtools sort -@4 -m4g -o out.bam -
</code></pre></div></div>
<p>Here, <code class="language-plaintext highlighter-rouge">samtools collate</code> groups the two reads in a read pair and outputs an
uncompressed BAM stream. <code class="language-plaintext highlighter-rouge">samtools fastq</code> consumes this stream and generates an
interleaved FASTQ. Option <code class="language-plaintext highlighter-rouge">-T RG,BC</code> copies RG and BC tags in the input BAM to
the output FASTQ comment lines. <code class="language-plaintext highlighter-rouge">bwa mem -C</code> then copies these tags to the
output SAM. Option <code class="language-plaintext highlighter-rouge">-H <(...)</code> inserts header <code class="language-plaintext highlighter-rouge">@RG</code> lines. Option <code class="language-plaintext highlighter-rouge">-p</code>
processes an interleaved FASTQ stream. Finally, <code class="language-plaintext highlighter-rouge">samtools sort</code> generates
sorted BAM. I often use <code class="language-plaintext highlighter-rouge">-@4 -m4g</code> for faster sorting. If you have unaligned
BAM (aka uBAM), you can skip the first <code class="language-plaintext highlighter-rouge">collate</code> step.</p>
<p>I wrote about design command-line interface a couple of days ago. This posts
exemplifies the power of a proper design: you can chain multiple tools together
to achieve high performance without writing any high-performance code.</p>
Designing a command-line interface2021-07-04T00:00:00+00:00http://lh3.github.io/2021/07/04/designing-command-line-interfaces
<p>This post is inspired by Vince’s <a href="https://twitter.com/vsbuffalo/status/1411771531407990784">tweet</a>. It describes my thoughts on
the design of command-line interface (CLI). Note that this article doesn’t
necessarily represent the best practices; it just shows my personal
preferences.</p>
<p>First of all, I need to clarify the terminology. For an example, in the command
line below</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rm -f file.txt
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">rm</code> is the the <em>command</em>. <code class="language-plaintext highlighter-rouge">file.txt</code> is the <em>command-line argument</em>, which is
required for <code class="language-plaintext highlighter-rouge">rm</code> to function. <code class="language-plaintext highlighter-rouge">-f</code> is the <em>command-line option</em>, which is
optional. Some call <code class="language-plaintext highlighter-rouge">file.txt</code> a positional option, but I think that is
inconsistent because <code class="language-plaintext highlighter-rouge">file.txt</code> is not optional.</p>
<p>In general, I think a good CLI should be intuitive, concise and easy to
remember and to type. Here are some more specific details in my mind when I
design CLI.</p>
<ul>
<li>
<p>I prefer to use command-line arguments and strive to make the default
setting (i.e. without any options) work well. I personally feel it is easier to
remember the positions of arguments than to remember the option letters,
especially when different tools often use the same option letter for distinct
meanings. The positions of arguments tend to be more consistent across tools.
For example, most mainstream aligners take the first argument as the
reference/database and the second as the query. Another benefit of using
arguments is that users can more easily specify multiple similar input sources.</p>
</li>
<li>
<p>I use a command-line argument parser compatible with the getopt behavior.
With getopt, an option with one hypen like <code class="language-plaintext highlighter-rouge">-a</code> is called a short option and an
option with two hypens like <code class="language-plaintext highlighter-rouge">--foo</code> is called a long option. If you see <code class="language-plaintext highlighter-rouge">-ab</code>,
it may mean <code class="language-plaintext highlighter-rouge">-a b</code> or <code class="language-plaintext highlighter-rouge">-a -b</code>, depending on the definition of <code class="language-plaintext highlighter-rouge">-a</code>. I know this
looks complicated to new users, but it is the convention the vast majority of
unix tools have adopted for decades. Users will learn the getopt behavior one
way or another anyway. Implementing non-standard behaviors (e.g. parsing
<code class="language-plaintext highlighter-rouge">-ab</code> as a long option, requiring a space after each option or allowing spaces
in an option) is more likely to cause confusion.</p>
</li>
<li>
<p>I try to avoid long options for basic settings because long options are
harder to remember and take longer to type. At least for basic use cases, I
expect a user to type a command line, not to copy-paste an excessively long
command line in his/her note book. Human-first.</p>
</li>
<li>
<p>Accept the standard input (aka stdin) as much as possible. This will help to
connect different tools together. Note that certain applications may save a lot
of memory or may be greatly simplified if they read an input file more than
once. It is ok not to support stdin in this case.</p>
</li>
<li>
<p>If possible, output the results to the standard output (aka stdout) and
output the error and messaging information to the standard error output
(aka stderr). Nonetheless, it is also preferred to have an <code class="language-plaintext highlighter-rouge">-o file.out</code> option
to output to an ordinary file in case stdout and stderr may get mixed.</p>
</li>
<li>
<p>For a long-running tool, I will make it output a message to stderr when it
ends. For example, minimap2 outputs running time and peak memory to stderr in
the end. If you see this message, you will know minimap2 is not killed by the
system.</p>
</li>
<li>
<p>Keep CLI backwardly compatible such that a pipeline using an earlier version
of the tool can update the tool to the latest version without any code changes.
Backward compatibility is essential to the long-term stability of a
command-line tool.</p>
</li>
</ul>
<p>The above are just general directions for you to think about. Even myself
doesn’t strictly follow them. If interested, you can google “cli best practice”
to see others’ take on the topic.</p>
An FM-index of 400k SARS-CoV-2 genomes2021-05-17T00:00:00+00:00http://lh3.github.io/2021/05/17/an-fm-index-of-400k-sars-cov-2-genomes
<p>Leonardo Martins <a href="https://twitter.com/leomrtns/status/1393315682352250881">tweeted</a> that xz can compress a 1.4 million SARS-CoV-2
genomes in a 39GB FASTA down to 74MB. That is a very impressive compression
ratio! This reminds me of my earlier work on <a href="https://pubmed.ncbi.nlm.nih.gov/25107872/">FM-index construction</a>.</p>
<p>For an experiment, I downloaded ~400k SARS-CoV-2 genomes from EBI’s <a href="https://www.covid19dataportal.org/">COVID-19
data portal</a> (<a href="https://www.gisaid.org/">GISAID</a> has ~1.5M genomes but imposes
restrictions) and generated an FM-index of these sequences in both strands
with <a href="https://github.com/lh3/ropebwt2">ropebwt2</a></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ropebwt2 -do sars-cov-2.fmd sequences_fasta_2021-05-15.fa.gz
</code></pre></div></div>
<p>The command line took ~30 minutes. The output file <code class="language-plaintext highlighter-rouge">sars-cov-2.fmd</code> is 33MB in
size. It keeps the BWT and the necessary information for backward/forward
search. You can find this file <a href="https://zenodo.org/record/4771285">at Zenodo</a>.</p>
<p>Here are a few things you can do with this file, using
<a href="https://github.com/lh3/fermi2">fermi2</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># uncompress the FM-index; forward and reverse strands are interleaved
fermi2 unpack sars-cov-2.fmd | less -S
# count 61-mers occurring 10 times or more using 4 threads (5 sec on my laptop)
fermi2 count -k61 -o10 -t4 sars-cov-2.fmd | less
# count how many times a sequence in FASTA occurs in the FM-index
fermi2 match sars-cov-2.fmd query.fa
# get the count of every 61-mer in query sequences
fermi2 kprof -k61 sars-cov-2.fmd query.fa
# find supermaximal exact matches (SMEMs)
fermi2 match -p sars-cov-2.fmd query.fa
# find SMEMs at least 200bp occurring 1000 times or more, using 8 threads
fermi2 match -pt8 -l200 -n1000 sars-cov-2.fmd
# generate sampled suffix array and output positions
fermi2 sa -t8 sars-cov-2.fmd > sars-cov-2.fmd.sa # slooow
fermi2 match -pm1000 -s sars-cov-2.fmd.sa sars-cov-2.fmd query.fa # can be very sloow
</code></pre></div></div>
<p>Although the original input sequences are totaled 12GB in length (or 24GB if
we consider both strands), all but the last operations take ~33MB in RAM, the
size of the index. That is the advantage of FM-index or similar indices.</p>
<p>PS: I don’t study SARS-CoV-2 genomes. I did the above for fun only. Let me know
if you feel some of these might be useful to your research and want to learn
more.</p>
Concepts in phased assemblies2021-04-17T00:00:00+00:00http://lh3.github.io/2021/04/17/concepts-in-phased-assemblies
<p>Concepts in phased assembly:</p>
<ul>
<li>
<p><a href="https://www.genome.gov/genetics-glossary/Contig">Contig</a>: a contiguous sequence in an assembly. A contig does not
contain long stretches of unknown sequences (aka <em>assembly gaps</em>).</p>
</li>
<li>
<p>Scaffold: a sequence consists of one or multiple contigs connected by
assembly gaps of typically inexact sizes. A scaffold is also called a
<a href="https://en.wiktionary.org/wiki/supercontig">supercontig</a>, though this terminology is rarely used nowadays.</p>
</li>
<li>
<p>Assembly: a set of contigs or scaffolds. In the following, I will say an
assembly is <em>haploid complete</em> or simply <em>complete</em> if it is supposed to
represent a haploid genome in full.</p>
</li>
<li>
<p><a href="https://www.ncbi.nlm.nih.gov/books/NBK44482/">Haplotig</a>: a contig that comes from the same haplotype. In an
unphased assembly, a contig may join alleles from different parental
haplotypes in a diploid or polyploid genome.</p>
</li>
<li>
<p>Switch error: a change from one parental allele to another parental allele on
a contig (see the figure below). This terminology has been used for measuring
reference-based phasing accuracy <a href="https://pubmed.ncbi.nlm.nih.gov/12386835/">for two decades</a>. A haplotig is
supposed to have no switch errors.</p>
</li>
<li>
<p>Yak hamming error: an allele not on the most supported haplotype of a
contig (see the figure below). Its main purpose is to test how close a contig is
to a haplotig. This definition is tricky. The terminology was perhaps first
used by <a href="https://pubmed.ncbi.nlm.nih.gov/29101320/">Porubsky et al (2017)</a> in the context of reference-based
phasing. However, adapting it for contigs is not straightforward. The
<a href="https://github.com/lh3/yak">yak</a> definition is not widely accepted. The hamming error rate is
arguably less important in practice (Richard Durbin, personal communication).</p>
</li>
</ul>
<p>Types of phased assemblies.</p>
<ul>
<li>
<p>Collapsed assembly: a complete assembly with parental alleles randomly
switching in a contig. Most conventional assemblers produce collapsed
assemblies. A collapsed assembly is also called a squashed assembly or a
conensus assembly.</p>
</li>
<li>
<p>Primary assembly: a complete assembly with long stretches of phased blocks.
The concept has been <a href="https://www.ncbi.nlm.nih.gov/grc/help/definitions/">used by GRC</a>. BAC-to-BAC assemblies can all be
regarded as primary assemblies. Falcon-unzip is perhaps the first to produce
such assemblies for whole-genome shotgun data.</p>
</li>
<li>
<p>Alternate assembly: an incomplete assembly consisting of haplotigs in
heterozygous regions. An alternate assembly always accompanies a primary
assembly. It is not useful by itself as it is fragmented and incomplete.</p>
</li>
<li>
<p>Partially phased assembly: sets of complete assemblies with long stretches
of phased blocks, representing an entire diploid/polyploid genome. Peregrine
is perhaps the first to produce such assemblies. This concept is coined by
me as I could not find a proper one in the existing literature. I don’t like
the terminology. If someone has a better naming, let me know.</p>
</li>
<li>
<p>Haplotype-resolved assembly: sets of complete assemblies consisting of
haplotigs, representing an entire diploid/polyploid genome. This concept has
been inconsistently used in publications without a clear definition. The
above is my take in our recent papers.</p>
</li>
</ul>
<p>Furthermore, we may have chromosome-scale haplotype-resolved assembly where
haplotigs from the same chromosome are fully phased. For germline genomes,
the highest standard is <a href="https://github.com/nanopore-wgs-consortium/CHM13">telomere-to-telomere</a> assembly where each
chromosome is fully phased and assembled without gaps.</p>
<style> .extable td,th { padding: 4px; } </style>
<table border="1" class="extable">
<tr><th>Asm type </th><th>Complete?</th><th>Haplotig?</th><th>N50 </th><th>SwitchErr</th><th>HammingErr</th></tr>
<tr><td>Collapsed</td><td>Yes </td><td>No </td><td>Long </td><td>Many </td><td>Many </td></tr>
<tr><td>Primary </td><td>Yes </td><td>No </td><td>Long </td><td>Some </td><td>Many </td></tr>
<tr><td>Alternate</td><td>No </td><td>Yes </td><td>Short</td><td>Few </td><td>Few </td></tr>
<tr><td>Partial </td><td>Yes </td><td>No </td><td>Long </td><td>Some </td><td>Many </td></tr>
<tr><td>Resolved </td><td>Yes </td><td>Yes </td><td>Long </td><td>Few </td><td>Few </td></tr>
</table>
<p><img src="http://www.liheng.org/images/asmconcepts/phased-asm-flow.png" alt="drawing" width="500" /></p>
SNP vs SNV2021-03-15T00:00:00+00:00http://lh3.github.io/2021/03/15/snp-vs-snv
<p>Ian Holmes has a <a href="https://twitter.com/ianholmes/status/1371523573861339141">twitter poll</a> right now on the use of “SNP”
(single-nucleotide polymorphism) versus “SNV” (single-nucleotide variant). I
have been bugged by the two terminologies for years, so I decided to write a
blog post on it. Personally, <strong>I use “SNP” for germline events and “SNV” for
somatic events</strong>, but I understand others think differently. Here are my
thoughts.</p>
<p>The <a href="https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism">wiki page for SNP</a> defines a SNP as a nucleotide change “that is
present in a sufficiently large fraction of the population (e.g. 1% or more)”.
However, <strong>such a frequency-based definition is not actionable in practice</strong>.
Allele frequency varies a lot across populations. Due to genetic drift and
selection, an allele at 5% frequency in African may be absent from the rest of
the world. Is this a SNP or not? Furthermore, the observed allele frequency
fluctuates with sampling and the sample size. An allele at 2% frequency in the
1000 Genomes Project (1KG) may become 0.5% in gnomAD. Is this a SNP or not? If
it is impractical to set a frequency threshold, the definition of SNP shouldn’t
require a frequency threshold.</p>
<p><strong>Historically, we have been using “SNP” without a frequency threshold for
decades</strong>. If you search word “SNP” in the <a href="https://www.nature.com/articles/35057062">landmark paper on the Human Genome
Project</a> in 2001, you can find 45 instances. With data produced at
that time, we had little information on frequency but we called observed
substituions as SNPs anyway. Similarly, there are 28 instances of “SNP” in the
<a href="https://www.nature.com/articles/nature15393">final 1KG paper</a>, including one in the abstract. In 1KG, we have
observed many substitutions at <1% but we still called them as SNPs.
In these papers, a SNP simply refers to a germline substitution.</p>
<p><strong>“SNV” is a much more recent terminology</strong>. I first saw “SNV” in the <a href="https://pubmed.ncbi.nlm.nih.gov/20130035/">SNVmix
paper</a> in the context of tumor mutation calling (I reviewed it).
That was 2010. According to <a href="https://pubmed.ncbi.nlm.nih.gov/?term=%28"single+nucleotide+variants"%5Btiab%5D+OR+"single+nucleotide+variant"%5Btiab%5D%29+SNV%5Btiab%5D&filter=dates.1000%2F1%2F1-2011&filter=dates.1000%2F1%2F1-1980">a PubMed search</a>, few papers were
published with “SNV” in the abstract before that and early uses of “SNV” mostly
focused on tumor data as well. This includes the popular <a href="https://pubmed.ncbi.nlm.nih.gov/22300766/">VarScan2
paper</a>. People coined up “SNV” for somatic mutations because SNP has
been reserved for germline events. “SNV” may sound more general than “SNP”, but
concepts in genetics should not be taken literally. What matters more is the
historical uses. There are simiarly confusing terminologies like VNTR and CNV
vs CNA, which I will not explain in detail here.</p>
<p>It is already too late to regulate the use of SNP and SNV. In practice, just
beware that the definition of SNP and SNV may vary between researchers. When
in a conversation you are not sure what SNP/SNV refers to, ask for a
clarification.</p>
<p><strong>Postscript:</strong> I personally avoid “SNV” in my work due to its inconsistent
uses in the past. When I want to describe a somatic event, I use “somatic SNV”
or “sSNV” in brief.</p>
Minigraph as a multi-assembly SV caller2021-01-11T00:00:00+00:00http://lh3.github.io/2021/01/11/minigraph-as-a-multi-assembly-sv-caller
<p>Honestly, I didn’t know what <a href="https://github.com/lh3/minigraph">minigraph</a> would be good for when I
was writing the code. When I was writing the <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02168-z">paper</a>, I pitched
minigraph as a fast caller for structural variations (SVs). However, except
performance and convenience, minigraph is not that special. In fact, in the
paper, minigraph is not as good as read-based SV callers because it randomly
misses one parental allele when most assemblies in the paper are not phased.</p>
<p>My exploration took a turn when one anonymous reviewer asked me to check the
<a href="https://en.wikipedia.org/wiki/Lipoprotein(a)">LPA gene</a>. It was not in the graph because the gene was <a href="http://lh3.github.io/2020/12/25/evaluating-assembly-quality-with-asmgene">collapsed or
missed</a> in all input assemblies. Fortunately, I had several phased
<a href="https://github.com/chhylp123/hifiasm">hifiasm</a> assemblies at hand. LPA is there and minigraph generates a
complex subgraph (figure below) far beyond the capability of <a href="https://en.wikipedia.org/wiki/Variant_Call_Format">VCF</a>. Then I
realized what minigraph is truly good for: complex SVs.</p>
<p>With the current SV calling pipelines, we typically map reads or an assembly
against a reference genome, call SVs and then merge pairwise SV calls into a
multi-sample call set. This sounds simple but doesn’t work well for complex events.
First, the position of an SV may be shifted by small variants. We have to
heuristically group nearby events. This is particularly problematic around
<a href="https://en.wikipedia.org/wiki/Variable_number_tandem_repeat">VNTRs</a>. Second, there are nested SVs: for example, an L1 insertion
inside a long segmental duplication. If we only see the reference coordinate,
we wouldn’t be able to easily represent duplications with and without L1.</p>
<p>The solution to these problems is multi-sequence alignment
(MSA) which minigraph approximates. MSA naturally alleviates imprecise
breakpoints because MSA effectively groups similar events first; MSA also fully
represents nested events because unlike mapping against a reference genome,
MSA aligns inserted sequences not in the reference. The following figure shows the subgraphs
around four genes. SVs like these will fail most existing SV callers and can’t
be represented in VCF.</p>
<p><img src="http://www.liheng.org/images/minigraph/examples.jpg" alt="" /></p>
<p>Are there many complex SVs? Not a lot by count. In the left plot
below, all examples come from blue and green areas on the “Partial-repeat” bar.
There are only several hundred of them. However, these complex SVs often reside
in long segmental duplications and affect a much larger fraction of
genomes in comparison to transposon insertions (the “Partial-repeat” bar on the
right plot). Genes in these loci, a few hundred of them, are frequently related
to immune systems (e.g. many HLA/KIR genes) or under rapid evolution in the
primate or human lineage (e.g. AMY* and NBPF* genes). <a href="http://lh3.github.io/2020/12/25/evaluating-assembly-quality-with-asmgene">My last blog
post</a> mentioned 10% genes that have multiple copies in CHM13 are
single-copy in GRCh38. These genes mostly come from the “Partial-repeat” bar,
too. With short reads, we can observe signals of transposon insertions and
copy number changes and with long reads, we can call VNTRs, but only with
multi-assembly callers like minigraph, we can have the access to the near full
spectrum of SVs, with the exception of centromeric repeats.</p>
<p><img src="http://www.liheng.org/images/minigraph/plot.jpg" alt="" /></p>
<p>Minigraph is a fast and powerful multi-assembly SV caller. Although the calling
is graph based, you can ignore the graph structure and focus on SVs only. I
have just added a <a href="https://github.com/lh3/minigraph#callsv">new section in README</a> that explains how to use
minigraph to call SVs. It is worth noting that at complex loci, minigraph
subgraphs, including examples above, are often suboptimal. Please read the
<a href="https://github.com/lh3/minigraph#limit">Limitations section</a> if you want to explore the minigraph approach.</p>
Evaluating collapsed misassembly with asmgene2020-12-25T00:00:00+00:00http://lh3.github.io/2020/12/25/evaluating-assembly-quality-with-asmgene
<h2 id="why">Why?</h2>
<p>It is usually easy to evaluate the contiguity of a de novo assembly – just
compute N50. It is much harder to evaluate the correctness. We typically
identify misassemblies by aligning contigs to a reference genome. However, it
is tricky to interpret the results. In case of human, there are thousands of
structural variations (SVs) between the reference and the sample being
assembled. Alignment-based evaluation often mistakes these SVs as misassemblies.
For example, <a href="http://bioinf.spbau.ru/quast">QUAST</a> identifies >10,000 “misassemblies” in the <a href="https://github.com/nanopore-wgs-consortium/CHM13">T2T
assembly</a> when compared to GRCh38. We can’t reliably tell
misassemblies from SVs which leads to overestimated misassembly rate. A second problem
with reference-based alignment is that most alignment differences come from
complex regions such as centromeres and subtelomeres. It fails to evaluate gene
regions we are mostly interested in; on the contrary it penalizes an assembly
that represents these complex regions better.</p>
<h2 id="how">How?</h2>
<p>Most assembly problems are caused by repetitive or paralogous regions. When an
assembler cannot resolve such a region, it either creates an assembly gap or
forces through the region with a misassembly. To probe these issues, we can
align a multi-copy gene to the assembly and see if it remains multi-copy.</p>
<p>More precisely, we do the following. We align all annotated transcripts to a
reference genome and select the longest isoform among overlapping transcripts.
For each selected transcript, we record a hit if the transcript is mapped at
≥99% identity over ≥99% of of the transcript length. A transcript
is considered to be single-copy (SC) if it has only one hit; otherwise it is
considered multi-copy (MC). We do the same for the assembly and then compute
the fraction of missing multi-copy gene as</p>
<blockquote>
<p><strong>MMC</strong> = 1 - |{MCinASM} ∩ {MCinREF}| / |{MCinREF}|</p>
</blockquote>
<p>In the ideal case of a perfect
assembly, %MMC should be zero. A higher fraction suggests more
collapsed assemblies. We can compute percent MMC (%MMC) with <code class="language-plaintext highlighter-rouge">paftools.js
asmgene</code> from <a href="https://github.com/lh3/minimap2">minimap2</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>minimap2 -cxsplice:hq -t8 ref.fa cdna.fa > ref.cdna.paf
minimap2 -cxsplice:hq -t8 asm.fa cdna.fa > asm.cdna.paf
paftools.js asmgene [-a] ref.cdna.paf asm.cdna.paf
</code></pre></div></div>
<p>The output looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>H Metric ref.cdna asm.cdna
X full_sgl 36426 36389
X full_dup 0 18
X frag 0 4
X part50+ 0 5
X part10+ 0 0
X part10- 0 10
X dup_cnt 1334 1330
X dup_sum 4110 4080
</code></pre></div></div>
<p>On the line <code class="language-plaintext highlighter-rouge">X dup_cnt</code>, 1334 is the number of multi-copy genes in the
reference, of which 1330 remain multi-copy in the assembly. %MMC is thus
1-1330/1334=0.3%. Also in this output, 36426 is the number of single-copy genes
in the reference, of which 36389 remain single-copy in the assembly and 18 are
false duplications. We can similarly compute <a href="https://busco.ezlab.org/">BUSCO</a>-like metrics but
based on the reference.</p>
<h2 id="collapsed-misassemblies-in-long-read-assemblies">Collapsed misassemblies in long-read assemblies</h2>
<p>The following figure shows the level of collapsed genes in <a href="https://github.com/lh3/pubLRasm#chm13-homozygous-human">various CHM13
assemblies</a>, taking <a href="http://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz">Ensembl genes</a> as input and the T2T CHM13 as the reference:</p>
<p><img src="http://www.liheng.org/images/asmgene/CHM13.jpg" alt="" /></p>
<p>Because the reference and the assemblies all come from the same sample,
a perfect assembly should have no collapsed genes. Hifiasm and HiCanu
are close to that mark. However, some assemblers may miss up to 70% of
multi-copy genes. We had a closer look at missing genes. As is expected, they
either fall into assembly gaps or leave a misassembly in a long contig. If you
want to study a gene family, such assembly problems will ruin your day.</p>
<p>CHM13 is a homozygous cell line. The following figure shows the
level of collapsed genes in diploid <a href="https://github.com/lh3/pubLRasm#hg00733-heterozygous-human">HG00733 assemblies</a>, again
with CHM13 as the reference:</p>
<p><img src="http://www.liheng.org/images/asmgene/HG00733.jpg" alt="" /></p>
<p>In this figure, even GRCh38 misses 10% multi-copy genes in CHM13. This is
background noises caused by between-sample SVs. It is much lower than the level
of collapsed misassemblies of many assemblers, demonstrating the effectiveness
this metric.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Percent MMC is a new metric to measure the quality of an assembly. It takes
minutes to compute, is gene focused and is robust to structural variations in
comparison to evaluations based on assembly-to-reference alignment. The obvious
downside of %MMC is that it requires a high-quality reference genome and is not
applicable to new species, but this is not a concern during the development of
assemblers.</p>
Base quality scores are essential to short read variant calling2020-05-27T00:00:00+00:00http://lh3.github.io/2020/05/27/base-quality-scores-are-essential-to-short-read-variant-calling
<p>In <a href="http://lh3.github.io/2020/05/25/format-quality-binning-and-file-sizes">an earlier post</a> a few days ago, I said “discarding base quality
dramatically reduces variant calling accuracy”. I didn’t provide evidence. This
certainly doesn’t sound persuasive. In this post, I will show an experiement to
support my claim.</p>
<p>I downloaded high-coverage short reads for sample HG002 <a href="https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/HG002Run01-11419412/">from GIAB
ftp</a>, converted to unsorted FASTQ with <a href="http://www.htslib.org/doc/samtools-collate.html">samtools collate</a>, mapped
them to <a href="https://ftp-trace.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/">hs37d5</a> (for compatibility with GIAB) with bwa-mem, called variants with
GATK v4 and compared the calls to the <a href="https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_v4.1_SmallVariantDraftBenchmark_12182019/GRCh37/">GIAB truth v4.1</a>. I then
estimated the false negative rate (FNR=1-sensitivity) and false discovery rate
(FDR=1-precision) with <a href="https://github.com/RealTimeGenomics/rtg-tools">RTG’s vcfeval</a>. I optionally applied the hard
filters proposed in <a href="https://pubmed.ncbi.nlm.nih.gov/24974202/">my earlier paper</a>. For “no quality”, I set all base quality
to Q25 which corresponds to the average empirical error rate of this dataset.</p>
<style> .extable td,th { padding: 4px; } </style>
<table border="1" class="extable">
<tr><th># qual bins</th><th>Filtered</th><th>SNP FNR</th><th>SNP FDR</th><th>INDEL FNR</th><th>INDEL FDR</th></tr>
<tr><td>Lossless </td><td>No </td><td style="text-align:right">0.58%</td><td style="text-align:right">0.46%</td><td style="text-align:right">0.63%</td><td style="text-align:right">0.25%</td></tr>
<tr><td>8 bins </td><td>No </td><td style="text-align:right">0.58%</td><td style="text-align:right">0.45%</td><td style="text-align:right">0.66%</td><td style="text-align:right">0.26%</td></tr>
<tr><td>2 bins </td><td>No </td><td style="text-align:right">0.59%</td><td style="text-align:right">0.95%</td><td style="text-align:right">0.55%</td><td style="text-align:right">0.34%</td></tr>
<tr><td>No quality </td><td>No </td><td style="text-align:right">0.60%</td><td style="text-align:right">6.38%</td><td style="text-align:right">0.64%</td><td style="text-align:right">0.44%</td></tr>
<tr><td>Lossless </td><td>Yes </td><td style="text-align:right">2.54%</td><td style="text-align:right">0.07%</td><td style="text-align:right">2.27%</td><td style="text-align:right">0.06%</td></tr>
<tr><td>8 bins </td><td>Yes </td><td style="text-align:right">2.52%</td><td style="text-align:right">0.07%</td><td style="text-align:right">2.30%</td><td style="text-align:right">0.06%</td></tr>
<tr><td>2 bins </td><td>Yes </td><td style="text-align:right">2.53%</td><td style="text-align:right">0.11%</td><td style="text-align:right">2.24%</td><td style="text-align:right">0.07%</td></tr>
<tr><td>No quality </td><td>Yes </td><td style="text-align:right">2.71%</td><td style="text-align:right">0.20%</td><td style="text-align:right">2.51%</td><td style="text-align:right">0.08%</td></tr>
<tr><td>HiFi; no qual</td><td>No </td><td style="text-align:right">0.80%</td><td style="text-align:right">0.10%</td><td style="text-align:right">1.46%</td><td style="text-align:right">1.29%</td></tr>
</table>
<p>Several comments:</p>
<ul>
<li>
<p>If we completely drop base quality, the SNP FDR becomes 10 times higher.
Most of additional false calls are due to low ALT allele fraction. Hard
filtering can improve this metric but the resulting SNP FDR is still twice as
high. <strong>Base quality scores are essential to accurate variant calling. For
somatic mutation calling, short reads without base quality are virtually
useless.</strong></p>
</li>
<li>
<p>Using 2 quality bins (i.e. good/bad) gives a dramtic improvement over
no-quality, though the result is not as good as 8-binning.</p>
</li>
<li>
<p>The accuracy of variants called with 8 quality bins is indistinguishable from
the accuracy with the original quality. The file size of the sorted 8-binning
alignment in CRAM is less than a quarter of the size of the orignal input
in gzip’d FASTQ.</p>
</li>
<li>
<p>I guess using 4 quality bins may achieve the best balance between storage and
accuracy. The GATK team reached this conclusion years ago. I forgot what was
the exact binning scheme in use, so I am not including an experiment here.</p>
</li>
<li>
<p>The last line in the table evaluates <a href="https://github.com/lh3/dipcall">dipcall</a> variants called
from a HiFi trio binning assembly. <a href="https://github.com/chhylp123/hifiasm">Hifiasm</a> is the only assembler
to date that can achieve this accuracy.</p>
</li>
</ul>
Format, quality binning and file size2020-05-25T00:00:00+00:00http://lh3.github.io/2020/05/25/format-quality-binning-and-file-sizes
<p>This short post evaluates the effect of format and quality binning on file
sizes. I am taking <a href="https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR2052362">SRR2052362</a> as an example. It gives 4.3-fold
coverage on the human genome. For 2-binning, I turned original quality 20 or
above to 30 and turned original quality below 20 to 10. For 8-binning, I took
the scheme from a <a href="https://www.illumina.com/Documents/products/whitepapers/whitepaper_datacompression.pdf">white paper</a> (PDF) published by Illumina.
Illumina has been using quality binning for more than seven years. In this
experiement, I only retained the original read names. To produce CRAM files, I
mapped the short reads to the GRCh38 primary assembly. The following table
shows the file sizes:</p>
<style> .extable td,th { padding: 4px; } </style>
<table border="1" class="extable">
<tr><th>Format </th><th># qual bins</th><th> Size (GB)</th><th> Change relative to SRA</th></tr>
<tr><td>Sorted CRAM </td><td>2 bins </td><td style="text-align:right">1.187</td><td style="text-align:right">-85%</td></tr>
<tr><td>Unsorted CRAM</td><td>2 bins </td><td style="text-align:right">1.279</td><td style="text-align:right">-84%</td></tr>
<tr><td>Unsorted CRAM</td><td>8 bins </td><td style="text-align:right">2.115</td><td style="text-align:right">-73%</td></tr>
<tr><td>Gzip'd FASTA </td><td>No quality </td><td style="text-align:right">4.172</td><td style="text-align:right">-47%</td></tr>
<tr><td>Unsorted CRAM</td><td>Lossless </td><td style="text-align:right">4.536</td><td style="text-align:right">-43%</td></tr>
<tr><td>Gzip'd FASTQ </td><td>2 bins </td><td style="text-align:right">4.784</td><td style="text-align:right">-40%</td></tr>
<tr><td>SRA </td><td>Lossless </td><td style="text-align:right">7.917</td><td style="text-align:right"> 0%</td></tr>
<tr><td>Gzip'd FASTQ </td><td>Lossless </td><td style="text-align:right">9.210</td><td style="text-align:right">+16%</td></tr>
</table>
<p>It is clear that the CRAM format is the winner here and the advantage of CRAM
is more prominent given lower quality resolution. A key question is how much
quality binning affects variant calling. Brad Chapman <a href="https://bcbio.wordpress.com/2013/02/13/the-influence-of-reduced-resolution-quality-scores-on-alignment-and-variant-calling/">concluded</a>
8-binning had little effect on variant calling accuracy. With Crumble, James
Bonfield <a href="https://academic.oup.com/bioinformatics/article/35/2/337/5051198">could get</a> a little higher accuracy with lossy compression.
<a href="https://github.com/lh3/fermikit">FermiKit</a> effectively uses 2-binning and can achieve descent
results. I applied 2-binning to GATK many years ago and observed 2-binning
barely reduced accuracy. The GATK team at Broad Institute also evaluated
2-binning and 4-binning. They found 4-binning was better than 2-binning and was
as good as original quality. The overall message is that we don’t need full
quality resolution to make accurate variant calls for germline samples.
The effect on tumor samples is more of an open question, though.</p>
<p>It is worth noting that completely discarding base quality dramatically reduces
variant calling accuracy. I have observed this both with FermiKit and with
GATK (I didn’t keep the results unfortunately). This is because low-quality
Illumina sequencing errors are correlated, in that if one low-quality base is
wrong, other low-quality bases tend to be wrong in the same way. Without base
quality, variant callers wouldn’t be able to identify such recurrent errors.</p>
Fast high-level programming languages2020-05-17T00:00:00+00:00http://lh3.github.io/2020/05/17/fast-high-level-programming-languages
<h3 id="background">Background</h3>
<p>Python and R are slow when they can’t rely on functionality or libraries backed
by C/C++. They are inefficient not only for certain algorithm development but
also for common tasks such as FASTQ parsing. Using these languages limits the
reach of biologists. Sometimes you may have a brilliant idea but can’t deliver
a fast implementation only because of the language in use. This can be
frustrating. I have always been searching for a <a href="https://en.wikipedia.org/wiki/High-level_programming_language">high-level language</a>
that is fast and easy to use by biologists. This blog post reports some of my
exploration. It is inconclusive but might still interest you.</p>
<h3 id="design">Design</h3>
<p>Here I am implementing two tasks, FASTQ parsing and interval overlap query, in
several languages including C, Python, Javascript, <a href="http://luajit.org/">LuaJIT</a>,
<a href="https://en.wikipedia.org/wiki/Julia_(programming_language)">Julia</a>, <a href="https://en.wikipedia.org/wiki/Nim_(programming_language)">Nim</a>, and <a href="https://en.wikipedia.org/wiki/Crystal_(programming_language)">Crystal</a>. I am comparing their
performance. I am proficient in C and know Python a little. I have used LuaJIT
and Javascript for a few years. I am equally new to Julia, Nim and Crystal. My
implementations in these languages may not be optimal. Please keep this
important note in mind when reading the results.</p>
<h3 id="results">Results</h3>
<p>The source code and the full table are available at my <a href="https://github.com/lh3/biofast">lh3/biofast</a>
github repo. You can also found the machine setup, versions of libraries in
use and some technical notes. I will only show part of the results here.</p>
<h4 id="fastq-parsing">FASTQ parsing</h4>
<p>The following table shows the CPU time in seconds for parsing a gzip’d FASTQ
(t<sub>gzip</sub>) or a plain FASTQ (t<sub>plain</sub>). We only count the
number of sequences and compute the sum of lengths. Those seamlessly parsing multi-line
FASTA/FASTQ all use an algorithm similar to my <a href="https://github.com/lh3/biofast/blob/master/lib/kseq.h">kseq.h</a> parser in C.</p>
<style> .extable td,th { padding: 4px; } </style>
<table border="1" class="extable">
<tr><th>Language</th><th>Ext. Library</th><th>t<sub>gzip</sub> (s)</th><th>t<sub>plain</sub> (s)</th><th>Comments</th></tr>
<tr><td>Rust </td><td>needletail</td><td style="text-align:right"> 9.3</td><td style="text-align:right"> 0.8</td><td>multi-line fasta/mostly 4-line fastq</td> </tr>
<tr><td>C </td><td> </td><td style="text-align:right"> 9.7</td><td style="text-align:right"> 1.4</td><td>multi-line fasta/fastq</td> </tr>
<tr><td>Crystal </td><td> </td><td style="text-align:right"> 9.7</td><td style="text-align:right"> 1.5</td><td>multi-line fasta/fastq</td> </tr>
<tr><td>Nim </td><td> </td><td style="text-align:right"> 10.5</td><td style="text-align:right"> 2.3</td><td>multi-line fasta/fastq</td> </tr>
<tr><td>Julia </td><td> </td><td style="text-align:right"> 11.2</td><td style="text-align:right"> 2.9</td><td>multi-line fasta/fastq</td> </tr>
<tr><td>Python </td><td>PyFastx </td><td style="text-align:right"> 15.8</td><td style="text-align:right"> 7.3</td><td>C binding</td> </tr>
<tr><td>Javascript</td><td> </td><td style="text-align:right"> 17.5</td><td style="text-align:right"> 9.4</td><td>multi-line fasta/fastq; k8 dialect</td> </tr>
<tr><td>Go </td><td> </td><td style="text-align:right"> 19.1</td><td style="text-align:right"> 2.8</td><td>4-line fastq only</td> </tr>
<tr><td>LuaJIT </td><td> </td><td style="text-align:right"> 28.6</td><td style="text-align:right"> 27.2</td><td>multi-line fasta/fastq</td> </tr>
<tr><td>PyPy </td><td> </td><td style="text-align:right"> 28.9</td><td style="text-align:right"> 14.6</td><td>multi-line fasta/fastq</td> </tr>
<tr><td>Python </td><td>BioPython</td><td style="text-align:right"> 37.9</td><td style="text-align:right"> 18.1</td><td>multi-line fastq; FastqGeneralIterator</td> </tr>
<tr><td>Python </td><td> </td><td style="text-align:right"> 42.7</td><td style="text-align:right"> 19.1</td><td>multi-line fasta/fastq</td> </tr>
<tr><td>Python </td><td>BioPython</td><td style="text-align:right">135.8</td><td style="text-align:right">107.1</td><td>multi-line fastq; SeqIO.parse</td> </tr>
</table>
<p>This benchmark stresses on I/O and string processing. I replaced the low-level
I/O of several languages to achieve good performance. The code looks more like
C than a high-level language, but at least these langauges give me the power
without resorting to C.</p>
<p>It is worth mentioning the default BioPython FASTQ parser is over 70 times
slower on plain FASTQ and over 10 times slower on gzip’d FASTQ. Running the C
implementation on a human 30X gzip’d FASTQ takes 20 minutes. The default
BioPython parser will take four and half hours, comparable to bwa-mem2
multi-thread mapping. If you want to parse FASTQ but doesn’t need other
BioPython functionality, choose <a href="https://github.com/lmdu/pyfastx">PyFastx</a> or mappy.</p>
<h4 id="interval-overlap-query">Interval overlap query</h4>
<p>The following table shows the CPU time in seconds for computing the breadth of
coverage of a list intervals compared against another interval list. There are
two columns for timing and memory footprint, depending on which list is loaded
into memory.</p>
<table border="1" class="extable">
<tr><th>Language</th><th>t<sub>g2r</sub> (s)</th><th>M<sub>g2r</sub> (Mb)</th><th>t<sub>r2g</sub> (s)</th><th>M<sub>r2g</sub> (Mb)</th></tr>
<tr><td>C </td><td style="text-align:right"> 5.2</td><td style="text-align:right"> 138.4</td><td style="text-align:right"> 10.7</td><td style="text-align:right"> 19.1</td></tr>
<tr><td>Crystal </td><td style="text-align:right"> 8.8</td><td style="text-align:right"> 319.6</td><td style="text-align:right"> 14.8</td><td style="text-align:right"> 40.1</td></tr>
<tr><td>Nim </td><td style="text-align:right"> 16.6</td><td style="text-align:right"> 248.4</td><td style="text-align:right"> 26.0</td><td style="text-align:right"> 34.1</td></tr>
<tr><td>Julia </td><td style="text-align:right"> 25.9</td><td style="text-align:right"> 428.1</td><td style="text-align:right"> 63.0</td><td style="text-align:right">257.0</td></tr>
<tr><td>Go </td><td style="text-align:right"> 34.0</td><td style="text-align:right"> 318.9</td><td style="text-align:right"> 21.8</td><td style="text-align:right"> 47.3</td></tr>
<tr><td>Javascript</td><td style="text-align:right"> 76.4</td><td style="text-align:right">2219.9</td><td style="text-align:right"> 80.0</td><td style="text-align:right">316.8</td></tr>
<tr><td>LuaJIT </td><td style="text-align:right">174.1</td><td style="text-align:right">2668.0</td><td style="text-align:right">217.6</td><td style="text-align:right">364.6</td></tr>
<tr><td>PyPy </td><td style="text-align:right">17332.9</td><td style="text-align:right">1594.3</td><td style="text-align:right">5481.2</td><td style="text-align:right">256.8</td></tr>
<tr><td>Python </td><td style="text-align:right">>33770.4</td><td style="text-align:right">2317.6</td><td style="text-align:right">>20722.0</td><td style="text-align:right">313.7</td></tr>
</table>
<p>The implementation of this algorithm is straightforward. It is mostly about
random access to large arrays. Javascript and LuaJIT are much slower here
because I can’t put objects in an array; I can only put references to objects
in an array.</p>
<h3 id="my-take-on-fast-high-level-languages">My take on fast high-level languages</h3>
<p>The following is subjective and can be controversial, but I need to speak it
out. Performance is not everything. Some subtle but important details are only
apparent to those who write these programs.</p>
<h4 id="javascript-and-luajit">Javascript and LuaJIT</h4>
<p>These are two similar languages. They are old and were not designed with
<a href="https://en.wikipedia.org/wiki/Just-in-time_compilation">Just-In-Time</a> (JIT) compilation in mind. People later developed JIT
compilers and made them much faster. I like the two languages. They are easy to
use, have few performance pitfalls and are pretty fast. Nonetheless, they are
not the right languages for bioinformatics. If they were, they would have
prevailed years ago.</p>
<h4 id="julia">Julia</h4>
<p>Among the three more modern languages Julia, Nim and Crystal, Julia reached 1.0
first. I think Julia could be a decent replacement of Matlab or R by the
language itself. If you like the experience of Matlab or R, you may like Julia.
It has builtin matrix support, 1-based coordinate system, friendly <a href="https://en.wikipedia.org/wiki/Read-eval-print_loop">REPL</a>
and an emphasis on plotting as well. I heard its differential equation solver might be
the best across all languages.</p>
<p>I don’t see Julia a good replacement of Python. Julia has a long startup time.
When you use a large package like Bio.jl, Julia may take 30 seconds to compile
the code, longer than the actual running time of your scripts. You may not feel
it is fast in practice. Actually in my benchmark, Julia is not really as fast
as other languages, either. Probably my Julia implementations here will get
most slaps. I have seen quite a few you-are-holding-the-phone-wrong type of
responses from Julia supporters. Also importantly, the Julia developers do not
value backward compatibility. There may be a python2-to-3 like transition in
several years if they still hold their views by then. I wouldn’t take the risk.</p>
<h4 id="nim">Nim</h4>
<p>Nim reached its maturity in September 2019. Its syntax is similar to python on
the surface, which is a plus. It is relatively easier to get descent
performance out of Nim. I have probably spent least time on learning Nim but I
can write programs faster than in Julia.</p>
<p>On the down side, writing Nim programs feels a little like writing Perl in that
I need to pay extra attention to reference vs value. For the second task, my
initial implementation was several times slower than the Javascript one, which
is unexpected. Even in the current program, I still don’t understand why the
performance get much worse if I change by-reference to by-value in one instance.
Nim supporters advised me to run a profiler. I am not sure biologists would
enjoy that.</p>
<h4 id="crystal">Crystal</h4>
<p>Crystal is a pleasant surprise. On the second benchmark, I got a fast
implementation on my first try. I did take a detour on FASTQ parsing when I
initially tried to use Crystal’s builtin buffered reader, but again I got
C-like performance immediately after I started to manage buffers by myself.</p>
<p>Crystal resembles Ruby a lot. It has very similar syntax, including a
class/<a href="https://en.wikipedia.org/wiki/Mixin">mixin</a> system familiar to modern programmers. Some elementary tutorials
on Ruby are even applicable to Crystal. I think building on top of successful
languages is the right way to design a new language. Julia on the other hand
feels different from most mainstream languages like C++ and Python. Some of its
key features haven’t stood the test of time and may become frequent sources of
bugs and performance traps.</p>
<p>To implement fast programs, we need to care about reference vs value. Crystal
is no different. The good thing about Crystal is that reference and value are
explicit with its class system. Among Julia, Nim and Crystal, I feel most
comfortable with Crystal.</p>
<p>Crystal is not without problems. First, it is hard to install Crystal without
the root permission. I am providing a portable installation binary package in
<a href="https://github.com/lh3/PortableCrystal">lh3/PortableCrystal</a>. It alleviates the issue for now. Second, Crystal is
unstable. Each release introduces multiple breaking changes. Your code written
today may not work later. Nonetheless, my program seems not affected by
breaking changes in the past two years. This has given me some confidence. The
Crystal devs also said 1.0 is coming “<a href="https://crystal-lang.org/2020/03/03/towards-crystal-1.0.html">in the near future</a>”. I will look
forward to that.</p>
<h3 id="conclusions">Conclusions</h3>
<p>A good high-level high-performance programming language would be a blessing to
the field of bioinformatics. It could extend the reach of biologists, shorten
the development time for experienced programmers and save the running time of
numerous python scripts by many folds. However, no languages are good enough in
my opinion. I will see how Crystal turns out. It has potentials.</p>
<h3 id="anecdote">Anecdote</h3>
<p>Someone posted this blog post to <a href="https://news.ycombinator.com/item?id=23229657">Hacker New</a>, <a href="https://www.reddit.com/r/crystal_programming/comments/gm2dps/crystal_in_bioinformatics_comparison_fast/">Crystal
subreddit</a>, and <a href="https://discourse.julialang.org/t/lhe-biofast-benchmark-fastq-parsing-julia-nim-crystal-python/39747">Julia discourse</a>. The reaction from many
Julia supporters is just as I expected. That said, I owe a debt of gratitude to
<a href="https://github.com/bicycle1885">Kenta Sato</a> for improving my Julia implementation. I geniunely
appreciate.</p>
<p><strong>Update on 2020-05-19:</strong> Added contributed Go implementations. More accurate
timing for fast implementations, measured by <a href="https://github.com/sharkdp/hyperfine">hyperfine</a>.</p>
<p><strong>Update on 2020-05-20:</strong> Added a contributed Rust implementation. Added PyPy.</p>
<p><strong>Update on 2020-05-21:</strong> Faster Nim and Julia with <a href="http://man7.org/linux/man-pages/man3/memchr.3.html">memchr</a>. Faster
Julia by adjusting <a href="https://github.com/lh3/biofast/pull/7">three additional lines</a>. For gzip’d input,
Julia-1.4.1 is slow due to <a href="https://github.com/JuliaPackaging/Yggdrasil/pull/1051">a misconfiguration</a> on the Julia end. The
numbers shown in the table are acquired by forcing Julia to use the system
zlib on CentOS7. Added Python bedcov implementation. It is slow.</p>
<p><strong>Update on 2020-05-23:</strong> Added a faster contributed Rust implementation.</p>
auN: a new metric to measure assembly contiguity2020-04-08T00:00:00+00:00http://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity
<p>Given a de novo assembly, we often measure the “average” contig length by
N50. <a href="https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics">N50</a> is neither the real average nor median. It is the length of the
contig such that this and longer contigs cover at least 50% of the assembly. A
longer N50 indicates better contiguity. We can similarly define N<em>x</em> such that
contigs no shorter than N<em>x</em> covers <em>x</em>% of the assembly. The N<em>x</em> curve plots
N<em>x</em> as a function of <em>x</em>, where <em>x</em> is ranged from 0 to 100.</p>
<p>In my opinion, there are two problems with N50. First, N50 is not contiguous.
For a good human assembly, contigs of lengths around N50 can differ by several
megabases in length. Discarding tiny contigs may lead a big jump in N50.
Relatedly, between two assemblies, a more contiguous assembly might happen to
have a smaller N50 just by chance. Second, N50 may not reflect some
improvements to the assembly. If we connect two contigs longer than N50 or
connect two contigs shorter than N50, N50 is not changed; N50 is only improved
if we connect a contig shorter than N50 and a contig longer than N50. If
we assembler developers solely target N50, we may be misled by it.</p>
<p>Here is an idea about how to overcome the two issues. N50 is a single point on
the N<em>x</em> curve. The entire N<em>x</em> curve in fact gives us a better sense of
contiguity. The following figure from a <a href="https://nbis.se/">NIBS</a> <a href="https://nbisweden.github.io/workshop-genome_assembly/index">workshop</a> shows a
good example:</p>
<!--
<img src="images/NGx_plot.png" width=480></img>
-->
<p><img src="http://lh3.github.io/images/NGx_plot.png" alt="" /></p>
<p>Notably, the NG50 (similar to N50) of several assemblers/settings are about the
same around 300kb, but it is clear the black curve achieves better contiguity
– a single contig on that curve covers more than 40% of the assembly.
Intuitively, a better N<em>x</em> curve is “higher”, or has a larger area under the
curve. Then we can take the area under the curve, abbreviated as “auN”, as a
measurement of contiguity. The formula to calculate the area is:</p>
\[{\rm auN}=\sum_i L_i\cdot\frac{L_i}{\sum_j L_j}=\left.\sum_i L_i^2 \middle/ \sum_j L_j\right.\]
<p>where $L_i$ is the length of contig $i$. Although auN is inspired by the N<em>x</em>
curve, its calculation actually doesn’t require to sort contigs by their
lengths. It is easier to calculate in practice. For multiple human assemblies
at my hand, auN falls between N50 and N40, though this observation doesn’t hold
for other assemblies in general.</p>
<p>auN doesn’t have the two problems with N50. It is more stable and less affected
by big jumps in contig lengths. It considers the entire N<em>x</em> curve. Connecting
two contigs of any lengths will always lead to a longer auN. If we want to
summarizes contig contiguity with a single number, auN is a better choice than
N50. Similarly we can define auNG and auNGA. I don’t think auN will be widely
used given the inertia on N50, but it is anyway fun to ponder new metrics.</p>
<p><strong>Update</strong>: Gregory Concepcion pointed out that the <a href="https://www.ncbi.nlm.nih.gov/pubmed/22147368">GAGE benchmark</a> was
using the same metric to evaluate assemblies, though the authors were not
interpreting it as area under the N<em>x</em> cure. Ivan Sovic and Jens-Uwe Ulrich
have independently come up with auN as well.</p>
On a reference pan-genome model (Part II)2019-07-12T00:00:00+00:00http://lh3.github.io/2019/07/12/on-a-reference-pan-genome-model-part-ii
<p>I wrote a <a href="http://lh3.github.io/2019/07/08/on-a-reference-pan-genome-model">blog post</a> on a potential reference pan-genome model. I had
more thoughts in my mind. I didn’t write about them because they are
<em>immature</em>. Nonetheless, a few readers raised questions related to <strong>my
immature thoughts</strong>, so I decide to add this “Part II” as a response. Please
note that this and the previous blog posts <strong>only represent my own limited
view</strong>. A consortium will <a href="https://grants.nih.gov/grants/guide/rfa-files/rfa-hg-19-004.html">be formed</a> to build a better genome reference.
They may come up with a distinct but much better solution than what I am saying
here.</p>
<p>The previous post talked about one problem with the current reference genome:
it lacks diversity. Another more practical issue is that the coordinate system
drastically changes with each major release. To avoid inconsistent coordinate,
many projects are still using the older GRCh37 even though GRCh38 has been
around for half of a decade. At the same time, GRCh38 is <a href="https://www.ncbi.nlm.nih.gov/assembly?term=GRCh38&cmd=DetailsSearch">being actively
updated</a> with new “patches” arriving regularly. However, these patches
are not well integrated into GRCh38. Using them naively will lead to loss of
information. Almost no complete tool chains work with them; the few existing
patch-aware tools (e.g. <a href="https://github.com/lh3/bwa/blob/master/README-alt.md">BWA-MEM ALT mapping</a>) are only ad hoc hacks.
If we keep using patches, few will be benefited from them; if we integrate
patches into the primary assembly, everyone will need to remap all the data
from time to time. A few readers of the previous blog post asked: how will a
graph model help? Here is my vision:</p>
<p><strong>In the short term (say 5 years)</strong>, we can start from the primary assembly of
GRCh38 and build a reference graph integrating large variations in existing
patches and other long-read assemblies. We have three possible strategies to
work with the new reference:</p>
<ol>
<li>
<p>Those who prefer the current practice can continue to map data to the
primary GRCh38, the backbone of the graph. The graph will tell us regions
susceptible to artifacts caused by large varitions. These are like
<a href="https://sites.google.com/site/anshulkundaje/projects/blacklists">blacklisted regions</a> generated by ENCODE.</p>
</li>
<li>
<p>We can extract novel sequences in the graph and treat them as separate
contigs. These contigs are like decoy sequences and will attract false
mappings away. We will still use our existing tools (e.g. STAR and Bowtie2)
for most analyses. This strategy will supposedly give us cleaner results,
but it won’t take full advantage of the graph.</p>
</li>
<li>
<p>When we have capable graph mappers, we can map data the graph. This will do
better than the strategy 2. We will need to project graph mappings to
GRCh38. <a href="https://github.com/vgteam/vg">Vg</a> has surjection; I have a complement idea, which I am not
detailing here.</p>
</li>
</ol>
<p>All the three strategies use the same coordinate system: GRCh38. The analysis
results won’t be the same, but will be close. We won’t need liftOver. I imagine
in a foreseeable future, a great majority of the community will use the linear
coordinate most of time. The graph elements only come in at certain steps.</p>
<p><strong>When we update the reference genome</strong>, we will insert new sequence segments,
add deletion links or mark a segment to be “deleted” without hard removal from
the graph. These modifications are like the current GRC patches, but they won’t
mess up the existing coordinate system. This will allow us to integrate data
from different minor versions of the reference. We do need to watch out batch
effects due to version changes. I can’t predict how much they matter in
comparison to batch effects from other sources. Excluding blacklisted regions
may alleviate this issue.</p>
<p><strong>You can already explore the graph world in my vision if you are brave
enough.</strong> Strategy 1 is what you are currently doing. I haven’t created the
blacklist, but I see it can be done by traversing the graph. For Strategy 2,
you can run the following command line to get the linearized reference:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/lh3/gfatools
cd gfatools && make
wget ftp://ftp.dfci.harvard.edu/pub/hli/minigraph/GRCh38-0.1-14.gfa.gz
./gfatools gfa2fa -s GRCh38-0.1-14.gfa.gz > GRCh38-0.1-14.fa
</code></pre></div></div>
<p>Here option <code class="language-plaintext highlighter-rouge">-s</code> asks for stable FASTA output. The output file
<code class="language-plaintext highlighter-rouge">GRCh38-0.1-14.fa</code> will include the entire GRCh38 primary assembly and extra
26Mb large SVs or diverged regions extracted from 13 other human assemblies.
You can build a STAR/Bowtie2 index and map reads normally. You will see reads
mapped to the additional contigs.</p>
<p>Strategy 3 is incomplete, but you can get a peek about how a graph
reference may affect your analysis:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/lh3/minigraph
cd minigraph && make
./minigraph -x se -t16 /path/to/GRCh38-0.1-14.gfa.gz se-reads.fa > out.gaf
</code></pre></div></div>
<p>For 150bp Illumina reads, you will see reads bridging GRCh38 and non-GRCh38
segments. There are not a lot of them, but most of them come from regions with
weird alignment. Longer reads will involve more non-GRCh38 paths in the
alignment.</p>
<p>The graph in this example was automatically built with minigraph. It just gives
you a feeling. I envision the construction of the actual reference graph is
likely to involve meticulous manual curation. A reference may not be complete,
but what is in there has to be extremely accurate.</p>
<p><strong>I don’t have a clear vision for the long term</strong>. The GRCh38 coordinate system
will have a longer life span in the graph world. We can keep using it for quite
some time. Ultimately, we will routinely sequence human genomes to a quality
higher than the primary GRCh38 assembly. We will need to rethink the concept of
“reference”. I would lose my job if I kept doing the same thing by then.</p>
<p>Anyway, the above is my vision. I am aware of and have thought over other
alternatives, but I think a conservative reference model is more likely to be
accepted and actually benefit the community. Finally, I reiterate that <strong>these
ideas are immature and only represent my own view</strong>. They are more like food
for thought. What will happen may be vastly different.</p>
On a reference pan-genome model2019-07-08T00:00:00+00:00http://lh3.github.io/2019/07/08/on-a-reference-pan-genome-model
<p>In the last weekend, I made <a href="https://github.com/lh3/gfatools">gfatools</a> and <a href="https://github.com/lh3/minigraph">minigraph</a>
open to the public. Both repos come with some documentations, but they haven’t
explained the background and motivation behind. This blog post gives a more
complete picture.</p>
<p>The primary assembly of GRCh38, our current human reference genome, is largely
the concatenation of individual haplotype segments. It aims to model a single
human genome and lacks thousands of structural variations (SVs). These SVs are
causing a multitude of problems which have been documented in many papers. A
solution is to construct a pan-genome reference. The question is “how?”.</p>
<p>My answer to that is <a href="https://github.com/lh3/gfatools/blob/master/doc/rGFA.md">rGFA</a>. rGFA is a text format and more importantly a
data model. It introduces the concept of <em>stable coordiate</em>, which is
persistent under the sequence split and insertion operations. We can
incrementally “add” a new genome to a graph without breaking the old coordinate
system. At the same time, if we start with a linear reference genome, an
augmented graph naturally inherits the coordinate system from the linear
reference. We can have the benefits of both linear and graphical
representations.</p>
<p><a href="https://github.com/lh3/minigraph">Minigraph</a> proves the above is more than just an idea; it is
practically working at least to some extent (constructing a graph from 15 human assemblies in an hour). Along this line of work,
<a href="https://github.com/lh3/gfatools/blob/master/doc/rGFA.md#the-graph-alignment-format-gaf">GAF</a> is a first text format to describe sequence-to-graph mapping.
The sister repo <a href="https://github.com/lh3/gfatools">gfatools</a> implements a few utilities to work with
rGFA. These are all connected by design.</p>
<p>There are much more to be done: minigraph has <a href="https://github.com/lh3/minigraph#limit">many limitations</a>;
gfatools lacks important functionalies; GAF alone is inadequate and can’t play
the same role as SAM; the starting linear reference genome has a lot of room
for improvement given the advances in sequencing technologies. With the same
data model, there can also be alternative approaches to graph construction
(e.g. via VCF, compact de Bruijn graph, mutliple-sequence alignment or all-pair
alignment). Minigraph is more of a <em>proof-of-concept</em> starting point. Community
efforts are the only way to build a pan-genome reference that is practical,
accurate, and comprehensive enough to represent genome diversity and ultimately
help us to understand genetics better.</p>
How much does developement time matter?2019-05-18T00:00:00+00:00http://lh3.github.io/2019/05/18/how-much-does-developer-time-matter
<p>I often hear developers saying “I do XYZ because it saves my time”. “XYZ” could
be the selection of programming language, the use of 3rd-party libraries or
other choices in programming. When I hear this, my immediate reaction is
always: where to put users’ time into the equation? Here is how I think about
it.</p>
<p>In <em>my</em> view, the value of a feature is <em>roughly</em> measured by</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>value(feature) = #benefitedUsers * avgUserTimeSaved - devTime
</code></pre></div></div>
<p>It is a balance between users’ time and developers’ time. The feature is more
valuable if it benefits more users and takes less time to implement. This
simple “equation” can be applied to a variety of use cases:</p>
<ul>
<li>
<p>In one extreme, I am the only benefited user for a one-off task. The user
time is usually much less than dev time. The approach taking the least dev
time is preferred.</p>
</li>
<li>
<p>In the other extreme, installation affects all users. For a tool with a large
user base, it is worth every bit of effort to simplify installation, even at
the cost of significant development time.</p>
</li>
<li>
<p>Most practical scenarios are something in between. It can be difficult to
measure how many users may be benefited from a new feature. Experience in
user interaction and familiarity with practical data analysis will play an
important role in making the right decision.</p>
</li>
<li>
<p>A software project can be improved in multiple ways. Rationally, the most
valuable feature should be prioritized.</p>
</li>
<li>
<p>Similarly, a developer or a team of developers may be working on multiple
projects, it is preferred to work on the most valuable feature first.</p>
</li>
</ul>
<p>Of course, in practice, the choice to implement a feature is more often
opinionated than rational. There are also other factors into play. For example,
if a developer gets paid to implement a feature (e.g. in case of LuaJIT), that
feature is likely to get a higher priority.</p>
<p>The above measures the value of a feature from the developer point of view.
From the user point of view, development time doesn’t matter. All users want
is a product that is more convient for them. Next time when you argue
development time is important, think about the potential users.</p>
On maintaining bioinformatics software2019-03-11T00:00:00+00:00http://lh3.github.io/2019/03/11/on-maintaining-bioinformatics-software
<p>In 2006, Ruiqiang Li, the now CEO of Novogene, said to me in Chinese: “you
can’t maintain the <a href="http://www.treefam.org/">TreeFam</a> database forever”. Considering TreeFam as
my most significant work at the time, I said “I think I can”. Some of you know
what happened next: one year later, I started to develop tools for next-gen
sequencing data and forgot about TreeFam almost completely. Few can constantly
and single-handedly maintain their own software or databases.</p>
<p>A database has to receive constant updates to stay relevant, but a software
package is much different. Along with TreeFam, I also developed a companion
tool, <a href="https://github.com/Ensembl/treebest">TreeBeST</a> (written in C++ by the way). It is <a href="https://useast.ensembl.org/info/genome/compara/homology_method.html">used by Ensembl
Compara</a> to build gene trees and still serves the community. A few
months ago when I needed to view and compare trees, I <a href="https://sourceforge.net/projects/treesoft/files/treebest/1.9.2/">downloaded</a> the
binary of TreeBeST compiled in 2007. It still works. A software package can
survive without the attention of its developer.</p>
<p>Currently I have <a href="https://github.com/lh3?tab=repositories">74 repos</a> at GitHub. Most of them are my own projects
in a state similar to TreeBeST. They still work for the specific tasks they
were designed for but they rarely receive code changes any more. This is how I
maintain my personal projects: I try <em>not</em> to maintain them. To achieve the
goal, 1) I strive to simplify the user interface to reduce users’ questions. 2)
I make my projects independent of each other and of external libraries to avoid
changes in dependencies failing my own projects. 3) If I perceive significant
changes to an existing code base, I more often duplicate the code and start
fresh (e.g. fermi vs fermi2 vs fermi-lite, and minimap vs minimap2). This way
I can forget about compatibility and freely cut <a href="https://en.wikipedia.org/wiki/Technical_debt">technical debts</a>
without affecting the stability of the previous versions. These strategies have
enabled me to switch projects from time to time without leaving unmanageable
messes behind.</p>
<p>Many fundamental tools in bioinformatics (e.g. BLAST, samtools/htslib and GATK)
have thrived only due to continuous efforts by a stable development team, but
in reality, much more tools are coded in short term by individual developers
who lack sustainable resources. In the latter case, developing simple, clean
and self-consistent tools without need for maintenance is often the best way
towards maintenance, just as Richard Durbin, my postdoc advisor, <a href="https://www.nature.com/articles/nbt.2721">said in an
interview</a>: “a key thing is that software or a data format does a
clean job correctly, that it works … support isn’t a critical thing, in a
strange way. Rather, it’s the lack of need for support that’s important”.
Think about this when you develop your next coding project.</p>
SAM/BAM/samtools is 10 years old2018-12-21T00:00:00+00:00http://lh3.github.io/2018/12/21/sambamsamtools-is-10-years-old
<p>I wrote a commentary on the SAM/BAM format a while ago. I now publish it as a
blog post, 10 years after I released the <a href="https://sourceforge.net/projects/samtools/files/samtools/0.1.1/">first samtools</a> to SourceForge.
It gives an overview about what has happened in the past 10 years. Here “we”
refers to the samtools dev team and the HTS file format commitee, as well as
those who have contributed to the specification or samtools/htslib. I am only
one of them.</p>
<hr />
<p>When the <a href="https://en.wikipedia.org/wiki/1000_Genomes_Project">1000 Genomes Project</a> was launched in early 2008, there were already many
short-read aligners and variant callers. Each of them had its own input or
output format for limited use cases. They did not talk to each other. We had to
implement various format converters to bridge tools, which was awkward and even
sometimes impossible as formats may encode different information. The
fragmented ecosystem hampered the collaboration between the participants of the
project and delayed the development of advanced data analysis algorithms.</p>
<p>In a conference call on October 21, 2008, the 1000 Genome Project analysis
subgroup decided to take on the issue by unifying a variety of short-read
alignment formats to a <a href="https://en.wikipedia.org/wiki/SAM_(file_format)">Sequence Alignment/Map format</a> or SAM in short. Towards
the end of 2008, the subgroup announced the first SAM specification, detailing
a text-based SAM format and its binary representation, the BAM format. SAM/BAM
quickly replaced all the other short-read alignment formats and became the de
facto standard in the analysis of high-throughput sequence data. Nowadays,
SAM/BAM is not limited to a short-read format any more. It is able to represent
noisy read alignments of millions of bases in length (feature 6 and 9 in Table
1), and also finds its uses as the primary format to store signal data for
IonTorrent and PacBio (feature 3).</p>
<p>One of the most influential decisions on evolving the SAM ecosystem is the
separation of the application programming interface (API) from the command-line
tools. The SAM/BAM format originally came with a reference implementation,
<a href="https://github.com/samtools/samtools">samtools</a>. While samtools provided primitive APIs to parse SAM/BAM files, it
mixed the APIs with applications and did not promise long-term stability, which
made it difficult to interface in other programs. To address this issue, we
created <a href="https://github.com/samtools/htslib">htslib</a> in 2014, a dedicated programming library in C that processes
common data formats used in high-throughput sequencing. This library implements
stable and robust (e.g. feature 1 and 11) APIs that other programs can rely on.
It enables efficient access to SAM/BAM in other popular programming languages
such as Python and R and boosts the development of sequence analysis tools.</p>
<p>Htslib is not merely a separation; it also brought numerous improvements to
samtools and third-party programs depending on it. Htslib supports
multi-threading and uses faster compression libraries. In comparison to the
original samtools, it reduces the wall-clock processing time by several folds
on multiple cores, broadly matching the performance of <a href="http://lomereiter.github.io/sambamba/">sambamba</a>. Htslib can
directly access BAM files on remote HTTP/FTP servers or cloud storages such as
DropBox, Google Cloud and Amazon Web Services (feature 8). Users can extract
and visualize alignments in a small region without downloading the entire
dataset, which can be thousands of times smaller than the entire dataset.
Recently, we have pushed this feature further by implementing the <a href="https://www.ncbi.nlm.nih.gov/pubmed/29931085">htsget</a>
protocol (feature 10). This protocol eliminates bulk data recoding and thus
takes tens of times less computing resource than remote BAM access. Htslib
seamlessly supports the <a href="https://en.wikipedia.org/wiki/CRAM_(file_format)">CRAMv3 format</a> (feature 3), a more compact binary
representation of SAM. On high-coverage data, CRAM is typically twice as small
as BAM containing identical information. Htslib, samtools and popular libraries
that include a copy of the htslib source code (e.g. <a href="https://github.com/pysam-developers/pysam">Pysam</a> and <a href="https://bioconductor.org/packages/release/bioc/html/Rsamtools.html">Rsamtools</a>) have
been downloaded over 3 million times in the past 10 years.</p>
<p>Although the SAM format has been revised multiple times in the past, nearly all
changes are backward compatible – tools supporting more recent versions of
the specification seamlessly work with older versions. Meanwhile, the SAM/BAM
format is also largely forward compatible. The very <a href="https://sourceforge.net/projects/samtools/files/samtools/0.1.1/">first version of samtools</a>
still parses the vast majority of short-read SAM/BAM produced today. This
long-term stability prevents the fragmentation of the community and is one of
the most critical features of the format. Many may feel SAM is the same old
format announced 10 years ago. However, without the continuous enhancements to
the format and the ecosystem, many routine tasks, such as structural variation
calling, long read alignment, cloud computing and online genome browsing, would
be made more difficult or even impossible.</p>
<p>The SAM format, along with BAM, CRAM and several other formats that htslib
supports, is now maintained by the <a href="https://www.ga4gh.org/work_stream/large-scale-genomics/">Large Scale Genomics work stream</a> of the <a href="https://www.ga4gh.org/">Global Alliance
for Genomics and Health</a> initiative. These formats and htslib constantly receive
bug fixes, enhancements and new features, and will continue to empower the
analysis of high-throughput sequence data in the coming years.</p>
<table>
<thead>
<tr>
<th style="text-align: right">Feature</th>
<th style="text-align: center">Year first available</th>
<th style="text-align: center">Target</th>
<th style="text-align: left">Feature descriptions</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">1</td>
<td style="text-align: center">2009</td>
<td style="text-align: center">BAM</td>
<td style="text-align: left">End-of-file marker to detect truncated files</td>
</tr>
<tr>
<td style="text-align: right">2</td>
<td style="text-align: center">2009</td>
<td style="text-align: center">SAM</td>
<td style="text-align: left">”=”/”X” CIGAR operators to encode sequence matches or mismatches</td>
</tr>
<tr>
<td style="text-align: right">3</td>
<td style="text-align: center">2011</td>
<td style="text-align: center">SAM</td>
<td style="text-align: left">“B” tag type to efficiently store binary arrays</td>
</tr>
<tr>
<td style="text-align: right">4</td>
<td style="text-align: center">2012</td>
<td style="text-align: center">BAM</td>
<td style="text-align: left">CSI index for chromosomes longer than 512Mbp</td>
</tr>
<tr>
<td style="text-align: right">5</td>
<td style="text-align: center">2012</td>
<td style="text-align: center">htslib</td>
<td style="text-align: left">BCFv2 as an efficient binary representation of VCF files</td>
</tr>
<tr>
<td style="text-align: right">6</td>
<td style="text-align: center">2013</td>
<td style="text-align: center">SAM</td>
<td style="text-align: left">Supplementary alignment to encode split alignment</td>
</tr>
<tr>
<td style="text-align: right">7</td>
<td style="text-align: center">2014</td>
<td style="text-align: center">htslib</td>
<td style="text-align: left">CRAMv3 as a more compact binary representation of SAM</td>
</tr>
<tr>
<td style="text-align: right">8</td>
<td style="text-align: center">2015</td>
<td style="text-align: center">htslib</td>
<td style="text-align: left">Direct access to remote files over internet</td>
</tr>
<tr>
<td style="text-align: right">9</td>
<td style="text-align: center">2017</td>
<td style="text-align: center">BAM</td>
<td style="text-align: left">BAM extension to support CIGARs with >65535 operators</td>
</tr>
<tr>
<td style="text-align: right">10</td>
<td style="text-align: center">2017</td>
<td style="text-align: center">htslib</td>
<td style="text-align: left">Htsget protocol for efficient database access</td>
</tr>
<tr>
<td style="text-align: right">11</td>
<td style="text-align: center">2017</td>
<td style="text-align: center">htslib</td>
<td style="text-align: left">CRC error checking to detect internally corrupted BAM files</td>
</tr>
</tbody>
</table>
On the definition of sequence identity2018-11-25T00:00:00+00:00http://lh3.github.io/2018/11/25/on-the-definition-of-sequence-identity
<p>Sequence identity is a way to measure the similarity between two sequences. For sequencing
data, it is often thought as the opposite of sequencing error rate. When we say
“the sequence divergence between two species is ABC” or “the sequencing error
rate is XYZ”, we assume everyone knows how to compute identity. In fact, there
are more than one ways to compute identity. This blog post discusses a few
definitions and how <a href="https://github.com/lh3/minimap2">minimap2</a> implements it.</p>
<p>I will start with the following example: what is the identity between the two
sequences?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CCAGTGTGGCCGATACCCCAGGTTGGCACGCATCGTTGCCTTGGTAAGC
CCAGTGTGGCCGATGCCCGTGCTACGCATCGTTGCCTTGGTAAGC
</code></pre></div></div>
<p>The classical way to find identity is to perform alignment first. If match=1,
mismatch=-2, gapOpen=-2 and gapExt=-1, we get the following alignment:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Ref+: 1 CCAGTGTGGCCGATaCCCcagGTtgGC-ACGCATCGTTGCCTTGGTAAGC 49
|||||||||||||| ||| || || ||||||||||||||||||||||
Qry+: 1 CCAGTGTGGCCGATgCCC---GT--GCtACGCATCGTTGCCTTGGTAAGC 45
</code></pre></div></div>
<p>Here we have 43 matching bases, 1 mismatch, 5 deletions and 1 insertion to the
first/Ref sequence. The CIGAR is <code class="language-plaintext highlighter-rouge">18M3D2M2D2M1I22M</code>.</p>
<h3 id="gap-excluded-identity">Gap-excluded identity</h3>
<p>With this definition, we exclude all gapped columns from the alignment. The
identity equals “#matches / (#matches + #mismatches)”. In the example able, the
gap-excluded identity is 43/(43+1)=97.7%.</p>
<p>An obvious problem with this definition is that it doesn’t count gaps. However,
it is an often used definition. We may hear that the chimpanzee and human
genome differ by a couple of percent. Here we are referring to such
gap-excluded identity. The exact sentence in the <a href="https://www.nature.com/articles/nature04072">first chimpanzee genome
paper</a> is “Single-nucleotide substitutions occur at a mean rate of
1.23% between copies of the human and chimpanzee genome”.</p>
<h3 id="blast-identity">BLAST identity</h3>
<p>BLAST identity is defined as the number of matching bases over the number of
alignment columns. In this example, there are 50 columns, so the identity is
43/50=86%. In a SAM file, the number of columns can be calculated by summing
over the lengths of M/I/D CIGAR operators. The number of matching bases equals
the column length minus the NM tag. Here is a Perl one-liner to calculate
BLAST identity:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>perl -ane 'if(/NM:i:(\d+)/){$n=$1;$l=0;$l+=$1 while/(\d+)[MID]/g;print(($l-$n)/$l,"\n")}'
</code></pre></div></div>
<p>where variable <code class="language-plaintext highlighter-rouge">$n</code> is the sum of mismatches and gaps and <code class="language-plaintext highlighter-rouge">$l</code> is the alignment
length. In the <a href="https://github.com/lh3/miniasm/blob/master/PAF.md">PAF</a> format, column 10 divived by column 11 gives the
BLAST identity.</p>
<p>BLAST identity is perhaps the most common definition, but it should be used
with caution when we filter alignments by identity. Suppose we are aligning
1000bp query sequence that has a ~300bp ALU insertion in the middle. The
alignment will have a BLAST identity around 70% and is likely to get filtered
out. In evolution, an ALT insertion is created by one event. It should not be
counted as 300 independent differences.</p>
<h3 id="gap-compressed-identity">Gap-compressed identity</h3>
<p>At least for filtering, a better definition of sequence identity is the
following: we count consecutive gaps as one difference. By compressing gaps in
the example above, we are effectively dealing with this alignment:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Ref+: CCAGTGTGGCCGATaCCCcGTtGCTACGCATC-TTGCCTTGGTAAGC
|||||||||||||| ||| || |||||||||| ||||||||||||||
Qry+: CCAGTGTGGCCGATgCCC-GT-GCTACGCATCgTTGCCTTGGTAAGC
</code></pre></div></div>
<p>The identity is 43/(50-2-1)=91.5%. I have been using this definition for
various tasks. The latest minimap2 at github outputs such identity at a new
<code class="language-plaintext highlighter-rouge">de:f</code> tag. There is a Perl one-liner for this as well:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>perl -ane 'if(/NM:i:(\d+)/){$n=$1;$m=$g=$o=0;$m+=$1 while/(\d+)M/g;$g+=$1,++$o while/(\d+)[ID]/g;print(1-($n-$g+$o)/($m+$o),"\n")}'
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">$m</code> is the sum of <code class="language-plaintext highlighter-rouge">M</code> operations, <code class="language-plaintext highlighter-rouge">$g</code> the sum of <code class="language-plaintext highlighter-rouge">I</code> and <code class="language-plaintext highlighter-rouge">D</code> operations
and <code class="language-plaintext highlighter-rouge">$o</code> the number of gap opens.</p>
<h3 id="effect-of-scoring">Effect of scoring</h3>
<p>Scoring affects alignment and thus sequence identity. For the same pair of
sequences, if we change gapOpen=-4, we end up with a different alignment</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Ref+: 1 CCAGTGTGGCCGATaCCCcaggtTGgcACGCATCGTTGCCTTGGTAAGC 49
|||||||||||||| ||| || ||||||||||||||||||||||
Qry+: 1 CCAGTGTGGCCGATgCCC----gTGctACGCATCGTTGCCTTGGTAAGC 45
</code></pre></div></div>
<p>The BLAST identity is 83.7% and the gap-compressed identity is 89.1%. <strong>Even if
we stick with one definition, the identity can be different if we change the
scoring.</strong></p>
<h3 id="concluding-remarks">Concluding remarks</h3>
<p>The estimate of sequence identity varies with definitions and alignment
scoring. When you see someone talking about “sequencing error rate” next time,
ask about the definition and scoring in use to make sure that is the error rate
you intend to compare. If you want to estimate error rate or identity on your
own, try the following command line</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>minimap2 -c ref.fa query.fa \
| perl -ane 'if(/tp:A:P/&&/NM:i:(\d+)/){$n+=$1;$m+=$1 while/(\d+)M/g;$g+=$1,++$o while/(\d+)[ID]/g}END{print(($n-$g+$o)/($m+$o),"\n")}'
</code></pre></div></div>
<p>Remember that for fair comparisons, use the same scoring.</p>
Seqtk: code walkthrough2018-11-12T00:00:00+00:00http://lh3.github.io/2018/11/12/seqtk-code-walkthrough
<p>On Twitter, <a href="http://zachcp.org/">Zach charlop-powers</a> asked me to give a code walkthrough
for <a href="https://github.com/lh3/seqtk">seqtk</a>. This post does that, to a certain extent.</p>
<p>Seqtk is a fairly simple project. It uses two single-header libraries for hash
table and FASTQ parsing, respectively. Its single <code class="language-plaintext highlighter-rouge">.c</code> file consists of mostly
independent components, one for each seqtk command. I will start with the two
single-header libraries.</p>
<h3 id="buffered-stream-reader">Buffered stream reader</h3>
<p>In standard C, the <a href="http://man7.org/linux/man-pages/man2/read.2.html">read()</a> system call is the most basic function to
read from file. It is usually slow to read data byte by byte. That is why the C
library provides the <a href="http://man7.org/linux/man-pages/man3/fread.3.html">fread()</a> series. fread() reads large chunks of
data with read() into an internal buffer and returns smaller data blocks on
request. It may dramatically reduce the expensive read() system calls and is
mostly the preferred choice.</p>
<p>fread() is efficient. However, it only works with data streams coming from
<a href="https://en.wikipedia.org/wiki/File_descriptor">file descriptors</a>, not from a zlib file handler for example. More recent programming languages provide generic
buffered readers. Take Go’s <a href="https://golang.org/pkg/bufio/">Bufio</a> as an example. It demands a
read() like function from the user code, and provides a buffered single-byte
reader and an efficient line reader in return. The buffered functionalities are
harder to implement on your own.</p>
<p>The buffered reader in <a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h">kseq.h</a> predates Go, but the basic idea is similar.
In this file, <a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h#L94-L144">ks_getuntil()</a> reads up to a delimitor such as a line
break. It moves data with memcpy() and uses a single loop to test delimitor. 10
years ago when “kseq.h” was first implemented, <a href="https://zlib.net/">zlib</a> didn’t support
buffered I/O. Line reading with zlib was very slow. “kseq.h” is critical to
the performance of FASTA/Q parsing.</p>
<h3 id="fastaq-parser">FASTA/Q parser</h3>
<p>The <a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h#L178-L219">parser</a> parses FASTA and FASTQ at the same time. It <a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h#L183">looks
for</a> ‘@’ or ‘>’ if it hasn’t been read, and then <a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h#L188">reads</a> name and
comment. To read sequence, the parser first <a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h#L194">reads the first character</a>
on a line. If the character is ‘+’ or indicates a FASTA/Q header, the parser
stops; if not, it <a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h#L197">reads the rest of line</a> into the sequence buffer.
If the parser stops at a FASTA/Q header, it returns the sequence as a FASTA
record and <a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h#L199">indicates</a> the header character has been read, such that the parser
need not look for it for the next record. If the parser stops at ‘+’, it
<a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h#L212">skips</a> the rest of line and starts to <a href="https://github.com/lh3/seqtk/blob/v1.3/kseq.h#L214">read quality strings</a> line
by line until the quality string is no shorter than the sequence. The parser
returns an error if it reaches the end of file before reading enough quality,
or the quality string turns out to be longer than sequence. Given a
malformatted FASTA/Q file, the parser won’t lead to memory violation except
when there is not enough memory.</p>
<p>A basic tip on fast file parsing: read by line or by chunk, not by byte. Even
with a buffered reader, using fgetc() etc to read every byte is slow. In fact,
it is possible to make the FASTA/Q parser faster by reading chunks of fixed
size, but the current pasrser is fast enough for typical FASTA/Q.</p>
<h3 id="hash-table">Hash table</h3>
<p>File <a href="https://github.com/lh3/seqtk/blob/v1.3/khash.h">khash.h</a> implements an <a href="https://en.wikipedia.org/wiki/Open_addressing">open-addressing hash table</a>
with power-of-2 capacity and quadratic probing. It uses a 2-bit-per-bucket
<a href="https://github.com/lh3/seqtk/blob/v1.3/khash.h#L165">meta table</a> to indicate whether a bucket is used or deleted. The query
and insertion operations are fairly standard. There are no tricks. Rehashing in
khash is different from other libraries, but that is not an important aspect.</p>
<p>Both “khash.h” and “kseq.h” heavily depend on C macros. They look ugly.
Unfortunately, in C, that is the only way to achieve the performance of
type-specific code.</p>
<h3 id="seqtk">Seqtk</h3>
<p>The only performance-related trick I can think of in <a href="https://github.com/lh3/seqtk/blob/v1.3/seqtk.c">seqtk.c</a> is the
<a href="https://github.com/lh3/seqtk/blob/v1.3/seqtk.c#L117-L169">tables</a> to map nucleotides to integers. It is commonly used
elsewhere. Another convenience-related trick is <a href="http://man7.org/linux/man-pages/man3/isatty.3.html">isatty()</a>. This
function can <a href="https://github.com/lh3/seqtk/blob/v1.3/seqtk.c#L375">test</a> if there is an incoming stream from the
standard input. Gzip probably uses this function, too.</p>
<p>Seqtk.c also implements a simple 3-column <a href="https://github.com/lh3/seqtk/blob/v1.3/seqtk.c#L52">BED reader</a> and comes with
a <a href="https://en.wikipedia.org/wiki/Mersenne_Twister">Mersenne Twister</a> pseudorandom number generator (PRNG). That PRNG
is a <a href="http://www.pcg-random.org/other-rngs.html">mistake</a>, though seqtk doesn’t need a good PRNG anyway.</p>
<p>The rest of seqtk consists of mostly indepedent functions, each implementing a
seqtk command. I will briefly explain a couple of them. “trimfq” uses a modified
Mott algorithm (please search text “Mott algorithm” in <a href="https://www.codoncode.com/support/phred.doc.html">phred.doc</a>).
I think this is a much cleaner and more theoretically sound algorithm than most
ad hoc methods in various read trimmers. The “sample” command takes the
advantage of <a href="https://en.wikipedia.org/wiki/Reservoir_sampling">reservoir sampling</a>. The core implementation only takes
<a href="https://github.com/lh3/seqtk/blob/v1.3/seqtk.c#L1073-L1074">two lines</a>. You can in fact sample a text file with an awk one liner:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cat file.txt | awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'
</code></pre></div></div>
<h3 id="concluding-remarks">Concluding remarks</h3>
<p>This is the first time I present a code walkthrough in a blog post. Not sure if
it is helpful or even qualified as a walkthrough…</p>
On the MPEG-G alignment format2018-09-28T00:00:00+00:00http://lh3.github.io/2018/09/28/on-the-mpeg-g-alignment-format
<p>SAM is a text format that is typically used to store the alignment of
high-throughput sequence reads against a reference genome. BAM is the first
binary representation of SAM designed at the same time. BAM is smaller, faster
to process and has additional features like random access.</p>
<p>BAM is not optimal in terms of compression ratio. By reorganizing binary data
and using more advanced compression techniques, we can make alignments much
more compressible. There have been many attempts to replace BAM with a better
binary format, such as <a href="https://www.ncbi.nlm.nih.gov/pubmed/25357237">DeeZ</a>, <a href="https://www.ncbi.nlm.nih.gov/pubmed/22904078">Quip</a>, <a href="https://www.ncbi.nlm.nih.gov/core/assets/sra/files/csra-fileformat.ppsx">cSRA</a> (PPT),
<a href="https://www.ncbi.nlm.nih.gov/pubmed/29046896">GenComp</a>, <a href="https://www.ncbi.nlm.nih.gov/pubmed/27540265">cSAM</a>, <a href="https://www.ncbi.nlm.nih.gov/pubmed/24260313">Goby</a> and <a href="https://www.ncbi.nlm.nih.gov/pubmed/23533605">samcomp</a> (see
<a href="https://www.mdpi.com/2078-2489/7/4/56">Hosseini et al (2016)</a> for a more thorough review). The team
maintaining the SAM spec finally adopted <a href="https://www.ncbi.nlm.nih.gov/pubmed/21245279">CRAM</a> as the future of
alignment format. CRAM is much smaller than BAM and has a similar feature set.
With the new codec implemented in <a href="https://www.ncbi.nlm.nih.gov/pubmed/24930138">scramble</a>, it is as fast as BAM in
routine data processing.</p>
<p><a href="https://mpeg-g.org">MPEG-G</a> is a new binary format that aims to replace BAM. <a href="https://www.biorxiv.org/content/early/2018/09/27/426353">Its
preprint</a> claims “10x improvement over the BAM format” in the
abstract. However, in the only compression ratio comparison, Figure 3, MPEG-G
is only 6.54x as small, not 10x. In addition, Figure 3 suggests sequences and
qualities are of different sizes in SAM (green vs orange). This could happen
(some reads don’t have qualities), but is very rarely the case in real-world
BAMs. I am also surprised by Figure 3a, where MPEG-G can compress qualities
much more than sequences (green vs orange). On real data produced today,
qualities are harder to compress because they don’t follow a clear pattern. I
suspect the authors are employing lossy compression, possibly with one of the
algorithms developed by <a href="https://github.com/voges">a contributor</a> to MPEG-G. Furthermore, the
usability of a format is more than just compression ratio. Encoding/decoding
has to be performant. The preprint shows no evaluation. James Bonfield, the
developer behind the latest CRAM, <a href="https://datageekdom.blogspot.com/2018/09/mpeg-g-bad.html">has similar concerns</a> with their
previous results.</p>
<p>Much of the above is my speculation. I could be wrong. And it is easy to prove
me wrong: make the data and software available and let the world reproduce
Figure 3. Unfortunately, although the <a href="https://mpeg.chiariglione.org/standards/mpeg-g/genomic-information-representation/study-isoiec-cd-23092-2-coding-genomic">MPEG-G specification</a> is
available, the implementation and the <a href="https://github.com/voges/mpeg-g-gidb">benchmark data</a> are not. This
leads to my following point:</p>
<p>MPEG-G is an open standard endorsed by <a href="https://en.wikipedia.org/wiki/International_Organization_for_Standardization">ISO</a>. However, open doesn’t mean
free. Remember the <a href="https://en.wikipedia.org/wiki/Royalty_payment">royalties</a> imposed by <a href="https://en.wikipedia.org/wiki/H.264/MPEG-4_AVC#Licensing">H.264/MPEG-4 AVC</a>?
MPEG-G may be going down the same route. Key contributors <a href="https://datageekdom.blogspot.com/2018/09/mpeg-g-ugly.html">are applying for
patents</a> and may have financial interest in the format. Before the MPEG-G
authors 1) open source the reference implementation and 2) make the format
<a href="https://en.wikipedia.org/wiki/Royalty-free">royalty-free</a> like <a href="https://en.wikipedia.org/wiki/AV1">AV1</a>, I recommend everyone to use BAM or CRAM.</p>
<p><strong><em>Disclaimer</em></strong>: I was the key contributor to BAM, the format that CRAM and
MPEG-G aim to replace, and I am still a contributor to the SAM/BAM spec and its
reference implementation. I have no competing financial interests in
SAM/BAM/CRAM or its reference implementation htslib.</p>
Minimap2 and the future of BWA2018-04-02T00:00:00+00:00http://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa
<p>My minimap2 paper has been accepted for publication in Bioinformatics. You can
find the latest LaTeX source <a href="http://www.overleaf.com/read/ddwtrgmngxms">at OverLeaf</a> or in the <a href="https://github.com/lh3/minimap2/tree/master/tex">tex
directory of minimap2</a>. I am intentionally delaying the publication
process for personal reasons. It will take a while for you to see the published
version at Bioinformatics. I thought to write this blog post when the paper
comes out, but there have been a few discussions on minimap2 recently, so I
decide to write it now.</p>
<h3 id="why-minimap2">Why minimap2?</h3>
<p>I wrote a <a href="http://lh3.github.io/2014/12/10/bwa-mem-for-long-error-prone-reads">blog post</a> on long-read alignment with bwa-mem several
years ago. In short, bwa-mem was not designed for long reads initially. It
works, but not well. When I was developing <a href="https://github.com/lh3/minimap">minimap</a> for read
overlapping, I realized approximate mapping could achieve comparable accuracy
to bwa-mem at a much faster speed. I didn’t expand minimap to a full-pledge
aligner because (1) I knew base-level alignment was going to be very slow and
(2) bwa-mem still worked fine. However, both reasons became invalid in the
coming years.</p>
<p>In early 2017, Nick Loman et al invented a <a href="http://lab.loman.net/2017/03/09/ultrareads-for-nanopore/">new protocol</a> to
sequence nanopore reads of 100kb in length. Bwa-mem failed miserably on such
ultra-long reads – it was not “fine” at all. In addition, not long after I
published minimap, Suzuki and Kasahara released <a href="https://github.com/ocxtal/minialign">minialign</a>. It
implements a banded base-level alignment algorithm that is practical for
long-read alignment and much faster than the alternatives. These events finally
motivated me to develop minimap2.</p>
<h3 id="the-status-of-minimap2">The status of minimap2</h3>
<p>For long reads, <a href="https://github.com/lh3/minimap2">minimap2</a> is a much better mapper than <a href="https://github.com/lh3/bwa">bwa-mem</a> in almost every
aspect: it is >50X faster, more accurate, gives better alignment at long
gaps and works with ultra-long reads that fail bwa-mem. Minimap2 also goes
beyond a typical long-read mapper. It can achieve good full-genome alignment
(see the minimap2 paper, section 3.4) and is used by <a href="http://cab.spbu.ru/software/quast-lg/">QUAST-LG</a>.
Minimap2 can also align high-quality cDNAs and noisy long RNA-seq reads
(section 3.2). PacBio has <a href="https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki/%5BBeta%5D-ToFU2:-running-and-installing-ToFU2">started to consider</a> minimap2 in their
Iso-seq pipeline. The feature set and the code base of minimap2 are also fairly
stable. <strong>I see little reason to use bwa-mem for long reads in future</strong>.</p>
<p>The story on short-read alignment is a little complex, though. I did plan to
replace bwa-mem with minimap2 on short-read alignment, too. In the minimap2
paper, I showed that minimap2 is 3X as fast as bwa-mem and achieves comparable
accuracy to bwa-mem on short variant calling (section 3.3). In the final round
of the review, an reviewer still argued that minimap2 wouldn’t work well for
short reads. I didn’t think so at the time given that Illumina Inc. has
independently evaluated minimap2 and observed that minimap2 is highly
competitive. Therefore, I didn’t follow the suggestion of that reviewer.</p>
<p>However, <a href="https://blog.dnanexus.com/author/acarroll/">Andrew Carroll</a> at DNAnexus has recently showed me that
minimap2 was slower than bwa-mem on two NovaSeq runs at his hand. Part of the
reason, I guess, is that the two NovaSeq runs have a little higher error rate,
which triggers expensive heuristics in minimap2 more frequently. Furthermore, I
also realize that bwa-mem will be better than minimap2 at Hi-C alignment
because bwa-mem is more sensitive to short matches. In the end, I admit
<strong>minimap2 is not ready to replace bwa-mem all around</strong>. I owe that reviewer an
apology.</p>
<p>Generally, I still think minimap2 is a competitive short-read mapper and I will
use it often in my research projects. However, given that the performance of
minimap2 is not as consistent as bwa-mem for short reads of varying quality,
bwa-mem is still better for production uses, at least before I find a way to
improve minimap2.</p>
<h3 id="the-future-of-bwa">The future of bwa</h3>
<p><strong>Bwa will stay</strong>.
I am thinking to bring some minimap2 features to bwa-mem, such as fast
alignment extension and global alignment. This will make code cleaner and fix a
long-existing bug in bwa-mem: a tiny fraction of base-level alignment is
suboptimal. Nonetheless, implementing these features will not speed up bwa-mem
much because base-level alignment is not the computation bottleneck for short
reads. I am also likely to remove the bwa-sw algorithm and issue a deprecation
warning when the “pacbio” or the “ont2d” presets are used. In the mean time,
several talented developers at Intel Inc. are restructuring bwa-mem for
considerable performance boost at no loss of accuracy. I will work with them.
If this effort hopefully works out, the end product will become bwa-mem2. All
these won’t happen soon, unfortunately.</p>
The history the MD tag and the CIGAR X operator2018-03-27T00:00:00+00:00http://lh3.github.io/2018/03/27/the-history-the-cigar-x-operator-and-the-md-tag
<p>In the SAM format, the “X” and “=” CIGAR operators were not part of the
original spec. Nonetheless, they were among the first several features added
after the initial release of the spec. I was resistant to this feature for
several reasons. First, CIGAR describes alignment, but sequence matches and
mismatches are not indispensable properties of alignment. Second, for this
reason, most older alignment formats did not distinguish sequence matches and
mismatches, either. When we convert from other formats to SAM, it is
non-trivial to generate “X” and “=”. Third, an “X” does not tell us the
mismatching base. Its application is limited in practice. In the end, I still
added “X” and “=” to the spec in response to the request of several important
users. However, I have to say I regret the decision.</p>
<p>There was also a fourth reason: before X/=, there was the “MD” tag, which
encodes mismatching bases in addition to positions. “MD” was in the original
spec. The motivation was to reconstruct the reference subsequence in the
alignment. I learned the idea from a variant of the Eland format used
internally at Illumina. However, because at that time Eland didn’t do gapped
alignment, the Illumina version of MD was unable to encode gaps, so I
introduced “^”, representing deleted sequences from the reference.</p>
<p>Something unexpected happened down the road, though. Without “^”, we could
represent adjacent mismatches simply with two letters like “AC”. With “^”,
there was an ambiguity like “^AC” - is it 2bp deletion, or 1bp deletion
followed by a mismatch? To resolve this issue, we changed MD to require a zero
before each mismatch like “^A0C”. It was an oversight.</p>
<p>There is a bigger problem with “MD”: it is too complicated to use. We have to
keep track of MD, CIGAR and query string at the same time to generate the
reference string. I thought to use it a few times, but was stopped by the
complexity. I have never used this tag until very recently, with a lot of
efforts.</p>
<p>This is why in minimap2, I came up with a new custom tag “<a href="https://github.com/lh3/minimap2#cs">cs</a>”. It encodes
CIGAR and both query and target sequence differences, such that we can parse
all information from one string. It greatly simplifies code. “cs” is also
critical to the PAF format that doesn’t store query sequences. I firmly
believe “cs” is “MD” done right.</p>
<p>In reality, though, something better is not necessarily more popular. “cs” came
too late. I even don’t know if it will become a standard tag in SAM. Minimap2
will keep using “cs” anyway as PAF is an important part of minimap2.</p>
Immature thoughts on assembly De Bruijn graphs2017-11-15T00:00:00+00:00http://lh3.github.io/2017/11/15/on-assembly-de-bruijn-graphs
<p>By <a href="https://en.wikipedia.org/wiki/De_Bruijn_graph">mathematical definition</a>, a <em>k</em>-order (or <em>k</em>-dimensional) De Bruijn
graph, or ${\rm DBG}(k)$ in brief, over the DNA alphabet uses <em>k</em>-mers at vertices. It
has $4^k$ vertices and $4^{k+1}$ edges. DBG(k) has two interesting properties.
First, DBG(k) is the <a href="https://en.wikipedia.org/wiki/Line_graph">line graph</a> of DBG(k-1). Intuitively, this
means an edge in DBG(k-1) uniquely corresponds to a vertex in DBG(k) and that
the edge adjacency of DBG(k-1) is precisely modeled by vertex adjacency of
DBG(k). Second, DBG(k) is both <a href="https://en.wikipedia.org/wiki/Eulerian_path">Eulerian</a> and <a href="https://en.wikipedia.org/wiki/Hamiltonian_path">Hamiltonian</a>.
A Eulerian path in DBG(k-1) corresponds to a Hamiltonian path in DBG(k).</p>
<p>Given a set of sequences <em>S</em>, let $S(k)$ be the set of k-mers present in
sequences in <em>S</em>. We can <a href="https://en.wikipedia.org/wiki/Induced_subgraph">vertex-induce</a> a subgraph from DBG(k) by
keeping vertices in S(k) together with edges connecting vertices in S(k). We
denote this graph by DBGv(k|S). DBGv(k|S) can be regarded as an overlap graph
consisting of k-mers at vertices with (k-1)-mers overlaps.</p>
<p>Alternatively, we can edge-induce a subgraph by keeping edges in $S(k+1)$
together with <a href="https://en.wikipedia.org/wiki/Incidence_(graph)">incident</a> vertices. We denote this graph by DBGe(k|S).
DBGe(k|S) cannot be considerd as an overlap graph because there may be no
edges between two k-mers even if they have a (k-1)-mer overlap. To this end,
DBGe(k|S) is a subgraph of DBGv(k|S).</p>
<p>DBGv(k|S) is the line graph of DBGe(k-1|S). This property has an important
implication in implementation. One common way to store a DBG is to keep a
collection of <em>k</em>-mers. We traverse the graph by shifting a <em>k</em>-mer and probing
its presence/absence in the collection. Such an algorithm actually implements
both DBGv(k|S) and DBGe(k-1|S) at the same time.</p>
<p>In summary, the “De Bruijn graph” in “De Bruijn graph based assembler” is not
the De Bruijn graph by mathematical definition. Assembly De Bruijn graphs are
subgraphs. There are two different ways to induce such subgraphs, but in
implementation, they often behave the same. In DBG, are sequences on vertices
or on edges? The correct answer is: depending on how you look at the graph.</p>
Which human reference genome to use?2017-11-13T00:00:00+00:00http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use
<p>TL;DR: If you map reads to GRCh37 or hg19, use <code class="language-plaintext highlighter-rouge">hs37-1kg</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
</code></pre></div></div>
<p>If you map to GRCh37 and believe decoy sequences help with better variant calling, use <code class="language-plaintext highlighter-rouge">hs37d5</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
</code></pre></div></div>
<p>If you map reads to GRCh38 or hg38, use the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
</code></pre></div></div>
<p>There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here
are a collection of potential issues:</p>
<ol>
<li>
<p>Inclusion of ALT contigs. ALT contigs are large variations with very long
flanking sequences nearly identical to the primary human assembly. Most read
mappers will give mapping quality zero to reads mapped in the flanking
sequences. This will reduce the sensitivity of variant calling and many
other analyses. You can resolve this issue with an ALT-aware mapper, but
no mainstream variant callers or other tools can take the advantage of
ALT-aware mapping.</p>
</li>
<li>
<p>Padding ALT contigs with long “N”s. This has the same problem with 1 and
also increases the size of genome unnecessarily. It is worse.</p>
</li>
<li>
<p>Inclusion of multi-placed sequences. In both GRCh37 and GRCh38, the
pseudo-autosomal regions (PARs) of chrX are also placed on to chrY. If you
use a reference genome that contains both copies, you will not be able to
call any variants in PARs with a standard pipeline. In GRCh38, some
alpha satellites are placed multiple times, too. The right solution is to
hard mask PARs on chrY and those extra copies of alpha repeats.</p>
</li>
<li>
<p>Not using the <a href="http://en.wikipedia.org/wiki/Cambridge_Reference_Sequence">rCRS</a> mitochondrial sequence. rCRS is widely used in
population genetics. However, the official GRCh37 comes with a mitochondrial
sequence 2bp longer than rCRS. If you want to analyze mitochondrial
phylogeny, this 2bp insertion will cause troubles. GRCh38 uses rCRS.</p>
</li>
<li>
<p>Converting semi-ambiguous <a href="http://biocorp.ca/IUB.php">IUB codes</a> to “N”. This is a very minor issue,
though. Human chromosomal sequences contain few semi-ambiguous bases.</p>
</li>
<li>
<p>Using accession numbers instead of chromosome names. Do you know
<a href="https://www.ncbi.nlm.nih.gov/nuccore/568336023">CM000663.2</a> corresponds to chr1 in GRCh38?</p>
</li>
<li>
<p>Not including unplaced and unlocalized contigs. This will force reads
originated from these contigs to be mapped to the chromosomal assembly and
lead to false variant calls.</p>
</li>
</ol>
<p>Now we can explain what is wrong with other versions of human reference genomes:</p>
<ul>
<li>hg19/chromFa.tar.gz <a href="http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/">from UCSC</a>: 1, 3, 4 and 5.</li>
<li>hg38/hg38.fa.gz <a href="http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/">from UCSC</a>: 1, 3 and 5.</li>
<li>GCA_000001405.15_GRCh38_genomic.fna.gz <a href="http://www.ncbi.nlm.nih.gov/projects/genome/guide/human/">from NCBI</a>: 1, 3, 5 and 6.</li>
<li>Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz <a href="http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/">from EnsEMBL</a>: 3.</li>
<li>Homo_sapiens.GRCh38.dna.toplevel.fa.gz <a href="http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/">from EnsEMBL</a>: 1, 2 and 3.</li>
</ul>
<p>Using an impropriate human reference genome is usually not a big deal unless
you study regions affected by the issues. However, 1) other researchers may be
studying in these biologically interesting regions and will need to redo
alignment; 2) aggregating data mapped to different versions of the genome will
amplify the problems. It is still preferable to choose the right genome version
if possible.</p>
<p>Well, welcome to bioinformatics!</p>
On NovaSeq Base Quality2017-07-24T00:00:00+00:00http://lh3.github.io/2017/07/24/on-nonvaseq-base-quality
<h3 id="introduction">Introduction</h3>
<p>Illumina Inc. released <a href="https://www.illumina.com/systems/sequencing-platforms/novaseq.html">NovaSeq</a> earlier this year and provided sample
data <a href="https://basespace.illumina.com/datacentral">at BaseSpace</a> several months later. Different from the HiSeq
series, NovaSeq uses <a href="https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high-confidence-g-bases/">2-color chemistry</a>. It has been observed that
the NextSeq series, which also uses 2-color chemistry, <a href="http://seqanswers.com/forums/showthread.php?t=40741">produced data of worse
quality</a>. One naturally wonders if NovaSeq has a similar problem.
This post might give you some hints.</p>
<h3 id="data-and-methods">Data and methods</h3>
<p>I am looking at three whole-genome Illumina runs for human sample NA12878,
with data produced on HiSeq 2500, HiSeq X Ten and NovaSeq, respectively.
The HiSeq X Ten and NovaSeq runs used a PCR-free protocol. Their data are
currently available at BaseSpace. The HiSeq 2500 run did not specify whether
PCR was applied, but based on the low PCR duplicate rate after alignment, I
believe it used a PCR-free protocol as well. The HiSeq 2500 data is also
available, but the link to the data is hidden.</p>
<p>I am focusing on the <em>empirical base quality</em> (emQ) rather than raw base
quality in the original FASTQ file. Raw base quality can be deceiving if it is
not well calibrated, whereas emQ ideally reflects the true base error rate. To
estimate emQ, we traverse the pileup of high-coverage data, exclude obvious
variant sites and count the rest of differences from the reference as
sequencing errors. In implementation, I regarded a site as a potential variant
if 35% of high-quality bases at the site are different from the reference. This
treatment is not perfect, but is easy to implement and often adequate unless we
care about base quality well over Q40 (see also Results).</p>
<h3 id="results">Results</h3>
<p>The NovaSeq FASTQs only consist of four possible quality values: 2, 12, 23 and 37.
The overall emQ for HiSeq 2500, HiSeq X Ten and NovaSeq is Q27, Q23 and
Q24, respectively. As is <a href="http://lh3.github.io/2014/11/03/on-hiseq-x10-base-quality">observed before</a>, HiSeq X Ten has a higher
empirical error rate than older HiSeq machines. NovaSeq is slightly better than
HiSeq X Ten but not as good as HiSeq 2500 or the Platinum Genomes data.</p>
<p>The solid black in panel A-C in the following figure (open in a new tab if the
labels are too small) shows emQ at each cycle:</p>
<p><img src="http://lh3lh3.users.sourceforge.net/images/novaseq-qual.png" alt="" /></p>
<p>It is clear that emQ drops with cycles and drops even more on the second read.
The black line is largely determined by the fraction of low-quality bases and
affects the fraction of mappable reads. However, for variant calling,
we usually ignore low-quality bases because they tend to be affected by
systematic errors. In panel A-C, the four solid color lines indicate
empirical base quality for high-quality A/C/G/T bases. Recall that all NovaSeq
Q30 bases have quality 37. The four solid lines in (A) suggest that these Q37
bases are about right but are overestimated in particular towards the end of
read2.</p>
<p>When calling variants from high-coverage samples, we almost always ignore
singleton errors (errors that occur only once at a site). We more care about
how often two or more high-quality errors occur at the same site. In panel A-C,
the four dashed color lines show the emQ if we ignore singleton errors. Both
HiSeq X Ten and NovaSeq look good. Their dashed lines are probably hitting the
limit of this analysis. They will probably stay higher if we exclude NA12878
variants in a more sophisticated way. I speculate that the lower coverage of
HiSeq 2500 lowers its dashed lines. As HiSeq 2500 is not my focus here, I have
not dig into the details.</p>
<p>Finally, panel (D) shows the frequency of erroneous base changes. Ideally, we
would like to see four horizontal lines at y=33%: when there is an error, the
error is randomly chosen from the three other types of bases. This is far from
the truth in reality. At the head of read1, few A base errors are true C bases
in the sample (blue line), but at the tail of read2, the trend is inversed.
While this observation is certainly not ideal, it is not that bad as long as
we rarely see two high-quality errors at the same site (dashed lines in panel
A).</p>
<h3 id="discussions-and-conclusions">Discussions and conclusions</h3>
<p>The public NovaSeq data from BaseSpace is broadly comparable to HiSeq X Ten
data in terms of empirical base quality. Like HiSeq X Ten, NovaSeq also
overestimates base quality, but personally I do not see this a big issue.
For high-coverage data, what is more important is the rate of systematic errors
and other mapping artifacts. Moderate inaccuracy in base quality rarely
matters except in artifactual benchmarks.</p>
<p>It should be noted that data quality varies, sometimes greatly, between runs
and across sequencing facilities. Having analyzed one run from each machine
model, this analysis may not be easily generalized to data produced elsewhere.
It is recommended to redo the analysis on your own data.</p>
Bioconda: a capable bio-software package manager2015-12-07T00:00:00+00:00http://lh3.github.io/2015/12/07/bioconda-the-best-package-manager-so-far
<h3 id="getting-started">Getting Started</h3>
<p>Firstly, a few basic concepts. <a href="http://conda.pydata.org/docs/">Conda</a> is a portable package manager
primarily for Python and precompiled binaries. Miniconda is the base system of
conda. It includes a standard python and a few required dependencies such as
readline and sqlite. In conda, a <em>channel</em> contains a set of software
typically managed by the same group. <a href="https://bioconda.github.io">Bioconda</a> is a channel of conda
focusing on bioinformatics software. The following shows how to install and
use conda.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Download the miniconda installation script for Python2
wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
# Install conda; it will ask a few questions, including the installation path
sh Miniconda-latest-Linux-x86_64.sh
. ~/.bashrc # or relogin to get PATH updated
# A few examples
conda info
conda search -c bioconda bwa
conda install -c bioconda bwa
conda list
</code></pre></div></div>
<p>In this example, option <code class="language-plaintext highlighter-rouge">-c bioconda</code> specifies that the package comes from
the bioconda channel; otherwise the default channel is used.</p>
<h3 id="why-conda">Why conda?</h3>
<p>Firstly, conda ships precompiled binaries, not source code. In my experience,
shipping source code has not worked well on “managed” Linux clusters.
Compiling source code is error prone, space wasting and time consuming.</p>
<p>Secondly, conda assumes the users do not have the root permission. Of course
not all software can be installed without the root permission, but most
enduser applications should not require the privilege.</p>
<p>Thirdly, conda is self contained. It puts all files in one root directory. It
does not taint other system paths. Conda installs its own dependencies. Unless
you want to build new packages, you do not need compilers or an existing
python installation.</p>
<p>Fourthly, conda ships portable binaries. It uses <a href="https://en.wikipedia.org/wiki/Rpath">rpath</a> to make sure
non-default dynamic libraries are loaded from a fixed relative path, not from
the system paths. Rpath solves one of the portability difficulties. The other
is libc, which can be solved by compiling on an old Linux system.</p>
<h3 id="what-does-conda-lack">What does conda lack?</h3>
<p>I could be wrong – it seems to me that conda does not provide a fully
automated system to build on old systems. The package maintainers need to find
a machine/VM/docker image with an old Linux system and have reasonable skills
to create portable packages. I think a missing step is to allow package
maintainers to (either manually or automatically) build tools on a CentOS5 AWS
instance.</p>
<p>Another missing link is to allow maintainers to bypass the conda build
process. For example, if I can build a portable package in my own way, it
would be good to let me add a package without having to go through <code class="language-plaintext highlighter-rouge">conda
build</code>. Conda could provide a script to check the structure of a package and
test it on a CentOS5 machine.</p>
<p>A third problem is documentation. Conda is in fact simple, but its
documentation is complex. It is too verbose for a beginner like me. I need to
read through a lot of pages to understand the basis of conda. As a minor
complaint, I do not like the documentation generator conda is using. I prefer
the entire documentation or a sufficiently long section to be contained in a
single web page such that I can go back and forth easily. The conda
documentation has many separate pages, making navigation quite difficult.
Ok, maybe it is just me.</p>
<p>Due to these problems, I suspect it <em>might</em> not be easy for every tool
developer to contribute to conda. Bioconda is currently maintained by several
experienced developers, but we need more to push it further. Automation is the
key, in my opinion.</p>
<p>(PS: another concern is that conda is a commercial product of <a href="https://www.continuum.io">Continuum
Analytics</a>. What if the company fails to make profit or decides to
discontinue conda? I know conda is an open-source project, but not every
open-source project can healthly grow on its own.)</p>
<h3 id="my-thoughts">My thoughts</h3>
<p>I have tried a few package management systems such as <a href="https://github.com/Homebrew/linuxbrew">Linuxbrew</a> and
<a href="https://wiki.gentoo.org/wiki/Project:Prefix">Gentoo Prefix</a>, and checked <a href="http://www.gnu.org/software/guix/">Guix</a>. Conda is the closest to the
system in my mind and in serveral ways better. It is promising. I really hope
it can be a success.</p>
<h3 id="appendix-my-summary-of-lomans-survey">Appendix: my summary of Loman’s survey</h3>
<p>Nick Loman conducted a <a href="http://figshare.com/articles/Bioinformatics_infrastructure_and_training_summary/1572287">survey</a>, where part of the questions are about
the difficulties in data analysis and running software. The answers are in
free text. I have read through all of them and classified them into several
categories. In the end, I <a href="https://gist.github.com/lh3/f49eb49168ce8b841958">collected</a> 233 non-duplicate replies that have
answers to the related questions. The leading difficulty is installation
problems (132/233; category 3), followed by insufficient computing resources
(88/233; 7 and 8), lack of interoperability (72/233; category 5) and bad
documentations (69/233; category 6). Not surprisingly, software installation
is less an issue to skilled researchers (27/62 for skill level 8 or above),
but 44% is still a large percentage and the fraction of junior
bioinformaticians is probably larger in the community that has not
participated the survey.</p>
<p>The software installation problem is real.</p>
A reimplementation of symmetric DUST2015-10-05T00:00:00+00:00http://lh3.github.io/2015/10/05/an-reimplementation-of-symmetric-dust
<p>I have just <a href="https://github.com/lh3/minimap/blob/master/sdust.c">reimplemented</a> the <a href="http://www.ncbi.nlm.nih.gov/pubmed/16796549">symmetric DUST algorithm</a>
(SDUST) for masking low-complexity regions. The program depends on <a href="https://github.com/lh3/minimap/blob/master/kdq.h">kdq.h</a>
(double-ended queue) and <a href="https://github.com/lh3/minimap/blob/master/kvec.h">kvec.h</a> (simple vector); the command line
interface further requires <a href="https://github.com/lh3/minimap/blob/master/kseq.h">kseq.h</a> for FASTA/Q parsing. As I have tried
on human chr11, the output is identical to the output by <a href="http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/dustmasker/">NCBI’s
dustmasker</a> except at assembly gaps. The speed is four times as fast.
I have also compared this implementation to <a href="ftp://occams.dfci.harvard.edu/pub/bio/tgi/software/seqclean/">mdust</a>, which is supposed
to be a reimplementation of the original asymmetric DUST. The mdust result
under the same score threshold seems to differ significantly from
SDUST/dustmasker. I haven’t looked into the cause.</p>
<p>I understand the basis of the SDUST algorithm, which is quite elegant, but I
haven’t fully understood all the implementation details. I was just literally
translating the pseudocode in the paper to C, with occassional reference to the
<a href="http://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/algo/dustmask/symdust.cpp">dustmasker source code</a>. If you have any problems, please let me know.</p>
A few comments on GraphMap2015-07-30T00:00:00+00:00http://lh3.github.io/2015/07/30/a-few-comments-on-graphmap
<p><a href="https://github.com/isovic/graphmap">GraphMap</a> is a new long-read mapper initially tuned for error-prone ONT
reads. There are quite a few interesting points methodologically. The following
two comments are mostly about techical and practical aspects. Before you read
the comments, please bear in mind that I am the developer of BWA-MEM. I could
be biased.</p>
<h3 id="comment-1-consensus-quality">Comment 1: consensus quality</h3>
<p>The most striking point I found in the <a href="http://biorxiv.org/content/early/2015/06/10/020719">preprint</a> (and earlier on their
website) is that GraphMap significantly outperforms <a href="http://last.cbrc.jp">LAST</a> on consensus
calling for lambda phage, a tiny genome for which mapping (finding the
approximate position of a read) should hardly be a problem. I do not believe
GraphMap is better on mapping. Then there are two remaining differences:
GraphMap does semi-global alignment while LAST does local alignment; 2)
GraphMap uses an edit-distance based scoring system match=1, mismatch=-1,
gapOpen=0 and gapExt=-1, while LAST uses gapOpen=-1. Factor 1) may matter more
if the authors are keeping all the fragmented local hits. If the authors are
only looking at the best hit, then 2) should be the more important factor. In
the latter case, we should be able to improve LAST consensus by using the same
scoring system. It would be good if the authors try to understand why LAST
is not doing as well.</p>
<p>Note that BWA-MEM has a bug in alignment. It is unable to perform the simple
edit-distance based alignment. I cannot evaluate the effect of scoring systems
with my own tool. I need to fix that at some point.</p>
<h3 id="comment-2-speed-and-memory">Comment 2: speed and memory</h3>
<p>I have tried GraphMap on 250 human PacBio reads. It took ~3000 CPU seconds to
map them on the prebuilt index (i.e. time on indexing is excluded). The peak
RAM is ~100GB. BWA-MEM took 26 CPU seconds with most of time spent on loading
the BWA index. It seems that GraphMap is not quite ready for whole human genome
mapping yet.</p>
<h3 id="comment-3-issues-with-end-to-end-mapping">Comment 3: issues with end-to-end mapping</h3>
<p>GraphMap forces end-to-end alignment. This is something I am trying to avoid in
BWA-MEM and BWA-SW. The 1000g SV group used to show the BLASR alignment of a
read containing an inversion. BLASR aligned through the inversion. Because the
true alignment is not linear, it leaves many mismatches and gaps in the
inverted region. Similarly, if there is a translocation, forcing end-to-end
alignment will produce wrong signals. On the other hand, the end-to-end
behavior may help accuracy when the tail of a read is error rich and is thus
more likely to be mismapped. Forcing end-to-end certainly helps to improve
speed as the mapper can more quickly filter out bad hits.</p>
<p>I have manually gone through the alignment of 250 reads (results are on
<code class="language-plaintext highlighter-rouge">ftp://hengli-data:lh3data@ftp.broadinstitute.org/pacbio-250</code>). The general
problem with BWA-MEM is that it may break a linear alignment into multiple
parts if there is a long INDEL or a region enriched with errors. This is
usually not a big problem because the aligned bases are still correct.
I found 5 such cases out of these 250 reads. There are a few cases where
GraphMap is likely to be wrong. For example, read <code class="language-plaintext highlighter-rouge">/183/10678_20910</code> probably
has an inversion in the middle. The percent divergence of the three parts
is 19%, 22% and 18%. GraphMap aligns it through, resulting an alignment with
27% divergence. The similar seems to happen to <code class="language-plaintext highlighter-rouge">/310/7790_13785</code>. GraphMap
produces an alignment of 35% divergence.</p>
<p>There are also several ambiguous cases. For example, BWA-MEM gives
two hits on different chromsomes for <code class="language-plaintext highlighter-rouge">/401/0_23424</code> with 9% and 11% divergence,
respectively. GraphMap maps it to one place with 19% divergence. I don’t know
which is correct. Nonetheless, I do know the answer in one case. BWA-MEM again
gives two hits for <code class="language-plaintext highlighter-rouge">/618/343_4960</code>, one on chr11 and the other on chr12.
GraphMap puts it on chr11. We can’t tell just from this, but fortunately the
read happens to have another subread <code class="language-plaintext highlighter-rouge">/618/5003_6006</code>, which is mapped to chr12
in one piece, very close to the BWA-MEM hit for <code class="language-plaintext highlighter-rouge">/618/343_4960</code>. BWA-MEM is
likely to be right.</p>
<p>Generally speaking, forcing end-to-end produces better alignment when the true
alignment is linear. In practice, though, telling whether the alignment is
linear is non-trivial. That’s why BWA-MEM produces local hits. As a side note,
the authors claim that BWA-MEM can’t find kb-long INDELs. That is true. BWA-MEM
can’t put such INDELs in one CIGAR. It usually produce two alignments flanking
the long INDELs, which can be identified later with post processing (e.g. by
<code class="language-plaintext highlighter-rouge">htsbox abreak</code>). This is an intentional design choice.</p>
<h3 id="concluding-remarks">Concluding remarks</h3>
<p>All that being said, GraphMap clearly represents an advance in sequence
alignment. Some observations in the manuscript (e.g. comment 1) are very
interesting and worth further investigation. I am sure the speed and memory can
be improved (comment 2) if the authors have such needs. As to the final
comment, when and how to break linear alignment is an unresolved issue.
I wouldn’t mind if the authors are not providing a satifactory answer.
Overall, if I were a reviewer, I would accept this manuscript for publication.</p>
My thoughts on sharing genotype and phenotype data2015-06-24T00:00:00+00:00http://lh3.github.io/2015/06/24/my-thoughts-on-sharing-genomic-data
<p>Today, Google and Broad Institute (my employer) have <a href="http://googlecloudplatform.blogspot.com/2015/06/Google-Genomics-and-Broad-Institute-Team-Up-to-Tackle-Genomic-Data.html">announced</a> that
they are teaming up to tackle genomic data. One sentence caught my attention:
“Broad Institute has … either sequenced or genotyped the equivalent of more
than 1.4 million biological samples”. Can we get the data?</p>
<h2 id="current-data-sharing-model">Current data sharing model</h2>
<p>In my limited experience, the current data sharing model is largely
trust-and-distribute. Principle investigators (PIs) submit a form stating they
will use controlled data properly. Upon approval, they can download (usually
<a href="https://en.wikipedia.org/wiki/De-identification">de-identified</a>) individual genotypes and phenotypes and analyze
locally.</p>
<p>In my view, this model has two issues. Firstly, it is insecure. The model puts
all the trust on PIs and their PhDs/Postdocs/staffs. If one of them discloses
confidential data, there is a breach. Secondly, it hampers data sharing. PIs
may need to access multiple data sets. Getting approval for all
of them is not trivial. In addition, many more researchers are not qualified to
access data at all. For example, some sensitive data are not allowed to be
transferred out of the border of a country.</p>
<h2 id="data-access-patterns">Data access patterns</h2>
<p>To address the issues, we need to first understand how we use genotype and
phenotype data. It seems to me that we often, though not always, care about
the aggregate of genotypes instead of indivial genotypes. We download
genotype/phenotype data only to collect aggregate information with scripting.
For example, we may ask which gene in cases have high loads of rare mutations
or whether a particular SNP has high frequency in cases but low in controls or
some statistics stratified by a phenotype (e.g. BMI or populations). Having
aggregate data available is already very helpful.</p>
<h2 id="a-different-data-sharing-model">A different data sharing model</h2>
<p>So, here is my thought: a better data sharing model is to let users publicly
query the aggregate statistics of their interest while hiding individual-level
data and keeping samples unidentifiable. The server sees all the genotypes and
phenotypes, processes the data and returns the aggregate to users.</p>
<p>How to make sure samples are unidentifiable is not easy. Here are a couple of
ideas. We may disallow query of frequency data among a small number of samples.
With one query, users would not know what sample has a particular allele.
However, depending on the structure of phenotypes, there is a small chance that
users may be able to infer the genotypes of a particular sample by performing
multiple queries. A stronger idea is to cluster samples based on phenotypes and
for each cluster, to use the median values to represent all samples in the
cluster. Users will not get individual-level details beyond a cluster.</p>
<h2 id="implementing-the-new-model">Implementing the new model</h2>
<p>To make this new data sharing model something real, we need a highly flexible
and performant genotype server that computes user-defined aggregate on the fly.
It should at least match the expressiveness of SQL queries like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT v.chrom,v.pos FROM Variant v, Sample s, Genotype g
WHERE v.gene="BRCA1" AND s.pop="CEU" AND g.vid=v.vid AND g.sid=s.sid
GROUPED BY g.vid HAVING maf(g.gt)>0.01
</code></pre></div></div>
<p>The key requirement is to have users flexibly defining their own ways to slice
and aggregate data. Precomputing a few summary statistics is a good start, but
is inadequate for complex use cases.</p>
<h2 id="implementation-is-not-enough">Implementation is not enough</h2>
<p>The idea of sharing aggregate data publicly will remain an idea unless
authorities such as IRB approves to share data this way. It is very hard, near
to impossible, but I do believe that this direction in general will make
data sharing more open, efficient, convenient and secure.</p>
<h2 id="a-final-comment">A final comment</h2>
<p>Many researchers are screaming for more samples, but in fact we already have
probably a couple of million genotyped/sequenced samples world wide. These
data are just locked up in attic. It is a pity. I have thoughts, but do not
have practical solutions.</p>
A few hours with docker2015-04-25T00:00:00+00:00http://lh3.github.io/2015/04/25/a-few-hours-with-docker
<h3 id="installing-docker-on-mac">Installing docker on Mac</h3>
<p>With all the buzz around <a href="https://www.docker.com">docker</a>, I finally decided to give it try.
I first asked Broad sysadmins if there are machines set up for testing docker
applications. They declined my request for security concerns and suggested
<a href="https://kitematic.com">Kitematic</a> for my MacBook. This means I can hardly run sequence
analyses for human. Anyway, I followed their suggestion. Kitematic turns out
to be easy to install. It found my pre-installed <a href="https://www.virtualbox.org">VirtualBox</a>, put
a new Linux VM in it, launched a docker server inside the VM and provided a
<code class="language-plaintext highlighter-rouge">/usr/local/bin/docker</code> on my laptop that talks to the server. When I opened a
terminal from Kitematic (hot key: command-shift-T), I have a fully functional
<code class="language-plaintext highlighter-rouge">docker</code> command. You can in principle launch <code class="language-plaintext highlighter-rouge">docker</code> from other terminals,
but you need to export the right environmental variables.</p>
<h3 id="trying-prebuilt-images">Trying prebuilt images</h3>
<p>I ran the <a href="https://registry.hub.docker.com/_/busybox/">busybox image</a> successfully. I then tried ngseasy as it is
supposed to be easily installed with <code class="language-plaintext highlighter-rouge">make all</code>. When I did that, it started to
download a 600MB image. I frowned - my laptop does not have much disk space -
but decided to wait. After this one, it started to download another 500MB
image. I killed <code class="language-plaintext highlighter-rouge">make all</code> and deleted temporary files and the virtual machine.
A 1.1GB pipeline seems too much for my small experiment (and I don’t know if it
keeps downloading more).</p>
<h3 id="building-my-own-image">Building my own image</h3>
<p>Can I build a small image if I only want to install BWA in it? I asked myself.
I then googled around and found <a href="http://blog.xebia.com/2014/07/04/create-the-smallest-possible-docker-container/">this post</a>. It is still too complex
for my purpose, but does give the answer: I can. With more google searches, I
learned how to build a tiny image: to use statically linked binaries. I have put
up relevant files in <a href="https://github.com/lh3/bwa-docker">lh3/bwa-docker</a> at github. Briefly, to build and use
it locally:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/lh3/bwa-docker.git
cd bwa-docker
docker build -t mybwa .
docker run -v `pwd`:/tmp -w /tmp mybwa index MT.fa
cat test.fq | docker run -iv `pwd`:/tmp -w /tmp mybwa mem MT.fa - > test.sam
</code></pre></div></div>
<p>This creates test.sam in the <code class="language-plaintext highlighter-rouge">bwa-docker</code> directory. Yes, docker naturally
reads from stdin and writes to stdout, though perhaps there are more efficient
ways to pipe between docker containers.</p>
<p>With files on github, I can also add <a href="https://registry.hub.docker.com/u/lh3lh3/bwa/">my image</a> to <a href="https://hub.docker.com">Docker Hub</a> by
allowing Docker Hub to access my github account. You can access the image with:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker pull lh3lh3/bwa
docker run -v `pwd`:/tmp -w /tmp lh3lh3/bwa index MT.fa
</code></pre></div></div>
<p>Is the above the typical approach to creating images? Definitely not. This way,
docker is no better than statically linked binaries. If you look at other
Dockerfiles (the file used to automatically build a docker image), you will see
the typical approach is to compile executables inside the docker VM. Images
created this way depend on “fat” base images. You have to download a base image
of hundreds of MB in size in the first place. If you have two tools built upon
different fat base images, you probably need to have both bases (is that
correct?).</p>
<h3 id="preliminary-thoughts">Preliminary thoughts</h3>
<p>Docker is a bless to complex systems such as the old Apache+MySQL+PHP combo,
but is a curse to simple command line tools. For simple tools, it adds multiple
complications (security, kernel version, Dockerfile, large package,
inter-process communication, etc) with little benefit.</p>
<p>Bioinformatics tools are not rocket science. They are supposed to be simple. If
they are not simple, we should encourage better practices rather than live with
the problems and resort to docker. I am particularly against dockerizing
easy-to-compile tools such as velvet and bwa or well packaged tools such as
spades. Another large fraction of tools in C/C++ can be compiled to statically
linked binaries or shipped with necessary dynamic libraries (see salifish).
While not ideal, these are still better solutions than docker. Docker will be
needed for some tools with complex dependencies, but I predict most of such
tools will be abandoned by users unless they are substantially better than
other competitors, which rarely happens in practice.</p>
<p>PS: the only benefit of dockerizing simple tools is that we can acquire a tool
with <code class="language-plaintext highlighter-rouge">docker pull user/tool</code>, but that is really the benefit of a centralized
repository which we are lacking in our field.</p>
The unary representation of variants2015-02-23T00:00:00+00:00http://lh3.github.io/2015/02/23/the-unary-representation-of-variants
<p>As is discussed in my previous post, a major but potentially fixable problem
with the VCF model is that we allow to and sometimes have to squeeze multiple
alleles in one VCF line. This post gives the solution, the unary representation.
The representation was first conceived by Richard Durbin a couple of years ago,
but in my view had a few practical issues initially and thus never openly
presented.</p>
<p>In the following, I will present a <em>toy</em> format named as variant allele format
(VAF), but I should note that this format is mainly to demonstrate the logical
relationships between objects in the unary representation. I am <strong><em>NOT</em></strong>
proposing a new polished format in this post.</p>
<h3 id="the-toy-vaf-format">The <em>toy</em> VAF format</h3>
<p>Here is an example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#LS @1000g NA06879 NA12345
#LS @SGDP HGDP12345 HGDP23456
AL chr1_501C>T rs12345 AC=3
GT @1000g:GT 1|0 .|.
GT @SGDP:GT:GL 0|0:0,30,100 1|1:100,30,0
//
UC chr1_601_700uncalled
GT @1000g
GT HGDP12345
//
AL chr1_1002A>T . .
GT NA06879:GT 1/1
GT NA12345:GT 1/chr1_1002_1005del
GT @SGDP:GT 1/0 1/chr1_1002_1005del
//
AL chr1_1002_1005del . .
GT NA12345:GT 1/chr1_1002A>T
GT @1000g:GT * 1/chr1_1002A>T
//
</code></pre></div></div>
<p>where a <code class="language-plaintext highlighter-rouge">#LS</code> header line defines an ordered list of samples. An <code class="language-plaintext highlighter-rouge">AL</code> line
encodes an allele sequence in an unambiguous HGVS-like format (which will be
discussed later) and the following <code class="language-plaintext highlighter-rouge">GT</code> lines give the genotypes of each sample,
where “1” indicates the presence of the ALT allele, “0” the presence of the
reference allele and “*” suggests the sample does not contain the allele (NB:
this is different from missing data “.”; see also discussion point 5 below). If
at a locus, a sample possesses two overlapping ALT alleles, one of the alleles
needs to be encoded in the HGVS-like format on the GT line.</p>
<p>If we use one <code class="language-plaintext highlighter-rouge">#LS</code> line and append the <code class="language-plaintext highlighter-rouge">GT</code> lines to the <code class="language-plaintext highlighter-rouge">AL</code> lines, the format
above will look like VCF. However, conceptually, VAF has two important
differences: 1) in VAF, we can choose to exclude homozygous reference sites (in
VCF, we need to explicitly write 0/0 at these sites), and 2) in VAF, the primary
unit is allele (in VCF, each data line doesn’t have a clear meaning). With these
two properties, merging two VAFs is simply done via line copying: if an allele
in the second VAF is absent in the first, we copy the entire <code class="language-plaintext highlighter-rouge">AL</code> record to the
first file; if present, we only copy <code class="language-plaintext highlighter-rouge">GT</code> lines to the first VAF. Simple merging
makes it trivial to integrate variants from various projects, which is very
complicated and costly with VCF. VAF is also more consistent and simplifies
some optional fields like AC and AF and fields related to variant annotation.</p>
<h3 id="further-comments-on-vaf">Further comments on VAF</h3>
<ol>
<li>
<p>Atomic vs compound alleles. An atomic allele is the “smallest” allele. A
compound allele consists of multiple atomic alleles. In VAF, each allele should
be atomic.</p>
</li>
<li>
<p>Unambiguous representation of an atomic allele. In HGVS, a C to G SNP could
also be described by a 1bp inversion. This is problematic. We could impose extra
rules such that the same atomic allele is always represented by the same string.</p>
</li>
<li>
<p>At the AL lines, it is probably better to convert <code class="language-plaintext highlighter-rouge">chr1_501C>T</code> to <code class="language-plaintext highlighter-rouge">chr1 500
501 C T</code>, but then we need a way to describe large alleles (e.g. translocations)
and a consistent procedure to convert TAB-delimited allele representation to a
string.</p>
</li>
<li>
<p>My major concern with the initial unary proposal is that retrieving genotypes
requires to combine multiple allele lines, which greatly complicates
programming. VAF solves this problem by listing non-reference alleles that are
different from the current allele record. This is only necessary at sites having
two different ALT alleles.</p>
</li>
<li>
<p>An alternative solution to point 4 is to add <code class="language-plaintext highlighter-rouge">RA</code> lines to list alleles
related to the current allele record. We could then use integers in the GT
lines. However, doing this will complicate merging as we need to recompute
allele number. Simple line copying won’t work.</p>
</li>
<li>
<p>I think having the special “*” genotype is convenient, but it is not quite
necessary. If we feel “*” is confusing, we could replace “*” with
“0/chr1_1002A>T” in the example above.</p>
</li>
<li>
<p>For a diploid sample, the same genotype may be presented twice (or four times
for a tetraploid sample). This is a disadvantage.</p>
</li>
<li>
<p>In comparison to VCF, it is harder to get overlapping genotypes containing
different alleles. Another disadvantage. However, I am not sure we need this
operation.</p>
</li>
</ol>
<h3 id="vaf-in-avro">VAF in Avro</h3>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>record Call {
string callSetId; // sample ID
array<union{int,Path}> genotype; // array size == ploidy
array<double> genotypeLikelihood = [];
map<array<string>> info = {}; // other VCF FORMAT fields
}
record Allele {
Path path; // allele position and sequence
array<Call> calls = [];
map<array<string>> info = {}; // other VCF INFO fields
}
</code></pre></div></div>
<p>A slightly different schema, where we put other related overlapping alleles in
<code class="language-plaintext highlighter-rouge">Allele</code> (see also point 5 in the discussions above).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>record Call {
string callSetId; // sample ID
array<int> genotype; // array size == ploidy
array<double> genotypeLikelihood = [];
map<array<string>> info = {}; // other VCF FORMAT fields
}
record Allele {
Path path; // allele position and sequence
// other paths overlapping `path` and used in `calls`
array<Path> relatedPaths = [];
array<Call> calls = [];
map<array<string>> info = {}; // other VCF INFO fields
}
</code></pre></div></div>
The problems with the VCF model2015-02-23T00:00:00+00:00http://lh3.github.io/2015/02/23/the-problems-with-the-vcf-model
<p><em>This and the next posts were mostly written on the plane when I felt tired and
did not have Internet connections. The logical flow is not very clear. In
addition, I have to admit that I have not thought through the topic when I was
writting it up. Now I have a clearer picture after I finish the posts. I will
still put them online for a historical record.</em></p>
<h3 id="edit-based-representation">Edit-based representation</h3>
<p>VCF represents a variant by substituting the reference allele sequence with the
variant allele. Substitution is a type of edit. In fact, HGVS, GVF and most
other variant formats or representations are edit-based. However, edit-based
representations have an intrinsic problem: edits are determined by the alignment
between the variant allele and the reference allele, and alignments are affected
by the scoring system. As a result, one allele sequence could be represented in
multiple ways. Here is an example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Ref: AAGCTA--CTAG----CT AAGCTA------CTAGCT
Allele: AAGCTAGACTAGGAAGCT or AAGCTAGACTAGGAAGCT
(2 gap opens, 0 mismatch) (1 gap open, 2 mismatches)
</code></pre></div></div>
<p>In this example, the allele sequence is the same, but in the VCF format, it
could be represented in two different ways. This is a serious problem. Some
blame VCF but this is not exactly fair. We have had the problem for over a
decade. We were not aware of it only because we haven’t dealt with so many sites
and samples.</p>
<p>A possible solution to the one-variant-multiple-representation problem is to use
a context-based representation. I will not go into details in this post except
pointing out that adopting a context-based representation requires a huge shift
in the modeling of genetic variants and will take time. Those who are interested
in this topic should read the preprint by Benedict et al (2014).</p>
<h3 id="multiple-alleles-per-record">Multiple alleles per record</h3>
<p>In VCF, we frequently <strong>have to</strong> squeeze multiple alleles in one VCF line;
otherwise we will not be able to represent a diploid genotype with two ALT
alleles sometimes. Then what is the rule to combine multiple alleles? The VCF
spec doesn’t specify. <strong>Conventionally</strong>, we merge all overlapping alleles into
one VCF line. Why couldn’t we promote this convention into the spec? It is
because the convention leads to various issues.</p>
<p>Firstly, merging overlapping alleles doesn’t practically work with long
deletions. Then when should we merge and when shouldn’t? There are no consistent
solutions.</p>
<p>Secondly, merging is sensitive to rare variants. Suppose we have two common SNPs
at position 1002 and 1005 among 100,000 samples. The two SNPs will be put on
separate VCF lines as they have no overlaps. However, if the 100,001st sample
has a deletion from position 1001 to 1006, we will have to squash the two SNPs
and the one deletion into one VCF line, but this complicates the annotation and
the analysis of the two SNPs. Such a scenario may happen often given many
samples.</p>
<p>Thirdly, merging is not always possible when data are unphased. Still consider
the example above. If the two SNPs at 1002 and 1005 are unphased, we will not
know how to join them for each sample. (TODO: how bcftools merge works?)</p>
<p>Fourthly, merging multi-sample VCFs will add many “./.” or “0/0” genotypes. The
resultant VCF is usually much larger than the sum of input. Merging is not
scalable. Strictly speaking, the space inefficiency is caused by the dense
representation of VCF, which I will come back shortly.</p>
<p>As a consequence of the points above, VCF merging as is required by the
multi-allele-per-record representation is a complex, expensive, inconsistent and
indeterministic operation. It effectively creates a boundary between VCFs
produced from different projects and hampers data integration. In addition,
the multi-allele-per-record representation also makes annotation harder
because we annotate individual alleles, not a VCF line. Furthermore, not every
VCF follows the non-overlapping convention. The inconsistencies between VCFs
are frequently troublesome.</p>
<p>The VCF representation also raises a serious theoretical concern: what does each
VCF line stand for? It is in fact this conceptual ambiguity that leads to all
the problems in this section.</p>
<h3 id="dense-representation">Dense representation</h3>
<p>VCF encodes a matrix of genotypes (site-by-sample) with a dense representation
whereby it explicitly gives the genotype of every cell. For many samples, this
matrix is sparse in the sense that the vast majority of cells are “0/0” (if the
VCF is produced by multi-sample calling, including gVCF merging) or “./.” (if
produced by merging). The more samples, the more sites, the more sparse the
matrix, and the more space and computation VCF costs. VCF is not scalable.</p>
<h3 id="conclusion">Conclusion</h3>
<p>I have listed three major problems with the VCF model: edit-based
representation, multiple allele per record and dense representation. While I am
not sure how to solve the first problem without disruptively trashing our
established practices on variants, a few believe it should be possible to solve
the other two problems. This will be explained in my second post of this sequel.</p>
Correcting Illumina sequencing errors: extended background2015-02-13T00:00:00+00:00http://lh3.github.io/2015/02/13/comments-on-illumina-error-correction
<p>I enjoy writing 2-page Application Notes these days. It takes less time to
write, giving me more time on solving other problems. More importantly, I don’t
need to fight to claim significance and novelty which are subjective most of
time. The downside of writing short manuscripts is the lack of extensive
discussions. Here are something I have not said in my <a href="http://arxiv.org/abs/1502.03744">new error correction
preprint</a>.</p>
<h4 id="the-role-of-error-correction">The role of error correction</h4>
<p>Error correction is only used for <em>de novo</em> assembly but almost nothing else. It
is usually wise not to apply error correction for mapping based analyses because
mapping looks at the full length of a read, but error correction, especially
k-mer based algorithms, only uses local information and is not as powerful. When
most Illumina data are processed through mapping based pipelines, error
correction is of limited uses. Then why do I care about error correction?</p>
<p>I have long believed that one day we will keep processed data only and throw
away raw sequence reads just as we have trashed images and intensities and
reduced the resolution of base qualities. With more and more high-coverage human
data coming out, a <em>de novo</em> assembly, rather than a gVCF, would be the best
form to keep because 1) like a gVCF, an assembly is much smaller than raw data;
2) unlike VCF, a perfect assembly is a lossless representation of raw reads; 3)
hence a perfect assembly encodes all types of variants, not only SNPs and short
INDELs; 4) unlike VCF, a <em>de novo</em> assembly is not bound to a particular
reference genome; 5) and is free of artifacts caused by the reference; 6)
mapping and variant calling from assembled unitigs is much more efficient and in
some way better. I have been working on this idea from time to time since 2011
and <a href="http://www.ncbi.nlm.nih.gov/pubmed/22569178">published a paper</a> in 2012, but I keep coming back to error
correction.</p>
<h4 id="my-past-efforts-in-sequencing-error-correction">My past efforts in sequencing error correction</h4>
<p>The 2012 paper is not very successful. Although on calling short variants, it
was better than other assembly-based approaches at that time, it was much less
sensitive in comparison to mapping based variant calling. I later realized that
this was mainly caused by the aggressive error corrector. The corrector I
implemented for that paper corrected many heterozygotes away. A couple of year
later, I reimplemented the correction algorithm but made it more conservative.
This is the error corrector in fermi2. The sensitivity was much improved.
Assembly-based variant calling started to compete well with mapping-based
calling on both sensitivity and specificity.</p>
<h4 id="development-of-error-correctors-in-2014">Development of error correctors in 2014</h4>
<p>The fermi2 error corrector was developed in late 2013 when there were few
Illumina error correctors fast and lightweight enough to handle whole-genome
human data. This situation was changed in 2014. In May, <a href="https://sourceforge.net/projects/bless-ec/">BLESS</a> was
published with a substantially improved version coming later in October that
uses KMC2 as the k-mer counter. <a href="https://github.com/mourisl/Lighter">Lighter</a> was out in September.
<a href="https://gatb.inria.fr/software/bloocoo/">Bloocoo</a> was on Bioinformatics Advanced access channel in October as
part of GATB. We have three performant error correctors in less than a year.
Fiona and Trowel were also published this year, but they are not capable of
correcting human data. In addition, <a href="http://www.ncbi.nlm.nih.gov/pubmed/25183248">Molnar and Ilie (2014)</a>, the
developers of HiTEC and RACER, for the first time evaluated error correction for
whole-genome human data, though due to timing, they have not included the five
new error correctors mentioned above (the review included an old version of
BLESS, which is slow, single-threaded and does not work with reads of variable
lengths).</p>
<h4 id="the-motivation-of-developing-bfc">The motivation of developing BFC</h4>
<p>Although BLESS, Bloocoo and Lighter are all faster and more lightweight than the
error corrector implemented in fermi2, I have two potential concerns, one
theoretical and the other practical. Firstly, these tools use greedy algorithms
in that they rarely (if at all) revert a correction at an earlier step even if
doing so helps to correct the rest of the read. The algorithm I implemented in
fermi and fermi2 are theoretically better (for details, see <a href="http://arxiv.org/abs/1502.03744">the
preprint</a>), I think. Secondly, Illumina sequencing occasionally
produces systematic errors. These are recurrent sequencing errors that usually
have low base quality. Being aware of base quality during the k-mer counting
phase would practically help to fix these errors. These concerns together with
the poor performance of the fermi2 corrector motivated me to develop a new error
corrector, BFC. Honestly, I didn’t know how much the two concerns would matter
in practical applications before I started to implement BFC.</p>
<h4 id="the-development-of-bfc">The development of BFC</h4>
<p>The BFC algorithm has been detailed in <a href="http://arxiv.org/abs/1502.03744">the preprint</a>. I will add
something untold.</p>
<p>When I started to evaluate BFC, I was only aware of SGA, Lighter and the old
BLESS. Lighter-1.0.4 is very slow on compressed input. Old BLESS does not work
as the test data have reads of variable lengths, while new BLESS is so
challenging to build that I gave it up initially. I was quite happy to see
BFC was several times faster than these tools.</p>
<p>I was overoptimistic. When I monitored the Lighter run more closely, I found it
was mostly single-threaded. I then tried Lighter on uncompressed input. It was
six times as fast as the run on compressed input. I told the Lighter developers
my findings and suggested solutions. They were very responsive and quickly
improved the performance. The timing reported in the preprint was from the new
Lighter.</p>
<p>Having seen Lighter could be much faster, I decided to get BLESS compiled. It
was a right decision. The new BLESS is working well, especially when it uses
long k-mers. This has inspired further explorations: use of KMC2 and long k-mers
for error correction. I implemented a variant of BFC, BFC-bf, to take KMC2
counts as input and keep the counts in a bloom filter. I have to ignore base
quality during the counting phase as KMC2 is not aware of base quality in the
way I would prefer. The correction accuracy is not as good as BFC. Nonetheless,
BFC-bf helps to confirm that the apparently better correction accuracy is not
purely due to the use of base quality during k-mer counting. BFC-bf also makes
it easier to use long k-mers.</p>
<p>The role of k-mer length in error correction is complicated. Since assembling
corrected reads takes much more computing time and my development time, I have
not run <em>de novo</em> assembly often at the beginning. I assumed that low under- and
over-correction rates should lead to better assemblies. I was wrong again.
At least for fermi2, short k-mers used for correction lead to better assemblies
(for reasons, see the discussions in the preprint). I have also tried to combine
two k-mer lengths but only with limited success.</p>
<p>Here I should admit that a weakness of the manuscript is that I have not run <em>de
novo</em> assemblies for all correctors. This is mostly because I do not have enough
computing resources. Another key reasons is the assembly result varies with
assemblers. It is too much for me to try <em>M</em> assemblers on <em>N</em> correctors. I did
briefly try fermi2 assembler on <em>E. coli</em> data corrected by a few (not all)
tools. BFC is better, which is a relief but not conclusive. I need to try more
assemblers on more data sets to get a firm view. This is for future.</p>
The early history of the SAM/BAM format2015-01-27T00:00:00+00:00http://lh3.github.io/2015/01/27/the-early-history-of-the-sambam-format
<p>While I was looking for an ancient email on my old (first) macbook, I noticed
the numerous email exchanges during the early days of the SAM/BAM format. Here
is a brief summary. The ideas below were proposed by various people in the 1000
Genomes Project analysis group.</p>
<ul>
<li>2008-10-21: SAM got its name.</li>
<li>2008-10-22: The first day: fixed columns and optional tags; extended CIGAR and
binning index.</li>
<li>2008-10-24: Compression suggested. RAZF started. Streamability emphasized.</li>
<li>2008-11-01: 2-byte tags; mate positions as fixed columns.</li>
<li>2008-11-03: Adopted the text/binary dual format. RAZF implemented.</li>
<li>2008-11-06: Sequence dictionary. The very first draft of SAM/BAM spec circulated.</li>
<li>2008-11-07: BGZF proposed. BAM got its name.</li>
<li>2008-11-10: Linear index.</li>
<li>2008-11-12: Read group.</li>
<li>2008-11-14: BGZF implemented. BAM on top of RAZF working.</li>
<li>2008-11-18: Combining binning and linear indices.</li>
<li>2008-11-20: sort/merge/pileup/faidx implemented.</li>
<li>2008-11-21: tview prototype working.</li>
<li>2008-12-08: Final draft sent to 1000g. Adopted the MIT license.</li>
<li>2008-12-22: First public release of samtools. It is still working on most BAMs nowadays.</li>
</ul>
BWA-MEM for long error-prone reads2014-12-10T00:00:00+00:00http://lh3.github.io/2014/12/10/bwa-mem-for-long-error-prone-reads
<p>A recent paper published by <a href="http://www.nature.com/nbt/journal/vaop/ncurrent/full/nbt.3103.html">Phil Ashton et al</a> has triggered some
discussions which subsequently moved to my domain: read mapping. Phil then
asked me to clarify how the upcoming bwa-mem works with Oxford Nanopore (ONT)
reads. Here we go.</p>
<p>Although the very first version of bwa-mem worked with PacBio reads (well, not
crashing), the alignment it produced was too fragmented to be useful. I
initially thought the long exact seeds used by bwa-mem would not be sensitive
enough to the ~15% error rate of PacBio reads, but <a href="http://www.homolog.us/blogs/">Homolog.us</a>
<a href="http://www.homolog.us/blogs/blog/2013/10/28/bwa-mem-good-blasr-aligning-pacbio-reads-part-2/">pointed out</a> that <a href="https://github.com/PacificBiosciences/blasr">BLASR</a> is also using long exact seeds.
I then realized that it is also possible for the bwa-mem algorithm to work with
PacBio data. With more and more interesting PacBio data sets coming out, I
decided to give a try.</p>
<p>There are two major changes in BWA-MEM to support PacBio data better. Firstly,
we have to use a relaxed scoring matrix such that Smith-Waterman (SW) can give
a positive score on a valid match. In 0.7.9 and 0.7.10, the scoring scheme is:
match=2, mismatch=-5, gapOpen=-2 and gapExt=-1. Secondly, I added a heuristic
to filter intial seeds to reduce unsuccessful seed extensions. For PacBio
reads, bwa-mem performs SSE2-SW in a small window around each seeds and then
reject seeds if the SW score is too small (threshold proportional to option
<code class="language-plaintext highlighter-rouge">-W</code>). This is similar to the X-dropoff heuristic of BLAST. In addition to
these, bwa-mem also implements a gap patching heuristic whereby it tries to
connect two colinear local hits with a global alignment even if the resulting
alignment is not optimal. This heuristic helps to let alignment walk through
low-quality regions and thus reduce fragmentation. With these changes, bwa-mem
works well for PacBio data.</p>
<p>ONT reads pose new challenges due to its higher error rate. The intial release
of one-direction (1D) reads have an error rate higher than 30%. The 2D reads is
a little better, but still has more errors than PacBio for now. The PacBio mode
is not quite suitable, which calls for further improvements to bwa-mem.</p>
<p>The ONT-specific changes are relatively simple. Firstly, we use shorter seed
lengths and more relaxed threshold <code class="language-plaintext highlighter-rouge">-W</code> as a consequence of higher error rate.
Secondly, we modified the scoring matrix to match=1, mismatch=-1, gapOpen=-1
and gapExt=-1, based on a <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4226419/">recent paper</a>. This setting turns out to be
better for PacBio, too.</p>
<p>The ONT mode of bwa-mem is largely comparable to <a href="http://last.cbrc.jp">LAST</a>, the mapper
recommended by several groups. Given the same scoring system, the two mappers
generate identical SW scores most of time. When scores are different, LAST
tends to be the winner - on this small fraction of alignments with different
scores, bwa-mem is more likely to miss low-quality hits or fails to extend a
partial alignment to the right place (I need to walk through these examples to understand why this happens). For bacterial data, bwa-mem and LAST are
also about the same in speed. For human PacBio reads, though, bwa-mem is times
faster. It is more geared towards human data.</p>
<p>LAST is probably the only mapper that works efficiently and accurately with
query sequences ranged from 100bp to 100Mbp without much parameter tuning. This
is very impressive. As of now, bwa-mem does not work well for queries longer
than ~10Mbp.</p>
<p>To use the ONT mode of bwa-mem, one needs to acquire a recent bwa <a href="https://github.com/lh3/bwa">from
github</a> and run it with <code class="language-plaintext highlighter-rouge">bwa mem -x ont2d ref.fa reads.fq</code>. Its official
release via 0.7.11 is coming soon.</p>
On HiSeq X10 Base Quality2014-11-03T00:00:00+00:00http://lh3.github.io/2014/11/03/on-hiseq-x10-base-quality
<p>Illumina has recently released <a href="https://basespace.illumina.com/datacentral">four lanes of NA12878 data</a> from HiSeq
X10. I was playing with this data set and found that my program had bad accuracy
on two of them. I initially thought the data quality was different, so wrote
some code to investigate the data quality. It turns out that my program was
buggy, but the finding of the HiSeq X10 data quality might be of its own
interest, which I am sharing here.</p>
<h3 id="hiseq-x10-data-quality">HiSeq X10 data quality</h3>
<p>When I looked at the HiSeq X10 alignment in samtools tview, my first impression
is that the error rate is visually higher than the previous Illumina data I have
seen. <strike>This might be the cause of the [2-channel system][ch2] (as is opposed to
the previous 4-channel).</strike> However, a closer look at the base quality suggested I
might be wrong. The average base quality of these HiSeq X10 data is Q37.0 for
reads mapped to chr11. This is compared favorably to NA12878 from <a href="http://www.illumina.com/platinumgenomes/">Platinum
Genomes</a> (Q36.4) and the CHM1 data used in my paper (Q34.9).</p>
<p>Mean quality is in effect the the geometric mean of error rate, but what I
observed in tview is the arithmetic mean. Could that be the cause? It is not.
The arithmetic mean of HiSeq X10 data is Q24.4, still better than NA12878
(Q24.2) and CHM1 (Q17.6).</p>
<p>However, I still trust my eyes more. I started to believe the base quality HiSeq
X10 reads is overestimated (or the older quality is underestimated). <a href="http://maq.sourceforge.net">MAQ</a>
has a subcommand “mapcheck” to estimate the empirical base quality from read
mapping. I don’t have this for BAM, so I implemented one. In this
implementation, if 35% of >=Q20 bases are different from the reference base,
the site is considered to be a variant site and is ignored.</p>
<p>My eyes are right after all. The following table shows the empirical arithmetic
means of different data sets and also stratified by low (<Q20) and high base
quality:</p>
<table>
<thead>
<tr>
<th>Dataset</th>
<th>emQ</th>
<th>% <Q20 bases</th>
<th>emQ <Q20</th>
<th>emQ >=Q20</th>
</tr>
</thead>
<tbody>
<tr>
<td>Platinum Genomes</td>
<td>Q26</td>
<td>2.5%</td>
<td>~Q13</td>
<td>~Q34</td>
</tr>
<tr>
<td>CHM1</td>
<td>Q25</td>
<td>4.3%</td>
<td>~Q12</td>
<td>~Q33</td>
</tr>
<tr>
<td>HiSeq X10 L1</td>
<td>Q23</td>
<td>4.7%</td>
<td>~Q10</td>
<td>~Q30</td>
</tr>
<tr>
<td>HiSeq X10 L3</td>
<td>Q23</td>
<td>4.6%</td>
<td>~Q10</td>
<td>~Q30</td>
</tr>
</tbody>
</table>
<p>We can see that the empirical quality of HiSeq X10 data is obviously lower -
almost twice as low. This is consistent with my feeling. Note that on older
data, the empirical quality could go higher if there were no variants. The 35%
rule is not good enough.</p>
<p>Another a bit worrying sign of the new data is the systematic compositional
bias. In particular, 68% of high-quality C bases in FASTQ should really be A but
only 8% be G. There were compositional biases in older Illumina data, but not as
bad. Does this affect variant calling? <em>Crude</em> evaluation using <a href="http://genomeinabottle.org">Genome In A
Bottle</a> (GIAB) suggests the variant calls are still decent. Nonetheless,
more careful comparisons are needed to draw a definite conclusion.</p>
<h3 id="base-quality-resolution">Base quality resolution</h3>
<p>Another visible difference of HiSeq X10 data is the reduced resolution of base
qualities. There are only seven distinct quality values as is opposed to nearly
40 in the previous data. It has long been discussed whether this would impact
the accuracy of variant calls with NCBI being one of the first advocates. I used
to evaluate the effect on the 1000g low-coverage pilot and a high-coverge
NA12878. As I remember (I have lost the data), 7 or even 4 quality values
worked.</p>
<p>Reduced quality resolution has a positive effect on the size of alignment.
Typically, for 35X human data, the size of the final BAM file is about 100GB.
The size of these 35X HiSeq X10 data is only 70GB. A 30% reduction. Could we
push further, say 1-bit quality?</p>
<p>I did an experiment. In the aligned BAM, I turned all quality below Q20 to Q10
all quality no less than Q20 to Q30 (based on the table above). I run GATK-HC on
the original BAM and the quality-reduced BAM. If I compared the exact variant
coordinates, HC called 7,253 unfiltered variants only present in the original
BAM and 15,435 variants only in the quality-reduced BAM. The difference is minor
if we notice that the difference between lane 1 and 7 is over 110,000. On GIAB,
the two call sets are largely indistinguishable. In all, 1-bit quality does not
obviously reduce the accuracy of variant calls.</p>
<p>One-bit quality further reduces the BAM size down to 40GB. In the CRAM format,
it is 16GB. This is a significant reduction from typical 100GB BAMs at 35X
coverage.</p>
<p>Although I have not tried, I firmly believe that we cannot discard base quality
at all. HiSeq produces recurrent errors. These errors are usually correctly
assigned to low base quality. If a variant caller ignores base quality, it is
likely to make calls at these systematic errors. One bit is the minimum.</p>
<p>While I was doing my small experiment, I learned from the GA4GH mailing list
that Illumina is also exploring the possibility of 1-bit quality. I believe this
strategy should be fine for normal samples. I actually think, cautiously, that
1-bit quality may even work for cancer data, but I am not experienced enough to
confirm this.</p>
On the graphical representation of sequences2014-07-25T00:00:00+00:00http://lh3.github.io/2014/07/25/on-the-graphical-representation-of-sequences
<h3 id="introduction">Introduction</h3>
<p>Ever since the advent of the so-called Next-Generation Sequencing (NGS), we have
been thinking about encoding all the population variations in a graph. That
was 2008. Now, six years later, the rapidly growing number of sequenced human
individuals continueously presses for the necessity of a graphical
representation of the existing sequences, which leads to many publications in
this direction, both in biology and in computer science.</p>
<p>The graph described above is a <em>population graph</em>. It captures variations
between <em>many</em> individuals/strains. A typical use of the graph is to
map the sequences of a new individual to the existing variations, in particular
large variations. Another different type of sequence graph is <em>assembly graph</em>.
It represents ambiguity in the assembly of a <em>single</em> individual. It aims, in
my view, to enable a modular approach to the development of assembly related
algorithms. Population graph and assembly graph serve different purposes. Of
course, there is not a clear boundary between the two types. The assembly of a
poly-ploid individual encodes variations between several haplotypes. We can
also assemble sequence reads from many individuals. The resulting graph is both
an assembly graph and population graph. I usually classify such a graph as a
population graph as its primary goal is to encode variations but not to derive
long contigs.</p>
<p>As a side note, I make a distinction between a population graph and an assembly
graph partly due to the suggestion of taking mapping as a killer application for
my proposed <a href="http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format/">GFA format</a>. From my point of view, GFA is designed to be a
lightweight assembly format but not for encoding variations from a large number
of individuals. I will come back to this topic later.</p>
<h3 id="population-graphs">Population graphs</h3>
<p>There are two subclasses of population graphs depending on whether they rely on
a reference-guided multi-alignment.</p>
<h4 id="graphs-derived-from-multi-alignment">Graphs derived from multi-alignment</h4>
<p>A graph derived from a multi-alignment or similarly a VCF, inherits the
reference coordinates and annotations. It is closer to our current practice and
easier to understand. It is the more popular type of graph in the literature.
To construct a multi-alignment, we need the exact coordinates on the reference
genome. This poses several problems. Firstly, for the same set of sequences,
there can be more than one plausible alignments. I gave the following example
to <a href="http://genomicsandhealth.org">GA4GH</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Ref: AAGCTA--CTAG----CT AAGCTA------CTAGCT
Allele: AAGCTAGACTAGGAAGCT or AAGCTAGACTAGGAAGCT
(2 gap opens, 0 mismatch) (1 gap open, 2 mismatches)
</code></pre></div></div>
<p>In the two alignments, the reference sequence and the allele sequence are
exactly the same. However, in an affine gap penalty model, both alignments can
be optimal depending on the scoring system (this is very common in protein
sequence alignment). From this example, we can see that alignment can be
ambiguous and thus the resulting graph can be ambiguous, too, which is not a
nice feature. This is worse when the alignment is constructed from VCF when the
actual allele sequence may be broken down to small variants. When we use a
canonical VCF to represent the two alignments, the skeletons will look like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ref 6 A AGA ref 6 A AGACTAG
ref 10 G GGAAG or ref 7 C G
ref 8 T A
</code></pre></div></div>
<p>Without the haplotype information, we will not know the two VCFs are derived
from the same allele. This also leads to the second question: when to break
haplotype structure. On one hand, it is costly and unnecessary to keep
arbitrarily long haplotypes. On the other hand, if we always go for minimal
variations, we will have the problem above, and additionally, create new
haplotypes not seen in data given that multi-nucleotide changes may be arising
from a different mechanism (<a href="http://arxiv.org/abs/1312.1395">Harris and Nielsen, 2014</a>).</p>
<p>Thirdly and more importantly, multi-alignment does not capture structural
variations, in particular novel insertions missing from the reference genome,
and is sensitive to the errors and the version of the reference genome.</p>
<h4 id="graphs-derived-from-assembly-or-context-matching">Graphs derived from assembly or context matching</h4>
<p>Alternatively, a graph can be constructed from a set of sequences without
typical multi-alignment. Such a graph is immune to all the problems caused by
multi-alignment, but has the difficulty in incorporating the rich information
in the reference genome and annotation.</p>
<p>Personally, I more like graphs independent of alignments. These seem cleaner
theoretically. Nonetheless, from a practical point of view, alignment-based
graphs may be more useful in the short term.</p>
<h4 id="review-of-related-works-in-computer-science">Review of related works in computer science</h4>
<p>To the best of my knowledge, <a href="http://www.dcc.uchile.cl/~gnavarro/ps/spire08.3.pdf">Siren et al (2008)</a> was the first attempt
to encode multiple genomes. The basic idea is very simple: to keep individual
genomes in a run-length encoded self-index, such as <a href="http://en.wikipedia.org/wiki/Compressed_suffix_array">CSA</a> or <a href="http://en.wikipedia.org/wiki/FM-index">FM-index</a>.
The authors further polished the idea, added theoretical analyses and finally
published in a journal (<a href="http://www.ncbi.nlm.nih.gov/pubmed/20377446">Makinen et al, 2010</a>). Later <a href="http://www.dcc.uchile.cl/~gnavarro/ps/tcs12.pdf">Kreft and Navarro
(2012)</a> and <a href="http://arxiv.org/abs/1306.4037">Ferrada et al (2012)</a> proposed to use a variation
of <a href="http://en.wikipedia.org/wiki/LZ77_and_LZ78">LZ77</a>, the key algorithm behind GIF, PNG and zlib, to achieve a
similar goal. <a href="http://www.sciencedirect.com/science/article/pii/S0304397513005409">Do et al (2014)</a> also use LZ77, but approach the problem
from a different angle. It uses the reference sequence to decompose individual
sequences, an idea originally developed for relative compression (<a href="http://link.springer.com/chapter/10.1007%2F978-3-642-16321-0_20">Kuruppu et
al, 2010</a>). These works do not require a multi-alignment between
sequences or a linear alignment against a reference genome. The following do.</p>
<p><a href="http://link.springer.com/chapter/10.1007%2F978-3-642-14355-7_19">Huang et al (2010)</a> assumed individual genomes are sharing long blocks
of identical sequences and indexed the differences. They devised a data
structure based on FM-index to enable fast pattern matching while retaining the
sample information. <a href="http://www-igm.univ-mlv.fr/~lecroq/articles/spire2013.pdf">Na et al (2013)</a> solved a similar problem with suffix
array of alignment. A concern with these two papers is that given many genomes,
shared blocks across all genomes may be quite short, which may hurt the
performance. <a href="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6470220">Alatabbi et al (2012)</a> attempted to address this issue
with a multi-level q-gram. The role of reference becomes explicit. A few
others (<a href="http://www.informatik.hu-berlin.de/forschung/gebiete/wbi/research/publications/2012/refcomprsearch.pdf">Wandelt and Leser</a>; <a href="http://www-igm.univ-mlv.fr/~lecroq/articles/icibm2012.pdf">Barton et al, 2013</a>; <a href="http://dl.acm.org/citation.cfm?id=2511216">Yang et al,
2013</a>) followed a similar hashtable-based approach.</p>
<p>Another class of methods, which even predate NGS, encode multi-alignment as a
grammar (<a href="http://arxiv.org/pdf/1110.4493.pdf">Claude and Navarro, 2011</a>; <a href="http://www.dcc.uchile.cl/~gnavarro/ps/spire12.2.pdf">Abeliuk and Navarro,
2012</a>; <a href="http://arxiv.org/pdf/1109.3954v6.pdf">Gagie et al, 2012</a>). For example, we can encode
a biallelic indel as regular grammar “ACGT(AGTC|AT)AT”. These methods are
frequently built upon other self-indices such as FM-index, LZ and suffix tree.
<a href="http://arxiv.org/abs/1010.2656">Siren et al (2010)</a> (and <a href="http://www.cs.helsinki.fi/u/jltsiren/papers/Siren2014.pdf">a recent update</a>) proposed to
convert a multi-alignment to a directed acyclic graph (DAG) and index the graph
with an extended CSA. It is similar to grammar-based indexing in that it
discards the sample information.</p>
<p>It should be noted that although there are many works on mapping against a
collection of related genomes, only a few provide an implementation and even
fewer provide an implementation practical for many human genomes. In contrast,
works in computational biology focus more on the practical aspect.</p>
<h4 id="review-of-related-works-in-computational-biology">Review of related works in computational biology</h4>
<p><a href="http://www.ncbi.nlm.nih.gov/pubmed/19761611">Schneeberger et al (2009)</a> is a first effort to explicitly model
variations in a population DAG. Similar to <a href="http://link.springer.com/chapter/10.1007%2F978-3-642-14355-7_19">Huang et al (2010)</a>,
it breaks the multi-alignment into entirely conserved regions and polymorphic
regions, but different from the previous work which directly keeps differential
sequences, it encodes these sequences in a graph and maps reads against the
graph. While Schneeberger et al provided a solution for a few <em>A. thaliana</em>
genomes, <a href="http://www.ncbi.nlm.nih.gov/pubmed/23813006">Huang et al (2013)</a> provides the first practical
implementation, BWBBLE, for many human genomes. This paper encodes SNPs with
<a href="http://en.wikipedia.org/wiki/Nucleic_acid_notation">IUB codes</a> and INDELs as separate contigs. For a query with SNPs, exact
mapping can be done by searching multiple SA intervals compatible with the
query (e.g. TAAG and YAAK at different positions are both compatible with
TAAG). However, mapping for a query with INDELs, which is harder and of more
interest to me, seems unexplained. The different treatment of SNPs and INDELs
seems a little inconsistent. <a href="http://www.ncbi.nlm.nih.gov/pubmed/25028723">Rahn et al (2014)</a> introduced Journaled
String Tree (JST) to consistently represent both SNPs and INDELs as edits to
the referene genome. Different from the previous works, Rahn et al does not
index the graph. Instead, read mapping is achieved by traversing JST. This
strategy is similar to early Eland and MAQ. <a href="http://biorxiv.org/content/early/2014/07/08/006973">Dilthey et al (2014)</a>, in
my view, is an improvement to Schneeberger et al (2009). It is featured with
cleaner and more extensible graph construction. The connection to de Bruijn
graph is particularly interesting. I actually think we may simplify graph
construction further.</p>
<p>These works all require a multi-alignment as input. Worrying about the
instability of multi-alignment, <a href="http://arxiv.org/abs/1404.5010">Paten et al (2014)</a> sought a very
different solution. They proposed context mapping to relate genomes. <a href="http://biorxiv.org/content/early/2014/04/06/003954">Marcus et
al (2014)</a> use a graph of MEMs to describe the relationship between
genomes, though it is not for the purpose of mapping.</p>
<p>Another distinct approach to the construction of population graph is to
use assembly graph as a population graph (<a href="http://biorxiv.org/content/early/2014/04/06/003954">Iqbal et al, 2012</a>).</p>
<h3 id="assembly-graphs-and-the-gfa-format">Assembly graphs and the GFA format</h3>
<p>An assembly graph represents assembly ambiguities primarily caused by limited
read length. With PacBio reads, we can assemble most bacteria, model organisms
and humans into megabase-long contigs. For such contigs, we can do analyses
without considering the remaining ambiguity in the graph. Furthermore, we frequently
have other types of data such as physical maps, genetic maps, optical mapping
and Hi-C long-range information to resolve ambiguity. The graph aspect of
assembly is even less important.</p>
<p>Then why do we need assembly graphs? <a href="http://pmelsted.wordpress.com/2014/07/17/dear-assemblers-we-need-to-talk-together/">Pall Melsted gave</a> the answer:
we need a common graph format to enable a modular approach to the development
of assembly algorithms. With a common format, we may be able to write a
scaffolder utilizing the existing relationship between contigs; with a common
format, we may have a generic module for graph simplification which is not so
trivial to implement; with a common format, we may take the best part from each
assembler to get better results; with a common format, we will be able to
develop a small component without writing a new assembler from scratch.
These will accelerate the development of assembly algorithms.</p>
<p>Take the SAM format as an analogy. Before SAM was widely adopted, there were
few generic tools; many mapper deverlopers had to write variant callers because
without variant calling, mapping itself is not of much use. After the adoption
of the SAM format, developers are able to focus on the part they are best at.
We end up with better mappers, better callers and more little tools. If we had
reached a consensus on the file format three years ago in the Assemblathon
mailing list, the success of SAM might have reoccurred. It is actually a little
late now, but there may still remain some benefits of having a common format.
That was why I proposed GFA.</p>
<p>GFA is not limited to an assembly format. It can represent arbitrary
relationship between sequences and is thus suitable for a population graph
format in theory. However, I would not take applications on population graphs,
such as graph mapping, as a killer application of GFA. There are many open
questions on population graphs. I cannot design a format for unsolved
questions. GFA aims to be an assembly format only, at least for now.</p>
<h3 id="concluding-remarks">Concluding remarks</h3>
<p>Before I wrote this blog post, I had thought of a short article explaining the
difference between an assembly graph and a population graph. However, after I
started, many questions came into my mind: what is a population graph? what is
the use of it? how is it constructed? what do we expect to get from a
population graph? if we care about mapping, what output do we prefer? how could
we encode rearrangement and novel insertions? what is the current progress? and
how far are we from a practical solution? These questions motivated me to read
related works, which resulted in this extremely lengthy blog post, but in the
end, the answers to many of these questions remain unclear to me. More thoughts
needed…</p>
<p>(A final word. I kept arguing with myself when I wrote this blog post. Although
I have spent a lot of time, I have not paid enough attention to the wording,
clarity and the logic flow of the article. It may be hard to read. I
apologize.)</p>
First update on GFA2014-07-23T00:00:00+00:00http://lh3.github.io/2014/07/23/first-update-on-gfa
<p>I was out of the town in the past few days, so have not been able to focus on
GFA. Now I am back to work to give the first update on the format based on the
comments from many people, which I appreciate a lot.</p>
<p>In comparison to my initial proposal, the first and the major change is to name
segments instead of the ends of segments. This seems the consensus so far.
Secondly, I am thinking to move the quality field on the “S” line to an
optional tag. Not many assemblers produce quality or per-base read depth.
Thirdly, more people prefer to explictly encode bubbles as multiple segments,
rather than inline them in the sequence. I will use explicit bubbles at least
in the initial iteration.</p>
<p>Here is the graph from the previous post in the updated format:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>H VN:Z:1.0
S 1 CGATGCAA
L 1 + 2 + 5M
S 2 TGCAAAGTAC
L 3 + 2 + 0M
S 3 TGCAACGTATAGACTTGTCAC RC:i:4
L 3 + 4 - 1M1D2M1S
S 4 GCATATA
L 4 - 5 + 0M
S 5 CGATGATA
S 6 ATGA
C 5 + 6 + 2 4M
</code></pre></div></div>
<p>A little bit more formally, GFA consists of four types of lines indicated by
the first letter at each line. The format of each line is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Line Fixed fields Comments
---------------------------------------------------------------
H N/A Header
S segName,segSeq Segment
L segName1,segOri1,segName2,segOri2,CIGAR Link
C segName1,segOri1,segName2,segOri2,pos,CIGAR Contained
</code></pre></div></div>
<p>Here is a list of predefined tags:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Line Tag Type Comments
-----------------------------------------------------------------------
H VN Z Version number
L/S RC i # reads that support the segment/link
L/S FC i # fragments that support the segment/link
L/C MQ i Mapping quality of the overlap/containment
L/C NM i # mismatches/gaps
S LN i Segment length
</code></pre></div></div>
<p>Discussions and open issues:</p>
<ol>
<li>
<p>How to describe complex overlaps with simple syntax. Currently, GFA uses a
CIGAR, but I think it is bit overcomplicated.</p>
</li>
<li>
<p>Random access to GFA. I am not quite sure how this is useful in practice,
but it is worth thinking.</p>
</li>
<li>
<p>Small bubbles. Although I said that a few others and I would prefer to
encode bubbles as explicit segments in the initial iteration, I know a few
would like a better representation.</p>
</li>
<li>
<p>Where to keep the read-to-contig alignment. My preference is to keep them in
a separate BAM file.</p>
</li>
<li>
<p>Where to keep the segment sequences. My preference is to keep them in GFA.
Nonetheless, we still allow to put a “*” at the sequence field. We can still
describe the topology without the sequence data.</p>
</li>
<li>
<p>“Twin edges”. A link can be represented in two directions. My preference is
to allow both directions. The parser should throw a warning or an error if
the two directions are inconsistent.</p>
</li>
</ol>
<p>In the next step, I will write a standalone parser for GFA and clean up a few
dirty corners meanwhile. I will also try to write a few converters for existing
assembly formats by various assemblers and implement a few basic tools. If you
have any suggestions, please let me know. After all, I am not so experienced in
de novo assemblies as most of the readers.</p>
<p>Finally, I should emphasize that the format has not been fixed at all, far from
it. Please keep the comments coming. The discussions so far are very helpful to
me. Thank you!</p>
Alternatives to PSMC2014-07-20T00:00:00+00:00http://lh3.github.io/2014/07/20/alternatives-to-psmc
<p><a href="https://github.com/lh3/psmc">PSMC</a> is my program to infer the historical effective population size
from a diploid genome. It was <a href="http://www.nature.com/nature/journal/v475/n7357/full/nature10231.html">published in Nature</a> three years ago and
has been cited over 100 times so far. Whenever I see a PSMC plot in a paper, I
feel a moment of joy both as a scientist and as a programmer.</p>
<p>PSMC is okay, but now there are better models and implementations at least in
theory. <a href="https://github.com/stschiff/msmc">MSMC</a>, which has recently been <a href="http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3015.html">published in Nature Genetics</a>,
not only extends PSMC to multiple haplotypes, but also improves PSMC
for a diploid genome. It precalculates transition matrices over long runs of
homozygosity and becomes fast enough to perform whole-genome inference without
binning the input like PSMC. More importantly, for a diploid genome, MSMC
implements the PSMC’ model. It is a better approximation to the
coalescent-with-recombination model by allowing non-effective recombinations.
It is able to give a much better estimate of the recombination rate. I was lazy
when I was working on PSMC. I knew PSMC’ is better, but I skipped that because
its derivation is more complicated and because PSMC worked well to infer other
parameters.</p>
<p>Another important tool is <a href="http://sourceforge.net/projects/dical/">dical</a> by Kelly Harris et al. It also uses
better model and has a <a href="http://arxiv.org/abs/1403.0858">time complexity linear in the number of
states</a>. This is a significant advantage over the PSMC implementation
whose time complexity is quadratic in the number of states. Dical runs much
faster.</p>
<p>You can still use PSMC if you prefer. It comes with a few useful scripts and is
fast enough for most applications. Just beware that there may be better ones
when PSMC is not for you.</p>
A proposal of the Grapical Fragment Assembly format2014-07-19T00:00:00+00:00http://lh3.github.io/2014/07/19/a-proposal-of-the-grapical-fragment-assembly-format
<h3 id="introduction">Introduction</h3>
<p>Almost three years ago, there was a lengthy discussion in the Assemblathon
mailing list about a generic format for fragment assemmbly. The end product is
<a href="http://fastg.sourceforge.net">the FASTG format</a>. In the discussion, I have expressed several major
concerns with the format. The top one is that it is mathematically wrong. Three
years later, FASTG is still not widely used. At this point, <a href="http://www.iscb.org/ismb-mm/media-ismb2014/talks">Adam
Phillippy</a> and <a href="http://pmelsted.wordpress.com/2014/07/17/dear-assemblers-we-need-to-talk-together/">Pall Melsted</a> openly called for a generic
assembly format again. I also feel the pressing necessity of standardization, so
decided to give a try myself. This is the Graphical Fragment Assembly format, or
GFA in abbreviation.</p>
<p>In this post, I will start from the theoretical basis of assembly graph,
describe the format and finally discuss the potential issues with the proposal.</p>
<p>I showed an earlier version of this format to Richard Durbin, Daniel Zerbino and
Benedict Paten last night in Oxford. That version was a variant of FASTA. When I
was formalizing the format in this post, I found FASTA is too crowded and too
limited. Following the suggestion of Daniel, I finally adopted a format similar
to <a href="https://github.com/jts/sga/wiki/ASQG-Format">ASQG</a> and the PSMC output.</p>
<h3 id="theory">Theory</h3>
<p>DNA sequence assembly is often (though not always) represented as a graph.
There are multiple types of graphs including de Bruijn graph, overlap graph,
unitig graph and string graph. They are all <a href="http://en.wikipedia.org/wiki/Bidirected_graph">birected graph</a>. Briefly,
in this graph, each vertex is a sequence and each arc is an overlap. Because
DNA sequences have two strands, an arc may have four directions, representing
the four possible overlaps: forward-forward, forward-reverse, reverse-forward
and reverse-reverse. It should be noted that a k-mer de Bruijn graph is
equivalent to an overlap graph for k-mer reads with (k-1)-mer overlaps.
It is a bidirected graph, too.</p>
<p>The critical problem with FASTG is that it puts sequneces on arcs/edges. It is
unable to describe a simple topology such as <code class="language-plaintext highlighter-rouge">A->B; C->B; C->D</code> without adding a
dummy node, which breaks the theoretical elegance of assembly graphs. Due to the
historical confusion between vertices and edges, I will avoid using these
terminologies. I will use a <em>segment</em> for a piece of sequence and a <em>link</em> for a
connection between segments.</p>
<h3 id="the-gfa-format">The GFA format</h3>
<p>Although we can describe an assembly graph with bidirected arcs, I find in
practice, it is easier and more explicit to describe links between the ends of
segments. <a href="http://en.wikipedia.org/wiki/Eugene_Myers">Gene Myers</a> took a similar approach in his <a href="http://bioinformatics.oxfordjournals.org/content/21/suppl_2/ii79.abstract">string graph
paper</a>. Based on this observation, I <em>uniquely</em> label the 5’-end and
the 3’-end of each segment. The following shows an assembly graph with seven
segments in GFA:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>H VN:Z:1.0
S 1 2 CGATGCAA *
L 2 3 5M
S 3 4 TGCAAAGTAC *
L 3 6 0M
S 5 6 TGCAACGTATAGACTTGTCAC * RC:i:4
L 6 8 1M1D2M1S
S 7 8 GCATATA *
L 7 9 0M
S 9 10 CGATGATA *
S 11 12 ATGA *
C 9 11 2 4M
</code></pre></div></div>
<p>If we name a segment with the two <em>ordered</em> integers, the example above is
equivalent to a bidirected graph <code class="language-plaintext highlighter-rouge">1:2>->3:4; 5:6>->3:4; 5:6>-<7:8<->9:10</code> with
<code class="language-plaintext highlighter-rouge">11:12</code> contained in <code class="language-plaintext highlighter-rouge">9:10</code>. The <code class="language-plaintext highlighter-rouge">H</code> line is the header. An <code class="language-plaintext highlighter-rouge">S</code> line describes a
segment which consists of 5’-end label, 3’-end label, sequence and
pseudo-quality. An <code class="language-plaintext highlighter-rouge">L</code> line represents a link which consists of the labels of
the two ends and a CIGAR that describes the overlap alignment taking the first
end as the target/upper sequence. The CIGAR can describe symmetric overlaps
(e.g. <code class="language-plaintext highlighter-rouge">5M</code>), assembly gaps (e.g. <code class="language-plaintext highlighter-rouge">10N</code>), gapped overlaps, open-end alignments
(e.g. <code class="language-plaintext highlighter-rouge">1M1D2M1S</code>; heading <code class="language-plaintext highlighter-rouge">S</code> for clipping on the second sequence and tailing
<code class="language-plaintext highlighter-rouge">S</code> on the first), or unaligned overlaps (e.g. <code class="language-plaintext highlighter-rouge">5S10I8D2S</code>; no <code class="language-plaintext highlighter-rouge">M</code> operators).
It is related to but different from the CIGAR used in SAM. A <code class="language-plaintext highlighter-rouge">C</code> line represents
a containment, which is only relevant to read-to-read overlaps.</p>
<p>For all lines, additional information is described with tags in a format
identical to SAM. Predefined tags include:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Line Tag Type Meaing
-----------------------------------------------------------------------
H VN Z Version number
H QT A Type of pseudo-quality. Valid values: `Q`, `D` or `K`
S RC i # reads assembled into the segment
L/C MQ i Mapping quality of the overlap/containment
L NM i # mismatches/gaps
S LN i Segment length
</code></pre></div></div>
<h3 id="discussions">Discussions</h3>
<ol>
<li>
<p>If this format cannot encode your assembly, please let me know. Thank you.
Suggestions on making GFA work would be appreciated even more. :-)</p>
</li>
<li>
<p>It is unusual to uniquely label the two ends of a segment. <a href="http://www.bcgsc.ca/platform/bioinfo/software/abyss">ABySS</a>, <a href="https://github.com/jts/sga">SGA</a> and
most other assemblers uniquely label a segment. In my view, end-labeling has a
few advantages: a) it requires fewer operations for reverse-complementing and
unambiguous merging; b) by representing a bidirected arc with <code class="language-plaintext highlighter-rouge">A+,B-</code>, we are
still converting <code class="language-plaintext highlighter-rouge">A</code> to two labels; c) my own assembler only works with
end-labeling. I think it should always be easy to convert the segment-labeling
to the end-labeling but not vice versa. Unless there are strong arguments
against end-labeling, I will keep it.</p>
</li>
<li>
<p>Use a string to label an end. I like integers for efficiency, but don’t
object to strings in principle.</p>
</li>
<li>
<p>I don’t like the CIGAR I proposed. It is too complex. If you can find a
cleaner way to describe all kinds of overlaps and gaps, please let me know.
These complex overlaps are not uncommon in a long-read assembly or for
scaffolding.</p>
</li>
<li>
<p>In FASTG, we can encode a simple “bubble” with <code class="language-plaintext highlighter-rouge">ACGT[C,T]TAGT</code>. Although GFA
can describe this assembly, it needs to add three more segments and four links,
which are quite heavy. One option is to allow such simple bubbles on the <code class="language-plaintext highlighter-rouge">S</code>
line with a specific header tag indicating that the file contains small bubbles.
Is it a good idea? How many assemblers can take advantage of this potential
addition?</p>
</li>
<li>
<p>If we can agree on a format, I can write a parser and a few basic tools such
as flip, unambiguous merge and perhaps more complex operations such as tip
trimming and bubble popping if I have time.</p>
</li>
<li>
<p>Any other suggestions?</p>
</li>
</ol>
<h3 id="update">Update</h3>
<p>Considering to replace end-labeling with the more common segment-labeling. The
example above will look like (better or worse?):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>H VN:Z:1.0
S 1 CGATGCAA *
L 1 + 2 + 5M
S 2 TGCAAAGTAC *
L 3 + 2 + 0M
S 3 TGCAACGTATAGACTTGTCAC * RC:i:4
L 3 + 4 - 1M1D2M1S
S 4 GCATATA *
L 4 - 5 + 0M
S 5 CGATGATA *
S 6 ATGA *
C 5 + 6 + 2 4M
</code></pre></div></div>
On the trend of disk-based algorithms2014-07-13T00:00:00+00:00http://lh3.github.io/2014/07/13/on-the-trend-of-disk-based-algorithms
<p>I am looking for a good k-mer counter to replace the slow k-mer counting phase
in fermi2. A quick survey reveals that except Jellyfish and BFCounter, most
k-mer counters are disk-based, in the sense that they heavily rely on temporary
files to reduce memory.</p>
<p>This is not a good trend in my view. In large institutions and universities
with centralized computing resources, the majority of file systems are sitting
over network. Network I/O is frequently slow. This is becoming even worse in
the production setting when multiple instances of a tool are running at the
same time and all putting heavy stress on I/O. In the end, the practical
performance of disk-based tools may be far from the optimal numbers shown in
the papers.</p>
<p>I would encourage the development of in-memory algorithms. Disks are not really
a good solution to high memory.</p>
About static linking2014-07-12T00:00:00+00:00http://lh3.github.io/2014/07/12/about-static-linking
<p>One of the things I hate most about Linux is to compile software. Sometimes it
is a nightmare: lack of root permission, requirement of new gcc, dependencies
on huge or weird libraries, etc. Whenever these happens, I ask myself: why not
just distribute statically linked binaries such that they can run on most Linux
distributions? I knew a few reasons, but only today I took the question a
little more serious and did a google search. The following two links are
quite useful: <a href="http://stackoverflow.com/questions/1993390/static-linking-vs-dynamic-linking">static linking vs. dynamic linking</a> and <a href="http://www.akkadia.org/drepper/no_static_linking.html">static linking
considered harmful</a>.</p>
<p>In summary, static linking has the following disadvantages: more likely to be
attacked, not receiving patches in dynamic libraries, more memory hungry, not
truly static, sometimes not flexible and potentially violating GPL. I buy all
these arguments. However, for tools in bioinformatics, these are not big
concerns because most bioinformatics tools:</p>
<ul>
<li>
<p>are not system utilities and are not security-critical.</p>
</li>
<li>
<p>are only linked to small dynamic libraries. Statically linking the tools will
not cost much memory.</p>
</li>
<li>
<p>do not often use glibc features that have to be dynamically linked.</p>
</li>
<li>
<p>are distributed under a license compatible with LGPL.</p>
</li>
</ul>
<p>Most command-line bioinformatics tools can be statically linked without
problems. And I think we should create a repository for precompiled
bioinformatics tools. This will at least make my life much easier. What about
you?</p>
Abreak: evaluating de novo assemblies2014-07-07T00:00:00+00:00http://lh3.github.io/2014/07/07/abreak-evaluating-de-novo-assemblies
<p>“Abreak” is a subcommand of <a href="https://github.com/lh3/htsbox">htsbox</a>, which is a little <em>toy</em> forked from on the lite branch of <a href="https://github.com/samtools/htslib">htslib</a>. It
takes an assembly-to-reference alignment as input and counts the number of
alignment break points. An earlier version was used in <a href="http://bioinformatics.oxfordjournals.org/content/28/14/1838">my fermi paper</a>
to measure the missassembly rate of human de novo assemblies. A typical output
looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Number of unmapped contigs: 239
Total length of unmapped contigs: 54588
Number of alignments dropped due to excessive overlaps: 0
Mapped contig bases: 2933399461
Mapped N50: 6241
Number of break points: 102146
Number of Q10 break points longer than (0,100,200,500)bp: (28719,7206,4644,3222)
Number of break points after patching gaps short than 500bp: 94298
Number of Q10 break points longer than (0,100,200,500)bp after gap patching: (23326,5320,3369,2194)
</code></pre></div></div>
<p>Here it gives the mapped contig bases, mapped N50 and number of break points
with flanking sequences longer than 0, 100, 200 and 500bp.</p>
<p>Although <a href="http://ccb.jhu.edu/gage_b/">GAGE-B</a> and <a href="http://bioinf.spbau.ru/en/quast">QUAST</a> are more powerful, the use of
MUMmer limits them to small genomes only. In contrast, “abreak” works with any
aligners supporting chimeric alignment. When BWA-SW or BWA-MEM is used to map
contigs, “abreak” can easily and efficiently work with mammal-sized assemblies.</p>
<p><strong>UPDATE on 11/24/2014</strong>: changed links to htslib/htsbox. BTW, the results vary
with mapping parameters. If contigs and references are very close to each
other, it is recommended to use:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bwa mem -B9 -O16 -E1
</code></pre></div></div>
<p>for mapping. The default works better when the divergence is high.</p>
Random access to zlib compressed files2014-07-05T00:00:00+00:00http://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files
<p><a href="http://www.zlib.net">Zlib</a>/<a href="http://www.gzip.org">gzip</a> is probably the most popular library/tool for general
data compression. In zlib, there is an API <code class="language-plaintext highlighter-rouge">gzseek()</code> which places the file
position indicator at a specified offset in the uncompressed file. However,
whenever it gets called, it starts from the beginning of the file and reads
through all the data up to the specified offset. For huge files, this is very
slow.</p>
<p>It is actually possible to achieve faster random access in a generic gzip file.
The <a href="http://www.opensource.apple.com/source/zlib/zlib-22/zlib/examples/zran.c">zran.c</a> in the zlib source code package gives an example
implementation. It works by keeping 32kB uncompressed data right before an
access point. With the 32kB data, we can decompress data after the access
point - we do not need to decompress from the beginning. My friend Jue Ruan
found this example and and implemented <a href="https://sourceforge.net/p/maq/code/HEAD/tree/trunk/maqview/zrio.h">zrio</a>, a small library that
keeps the 32kB data in an index file to achieve random access to generic gzip
files. This library is used in <a href="http://maq.sourceforge.net/maqview.shtml">maqview</a>.</p>
<p>However, keeping 32kB data per access point is quite heavy. To drop this 32kB
dependency, Jue sought a better solution: calling
<code class="language-plaintext highlighter-rouge">deflate(stream,Z_FULL_FLUSH)</code> every 64kB. After <code class="language-plaintext highlighter-rouge">Z_FULL_FLUSH</code>, we can decompress
the following data independent of the previous data – keeping 32kB is not
necessary any more. The resultant compressed stream is still fully compatible
with zlib. Jue implemented this idea in <a href="https://github.com/lh3/samtools-legacy/blob/master/razf.h">RAZF</a>. In addition to this
stream reset, RAZF also writes an index table at the end of the file. Given
an uncompressed offset, we can look up the table to find the nearest access
point ahead of the offset to achieve random access. The index is much smaller
and the speed is much faster.</p>
<p>The first prototype of <a href="http://samtools.sourceforge.net">BAM</a> was using RAZF. At that time, a major concern
was that RAZF is using low-level zlib APIs which were not available in other
programming languages. This would limit the adoption of BAM. The size of the
index might also become a concern given >100GB files. In the discussion,
<a href="http://www.well.ox.ac.uk/dr-gerton-lunter">Gerton Lunter</a> directed us to <a href="http://linuxcommand.org/man_pages/dictzip1.html">dictzip</a>, another tool for
random access in gzip-compatible files. Dictzip would not work well for a huge
BAM due to the constraint of the gzip header. However, its key idea –
concatenating small gzip blocks – led Bob Handsaker to design something
better: <a href="http://samtools.github.io/hts-specs/SAMv1.pdf">BGZF</a> (section 4.1).</p>
<p>The key observation Bob made in BGZF is that when we seek the middle of a
compressed file, all we need is a virtual position which is not necessarily the
real position in the uncompressed file. In BGZF, the virtual position is a
tuple <code class="language-plaintext highlighter-rouge">(block_file_position,in_block_offset)</code>, where <code class="language-plaintext highlighter-rouge">block_file_position</code> is
the file postion, in the compressed file, of the start of a gzip block and
<code class="language-plaintext highlighter-rouge">in_block_offset</code> is the offset within the uncompressed gzip block. With the
tuple, we can unambiguously pinpoint a byte in the uncompressed file. When we
keep the tuple in an index file, we can jump to the position without looking up
another index. BGZF is smaller than RAZF and easier to implement. It has been
implemented in C, Java, Javascript and Go. Recently, Petr Danecek has <a href="https://github.com/samtools/htslib/blob/develop/htslib/bgzf.h">extended
BGZF</a> with an extra index file to achieve random access with offset in
uncompressed file.</p>
<p>In the analysis of high-throughput sequencing data, BGZF plays a crucial role
in reducing the storage cost while maintaining the easy accessibility to the
data. It is a proven technology scaled to TB of data.</p>
My blog2014-07-05T00:00:00+00:00http://lh3.github.io/2014/07/05/my-blog
<p>I have a <a href="http://lh3lh3.users.sourceforge.net">homepage</a>. It is using a CSS template heavily modified from a
theme from <a href="http://en.wikipedia.org/wiki/Google_Page_Creator">Google Pages</a>, which has long been discontinued. This
template has been used for the <a href="http://samtools.sourceforge.net">SAMtools website</a> among a few others.
I still like the general look-and-feel of the template, but I do have a few
concerns with my homepage. Firstly, I do not like to write raw HTML, as is
opposed to MarkDown. Secondly, as a novice web developer, I still have troubles
with fine tuning the CSS template. Thirdly and most importantly, I am unable to
get feedbacks from readers. While there are services like <a href="https://disqus.com">Disqus</a>, I
am too lazy to learn how to integrate them. After all, I am a scietific
researcher. Due to these concerns, I have not updated my homepage often in the
past few years.</p>
<p>On the other hand, I sometimes feel it is necessary to describe my preliminary
works and express my immature thoughts in short articles. Maybe some of them
could be useful to others as well. Blog posts seem a good way to achieve the
goal. This is why I started this blog. I do not know where it goes or if I will
update often, but a little bit more documentation is better than nothing, I
believe.</p>