12 April 2024

My blog posts are often motivated by casual conversations with other scientists. This blog post is an example. Someone asked me how I thought of using a compression algorithm for alignment in BWA. The short answer is: I didn’t; I learned from Tak-Wah Lam.

Here is the longer story. On a 1000G conference call in April 2008, as I remember, Tak-Wah gave a short talk on a prototype of SOAP2 and showed that it was faster than MAQ and SOAP by over an order of magnitude. Honestly, I thought it was too good to be real (sorry, Tak-Wah!). Tak-Wah mentioned BWT but did not describe the algorithm. I was curious about how that works and found his BWT-SW paper published earlier that year. I learned the basic of BWT/FM-index from the BWT-SW paper. FM-index only gives exact matches. I came up with the backtracking idea independently to allow mismatches and gaps.

I started to implement BWA in late May 2008. Realizing efficient BWT construction is challenging, I adapted the BWT-SW implementation which remains in BWA. It took me another 2-3 months to get a decent prototype working for single-end reads. That version was over ten times faster than MAQ — Tak-Wah was correct of course. After I released samtools, I came back to BWA. I added the support of paired-end reads, made BWA the first aligner to output SAM and published it in mid 2009. SOAP2 was actually published one month later than BWA, but without SOAP2, there would be no BWA.



blog comments powered by Disqus