28 September 2018

SAM is a text format that is typically used to store the alignment of high-throughput sequence reads against a reference genome. BAM is the first binary representation of SAM designed at the same time. BAM is smaller, faster to process and has additional features like random access.

BAM is not optimal in terms of compression ratio. By reorganizing binary data and using more advanced compression techniques, we can make alignments much more compressible. There have been many attempts to replace BAM with a better binary format, such as DeeZ, Quip, cSRA (PPT), GenComp, cSAM, Goby and samcomp (see Hosseini et al (2016) for a more thorough review). The team maintaining the SAM spec finally adopted CRAM as the future of alignment format. CRAM is much smaller than BAM and has a similar feature set. With the new codec implemented in scramble, it is as fast as BAM in routine data processing.

MPEG-G is a new binary format that aims to replace BAM. Its preprint claims “10x improvement over the BAM format” in the abstract. However, in the only compression ratio comparison, Figure 3, MPEG-G is only 6.54x as small, not 10x. In addition, Figure 3 suggests sequences and qualities are of different sizes in SAM (green vs orange). This could happen (some reads don’t have qualities), but is very rarely the case in real-world BAMs. I am also surprised by Figure 3a, where MPEG-G can compress qualities much more than sequences (green vs orange). On real data produced today, qualities are harder to compress because they don’t follow a clear pattern. I suspect the authors are employing lossy compression, possibly with one of the algorithms developed by a contributor to MPEG-G. Furthermore, the usability of a format is more than just compression ratio. Encoding/decoding has to be performant. The preprint shows no evaluation. James Bonfield, the developer behind the latest CRAM, has similar concerns with their previous results.

Much of the above is my speculation. I could be wrong. And it is easy to prove me wrong: make the data and software available and let the world reproduce Figure 3. Unfortunately, although the MPEG-G specification is available, the implementation and the benchmark data are not. This leads to my following point:

MPEG-G is an open standard endorsed by ISO. However, open doesn’t mean free. Remember the royalties imposed by H.264/MPEG-4 AVC? MPEG-G may be going down the same route. Key contributors are applying for patents and may have financial interest in the format. Before the MPEG-G authors 1) open source the reference implementation and 2) make the format royalty-free like AV1, I recommend everyone to use BAM or CRAM.

Disclaimer: I was the key contributor to BAM, the format that CRAM and MPEG-G aim to replace, and I am still a contributor to the SAM/BAM spec and its reference implementation. I have no competing financial interests in SAM/BAM/CRAM or its reference implementation htslib.

blog comments powered by Disqus