Format, quality binning and file size

25 May 2020

This short post evaluates the effect of format and quality binning on file sizes. I am taking SRR2052362 as an example. It gives 4.3-fold coverage on the human genome. For 2-binning, I turned original quality 20 or above to 30 and turned original quality below 20 to 10. For 8-binning, I took the scheme from a white paper (PDF) published by Illumina. Illumina has been using quality binning for more than seven years. In this experiement, I only retained the original read names. To produce CRAM files, I mapped the short reads to the GRCh38 primary assembly. The following table shows the file sizes:

Format	# qual bins	Size (GB)	Change relative to SRA
Sorted CRAM	2 bins	1.187	-85%
Unsorted CRAM	2 bins	1.279	-84%
Unsorted CRAM	8 bins	2.115	-73%
Gzip'd FASTA	No quality	4.172	-47%
Unsorted CRAM	Lossless	4.536	-43%
Gzip'd FASTQ	2 bins	4.784	-40%
SRA	Lossless	7.917	0%
Gzip'd FASTQ	Lossless	9.210	+16%

It is clear that the CRAM format is the winner here and the advantage of CRAM is more prominent given lower quality resolution. A key question is how much quality binning affects variant calling. Brad Chapman concluded 8-binning had little effect on variant calling accuracy. With Crumble, James Bonfield could get a little higher accuracy with lossy compression. FermiKit effectively uses 2-binning and can achieve descent results. I applied 2-binning to GATK many years ago and observed 2-binning barely reduced accuracy. The GATK team at Broad Institute also evaluated 2-binning and 4-binning. They found 4-binning was better than 2-binning and was as good as original quality. The overall message is that we don’t need full quality resolution to make accurate variant calls for germline samples. The effect on tumor samples is more of an open question, though.

It is worth noting that completely discarding base quality dramatically reduces variant calling accuracy. I have observed this both with FermiKit and with GATK (I didn’t keep the results unfortunately). This is because low-quality Illumina sequencing errors are correlated, in that if one low-quality base is wrong, other low-quality bases tend to be wrong in the same way. Without base quality, variant callers wouldn’t be able to identify such recurrent errors.