A reimplementation of symmetric DUST

05 October 2015

I have just reimplemented the symmetric DUST algorithm (SDUST) for masking low-complexity regions. The program depends on kdq.h (double-ended queue) and kvec.h (simple vector); the command line interface further requires kseq.h for FASTA/Q parsing. As I have tried on human chr11, the output is identical to the output by NCBI’s dustmasker except at assembly gaps. The speed is four times as fast. I have also compared this implementation to mdust, which is supposed to be a reimplementation of the original asymmetric DUST. The mdust result under the same score threshold seems to differ significantly from SDUST/dustmasker. I haven’t looked into the cause.

I understand the basis of the SDUST algorithm, which is quite elegant, but I haven’t fully understood all the implementation details. I was just literally translating the pseudocode in the paper to C, with occassional reference to the dustmasker source code. If you have any problems, please let me know.