05 October 2015

I have just reimplemented the symmetric DUST algorithm (SDUST) for masking low-complexity regions. The program depends on kdq.h (double-ended queue) and kvec.h (simple vector); the command line interface further requires kseq.h for FASTA/Q parsing. As I have tried on human chr11, the output is identical to the output by NCBI’s dustmasker except at assembly gaps. The speed is four times as fast. I have also compared this implementation to mdust, which is supposed to be a reimplementation of the original asymmetric DUST. The mdust result under the same score threshold seems to differ significantly from SDUST/dustmasker. I haven’t looked into the cause.

I understand the basis of the SDUST algorithm, which is quite elegant, but I haven’t fully understood all the implementation details. I was just literally translating the pseudocode in the paper to C, with occassional reference to the dustmasker source code. If you have any problems, please let me know.

blog comments powered by Disqus