12 November 2018

On Twitter, Zach charlop-powers asked me to give a code walkthrough for seqtk. This post does that, to a certain extent.

Seqtk is a fairly simple project. It uses two single-header libraries for hash table and FASTQ parsing, respectively. Its single .c file consists of mostly independent components, one for each seqtk command. I will start with the two single-header libraries.

Buffered stream reader

In standard C, the read() system call is the most basic function to read from file. It is usually slow to read data byte by byte. That is why the C library provides the fread() series. fread() reads large chunks of data with read() into an internal buffer and returns smaller data blocks on request. It may dramatically reduce the expensive read() system calls and is mostly the preferred choice.

fread() is efficient. However, it only works with data streams coming from file descriptors, not from a zlib file handler for example. More recent programming languages provide generic buffered readers. Take Go’s Bufio as an example. It demands a read() like function from the user code, and provides a buffered single-byte reader and an efficient line reader in return. The buffered functionalities are harder to implement on your own.

The buffered reader in kseq.h predates Go, but the basic idea is similar. In this file, ks_getuntil() reads up to a delimitor such as a line break. It moves data with memcpy() and uses a single loop to test delimitor. 10 years ago when “kseq.h” was first implemented, zlib didn’t support buffered I/O. Line reading with zlib was very slow. “kseq.h” is critical to the performance of FASTA/Q parsing.

FASTA/Q parser

The parser parses FASTA and FASTQ at the same time. It looks for ‘@’ or ‘>’ if it hasn’t been read, and then reads name and comment. To read sequence, the parser first reads the first character on a line. If the character is ‘+’ or indicates a FASTA/Q header, the parser stops; if not, it reads the rest of line into the sequence buffer. If the parser stops at a FASTA/Q header, it returns the sequence as a FASTA record and indicates the header character has been read, such that the parser need not look for it for the next record. If the parser stops at ‘+’, it skips the rest of line and starts to read quality strings line by line until the quality string is no shorter than the sequence. The parser returns an error if it reaches the end of file before reading enough quality, or the quality string turns out to be longer than sequence. Given a malformatted FASTA/Q file, the parser won’t lead to memory violation except when there is not enough memory.

A basic tip on fast file parsing: read by line or by chunk, not by byte. Even with a buffered reader, using fgetc() etc to read every byte is slow. In fact, it is possible to make the FASTA/Q parser faster by reading chunks of fixed size, but the current pasrser is fast enough for typical FASTA/Q.

Hash table

File khash.h implements an open-addressing hash table with power-of-2 capacity and quadratic probing. It uses a 2-bit-per-bucket meta table to indicate whether a bucket is used or deleted. The query and insertion operations are fairly standard. There are no tricks. Rehashing in khash is different from other libraries, but that is not an important aspect.

Both “khash.h” and “kseq.h” heavily depend on C macros. They look ugly. Unfortunately, in C, that is the only way to achieve the performance of type-specific code.

Seqtk

The only performance-related trick I can think of in seqtk.c is the tables to map nucleotides to integers. It is commonly used elsewhere. Another convenience-related trick is isatty(). This function can test if there is an incoming stream from the standard input. Gzip probably uses this function, too.

Seqtk.c also implements a simple 3-column BED reader and comes with a Mersenne Twister pseudorandom number generator (PRNG). That PRNG is a mistake, though seqtk doesn’t need a good PRNG anyway.

The rest of seqtk consists of mostly indepedent functions, each implementing a seqtk command. I will briefly explain a couple of them. “trimfq” uses a modified Mott algorithm (please search text “Mott algorithm” in phred.doc). I think this is a much cleaner and more theoretically sound algorithm than most ad hoc methods in various read trimmers. The “sample” command takes the advantage of reservoir sampling. The core implementation only takes two lines. You can in fact sample a text file with an awk one liner:

cat file.txt | awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'

Concluding remarks

This is the first time I present a code walkthrough in a blog post. Not sure if it is helpful or even qualified as a walkthrough…



blog comments powered by Disqus