29 March 2024

Historically, the term “pan-genome” (or more often “pangenome” in recent literature) was first coined by Sigaux in 2000 for a purpose distinct from current uses. Victor Tetz’s definition is fairly close to the modern definition except that it is applied to all living organisms. These definitions are rarely cited nowadays.

In microbiome research, a pangenome refers to the non-redundant set of genes in a bacterial species. This definition is commonly attributed to two papers (Medini et al, 2005; Tettelin et al, 2005), both with Rino Rappuoli as the corresponding author. In practice, most widely used tools for bacterial pangenome construction only consider protein-coding genes. The pan-genome wiki page currently focuses on bacterial pangenome analysis, too.

When the concept of pangenome was ported to eukaryotic genomes where protein-coding genes are sparse, pangenome initially referred to the non-redundant set of genomic sequences in a species. This is a good definition in the narrow sense. However, the scope of pangenome has been broadened in recent literature partly because, I guess, there is no better word to describe a set of genomes. In my talks, I often cite the definition in a review paper in 2018 (with slight modifications): a pangenome refers to a collection of well assembled genomes in a clade (typically a species) to be analyzed together. Determining the non-redundant genes or genomic sequences is one type of pangenome analysis but other types of joint analyses such as indexing and storage also count. Notably, we do not consider short reads from the 1000 Genomes Project as pangenome data because these short reads cannot be well assembled.

Like many concepts in biology, the definiton of pangenome somewhat differs between fields and evolves with time. When you see “pangenome” next time, read the context to understand what it really means.

blog comments powered by Disqus