Open Human Genome Library
The Open Human Genome Library (OpenHGL) is a collection of high-quality de novo human assemblies that are publicly available in genomic databases (e.g. NCBI and CNCB) or from individual research papers. It provides consistent naming and uniform formats across datasets, supporting efficient subsequence retrieval and approximate string search.
The dataset currently consists of 579 huamn genomes with 1.7 trillion basepairs. The full dataset is available at AWS as Open Data. The primary data is also archieved at Zenodo.
OpenHGL is available in S3 bucket s3://openhgl. The fastest way to download
bulk data is to use the AWS command-line interface (aws-cli), for
example, with:
# list all files (there are tens of files in total)
aws s3 ls --no-sign-request --recursive s3://openhgl
# download a small sample file (24.8MB in size)
aws s3 cp --no-sign-request s3://openhgl/misc/mtb/mtb152.tar.gz .
If you are not familiar with aws-cli, you can browse the files,
find their links and download with wget or curl. Alternatively, you can
download primary data from Zenodo. However, due to limited space
provided by Zenodo, derived files (e.g. FM-index in the static format) are not
available. Downloading from Zenodo is also much slower than from AWS.
At present, OpenHGL provides genome sequences in the AGC format and the corresponding FM-index in the ropebwt3 format:
human579.agc: AGC archive of assembly sequenceshuman579.fmd: BWT in the static ropebwt3 format (AWS only)human579.fmd.ssa: sampled suffix array (AWS only)human579.fmd.len.gz: contig names and lengthshuman579.fmr.gz: BWT sequence in the dynamic ropebwt3 formathuman579.fmd.ssa.gz: sampled suffix array (Zenodo only)human579.meta.tsv: metadata including 1) assembly name, 2) the sex
chromosome in the assembly, 3) sample name, 4) sample sex, 5) SGDP region
code, 6) 1KG population code and 7) country.It is recommended to download precompiled AGC binary from its release
page. After copying the agc binary to your PATH, you can download
and retrieve sequences with
# download AGC archive
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.agc
# or with aws-cli
aws s3 cp s3://openhgl/human/human579/human579.agc .
# list assembly names
agc listset human579.agc
# list contig names in assembly 200125_HG02129.pat
agc listctg human579.agc 200125_HG02129.pat
# retrieve all sequences in assembly 200125_HG02129.pat
agc getctg human579.agc 200125_HG02129.pat > HG02129.pat.fa
# retrieve the first 100bp of contig HG02129#1#CM085853.1
agc getctg human579.agc HG02129#1#CM085853.1:0-99
Importantly, with AGC, the coordinate of the first base is 0. start-end
is a closed interval. This is different from common tools like samtools faidx
which uses closed intervals but puts the first base at coodinate 1.
Ropebwt3 is required for string search:
# install ropebwt3
git clone https://github.com/lh3/ropebwt3
cd ropebwt3; make # add "omp=0" if you see errors
# download FM-index
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd.ssa
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd.len.gz
# exact match
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ropebwt3 mem -L human579.fmd -
# inexact match
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ropebwt3 sw -eN200 -Lm10 human579.fmd -
The following command lines show more use cases:
# Locate up to 100 exact matches
ropebwt3 mem -t16 -p100 human579.fmd seq.fa.gz > out.bed
# Find non-human sequences/contaminations
ropebwt3 mem -t16 -l101 --gap=10k human579.fmd seq.fastq.gz > out.bed
# Count 101-mers occuring over 20 times per genome on average
ropebwt3 kount -k101 -m 11580 human579.fmd > k101-20.txt
The ropebwt3 paper provides additional examples.
| Name | Version | nAsm | Description |
|---|---|---|---|
| CHM13 | 2.0 | 1 | Analysis set with HG002 chrY and rCRS chrM |
| CN1 | 1.0.1 | 2 | Chinese Han |
| KSA001 | 1.1.0 | 2 | Saudi Arabia |
| I002C | 0.7 | 2 | Indian |
| KOREF1 | 2025 | 2 | Korean |
| YAO | 2.0 | 2 | Chinese |
| HPRC | r2-v1.0.1 | 464 | Human Pangenome Reference Consortium |
| APR | v1 | 104 | UAE-based Arab Pangenome Reference |
Criteria in sample selection:
Additional procedure:
Overview of the workflow is shown below:

A sample name matches regular expression ([0-9]{6})_([A-Z0-9]+)\.(pri|pat|mat|hap1|hap2).
The leading digits are a unique identifier for the contig set. The alphanumeric
string after the first underscore indicates the sample name. If the assembly of
a sample is updated, the sample name stays the same but the identifier will be
different. The ending code specifies the assembly type:
A contig name matches ([^\s#]+)#[012]#([^\s#]+) where the first field
corresponds to the sample name and the last field to the contig or chromosome
name. The number in the middle indicates haplotype with 0 in primary assembly,
1 for paternal or haplotype 1, and 2 for maternal or haplotype 2.