OpenHGL

Open Human Genome Library

View the Project on GitHub

Table of Contents

Introduction

The Open Human Genome Library (OpenHGL) is a collection of high-quality de novo human assemblies that are publicly available in genomic databases (e.g. NCBI and CNCB) or from individual research papers. It provides consistent naming and uniform formats across datasets, supporting efficient subsequence retrieval and approximate string search.

The dataset currently consists of 579 huamn genomes with 1.7 trillion basepairs. The full dataset is available at AWS as Open Data. The primary data is also archieved at Zenodo.

Downloading OpenHGL Data

OpenHGL is available in S3 bucket s3://openhgl. The fastest way to download bulk data is to use the AWS command-line interface (aws-cli), for example, with:

# list all files (there are tens of files in total)
aws s3 ls --no-sign-request --recursive s3://openhgl

# download a small sample file (24.8MB in size)
aws s3 cp --no-sign-request s3://openhgl/misc/mtb/mtb152.tar.gz .

If you are not familiar with aws-cli, you can browse the files, find their links and download with wget or curl. Alternatively, you can download primary data from Zenodo. However, due to limited space provided by Zenodo, derived files (e.g. FM-index in the static format) are not available. Downloading from Zenodo is also much slower than from AWS.

Using OpenHGL Data

File description

At present, OpenHGL provides genome sequences in the AGC format and the corresponding FM-index in the ropebwt3 format:

Retrieving genomic sequences

It is recommended to download precompiled AGC binary from its release page. After copying the agc binary to your PATH, you can download and retrieve sequences with

# download AGC archive
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.agc
# or with aws-cli
aws s3 cp s3://openhgl/human/human579/human579.agc .

# list assembly names
agc listset human579.agc

# list contig names in assembly 200125_HG02129.pat
agc listctg human579.agc 200125_HG02129.pat

# retrieve all sequences in assembly 200125_HG02129.pat
agc getctg human579.agc 200125_HG02129.pat > HG02129.pat.fa

# retrieve the first 100bp of contig HG02129#1#CM085853.1
agc getctg human579.agc HG02129#1#CM085853.1:0-99

Importantly, with AGC, the coordinate of the first base is 0. start-end is a closed interval. This is different from common tools like samtools faidx which uses closed intervals but puts the first base at coodinate 1.

Finding sequence matches

Ropebwt3 is required for string search:

# install ropebwt3
git clone https://github.com/lh3/ropebwt3
cd ropebwt3; make               # add "omp=0" if you see errors

# download FM-index
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd.ssa
wget https://openhgl.s3.us-east-1.amazonaws.com/human/human579/human579.fmd.len.gz

# exact match
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ropebwt3 mem -L human579.fmd -

# inexact match
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ropebwt3 sw -eN200 -Lm10 human579.fmd -

More use cases

The following command lines show more use cases:

# Locate up to 100 exact matches
ropebwt3 mem -t16 -p100 human579.fmd seq.fa.gz > out.bed

# Find non-human sequences/contaminations
ropebwt3 mem -t16 -l101 --gap=10k human579.fmd seq.fastq.gz > out.bed

# Count 101-mers occuring over 20 times per genome on average
ropebwt3 kount -k101 -m 11580 human579.fmd > k101-20.txt

The ropebwt3 paper provides additional examples.

Data Description

Data sources

Name Version nAsm Description
CHM13 2.0 1 Analysis set with HG002 chrY and rCRS chrM
CN1 1.0.1 2 Chinese Han
KSA001 1.1.0 2 Saudi Arabia
I002C 0.7 2 Indian
KOREF1 2025 2 Korean
YAO 2.0 2 Chinese
HPRC r2-v1.0.1 464 Human Pangenome Reference Consortium
APR v1 104 UAE-based Arab Pangenome Reference

Criteria in sample selection:

Additional procedure:

Overview of the workflow is shown below:

Naming convention

A sample name matches regular expression ([0-9]{6})_([A-Z0-9]+)\.(pri|pat|mat|hap1|hap2). The leading digits are a unique identifier for the contig set. The alphanumeric string after the first underscore indicates the sample name. If the assembly of a sample is updated, the sample name stays the same but the identifier will be different. The ending code specifies the assembly type:

A contig name matches ([^\s#]+)#[012]#([^\s#]+) where the first field corresponds to the sample name and the last field to the contig or chromosome name. The number in the middle indicates haplotype with 0 in primary assembly, 1 for paternal or haplotype 1, and 2 for maternal or haplotype 2.

Known issues

ChangeLogs