miniprot - protein-to-genome alignment with splicing and frameshifts
Synopsis
Description
Options
Indexing options
Chaining options
Alignment options
Input/Output options
Output Format
The GFF3 Format
The PAF Format
Limitations
* Indexing a genome (recommended as indexing can be slow and memory hungry):miniprot [-t nThreads] -d ref.mpi ref.fna* Aligning proteins to a genome:
miniprot [-t nThreads] ref.mpi protein.faa > output.paf
miniprot [-t nThreads] ref.fna protein.faa > output.paf
Miniprot aligns protein sequences to a genome allowing potential frameshifts and splicing.
-k INT K-mer size for genome-wide indexing [6] -M INT Sample k-mers at a rate 1/2**INT [1]. Increasing this option reduces peak memory but decreases sensitivity. -L INT Minimum ORF length to index [30] -b INT Number of bits per bin [8]. Miniprot splits the genome into non-overlapping bins of 2^8 bp in size. -d FILE Write the index to FILE [].
-S Disable splicing. It applies -G1k -J1k -e1k at the same time. -c NUM Ignore k-mers occurring NUM times or more [50k] -G NUM Max intron size [200k]. This option overrides -I. -I Set max intron size to min(max(3.6*sqrt(refLen),10000),300000) where refLen is the total length of the input genome. -n NUM Min number of syncmers in a chain [10] -m NUM Min chaining score [0] -l INT K-mer size for the second round of chaining [5] -e NUM Max extension from chain ends for alignment or the second round of chaining [10k] -p FLOAT Filter out a secondary chain/alignment if its score is FLOAT fraction of the best chain [0.5] -N NUM Retain at most NUM number of secondary chains/alignments [30]
-O INT Gap open penalty [11] -E INT Gap extension penalty [1]. A gap of size g costs {-O}+{-E}*g. -J INT Intron open penalty [29] -F INT Penalty for frameshifts or in-frame stop codons [23] -C FLOAT Weight of splicing penalty [1]. Set to 0 to ignore splicing signals. -B IN Bonus score for alignment reaching ends of proteins [5] -j INT Splice model for the target genome: 2=mammal, 1=general, 0=none [1]. The mammal model considers G|GTR...YYYNYAG| as the optimal splicing sequence and penalizes other sequences based on profiles in Sibley et al (2016). According to Irimia and Roy (2008) and Sheth et al (2006), the first G in the donor exon and the poly-Y close to the acceptor may not be conserved in some species. The general model takes |GTR...YAG| as the optimal sequence. Both models also consider less frequent splice sites including G|GC...YAG| and |AT...AC|.
-t INT Number of threads [4] --gff Output in the GFF3 format. ##PAF lines in the output provide detailed alignments. --gff-only Output in the GFF3 format without ##PAF lines. --aln Output the residue alignment in three lines, where line ##ATN for the target nucleotide sequence, ##ATA for translated amino acid sequence and ##AQA for the query protein sequence. On a ##ATA line, ! denotes a frameshift insertion corresponding to the F CIGAR operator and $ denotes a frameshift substitution corresponding to the G operator. --max-intron-out NUM In the --aln format, if an intron is longer than NUM, only output ceil(NUM/2) basepairs at the donor or the acceptor sites and write the full intron length LEN as ~LEN~ in the middle [200]. -P STR Prefix for IDs in GFF3 or GTF [MP]. --gff-delim overrides this option. --gff-delim CHAR Change the ID field in GFF3 to QueryNameCHARHitIndex []. If not specified, the default ID looks like MP000012. This option is only applicable to the GFF3 output format. --gtf Output in the GTF format -u Print unmapped query proteins --outn NUM Output up to min{NUM, -N} alignments per query [1000]. --outs FLOAT Output an alignment only if its score is at least FLOAT*bestScore, where bestScore is the best alignment score of the protein [0.99] --outc FLOAT Output an alignment only if FLOAT fraction of the query protein is aligned [0.1] -K NUM Query batch size [2M]
Miniprot outputs alignment in the extended Pairwise mApping Format (PAF) by default (see the next subsection). It can also output GFF3 with option --gff. Miniprot may output three features: mRNA, CDS or stop_codon. Here, a stop_codon is only reported if the alignment reaches the C-terminus of the protein and the next codon is a stop codon. Per GenCode rule, stop_codon is not part of CDS but it is part of mRNA or exon.Miniprot may output the following attributes in GFF3:
Attribute Type Description ID str mRNA identifier Parent str Identifier of the parent feature Rank int Rank among all hits of the query Identity real Fraction of exact amino acid matches Positive real Fraction of positive amino acid matches Donor str 2bp at the donor site if not GT Acceptor str 2bp at the acceptor site if not AG Frameshift int Number of frameshift events in alignment StopCodon int Number of in-frame stop codons Target str Protein coordinate in alignment
PAF gives detailed alignment. It is a TAB-delimited text format with each line consisting of at least 12 fields as are described in the following table:
Col Type Description 1 string Protein sequence name 2 int Protein sequence length 3 int Protein start coordinate (0-based) 4 int Protein end coordinate (0-based) 5 char + for forward strand; - for reverse 6 string Contig sequence name 7 int Contig sequence length 8 int Contig start coordinate on the original strand 9 int Contig end coordinate on the original strand 10 int Number of matching nucleotides 11 int Number of nucleotides in alignment excl. introns 12 int Mapping quality (0-255 with 255 for missing)
PAF may optionally have additional fields in the SAM-like typed key-value format. Miniprot may output the following tags:
Tag Type Description AS i Alignment score from dynamic programming ms i Alignment score excluding introns np i Number of amino acid matches with positive scores da i Distance to the nearest start codon do i Distance to the nearest stop codon cg Z Protein CIGAR cs Z Difference string
A protein CIGAR consists of the following operators:
Op Description nM Alignment match. Consuming n*3 nucleotides and n amino acids nI Insertion. Consuming n amino acids nD Delection. Consuming n*3 nucleotides nF Frameshift deletion. Consuming n nucleotides nG Frameshift match. Consuming n nucleotides and 1 amino acid nN Phase-0 intron. Consuming n nucleotides nU Phase-1 intron. Consuming n nucleotides and 1 amino acid nV Phase-2 intron. Consuming n nucleotides and 1 amino acid
The cs tag encodes difference sequences. It consists of a series of operations:
Op Regex Description : [0-9]+ Number of identical amino acids * [acgtn]+[A-Z*] Substitution: ref to query + [A-Z]+ # aa inserted to the reference - [acgtn]+ # nt deleted from the reference ~ [acgtn]{2}[0-9]+[acgtn]{2} Intron length and splice signal
* The DP alignment score (the AS tag) is not accurate. * Need to introduce more heuristics for improved accuracy.
miniprot-0.9 (r223) | miniprot (1) | 9 March 2023 |