macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

cDNA配列をゲノムにアラインメントする GMAP

 

2016年の論文より

RNA-SeqやDNA-SeqのデータセットをゲノムにアライメントするためのプログラムGMAPとGSNAPは、生物学的手法の進歩に伴い、より長いリード、より大量のデータ、新しいタイプの生物学的アッセイを扱うために進化してきた。ゲノム表現では、SIMD(single-instruction multiple-data)命令を用いて配列を比較できる線形ゲノム、SIMD命令を用いて高速アクセスできる圧縮ゲノムハッシュテーブル、40億bp以上の大規模ゲノムへの対応、高速アクセスのための新規データ構造を持つ拡張サフィックスアレイ(ESA)などに改良されてきた。アルゴリズムの改良では、接尾辞配列を用いた greedy match-and-extend アルゴリズム、ゲノムハッシュテーブルを用いたセグメントチェイニング、セグメントハッシュテーブルを用いたdiagonalization、SIMD命令を用いた核酸レベルの動的計画法(Fループ計算を不要にする)などが行われた。プログラムの機能強化には、indelの位置の標準化、あいまいなスプライシングの処理、重複するペアエンドリードのクリッピングとマージ、環状染色体やスキャフォールドへのアラインメントが含まれる。これらのプログラムは、gmapR や HTSeqGenie などの R/Bioconductor パッケージに統合され、パイプラインで使用できるようになり、これらのパイプラインは多くの生命現象の発見を促進している。

 

インストール

Github

git clone https://github.com/juliangehring/GMAP-GSNAP.git
cd GMAP-GSNAP/
./configure
make
make check   (optional)
make install

#conda
mamba install -c bioconda gmap

> gmap

$ gmap

Note: gmap.avx2 does not exist.  For faster speed, may want to compile package on an AVX2 machine GMAP version 2021-08-25 called with args: gmap.sse42 Need to specify the -d, -g, -1, -2, or --cmdline flag Usage: gmap [OPTIONS...] <FASTA files...>, or

       cat <FASTA files...> | gmap [OPTIONS...]

 

Input options (must include -d or -g)

  -D, --dir=directory            Genome directory.  Default (as specified by --with-gmapdb to the configure program) is mambaforge/envs/gemoma/share

  -d, --db=STRING                Genome database.  If argument is '?' (with the quotes), this command lists available databases.

 

  -k, --kmer=INT                 kmer size to use in genome database (allowed values: 16 or less).

                                   If not specified, the program will find the highest available

                                   kmer size in the genome database

  --sampling=INT                 Sampling to use in genome database. If not specified, the program

                                   will find the smallest available sampling value in the genome database within selected k-mer size

  -g, --gseg=filename            User-supplied genomic segment

  -1, --selfalign                Align one sequence against itself in FASTA format via stdin

                                   (Useful for getting protein translation of a nucleotide sequence)

  -2, --pairalign                Align two sequences in FASTA format via stdin, first one being

                                   genomic and second one being cDNA

  --cmdline=STRING,STRING        Align these two sequences provided on the command line,

                                   first one being genomic and second one being cDNA

  -q, --part=INT/INT             Process only the i-th out of every n sequences

                                   e.g., 0/100 or 99/100 (useful for distributing jobs

                                   to a computer farm).

  --input-buffer-size=INT        Size of input buffer (program reads this many sequences

                                   at a time for efficiency) (default 1000)

 

Computation options

  -B, --batch=INT                Batch mode (default = 2)

                                 Mode     Positions       Genome

                                   0      mmap            mmap

                                   1      mmap & preload  mmap

                      (default)    2      mmap & preload  mmap & preload

                                   3      allocate        mmap & preload

                                   4      allocate        allocate

                                   5      allocate        allocate (same as 4)

                           Note: For a single sequence, all data structures use mmap If mmap not available and allocate not chosen, then will use fileio (very slow)

  --use-shared-memory=INT        If 1, then allocated memory is shared among all processes on this node

                                   If 0 (default), then each process has private allocated memory

  --nosplicing                   Turns off splicing (useful for aligning genomic sequences

                                   onto a genome)

  --max-deletionlength=INT       Max length for a deletion (default 100).  Above this size,

                                   a genomic gap will be considered an intron rather than a deletion.

                                   If the genomic gap is less than --max-deletionlength and greater

                                   than --min-intronlength, a known splice site or splice site probabilities

                                   of 0.80 on both sides will be reported as an intron.

  --min-intronlength=INT         Min length for one internal intron (default 9).  Below this size,

                                   a genomic gap will be considered a deletion rather than an intron.

                                   If the genomic gap is less than

--max-deletionlength and greater

                                   than --min-intronlength, a known splice site or splice site probabilities

                                   of 0.80 on both sides will be

reported as an intron.

  --max-intronlength-middle=INT  Max length for one internal intron (default 500000).  Note: for backward compatibility, the -K or

--intronlength flag will set both --max-intronlength-middle and

--max-intronlength-ends. Also see --split-large-introns below.

  --max-intronlength-ends=INT    Max length for first or last intron (default 10000).  Note: for backward  compatibility, the -K or

--intronlength flag will set both  --max-intronlength-middle and --max-intronlength-ends.

  --split-large-introns          Sometimes GMAP will exceed the value for --max-intronlength-middle,  if it finds a good single alignment.  However, you can force GMAP to split such alignments by using this flag

  --trim-end-exons=INT           Trim end exons with fewer than given number of matches

 (in nt, default 12)

  -w, --localsplicedist=INT      Max length for known splice sites at ends of sequence

                                   (default 2000000)

  -L, --totallength=INT          Max total intron length (default 2400000)

  -x, --chimera-margin=INT       Amount of unaligned sequence that triggers

                                   search for the remaining sequence (default 30). Enables alignment of chimeric reads, and may help with some non-chimeric reads.  To turn off, set to zero.

  --no-chimeras                  Turns off finding of chimeras.  Same effect as --chimera-margin=0

  -t, --nthreads=INT             Number of worker threads

  -c, --chrsubset=string         Limit search to given chromosome

  --strand=STRING                Genome strand to try aligning to (plus, minus, or both default)

  -z, --direction=STRING         cDNA direction (sense_force, antisense_force,

                                   sense_filter, antisense_filter,or auto (default))

  --canonical-mode=INT           Reward for canonical and semi-canonical introns

                                   0=low reward, 1=high reward (default), 2=low reward for

                                   high-identity sequences and high reward otherwise

  --cross-species                Use a more sensitive search for canonical splicing, which helps especially

                                   for cross-species alignments and other difficult cases

  --allow-close-indels=INT       Allow an insertion and deletion close to each other

                                   (0=no, 1=yes (default), 2=only for

high-quality alignments)

  --microexon-spliceprob=FLOAT   Allow microexons only if one of the splice site probabilities is greater than this value (default 0.95)

  --indel-open                   In dynamic programming, opening penalty for indel

  --indel-extend                 In dynamic programming, extension penalty for indel

                                   Values for --indel-open and

--indel-extend should be in [-127,-1].

                                   If value is < -127, then will use -127 instead.

                                   If --indel-open and --indel-extend are not specified, values are chose adaptively, based on the differences between the query and reference

  --cmetdir=STRING               Directory for methylcytosine index files (created using cmetindex)

                                   (default is location of genome index files specified using -D, -V, and -d)

  --atoidir=STRING               Directory for A-to-I RNA editing index files (created using atoiindex)

                                   (default is location of genome index files specified using -D, -V, and -d)

  --mode=STRING                  Alignment mode: standard (default), cmet-stranded, cmet-nonstranded,

                                    atoi-stranded, atoi-nonstranded, ttoc-stranded, or ttoc-nonstranded.

                                    Non-standard modes requires you to have previously run the cmetindex

                                    or atoiindex programs (which also cover the ttoc modes) on the genome

  -p, --prunelevel               Pruning level: 0=no pruning (default), 1=poor seqs,

                                   2=repetitive seqs, 3=poor and repetitive

 

Output types

  -S, --summary                  Show summary of alignments only

  -A, --align                    Show alignments

  -3, --continuous               Show alignment in three continuous lines

  -4, --continuous-by-exon       Show alignment in three lines per exon

  -E, --exons=STRING             Print exons ("cdna" or "genomic")

                                   Will also print introns with

"cdna+introns" or

                                   "genomic+introns"

  -P, --protein_dna              Print protein sequence (cDNA)

  -Q, --protein_gen              Print protein sequence (genomic)

  -f, --format=INT               Other format for output (also note the -A and -S options

                                   and other options listed under Output types):

                                   mask_introns,

                                   mask_utr_introns,

                                   psl (or 1) = PSL (BLAT) format,

                                   gff3_gene (or 2) = GFF3 gene format,

                                   gff3_match_cdna (or 3) = GFF3 cDNA_match format,

                                   gff3_match_est (or 4) = GFF3 EST_match format,

                                   splicesites (or 6) = splicesites output (for GSNAP splicing file),

                                   introns = introns output (for GSNAP splicing file),

                                   map_exons (or 7) = IIT FASTA exon map format,

                                   map_ranges (or 8) = IIT FASTA range map format,

                                   coords (or 9) = coords in table format,

                                   sampe = SAM format (setting paired_read bit in flag),

                                   samse = SAM format (without setting paired_read bit),

                                   bedpe = indels and gaps in BEDPE format Output options

  -n, --npaths=INT               Maximum number of paths to show (default 5).  If set to 1, GMAP

                                   will not report chimeric alignments, since those imply

                                   two paths.  If you want a single alignment plus chimeric

                                   alignments, then set this to be 0.

  --suboptimal-score=FLOAT       Report only paths whose score is within this value of the

                                   best path.

                                 If specified between 0.0 and 1.0, then treated as a fraction

                                   of the score of the best alignment (matches minus penalties for

                                   mismatches and indels).  Otherwise, treated as an integer

                                   number to be subtracted from the score of the best alignment.

                                   Default value is 0.50.

  -O, --ordered                  Print output in same order as input (relevant

                                   only if there is more than one worker thread)

  -5, --md5                      Print MD5 checksum for each query sequence

  -o, --chimera-overlap          Overlap to show, if any, at chimera breakpoint

  --failsonly                    Print only failed alignments, those with no results

  --nofails                      Exclude printing of failed alignments

 

  -V, --snpsdir=STRING           Directory for SNPs index files (created using snpindex) (default is location of genome index files specified using -D and -d)

   -v, --use-snps=STRING          Use database containing known SNPs (in <STRING>.iit, built

                                   previously using snpindex) for tolerance to SNPs

  --split-output=STRING          Basename for multiple-file output, separately for nomapping,

                                   uniq, mult, (and chimera, if --chimera-margin is selected)

  --failed-input=STRING          Print completely failed alignments as input FASTA or FASTQ format

                                   to the given file.  If the

--split-output flag is also given, this file

                                   is generated in addition to the output in the .nomapping file.

  --append-output                When --split-output or --failedinput is given, this flag will append output to the existing files.  Otherwise, the default is to create new files.

  --output-buffer-size=INT       Buffer size, in queries, for output thread (default 1000).  When the number

                                   of results to be printed exceeds this size, worker threads wait

                                   until the backlog is cleared

  --translation-code=INT         Genetic code used for translating codons to amino acids and computing CDS

                                   Integer value (default=1) corresponds to an available code at http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

  --alt-start-codons             Also, use the alternate initiation codons shown in the above Web site

                                   By default, without this option,

only ATG is considered an initiation codon

  -F, --fulllength               Assume full-length protein, starting with Met

  -a, --cdsstart=INT             Translate codons from given

nucleotide (1-based)

  -T, --truncate                 Truncate alignment around full-length

protein, Met to Stop

                                 Implies -F flag.

  -Y, --tolerant                 Translates cDNA with corrections for

frameshifts

 

Options for GFF3 output

  --gff3-add-separators=INT      Whether to add a ### separator after each query sequence

                                   Values: 0 (no), 1 (yes, default)

  --gff3-swap-phase=INT          Whether to swap phase (0 => 0, 1 => 2, 2 => 1) in gff3_gene format

                                   Needed by some analysis programs, but deviates from GFF3 specification

                                   Values: 0 (no, default), 1 (yes)

  --gff3-fasta-annotation=INT    Whether to include annotation from the FASTA header into the GFF3 output

                                   Values: 0 (default): Do not include

                                           1: Wrap all annotation as Annot="<header>"

                                           2: Include key=value pairs, replacing brackets with quotation marks and replacing spaces between key=value pairs with semicolons

  --gff3-cds=STRING              Whether to use cDNA or genomic translation for the CDS coordinates Values: cdna (default), genomic Options for SAM output

  --no-sam-headers               Do not print headers beginning with '@'

  --sam-use-0M                   Insert 0M in CIGAR between adjacent insertions and deletions

                                   Required by Picard, but can cause errors in other tools

  --sam-extended-cigar           Use extended CIGAR format (using X and = symbols instead of M,

                                   to indicate matches and mismatches, respectively

  --force-xs-dir                 For RNA-Seq alignments, disallows XS:A:? when the sense direction

                                   is unclear, and replaces this value arbitrarily with XS:A:+.

                                   May be useful for some programs, such as Cufflinks, that cannot

                                   handle XS:A:?.  However, if you use this flag, the reported value

                                   of XS:A:+ in these cases will not be meaningful.

  --md-lowercase-snp             In MD string, when known SNPs are given by the -v flag,

                                   prints difference nucleotides as lower-case when they,

                                   differ from reference but match a known alternate allele

  --action-if-cigar-error        Action to take if there is a disagreement between CIGAR length and sequence length

                                   Allowed values: ignore, warning (default), noprint, abort

                                   Note that the noprint option does not print the CIGAR string at all if there is an error, so it may break a SAM parser

  --read-group-id=STRING         Value to put into read-group id (RG-ID) field

  --read-group-name=STRING       Value to put into read-group name (RG-SM) field

  --read-group-library=STRING    Value to put into read-group library (RG-LB) field

  --read-group-platform=STRING   Value to put into read-group library

(RG-PL) field Options for quality scores

  --quality-protocol=STRING      Protocol for input quality scores. Allowed values:

                                   illumina (ASCII 64-126) (equivalent to -J 64 -j -31)

                                   sanger   (ASCII 33-126) (equivalent to -J 33 -j 0)

                                 Default is sanger (no quality print shift)

                                 SAM output files should have quality scores in sanger protocol

 

                                 Or you can specify the print shift with this flag:

  -j, --quality-print-shift=INT  Shift FASTQ quality scores by this amount in output

                                   (default is 0 for sanger protocol; to change Illumina input

                                   to Sanger output, select -31) External map file options

  -M, --mapdir=directory         Map directory

  -m, --map=iitfile              Map file.  If argument is '?' (with the quotes),

                                   this lists available map files.

  -e, --mapexons                 Map each exon separately

  -b, --mapboth                  Report hits from both strands of genome

  -u, --flanking=INT             Show flanking hits (default 0)

  --print-comment                Show comment line for each hit

 

Alignment output options

  --nolengths                    No intron lengths in alignment

  --nomargin                     No left margin in GMAP standard output (with the -A flag)

  -I, --invertmode=INT           Mode for alignments to genomic (-) strand:

                                   0=Don't invert the cDNA (default)

                                   1=Invert cDNA and print genomic (-) strand

                                   2=Invert cDNA and print genomic (+) strand

  -i, --introngap=INT            Nucleotides to show on each end of intron (default 3)

  -l, --wraplength=INT           Wrap length for alignment (default 50) Filtering output options

  --min-trimmed-coverage=FLOAT   Do not print alignments with trimmed coverage less

                                   this value (default=0.0, which means no filtering)

                                   Note that chimeric alignments will be output regardless

                                   of this filter

  --min-identity=FLOAT           Do not print alignments with identity less

                                   this value (default=0.0, which means no filtering)

                                   Note that chimeric alignments will be output regardless

                                   of this filter

 

Help options

  --check                        Check compiler assumptions

  --version                      Show version

  --help                         Show this help message

 

 

 

> gmap_build

-k flag not specified, so building main hash table with default 15-mers

-j flag not specified, so building regional hash tables with default 6-mers

 

gmap_build: Builds a gmap database for a genome to be used by GMAP or GSNAP.

Part of GMAP package, version 2021-08-25.

 

Usage: gmap_build [options...] -d <genome> [-c <transcriptome> -T

<transcript_fasta>] <genome_fasta_files>

 

You are free to name <genome> and <transcriptome> as you wish.  You

will use the same names when performing alignments subsequently using

GMAP or GSNAP.

 

Note: If adding a transcriptome to an existing genome, then there is

no need to specify the genome_fasta_files.  This way you can add

transcriptome information to an existing genome database.

 

Options:

    -D, --dir=STRING          Destination directory for installation

(defaults to gmapdb

                                directory specified at configure time)

    -d, --genomedb=STRING     Genome name (required)

 

    -n, --names=STRING        Substitute names for contigs, provided in a file.

        The file can have two formats:

 

        1.  A file with one column per line, with each line

            corresponding to a FASTA file, in the order given to

            gmap_build.  The chromosome name for each FASTA file will

            be replaced with the desired chromosome name in the file.

            Every chromosome in the FASTA must have a corresponding line

            in the file.  This is useful if you want to rename chromosomes

            with a systematic numbering pattern.

 

        2.  A file with two columns per line, separated by white

            space.  In each line, the original FASTA chromosome name

            should be in column 1 and the desired chromosome name

            will be in column 2.

 

            The meaning of file format 2 depends on whether

            --limit-to-names is specified.  If so, the genome build will

            be limited to those chromosomes in this file.  Otherwise,

            all chromosomes in the FASTA file will be included,

            but only those chromosomes in this file will be re-named, which

            provides an easy way to change just a few chromosome names.

 

        This file can be combined with the --sort=names option, in

        which the order of chromosomes is that given in the file.  In

        this case, every chromosome must be listed in the file, and

        for chromosome names that should not be changed, column 2 can

        be blank (or the same as column 1).  The option of a blank

        column 2 is allowed only when specifying --sort=names,

        because otherwise, the program cannot distinguish between a

        1-column and 2-column names file.

 

    -L, --limit-to-names      Determines whether to limit the genome

build to the lines listed

                              in the --names file.  You can limit a

genome build to certain

                              chromosomes with this option, plus a

--names file that either

                              renames chromosomes, or lists the same

names in both columns for

                              the desired chromosomes.

 

    -k, --kmer=INT            k-mer value for genomic index (allowed:

15 or less, default is 15)

    -q INT                    sampling interval for genomoe (allowed:

1-3, default 3)

 

    -s, --sort=STRING         Sort chromosomes using given method:

                                none - use chromosomes as found in

FASTA file(s) (default)

                                alpha - sort chromosomes

alphabetically (chr10 before chr 1)

                                numeric-alpha - chr1, chr1U, chr2,

chrM, chrU, chrX, chrY

                                chrom - chr1, chr2, chrM, chrX, chrY,

chr1U, chrU

                                names - sort chromosomes based on file

provided to --names flag

 

    -g, --gunzip              Files are gzipped, so need to gunzip

each file first

    -E, --fasta-pipe=STRING   Interpret argument as a command, instead

of a list of FASTA files

    -Q, --fastq               Files are in FASTQ format

    -R, --revcomp             Reverse complement all contigs

    -w INT                    Wait (sleep) this many seconds after

each step (default 2)

 

    -o, --circular=STRING     Circular chromosomes (either a list of

chromosomes separated

                                by a comma, or a filename containing

circular chromosomes,

                                one per line).  If you use the --names

feature, then you

                                should use the substitute name of the

chromosome, not the

                                original name, for this option.

(NOTE: This behavior is different

                                from previous versions, and starts

with version 2020-10-20.)

 

    -2, --altscaffold=STRING  File with alt scaffold info, listing

alternate scaffolds,

                                one per line, tab-delimited, with the

following fields:

                                (1) alt_scaf_acc, (2) parent_name, (3)

orientation,

                                (4) alt_scaf_start, (5) alt_scaf_stop,

(6) parent_start, (7) parent_end.

 

    -e, --nmessages=INT       Maximum number of messages (warnings,

contig reports) to report (default 50)

 

 

Options for older genome formats:

    -M, --mdflag=STRING       Use MD file from NCBI for mapping contigs to

                                chromosomal coordinates

 

    -C, --contigs-are-mapped  Find a chromosomal region in each FASTA

header line.

                                Useful for contigs that have been mapped

                                to chromosomal coordinates.  Ignored

if the --mdflag is provided.

 

 

Options for transcriptome-guided alignment:

    -c, --transcriptomedb=STRING    Transcriptome name

 

    -T, --transcripts=FILE    FASTA file containing transcripts

(required if specifying

                                --transcriptomedb)

 

    -t, --nthreads=INT        Number of threads for GMAP alignment of

transcripts to genome

                                (default 8)

 

 

 

実行方法

1,indexing

gmap_build -d indexDB -k 15 genome.fna

 

2、alignment

gmap -d indexDB  -B 5 -t 10 -f 3  transcripts.fasta > output.gff3
  • -B   Batch mode (default = 2)

  • -t    Number of worker threads
  • -f    Other format for output.  gff3_match_cdna (or 3) = GFF3 cDNA_match format

 

 

K-merサイズ(マニュアルより)

kフラグにより、ゲノムインデックスのk-merサイズを12から15の範囲で制御できる。kのデフォルト値は15だが、インデックスを構築するために4GBのRAMが必要になる。もし4GBのRAMがない場合は、-kの値を小さくするか、他のマシンを探す必要がある。以下は、様々なインデックスを構築するために必要なRAM。

12-mer: 64 MB
13-mer: 256 MB
14-mer: 1 GB
15-mer: 4 GB

ゲノムインデックスは圧縮されているため、インデックスの構築後にGMAP/GSNAPプログラムを実行するためのRAM容量は必要ない。例えば、k-merが15の場合のゲノムインデックスは、gammaptrsファイルが64MB、offsetscompファイルが約350MBとなり、4GBよりはるかに小さくなる。gmap_build の -b または --basesize パラメータで圧縮量を制御することができる。デフォルトでは、k-mer sizeの値は15、basesizeの値は12になっている。k-merサイズに別の値を指定すると、basesizeはそのk-merサイズと等しくなるようにデフォルトで設定される。もし、複数のk-mer sizeでゲノムデータベースを構築したい場合は、-kの値を変えてgmap_buildを再実行する。この場合、前回実行したものと同一のファイルのみが上書きされる。その後、GMAPまたはGSNAPの実行時に-kフラグを使用してk-merサイズを選択することができる。

 

 

引用

GMAP and GSNAP for Genomic Sequence Alignment: Enhancements to Speed, Accuracy, and Functionality
Thomas D Wu, Jens Reeder, Michael Lawrence, Gabe Becker, Matthew J Brauer 

Methods Mol Biol. 2016;1418:283-334

 

GMAP: a genomic mapping and alignment program for mRNA and EST sequences
Thomas D Wu 1, Colin K Watanabe

Bioinformatics. 2005 May 1;21(9):1859-75

 

関連