macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

HTS(NGS)の基本ツールのインストール手順

2024/05/10 BamTools追加

 

UbuntuにHTSの基本的なツールをインストールする手順をまとめました。pythonでラップされているようなツールはanacondaに頼るのが早いので除外し、純粋にコンパイル言語で書かれたよく使われているツールを対象としました。

 

dockerイメージとライブラリなどの準備

dockerを使用し、ubuntuのベースイメージをpullして進めます。既にcondaやhomebrewでこれらのツールを導入されている場合、ややこしくなりますので注意してください。例えばcondaでツールを管理されているのでしたら前もってconda deactivateしておくのが理想です(プロンプト左端に"(base)"など出ていないか確認して下さい )。

 

最新のタグのイメージをpull後、ランしてコンパイラや依存するライブラリなどを導入します。

docker pull ubuntu:latest
docker run -itv $PWD:/data ubuntu:latest #その場限りで使うのなら"--rm"を付ける

#最初に依存するライブラリなどを導入する必要がある
apt update
apt install -y make gcc git autoconf automake g++ cmake build-essential
apt install -y zlib1g-dev
apt install -y liblzma-dev
apt install -y libbz2-dev
apt install -y libcurl4-openssl-dev
apt install -y libssl-dev #OpenSSLライブラリ
apt install -y pkg-config
apt install -y libdeflate-dev
apt install -y libncurses5-dev
apt install -y libgsl-dev
apt install -y libperl-dev
apt install -y libxml2-dev
#/srcに移動 *1
cd /srv


=============================================================================
#補足
#1 作業完了後、exitしてcommitしておく。例:
docker ps -a #IDの確認

#2 IDを指定してcommit
docker commit 1c4af84ced55 kazumax/ngsbaseimage

#3 再使用(bwa indexコマンドをホスト側から実行)
docker run --rm -itv $PWD:/root -w /root kazumax/ngsbaseimage bwa index -a is GCA_021272385.1.fna

#or(複数コマンド例)
command1="echo bwa indexing"
command2="bwa index -a is GCA_021272385.1.fna"

docker run --rm -itv $PWD:/root -w /root kazumax/ngsbaseimage bash -c "$command1; $command2"

*1 ここでは/srvにcloneする。ほかの候補にはoptなど(参考)。ユーザーならhome下など。

これで準備OK。/srv下に順番に導入していきます。

 

インストール

1,SAMtoolsとBCFtools、BAMToolsのインストール

Github

https://github.com/samtools/htslib

https://github.com/samtools/samtools

https://github.com/samtools/bcftools

https://github.com/pezmaster31/bamtools

https://github.com/telatin/bamtocov?tab=readme-ov-file

#1 htslibのインストール(samtoolsとbcftoolsのインストールに必須)*2
git clone --recurse-submodules https://github.com/samtools/htslib.git
cd htslib/
autoheader
autoreconf -i
./configure --prefix=/usr/local
make -j10
make install
cd ../

#2 samtoolsのインストール
git clone https://github.com/samtools/samtools.git
cd samtools/
autoheader
autoconf -Wno-syntax
./configure --prefix=/usr/local # *3
make -j10
make install
cd ../
samtools -h

#3 bcftoolsのインストール
git clone https://github.com/samtools/bcftools.git
cd bcftools/
autoheader
autoconf
./configure --enable-libgsl --enable-perl-filters
make -j10
make install
cd ../
bcftools -h

#4 BamToolsのインストール
git clone https://github.com/pezmaster31/bamtools.git
cd bamtools/
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr/local ..
make -j10
make install
cd ../../
bamtools -h

*2 HTSlibはhtscodecsプロジェクトのいくつかの関数を使用している。clone時、サブモジュールを更新して初期化する。

*3 configure時、デフォルトでは"../htslib"にある HTSlibに対してビルドされる。--with-htslib=DIRを指定することでシステムの別のhtslibと変更可能。

 

2、vcftoolsのインストール

Github

https://github.com/vcftools/vcftools

git clone https://github.com/vcftools/vcftools.git
cd vcftools/
./autogen.sh
./configure --prefix=/usr/local
make -j10
make install
cd ../
vcftools -h

 

3、SRA Toolkitのインストール

Github

https://github.com/ncbi/sra-tools

git clone https://github.com/ncbi/sra-tools.git
cd sra-tools/

#ncbi-vdbが必要
git clone https://github.com/ncbi/ncbi-vdb.git
cd ncbi-vdb
./configure --prefix=/usr/local/ncbi-vdb
make -j20
make install -j20
cd ../

./configure --with-ncbi-vdb=/usr/local/ncbi-vdb --prefix=/usr/local
make -j20
make install
cd ../
fastq-dump -h

 

4、Mapping toolのインストール

Github

https://github.com/BenLangmead/bowtie2

https://github.com/lh3/bwa

https://github.com/bwa-mem2/bwa-mem2

https://github.com/lh3/minimap2

https://github.com/DaehwanKimLab/hisat2

https://github.com/alexdobin/STAR

#bowtie2
git clone https://github.com/BenLangmead/bowtie2.git
cd bowtie2/
make -j20
make install
cd ../
bowtie2 -h

#bwa1
git clone https://github.com/lh3/bwa.git
cd bwa
make -j20
echo 'export PATH=$PATH:$PWD/bwa' >> ~/.bashrc && source ~/.bashrc
#or
cp bwa /usr/local/bin/
cd ../
bwa

#bwa2
git clone --recursive https://github.com/bwa-mem2/bwa-mem2
cd bwa-mem2
make -j20
echo 'export PATH=$PATH:$PWD/bwa-mem2*' >> ~/.bashrc && source ~/.bashrc
#or
cp bwa2* /usr/local/bin/
cd ../
bwa-mem2

#minimap2
git clone https://github.com/lh3/minimap2
cd minimap2
make -j20
echo 'export PATH=$PATH:$PWD/minimap2' >> ~/.bashrc && source ~/.bashrc
#or
cp minimap2 /usr/local/bin/
cd ../
minimap2

#hisat2 (pythonも必要)
git clone https://github.com/DaehwanKimLab/hisat2.git
cd hisat2
make -j20
echo 'export PATH=$PATH:$PWD/hisat2*' >> ~/.bashrc && source ~/.bashrc
cd ../
hisat2 -h

#STAR(STAR/bin/Linux_x86_64_staticにバイナリが配置されているのでx86 64bit版ならパスを通すだけ
git clone https://github.com/alexdobin/STAR.git
cd STAR/bin/Linux_x86_64_static/
echo 'export PATH=$PATH:$PWD' >> ~/.bashrc && source ~/.bashrc
#or
ln -s STAR /usr/local/bin/
cd ../../../
STAR --help

(pseudo-mappingは除外)

 

5、fastqのQC toolのインストール

Github

https://github.com/smithlabcode/falco

#falco (fastQCの高速な代替)
git clone https://github.com/smithlabcode/falco.git
cd falco/
./autogen.sh
./configure CXXFLAGS="-O3 -Wall" --enable-hts
make HAVE_HTSLIB=1 all -j 20
make HAVE_HTSLIB=1 install

pythonで書かれたcutadapt、perlで書かれたTrim Galore(fastQCとcutadaptのラッパー)、javaで書かれたTrimmostic(.jarが配布されている)は除外。依存が多いfastpはcondaを使うのが簡単。)

 

6、配列クラスタリングツールのインストール

Github

https://github.com/weizhongli/cdhit

#cd-hit
git clone https://github.com/weizhongli/cdhit.git
cd cdhit/
#長い配列も扱えるようにビルド
make MAX_SEQ=1000000 -j4
make install
cd ../

 

7、GFF/GTFを操作するツールのインストール

https://github.com/gpertea/gffread

#gffread
git clone https://github.com/gpertea/gffread
cd gffread
make release -j 10
cd ../

 

8、配列操作ツールのインストール

https://github.com/torognes/vsearch

#vsearch
git clone https://github.com/torognes/vsearch.git
cd vsearch
./autogen.sh
./configure CFLAGS="-O3" CXXFLAGS="-O3"
make -j20
make install
cd ../

他は?

 

除外したもの

・ETE toolkit(pythonベース、別に紹介予定

・bedtools(python、bamtoolsなど依存が多い)

・IGVjavaが肥大)

############################################################################################

主要なコマンドのヘルプ

> samtools -h

Program: samtools (Tools for alignments in the SAM format)

Version: 1.20-6-g95f5426 (using htslib 1.20-15-gb204d55c)

 

Usage:   samtools <command> [options]

 

Commands:

  -- Indexing

     dict           create a sequence dictionary file

     faidx          index/extract FASTA

     fqidx          index/extract FASTQ

     index          index alignment

 

  -- Editing

     calmd          recalculate MD/NM tags and '=' bases

     fixmate        fix mate information

     reheader       replace BAM header

     targetcut      cut fosmid regions (for fosmid pool only)

     addreplacerg   adds or replaces RG tags

     markdup        mark duplicates

     ampliconclip   clip oligos from the end of reads

 

  -- File operations

     collate        shuffle and group alignments by name

     cat            concatenate BAMs

     consensus      produce a consensus Pileup/FASTA/FASTQ

     merge          merge sorted alignments

     mpileup        multi-way pileup

     sort           sort alignment file

     split          splits a file by read group

     quickcheck     quickly check if SAM/BAM/CRAM file appears intact

     fastq          converts a BAM to a FASTQ

     fasta          converts a BAM to a FASTA

     import         Converts FASTA or FASTQ files to SAM/BAM/CRAM

     reference      Generates a reference from aligned data

     reset          Reverts aligner changes in reads

 

  -- Statistics

     bedcov         read depth per BED region

     coverage       alignment depth and percent coverage

     depth          compute the depth

     flagstat       simple stats

     idxstats       BAM index stats

     cram-size      list CRAM Content-ID and Data-Series sizes

     phase          phase heterozygotes

     stats          generate stats (former bamcheck)

     ampliconstats  generate amplicon specific stats

 

  -- Viewing

     flags          explain BAM flags

     head           header viewer

     tview          text alignment viewer

     view           SAM<->BAM<->CRAM conversion

     depad          convert padded BAM to unpadded BAM

     samples        list the samples in a set of SAM/BAM/CRAM files

 

  -- Misc

     help [cmd]     display this help message or help for [cmd]

     version        detailed version information

 

> bcftools -h

 

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)

License: GNU GPLv3+, due to use of the GNU Scientific Library

Version: 1.20-12-g18439d1d (using htslib 1.20-15-gb204d55c)

 

Usage:   bcftools [--version|--version-only] [--help] <command> <argument>

 

Commands:

 

 -- Indexing

    index        index VCF/BCF files

 

 -- VCF/BCF manipulation

    annotate     annotate and edit VCF/BCF files

    concat       concatenate VCF/BCF files from the same set of samples

    convert      convert VCF/BCF files to different formats and back

    head         view VCF/BCF file headers

    isec         intersections of VCF/BCF files

    merge        merge VCF/BCF files files from non-overlapping sample sets

    norm         left-align and normalize indels

    plugin       user-defined plugins

    query        transform VCF/BCF into user-defined formats

    reheader     modify VCF/BCF header, change sample names

    sort         sort VCF/BCF file

    view         VCF/BCF conversion, view, subset and filter VCF/BCF files

 

 -- VCF/BCF analysis

    call         SNP/indel calling

    consensus    create consensus sequence by applying VCF variants

    cnv          HMM CNV calling

    csq          call variation consequences

    filter       filter VCF/BCF files using fixed thresholds

    gtcheck      check sample concordance, detect sample swaps and contamination

    mpileup      multi-way pileup producing genotype likelihoods

    polysomy     detect number of chromosomal copies

    roh          identify runs of autozygosity (HMM)

    stats        produce VCF/BCF stats

 

 -- Plugins (collection of programs for calling, file manipulation & analysis)

    39 plugins available, run "bcftools plugin -lv" to see a complete list

 

 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected

 automatically even when streaming from a pipe. Indexed VCF and BCF will work

 in all situations. Un-indexed VCF and BCF and streams will work in most but

 not all situations.

 

> vcftools -h

VCFtools (0.1.17)

© Adam Auton and Anthony Marcketta 2009

 

Process Variant Call Format files

 

For a list of options, please go to:

 https://vcftools.github.io/man_latest.html

 

Alternatively, a man page is available, type:

 man vcftools

 

Questions, comments, and suggestions should be emailed to:

 vcftools-help@lists.sourceforge.net

 

> fastq-dump -h

Usage:

  fastq-dump [options] <path> [<path>...]

  fastq-dump [options] <accession>

 

INPUT

  -A|--accession <accession>       Replaces accession derived from <path> in 

                                   filename(s) and deflines (only for single 

                                   table dump) 

  --table <table-name>             Table name within cSRA object, default is 

                                   "SEQUENCE" 

 

PROCESSING

 

Read Splitting                     Sequence data may be used in raw form or

                                     split into individual reads

  --split-spot                     Split spots into individual reads 

 

Full Spot Filters                  Applied to the full spot independently

                                     of --split-spot

  -N|--minSpotId <rowid>           Minimum spot id 

  -X|--maxSpotId <rowid>           Maximum spot id 

  --spot-groups <[list]>           Filter by SPOT_GROUP (member): name[,...] 

  -W|--clip                        Remove adapter sequences from reads 

 

Common Filters                     Applied to spots when --split-spot is not

                                     set, otherwise - to individual reads

  -M|--minReadLen <len>            Filter by sequence length >= <len> 

  -R|--read-filter <[filter]>      Split into files by READ_FILTER value 

                                   optionally filter by value

                                   pass|reject|criteria|redacted 

  -E|--qual-filter                 Filter used in early 1000 Genomes data: no 

                                   sequences starting or ending with >= 10N 

  --qual-filter-1                  Filter used in current 1000 Genomes data 

 

Filters based on alignments        Filters are active when alignment

                                     data are present

  --aligned                        Dump only aligned sequences 

  --unaligned                      Dump only unaligned sequences 

  --aligned-region <name[:from-to]>  Filter by position on genome. Name can 

                                   either be accession.version (ex: 

                                   NC_000001.10) or file specific name (ex: 

                                   "chr1" or "1"). "from" and "to" are 1-based 

                                   coordinates 

  --matepair-distance <from-to|unknown>  Filter by distance between matepairs. 

                                   Use "unknown" to find matepairs split 

                                   between the references. Use from-to to limit 

                                   matepair distance on the same reference 

 

Filters for individual reads       Applied only with --split-spot set

  --skip-technical                 Dump only biological reads 

 

OUTPUT

  -O|--outdir <path>               Output directory, default is working 

                                   directory '.' ) 

  -Z|--stdout                      Output to stdout, all split data become 

                                   joined into single stream 

  --gzip                           Compress output using gzip: deprecated, not 

                                   recommended 

  --bzip2                          Compress output using bzip2: deprecated, 

                                   not recommended 

 

Multiple File Options              Setting these options will produce more

                                     than 1 file, each of which will be suffixed

                                     according to splitting criteria.

  --split-files                    Write reads into separate files. Read 

                                   number will be suffixed to the file name.  

                                   NOTE! The `--split-3` option is recommended. 

                                   In cases where not all spots have the same 

                                   number of reads, this option will produce 

                                   files that WILL CAUSE ERRORS in most programs 

                                   which process split pair fastq files. 

  --split-3                        3-way splitting for mate-pairs. For each 

                                   spot, if there are two biological reads 

                                   satisfying filter conditions, the first is 

                                   placed in the `*_1.fastq` file, and the 

                                   second is placed in the `*_2.fastq` file. If 

                                   there is only one biological read 

                                   satisfying the filter conditions, it is 

                                   placed in the `*.fastq` file.All other 

                                   reads in the spot are ignored. 

  -G|--spot-group                  Split into files by SPOT_GROUP (member name) 

  -R|--read-filter <[filter]>      Split into files by READ_FILTER value 

                                   optionally filter by value

                                   pass|reject|criteria|redacted 

  -T|--group-in-dirs               Split into subdirectories instead of files 

  -K|--keep-empty-files            Do not delete empty files 

 

FORMATTING

 

Sequence

  -C|--dumpcs <[cskey]>            Formats sequence using color space (default 

                                   for SOLiD),"cskey" may be specified for 

                                   translation 

  -B|--dumpbase                    Formats sequence using base space (default 

                                   for other than SOLiD). 

 

Quality

  -Q|--offset <integer>            Offset to use for quality conversion, 

                                   default is 33 

  --fasta <[line width]>           FASTA only, no qualities, optional line 

                                   wrap width (set to zero for no wrapping) 

  --suppress-qual-for-cskey        suppress quality-value for cskey 

 

Defline

  -F|--origfmt                     Defline contains only original sequence name 

  -I|--readids                     Append read id after spot id as 

                                   'accession.spot.readid' on defline 

  --helicos                        Helicos style defline 

  --defline-seq <fmt>              Defline format specification for sequence. 

  --defline-qual <fmt>             Defline format specification for quality. 

                                   <fmt> is string of characters and/or 

                                   variables. The variables can be one of: $ac 

                                   - accession, $si spot id, $sn spot 

                                   name, $sg spot group (barcode), $sl spot 

                                   length in bases, $ri read number, $rn 

                                   read name, $rl read length in bases. '

                                   could be used for an optional output: if 

                                   all vars in yield empty values whole 

                                   group is not printed. Empty value is empty 

                                   string or for numeric variables. Ex: 

                                   @$sn[_$rn]/$ri '_$rn' is omitted if name 

                                   is empty

 

OTHER:

  --ngc <path>                     <path> to ngc file 

  --disable-multithreading         disable multithreading 

  -h|--help                        Output brief explanation of program usage 

  -V|--version                     Display the version of the program 

  -L|--log-level <level>           Logging level as number or enum string One 

                                   of (fatal|sys|int|err|warn|info) or (0-5) 

                                   Current/default is warn 

  -v|--verbose                     Increase the verbosity level of the program 

                                   Use multiple times for more verbosity 

  --ncbi_error_report              Control program execution environment 

                                   report generation (if implemented). One of 

                                   (never|error|always). Default is error 

  --legacy-report                  use legacy style 'Written spots' for tool 

 

> bowtie2 -h

Bowtie 2 version 2.5.3 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)

Usage: 

  bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]

 

  <bt2-idx>  Index filename prefix (minus trailing .X.bt2).

             NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.

  <m1>       Files with #1 mates, paired with files in <m2>.

             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

  <m2>       Files with #2 mates, paired with files in <m1>.

             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

  <r>        Files with unpaired reads.

             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

  <i>        Files with interleaved paired-end FASTQ/FASTA reads

             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

  <bam>      Files are unaligned BAM sorted by read name.

  <sam>      File for SAM output (default: stdout)

 

  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be

  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

 

Options (defaults in parentheses):

 

 Input:

  -q                 query input files are FASTQ .fq/.fastq (default)

  --tab5             query input files are TAB5 .tab5

  --tab6             query input files are TAB6 .tab6

  --qseq             query input files are in Illumina's qseq format

  -f                 query input files are (multi-)FASTA .fa/.mfa

  -r                 query input files are raw one-sequence-per-line

  -F k:<int>,i:<int> query input files are continuous FASTA where reads

                     are substrings (k-mers) extracted from the FASTA file

                     and aligned at offsets 1, 1+i, 1+2i ... end of reference

  -c                 <m1>, <m2>, <r> are sequences themselves, not files

  -s/--skip <int>    skip the first <int> reads/pairs in the input (none)

  -u/--upto <int>    stop after first <int> reads/pairs (no limit)

  -5/--trim5 <int>   trim <int> bases from 5'/left end of reads (0)

  -3/--trim3 <int>   trim <int> bases from 3'/right end of reads (0)

  --trim-to [3:|5:]<int> trim reads exceeding <int> bases from either 3' or 5' end

                     If the read end is not specified then it defaults to 3 (0)

  --phred33          qualities are Phred+33 (default)

  --phred64          qualities are Phred+64

  --int-quals        qualities encoded as space-delimited integers

 

 Presets:                 Same as:

  For --end-to-end:

   --very-fast            -D 5 -R 1 -N 0 -L 22 -i S,0,2.50

   --fast                 -D 10 -R 2 -N 0 -L 22 -i S,0,2.50

   --sensitive            -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)

   --very-sensitive       -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

 

  For --local:

   --very-fast-local      -D 5 -R 1 -N 0 -L 25 -i S,1,2.00

   --fast-local           -D 10 -R 2 -N 0 -L 22 -i S,1,1.75

   --sensitive-local      -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)

   --very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

 

 Alignment:

  -N <int>           max # mismatches in seed alignment; can be 0 or 1 (0)

  -L <int>           length of seed substrings; must be >3, <32 (22)

  -i <func>          interval between seed substrings w/r/t read len (S,1,1.15)

  --n-ceil <func>    func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)

  --dpad <int>       include <int> extra ref chars on sides of DP table (15)

  --gbar <int>       disallow gaps within <int> nucs of read extremes (4)

  --ignore-quals     treat all quality values as 30 on Phred scale (off)

  --nofw             do not align forward (original) version of read (off)

  --norc             do not align reverse-complement version of read (off)

  --no-1mm-upfront   do not allow 1 mismatch alignments before attempting to

                     scan for the optimal seeded alignments

  --end-to-end       entire read must align; no clipping (on)

   OR

  --local            local alignment; ends might be soft clipped (off)

 

 Scoring:

  --ma <int>         match bonus (0 for --end-to-end, 2 for --local) 

  --mp <int>         max penalty for mismatch; lower qual = lower penalty (6)

  --np <int>         penalty for non-A/C/G/Ts in read/ref (1)

  --rdg <int>,<int>  read gap open, extend penalties (5,3)

  --rfg <int>,<int>  reference gap open, extend penalties (5,3)

  --score-min <func> min acceptable alignment score w/r/t read length

                     (G,20,8 for local, L,-0.6,-0.6 for end-to-end)

 

 Reporting:

  (default)          look for multiple alignments, report best, with MAPQ

   OR

  -k <int>           report up to <int> alns per read; MAPQ not meaningful

   OR

  -a/--all           report all alignments; very slow, MAPQ not meaningful

 

 Effort:

  -D <int>           give up extending after <int> failed extends in a row (15)

  -R <int>           for reads w/ repetitive seeds, try <int> sets of seeds (2)

 

 Paired-end:

  -I/--minins <int>  minimum fragment length (0)

  -X/--maxins <int>  maximum fragment length (500)

  --fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)

  --no-mixed         suppress unpaired alignments for paired reads

  --no-discordant    suppress discordant alignments for paired reads

  --dovetail         concordant when mates extend past each other

  --no-contain       not concordant when one mate alignment contains other

  --no-overlap       not concordant when mates overlap at all

 

 BAM:

  --align-paired-reads

                     Bowtie2 will, by default, attempt to align unpaired BAM reads.

                     Use this option to align paired-end reads instead.

  --preserve-tags    Preserve tags from the original BAM record by

                     appending them to the end of the corresponding SAM output.

 

 Output:

  -t/--time          print wall-clock time taken by search phases

  --un <path>        write unpaired reads that didn't align to <path>

  --al <path>        write unpaired reads that aligned at least once to <path>

  --un-conc <path>   write pairs that didn't align concordantly to <path>

  --al-conc <path>   write pairs that aligned concordantly at least once to <path>

    (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.

    --un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)

  --quiet            print nothing to stderr except serious errors

  --met-file <path>  send metrics to file at <path> (off)

  --met-stderr       send metrics to stderr (off)

  --met <int>        report internal counters & metrics every <int> secs (1)

  --no-unal          suppress SAM records for unaligned reads

  --no-head          suppress header lines, i.e. lines starting with @

  --no-sq            suppress @SQ header lines

  --rg-id <text>     set read group id, reflected in @RG line and RG:Z: opt field

  --rg <text>        add <text> ("lab:value") to @RG line of SAM header.

                     Note: @RG line only printed when --rg-id is set.

  --omit-sec-seq     put '*' in SEQ and QUAL fields for secondary alignments.

  --sam-no-qname-trunc

                     Suppress standard behavior of truncating readname at first whitespace 

                     at the expense of generating non-standard SAM.

  --xeq              Use '='/'X', instead of 'M,' to specify matches/mismatches in SAM record.

  --soft-clipped-unmapped-tlen

                     Exclude soft-clipped bases when reporting TLEN

  --sam-append-comment

                     Append FASTA/FASTQ comment to SAM record

 

 Performance:

  -p/--threads <int> number of alignment threads to launch (1)

  --reorder          force SAM output order to match order of input reads

  --mm               use memory-mapped I/O for index; many 'bowtie's can share

 

 Other:

  --qc-filter        filter out reads that are bad according to QSEQ filter

  --seed <int>       seed for random number generator (0)

  --non-deterministic

                     seed rand. gen. arbitrarily instead of using read attributes

  --version          print version information and quit

  -h/--help          print this usage message

 

> bwa   

Program: bwa (alignment via Burrows-Wheeler transformation)

Version: 0.7.18-r1243-dirty

Contact: Heng Li <hli@ds.dfci.harvard.edu>

 

Usage:   bwa <command> [options]

 

Command: index         index sequences in the FASTA format

         mem           BWA-MEM algorithm

         fastmap       identify super-maximal exact matches

         pemerge       merge overlapping paired ends (EXPERIMENTAL)

         aln           gapped/ungapped alignment

         samse         generate alignment (single ended)

         sampe         generate alignment (paired ended)

         bwasw         BWA-SW for long queries (DEPRECATED)

 

         shm           manage indices in shared memory

         fa2pac        convert FASTA to PAC format

         pac2bwt       generate BWT from PAC

         pac2bwtgen    alternative algorithm for generating BWT

         bwtupdate     update .bwt to the new format

         bwt2sa        generate SA from BWT and Occ

 

Note: To use BWA, you need to first index the genome with `bwa index'.

      There are three alignment algorithms in BWA: `mem', `bwasw', and

      `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'

      first. Please `man ./bwa.1' for the manual.

 

 

 

> bwa-mem2   

Looking to launch executable "/srv/bwa-mem2/bwa-mem2.avx2", simd = .avx2

Launching executable "/srv/bwa-mem2/bwa-mem2.avx2"

Usage: bwa-mem2 <command> <arguments>

Commands:

  index         create index

  mem           alignment

  version       print version number

 

> minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Options:

  Indexing:

    -H           use homopolymer-compressed k-mer (preferrable for PacBio)

    -k INT       k-mer size (no larger than 28) [15]

    -w INT       minimizer window size [10]

    -I NUM       split index for every ~NUM input bases [8G]

    -d FILE      dump index to FILE

  Mapping:

    -f FLOAT     filter out top FLOAT fraction of repetitive minimizers [0.0002]

    -g NUM       stop chain enlongation if there are no minimizers in INT-bp [5000]

    -G NUM       max intron length (effective with -xsplice; changing -r) [200k]

    -F NUM       max fragment length (effective with -xsr or in the fragment mode) [800]

    -r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]

    -n INT       minimal number of minimizers on a chain [3]

    -m INT       minimal chaining score (matching bases minus log gap penalty) [40]

    -X           skip self and dual mappings (for the all-vs-all mode)

    -p FLOAT     min secondary-to-primary score ratio [0.8]

    -N INT       retain at most INT secondary alignments [5]

  Alignment:

    -A INT       matching score [2]

    -B INT       mismatch penalty (larger value for lower divergence) [4]

    -O INT[,INT] gap open penalty [4,24]

    -E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]

    -z INT[,INT] Z-drop score and inversion Z-drop score [400,200]

    -s INT       minimal peak DP alignment score [80]

    -u CHAR      how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

    -J INT       splice mode. 0: original minimap2 model; 1: miniprot model [1]

  Input/Output:

    -a           output in the SAM format (PAF by default)

    -o FILE      output alignments to FILE [stdout]

    -L           write CIGAR with >65535 ops at the CG tag

    -R STR       SAM read group line in a format like '@RG\tID:foo\tSM:bar'

    -c           output CIGAR in PAF

    --cs[=STR]   output the cs tag; STR is 'short' (if absent) or 'long' [none]

    --ds         output the ds tag, which is an extension to cs

    --MD         output the MD tag

    --eqx        write =/X CIGAR operators

    -Y           use soft clipping for supplementary alignments

    -t INT       number of threads [3]

    -K NUM       minibatch size for mapping [500M]

    --version    show version number

  Preset:

    -x STR       preset (always applied before other options; see minimap2.1 for details) []

                 - lr:hq - accurate long reads (error rate <1%) against a reference genome

                 - splice/splice:hq - spliced alignment for long reads/accurate long reads

                 - asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence

                 - sr - short reads against a reference

                 - map-pb/map-hifi/map-ont/map-iclr - CLR/HiFi/Nanopore/ICLR vs reference mapping

                 - ava-pb/ava-ont - PacBio CLR/Nanopore read overlap

 

See `man ./minimap2.1' for detailed description of these and other advanced command-line options.

 

> bwa-mem2   

Looking to launch executable "/usr/local/bin/bwa-mem2.avx2", simd = .avx2

Launching executable "/usr/local/bin/bwa-mem2.avx2"

Usage: bwa-mem2 <command> <arguments>

Commands:

  index         create index

  mem           alignment

  version       print version number

> falco

Usage: falco [OPTIONS] <seqfile1> <seqfile2> ...

 

Options:

  -h, --help               Print this help file and exit  

  -v, --version            Print the version of the program and exit  

  -o, --outdir             Create all output files in the specified 

                           output directory. FALCO-SPECIFIC: If the 

                           directory does not exists, the program will 

                           create it. If this option is not set then 

                           the output file for each sequence file is 

                           created in the same directory as the 

                           sequence file which was processed.  

      --casava             [IGNORED BY FALCO] Files come from raw 

                           casava output. Files in the same sample 

                           group (differing only by the group number) 

                           will be analysed as a set rather than 

                           individually. Sequences with the filter flag 

                           set in the header will be excluded from the 

                           analysis. Files must have the same names 

                           given to them by casava (including being 

                           gzipped and ending with .gz) otherwise they 

                           won't be grouped together correctly.  

      --nano               [IGNORED BY FALCO] Files come from nanopore 

                           sequences and are in fast5 format. In this 

                           mode you can pass in directories to process 

                           and the program will take in all fast5 files 

                           within those directories and produce a 

                           single output file from the sequences found 

                           in all files.  

      --nofilter           [IGNORED BY FALCO] If running with --casava 

                           then don't remove read flagged by casava as 

                           poor quality when performing the QC 

                           analysis.  

      --extract            [ALWAYS ON IN FALCO] If set then the zipped 

                           output file will be uncompressed in the same 

                           directory after it has been created. By 

                           default this option will be set if fastqc is 

                           run in non-interactive mode.  

  -j, --java               [IGNORED BY FALCO] Provides the full path to 

                           the java binary you want to use to launch 

                           fastqc. If not supplied then java is assumed 

                           to be in your path.  

      --noextract          [IGNORED BY FALCO] Do not uncompress the 

                           output file after creating it. You should 

                           set this option if you do not wish to 

                           uncompress the output when running in 

                           non-interactive mode.  

      --nogroup            Disable grouping of bases for reads >50bp. 

                           All reports will show data for every base in 

                           the read. WARNING: When using this option, 

                           your plots may end up a ridiculous size. You 

                           have been warned!  

      --min_length         [NOT YET IMPLEMENTED IN FALCO] Sets an 

                           artificial lower limit on the length of the 

                           sequence to be shown in the report. As long 

                           as you set this to a value greater or equal 

                           to your longest read length then this will 

                           be the sequence length used to create your 

                           read groups. This can be useful for making 

                           directly comaparable statistics from 

                           datasets with somewhat variable read 

                           lengths.  

  -f, --format             Bypasses the normal sequence file format 

                           detection and forces the program to use the 

                           specified format. Validformats are bam, sam, 

                           bam_mapped, sam_mapped, fastq, fq, fastq.gz 

                           or fq.gz.  

  -t, --threads            [NOT YET IMPLEMENTED IN FALCO] Specifies the 

                           number of files which can be processed 

                           simultaneously. Each thread will be 

                           allocated 250MB of memory so you shouldn't 

                           run more threads than your available memory 

                           will cope with, and not more than 6 threads 

                           on a 32 bit machine [1] 

  -c, --contaminants       Specifies a non-default file which contains 

                           the list of contaminants to screen 

                           overrepresented sequences against. The file 

                           must contain sets of named contaminants in 

                           the form name[tab]sequence. Lines prefixed 

                           with a hash will be ignored. Default: 

                           /srv/falco/Configuration/contaminant_list.txt 

  -a, --adapters           Specifies a non-default file which contains 

                           the list of adapter sequences which will be 

                           explicity searched against the library. The 

                           file must contain sets of named adapters in 

                           the form name[tab]sequence. Lines prefixed 

                           with a hash will be ignored. Default: 

                           /srv/falco/Configuration/adapter_list.txt 

  -l, --limits             Specifies a non-default file which contains 

                           a set of criteria which will be used to 

                           determine the warn/error limits for the 

                           various modules. This file can also be used 

                           to selectively remove some modules from the 

                           output all together. The format needs to 

                           mirror the default limits.txt file found in 

                           the Configuration folder. Default: 

                           /srv/falco/Configuration/limits.txt 

  -k, --kmers              [IGNORED BY FALCO AND ALWAYS SET TO 7] 

                           Specifies the length of Kmer to look for in 

                           the Kmer content module. Specified Kmer 

                           length must be between 2 and 10. Default 

                           length is 7 if not specified.  

  -q, --quiet              Supress all progress messages on stdout and 

                           only report errors.  

  -d, --dir                [IGNORED: FALCO DOES NOT CREATE TMP FILES] 

                           Selects a directory to be used for temporary 

                           files written when generating report images. 

                           Defaults to system temp directory if not 

                           specified.  

  -s, -subsample           [Falco only] makes falco faster (but 

                           possibly less accurate) by only processing 

                           reads that are multiple of this value (using 

                           0-based indexing to number reads). [1] 

  -b, -bisulfite           [Falco only] reads are whole genome 

                           bisulfite sequencing, and more Ts and fewer 

                           Cs are therefore expected and will be 

                           accounted for in base content.  

  -r, -reverse-complement  [Falco only] The input is a 

                           reverse-complement. All modules will be 

                           tested by swapping A/T and C/G  

      -skip-data           [Falco only] Do not create FastQC data text 

                           file.  

      -skip-report         [Falco only] Do not create FastQC report 

                           HTML file.  

      -skip-summary        [Falco only] Do not create FastQC summary 

                           file  

  -D, -data-filename       [Falco only] Specify filename for FastQC 

                           data output (TXT). If not specified, it will 

                           be called fastq_data.txt in either the input 

                           file's directory or the one specified in the 

                           --output flag. Only available when running 

                           falco with a single input.  

  -R, -report-filename     [Falco only] Specify filename for FastQC 

                           report output (HTML). If not specified, it 

                           will be called fastq_report.html in either 

                           the input file's directory or the one 

                           specified in the --output flag. Only 

                           available when running falco with a single 

                           input.  

  -S, -summary-filename    [Falco only] Specify filename for the short 

                           summary output (TXT). If not specified, it 

                           will be called fastq_report.html in either 

                           the input file's directory or the one 

                           specified in the --output flag. Only 

                           available when running falco with a single 

                           input.  

  -K, -add-call            [Falco only] add the command call call to 

                           FastQC data output and FastQC report HTML 

                           (this may break the parse of fastqc_data.txt 

                           in programs that are very strict about the 

                           FastQC output format).  

 

Help options:

  -?, -help                print this help message  

      -about               print about message  

> gffread

gffread v0.12.8. Usage:

gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>] 

 [-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]

 [--jmatch <chr>:<start>-<end>] [--no-pseudo] 

 [-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]

 [-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]

 [--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]

 [<input_gff>] 

 

 Filter, convert or cluster GFF/GTF/BED records, extract the sequence of

 transcripts (exon or CDS) and more.

 By default (i.e. without -O) only transcripts are processed, discarding any

 other non-transcript features. Default output is a simplified GFF3 with only

 the basic attributes.

 

Options:

 --ids discard records/transcripts if their IDs are not listed in <IDs.lst>

 --nids discard records/transcripts if their IDs are listed in <IDs.lst>

 -i   discard transcripts having an intron larger than <maxintron>

 -l   discard transcripts shorter than <minlen> bases

 -r   only show transcripts overlapping coordinate range <start>..<end>

      (on chromosome/contig <chr>, strand <strand> if provided)

 -R   for -r option, discard all transcripts that are not fully 

      contained within the given range

 --jmatch only output transcripts matching the given junction

 -U   discard single-exon transcripts

 -C   coding only: discard mRNAs that have no CDS features

 --nc non-coding only: discard mRNAs that have CDS features

 --ignore-locus : discard locus features and attributes found in the input

 -A   use the description field from <seq_info.fsize> and add it

      as the value for a 'descr' attribute to the GFF record

 -s   <seq_info.fsize> is a tab-delimited file providing this info

      for each of the mapped sequences:

      <seq-name> <seq-length> <seq-description>

      (useful for -A option with mRNA/EST/protein mappings)

Sorting: (by default, chromosomes are kept in the order they were found)

 --sort-alpha : chromosomes (reference sequences) are sorted alphabetically

 --sort-by : sort the reference sequences by the order in which their

      names are given in the <refseq.lst> file

Misc options: 

 -F   keep all GFF attributes (for non-exon features)

 --keep-exon-attrs : for -F option, do not attempt to reduce redundant

      exon/CDS attributes

 -G   do not keep exon attributes, move them to the transcript feature

      (for GFF3 output)

 --attrs <attr-list> only output the GTF/GFF attributes listed in <attr-list>

    which is a comma delimited list of attribute names to

 --keep-genes : in transcript-only mode (default), also preserve gene records

 --keep-comments: for GFF3 input/output, try to preserve comments

 -O   process other non-transcript GFF records (by default non-transcript

      records are ignored)

 -V   discard any mRNAs with CDS having in-frame stop codons (requires -g)

 -H   for -V option, check and adjust the starting CDS phase

      if the original phase leads to a translation with an 

      in-frame stop codon

 -B   for -V option, single-exon transcripts are also checked on the

      opposite strand (requires -g)

 -P   add transcript level GFF attributes about the coding status of each

      transcript, including partialness or in-frame stop codons (requires -g)

 --add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts

      that have CDS features

 --adj-stop stop codon adjustment: enables -P and performs automatic

      adjustment of the CDS stop coordinate if premature or downstream

 -N   discard multi-exon mRNAs that have any intron with a non-canonical

      splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)

 -J   discard any mRNAs that either lack initial START codon

      or the terminal STOP codon, or have an in-frame stop codon

      (i.e. only print mRNAs with a complete CDS)

 --no-pseudo: filter out records matching the 'pseudo' keyword

 --in-bed: input should be parsed as BED format (automatic if the input

           filename ends with .bed*)

 --in-tlf: input GFF-like one-line-per-transcript format without exon/CDS

           features (see --tlf option below); automatic if the input

           filename ends with .tlf)

 --stream: fast processing of input GFF/BED transcripts as they are received

           ((no sorting, exons must be grouped by transcript in the input data)

Clustering:

 -M/--merge : cluster the input transcripts into loci, discarding

      "redundant" transcripts (those with the same exact introns

      and fully contained or equal boundaries)

 -d <dupinfo> : for -M option, write duplication info to file <dupinfo>

 --cluster-only: same as -M/--merge but without discarding any of the

      "duplicate" transcripts, only create "locus" features

 -K   for -M option: also discard as redundant the shorter, fully contained

       transcripts (intron chains matching a part of the container)

 --cset for -K option, discard single exon transcripts when fully contained

        in an exon of a multi-exon transcript

 -Q   for -M option, no longer require boundary containment when assessing

      redundancy (can be combined with -K); only introns have to match for

      multi-exon transcripts, and >=80% overlap for single-exon transcripts

 -Y   for -M option, enforce -Q but also discard overlapping single-exon 

      transcripts, even on the opposite strand (can be combined with -K)

Output options:

 --force-exons: make sure that the lowest level GFF features are considered

       "exon" features

 --gene2exon: for single-line genes not parenting any transcripts, add an

       exon feature spanning the entire gene (treat it as a transcript)

 --t-adopt:  try to find a parent gene overlapping/containing a transcript

       that does not have any explicit gene Parent

 -D    decode url encoded characters within attributes

 -Z    merge very close exons into a single exon (when intron size<4)

 -g   full path to a multi-fasta file with the genomic sequences

      for all input mappings, OR a directory with single-fasta files

      (one per genomic sequence, with file names matching sequence names)

 -j    output the junctions and the corresponding transcripts

 -u    write a fasta file with the whole span of each transcript

       (including introns)

 -w    write a fasta file with spliced exons for each transcript

 --w-add <N> for the -w option, extract additional <N> bases

       both upstream and downstream of the transcript boundaries

  (this also applies to -u option)

 --w-nocds for -w, disable the output of CDS info in the FASTA file

 -x    write a fasta file with spliced CDS for each GFF transcript

 -y    write a protein fasta file with the translation of CDS for each record

 -W    for -w, -x and -y options, write in the FASTA defline all the exon

       coordinates projected onto the spliced sequence;

 -S    for -y option, use '*' instead of '.' as stop codon translation

 -L    Ensembl GTF to GFF3 conversion, adds version to IDs

 -m    <chr_replace> is a name mapping table for converting reference 

       sequence names, having this 2-column format:

       <original_ref_ID> <new_ref_ID>

 -t    use <trackname> in the 2nd column of each GFF/GTF output line

 -o    write the output records into <outfile> instead of stdout

 -T    main output will be GTF instead of GFF3

 --bed output records in BED format instead of default GFF3

 --tlf output "transcript line format" which is like GFF

       but with exons and CDS related features stored as GFF 

       attributes in the transcript feature line, like this:

         exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords> 

       <exons> is a comma-delimited list of exon_start-exon_end coordinates;

       <CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>

 --table output a simple tab delimited format instead of GFF, with columns

       having the values of GFF attributes given in <attrlist>; special

       pseudo-attributes (prefixed by @) are recognized:

       @id, @geneid, @chr, @start, @end, @strand, @track, @numexons, @exons,

       @cds, @introns, @covlen, @cdslen

       If any of -w/-y/-x FASTA output files are enabled, the same fields

       (excluding @id) are appended to the definition line of corresponding

       FASTA records

 -v,-E expose (warn about) duplicate transcript IDs and other potential

       problems with the given GFF/GTF records

> vsearch

vsearch v2.28.1_linux_x86_64, 251.5GB RAM, 128 cores

https://github.com/torognes/vsearch

 

For more help, please enter: vsearch --help

For further details, please consult the manual by entering: man vsearch

 

Selected command examples:

 

vsearch --allpairs_global FILENAME --id 0.5 --alnout FILENAME

vsearch --cluster_size FILENAME --id 0.97 --centroids FILENAME

vsearch --cut FILENAME --cut_pattern G^AATT_C --fastaout FILENAME

vsearch --fastq_chars FILENAME

vsearch --fastq_convert FILENAME --fastqout FILENAME --fastq_ascii 64

vsearch --fastq_eestats FILENAME --output FILENAME

vsearch --fastq_eestats2 FILENAME --output FILENAME

vsearch --fastq_mergepairs FILENAME --reverse FILENAME --fastqout FILENAME

vsearch --fastq_stats FILENAME --log FILENAME

vsearch --fastx_filter FILENAME --fastaout FILENAME --fastq_trunclen 100

vsearch --fastx_getseq FILENAME --label LABEL --fastaout FILENAME

vsearch --fastx_mask FILENAME --fastaout FILENAME

vsearch --fastx_revcomp FILENAME --fastqout FILENAME

vsearch --fastx_subsample FILENAME --fastaout FILENAME --sample_pct 1

vsearch --fastx_uniques FILENAME --output FILENAME

vsearch --makeudb_usearch FILENAME --output FILENAME

vsearch --search_exact FILENAME --db FILENAME --alnout FILENAME

vsearch --sff_convert FILENAME --output FILENAME --sff_clip

vsearch --shuffle FILENAME --output FILENAME

vsearch --sintax FILENAME --db FILENAME --tabbedout FILENAME

vsearch --sortbylength FILENAME --output FILENAME

vsearch --sortbysize FILENAME --output FILENAME

vsearch --uchime_denovo FILENAME --nonchimeras FILENAME

vsearch --uchime_ref FILENAME --db FILENAME --nonchimeras FILENAME

vsearch --usearch_global FILENAME --db FILENAME --id 0.97 --alnout FILENAME

 

Other commands: cluster_fast, cluster_smallmem, cluster_unoise, cut,

                derep_id, derep_fulllength, derep_prefix, derep_smallmem,

                fasta2fastq, fastq_filter, fastq_join, fastx_getseqs,

                fastx_getsubseq, maskfasta, orient, rereplicate, uchime2_denovo,

                uchime3_denovo, udb2fasta, udbinfo, udbstats, version

> bamtools

usage: bamtools [--help] COMMAND [ARGS]

 

Available bamtools commands:

convert         Converts between BAM and a number of other formats

count           Prints number of alignments in BAM file(s)

coverage        Prints coverage statistics from the input BAM file

filter          Filters BAM file(s) by user-specified criteria

header          Prints BAM header information

index           Generates index for BAM file

merge           Merge multiple BAM files into single file

random          Select random alignments from existing BAM file(s), intended more as a testing tool.

resolve         Resolves paired-end reads (marking the IsProperPair flag as needed)

revert          Removes duplicate marks and restores original base qualities

sort            Sorts the BAM file according to some criteria

split           Splits a BAM file on user-specified property, creating a new BAM output file for each value found

stats           Prints some basic statistics from input BAM file(s)

 

See 'bamtools help COMMAND' for more information on a specific command.

引用

<HTSlib>

HTSlib: C library for reading/writing high-throughput sequencing data
James K Bonfield, John Marshall, Petr Danecek, Heng Li, Valeriu Ohan, Andrew Whitwham, Thomas Keane, Robert M Davies
GigaScience, Volume 10, Issue 2, February 2021, giab007

<SAMTools & BCFTools>

Twelve years of SAMtools and BCFtools
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li
GigaScience, Volume 10, Issue 2, February 2021, giab008

SRA Toolkit 

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software and the SRA Toolkit Development Team.

<vcftools>

The variant call format and VCFtools

Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin and 1000 Genomes Project Analysis Group

Bioinformatics. 2011 Aug 1; 27(15): 2156–2158.

<Bowtie2>

Fast gapped-read alignment with Bowtie 2

Ben Langmead1,2 and Steven L Salzberg1,2,3

Nat Methods. 2012 Mar 4; 9(4): 357–359.

<Bwa>

Fast and accurate short read alignment with Burrows-Wheeler transform

Heng Li, Richard Durbin

Bioinformatics. 2009 Jul 15;25(14):1754-60

Fast and accurate long-read alignment with Burrows-Wheeler transform

Heng Li, Richard Durbin

Bioinformatics. 2010 Mar 1;26(5):589-95.

<Bwa2>

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. 

Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru.

IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019. 10.1109/IPDPS.2019.00041

<Minimap2>

Minimap2: pairwise alignment for nucleotide sequences 

Heng Li

Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094–3100

New strategies to improve minimap2 alignment accuracy 

Heng Li

Bioinformatics, Volume 37, Issue 23, December 2021, Pages 4572–4574

<Hisat2>

HISAT: a fast spliced aligner with low memory requirements.

Kim D, Langmead B, Salzberg SL

Nat Methods. 2015 Apr;12(4):357-60.

<STAR>

STAR: ultrafast universal RNA-seq aligner

Alexander Dobi, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras

Bioinformatics. 2013 Jan; 29(1): 15–21.

Mapping RNA-seq Reads with STAR

Alexander Dobin and Thomas R. Gingeras

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2016 Sep 3.

<Falco>

Falco: high-speed FastQC emulation for quality control of sequencing data

Guilherme de Sena Brandine 1, Andrew D Smith

F1000Res. 2019 Nov 7:8:1874

<GffRead>

GFF Utilities: GffRead and GffCompare

Geo Pertea and Mihaela Pertea

Published online 2020 Sep 9. doi: 10.12688/f1000research.23297.2

<Cd-hit>

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Li W, Godzik A.

Bioinformatics. 2006 Jul 1;22(13):1658-9. Epub 2006 May 26.

<VSEARCH>

VSEARCH: a versatile open source tool for metagenomics
Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, Frédéric Mahé

PeerJ. 2016; 4: e2584

<BamTools>

BamTools: a C++ API and toolkit for analyzing and managing BAM files
Derek W Barnett, Erik K Garrison, Aaron R Quinlan, Michael P Strömberg, Gabor T Marth

Bioinformatics. 2011 Jun 15;27(12):1691-2.

 

関連

バイオインフォマティクスのツールを再現性よく実行するためのコンテナ仮想化ツール群 BioContainers 

教育にも利用できる、何百ものバイオインフォマティクスツールが入った包括的コンテナ環境 ORCA