2024-05-10

HTS（NGS）の基本ツールのインストール手順

tips summary

2024/05/10 BamTools追加

UbuntuにHTSの基本的なツールをインストールする手順をまとめました。pythonでラップされているようなツールはanacondaに頼るのが早いので除外し、純粋にコンパイル言語で書かれたよく使われているツールを対象としました。

dockerイメージとライブラリなどの準備

dockerを使用し、ubuntuのベースイメージをpullして進めます。既にcondaやhomebrewでこれらのツールを導入されている場合、ややこしくなりますので注意してください。例えばcondaでツールを管理されているのでしたら前もってconda deactivateしておくのが理想です（プロンプト左端に"(base)"など出ていないか確認して下さい）。

最新のタグのイメージをpull後、ランしてコンパイラや依存するライブラリなどを導入します。

docker pull ubuntu:latest
docker run -itv $PWD:/data ubuntu:latest #その場限りで使うのなら"--rm"を付ける

#最初に依存するライブラリなどを導入する必要がある
apt update
apt install -y make gcc git autoconf automake g++ cmake build-essential
apt install -y zlib1g-dev 
apt install -y liblzma-dev
apt install -y libbz2-dev
apt install -y libcurl4-openssl-dev
apt install -y libssl-dev #OpenSSLライブラリ
apt install -y pkg-config
apt install -y libdeflate-dev
apt install -y libncurses5-dev
apt install -y libgsl-dev
apt install -y libperl-dev
apt install -y libxml2-dev
#/srcに移動 *1
cd /srv


=============================================================================
#補足
#1 作業完了後、exitしてcommitしておく。例：
docker ps -a #IDの確認

#2 IDを指定してcommit
docker commit 1c4af84ced55 kazumax/ngsbaseimage

#3 再使用（bwa indexコマンドをホスト側から実行)
docker run --rm -itv $PWD:/root -w /root kazumax/ngsbaseimage bwa index -a is GCA_021272385.1.fna

#or（複数コマンド例）
command1="echo bwa indexing"
command2="bwa index -a is GCA_021272385.1.fna"

docker run --rm -itv $PWD:/root -w /root kazumax/ngsbaseimage bash -c "$command1; $command2"

*1 ここでは/srvにcloneする。ほかの候補にはoptなど（参考）。ユーザーならhome下など。

これで準備OK。/srv下に順番に導入していきます。

インストール

１，SAMtoolsとBCFtools、BAMToolsのインストール

Github

https://github.com/samtools/htslib

https://github.com/samtools/samtools

https://github.com/samtools/bcftools

https://github.com/pezmaster31/bamtools

https://github.com/telatin/bamtocov?tab=readme-ov-file

#1 htslibのインストール（samtoolsとbcftoolsのインストールに必須）*2
git clone --recurse-submodules https://github.com/samtools/htslib.git
cd htslib/
autoheader
autoreconf -i
./configure --prefix=/usr/local
make -j10
make install
cd ../

#2 samtoolsのインストール
git clone https://github.com/samtools/samtools.git
cd samtools/
autoheader
autoconf -Wno-syntax
./configure --prefix=/usr/local # *3
make -j10
make install
cd ../
samtools -h

#3 bcftoolsのインストール
git clone https://github.com/samtools/bcftools.git
cd bcftools/
autoheader
autoconf
./configure --enable-libgsl --enable-perl-filters
make -j10
make install
cd ../
bcftools -h

#4 BamToolsのインストール
git clone https://github.com/pezmaster31/bamtools.git
cd bamtools/
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr/local ..
make -j10
make install
cd ../../
bamtools -h

*2 HTSlibはhtscodecsプロジェクトのいくつかの関数を使用している。clone時、サブモジュールを更新して初期化する。

*3 configure時、デフォルトでは"../htslib"にある HTSlibに対してビルドされる。--with-htslib=DIRを指定することでシステムの別のhtslibと変更可能。

2、vcftoolsのインストール

Github

https://github.com/vcftools/vcftools

git clone https://github.com/vcftools/vcftools.git
cd vcftools/
./autogen.sh
./configure --prefix=/usr/local
make -j10
make install
cd ../
vcftools -h

3、SRA Toolkitのインストール

Github

https://github.com/ncbi/sra-tools

git clone https://github.com/ncbi/sra-tools.git
cd sra-tools/

#ncbi-vdbが必要
git clone https://github.com/ncbi/ncbi-vdb.git
cd ncbi-vdb
./configure --prefix=/usr/local/ncbi-vdb
make -j20
make install -j20
cd ../

./configure --with-ncbi-vdb=/usr/local/ncbi-vdb --prefix=/usr/local
make -j20
make install
cd ../
fastq-dump -h

4、Mapping toolのインストール

Github

https://github.com/BenLangmead/bowtie2

https://github.com/lh3/bwa

https://github.com/bwa-mem2/bwa-mem2

https://github.com/lh3/minimap2

https://github.com/DaehwanKimLab/hisat2

https://github.com/alexdobin/STAR

#bowtie2
git clone https://github.com/BenLangmead/bowtie2.git
cd bowtie2/
make -j20
make install
cd ../
bowtie2 -h

#bwa1
git clone https://github.com/lh3/bwa.git
cd bwa
make -j20
echo 'export PATH=$PATH:$PWD/bwa' >> ~/.bashrc && source ~/.bashrc
#or
cp bwa /usr/local/bin/
cd ../
bwa

#bwa2
git clone --recursive https://github.com/bwa-mem2/bwa-mem2
cd bwa-mem2
make -j20
echo 'export PATH=$PATH:$PWD/bwa-mem2*' >> ~/.bashrc && source ~/.bashrc
#or
cp bwa2* /usr/local/bin/
cd ../
bwa-mem2

#minimap2
git clone https://github.com/lh3/minimap2
cd minimap2
make -j20
echo 'export PATH=$PATH:$PWD/minimap2' >> ~/.bashrc && source ~/.bashrc
#or
cp minimap2 /usr/local/bin/
cd ../
minimap2

#hisat2 (pythonも必要)
git clone https://github.com/DaehwanKimLab/hisat2.git
cd hisat2
make -j20
echo 'export PATH=$PATH:$PWD/hisat2*' >> ~/.bashrc && source ~/.bashrc
cd ../
hisat2 -h

#STAR（STAR/bin/Linux_x86_64_staticにバイナリが配置されているのでx８６　６４bit版ならパスを通すだけ）
git clone https://github.com/alexdobin/STAR.git
cd STAR/bin/Linux_x86_64_static/
echo 'export PATH=$PATH:$PWD' >> ~/.bashrc && source ~/.bashrc
#or
ln -s STAR /usr/local/bin/
cd ../../../
STAR --help

（pseudo-mappingは除外）

5、fastqのQC toolのインストール

Github

https://github.com/smithlabcode/falco

#falco (fastQCの高速な代替)
git clone https://github.com/smithlabcode/falco.git
cd falco/
./autogen.sh
./configure CXXFLAGS="-O3 -Wall" --enable-hts
make HAVE_HTSLIB=1 all -j 20
make HAVE_HTSLIB=1 install

（pythonで書かれたcutadapt、perlで書かれたTrim Galore（fastQCとcutadaptのラッパー）、javaで書かれたTrimmostic（.jarが配布されている）は除外。依存が多いfastpはcondaを使うのが簡単。)

6、配列クラスタリングツールのインストール

Github

https://github.com/weizhongli/cdhit

#cd-hit
git clone https://github.com/weizhongli/cdhit.git
cd cdhit/
#長い配列も扱えるようにビルド
make MAX_SEQ=1000000 -j4
make install
cd ../

7、GFF/GTFを操作するツールのインストール

https://github.com/gpertea/gffread

#gffread
git clone https://github.com/gpertea/gffread
cd gffread
make release -j 10
cd ../

8、配列操作ツールのインストール

https://github.com/torognes/vsearch

#vsearch
git clone https://github.com/torognes/vsearch.git
cd vsearch
./autogen.sh
./configure CFLAGS="-O3" CXXFLAGS="-O3"
make -j20
make install
cd ../

他は？

除外したもの

・ETE toolkit（pythonベース、別に紹介予定）

・bedtools（python、bamtoolsなど依存が多い）

・IGV（javaが肥大）

############################################################################################

主要なコマンドのヘルプ

> samtools -h

Program: samtools (Tools for alignments in the SAM format)

Version: 1.20-6-g95f5426 (using htslib 1.20-15-gb204d55c)

Usage: samtools <command> [options]

Commands:

-- Indexing

dict create a sequence dictionary file

faidx index/extract FASTA

fqidx index/extract FASTQ

index index alignment

-- Editing

calmd recalculate MD/NM tags and '=' bases

fixmate fix mate information

reheader replace BAM header

targetcut cut fosmid regions (for fosmid pool only)

addreplacerg adds or replaces RG tags

markdup mark duplicates

ampliconclip clip oligos from the end of reads

-- File operations

collate shuffle and group alignments by name

cat concatenate BAMs

consensus produce a consensus Pileup/FASTA/FASTQ

merge merge sorted alignments

mpileup multi-way pileup

sort sort alignment file

split splits a file by read group

quickcheck quickly check if SAM/BAM/CRAM file appears intact

fastq converts a BAM to a FASTQ

fasta converts a BAM to a FASTA

import Converts FASTA or FASTQ files to SAM/BAM/CRAM

reference Generates a reference from aligned data

reset Reverts aligner changes in reads

-- Statistics

bedcov read depth per BED region

coverage alignment depth and percent coverage

depth compute the depth

flagstat simple stats

idxstats BAM index stats

cram-size list CRAM Content-ID and Data-Series sizes

phase phase heterozygotes

stats generate stats (former bamcheck)

ampliconstats generate amplicon specific stats

-- Viewing

flags explain BAM flags

head header viewer

tview text alignment viewer

view SAM<->BAM<->CRAM conversion

depad convert padded BAM to unpadded BAM

samples list the samples in a set of SAM/BAM/CRAM files

-- Misc

help [cmd] display this help message or help for [cmd]

version detailed version information

> bcftools -h

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)

License: GNU GPLv3+, due to use of the GNU Scientific Library

Version: 1.20-12-g18439d1d (using htslib 1.20-15-gb204d55c)

Usage: bcftools [--version|--version-only] [--help] <command> <argument>

Commands:

-- Indexing

index index VCF/BCF files

-- VCF/BCF manipulation

annotate annotate and edit VCF/BCF files

concat concatenate VCF/BCF files from the same set of samples

convert convert VCF/BCF files to different formats and back

head view VCF/BCF file headers

isec intersections of VCF/BCF files

merge merge VCF/BCF files files from non-overlapping sample sets

norm left-align and normalize indels

plugin user-defined plugins

query transform VCF/BCF into user-defined formats

reheader modify VCF/BCF header, change sample names

sort sort VCF/BCF file

view VCF/BCF conversion, view, subset and filter VCF/BCF files

-- VCF/BCF analysis

call SNP/indel calling

consensus create consensus sequence by applying VCF variants

cnv HMM CNV calling

csq call variation consequences

filter filter VCF/BCF files using fixed thresholds

gtcheck check sample concordance, detect sample swaps and contamination

mpileup multi-way pileup producing genotype likelihoods

polysomy detect number of chromosomal copies

roh identify runs of autozygosity (HMM)

stats produce VCF/BCF stats

-- Plugins (collection of programs for calling, file manipulation & analysis)

39 plugins available, run "bcftools plugin -lv" to see a complete list

Most commands accept VCF, bgzipped VCF, and BCF with the file type detected

automatically even when streaming from a pipe. Indexed VCF and BCF will work

in all situations. Un-indexed VCF and BCF and streams will work in most but

not all situations.

> vcftools -h

VCFtools (0.1.17)

Process Variant Call Format files

For a list of options, please go to:

https://vcftools.github.io/man_latest.html

Alternatively, a man page is available, type:

man vcftools

Questions, comments, and suggestions should be emailed to:

vcftools-help@lists.sourceforge.net

> fastq-dump -h

Usage:

fastq-dump [options] <path> [<path>...]

fastq-dump [options] <accession>

INPUT

-A|--accession <accession> Replaces accession derived from <path> in

filename(s) and deflines (only for single

table dump)

--table <table-name> Table name within cSRA object, default is

"SEQUENCE"

PROCESSING

Read Splitting Sequence data may be used in raw form or

split into individual reads

--split-spot Split spots into individual reads

Full Spot Filters Applied to the full spot independently

of --split-spot

-N|--minSpotId <rowid> Minimum spot id

-X|--maxSpotId <rowid> Maximum spot id

--spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...]

-W|--clip Remove adapter sequences from reads

Common Filters Applied to spots when --split-spot is not

set, otherwise - to individual reads

-M|--minReadLen <len> Filter by sequence length >= <len>

-R|--read-filter <[filter]> Split into files by READ_FILTER value

optionally filter by value:

pass|reject|criteria|redacted

-E|--qual-filter Filter used in early 1000 Genomes data: no

sequences starting or ending with >= 10N

--qual-filter-1 Filter used in current 1000 Genomes data

Filters based on alignments Filters are active when alignment

data are present

--aligned Dump only aligned sequences

--unaligned Dump only unaligned sequences

--aligned-region <name[:from-to]> Filter by position on genome. Name can

either be accession.version (ex:

NC_000001.10) or file specific name (ex:

"chr1" or "1"). "from" and "to" are 1-based

coordinates

--matepair-distance <from-to|unknown> Filter by distance between matepairs.

Use "unknown" to find matepairs split

between the references. Use from-to to limit

matepair distance on the same reference

Filters for individual reads Applied only with --split-spot set

--skip-technical Dump only biological reads

OUTPUT

-O|--outdir <path> Output directory, default is working

directory '.' )

-Z|--stdout Output to stdout, all split data become

joined into single stream

--gzip Compress output using gzip: deprecated, not

recommended

--bzip2 Compress output using bzip2: deprecated,

not recommended

Multiple File Options Setting these options will produce more

than 1 file, each of which will be suffixed

according to splitting criteria.

--split-files Write reads into separate files. Read

number will be suffixed to the file name.

NOTE! The `--split-3` option is recommended.

In cases where not all spots have the same

number of reads, this option will produce

files that WILL CAUSE ERRORS in most programs

which process split pair fastq files.

--split-3 3-way splitting for mate-pairs. For each

spot, if there are two biological reads

satisfying filter conditions, the first is

placed in the `*_1.fastq` file, and the

second is placed in the `*_2.fastq` file. If

there is only one biological read

satisfying the filter conditions, it is

placed in the `*.fastq` file.All other

reads in the spot are ignored.

-G|--spot-group Split into files by SPOT_GROUP (member name)

-R|--read-filter <[filter]> Split into files by READ_FILTER value

optionally filter by value:

pass|reject|criteria|redacted

-T|--group-in-dirs Split into subdirectories instead of files

-K|--keep-empty-files Do not delete empty files

FORMATTING

Sequence

-C|--dumpcs <[cskey]> Formats sequence using color space (default

for SOLiD),"cskey" may be specified for

translation

-B|--dumpbase Formats sequence using base space (default

for other than SOLiD).

Quality

-Q|--offset <integer> Offset to use for quality conversion,

default is 33

--fasta <[line width]> FASTA only, no qualities, optional line

wrap width (set to zero for no wrapping)

--suppress-qual-for-cskey suppress quality-value for cskey

Defline

-F|--origfmt Defline contains only original sequence name

-I|--readids Append read id after spot id as

'accession.spot.readid' on defline

--helicos Helicos style defline

--defline-seq <fmt> Defline format specification for sequence.

--defline-qual <fmt> Defline format specification for quality.

<fmt> is string of characters and/or

variables. The variables can be one of: $ac

- accession, $si spot id, $sn spot

name, $sg spot group (barcode), $sl spot

length in bases, $ri read number, $rn

read name, $rl read length in bases. ''

could be used for an optional output: if

all vars in yield empty values whole

group is not printed. Empty value is empty

string or for numeric variables. Ex:

@$sn[_$rn]/$ri '_$rn' is omitted if name

is empty

OTHER:

--ngc <path> <path> to ngc file

--disable-multithreading disable multithreading

-h|--help Output brief explanation of program usage

-V|--version Display the version of the program

-L|--log-level <level> Logging level as number or enum string One

of (fatal|sys|int|err|warn|info) or (0-5)

Current/default is warn

-v|--verbose Increase the verbosity level of the program

Use multiple times for more verbosity

--ncbi_error_report Control program execution environment

report generation (if implemented). One of

(never|error|always). Default is error

--legacy-report use legacy style 'Written spots' for tool

> bowtie2 -h

Bowtie 2 version 2.5.3 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)

Usage:

bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]

<bt2-idx> Index filename prefix (minus trailing .X.bt2).

NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.

<m1> Files with #1 mates, paired with files in <m2>.

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<m2> Files with #2 mates, paired with files in <m1>.

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<r> Files with unpaired reads.

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<i> Files with interleaved paired-end FASTQ/FASTA reads

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<bam> Files are unaligned BAM sorted by read name.

<sam> File for SAM output (default: stdout)

<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be

specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.

Options (defaults in parentheses):

Input:

-q query input files are FASTQ .fq/.fastq (default)

--tab5 query input files are TAB5 .tab5

--tab6 query input files are TAB6 .tab6

--qseq query input files are in Illumina's qseq format

-f query input files are (multi-)FASTA .fa/.mfa

-r query input files are raw one-sequence-per-line

-F k:<int>,i:<int> query input files are continuous FASTA where reads

are substrings (k-mers) extracted from the FASTA file

and aligned at offsets 1, 1+i, 1+2i ... end of reference

-c <m1>, <m2>, <r> are sequences themselves, not files

-s/--skip <int> skip the first <int> reads/pairs in the input (none)

-u/--upto <int> stop after first <int> reads/pairs (no limit)

-5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)

-3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)

--trim-to [3:|5:]<int> trim reads exceeding <int> bases from either 3' or 5' end

If the read end is not specified then it defaults to 3 (0)

--phred33 qualities are Phred+33 (default)

--phred64 qualities are Phred+64

--int-quals qualities encoded as space-delimited integers

Presets: Same as:

For --end-to-end:

--very-fast -D 5 -R 1 -N 0 -L 22 -i S,0,2.50

--fast -D 10 -R 2 -N 0 -L 22 -i S,0,2.50

--sensitive -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)

--very-sensitive -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

For --local:

--very-fast-local -D 5 -R 1 -N 0 -L 25 -i S,1,2.00

--fast-local -D 10 -R 2 -N 0 -L 22 -i S,1,1.75

--sensitive-local -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)

--very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

Alignment:

-N <int> max # mismatches in seed alignment; can be 0 or 1 (0)

-L <int> length of seed substrings; must be >3, <32 (22)

-i <func> interval between seed substrings w/r/t read len (S,1,1.15)

--n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)

--dpad <int> include <int> extra ref chars on sides of DP table (15)

--gbar <int> disallow gaps within <int> nucs of read extremes (4)

--ignore-quals treat all quality values as 30 on Phred scale (off)

--nofw do not align forward (original) version of read (off)

--norc do not align reverse-complement version of read (off)

--no-1mm-upfront do not allow 1 mismatch alignments before attempting to

scan for the optimal seeded alignments

--end-to-end entire read must align; no clipping (on)

--local local alignment; ends might be soft clipped (off)

Scoring:

--ma <int> match bonus (0 for --end-to-end, 2 for --local)

--mp <int> max penalty for mismatch; lower qual = lower penalty (6)

--np <int> penalty for non-A/C/G/Ts in read/ref (1)

--rdg <int>,<int> read gap open, extend penalties (5,3)

--rfg <int>,<int> reference gap open, extend penalties (5,3)

--score-min <func> min acceptable alignment score w/r/t read length

(G,20,8 for local, L,-0.6,-0.6 for end-to-end)

Reporting:

(default) look for multiple alignments, report best, with MAPQ

-k <int> report up to <int> alns per read; MAPQ not meaningful

-a/--all report all alignments; very slow, MAPQ not meaningful

Effort:

-D <int> give up extending after <int> failed extends in a row (15)

-R <int> for reads w/ repetitive seeds, try <int> sets of seeds (2)

Paired-end:

-I/--minins <int> minimum fragment length (0)

-X/--maxins <int> maximum fragment length (500)

--fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)

--no-mixed suppress unpaired alignments for paired reads

--no-discordant suppress discordant alignments for paired reads

--dovetail concordant when mates extend past each other

--no-contain not concordant when one mate alignment contains other

--no-overlap not concordant when mates overlap at all

BAM:

--align-paired-reads

Bowtie2 will, by default, attempt to align unpaired BAM reads.

Use this option to align paired-end reads instead.

--preserve-tags Preserve tags from the original BAM record by

appending them to the end of the corresponding SAM output.

Output:

-t/--time print wall-clock time taken by search phases

--un <path> write unpaired reads that didn't align to <path>

--al <path> write unpaired reads that aligned at least once to <path>

--un-conc <path> write pairs that didn't align concordantly to <path>

--al-conc <path> write pairs that aligned concordantly at least once to <path>

(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.

--un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)

--quiet print nothing to stderr except serious errors

--met-file <path> send metrics to file at <path> (off)

--met-stderr send metrics to stderr (off)

--met <int> report internal counters & metrics every <int> secs (1)

--no-unal suppress SAM records for unaligned reads

--no-head suppress header lines, i.e. lines starting with @

--no-sq suppress @SQ header lines

--rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field

--rg <text> add <text> ("lab:value") to @RG line of SAM header.

Note: @RG line only printed when --rg-id is set.

--omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.

--sam-no-qname-trunc

Suppress standard behavior of truncating readname at first whitespace

at the expense of generating non-standard SAM.

--xeq Use '='/'X', instead of 'M,' to specify matches/mismatches in SAM record.

--soft-clipped-unmapped-tlen

Exclude soft-clipped bases when reporting TLEN

--sam-append-comment

Append FASTA/FASTQ comment to SAM record

Performance:

-p/--threads <int> number of alignment threads to launch (1)

--reorder force SAM output order to match order of input reads

--mm use memory-mapped I/O for index; many 'bowtie's can share

Other:

--qc-filter filter out reads that are bad according to QSEQ filter

--seed <int> seed for random number generator (0)

--non-deterministic

seed rand. gen. arbitrarily instead of using read attributes

--version print version information and quit

-h/--help print this usage message

> bwa

Program: bwa (alignment via Burrows-Wheeler transformation)

Version: 0.7.18-r1243-dirty

Contact: Heng Li <hli@ds.dfci.harvard.edu>

Usage: bwa <command> [options]

Command: index index sequences in the FASTA format

mem BWA-MEM algorithm

fastmap identify super-maximal exact matches

pemerge merge overlapping paired ends (EXPERIMENTAL)

aln gapped/ungapped alignment

samse generate alignment (single ended)

sampe generate alignment (paired ended)

bwasw BWA-SW for long queries (DEPRECATED)

shm manage indices in shared memory

fa2pac convert FASTA to PAC format

pac2bwt generate BWT from PAC

pac2bwtgen alternative algorithm for generating BWT

bwtupdate update .bwt to the new format

bwt2sa generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.

There are three alignment algorithms in BWA: `mem', `bwasw', and

`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'

first. Please `man ./bwa.1' for the manual.

> bwa-mem2

Looking to launch executable "/srv/bwa-mem2/bwa-mem2.avx2", simd = .avx2

Launching executable "/srv/bwa-mem2/bwa-mem2.avx2"

Usage: bwa-mem2 <command> <arguments>

Commands:

index create index

mem alignment

version print version number

> minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Options:

Indexing:

-H use homopolymer-compressed k-mer (preferrable for PacBio)

-k INT k-mer size (no larger than 28) [15]

-w INT minimizer window size [10]

-I NUM split index for every ~NUM input bases [8G]

-d FILE dump index to FILE

Mapping:

-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]

-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]

-G NUM max intron length (effective with -xsplice; changing -r) [200k]

-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]

-r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]

-n INT minimal number of minimizers on a chain [3]

-m INT minimal chaining score (matching bases minus log gap penalty) [40]

-X skip self and dual mappings (for the all-vs-all mode)

-p FLOAT min secondary-to-primary score ratio [0.8]

-N INT retain at most INT secondary alignments [5]

Alignment:

-A INT matching score [2]

-B INT mismatch penalty (larger value for lower divergence) [4]

-O INT[,INT] gap open penalty [4,24]

-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]

-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]

-s INT minimal peak DP alignment score [80]

-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

-J INT splice mode. 0: original minimap2 model; 1: miniprot model [1]

Input/Output:

-a output in the SAM format (PAF by default)

-o FILE output alignments to FILE [stdout]

-L write CIGAR with >65535 ops at the CG tag

-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar'

-c output CIGAR in PAF

--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]

--ds output the ds tag, which is an extension to cs

--MD output the MD tag

--eqx write =/X CIGAR operators

-Y use soft clipping for supplementary alignments

-t INT number of threads [3]

-K NUM minibatch size for mapping [500M]

--version show version number

Preset:

-x STR preset (always applied before other options; see minimap2.1 for details) []

- lr:hq - accurate long reads (error rate <1%) against a reference genome

- splice/splice:hq - spliced alignment for long reads/accurate long reads

- asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence

- sr - short reads against a reference

- map-pb/map-hifi/map-ont/map-iclr - CLR/HiFi/Nanopore/ICLR vs reference mapping

- ava-pb/ava-ont - PacBio CLR/Nanopore read overlap

See `man ./minimap2.1' for detailed description of these and other advanced command-line options.

> bwa-mem2

Looking to launch executable "/usr/local/bin/bwa-mem2.avx2", simd = .avx2

Launching executable "/usr/local/bin/bwa-mem2.avx2"

Usage: bwa-mem2 <command> <arguments>

Commands:

index create index

mem alignment

version print version number

> falco

Usage: falco [OPTIONS] <seqfile1> <seqfile2> ...

Options:

-h, --help Print this help file and exit

-v, --version Print the version of the program and exit

-o, --outdir Create all output files in the specified

output directory. FALCO-SPECIFIC: If the

directory does not exists, the program will

create it. If this option is not set then

the output file for each sequence file is

created in the same directory as the

sequence file which was processed.

--casava [IGNORED BY FALCO] Files come from raw

casava output. Files in the same sample

group (differing only by the group number)

will be analysed as a set rather than

individually. Sequences with the filter flag

set in the header will be excluded from the

analysis. Files must have the same names

given to them by casava (including being

gzipped and ending with .gz) otherwise they

won't be grouped together correctly.

--nano [IGNORED BY FALCO] Files come from nanopore

sequences and are in fast5 format. In this

mode you can pass in directories to process

and the program will take in all fast5 files

within those directories and produce a

single output file from the sequences found

in all files.

--nofilter [IGNORED BY FALCO] If running with --casava

then don't remove read flagged by casava as

poor quality when performing the QC

analysis.

--extract [ALWAYS ON IN FALCO] If set then the zipped

output file will be uncompressed in the same

directory after it has been created. By

default this option will be set if fastqc is

run in non-interactive mode.

-j, --java [IGNORED BY FALCO] Provides the full path to

the java binary you want to use to launch

fastqc. If not supplied then java is assumed

to be in your path.

--noextract [IGNORED BY FALCO] Do not uncompress the

output file after creating it. You should

set this option if you do not wish to

uncompress the output when running in

non-interactive mode.

--nogroup Disable grouping of bases for reads >50bp.

All reports will show data for every base in

the read. WARNING: When using this option,

your plots may end up a ridiculous size. You

have been warned!

--min_length [NOT YET IMPLEMENTED IN FALCO] Sets an

artificial lower limit on the length of the

sequence to be shown in the report. As long

as you set this to a value greater or equal

to your longest read length then this will

be the sequence length used to create your

read groups. This can be useful for making

directly comaparable statistics from

datasets with somewhat variable read

lengths.

-f, --format Bypasses the normal sequence file format

detection and forces the program to use the

specified format. Validformats are bam, sam,

bam_mapped, sam_mapped, fastq, fq, fastq.gz

or fq.gz.

-t, --threads [NOT YET IMPLEMENTED IN FALCO] Specifies the

number of files which can be processed

simultaneously. Each thread will be

allocated 250MB of memory so you shouldn't

run more threads than your available memory

will cope with, and not more than 6 threads

on a 32 bit machine [1]

-c, --contaminants Specifies a non-default file which contains

the list of contaminants to screen

overrepresented sequences against. The file

must contain sets of named contaminants in

the form name[tab]sequence. Lines prefixed

with a hash will be ignored. Default:

/srv/falco/Configuration/contaminant_list.txt

-a, --adapters Specifies a non-default file which contains

the list of adapter sequences which will be

explicity searched against the library. The

file must contain sets of named adapters in

the form name[tab]sequence. Lines prefixed

with a hash will be ignored. Default:

/srv/falco/Configuration/adapter_list.txt

-l, --limits Specifies a non-default file which contains

a set of criteria which will be used to

determine the warn/error limits for the

various modules. This file can also be used

to selectively remove some modules from the

output all together. The format needs to

mirror the default limits.txt file found in

the Configuration folder. Default:

/srv/falco/Configuration/limits.txt

-k, --kmers [IGNORED BY FALCO AND ALWAYS SET TO 7]

Specifies the length of Kmer to look for in

the Kmer content module. Specified Kmer

length must be between 2 and 10. Default

length is 7 if not specified.

-q, --quiet Supress all progress messages on stdout and

only report errors.

-d, --dir [IGNORED: FALCO DOES NOT CREATE TMP FILES]

Selects a directory to be used for temporary

files written when generating report images.

Defaults to system temp directory if not

specified.

-s, -subsample [Falco only] makes falco faster (but

possibly less accurate) by only processing

reads that are multiple of this value (using

0-based indexing to number reads). [1]

-b, -bisulfite [Falco only] reads are whole genome

bisulfite sequencing, and more Ts and fewer

Cs are therefore expected and will be

accounted for in base content.

-r, -reverse-complement [Falco only] The input is a

reverse-complement. All modules will be

tested by swapping A/T and C/G

-skip-data [Falco only] Do not create FastQC data text

file.

-skip-report [Falco only] Do not create FastQC report

HTML file.

-skip-summary [Falco only] Do not create FastQC summary

file

-D, -data-filename [Falco only] Specify filename for FastQC

data output (TXT). If not specified, it will

be called fastq_data.txt in either the input

file's directory or the one specified in the

--output flag. Only available when running

falco with a single input.

-R, -report-filename [Falco only] Specify filename for FastQC

report output (HTML). If not specified, it

will be called fastq_report.html in either

the input file's directory or the one

specified in the --output flag. Only

available when running falco with a single

input.

-S, -summary-filename [Falco only] Specify filename for the short

summary output (TXT). If not specified, it

will be called fastq_report.html in either

the input file's directory or the one

specified in the --output flag. Only

available when running falco with a single

input.

-K, -add-call [Falco only] add the command call call to

FastQC data output and FastQC report HTML

(this may break the parse of fastqc_data.txt

in programs that are very strict about the

FastQC output format).

Help options:

-?, -help print this help message

-about print about message

> gffread

gffread v0.12.8. Usage:

gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>]

[-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]

[--jmatch <chr>:<start>-<end>] [--no-pseudo]

[-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]

[-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]

[--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]

[<input_gff>]

Filter, convert or cluster GFF/GTF/BED records, extract the sequence of

transcripts (exon or CDS) and more.

By default (i.e. without -O) only transcripts are processed, discarding any

other non-transcript features. Default output is a simplified GFF3 with only

the basic attributes.

Options:

--ids discard records/transcripts if their IDs are not listed in <IDs.lst>

--nids discard records/transcripts if their IDs are listed in <IDs.lst>

-i discard transcripts having an intron larger than <maxintron>

-l discard transcripts shorter than <minlen> bases

-r only show transcripts overlapping coordinate range <start>..<end>

(on chromosome/contig <chr>, strand <strand> if provided)

-R for -r option, discard all transcripts that are not fully

contained within the given range

--jmatch only output transcripts matching the given junction

-U discard single-exon transcripts

-C coding only: discard mRNAs that have no CDS features

--nc non-coding only: discard mRNAs that have CDS features

--ignore-locus : discard locus features and attributes found in the input

-A use the description field from <seq_info.fsize> and add it

as the value for a 'descr' attribute to the GFF record

-s <seq_info.fsize> is a tab-delimited file providing this info

for each of the mapped sequences:

<seq-name> <seq-length> <seq-description>

(useful for -A option with mRNA/EST/protein mappings)

Sorting: (by default, chromosomes are kept in the order they were found)

--sort-alpha : chromosomes (reference sequences) are sorted alphabetically

--sort-by : sort the reference sequences by the order in which their

names are given in the <refseq.lst> file

Misc options:

-F keep all GFF attributes (for non-exon features)

--keep-exon-attrs : for -F option, do not attempt to reduce redundant

exon/CDS attributes

-G do not keep exon attributes, move them to the transcript feature

(for GFF3 output)

--attrs <attr-list> only output the GTF/GFF attributes listed in <attr-list>

which is a comma delimited list of attribute names to

--keep-genes : in transcript-only mode (default), also preserve gene records

--keep-comments: for GFF3 input/output, try to preserve comments

-O process other non-transcript GFF records (by default non-transcript

records are ignored)

-V discard any mRNAs with CDS having in-frame stop codons (requires -g)

-H for -V option, check and adjust the starting CDS phase

if the original phase leads to a translation with an

in-frame stop codon

-B for -V option, single-exon transcripts are also checked on the

opposite strand (requires -g)

-P add transcript level GFF attributes about the coding status of each

transcript, including partialness or in-frame stop codons (requires -g)

--add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts

that have CDS features

--adj-stop stop codon adjustment: enables -P and performs automatic

adjustment of the CDS stop coordinate if premature or downstream

-N discard multi-exon mRNAs that have any intron with a non-canonical

splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)

-J discard any mRNAs that either lack initial START codon

or the terminal STOP codon, or have an in-frame stop codon

(i.e. only print mRNAs with a complete CDS)

--no-pseudo: filter out records matching the 'pseudo' keyword

--in-bed: input should be parsed as BED format (automatic if the input

filename ends with .bed*)

--in-tlf: input GFF-like one-line-per-transcript format without exon/CDS

features (see --tlf option below); automatic if the input

filename ends with .tlf)

--stream: fast processing of input GFF/BED transcripts as they are received

((no sorting, exons must be grouped by transcript in the input data)

Clustering:

-M/--merge : cluster the input transcripts into loci, discarding

"redundant" transcripts (those with the same exact introns

and fully contained or equal boundaries)

-d <dupinfo> : for -M option, write duplication info to file <dupinfo>

--cluster-only: same as -M/--merge but without discarding any of the

"duplicate" transcripts, only create "locus" features

-K for -M option: also discard as redundant the shorter, fully contained

transcripts (intron chains matching a part of the container)

--cset for -K option, discard single exon transcripts when fully contained

in an exon of a multi-exon transcript

-Q for -M option, no longer require boundary containment when assessing

redundancy (can be combined with -K); only introns have to match for

multi-exon transcripts, and >=80% overlap for single-exon transcripts

-Y for -M option, enforce -Q but also discard overlapping single-exon

transcripts, even on the opposite strand (can be combined with -K)

Output options:

--force-exons: make sure that the lowest level GFF features are considered

"exon" features

--gene2exon: for single-line genes not parenting any transcripts, add an

exon feature spanning the entire gene (treat it as a transcript)

--t-adopt: try to find a parent gene overlapping/containing a transcript

that does not have any explicit gene Parent

-D decode url encoded characters within attributes

-Z merge very close exons into a single exon (when intron size<4)

-g full path to a multi-fasta file with the genomic sequences

for all input mappings, OR a directory with single-fasta files

(one per genomic sequence, with file names matching sequence names)

-j output the junctions and the corresponding transcripts

-u write a fasta file with the whole span of each transcript

(including introns)

-w write a fasta file with spliced exons for each transcript

--w-add <N> for the -w option, extract additional <N> bases

both upstream and downstream of the transcript boundaries

(this also applies to -u option)

--w-nocds for -w, disable the output of CDS info in the FASTA file

-x write a fasta file with spliced CDS for each GFF transcript

-y write a protein fasta file with the translation of CDS for each record

-W for -w, -x and -y options, write in the FASTA defline all the exon

coordinates projected onto the spliced sequence;

-S for -y option, use '*' instead of '.' as stop codon translation

-L Ensembl GTF to GFF3 conversion, adds version to IDs

-m <chr_replace> is a name mapping table for converting reference

sequence names, having this 2-column format:

<original_ref_ID> <new_ref_ID>

-t use <trackname> in the 2nd column of each GFF/GTF output line

-o write the output records into <outfile> instead of stdout

-T main output will be GTF instead of GFF3

--bed output records in BED format instead of default GFF3

--tlf output "transcript line format" which is like GFF

but with exons and CDS related features stored as GFF

attributes in the transcript feature line, like this:

exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>

<exons> is a comma-delimited list of exon_start-exon_end coordinates;

<CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>

--table output a simple tab delimited format instead of GFF, with columns

having the values of GFF attributes given in <attrlist>; special

pseudo-attributes (prefixed by @) are recognized:

@id, @geneid, @chr, @start, @end, @strand, @track, @numexons, @exons,

@cds, @introns, @covlen, @cdslen

If any of -w/-y/-x FASTA output files are enabled, the same fields

(excluding @id) are appended to the definition line of corresponding

FASTA records

-v,-E expose (warn about) duplicate transcript IDs and other potential

problems with the given GFF/GTF records

> vsearch

vsearch v2.28.1_linux_x86_64, 251.5GB RAM, 128 cores

https://github.com/torognes/vsearch

For more help, please enter: vsearch --help

For further details, please consult the manual by entering: man vsearch

Selected command examples:

vsearch --allpairs_global FILENAME --id 0.5 --alnout FILENAME

vsearch --cluster_size FILENAME --id 0.97 --centroids FILENAME

vsearch --cut FILENAME --cut_pattern G^AATT_C --fastaout FILENAME

vsearch --fastq_chars FILENAME

vsearch --fastq_convert FILENAME --fastqout FILENAME --fastq_ascii 64

vsearch --fastq_eestats FILENAME --output FILENAME

vsearch --fastq_eestats2 FILENAME --output FILENAME

vsearch --fastq_mergepairs FILENAME --reverse FILENAME --fastqout FILENAME

vsearch --fastq_stats FILENAME --log FILENAME

vsearch --fastx_filter FILENAME --fastaout FILENAME --fastq_trunclen 100

vsearch --fastx_getseq FILENAME --label LABEL --fastaout FILENAME

vsearch --fastx_mask FILENAME --fastaout FILENAME

vsearch --fastx_revcomp FILENAME --fastqout FILENAME

vsearch --fastx_subsample FILENAME --fastaout FILENAME --sample_pct 1

vsearch --fastx_uniques FILENAME --output FILENAME

vsearch --makeudb_usearch FILENAME --output FILENAME

vsearch --search_exact FILENAME --db FILENAME --alnout FILENAME

vsearch --sff_convert FILENAME --output FILENAME --sff_clip

vsearch --shuffle FILENAME --output FILENAME

vsearch --sintax FILENAME --db FILENAME --tabbedout FILENAME

vsearch --sortbylength FILENAME --output FILENAME

vsearch --sortbysize FILENAME --output FILENAME

vsearch --uchime_denovo FILENAME --nonchimeras FILENAME

vsearch --uchime_ref FILENAME --db FILENAME --nonchimeras FILENAME

vsearch --usearch_global FILENAME --db FILENAME --id 0.97 --alnout FILENAME

Other commands: cluster_fast, cluster_smallmem, cluster_unoise, cut,

derep_id, derep_fulllength, derep_prefix, derep_smallmem,

fasta2fastq, fastq_filter, fastq_join, fastx_getseqs,

fastx_getsubseq, maskfasta, orient, rereplicate, uchime2_denovo,

uchime3_denovo, udb2fasta, udbinfo, udbstats, version

> bamtools

usage: bamtools [--help] COMMAND [ARGS]

Available bamtools commands:

convert Converts between BAM and a number of other formats

count Prints number of alignments in BAM file(s)

coverage Prints coverage statistics from the input BAM file

filter Filters BAM file(s) by user-specified criteria

header Prints BAM header information

index Generates index for BAM file

merge Merge multiple BAM files into single file

random Select random alignments from existing BAM file(s), intended more as a testing tool.

resolve Resolves paired-end reads (marking the IsProperPair flag as needed)

revert Removes duplicate marks and restores original base qualities

sort Sorts the BAM file according to some criteria

split Splits a BAM file on user-specified property, creating a new BAM output file for each value found

stats Prints some basic statistics from input BAM file(s)

See 'bamtools help COMMAND' for more information on a specific command.

引用

HTSlib: C library for reading/writing high-throughput sequencing data
James K Bonfield, John Marshall, Petr Danecek, Heng Li, Valeriu Ohan, Andrew Whitwham, Thomas Keane, Robert M Davies
GigaScience, Volume 10, Issue 2, February 2021, giab007

Twelve years of SAMtools and BCFtools
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li
GigaScience, Volume 10, Issue 2, February 2021, giab008

SRA Toolkit

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software and the SRA Toolkit Development Team.

The variant call format and VCFtools

Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin and 1000 Genomes Project Analysis Group

Bioinformatics. 2011 Aug 1; 27(15): 2156–2158.

Fast gapped-read alignment with Bowtie 2

Ben Langmead1,2 and Steven L Salzberg1,2,3

Nat Methods. 2012 Mar 4; 9(4): 357–359.

<Bwa>

Fast and accurate short read alignment with Burrows-Wheeler transform

Heng Li, Richard Durbin

Bioinformatics. 2009 Jul 15;25(14):1754-60

Fast and accurate long-read alignment with Burrows-Wheeler transform

Heng Li, Richard Durbin

Bioinformatics. 2010 Mar 1;26(5):589-95.

<Bwa2>

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems.

Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru.

IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019. 10.1109/IPDPS.2019.00041

Minimap2: pairwise alignment for nucleotide sequences

Heng Li

Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094–3100

New strategies to improve minimap2 alignment accuracy

Heng Li

Bioinformatics, Volume 37, Issue 23, December 2021, Pages 4572–4574

HISAT: a fast spliced aligner with low memory requirements.

Kim D, Langmead B, Salzberg SL

Nat Methods. 2015 Apr;12(4):357-60.

<STAR>

STAR: ultrafast universal RNA-seq aligner

Alexander Dobi, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras

Bioinformatics. 2013 Jan; 29(1): 15–21.

Mapping RNA-seq Reads with STAR

Alexander Dobin and Thomas R. Gingeras

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2016 Sep 3.

<Falco>

Falco: high-speed FastQC emulation for quality control of sequencing data

Guilherme de Sena Brandine 1, Andrew D Smith

F1000Res. 2019 Nov 7:8:1874

GFF Utilities: GffRead and GffCompare

Geo Pertea and Mihaela Pertea

Published online 2020 Sep 9. doi: 10.12688/f1000research.23297.2

<Cd-hit>

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Li W, Godzik A.

Bioinformatics. 2006 Jul 1;22(13):1658-9. Epub 2006 May 26.

VSEARCH: a versatile open source tool for metagenomics
Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, Frédéric Mahé

PeerJ. 2016; 4: e2584

BamTools: a C++ API and toolkit for analyzing and managing BAM files
Derek W Barnett, Erik K Garrison, Aaron R Quinlan, Michael P Strömberg, Gabor T Marth

Bioinformatics. 2011 Jun 15;27(12):1691-2.