2024/05/10 BamTools追加
UbuntuにHTSの基本的なツールをインストールする手順をまとめました。pythonでラップされているようなツールはanacondaに頼るのが早いので除外し、純粋にコンパイル言語で書かれたよく使われているツールを対象としました。
dockerイメージとライブラリなどの準備
dockerを使用し、ubuntuのベースイメージをpullして進めます。既にcondaやhomebrewでこれらのツールを導入されている場合、ややこしくなりますので注意してください。例えばcondaでツールを管理されているのでしたら前もってconda deactivateしておくのが理想です(プロンプト左端に"(base)"など出ていないか確認して下さい )。
最新のタグのイメージをpull後、ランしてコンパイラや依存するライブラリなどを導入します。
docker pull ubuntu:latest
docker run -itv $PWD:/data ubuntu:latest #その場限りで使うのなら"--rm"を付ける
#最初に依存するライブラリなどを導入する必要がある
apt update
apt install -y make gcc git autoconf automake g++ cmake build-essential
apt install -y zlib1g-dev
apt install -y liblzma-dev
apt install -y libbz2-dev
apt install -y libcurl4-openssl-dev
apt install -y libssl-dev #OpenSSLライブラリ
apt install -y pkg-config
apt install -y libdeflate-dev
apt install -y libncurses5-dev
apt install -y libgsl-dev
apt install -y libperl-dev
apt install -y libxml2-dev
#/srcに移動 *1
cd /srv
=============================================================================
#補足
#1 作業完了後、exitしてcommitしておく。例:
docker ps -a #IDの確認
#2 IDを指定してcommit
docker commit 1c4af84ced55 kazumax/ngsbaseimage
#3 再使用(bwa indexコマンドをホスト側から実行)
docker run --rm -itv $PWD:/root -w /root kazumax/ngsbaseimage bwa index -a is GCA_021272385.1.fna
#or(複数コマンド例)
command1="echo bwa indexing"
command2="bwa index -a is GCA_021272385.1.fna"
docker run --rm -itv $PWD:/root -w /root kazumax/ngsbaseimage bash -c "$command1; $command2"
*1 ここでは/srvにcloneする。ほかの候補にはoptなど(参考)。ユーザーならhome下など。
これで準備OK。/srv下に順番に導入していきます。
インストール
1,SAMtoolsとBCFtools、BAMToolsのインストール
https://github.com/samtools/htslib
https://github.com/samtools/samtools
https://github.com/samtools/bcftools
https://github.com/pezmaster31/bamtools
https://github.com/telatin/bamtocov?tab=readme-ov-file
#1 htslibのインストール(samtoolsとbcftoolsのインストールに必須)*2
git clone --recurse-submodules https://github.com/samtools/htslib.git
cd htslib/
autoheader
autoreconf -i
./configure --prefix=/usr/local
make -j10
make install
cd ../
#2 samtoolsのインストール
git clone https://github.com/samtools/samtools.git
cd samtools/
autoheader
autoconf -Wno-syntax
./configure --prefix=/usr/local # *3
make -j10
make install
cd ../
samtools -h
#3 bcftoolsのインストール
git clone https://github.com/samtools/bcftools.git
cd bcftools/
autoheader
autoconf
./configure --enable-libgsl --enable-perl-filters
make -j10
make install
cd ../
bcftools -h
#4 BamToolsのインストール
git clone https://github.com/pezmaster31/bamtools.git
cd bamtools/
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr/local ..
make -j10
make install
cd ../../
bamtools -h
*2 HTSlibはhtscodecsプロジェクトのいくつかの関数を使用している。clone時、サブモジュールを更新して初期化する。
*3 configure時、デフォルトでは"../htslib"にある HTSlibに対してビルドされる。--with-htslib=DIRを指定することでシステムの別のhtslibと変更可能。
2、vcftoolsのインストール
https://github.com/vcftools/vcftools
git clone https://github.com/vcftools/vcftools.git
cd vcftools/
./autogen.sh
./configure --prefix=/usr/local
make -j10
make install
cd ../
vcftools -h
3、SRA Toolkitのインストール
https://github.com/ncbi/sra-tools
git clone https://github.com/ncbi/sra-tools.git
cd sra-tools/
#ncbi-vdbが必要
git clone https://github.com/ncbi/ncbi-vdb.git
cd ncbi-vdb
./configure --prefix=/usr/local/ncbi-vdb
make -j20
make install -j20
cd ../
./configure --with-ncbi-vdb=/usr/local/ncbi-vdb --prefix=/usr/local
make -j20
make install
cd ../
fastq-dump -h
4、Mapping toolのインストール
https://github.com/BenLangmead/bowtie2
https://github.com/bwa-mem2/bwa-mem2
https://github.com/lh3/minimap2
https://github.com/DaehwanKimLab/hisat2
https://github.com/alexdobin/STAR
#bowtie2
git clone https://github.com/BenLangmead/bowtie2.git
cd bowtie2/
make -j20
make install
cd ../
bowtie2 -h
#bwa1
git clone https://github.com/lh3/bwa.git
cd bwa
make -j20
echo 'export PATH=$PATH:$PWD/bwa' >> ~/.bashrc && source ~/.bashrc
#or
cp bwa /usr/local/bin/
cd ../
bwa
#bwa2
git clone --recursive https://github.com/bwa-mem2/bwa-mem2
cd bwa-mem2
make -j20
echo 'export PATH=$PATH:$PWD/bwa-mem2*' >> ~/.bashrc && source ~/.bashrc
#or
cp bwa2* /usr/local/bin/
cd ../
bwa-mem2
#minimap2
git clone https://github.com/lh3/minimap2
cd minimap2
make -j20
echo 'export PATH=$PATH:$PWD/minimap2' >> ~/.bashrc && source ~/.bashrc
#or
cp minimap2 /usr/local/bin/
cd ../
minimap2
#hisat2 (pythonも必要)
git clone https://github.com/DaehwanKimLab/hisat2.git
cd hisat2
make -j20
echo 'export PATH=$PATH:$PWD/hisat2*' >> ~/.bashrc && source ~/.bashrc
cd ../
hisat2 -h
#STAR(STAR/bin/Linux_x86_64_staticにバイナリが配置されているのでx86 64bit版ならパスを通すだけ)
git clone https://github.com/alexdobin/STAR.git
cd STAR/bin/Linux_x86_64_static/
echo 'export PATH=$PATH:$PWD' >> ~/.bashrc && source ~/.bashrc
#or
ln -s STAR /usr/local/bin/
cd ../../../
STAR --help
(pseudo-mappingは除外)
5、fastqのQC toolのインストール
https://github.com/smithlabcode/falco
#falco (fastQCの高速な代替)
git clone https://github.com/smithlabcode/falco.git
cd falco/
./autogen.sh
./configure CXXFLAGS="-O3 -Wall" --enable-hts
make HAVE_HTSLIB=1 all -j 20
make HAVE_HTSLIB=1 install
(pythonで書かれたcutadapt、perlで書かれたTrim Galore(fastQCとcutadaptのラッパー)、javaで書かれたTrimmostic(.jarが配布されている)は除外。依存が多いfastpはcondaを使うのが簡単。)
6、配列クラスタリングツールのインストール
https://github.com/weizhongli/cdhit
#cd-hit
git clone https://github.com/weizhongli/cdhit.git
cd cdhit/
#長い配列も扱えるようにビルド
make MAX_SEQ=1000000 -j4
make install
cd ../
7、GFF/GTFを操作するツールのインストール
https://github.com/gpertea/gffread
#gffread
git clone https://github.com/gpertea/gffread
cd gffread
make release -j 10
cd ../
8、配列操作ツールのインストール
https://github.com/torognes/vsearch
#vsearch
git clone https://github.com/torognes/vsearch.git
cd vsearch
./autogen.sh
./configure CFLAGS="-O3" CXXFLAGS="-O3"
make -j20
make install
cd ../
他は?
除外したもの
・ETE toolkit(pythonベース、別に紹介予定)
・bedtools(python、bamtoolsなど依存が多い)
・IGV(javaが肥大)
############################################################################################
主要なコマンドのヘルプ
> samtools -h
Program: samtools (Tools for alignments in the SAM format)
Version: 1.20-6-g95f5426 (using htslib 1.20-15-gb204d55c)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
fqidx index/extract FASTQ
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
markdup mark duplicates
ampliconclip clip oligos from the end of reads
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
consensus produce a consensus Pileup/FASTA/FASTQ
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
import Converts FASTA or FASTQ files to SAM/BAM/CRAM
reference Generates a reference from aligned data
reset Reverts aligner changes in reads
-- Statistics
bedcov read depth per BED region
coverage alignment depth and percent coverage
depth compute the depth
flagstat simple stats
idxstats BAM index stats
cram-size list CRAM Content-ID and Data-Series sizes
phase phase heterozygotes
stats generate stats (former bamcheck)
ampliconstats generate amplicon specific stats
-- Viewing
flags explain BAM flags
head header viewer
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
samples list the samples in a set of SAM/BAM/CRAM files
-- Misc
help [cmd] display this help message or help for [cmd]
version detailed version information
> bcftools -h
Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)
License: GNU GPLv3+, due to use of the GNU Scientific Library
Version: 1.20-12-g18439d1d (using htslib 1.20-15-gb204d55c)
Usage: bcftools [--version|--version-only] [--help] <command> <argument>
Commands:
-- Indexing
index index VCF/BCF files
-- VCF/BCF manipulation
annotate annotate and edit VCF/BCF files
concat concatenate VCF/BCF files from the same set of samples
convert convert VCF/BCF files to different formats and back
head view VCF/BCF file headers
isec intersections of VCF/BCF files
merge merge VCF/BCF files files from non-overlapping sample sets
norm left-align and normalize indels
plugin user-defined plugins
query transform VCF/BCF into user-defined formats
reheader modify VCF/BCF header, change sample names
sort sort VCF/BCF file
view VCF/BCF conversion, view, subset and filter VCF/BCF files
-- VCF/BCF analysis
call SNP/indel calling
consensus create consensus sequence by applying VCF variants
cnv HMM CNV calling
csq call variation consequences
filter filter VCF/BCF files using fixed thresholds
gtcheck check sample concordance, detect sample swaps and contamination
mpileup multi-way pileup producing genotype likelihoods
polysomy detect number of chromosomal copies
roh identify runs of autozygosity (HMM)
stats produce VCF/BCF stats
-- Plugins (collection of programs for calling, file manipulation & analysis)
39 plugins available, run "bcftools plugin -lv" to see a complete list
Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
automatically even when streaming from a pipe. Indexed VCF and BCF will work
in all situations. Un-indexed VCF and BCF and streams will work in most but
not all situations.
> vcftools -h
VCFtools (0.1.17)
© Adam Auton and Anthony Marcketta 2009
Process Variant Call Format files
For a list of options, please go to:
https://vcftools.github.io/man_latest.html
Alternatively, a man page is available, type:
man vcftools
Questions, comments, and suggestions should be emailed to:
vcftools-help@lists.sourceforge.net
> fastq-dump -h
Usage:
fastq-dump [options] <path> [<path>...]
fastq-dump [options] <accession>
INPUT
-A|--accession <accession> Replaces accession derived from <path> in
filename(s) and deflines (only for single
table dump)
--table <table-name> Table name within cSRA object, default is
"SEQUENCE"
PROCESSING
Read Splitting Sequence data may be used in raw form or
split into individual reads
--split-spot Split spots into individual reads
Full Spot Filters Applied to the full spot independently
of --split-spot
-N|--minSpotId <rowid> Minimum spot id
-X|--maxSpotId <rowid> Maximum spot id
--spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...]
-W|--clip Remove adapter sequences from reads
Common Filters Applied to spots when --split-spot is not
set, otherwise - to individual reads
-M|--minReadLen <len> Filter by sequence length >= <len>
-R|--read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value:
pass|reject|criteria|redacted
-E|--qual-filter Filter used in early 1000 Genomes data: no
sequences starting or ending with >= 10N
--qual-filter-1 Filter used in current 1000 Genomes data
Filters based on alignments Filters are active when alignment
data are present
--aligned Dump only aligned sequences
--unaligned Dump only unaligned sequences
--aligned-region <name[:from-to]> Filter by position on genome. Name can
either be accession.version (ex:
NC_000001.10) or file specific name (ex:
"chr1" or "1"). "from" and "to" are 1-based
coordinates
--matepair-distance <from-to|unknown> Filter by distance between matepairs.
Use "unknown" to find matepairs split
between the references. Use from-to to limit
matepair distance on the same reference
Filters for individual reads Applied only with --split-spot set
--skip-technical Dump only biological reads
OUTPUT
-O|--outdir <path> Output directory, default is working
directory '.' )
-Z|--stdout Output to stdout, all split data become
joined into single stream
--gzip Compress output using gzip: deprecated, not
recommended
--bzip2 Compress output using bzip2: deprecated,
not recommended
Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed
according to splitting criteria.
--split-files Write reads into separate files. Read
number will be suffixed to the file name.
NOTE! The `--split-3` option is recommended.
In cases where not all spots have the same
number of reads, this option will produce
files that WILL CAUSE ERRORS in most programs
which process split pair fastq files.
--split-3 3-way splitting for mate-pairs. For each
spot, if there are two biological reads
satisfying filter conditions, the first is
placed in the `*_1.fastq` file, and the
second is placed in the `*_2.fastq` file. If
there is only one biological read
satisfying the filter conditions, it is
placed in the `*.fastq` file.All other
reads in the spot are ignored.
-G|--spot-group Split into files by SPOT_GROUP (member name)
-R|--read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value:
pass|reject|criteria|redacted
-T|--group-in-dirs Split into subdirectories instead of files
-K|--keep-empty-files Do not delete empty files
FORMATTING
Sequence
-C|--dumpcs <[cskey]> Formats sequence using color space (default
for SOLiD),"cskey" may be specified for
translation
-B|--dumpbase Formats sequence using base space (default
for other than SOLiD).
Quality
-Q|--offset <integer> Offset to use for quality conversion,
default is 33
--fasta <[line width]> FASTA only, no qualities, optional line
wrap width (set to zero for no wrapping)
--suppress-qual-for-cskey suppress quality-value for cskey
Defline
-F|--origfmt Defline contains only original sequence name
-I|--readids Append read id after spot id as
'accession.spot.readid' on defline
--helicos Helicos style defline
--defline-seq <fmt> Defline format specification for sequence.
--defline-qual <fmt> Defline format specification for quality.
<fmt> is string of characters and/or
variables. The variables can be one of: $ac
- accession, $si spot id, $sn spot
name, $sg spot group (barcode), $sl spot
length in bases, $ri read number, $rn
read name, $rl read length in bases. ''
could be used for an optional output: if
all vars in yield empty values whole
group is not printed. Empty value is empty
string or for numeric variables. Ex:
@$sn[_$rn]/$ri '_$rn' is omitted if name
is empty
OTHER:
--ngc <path> <path> to ngc file
--disable-multithreading disable multithreading
-h|--help Output brief explanation of program usage
-V|--version Display the version of the program
-L|--log-level <level> Logging level as number or enum string One
of (fatal|sys|int|err|warn|info) or (0-5)
Current/default is warn
-v|--verbose Increase the verbosity level of the program
Use multiple times for more verbosity
--ncbi_error_report Control program execution environment
report generation (if implemented). One of
(never|error|always). Default is error
--legacy-report use legacy style 'Written spots' for tool
> bowtie2 -h
Bowtie 2 version 2.5.3 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage:
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]
<bt2-idx> Index filename prefix (minus trailing .X.bt2).
NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<i> Files with interleaved paired-end FASTQ/FASTA reads
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<bam> Files are unaligned BAM sorted by read name.
<sam> File for SAM output (default: stdout)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
Options (defaults in parentheses):
Input:
-q query input files are FASTQ .fq/.fastq (default)
--tab5 query input files are TAB5 .tab5
--tab6 query input files are TAB6 .tab6
--qseq query input files are in Illumina's qseq format
-f query input files are (multi-)FASTA .fa/.mfa
-r query input files are raw one-sequence-per-line
-F k:<int>,i:<int> query input files are continuous FASTA where reads
are substrings (k-mers) extracted from the FASTA file
and aligned at offsets 1, 1+i, 1+2i ... end of reference
-c <m1>, <m2>, <r> are sequences themselves, not files
-s/--skip <int> skip the first <int> reads/pairs in the input (none)
-u/--upto <int> stop after first <int> reads/pairs (no limit)
-5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)
-3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)
--trim-to [3:|5:]<int> trim reads exceeding <int> bases from either 3' or 5' end
If the read end is not specified then it defaults to 3 (0)
--phred33 qualities are Phred+33 (default)
--phred64 qualities are Phred+64
--int-quals qualities encoded as space-delimited integers
Presets: Same as:
For --end-to-end:
--very-fast -D 5 -R 1 -N 0 -L 22 -i S,0,2.50
--fast -D 10 -R 2 -N 0 -L 22 -i S,0,2.50
--sensitive -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)
--very-sensitive -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
For --local:
--very-fast-local -D 5 -R 1 -N 0 -L 25 -i S,1,2.00
--fast-local -D 10 -R 2 -N 0 -L 22 -i S,1,1.75
--sensitive-local -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)
--very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
Alignment:
-N <int> max # mismatches in seed alignment; can be 0 or 1 (0)
-L <int> length of seed substrings; must be >3, <32 (22)
-i <func> interval between seed substrings w/r/t read len (S,1,1.15)
--n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
--dpad <int> include <int> extra ref chars on sides of DP table (15)
--gbar <int> disallow gaps within <int> nucs of read extremes (4)
--ignore-quals treat all quality values as 30 on Phred scale (off)
--nofw do not align forward (original) version of read (off)
--norc do not align reverse-complement version of read (off)
--no-1mm-upfront do not allow 1 mismatch alignments before attempting to
scan for the optimal seeded alignments
--end-to-end entire read must align; no clipping (on)
OR
--local local alignment; ends might be soft clipped (off)
Scoring:
--ma <int> match bonus (0 for --end-to-end, 2 for --local)
--mp <int> max penalty for mismatch; lower qual = lower penalty (6)
--np <int> penalty for non-A/C/G/Ts in read/ref (1)
--rdg <int>,<int> read gap open, extend penalties (5,3)
--rfg <int>,<int> reference gap open, extend penalties (5,3)
--score-min <func> min acceptable alignment score w/r/t read length
(G,20,8 for local, L,-0.6,-0.6 for end-to-end)
Reporting:
(default) look for multiple alignments, report best, with MAPQ
OR
-k <int> report up to <int> alns per read; MAPQ not meaningful
OR
-a/--all report all alignments; very slow, MAPQ not meaningful
Effort:
-D <int> give up extending after <int> failed extends in a row (15)
-R <int> for reads w/ repetitive seeds, try <int> sets of seeds (2)
Paired-end:
-I/--minins <int> minimum fragment length (0)
-X/--maxins <int> maximum fragment length (500)
--fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
--no-mixed suppress unpaired alignments for paired reads
--no-discordant suppress discordant alignments for paired reads
--dovetail concordant when mates extend past each other
--no-contain not concordant when one mate alignment contains other
--no-overlap not concordant when mates overlap at all
BAM:
--align-paired-reads
Bowtie2 will, by default, attempt to align unpaired BAM reads.
Use this option to align paired-end reads instead.
--preserve-tags Preserve tags from the original BAM record by
appending them to the end of the corresponding SAM output.
Output:
-t/--time print wall-clock time taken by search phases
--un <path> write unpaired reads that didn't align to <path>
--al <path> write unpaired reads that aligned at least once to <path>
--un-conc <path> write pairs that didn't align concordantly to <path>
--al-conc <path> write pairs that aligned concordantly at least once to <path>
(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
--un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)
--quiet print nothing to stderr except serious errors
--met-file <path> send metrics to file at <path> (off)
--met-stderr send metrics to stderr (off)
--met <int> report internal counters & metrics every <int> secs (1)
--no-unal suppress SAM records for unaligned reads
--no-head suppress header lines, i.e. lines starting with @
--no-sq suppress @SQ header lines
--rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field
--rg <text> add <text> ("lab:value") to @RG line of SAM header.
Note: @RG line only printed when --rg-id is set.
--omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.
--sam-no-qname-trunc
Suppress standard behavior of truncating readname at first whitespace
at the expense of generating non-standard SAM.
--xeq Use '='/'X', instead of 'M,' to specify matches/mismatches in SAM record.
--soft-clipped-unmapped-tlen
Exclude soft-clipped bases when reporting TLEN
--sam-append-comment
Append FASTA/FASTQ comment to SAM record
Performance:
-p/--threads <int> number of alignment threads to launch (1)
--reorder force SAM output order to match order of input reads
--mm use memory-mapped I/O for index; many 'bowtie's can share
Other:
--qc-filter filter out reads that are bad according to QSEQ filter
--seed <int> seed for random number generator (0)
--non-deterministic
seed rand. gen. arbitrarily instead of using read attributes
--version print version information and quit
-h/--help print this usage message
> bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.18-r1243-dirty
Contact: Heng Li <hli@ds.dfci.harvard.edu>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries (DEPRECATED)
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.
> bwa-mem2
Looking to launch executable "/srv/bwa-mem2/bwa-mem2.avx2", simd = .avx2
Launching executable "/srv/bwa-mem2/bwa-mem2.avx2"
Usage: bwa-mem2 <command> <arguments>
Commands:
index create index
mem alignment
version print version number
> minimap2
Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]
Options:
Indexing:
-H use homopolymer-compressed k-mer (preferrable for PacBio)
-k INT k-mer size (no larger than 28) [15]
-w INT minimizer window size [10]
-I NUM split index for every ~NUM input bases [8G]
-d FILE dump index to FILE
Mapping:
-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]
-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]
-G NUM max intron length (effective with -xsplice; changing -r) [200k]
-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]
-r NUM[,NUM] chaining/alignment bandwidth and long-join bandwidth [500,20000]
-n INT minimal number of minimizers on a chain [3]
-m INT minimal chaining score (matching bases minus log gap penalty) [40]
-X skip self and dual mappings (for the all-vs-all mode)
-p FLOAT min secondary-to-primary score ratio [0.8]
-N INT retain at most INT secondary alignments [5]
Alignment:
-A INT matching score [2]
-B INT mismatch penalty (larger value for lower divergence) [4]
-O INT[,INT] gap open penalty [4,24]
-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]
-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]
-s INT minimal peak DP alignment score [80]
-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]
-J INT splice mode. 0: original minimap2 model; 1: miniprot model [1]
Input/Output:
-a output in the SAM format (PAF by default)
-o FILE output alignments to FILE [stdout]
-L write CIGAR with >65535 ops at the CG tag
-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar'
-c output CIGAR in PAF
--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]
--ds output the ds tag, which is an extension to cs
--MD output the MD tag
--eqx write =/X CIGAR operators
-Y use soft clipping for supplementary alignments
-t INT number of threads [3]
-K NUM minibatch size for mapping [500M]
--version show version number
Preset:
-x STR preset (always applied before other options; see minimap2.1 for details) []
- lr:hq - accurate long reads (error rate <1%) against a reference genome
- splice/splice:hq - spliced alignment for long reads/accurate long reads
- asm5/asm10/asm20 - asm-to-ref mapping, for ~0.1/1/5% sequence divergence
- sr - short reads against a reference
- map-pb/map-hifi/map-ont/map-iclr - CLR/HiFi/Nanopore/ICLR vs reference mapping
- ava-pb/ava-ont - PacBio CLR/Nanopore read overlap
See `man ./minimap2.1' for detailed description of these and other advanced command-line options.
> bwa-mem2
Looking to launch executable "/usr/local/bin/bwa-mem2.avx2", simd = .avx2
Launching executable "/usr/local/bin/bwa-mem2.avx2"
Usage: bwa-mem2 <command> <arguments>
Commands:
index create index
mem alignment
version print version number
> falco
Usage: falco [OPTIONS] <seqfile1> <seqfile2> ...
Options:
-h, --help Print this help file and exit
-v, --version Print the version of the program and exit
-o, --outdir Create all output files in the specified
output directory. FALCO-SPECIFIC: If the
directory does not exists, the program will
create it. If this option is not set then
the output file for each sequence file is
created in the same directory as the
sequence file which was processed.
--casava [IGNORED BY FALCO] Files come from raw
casava output. Files in the same sample
group (differing only by the group number)
will be analysed as a set rather than
individually. Sequences with the filter flag
set in the header will be excluded from the
analysis. Files must have the same names
given to them by casava (including being
gzipped and ending with .gz) otherwise they
won't be grouped together correctly.
--nano [IGNORED BY FALCO] Files come from nanopore
sequences and are in fast5 format. In this
mode you can pass in directories to process
and the program will take in all fast5 files
within those directories and produce a
single output file from the sequences found
in all files.
--nofilter [IGNORED BY FALCO] If running with --casava
then don't remove read flagged by casava as
poor quality when performing the QC
analysis.
--extract [ALWAYS ON IN FALCO] If set then the zipped
output file will be uncompressed in the same
directory after it has been created. By
default this option will be set if fastqc is
run in non-interactive mode.
-j, --java [IGNORED BY FALCO] Provides the full path to
the java binary you want to use to launch
fastqc. If not supplied then java is assumed
to be in your path.
--noextract [IGNORED BY FALCO] Do not uncompress the
output file after creating it. You should
set this option if you do not wish to
uncompress the output when running in
non-interactive mode.
--nogroup Disable grouping of bases for reads >50bp.
All reports will show data for every base in
the read. WARNING: When using this option,
your plots may end up a ridiculous size. You
have been warned!
--min_length [NOT YET IMPLEMENTED IN FALCO] Sets an
artificial lower limit on the length of the
sequence to be shown in the report. As long
as you set this to a value greater or equal
to your longest read length then this will
be the sequence length used to create your
read groups. This can be useful for making
directly comaparable statistics from
datasets with somewhat variable read
lengths.
-f, --format Bypasses the normal sequence file format
detection and forces the program to use the
specified format. Validformats are bam, sam,
bam_mapped, sam_mapped, fastq, fq, fastq.gz
or fq.gz.
-t, --threads [NOT YET IMPLEMENTED IN FALCO] Specifies the
number of files which can be processed
simultaneously. Each thread will be
allocated 250MB of memory so you shouldn't
run more threads than your available memory
will cope with, and not more than 6 threads
on a 32 bit machine [1]
-c, --contaminants Specifies a non-default file which contains
the list of contaminants to screen
overrepresented sequences against. The file
must contain sets of named contaminants in
the form name[tab]sequence. Lines prefixed
with a hash will be ignored. Default:
/srv/falco/Configuration/contaminant_list.txt
-a, --adapters Specifies a non-default file which contains
the list of adapter sequences which will be
explicity searched against the library. The
file must contain sets of named adapters in
the form name[tab]sequence. Lines prefixed
with a hash will be ignored. Default:
/srv/falco/Configuration/adapter_list.txt
-l, --limits Specifies a non-default file which contains
a set of criteria which will be used to
determine the warn/error limits for the
various modules. This file can also be used
to selectively remove some modules from the
output all together. The format needs to
mirror the default limits.txt file found in
the Configuration folder. Default:
/srv/falco/Configuration/limits.txt
-k, --kmers [IGNORED BY FALCO AND ALWAYS SET TO 7]
Specifies the length of Kmer to look for in
the Kmer content module. Specified Kmer
length must be between 2 and 10. Default
length is 7 if not specified.
-q, --quiet Supress all progress messages on stdout and
only report errors.
-d, --dir [IGNORED: FALCO DOES NOT CREATE TMP FILES]
Selects a directory to be used for temporary
files written when generating report images.
Defaults to system temp directory if not
specified.
-s, -subsample [Falco only] makes falco faster (but
possibly less accurate) by only processing
reads that are multiple of this value (using
0-based indexing to number reads). [1]
-b, -bisulfite [Falco only] reads are whole genome
bisulfite sequencing, and more Ts and fewer
Cs are therefore expected and will be
accounted for in base content.
-r, -reverse-complement [Falco only] The input is a
reverse-complement. All modules will be
tested by swapping A/T and C/G
-skip-data [Falco only] Do not create FastQC data text
file.
-skip-report [Falco only] Do not create FastQC report
HTML file.
-skip-summary [Falco only] Do not create FastQC summary
file
-D, -data-filename [Falco only] Specify filename for FastQC
data output (TXT). If not specified, it will
be called fastq_data.txt in either the input
file's directory or the one specified in the
--output flag. Only available when running
falco with a single input.
-R, -report-filename [Falco only] Specify filename for FastQC
report output (HTML). If not specified, it
will be called fastq_report.html in either
the input file's directory or the one
specified in the --output flag. Only
available when running falco with a single
input.
-S, -summary-filename [Falco only] Specify filename for the short
summary output (TXT). If not specified, it
will be called fastq_report.html in either
the input file's directory or the one
specified in the --output flag. Only
available when running falco with a single
input.
-K, -add-call [Falco only] add the command call call to
FastQC data output and FastQC report HTML
(this may break the parse of fastqc_data.txt
in programs that are very strict about the
FastQC output format).
Help options:
-?, -help print this help message
-about print about message
> gffread
gffread v0.12.8. Usage:
gffread [-g <genomic_seqs_fasta> | <dir>] [-s <seq_info.fsize>]
[-o <outfile>] [-t <trackname>] [-r [<strand>]<chr>:<start>-<end> [-R]]
[--jmatch <chr>:<start>-<end>] [--no-pseudo]
[-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>]
[-j ][--ids <IDs.lst> | --nids <IDs.lst>] [--attrs <attr-list>] [-i <maxintron>]
[--stream] [--bed | --gtf | --tlf] [--table <attrlist>] [--sort-by <ref.lst>]
[<input_gff>]
Filter, convert or cluster GFF/GTF/BED records, extract the sequence of
transcripts (exon or CDS) and more.
By default (i.e. without -O) only transcripts are processed, discarding any
other non-transcript features. Default output is a simplified GFF3 with only
the basic attributes.
Options:
--ids discard records/transcripts if their IDs are not listed in <IDs.lst>
--nids discard records/transcripts if their IDs are listed in <IDs.lst>
-i discard transcripts having an intron larger than <maxintron>
-l discard transcripts shorter than <minlen> bases
-r only show transcripts overlapping coordinate range <start>..<end>
(on chromosome/contig <chr>, strand <strand> if provided)
-R for -r option, discard all transcripts that are not fully
contained within the given range
--jmatch only output transcripts matching the given junction
-U discard single-exon transcripts
-C coding only: discard mRNAs that have no CDS features
--nc non-coding only: discard mRNAs that have CDS features
--ignore-locus : discard locus features and attributes found in the input
-A use the description field from <seq_info.fsize> and add it
as the value for a 'descr' attribute to the GFF record
-s <seq_info.fsize> is a tab-delimited file providing this info
for each of the mapped sequences:
<seq-name> <seq-length> <seq-description>
(useful for -A option with mRNA/EST/protein mappings)
Sorting: (by default, chromosomes are kept in the order they were found)
--sort-alpha : chromosomes (reference sequences) are sorted alphabetically
--sort-by : sort the reference sequences by the order in which their
names are given in the <refseq.lst> file
Misc options:
-F keep all GFF attributes (for non-exon features)
--keep-exon-attrs : for -F option, do not attempt to reduce redundant
exon/CDS attributes
-G do not keep exon attributes, move them to the transcript feature
(for GFF3 output)
--attrs <attr-list> only output the GTF/GFF attributes listed in <attr-list>
which is a comma delimited list of attribute names to
--keep-genes : in transcript-only mode (default), also preserve gene records
--keep-comments: for GFF3 input/output, try to preserve comments
-O process other non-transcript GFF records (by default non-transcript
records are ignored)
-V discard any mRNAs with CDS having in-frame stop codons (requires -g)
-H for -V option, check and adjust the starting CDS phase
if the original phase leads to a translation with an
in-frame stop codon
-B for -V option, single-exon transcripts are also checked on the
opposite strand (requires -g)
-P add transcript level GFF attributes about the coding status of each
transcript, including partialness or in-frame stop codons (requires -g)
--add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts
that have CDS features
--adj-stop stop codon adjustment: enables -P and performs automatic
adjustment of the CDS stop coordinate if premature or downstream
-N discard multi-exon mRNAs that have any intron with a non-canonical
splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)
-J discard any mRNAs that either lack initial START codon
or the terminal STOP codon, or have an in-frame stop codon
(i.e. only print mRNAs with a complete CDS)
--no-pseudo: filter out records matching the 'pseudo' keyword
--in-bed: input should be parsed as BED format (automatic if the input
filename ends with .bed*)
--in-tlf: input GFF-like one-line-per-transcript format without exon/CDS
features (see --tlf option below); automatic if the input
filename ends with .tlf)
--stream: fast processing of input GFF/BED transcripts as they are received
((no sorting, exons must be grouped by transcript in the input data)
Clustering:
-M/--merge : cluster the input transcripts into loci, discarding
"redundant" transcripts (those with the same exact introns
and fully contained or equal boundaries)
-d <dupinfo> : for -M option, write duplication info to file <dupinfo>
--cluster-only: same as -M/--merge but without discarding any of the
"duplicate" transcripts, only create "locus" features
-K for -M option: also discard as redundant the shorter, fully contained
transcripts (intron chains matching a part of the container)
--cset for -K option, discard single exon transcripts when fully contained
in an exon of a multi-exon transcript
-Q for -M option, no longer require boundary containment when assessing
redundancy (can be combined with -K); only introns have to match for
multi-exon transcripts, and >=80% overlap for single-exon transcripts
-Y for -M option, enforce -Q but also discard overlapping single-exon
transcripts, even on the opposite strand (can be combined with -K)
Output options:
--force-exons: make sure that the lowest level GFF features are considered
"exon" features
--gene2exon: for single-line genes not parenting any transcripts, add an
exon feature spanning the entire gene (treat it as a transcript)
--t-adopt: try to find a parent gene overlapping/containing a transcript
that does not have any explicit gene Parent
-D decode url encoded characters within attributes
-Z merge very close exons into a single exon (when intron size<4)
-g full path to a multi-fasta file with the genomic sequences
for all input mappings, OR a directory with single-fasta files
(one per genomic sequence, with file names matching sequence names)
-j output the junctions and the corresponding transcripts
-u write a fasta file with the whole span of each transcript
(including introns)
-w write a fasta file with spliced exons for each transcript
--w-add <N> for the -w option, extract additional <N> bases
both upstream and downstream of the transcript boundaries
(this also applies to -u option)
--w-nocds for -w, disable the output of CDS info in the FASTA file
-x write a fasta file with spliced CDS for each GFF transcript
-y write a protein fasta file with the translation of CDS for each record
-W for -w, -x and -y options, write in the FASTA defline all the exon
coordinates projected onto the spliced sequence;
-S for -y option, use '*' instead of '.' as stop codon translation
-L Ensembl GTF to GFF3 conversion, adds version to IDs
-m <chr_replace> is a name mapping table for converting reference
sequence names, having this 2-column format:
<original_ref_ID> <new_ref_ID>
-t use <trackname> in the 2nd column of each GFF/GTF output line
-o write the output records into <outfile> instead of stdout
-T main output will be GTF instead of GFF3
--bed output records in BED format instead of default GFF3
--tlf output "transcript line format" which is like GFF
but with exons and CDS related features stored as GFF
attributes in the transcript feature line, like this:
exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
<exons> is a comma-delimited list of exon_start-exon_end coordinates;
<CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>
--table output a simple tab delimited format instead of GFF, with columns
having the values of GFF attributes given in <attrlist>; special
pseudo-attributes (prefixed by @) are recognized:
@id, @geneid, @chr, @start, @end, @strand, @track, @numexons, @exons,
@cds, @introns, @covlen, @cdslen
If any of -w/-y/-x FASTA output files are enabled, the same fields
(excluding @id) are appended to the definition line of corresponding
FASTA records
-v,-E expose (warn about) duplicate transcript IDs and other potential
problems with the given GFF/GTF records
> vsearch
vsearch v2.28.1_linux_x86_64, 251.5GB RAM, 128 cores
https://github.com/torognes/vsearch
For more help, please enter: vsearch --help
For further details, please consult the manual by entering: man vsearch
Selected command examples:
vsearch --allpairs_global FILENAME --id 0.5 --alnout FILENAME
vsearch --cluster_size FILENAME --id 0.97 --centroids FILENAME
vsearch --cut FILENAME --cut_pattern G^AATT_C --fastaout FILENAME
vsearch --fastq_chars FILENAME
vsearch --fastq_convert FILENAME --fastqout FILENAME --fastq_ascii 64
vsearch --fastq_eestats FILENAME --output FILENAME
vsearch --fastq_eestats2 FILENAME --output FILENAME
vsearch --fastq_mergepairs FILENAME --reverse FILENAME --fastqout FILENAME
vsearch --fastq_stats FILENAME --log FILENAME
vsearch --fastx_filter FILENAME --fastaout FILENAME --fastq_trunclen 100
vsearch --fastx_getseq FILENAME --label LABEL --fastaout FILENAME
vsearch --fastx_mask FILENAME --fastaout FILENAME
vsearch --fastx_revcomp FILENAME --fastqout FILENAME
vsearch --fastx_subsample FILENAME --fastaout FILENAME --sample_pct 1
vsearch --fastx_uniques FILENAME --output FILENAME
vsearch --makeudb_usearch FILENAME --output FILENAME
vsearch --search_exact FILENAME --db FILENAME --alnout FILENAME
vsearch --sff_convert FILENAME --output FILENAME --sff_clip
vsearch --shuffle FILENAME --output FILENAME
vsearch --sintax FILENAME --db FILENAME --tabbedout FILENAME
vsearch --sortbylength FILENAME --output FILENAME
vsearch --sortbysize FILENAME --output FILENAME
vsearch --uchime_denovo FILENAME --nonchimeras FILENAME
vsearch --uchime_ref FILENAME --db FILENAME --nonchimeras FILENAME
vsearch --usearch_global FILENAME --db FILENAME --id 0.97 --alnout FILENAME
Other commands: cluster_fast, cluster_smallmem, cluster_unoise, cut,
derep_id, derep_fulllength, derep_prefix, derep_smallmem,
fasta2fastq, fastq_filter, fastq_join, fastx_getseqs,
fastx_getsubseq, maskfasta, orient, rereplicate, uchime2_denovo,
uchime3_denovo, udb2fasta, udbinfo, udbstats, version
> bamtools
usage: bamtools [--help] COMMAND [ARGS]
Available bamtools commands:
convert Converts between BAM and a number of other formats
count Prints number of alignments in BAM file(s)
coverage Prints coverage statistics from the input BAM file
filter Filters BAM file(s) by user-specified criteria
header Prints BAM header information
index Generates index for BAM file
merge Merge multiple BAM files into single file
random Select random alignments from existing BAM file(s), intended more as a testing tool.
resolve Resolves paired-end reads (marking the IsProperPair flag as needed)
revert Removes duplicate marks and restores original base qualities
sort Sorts the BAM file according to some criteria
split Splits a BAM file on user-specified property, creating a new BAM output file for each value found
stats Prints some basic statistics from input BAM file(s)
See 'bamtools help COMMAND' for more information on a specific command.
引用
<HTSlib>
HTSlib: C library for reading/writing high-throughput sequencing data
James K Bonfield, John Marshall, Petr Danecek, Heng Li, Valeriu Ohan, Andrew Whitwham, Thomas Keane, Robert M Davies
GigaScience, Volume 10, Issue 2, February 2021, giab007
<SAMTools & BCFTools>
Twelve years of SAMtools and BCFtools
Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li
GigaScience, Volume 10, Issue 2, February 2021, giab008
SRA Toolkit
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software and the SRA Toolkit Development Team.
<vcftools>
The variant call format and VCFtools
Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin and 1000 Genomes Project Analysis Group
Bioinformatics. 2011 Aug 1; 27(15): 2156–2158.
<Bowtie2>
Fast gapped-read alignment with Bowtie 2
Ben Langmead1,2 and Steven L Salzberg1,2,3
Nat Methods. 2012 Mar 4; 9(4): 357–359.
<Bwa>
Fast and accurate short read alignment with Burrows-Wheeler transform
Heng Li, Richard Durbin
Bioinformatics. 2009 Jul 15;25(14):1754-60
Fast and accurate long-read alignment with Burrows-Wheeler transform
Heng Li, Richard Durbin
Bioinformatics. 2010 Mar 1;26(5):589-95.
<Bwa2>
Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems.
Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru.
IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019. 10.1109/IPDPS.2019.00041
<Minimap2>
Minimap2: pairwise alignment for nucleotide sequences
Heng Li
Bioinformatics, Volume 34, Issue 18, September 2018, Pages 3094–3100
New strategies to improve minimap2 alignment accuracy
Heng Li
Bioinformatics, Volume 37, Issue 23, December 2021, Pages 4572–4574
<Hisat2>
HISAT: a fast spliced aligner with low memory requirements.
Kim D, Langmead B, Salzberg SL
Nat Methods. 2015 Apr;12(4):357-60.
<STAR>
STAR: ultrafast universal RNA-seq aligner
Alexander Dobi, Carrie A. Davis, Felix Schlesinger, Jorg Drenkow, Chris Zaleski, Sonali Jha, Philippe Batut, Mark Chaisson, and Thomas R. Gingeras
Bioinformatics. 2013 Jan; 29(1): 15–21.
Mapping RNA-seq Reads with STAR
Alexander Dobin and Thomas R. Gingeras
Curr Protoc Bioinformatics. Author manuscript; available in PMC 2016 Sep 3.
<Falco>
Falco: high-speed FastQC emulation for quality control of sequencing data
Guilherme de Sena Brandine 1, Andrew D Smith
F1000Res. 2019 Nov 7:8:1874
<GffRead>
GFF Utilities: GffRead and GffCompare
Geo Pertea and Mihaela Pertea
Published online 2020 Sep 9. doi: 10.12688/f1000research.23297.2
<Cd-hit>
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.
Li W, Godzik A.
Bioinformatics. 2006 Jul 1;22(13):1658-9. Epub 2006 May 26.
<VSEARCH>
VSEARCH: a versatile open source tool for metagenomics
Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, Frédéric Mahé
PeerJ. 2016; 4: e2584
<BamTools>
BamTools: a C++ API and toolkit for analyzing and managing BAM files
Derek W Barnett, Erik K Garrison, Aaron R Quinlan, Michael P Strömberg, Gabor T Marth
Bioinformatics. 2011 Jun 15;27(12):1691-2.
関連