Bowtie 2を使って素早くホスト由来のリードを除く

2021 4/16 タイトル変更、文章修正, 画像差し替え

2022/07/01、09/07 追記

2023/10/17 追記

Bowtie 2の--un-concオプションを使うと、リファレンスに適切にマッピングされなかったペアエンドリード（discordant read pairs）を別出力できる。このオプションを利用することで、ホストゲノムのリードを素早く除外できる。

discordant read pairの詳細

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#concordant-pairs-match-pair-expectations-discordant-pairs-dont

インストール

Github

#conda (bioconda link)
mamba install -c bioconda bowtie2 -y

#docker (dockerhub link)
docker pull biocontainers/bowtie2

> bowtie2

$ bowtie2

No index, query, or output file specified!

Bowtie 2 version 2.4.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)

Usage:

bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>} [-S <sam>]

<bt2-idx> Index filename prefix (minus trailing .X.bt2).

NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.

<m1> Files with #1 mates, paired with files in <m2>.

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<m2> Files with #2 mates, paired with files in <m1>.

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<r> Files with unpaired reads.

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<i> Files with interleaved paired-end FASTQ/FASTA reads

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<bam> Files are unaligned BAM sorted by read name.

<sam> File for SAM output (default: stdout)

<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be

specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.

Options (defaults in parentheses):

Input:

-q query input files are FASTQ .fq/.fastq (default)

--tab5 query input files are TAB5 .tab5

--tab6 query input files are TAB6 .tab6

--qseq query input files are in Illumina's qseq format

-f query input files are (multi-)FASTA .fa/.mfa

-r query input files are raw one-sequence-per-line

-F k:<int>,i:<int> query input files are continuous FASTA where reads

are substrings (k-mers) extracted from a FASTA file <s>

and aligned at offsets 1, 1+i, 1+2i ... end of reference

-c <m1>, <m2>, <r> are sequences themselves, not files

-s/--skip <int> skip the first <int> reads/pairs in the input (none)

-u/--upto <int> stop after first <int> reads/pairs (no limit)

-5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)

-3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)

--trim-to [3:|5:]<int> trim reads exceeding <int> bases from either 3' or 5' end

If the read end is not specified then it defaults to 3 (0)

--phred33 qualities are Phred+33 (default)

--phred64 qualities are Phred+64

--int-quals qualities encoded as space-delimited integers

Presets: Same as:

For --end-to-end:

--very-fast -D 5 -R 1 -N 0 -L 22 -i S,0,2.50

--fast -D 10 -R 2 -N 0 -L 22 -i S,0,2.50

--sensitive -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)

--very-sensitive -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

For --local:

--very-fast-local -D 5 -R 1 -N 0 -L 25 -i S,1,2.00

--fast-local -D 10 -R 2 -N 0 -L 22 -i S,1,1.75

--sensitive-local -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)

--very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

Alignment:

-N <int> max # mismatches in seed alignment; can be 0 or 1 (0)

-L <int> length of seed substrings; must be >3, <32 (22)

-i <func> interval between seed substrings w/r/t read len (S,1,1.15)

--n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)

--dpad <int> include <int> extra ref chars on sides of DP table (15)

--gbar <int> disallow gaps within <int> nucs of read extremes (4)

--ignore-quals treat all quality values as 30 on Phred scale (off)

--nofw do not align forward (original) version of read (off)

--norc do not align reverse-complement version of read (off)

--no-1mm-upfront do not allow 1 mismatch alignments before attempting to

scan for the optimal seeded alignments

--end-to-end entire read must align; no clipping (on)

--local local alignment; ends might be soft clipped (off)

Scoring:

--ma <int> match bonus (0 for --end-to-end, 2 for --local)

--mp <int> max penalty for mismatch; lower qual = lower penalty (6)

--np <int> penalty for non-A/C/G/Ts in read/ref (1)

--rdg <int>,<int> read gap open, extend penalties (5,3)

--rfg <int>,<int> reference gap open, extend penalties (5,3)

--score-min <func> min acceptable alignment score w/r/t read length

(G,20,8 for local, L,-0.6,-0.6 for end-to-end)

Reporting:

(default) look for multiple alignments, report best, with MAPQ

-k <int> report up to <int> alns per read; MAPQ not meaningful

-a/--all report all alignments; very slow, MAPQ not meaningful

Effort:

-D <int> give up extending after <int> failed extends in a row (15)

-R <int> for reads w/ repetitive seeds, try <int> sets of seeds (2)

Paired-end:

-I/--minins <int> minimum fragment length (0)

-X/--maxins <int> maximum fragment length (500)

--fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)

--no-mixed suppress unpaired alignments for paired reads

--no-discordant suppress discordant alignments for paired reads

--dovetail concordant when mates extend past each other

--no-contain not concordant when one mate alignment contains other

--no-overlap not concordant when mates overlap at all

BAM:

--align-paired-reads

Bowtie2 will, by default, attempt to align unpaired BAM reads.

Use this option to align paired-end reads instead.

--preserve-tags Preserve tags from the original BAM record by

appending them to the end of the corresponding SAM output.

Output:

-t/--time print wall-clock time taken by search phases

--un <path> write unpaired reads that didn't align to <path>

--al <path> write unpaired reads that aligned at least once to <path>

--un-conc <path> write pairs that didn't align concordantly to <path>

--al-conc <path> write pairs that aligned concordantly at least once to <path>

(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.

--un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)

--quiet print nothing to stderr except serious errors

--met-file <path> send metrics to file at <path> (off)

--met-stderr send metrics to stderr (off)

--met <int> report internal counters & metrics every <int> secs (1)

--no-unal suppress SAM records for unaligned reads

--no-head suppress header lines, i.e. lines starting with @

--no-sq suppress @SQ header lines

--rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field

--rg <text> add <text> ("lab:value") to @RG line of SAM header.

Note: @RG line only printed when --rg-id is set.

--omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.

--sam-no-qname-trunc

Suppress standard behavior of truncating readname at first whitespace

at the expense of generating non-standard SAM.

--xeq Use '='/'X', instead of 'M,' to specify matches/mismatches in SAM record.

--soft-clipped-unmapped-tlen

Exclude soft-clipped bases when reporting TLEN

--sam-append-comment

Append FASTA/FASTQ comment to SAM record

Performance:

-p/--threads <int> number of alignment threads to launch (1)

--reorder force SAM output order to match order of input reads

--mm use memory-mapped I/O for index; many 'bowtie's can share

Other:

--qc-filter filter out reads that are bad according to QSEQ filter

--seed <int> seed for random number generator (0)

--non-deterministic

seed rand. gen. arbitrarily instead of using read attributes

--version print version information and quit

-h/--help print this usage message

> bowtie2-build

No input sequence or sequence file specified!

Bowtie 2 version 2.5.1 by Ben Langmead (langmea@cs.jhu.edu,

www.cs.jhu.edu/~langmea)

Usage: bowtie2-build [options]* <reference_in> <bt2_index_base>

reference_in comma-separated list of files with ref sequences

bt2_index_base write bt2 data to files with this dir/basename

*** Bowtie 2 indexes will work with Bowtie v1.2.3 and later. ***

Options:

-f reference files are Fasta (default)

-c reference sequences given on cmd line (as

<reference_in>)

--large-index force generated index to be 'large', even if ref

has fewer than 4 billion nucleotides

--debug use the debug binary; slower, assertions enabled

--sanitized use sanitized binary; slower, uses ASan and/or UBSan

--verbose log the issued command

-a/--noauto disable automatic -p/--bmax/--dcv memory-fitting

-p/--packed use packed strings internally; slower, less memory

--bmax <int> max bucket sz for blockwise suffix-array builder

--bmaxdivn <int> max bucket sz as divisor of ref len (default: 4)

--dcv <int> diff-cover period for blockwise (default: 1024)

--nodc disable diff-cover (algorithm becomes quadratic)

-r/--noref don't build .3/.4 index files

-3/--justref just build .3/.4 index files

-o/--offrate <int> SA is sampled every 2^<int> BWT chars (default: 5)

-t/--ftabchars <int> # of chars consumed in initial lookup (default: 10)

--threads <int> # of threads

--seed <int> seed for random number generator

-q/--quiet verbose output (for debugging)

--h/--help print this message and quit

--version print version information and quit

テスト

公共データベースのヒト関連マイクロバイオームシーケンスデータ（ペアエンドのメタゲノムデータ）をダウンロードし、ヒトゲノム関連リードを除いてみる。

１、マッピング前にhg38 のbowtie2 index（FM Index）を作成する必要がある。ここではrefgenieを使って作成済みのindexをダウンロードする。

refgenie pull -g hg38 -a bowtie2_index

#あるいはindexingする
bowtie2-build --threads 20 -f ref.fasta bowtie2_index

２、"--un-conc"をつけて、ペアエンドのfastqをhg38 に対してグローバルアラインメントモードの--sensitive設定（デフォルト）でマッピングする。ペアエンドの向きはデフォルトのFR、インサートサイズの正常値は0-500とした。スレッドは32指定した。

bowtie2 -x <path>/<to>/bowtie2_index -p 32 -1 pair_R1.fq.gz -2 pair_R2.fq.gz --un-conc-gz unmap_R%.fq.gz --fr -I 0 -X 500 -S mapped.sam

#sam(mappedもunmappedも含まれる)が不要なら捨てる
bowtie2 -x bowtie2_index -p 32 -1 pair_R1.fq.gz -2 pair_R2.fq.gz --un-conc-gz unmap_R%.fq.gz --fr -I 0 -X 500 -S /dev/null

#<参照> 通常のペアエンドリードマッピングコマンド例。
#samtoolsにパイプしてbam出力。シングルエンドなら-Uを使う
bowtie2 -p 20 --sensitive -x bowtie2_index -1 R1.fq.gz -2 R2.fq.gz |samtools sort -@ 8 - > map.bam

#他にも、"local alignment"とend-to-end(global) alignment"のプリセットパラメータにも注意。デフォルトはend-to-endの"--sensitive"プリセット。

--un-conc <path> write pairs that didn't align concordantly to <path>
--fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
-I/--minins <int> minimum fragment length (0)
-X/--maxins <int> maximum fragment length (500)
-p/--threads <int> number of alignment threads to launch (1)

"--un-conc <prefix>"フラグを立てることで、samファイルに加えてunmap_R1.fq.gzとunmap_R2.fq.gzが書き出される（bzip2出力なら"--un-conc-bz2"、rawなら"--un-conc"）。unmap_R1.fq.gzとunmap_R2.fq.gzを使って再びhg38にマッピングして視覚化した（下の写真）。写真は上が処理前、下が処理後だが、マッピングされるリードは処理後も一定数見られた。

f:id:kazumaxneo:20210416111929p:plain

fastq内のホストゲノムにマッピングされるリードの割合を計算すると、元のデータは41％で、処理後は15.4%だった。インサートサイズの上限が小さかったかもしれない。

２とは反対に、リファレンスに適切にマッピングされるペアエンドリードだけ書き出す。

bowtie2 -x bowtie2_index -p 32 -1 pair_R1.fq.gz -2 pair_R2.fq.gz --al-conc map_R%.fq.gz --fr -I 0 -X 500 -S /dev/null

--al-conc write pairs that aligned concordantly at least once to <path>

"--un-conc"で書き出されるのは不適切なペアエンドリードなので、インサートサイズがおかしいリードペア、片方または両方がマッピングされないリードペア、向きが適切ではないリードペア（上ではFR以外）、はfastqに書き出される。マイクロバイオーム由来のリードも稀にヒトゲノムにマッピングされる事はあるだろうが、ペアエンドのリードが両方適切にヒトゲノムにマッピングされる可能性は低いため、proper read pairsとdiscordant read pairsを分けるこの方法は有効に働く。
ペアエンドは同期が維持されている。このまま下流解析に進めることができる
この方法では、ヒトの多型部位などリードが適切にマッピングされない部位のリードペアも出力される。ホストゲノムのリードはかなり除外されるが、このようなリードは除去されずに残ることは知っておく（=保守的な方法）。
ライブラリ調整法によっては、インサートサイズ分布はかなりロングテールになることがあり、このようなデータでは、インサートサイズが長いホストゲノムのリードペアも全て異常なリードとして書き出される。回避するには、インサートサイズの正常値maxを増やす（データを見て検討する）。
この方法は構造変異（SV）関連リードを濃縮することにも利用されている。ただし、シグナル/バックグラウンド比が変わるため、SVの方法論によってはおかしくなる点に注意が必要。異常なリードが濃縮されるということは、例えば、スモールゲノムのトランスポゾン導入シークエンシングリードを分析する時、この方法でターゲットゲノムのリードを除き、外来のトランスポゾン関連リードを簡単にエンリッチできるということを意味する（エンリッチ後、最低カバレッジの閾値を少しだけ大きめに指定してde novoアセンブルし、bandage上でトランスポゾン配列をblast検索する）。
"--un-conc"オプションはhisat2にも存在する。
この方法以外にも、sam/bamのFLAG（解説ページ）を使ってdiscordant read pairsを分離することもできる。