2018-05-30

メタゲノムのリードの系統アサインメントを行う Centrifuge

2019 1/17　タイトル修正

2019 4/16 condaインストール

2019 4/19 ダウンロード方法追記

2019 5/9 パラメータ追記

2019 5/13 test追加

2020 4/16 help更新

　アーキアやバクテリアなどの微生物は、土壌や海洋から温泉や深海に至るまで、事実上あらゆる場所で発見されている（Keller and Zengler 2004）。これらの微生物はまた、皮膚や腸管のような人体のさまざまなニッチを含む生きた生き物の上や中にも豊富に存在する（Human Microbiome Project Consortium 2012）。これらの目に見えない生命体は、広範な生物学的機能を果たす。それらは多くの種の生存にとって不可欠であり、地球の生態学的バランスを維持している。わずか数分の1（土壌中1％未満、海洋中でさえも）が単離され、栽培できるにもかかわらず、何百万もの原核生物種が存在する（Schloss and Handelsman 2004）（Amann et al 1995）。メタゲノムシーケンシングとして知られている微生物群集のハイスループットシーケンシングは、培養を必要とせず、したがって、微生物種の生物学的機能およびそれらの可視世界への影響に無数の洞察を提供する可能性がある。

　2004年にRefSeqデータベースには完全な原核生物ゲノムが179個含まれていた。2009年までにゲノムが954に、2015年12月までに4278に増加した。種の複雑なmixtureからの何百万ものリードを含むメタゲノミクス試料の分析は、これらのリードを迅速かつ正確に分類するためのコンパクトでスケーラブルな索引付けスキームを必要とする。現在のメタゲノミクス分類プログラムのほとんどは、分類速度が遅い、indexサイズが大きい、またはその両方に悩まされている。たとえば、Naive Bayes Classifier（NBC）（Rosen et al 2008）やPhymmBL（Brady and Salzberg 2009、2011）などの機械学習ベースのアプローチでは、1分あたり100リードが分類されるが、何百万というリードを処理するには遅すぎる。これとは対照的に、Kraken（Wood and Salzberg 2014）（紹介）プロセスで採用されている疑似アライメントアプローチははるかに上回る速度を持ち、1分間に100万以上のリードを処理できるが、正確なk-merマッチングアルゴリズムのため大きなindexが必要となる。例えば、Krakenの31-merデータベースでは、4278の原核生物ゲノムのために93 GBのメモリ（RAM）が要求され、現在のデスクトップコンピュータよりもかなり多くのメモリが必要なことになる。

　幸運なことに、Bowtie（Langmead et al 2009; Langmead and Salzberg 2012）やBWA（Li and Durbin 2009、2010）などの最新のリードマッピングアルゴリズムは、比較的小さなメモリフットプリントで非常に速くアライメントを行うデータ構造を開発している。 Burrows-Wheeler変換（Burrows and Wheeler 1994）とFerragina-Manzini (FM) index（Ferragina and Manzini 2000）をベースにした、データ構造を効率的に保存できるメタゲノミクス分類器Centrifugeを作成した。

公式ページ

http://ccb.jhu.edu/software/centrifuge/index.shtml

マニュアル

インストール

condaを使ってmacos10.14でテストした。

公式ページ　ダウンロードリンク（右側のリリースにある Mac OS X x86_64 binary）

http://ccb.jhu.edu/software/centrifuge/manual.shtml

解凍した中にプログラム本体が入っている。condaを使っても導入できる。

#bioconda (link) mac, linux
conda create -n centrifuge -y
conda activate centrifuge
conda install -c bioconda -y centrifuge

> centrifuge -h

$ centrifuge -h

Centrifuge version 1.0.4 by the Centrifuge developer team (centrifuge.metagenomics@gmail.com)

Usage:

centrifuge [options]* -x <cf-idx> {-1 <m1> -2 <m2> | -U <r> | --sample-sheet <s> } [-S <filename>] [--report-file <report>]

<cf-idx> Index filename prefix (minus trailing .X.cf).

<m1> Files with #1 mates, paired with files in <m2>.

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<m2> Files with #2 mates, paired with files in <m1>.

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<r> Files with unpaired reads.

Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).

<s> A TSV file where each line represents a sample.

<filename> File for classification output (default: stdout)

<report> File for tabular report output (default: centrifuge_report.tsv)

<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be

specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.

Options (defaults in parentheses):

Input:

-q query input files are FASTQ .fq/.fastq (default)

--qseq query input files are in Illumina's qseq format

-f query input files are (multi-)FASTA .fa/.mfa

-r query input files are raw one-sequence-per-line

-c <m1>, <m2>, <r> are sequences themselves, not files

-s/--skip <int> skip the first <int> reads/pairs in the input (none)

-u/--upto <int> stop after first <int> reads/pairs (no limit)

-5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)

-3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)

--phred33 qualities are Phred+33 (default)

--phred64 qualities are Phred+64

--int-quals qualities encoded as space-delimited integers

--ignore-quals treat all quality values as 30 on Phred scale (off)

--nofw do not align forward (original) version of read (off)

--norc do not align reverse-complement version of read (off)

Classification:

--min-hitlen <int> minimum length of partial hits (default 22, must be greater than 15)

-k <int> report upto <int> distinct, primary assignments for each read or pair

--host-taxids <taxids> comma-separated list of taxonomic IDs that will be preferred in classification

--exclude-taxids <taxids> comma-separated list of taxonomic IDs that will be excluded in classification

Output:

--out-fmt <str> define output format, either 'tab' or 'sam' (tab)

--tab-fmt-cols <str> columns in tabular format, comma separated

default: readID,seqID,taxID,score,2ndBestScore,hitLength,queryLength,numMatches

-t/--time print wall-clock time taken by search phases

--un <path> write unpaired reads that didn't align to <path>

--al <path> write unpaired reads that aligned at least once to <path>

--un-conc <path> write pairs that didn't align concordantly to <path>

--al-conc <path> write pairs that aligned concordantly at least once to <path>

(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.

--un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)

--quiet print nothing to stderr except serious errors

--met-file <path> send metrics to file at <path> (off)

--met-stderr send metrics to stderr (off)

--met <int> report internal counters & metrics every <int> secs (1)

Performance:

-p/--threads <int> number of alignment threads to launch (1)

--mm use memory-mapped I/O for index; many instances can share

Other:

--qc-filter filter out reads that are bad according to QSEQ filter

--seed <int> seed for random number generator (0)

--non-deterministic seed rand. gen. arbitrarily instead of using read attributes

--version print version information and quit

-h/--help print this usage message

> centrifuge-build

$ centrifuge-build

No input sequence or sequence file specified!

Centrifuge version 1.0.4 by Daehwan Kim (infphilo@gmail.com, http://www.ccb.jhu.edu/people/infphilo)

Usage: centrifuge-build [options]* --conversion-table <table file> --taxonomy-tree <taxonomy tree file> <reference_in> <cf_index_base>

reference_in comma-separated list of files with ref sequences

centrifuge_index_base write cf data to files with this dir/basename

Options:

-c reference sequences given on cmd line (as

<reference_in>)

-a/--noauto disable automatic -p/--bmax/--dcv memory-fitting

--bmax <int> max bucket sz for blockwise suffix-array builder

--bmaxdivn <int> max bucket sz as divisor of ref len (default: 4)

--dcv <int> diff-cover period for blockwise (default: 1024)

--nodc disable diff-cover (algorithm becomes quadratic)

-r/--noref don't build .3/.4.bt2 (packed reference) portion

-3/--justref just build .3/.4.bt2 (packed reference) portion

-o/--offrate <int> SA is sampled every 2^offRate BWT chars (default: 5)

-t/--ftabchars <int> # of chars consumed in initial lookup (default: 10)

--conversion-table <file name> a table that converts any id to a taxonomy id

--taxonomy-tree <file name> taxonomy tree

--name-table <file name> names corresponding to taxonomic IDs

--size-table <file name> table of contig (or genome) sizes

--seed <int> seed for random number generator

-q/--quiet verbose output (for debugging)

-p/--threads <int> number of alignment threads to launch (1)

--kmer-count <int> k size for counting the number of distinct k-mer

-h/--help print detailed description of tool and its options

--usage print this usage message

--version print version information and quit

> centrifuge-download

$ centrifuge-download

centrifuge-download [<options>] <database>

ARGUMENT

<database> One of refseq, genbank, contaminants or taxonomy:

- use refseq or genbank for genomic sequences,

- contaminants gets contaminant sequences from UniVec and EmVec,

- taxonomy for taxonomy mappings.

COMMON OPTIONS

-o <directory> Folder to which the files are downloaded. Default: '.'.

-P <# of threads> Number of processes when downloading (uses xargs). Default: '1'

WHEN USING database refseq OR genbank:

-d <domain> What domain to download. One or more of bacteria, viral, archaea, fungi, protozoa, invertebrate, plant, vertebrate_mammalian, vertebrate_other (comma separated).

-a <assembly level> Only download genomes with the specified assembly level. Default: 'Complete Genome'. Use 'Any' for any assembly level.

-c <refseq category> Only download genomes in the specified refseq category. Default: any.

-t <taxids> Only download the specified taxonomy IDs, comma separated. Default: any.

-g <program> Download using program. Options: rsync, curl, wget. Default curl (auto-detected).

-r Download RNA sequences, too.

-u Filter unplaced sequences.

-m Mask low-complexity regions using dustmasker. Default: off.

-l Modify header to include taxonomy ID. Default: off.

-g Download GI map.

-v Verbose mode

> centrifuge-inspect

$ centrifuge-inspect

No index name given!

Centrifuge version 1.0.4 by Daehwan Kim (infphilo@gmail.com, http://www.ccb.jhu.edu/people/infphilo)

Usage: centrifuge-inspect [options]* <cf_base>

<cf_base> cf filename minus trailing .1.cf/.2.cf/.3.cf

By default, prints FASTA records of the indexed nucleotide sequences to

standard out. With -n, just prints names. With -s, just prints a summary of

the index parameters and sequences. With -e, preserves colors if applicable.

Options:

-a/--across <int> Number of characters across in FASTA output (default: 60)

-n/--names Print reference sequence names only

-s/--summary Print summary incl. ref names, lengths, index properties

-e/--bt2-ref Reconstruct reference from .cf (slow, preserves colors)

--conversion-table Print conversion table

--taxonomy-tree Print taxonomy tree

--name-table Print names corresponding to taxonomic IDs

--size-table Print the lengths of the sequences belonging to the same taxonomic ID

-v/--verbose Verbose output (for debugging)

-h/--help print detailed description of tool and its options

--help print this usage message

ラン

初回はデータベースをダウンロードしてビルドする必要がある。次の２コマンドを順番に実行する。

centrifuge-download -o taxonomy taxonomy

#Download all of the complete archaeal, viral, and bacterial genomes from RefSeq.
centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map

ダウンロードしたファイルを１つにまとめる。

cat library/*/*.fna > input-sequences.fna

#数が多すぎてエラーがでたら、findで検索して、xargsでcatに渡す
cd library/bacteria/
find . -maxdepth 1 -name '*.fna' |xargs cat > ../bacteria.fna
cd library/virus/
find . -maxdepth 1 -name '*.fna' |xargs cat > ../virus.fna
cd library/archaea/
find . -maxdepth 1 -name '*.fna' |xargs cat > ../archaea.fna
cd ../
cat bacteria.fna virus.fna archaea.fna > ../input-sequences.fna
#-maxdepth 3で同時にやってもいいが、出力自身を含めないよう、出力は*fnaにならないようにする。または-maxdepth 3が及ばないパスに出力する。

ビルドする。

centrifuge-build -p 20 --conversion-table seqid2taxid.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp input-sequences.fna abv
#mac os10.12ではビルド中にクラッシュした。

abv.1.cf、abv.2.cf、abv.3.cfができる。ビルド後はabv1~3以外消して問題ない（library/と taxonomy/は消してOK）。上には載せてないが、ヒト、マウスのindexも追加できる。やり方はマニュアル参照（bacteriaのサイズが大きいので、まずvirusなどでテストして動作確認してから、全ファイルのマージを実行する）。

NCBI BLAST's nt をデータベースにするには、以下のコマンドを実行する。

wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz 
gunzip nt.gz && mv -v nt nt.fa

# Get mapping file wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz 
gunzip -c gi_taxid_nucl.dmp.gz | sed 's/^/gi|/' > gi_taxid_nucl.map

# build index using 16 cores and a small bucket size, which will require less memory
centrifuge-build -p 16 --bmax 1342177280 --conversion-table gi_taxid_nucl.map --taxonomy-tree taxonomy/nodes.dmp --name-table taxonomy/names.dmp nt.fa nt

build済みindexをHPから直接ダウンロードする。HPにアクセスして

右側のindexのところからダウンロード。

f:id:kazumaxneo:20190419161115j:plain

wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
tar -zxvf p_compressed+h+v.tar.gz

p_compressed+h+v1.cf、p_compressed+h+v2.cf、p_compressed+h+v3.cfができる。ラン時は"-x p_compressed+h+v"と指定する。

環境変数CENTRIFUGE_HOMEをCentrifugeのディレクトリに設定。

export CENTRIFUGE_HOME=/home/kazuma/centrifuge-1.0.3-beta

シーケンスデータを指定してランする。データベースは上で作成した名前（abv）になる。1-3を付けずに指定する。

centrifuge -x abv -1 pair1.fq -2 pair2.fq --report-file report.txt -S output -p 12

-p number of alignment threads to launch (1)
--mm use memory-mapped I/O for index; many 'bowtie's can share

SRAのpublicデータを直接指定してランすることもできる。

centrifuge -x abv --sra-acc SRA_accession_number --report-file report.txt -S output -p 12

解析が終わると、output.txtに全リードのhit結果、report.txtに生物種ごとのhit結果（abundance）が出力される。

追記

テストラン

ORCA（紹介）のコンテナでランしてみる。insilicoseq（紹介）で21バクテリアゲノムからリードを存在量が均等になるよう発生させ、Centrifugeで定量、Recentrifuge（紹介）で視覚化する。

# Mock bacteria communityのゲノムデータSRS121011.fasta（リンク）のダウンロード。
curl -O -J -L https://osf.io/thser/download
#SRS121011をテンプレートにペアエンドリードを発生。 存在量比はuniform指定
iss generate --genomes SRS121011.fasta -p 24 --model miseq -o miseq -n 3000000 --abundance uniform

#download pre-build databae
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
tar -zxvf p_compressed+h+v.tar.gz

docker run --rm -it -v$PWD:$HOME -w$HOME bcgsc/orca
> centrifuge -x abv -1 miseq_R1.fastq -2 miseq_R2.fastq -p 40 --report-file sample1_report.txt -S sample.out
> exit

#Recentrifugeでabundanceを視覚化
rcf -f sample.out