2018-05-08

クラスタを自動で決めてビニングする BinSanity

2019 4/25　誤字修正

2019 7/6 インストール追記

　微生物の生態学に関する研究は、微生物の単離と培養が困難であることによるボトルネックを経験することが普通である（論文より　Staley＆Konopka、1985）。実験室環境でほとんどの生物を培養することの困難さのために、代替方法を使ってコミュニティ構造および推定上の機能性を解明することが一般に行われている。そのような方法の1つは、環境中のすべての微生物の集団ゲノム（メタゲノミクス）の配列決定である（Handelsman et al、1998）。メタゲノミクスは、ゲノムの可能性を解明し、経路、代謝および分類学に関する情報を提供し、培養せずに環境の状況に関する推論を可能にする（Meyer et al、2016）。Metagenome-assembled genomes （MAG）にcontigを分類することは、メタゲノムデータを分析する際に直面するハードルの1つである。典型的には、現在のビニングプロトコルでは、サイズ閾値以下のコンティグに対する精度の低下、クラスターの識別におけるヒトの介入の必要性、関連微生物の差別化、または低カバレッジおよび低アベイラビリティ生物の排除を含むいくつかの問題の1つに遭遇する2014; Bowers et al、2015; Imelfort et al、2014）。

　一般的なunsupervisedのビニング法では、関連する配列の推定上のグループ（ビン）を取ってくるために、例えばテトラヌクレオチド頻度（Anantharaman、Breier＆Dick、2016; Pride et al、2003; Tully＆Heidelberg、2016; Tully et al、2014）を使用したりする。codon usage （Chen et al。、2004; Kanayaら、1999）やGC含量（Bohlin et al、2010; Chen et al、2004）および短いオリゴヌクレオチド（k-mer ）（Sandberg et al、2001; Zhou、Olman＆Xu、2008）などののtaxon特異的な性質があるため、これらのフィンガープリントはコンティグの特徴付けおよびクラスター化に用いられてきた。しかしながら、composition単独の利用では、いくつかの理由でビニングプロセス中に偏りをもたらす可能性がある。これには例えば類似のフィンガープリントを有する密接に関連する種、および/または現実を表さないキメラビンを作製してしまう水平移動によって取得された遺伝子などがあげられる（Dick et al、2009）。ビニング中の追加変数としてカバレッジ情報を組み込むことによって、いくつかの方法およびプロトコルが成功した（Alneberg et al、2014; Imelfort et al、2014; Kang et al、2015; Lu et al、2016; Wu et al、2014）。新しいビニングプロトコルの開発は、複雑な環境コミュニティを特徴づけ、現在では達成できない培養ベースの研究レベルでの微生物多様性を調べるために不可欠である。

　BinSanityはクラスタリングアルゴリズムAffinity Propagation （AP）を使用し、コンティグカバレッジを主要な区切り要素として取り入れている。他のクラスタリングアルゴリズムは、compositionデータおよびカバレッジデータを使用して関連するDNAフラグメントを効果的にグループ化することができるが、階層的およびk-meansクラスタリングのような一般的な方法は、最終的なクラスタ数（例えばベイジアン情報基準）を人間が指定する必要がある。複雑な生態系では、コミュニティの多様性のために先験的な数値を割り当てることがますます困難になっている。対照的に、APはクラスター数を決定するにあたり入力を必要としない。代わりに、すべての点が潜在的なクラスター中心として反復的に考慮される。データは、APがさまざまな種類のデータを同様のクラスタリング方法（Chen-Chia et al、2015; Flynn＆Moneypenny、2013; Frey＆Dueck、2007; Fujiwara et al、2015; Gan 2014; Leon、Sumedha＆Weigt、2007; Walter、Fischer＆Buhmann、2007; Zhengdong＆Carreira-Perpinan、2008）と同等に効果的にクラスタリングすることを示しており、時にはより精度の高いことが多い。以前にもクラスタリングコンティグのためAPが実装されていたが（Lin＆Liao、2016）、クラスター化の主要な方法としてsingle copy marker genesおよびテトラヌクレオチド頻度を含んでいた。対照的に、BinSanityは、カバレッジを使用して決定されたクラスタの初期セットを作成することによって、コンティグをビニングするための可能性のある合成ベースのバイアスをバイパスする（一部略）。

　BinSanityを現在公開されている5種類のビニングソフトウェアツールと比較してベンチマークした。著者らは、いくつかの人工の微生物コニュニティを構築し、これらの配列に基づいてin silicoメタゲノムサンプルを作成した。コミュニティは、組成に基づくビニングアルゴリズム、特にclosely relatedな低存在度の生物からなるメタゲノムに関して問題となる可能性のある配列で構成されていた。さらに、著者らはinfant gut microbiome time-series のデータセットを使い、最初にemergent self-organizing maps（ESOM）（Dick et al、2009）を用いて構築された高度にキュレーションされたゲノムビンのセットと比較してBinSanityを介して生成されたクラスターがどうなっているか調べた。調査の結果、BinSanityは、自動化されたプロセスを介してメタゲノミクスデータセットから高品質のゲノムを生成することができ、複雑な微生物群集を理解する能力が向上することがわかった。

図１ワークフロー。論文より転載。

マニュアル

https://github.com/edgraham/BinSanity/wiki/Usage

インストール

mac osx 10.13とubuntu16.0.4のminiconda2.4.0.5環境に導入した。

依存

Python (v2.7.13+)
Numpy (v1.11.1)
SciKit (v0.18)
Biopython (v1.66)
BedTools (v2.17.0)
Pandas (v0.13.1)
FeatureCounts(v1.5.3)
Bowtie2 (v2.2.5)
Samtools (v1.2)
CheckM（紹介）
HMMER
featureCounts (v1.2)

リンク先の指示に従い導入する。公式マニュアル（リンク）も参照してください。

本体　Github

https://github.com/edgraham/BinSanity

sudo pip install BinSanity

#scikit-learnが入らなかったので別途導入
pip install -U scikit-learn

> Binsanity

$ Binsanity

usage: Binsanity -f [/path/to/fasta] -l [fastafile] -c [coverage file]

*****************************************************************************

*********************************BinSanity***********************************

** Binsanity uses Affinity Propagation to split metagenomic contigs into **

** 'bins' using contig coverage values. It takes as input a coverage file **

** and files containing the contigs to be binned, then outputs clusters **

** of contigs in putative bins. **

** **

** NOTE: BinSanity becomes highly memory intensive at 100,000 contigs or **

** above. If you have greater than this number of contigs we recommend **

** trying the beta-version for our script binsanity-lc. **

*****************************************************************************

optional arguments:

-h, --help show this help message and exit

-c INPUTCOVFILE Specify a Coverage File

-f INPUTCONTIGFILES Specify directory

containing your contigs

-p PREFERENCE

Specify a preference (default is -3)

Note: decreasing the preference leads to more lumping,

increasing will lead to more splitting. If your range

of coverages are low you will want to decrease the preference,

if you have 10 or less replicates increasing the preference could

benefit you.

-m MAXITER

Specify a max number of iterations [default is 2000]

-v CONVITER

Specify the convergence iteration number (default is 200)

e.g Number of iterations with no change in the number

of estimated clusters that stops the convergence.

-d DAMP

Specify a damping factor between 0.5 and 1, default is 0.9

-l FASTAFILE

Specify the fasta file containing contigs you want to cluster

-x CONTIGSIZE

Specify the contig size cut-off [Default 1000 bp]

-o OUTPUTDIR

Give a name to the directory BinSanity results will be output in

[Default is 'BINSANITY-RESULTS']

--outPrefix OUTNAME

Sepcify what prefix you want appended to final Bins {optional}

--log LOGFILE

specify a name for the log file [Default is 'binsanity-logfile.txt']

--version show program's version number and exit

None

ラン

テストデータを走らせる。使うのは、Infant Gut Metagenome （pubmed）由来アセンブリデータ: igm.fa（FASTA。ファイル、35.9Mb）。bowtie2でマップングして、contig-coverage-bam.pyとcov-combine.pyでカバレッジを出しパースした結果:Infant_gut_assembly.cov.x100.lognormも準備されている。

ランする。

Binsanity -f . -l igm.fa -p -10 -c Infant_gut_assembly.cov.x100.lognorm

-f Specify directory containing your contigs
-l Specify the fasta file containing contigs you want to cluster
-c Specify a Coverage File
-p Specify a preference (default is -3). Note: decreasing the preference leads to more lumping, increasing will lead to more splitting. If your range of coverages are low you will want to decrease the preference, if you have 10 or less replicates increasing the preference could benefit you.

結果はデフォルトでは”BINSANITY-RESULTS/”に出力される。

f:id:kazumaxneo:20180507164906j:plain

binningされたFASTAが出力されている。一番上はlogファイル。

f:id:kazumaxneo:20180507165057j:plain

テスト

SRAのメタゲノムシーケンスデータを解析してみる。論文

Next generation sequencing data of a defined microbial mock community (pubmed)

で公開されたシーケンスデータを使ってみる。注意深く選択されたメタゲノムのmock (擬似) communityのシーケンスデータの論文で、研究者がベンチマークしたり、新しい手法をテストする目的で構築されている。有機物に富む土壌メタゲノムほど複雑ではないが、論文によれば、Acidobacteria、Actinobacteria、Bacteroidetes、Deinococcus-Thermus、Firmicutes、Alpha-and Gamma-Proteobacteria、Spirochaetes、Thermotogae、VerrucomicrobiaおよびEuryarchaeotaに属する23の細菌および3つの古細菌株からなっていると書かれている。ゲノムサイズ1.8Mbpから6.5Mbp、GC含量28.4~72.7％、リピート含量は0~18.3％、全てゲノムが決定されている（論文　図1、表1）。シミュレーションではなく、実際に一定量の菌からDNAを抽出してディープシーケンスしている（illumina Hiseqとpacbio RSII）。シーケンスのデプスは種によって大きな差がある。（図２　DNA抽出効率の違いやシーケンスバイアス）。

論文のショートリードシーケンスデータ（リンク）とリングリードシーケンスデータ（リンク）をダウンロードしてfastqに変換する。Hiseqでディープシーケンスされていて、fastqのサイズは120GB超ある。

#ショートリード（illumina 31GB）
prefetch --option-file SRR_Acc_List.txt  --max-size 1000000000

#ペアエンドfastqに変換 (展開するとおよそ120GB)
fastq-dump SRR3656745.sra --split-files -O output_dir

#ロングリード(pacbio 100MB)
prefetch --option-file SRR_Acc_List2.txt  --max-size 1000000000
fastq-dump SRR3656745.sra -O output_dir

ショートリードのフルデータをde novo assmeblyするのは厳しいので、30%ランダムサンプリングしてアセンブルする。

seqkit sample -p 0.3 illumina/SRR3656745_1.fastq > SRR3656745_1_30per.fq

seqkit sample -p 0.3 illumina/SRR3656745_2.fastq > SRR3656745_2_30per.fq

#時間がかかるのでここではエラー訂正は実行しない。k-merも55だけとする。
metaspades.py -k 55 -t 40 --only-assembler -1 SRR3656745_1_30per.fq -2 SRR3656745_2_30per.fq -o outdir

#hybridアセンブリ
metaspades.py -k 55 -t 40 --only-assembler -1 SRR3656745_1_30per.fq -2 SRR3656745_2_30per.fq --pacbio SRR3656744.fastq -o outdir_hybrid

ここからはショートリードのアセンブリ結果で説明していく。

ヘッダーが複雑だと失敗するらしいので、アセンブリしたFASTAのヘッダーを連番にリネームする。公式ではsimplify-fasta というスクリプトを使っているが、ここではperlのワンライナーで処理する（たぶんしなくても大丈夫。おまじない）。

perl -ane 'if(/\>/){$a++;print ">name_$a\n"}else{print;}' outdir/scaffolds.fasta > assembly_rename.fa

オリジナルデータをmappingしてbamを作る。bowite2でもよいが、遅いのでここではSTARを使う。

mkdir genome #出力用のディレクトリを作成

#index
STAR --runMode genomeGenerate --genomeDir genome/ --genomeFastaFiles assembly_rename.fa --runThreadN 20

#mapping
STAR --genomeDir genome/ --readFilesIn illumina/SRR3656745_1.fastq illumina/SRR3656745_2.fastq --runThreadN 20 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix mapping

#bamディレクトリに移動してindexをつけておく。
mkdir bam
mv mappingAligned.sortedByCoord.out.bam bam/
samtools index bam/mappingAligned.sortedByCoord.out.bam

Binsanity-profile を使ってカバレッジを計算する。Binsanity-profile はfeature countsを使い、bamの各コンティグのカバレッジを計算する。pipでBinsanityを導入していればBinsanity-profile コマンドもパスが通っている。もし通って無ければBinSanityのレポジトリをcloneして取ってくる。bin/に入っている。

#Binsanity-profileがパスが通ってなければ
git clone https://github.com/edgraham/BinSanity.git
cd BinSanity/bin/

> ./Binsanity-profile #ヘルプ

$ ./Binsanity-profile

usage: Binsanity-profile -i fasta_file -s {sam,bam}_file --id contig_ids.txt -c output_file

***********************************************************************

******************************BinSanity********************************

** **

** Binsanity-profile is used to generate coverage files for **

** input to BinSanity. This uses Featurecounts to generate a **

** a coverage profile and transforms data for input into Binsanity, **

** Binsanity-refine, and Binsanity-wf **

** **

***********************************************************************

optional arguments:

-h, --help show this help message and exit

-i INPUTFASTA Specify fasta file being profiled

-s INPUTMAPLOC

identify location of BAM files

BAM files should be indexed and sorted

--ids INPUTIDS

Identify file containing contig ids

-c OUTCOV

Identify name of output file for coverage information

--transform TRANSFORM

Indicate what type of data transformation you want in the final file [Default:log]:

scale --> Scaled by multiplying by 100 and log transforming

log --> Log transform

None --> Raw Coverage Values

X5 --> Multiplication by 5

X10 --> Multiplication by 10

X100 --> Multiplication by 100

SQR --> Square root

We recommend using a scaled log transformation for initial testing.

Other transformations can be useful on a case by case basis

-T THREADS Specify Number of Threads For Feature Counts [Default: 1]

-o OUTDIRECTORY Specify directory for output files to be deposited [Default: Working Directory]

--version show program's version number and exit

カバレッジファイルを作成するにはcontigのidファイル（FASTAのヘッダーを抽出したもの）も必要になる。idファイル作成はutilsディレクトリの"get-ids"を使う。-xで最小サイズ（bp）を指定することで、小さすぎるコンティグは除外できる。

BinSanity/utils/get-ids -f . -l assembly_rename.fa -x 1000 -o contig_id

-fでFASTAのディレクトリを指定している (ここではカレント指定)。

出力。このリストがカバレッジ分析対象になる。

$ head -n 5 contig_id

name_1

name_2

name_3

name_4

name_5

準備ができたら、Binsanity-profileでカバレッジリスト作成する。さきほど作成したcontig idファイルと、bamファイルのディレクトリを指定する（bowtie2などでマッピングしてsortしたbamを使う）。

Binsanity-profile -i assembly_rename.fa --ids contig_id --transform scale -s bam/ -c output

--transform scale Scaled by multiplying by 100 and log transformed
-i Specify fasta file being profiled
--ids Identify file containing contig ids
-s Identify location of BAM files BAM files should be indexed and sorted
-c Identify name of output file for coverage information

100xしてlog変換したカバレッジが出力される。

$ head output.cov.x100.lognorm

name_2 1.17806802743

name_3 1.15201791709

name_1 1.13838011255

name_6 1.13774417222

name_7 1.15336912817

name_4 1.15184188719

name_5 1.15177317581

name_8 1.15638071781

name_9 1.15594755044

name_55 1.86927209439

最後にbinningを実行する。自動でクラスター数が決定され、binningが実行される。

Binsanity -f . -l assembly_rename.fa -p -10 -c output.cov.x100.lognorm

出力。８つのクラスターが検出された。

f:id:kazumaxneo:20180508204623j:plain

この後、checkMなどのツールにかけて、コンタミネーション、完全度を推定する。

引用

BinSanity: unsupervised clustering of environmental microbial assemblies using coverage and affinity propagation.

Graham ED, Heidelberg JF, Tully BJ

PeerJ. 2017 Mar 8;5:e3035.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

クラスタを自動で決めてビニングする BinSanity