2019-04-26

メタゲノムの既知および未知バクテリアの存在量を推定single-copy phylogenetic marker genesに基づいて見積もる mOTUs2

2019 4/26 mergeエラー修正及び追記

2019 7/2 インストール追記

2019 8/6 リンク追加

2020 4/18 condaインストール追記

2020 8/24 インストール追記

　微生物は、地球上の生命や環境中の地球化学的プロセスに影響を与える、相互作用する種の複雑な共同体に住んでいる。したがって、それらが形成するコミュニティの構成を正確にプロファイルし、比較することは基本的な関心事である。微生物群集プロファイリングのための最も一般的なアプローチは、リボソームスモールサブユニットRNA遺伝子（すなわち、バクテリアおよびアーキアの16S rRNA gene）からのPCRアンプリコン配列の分類によるものである。強力ではあるが、このアプローチは、例えば、ゲノム当たりの16S rRNA遺伝子コピー数の変動（論文補足図1）、異なる種におけるPCRプライマーの効率の不均一性、ならびにこの遺伝子の異なるサブ領域の使用（ref.3）、さらに、その高レベルの配列保存は、closely relatedな生物の解像を制限する(ref.4)。

　より最近の方法は、ショットガンシークエンシング（メタゲノミクス）によって直接環境DNAをサンプリングすることで、これはそれらバイアスのいくつかを解決する。メタゲノムデータから微生物群集組成を決定するために様々な戦略が導入されてきた。１つのアプローチは、「既知の」種のpublicに入手可能な分類学的にアノテーション付けされたリファレンスゲノム配列を用いてシーケンシングリードを分類することに基づく。得られたリードの存在量分布は、個々の種の相対存在量を導き出すため、ゲノムサイズ（ref.5、6)による正規化を必要とする（論文補足図1）。全ゲノムを使用するのではなく、現在のリファレンスゲノムデータベースの分析に基づいて、クレード特異的であることが判明した遺伝子のリードカバレッジを定量するという方法もある（ref,7）。そのようなマーカー遺伝子がゲノムごとに一度だけ出現する場合、結果として生じるリードカバレッジはコピー数またはゲノム長によって正規化する必要はない。しかし、ゲノム配列に関する以前の知識に依存する方法の欠点は、ゲノム的に特徴付けられていない分類群が考慮されないままであることであり、これは種レベルの分解能で不正確な相対存在量の推定につながる可能性がある（補足図1）。

　このようなリファレンス依存の方法では見逃されている分類群は、まとめて生物学的「ダークマター」（ref.8）と呼ぶことができる。これらには、検出される可能性があるが、標準的な方法および最新のゲノムデータベースを使用して定量化することが依然として困難である生物（本明細書では「未知の」種と呼ばれる）が含まれる。この問題を克服するために、我々は以前に種レベルの解像度で微生物分類群を捕獲し定量化するためのアプローチとして、普遍的、タンパク質コーディング、single copy phylogenetic marker gene (MG)-based operational taxonomic units (mOTUs) を用いるプロファイリングツールを導入した。 mOTUは既知および未知の両方の種からのMGに基づいて構築されており、後者は既存のメタゲノムから抽出されているため、新しい微生物群集をプロファイリングするときに、より高い分類分解能および種のより正確な定量化ができる。

　この論文では、更新された機能拡張プロファイリングツールであるmOTUプロファイルバージョン2（mOTUs2）を紹介する。これは、3100以上のメタゲノムサンプルからのデータを更新されたmOTUデータベースに統合し、ヒト関連および海洋微生物種の表現を大幅に改善する。最先端の方法と比較したmOTU 2の評価は、既知の種と未知の種の両方について感度と定量精度の改善を示していた。我々（本著者ら）は他のアプローチによって見逃された種により、どのようにメタゲノムデータ構成から相対的な存在量の推定をゆがめられるかを説明する。さらに、MOTUは、非ハウスキーピング遺伝子の使用によるアーチファクト定量を回避しながら、微生物群集のメタトランスクリプトームデータのベースライン転写活性を定量することを可能にする。最後に、メタゲノム研究では、特定の微生物群集に共存し、個人と環境のサンプリング場所が異なり、経時的に安定していることが異質の微生物株集団で報告されている（ref.10-12）。メタゲノム一塩基変異プロファイリングのために全ゲノム配列を使用する効率的な代替手段として、mOTUのMGを使用してそのような株集団間の違いが推定できることを示す。

　最初に、25,000以上のシーケンスゲノムの全セット（ref.13）の中で、以前に選択されベンチマークされた40のMGを同定した。（おそらく冗長）配列の種レベルの分類群を得るために、本著者らは96.5％の配列同一性の較正済みカットオフ（ref.4）に基づいて、より多くのものを含む5232の非冗長リファレンスMG-based operational taxonomic units （ref-mOTU）にこれらのゲノムをクラスタ化した。次に、1件のバイオームあたり多数の系統的に処理されたサンプルを要件として含む研究から、3100以上のメタゲノムを集めた（論文補足データ1）。これらは、主要な人体部位（口腔、皮膚、腸および膣ref.14、15）からの1210のサンプル、さまざまな疾病コホート16-21および243の海洋水サンプル22を含むさまざまなヒト腸メタゲノム研究からの追加1693サンプルから構成されている。これらのアセンブリで予測されたMGは、マーカー遺伝子クラスター（MGC）にまとめられた。最後に、ref-mOTUと同じ包含基準（> 5 MG / mOTU）を適用して、MGCをメタゲノムmOTU（meta-mOTU）に共存量ベースでビニングする方法を考案した（論文図1a、メソッド）。メタmOTUのビニング精度を評価するために、ref-mOTUと比較した個々のMGCの分類学的な一貫性（方法）、存在量の変動、有病率およびGC含有量に関して個々のMGCを評価した（補足図2、方法）。全体として、私たちはすべてのカテゴリーで高い一致を見いだした。例えば、種レベルでは、種名の割り当てとMGに基づく配列との間の不一致が知られているにもかかわらず、97％を超えるメタmOTUがそれらの分類学的アノテーションで完全に一致すると予想される（論文補足図2a）。（以下略）

wiki（簡潔に分かりやすく説明されています）

https://github.com/motu-tool/mOTUs_v2/wiki

@alessio_mila getting ready to present the mOTUs2 profiler in the poster session at #SACMDA3 pic.twitter.com/uMNRvEvxBn
— Zeller Group (@ZellerGroup) April 1, 2019

Our new paper on community and population profiling using metagenomics is out now! #mOTUs2 https://t.co/hTmSN0OLYw
— Daniel Mende (@pangenomics) March 4, 2019

#paper Microbial abundance, activity and population genomic profiling with mOTUs2. Track 7700 species using marker genes, including a few thousand unknown branches. With Zeller and Sunagawa groups. https://t.co/K3j7wSpv5K
— Jaime Huerta-Cepas (@jhcepas) March 12, 2019

Website: https://t.co/6uX9rZbTwk
Code: https://t.co/N9KohNpafQ
Paper: https://t.co/i3oV76pPuF
for #mOTUs2
— Shinichi Sunagawa (@SunShinichi) March 5, 2019

インストール

condaで仮想環境を作ってテストした。

依存

Python 3 (or higher)
the Burrow-Wheeler Aligner v0.7.15 or higher (bwa)
SAMtools v1.5 or higher (link)
metaSNV

conda create -n motus2 python=3.7
conda activate motus2
conda install -c bioconda -y bwa samtools metasnv

本体　Github

git clone https://github.com/motu-tool/mOTUs_v2.git
cd mOTUs_v2

#データベースのダウンロードも行われる（1GBほど）
python setup.py
python test.py
export PATH=`pwd`:$PATH

#追記　condaを使って導入するなら (テスト時は失敗した)
#bioconda (link)
conda install -c bioconda -y motus

> python test.py

$ python test.py

------------------------------------------------------------------------------

| TEST MOTUS TOOL |

------------------------------------------------------------------------------

1-- ran setup.py: done

2-- Tools and versions:

- python: correct

- bwa: correct

- samtools: correct

- metaSNV: correct

3-- Taxonomy profiling test:

- Run motus (-v 1, only error messages):

- end motus call

Check resulting profile: correct

パスを通す。

export PATH=`pwd`:$PATH

> motus

$ motus

Program: motus - a tool for marker gene-based OTU (mOTU) profiling

Version: 2.1.1

Reference: Milanese et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nature Communications (2019). doi: 10.1038/s41467-019-08844-4

Usage: motus <command> [options]

Command:

-- Taxonomic profiling

profile Perform a taxonomic profiling (map_tax + calc_mgc + calc_motu)

merge Merge different taxonomic profiles to create a table

map_tax Map reads to the marker gene database

calc_mgc Aggregate reads from the same marker gene cluster (mgc)

calc_motu From a mgc abundance table, produce the mOTU abundance table

-- SNV calling

map_snv Map reads to create bam/sam file for snv calling

snv_call SNV calling using metaSNV

Type motus <command> to print the help for a specific command

>motus profile

$ motus profile

Usage: motus profile [options]

Input options:

-f FILE[,FILE] input file(s) for reads in forward orientation, fastq formatted

-r FILE[,FILE] input file(s) for reads in reverse orientation, fastq formatted

-s FILE[,FILE] input file(s) for reads without mate, fastq formatted

-n STR sample name

-i FILE[,FILE] provide sam or bam input file(s)

-m FILE provide a mgc reads count file

Output options:

-o FILE output file name [stdout]

-I FILE save the result of bwa in bam format (intermediate step) [None]

-M FILE save the mgc reads count (intermediate step) [None]

-e profile only reference species (ref_mOTUs)

-c print result as counts instead of relative abundances

-p print NCBI id

-u print the full name of the species

-q print the full rank taxonomy

-B print result in BIOM format

-C STR print result in CAMI format (BioBoxes format 0.9.1)

Values: [precision, recall, parenthesis]

-k STR taxonomic level [mOTU]

Values: [kingdom, phylum, class, order, family, genus, mOTU]

Algorithm options:

-g INT number of marker genes cutoff: 1=higher recall, 10=higher precision [3]

-l INT min. length of alignment for the reads (number of nucleotides) [75]

-t INT number of threads [1]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

-y STR type of read counts [insert.scaled_counts]

Values: [base.coverage, insert.raw_counts, insert.scaled_counts]

> motus merge

$ motus merge

Usage: motus merge [options]

Input options:

-i STR[,STR] list of files (comma separated)

-d DIR merge all the files in the directory

Output options:

-o FILE output file name [stdout]

-B print result in BIOM format

Algorithm options:

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

> motus map_tax

$ motus map_tax

Usage: motus map_tax [options]

Input options:

-f FILE[,FILE] input file(s) for reads in forward orientation, fastq formatted

-r FILE[,FILE] input file(s) for reads in reverse orientation, fastq formatted

-s FILE[,FILE] input file(s) for reads without mate, fastq formatted

Output options:

-o FILE output file name [stdout]

-b save the result of bwa in bam format

Algorithm options:

-l INT min. length of alignment for the reads (number of nucleotides) [75]

-t INT number of threads [1]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

> motus calc_mgc

$ motus calc_mgc

Usage: motus calc_mgc [options]

Input options:

-n STR sample name

-i FILE[,FILE] provide a sam or bam input file (or list of files)

Output options:

-o FILE output file name [stdout]

Algorithm options:

-l INT min. length of alignment for the reads (number of nucleotides) [75]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

-y STR type of read counts [insert.scaled_counts]

Values: [base.coverage, insert.raw_counts, insert.scaled_counts]

> motus calc_motu

$ motus calc_motu

Usage: motus calc_motu [options]

Input options:

-n STR sample name

-i FILE provide the mgc abundance table

Output options:

-o FILE output file name [stdout]

-e profile only reference species (ref_mOTUs)

-B print result in BIOM format

-C STR print result in CAMI format (BioBoxes format 0.9.1)

Values: [precision, recall, parenthesis]

-c print result as counts instead of relative abundances

-p print NCBI id

-u print the full name of the species

-q print the full rank taxonomy

-k STR taxonomic level [mOTU]

Values: [kingdom, phylum, class, order, family, genus, mOTU]

Algorithm options:

-g INT number of marker genes cutoff: 1=higher recall, 10=higher precision [3]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

> motus map_snv

$ motus map_snv

Usage: motus map_snv [options]

Input options:

-f FILE[,FILE] input file(s) for reads in forward orientation, fastq formatted

-r FILE[,FILE] input file(s) for reads in reverse orientation, fastq formatted

-s FILE[,FILE] input file(s) for reads without mate, fastq formatted

Output options:

-o FILE output bam file name [stdout]

Algorithm options:

-l INT min. length of alignment for the reads (number of nucleotides) [75]

-t INT number of threads [1]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

> motus snv_call

$ motus snv_call

Usage: motus snv_call -d Directory -o Directory [options]

Input options:

-d DIR Call metaSNV on all bam files in the directory. [Mandatory]

-fb FLOAT Coverage breadth: minimal horizontal genome coverage percentage per sample per species. Default=80.0

-fd FLOAT Coverage depth: minimal average vertical genome coverage per sample per species. Default=5.0

-fm INT Minimum number of samples per species. Default=2

-fp FLOAT FILTERING STEP II: Required proportion of informative samples (coverage non-zero) per position. Default=0.50

-fc FLOAT FILTERING STEP II: Minimum coverage per position per sample per species. Default=5.0

-t INT Number of threads. Default=1

Output options:

-o DIR Output directory. Will fail if already exists. [Mandatory]

-K Keep all the directories produced by metaSNV. Default is to remove cov, distances, filtered, snpCaller

Algorithm options:

-v INT Verbose level: 1=error, 2=warning, 3=message, 4+=debugging. Default=3

snv_callを使うにはmetaSNVも必要。詳細はwiki参照。

実行方法

１、taxonomyプロファイリング

ペアエンドfastq（gzip圧縮にも対応）。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -n sample1 > taxonomy_profile.txt

-f input file(s) for reads in forward orientation, fastq formatted
-r input file(s) for reads in reverse orientation, fastq formatted
-n sample name
-t number of threads [1]
-l min. length of alignment for the reads (number of nucleotides) [75]

シングルエンドfastq

motus profile -s single_reads.fq -t 8 -n sample1 > taxonomy_profile.txt

-s input file(s) for reads without mate, fastq formatted

複数 fastqある場合（同じサンプル）

motus profile -t 8 -n sample1 \
 -f for_r_lane1.fq,for_r_lane2.fq \
 -r rev_r_lane1.fq,rev_r_lane2.fq \
 > taxonomy_profile.txt

または各ステップを個別に行う。

motus map_tax -s metagenomic_sample.fastq -o mapped_reads.sam
motus calc_mgc -i mapped_reads.sam -o mgc_ab_table.count
motus calc_motu -i mgc_ab_table.count > taxonomy_profile.txt
rm mapped_reads.sam mgc_ab_table.count

出力

bwaを使ってmOTUsデータベースに対してmappingされる。長さ"-l 100"でフィルタリングされて結果は出力される。bamをexportするならiオプションを使い"-I mapped.bam"のように指定する。

f:id:kazumaxneo:20190425115858j:plain

各speciesのrelative abundanceが出力される。2019年4月現在7730行ある。出力についてはwiki参照（link）。

0以外を確認（mergeしたファイルの場合、1サンプルでも0があれば消すので注意）

awk '!/0.0000000000/' taxonomy_profile.txt

出力変更

"-c"でrelative abundanceからリードカウントに変更

motus profile -f pair_1.fq -r pair_2.fq -t 8 -c > taxonomy_profile.txt

-c print result as counts instead of relative abundances

標準ではspecies出力だが、-kで階級を指定できる。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -k class  > taxonomy_profile.txt

-k taxonomic level [mOTU] Values: [kingdom, phylum, class, order, family, genus, mOTU]

"-p"でNCBI taxonomy idsを出力する。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -p > taxonomy_profile.txt

-p print NCBI id

"-q"でfull rank taxonomyを出力する。-kと併用すれば出力する下限の階級を指定できる。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -k class -q > taxonomy_profile.txt

-k taxonomic level [mOTU] Values: [kingdom, phylum, class, order, family, genus, mOTU]
-q print the full rank taxonomy

"-B"で Biom format v.1.0で出力する（BIOM詳細）。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -B > taxonomy_profile.biom

-B print result in BIOM format.

CAMI (BioBoxes) formatで出力する（wiki）。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -C precision > taxonomy_profile

-C <value> print result in CAMI format (BioBoxes format 0.9.1). Values: [precision, recall, parenthesis]

他にも full species nameを出す"-u"や、メタゲノムのmeta_mOTUsに-1を足してref_mOUTsだけ出力する"-e"があります。wikiで確認して下さい。

２、複数サンプル結果のマージ

まず個別にmotus profileを実行する。

motus profile -f sample1_1.fq -r sample1_2.fq -t 8 -n sample1 > sample1.txt
motus profile -f sample2_1.fq -r sample2_2.fq -t 8 -n sample2 > sample2.txt
motus profile -f sample3_1.fq -r sample3_2.fq -t 8 -n sample3 > sample3.txt

マージする。

motus merge -i sample1.txt,sample2.txt,sample3.txt >  all_sample_profiles.txt

-o output file name [stdout]

マージしてBIOM format（JSON-derived format）出力。

motus merge -i sample1.txt,sample2.txt,sample3.txt -B > all_sample_profiles.biom

-B print result in BIOM format

引用

Microbial abundance, activity and population genomic profiling with mOTUs2

Alessio Milanese, Daniel R Mende, Lucas Paoli, Guillem Salazar, Hans-Joachim Ruscheweyh, Miguelangel Cuenca, Pascal Hingamp, Renato Alves, Paul I Costea, Luis Pedro Coelho, Thomas S. B. Schmidt, Alexandre Almeida, Alex L Mitchell, Robert D. Finn, Jaime Huerta-Cepas, Peer Bork, Georg Zeller, Shinichi Sunagawa

Nature Commun. 2019; 10: 1014

Metagenomic species profiling using universal phylogenetic marker genes
Shinichi Sunagawa, Daniel R Mende, Georg Zeller, Fernando Izquierdo-Carrasco, Simon A Berger, Jens Roat Kultima, Luis Pedro Coelho, Manimozhiyan Arumugam, Julien Tap, Henrik Bjørn Nielsen, Simon Rasmussen, Søren Brunak, Oluf Pedersen, Francisco Guarner, Willem M de Vos, Jun Wang, Junhua Li, Joël Doré, S Dusko Ehrlich, Alexandros Stamatakis & Peer Bork
Nature Methods volume 10, pages 1196–1199 (2013)

関連

2021 8/16

Curr Protoc. 2021 Aug;1(8):e218. doi: 10.1002/cpz1.218.

"mOTUの主なルーチンである、メタゲノム・メタトランスクリプトームコミュニティプロファイリング（基本プロトコル1）とメタゲノムSNVプロファイリング（基本プロトコル2）の2つの基本プロトコルを説明する。さらに、4つのサポートプロトコルを提供する。サポートプロトコル1では、mOTUのインストール方法を説明している。サポートプロトコル2および3では、プロファイリングパイプラインの各ステップをカスタマイズする方法を説明している。これにより、感度の変更や、コミュニティで受け入れられているさまざまな形式で結果を報告するなど、豊富なオプションを導入することができる。サポートプロトコル4では、パラメータ設定の違いがメタゲノムSNVプロファイリング機能に与える影響について説明している。"

https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.218

2019-04-25

教育にも利用できる、何百ものバイオインフォマティクスツールが入った包括的コンテナ環境 ORCA

2019 Bioinformatics docker インフォマティクス解析をサポートするツール education windowsツール benchmark conda

2019 4/25 誤字修正

　効果的なバイオインフォマティクス分析のための適切なプラットフォームを設定することは困難な場合がある。標準のソフトウェアパッケージをインストールするために必要な依存関係とバージョン要件を決定することは、分析を始める前の障壁である。バイオインフォマティクスツールをインストールするには、まず依存関係をインストールする必要がある。依存関係自体にも依存関係がある。再現可能なデータ分析を促進するには、インストールが簡単なソフトウェアが不可欠である（Leprevost et al、2014）。 Dockerは、複雑なパイプラインとワークフローを含むコンピューティング環境を構成し、それを複数のプラットフォームに分散する機能を提供する（Menegidio et al、2017）。最近のいくつかのプロジェクトは、研究のためのバイオインフォマティクスコンテナ環境を提供している。たとえば、BioContainersはBioCondaリポジトリからバイオインフォマティクスソフトウェアへのアクセスを提供し（Grüninget al、2018）、各ソフトウェアパッケージ（Leprevost）のために隔離されたDockerコンテナを作成している。
　BioBoxesも同様の目的を果たす（Belmann et al、2015）。ただし、複数の分離されたコンテナを1つの分析パイプラインに結合することは初心者にとっては難しい場合があり、コンテナやデータ分析パイプラインツールにまだ精通していない初心者には適していない。目標がバイオインフォマティクスを教えることである場合、最初にコンテナを教える必要があるのは望ましくない。このような状況では、単一のコンテナにすべての必要なソフトウェアを提供する包括的なコンテナが適している。 BioLinux（http:// environmentomics.org/bio-linux/）は、物理的なUSBフラッシュドライブまたはDVDにインストールできる包括的なディスクイメージを提供し、関連プロジェクトのCloud BioLinux（Krampis et al。、2012）はクラウドコンピューティングサービスで使用するためのAmazonマシンイメージを提供している。 ORCA、Genomics Research Container Architectureは、Dockerイメージで教育や研究のための包括的なバイオインフォマティクスコマンドラインコンピューティング環境を提供する。これには、何百もの一般的なバイオインフォマティクスツールとその依存関係が含まれている。

　教育現場では、ORCAはDockerを使用してプライベートなコンテナ環境を提供するが、ユーザーがDockerと直接対話する必要はない。ユーザーは、セキュアなシェルユーティリティsshを使用してORCAサーバーにログインする。ユーザーのログインシェルは、ユーザーに対して透過的に、プライベートコンテナ内でインタラクティブシェルを実行するように設定されている。この構成では、ユーザーは自分でDockerを実行したり対話したりすることはない。ユーザーがログアウトしても、コンテナおよびコンテナ内のユーザーのプロセスは実行されたままなので、ユーザーは後でその同じコンテナに戻ってジョブのステータスと結果を確認できる。データは、ユーティリティscp、sftp、またはrsyncのいずれかを使用してコンテナとの間で転送される。 Integrative Genomics Viewer（IGV）などのグラフィカルアプリケーションは、sshを介してX11 プロトコルをトンネリングすることによって使用できる。マルチユーザーサーバーでのORCAの設定については、ORCAのドキュメント（https://github.com/bcgsc/orca#readme）に記載されている。ログインシェルスクリプトを使用して、各ユーザーに個々のコンテナ内にシェルを表示する。個々の設定では、Dockerをインストールし、次にdockerコマンド

$ docker run -it -v HOME:$HOME -w $HOME bcgsc/orca

を実行してORCAをユーザーのワークステーションまたはラップトップにインストールできる。ORCAイメージに含まれるすべてのソフトウェアへのアクセスを提供し、ユーザーのホームディレクトリをマウントしてデータを保存する。 Dockerイメージは、必要に応じてクラウドコンピューティングで使用できるが、ORCAはインストールまたはクラウドコンピューティングに物理メディアを必要としない。
ORCAコンテナイメージの構築は自動化されており、アーキテクチャの概要は論文図1に示されている。コンテナイメージを構築するためのDockerfileスクリプトはGitHubに格納されている。継続的インテグレーションサービスTravisCIは、GitHubへのコミットのたびに、Dockerfileの一般的な間違いや落とし穴を分析するために使用される。コンテナイメージは、Docker Hubの自動ビルド機能を使用して、GitHubに格納されているDockerfileからDocker Hubによってビルドされる。 GitHubで新しいリリースのORCAにタグを付けると、新しいDockerイメージが構築され、テストされ、Docker Hubでタグ付けされる。 ORCAイメージの安定版にはそれぞれバージョン番号が割り当てられているので、ユーザーは以前のバージョンのORCAをDocker Hubから取得して以前の分析を繰り返すことができる。 ORCAイメージの作成およびテスト手順を自動化することで、半日で新しいイメージを展開することができ、Dockerイメージ作成の再現性が保証される。 Dockerイメージは大きいが（17 GB）、1レーンのDNAシーケンスデータ（50 GB以上）より小さい。
　ORCAとともに配布されるソフトウェアパッケージの大部分は、Homebrew（https://brew.sh）パッケージマネージャを使用してインストールされる。 Homebrewは、Linux用のWindows Subsystemを使用して、Linux、macOS、またはWindows上のユーザーのホームディレクトリにソフトウェアをインストールでき、管理者権限を必要としない（Jackman and Birol、2016）。 Homebrewは多くのパッケージ用にプリコンパイルされたバイナリを提供する。これはソースから各ツールを構築する必要性を軽減する。バイナリパッケージはCircleCI上に構築され、各ツールが更新されるたびにBintrayに保存される。イメージがビルドされるときに各パッケージがソースからビルドされると、ORCA Dockerイメージのビルドには何日もかかる。
Homebrewで利用可能なバイオインフォマティクスパッケージはORCAにプレインストールされている。ユーザーは、ORCAにデフォルトで含まれていない他のパッケージをインストールすることができる。 Debianパッケージ管理ツールapt-getとPerl（cpan）、Python（pip）そしてRのための言語特有のパッケージ管理システムが追加のソフトウェアパッケージをインストールするために使われるかもしれない。興味を持った熱心なユーザーは、Brewsci/bio（https://github.com/ brewsci / homebrew-bio）に新しいバイオインフォマティクスツールや既存のツールの新しいバージョンを寄付することができる。これはLinuxとmacOSの両方のためのバイナリパッケージを構築する。これらの貢献したツールはORCAの次のリリースに含まれる予定である。

f:id:kazumaxneo:20190422180445j:plain

The architecture of ORCA.　論文より転載。

ORCA
genOmics Research Container Architecture

http://www.bcgsc.ca/services/orca

Github

Dockerfileの他にSingularityのレシピも用意されています。

現在導入されているツールのstatus一覧

https://bcgsc.github.io/orca/

600以上ある !

リストについてはこちらも参照

https://github.com/bcgsc/orca/blob/master/versions.tsv

使い方

ローカルマシンにpullして使う流れを簡単に紹介する。

A、macを使っている場合

dockerさえインストールしておけば使える。macへのdockerの導入はこちらを参考にしてください。

１、ターミナル（コンソール）を起動し、これから扱うデータがある作業ディレクトリ、またはデータをサブディレクトリとして含むデータのルートディレクトリ（データの一番上の階層）に移動する。

２、コンテナイメージをランする。

#このrunコマンドを打つと、localになければダウンロードしてから立ち上がる。10GBくらいあるので注意する。
#ホストの現在位置をorcaの$HOMEにマウントしてコンテナを立ち上げる。
docker run --rm -it -v$PWD:$HOME -w$HOME bcgsc/orca

#またはホストの$HOMEをorcaの$HOMEにマウントしてコンテナを立ち上げる（*1）
docker run --rm -it -v$HOME:$HOME -w$HOME bcgsc/orca

#またはホスト側からジョブを投げる。ちょっとだけ使うならこれが便利
#samtoolsを使ってbamをsamに変換する。
docker run --rm -it -v$PWD:$HOME -w$HOME bcgsc/orca samtools view -h -@ 8 mapped.bam > mapped.bam

root権限がないなら、上のコマンドの先頭部分はsudo dockerに変更（もちろんpasswordが必要）。-w $HOMEでワーキングディレクトリ（login後のパス）を指定している。-wでカレントを指定しているので、外からジョブを投げる場合も、現在のパスにファイルがあるならフルパスでファイル指定していなくても動作する。exit後にコンテナを消すため"--rm"もつけている。

３、すぐに立ち上がるのであとは使うだけ。説明は以上になる。

もう少しだけ説明しておく。

PATHの通ったディレクトリを確認してみる。

> echo $PATH

$ echo $PATH

/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

/home/linuxbrew/.linuxbrew/bin/
/home/linuxbrew/.linuxbrew/sbin/
/usr/local/sbin/
/usr/local/bin/
/usr/sbin/
/usr/bin/
/sbin/
/bin/

になる。Dockerfileに記述されている通り、docker build時にlinuxbrewで自動インストールしてイメージが作られている。よって/home/linuxbrew/.linuxbrew/bin/にツールのバイナリはあるはず。/home/linuxbrew/.linuxbrew/bin/を確認する。

> ls /home/linuxbrew/.linuxbrew/bin/

$ ls /home/linuxbrew/.linuxbrew/bin/

2to3-3.7 cmake gmap_uncompress multigem sff-dump.2.9.2

ABYSS cmalign gmapindex multiruby sga

ABYSS-P cmbuild gmapl multiruby_setup sga-align

ALE cmcalibrate gmapl.sse42 mummer sga-asqg2dot.pl

AMOScmp cmconvert gmapper mummerplot sga-astat.py

AMOScmp-shortReads cmemit gmapper-cs muscle sga-bam2de.pl

AMOScmp-shortReads-alignmentTrimmed cmetindex gmapper-ls mutDiff sga-beetl-index.pl

以下略

多すぎるので省略したが、/home/linuxbrew/.linuxbrew/bin/にインフォマィクスツール（バイナリ等）が収納されている。

B、windowsを使っている場合

方法は複数存在します。windows10 ProならHyper-Vを有効にしてDocker for windowsを入れれば良いと思いますが、windowsにおけるlinuxコマンドの扱いはまだ流動的で、今後も変更される可能性があり不正確なことは書けないので、ここでは導入手順は省略します。Qiitaなどで最新の情報を確認してください（*2）。

#c:/Usersを/dataにマウントして立ち上げる
docker run --rm -itv c:/Users:/data bcgsc/orca

f:id:kazumaxneo:20190424105701p:plain

windowsでもちゃんと使えます。linuxとdual bootしたり、いきなりクラウドで次世代解析するのはしんどいという方は試してみてはいかがでしょうか？

memo

samtoolsを使ってみる。

> samtools

$ samtools

Program: samtools (Tools for alignments in the SAM format)

Version: 1.9 (using htslib 1.9)

Usage: samtools <command> [options]

以下略

2019年4月現在v1.9が入っている。

パスを確認する。

$ which samtools

/home/linuxbrew/.linuxbrew/bin/samtools

linuxbrew@2a9b84e428e1:/Us

先ほどの/home/linuxbrew/.linuxbrew/bin/になっている。

いくつかツールを見てみる。

>fastqc -v

$ fastqc -v

FastQC v0.11.7

> fastp -v

$ fastp -v

fastp 0.19.3

> blastn

$ blastn

BLAST query/options error: Either a BLAST database or subject sequence(s) must be specified

Please refer to the BLAST+ user manual.

> blat

$ blat

blat - Standalone BLAT v. 36 fast sequence search command line tool

usage:

> nucmer -v

$ nucmer -v

nucmer

NUCmer (NUCleotide MUMmer) version 3.1

> lastal -V

$ lastal -V

lastal 926

> bamtools

$ bamtools

usage: bamtools [--help] COMMAND [ARGS]

以下略

> gtf2bed

$ gtf2bed

Error: No input is specified; please redirect or pipe in formatted data

convert2bed

> sambamba

$ sambamba

sambamba 0.6.6

Usage: sambamba [command] [args...]

以下略

> bwa

$ bwa

Program: bwa (alignment via Burrows-Wheeler transformation)

Version: 0.7.17-r1188

以下略

> vcftools

$ vcftools

VCFtools (0.1.16)

以下略

> bedtools

$ bedtools

bedtools is a powerful toolset for genome arithmetic.

Version: v2.27.1

以下略

> bcftools -v

$ bcftools -v

bcftools 1.9

Using htslib 1.9

以下略

> minimap2

$ minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Options:

以下略

> bbmap.sh --version

$ bbmap.sh --version

java -Djava.library.path=/home/linuxbrew/.linuxbrew/Cellar/bbtools/37.77/jni/ -ea -Xmx35786m -cp /home/linuxbrew/.linuxbrew/Cellar/bbtools/37.77/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 --version

BBMap version 37.77

以下略

> gatk

$ gatk

Usage template for all tools (uses --spark-runner LOCAL when used with a Spark tool)

以下略

> snpEff -version

$ snpEff -version

SnpEff 4.3t 2017-11-24

> lumpy

$ lumpy

Program: ********** (v 0.2.13)

Author: Ryan Layer (rl6sf@virginia.edu)

Summary: Find structural variations in various signals.

以下略

> delly

$ delly

**********************************************************************

Program: Delly

以下略

> fermi2

Program: fermi2

Version: r178

Contact: http://hengli.uservoice.com/

以下略

> supermatcher

$ supermatcher

Calculate approximate local pair-wise alignments of larger sequences

Input

> abyss-pe -h

$ abyss-pe -h

Usage: make [options] [target] ...

Options:

-b, -m Ignored for compatibility.

以下略

> SOAPdenovo-63mer

$ SOAPdenovo-63mer

Version 2.04: released on July 13th, 2012

Compile Jul 29 2016 22:54:45

以下略

> spades.py -v

$ spades.py -v

SPAdes v3.12.0

lin

> a5_pipeline.pl

$ a5_pipeline.pl

A5-miseq version 20160825

Usage: a5_pipeline.pl [--begin=1-5] [--end=1-5] [--preprocessed] [--threads=4] [--debug] [--metagenome] <lib_file> <out_base>

以下略

> flye --version

r$ flye --version

2.3.5-release

以下略

> canu --version

$ canu --version

Canu 1.7.1

以下略

> quast

$ quast

QUAST: Quality Assessment Tool for Genome Assemblies

Version: 5.0.0, de6973bb

以下略

> vt

$ vt

Help page on http://statgen.sph.umich.edu/wiki/Vt

Useful tools:

以下略

> freebayes

$ freebayes

usage: freebayes -f [REFERENCE] [OPTIONS] [BAM FILES] >[OUTPUT]

Bayesian haplotype-based polymorphism discovery.

> varscan -h

$ varscan -h

Command not recognized

VarScan v2.4.3

以下略

> nanopolish

$ nanopolish

error: no command provided

usage: nanopolish [command] [options]

以下略

> pilon

$ pilon

Pilon version 1.22 Wed Mar 15 16:38:30 2017 -0400

Usage: pilon --genome genome.fasta [--frags frags.bam] [--jumps jumps.bam] [--unpaired unpaired.bam]

以下略

> jellyfish --version

$ jellyfish --version

jellyfish 2.2.10

> kmc

$ kmc

K-Mer Counter (KMC) ver. 3.1.0 (2018-05-10)

Usage:

以下略

> ntcard --version

$ ntcard --version

ntcard Version 1.0.0

Written by Hamid Mohamadi.

> kmergenie

$ kmergenie

KmerGenie

Usage:

kmergenie <read_file> [options]

以下略

> bowtie2 --version

r$ bowtie2 --version

/home/linuxbrew/.linuxbrew/bin/../Cellar/bowtie2/2.3.4.2/bin/bowtie2-align-s version 2.3.4.2

64-bit

以下略

> hisat2 --version

$ hisat2 --version

/home/linuxbrew/.linuxbrew/bin/../Cellar/hisat2/2.1.0/bin/hisat2-align-s version 2.1.0

64-bit

Built on c15103119586

以下略

> kallisto version

$ kallisto version

kallisto, version 0.44.0

> salmon

$ salmon

Salmon v0.9.1

Usage: salmon -h|--help or

以下略

> tophat -v

$ tophat -v

TopHat v2.1.1

> oases

$ oases

oases - De novo transcriptome assembler for the Velvet package

Version 0.2.08

> metaphlan -v

$ metaphlan -v

MetaPhlAn version 1.7.7 (27 January 2014)

> fasterq-dump

$ fasterq-dump

Usage:

fasterq-dump <path> [options]

以下略

> cd-hit

$ cd-hit

====== CD-HIT version 4.7 (built on Apr 19 2018) ======

> busco -v

$ busco -v

BUSCO 3.0.2

> barrnap --version

$ barrnap --version

barrnap 0.9

> tRNAscan-SE

$ tRNAscan-SE

tRNAscan-SE 1.3.1 (January 2012)

> abricate

$ abricate

SYNOPSIS

Find and collate amplicons in assembled contigs

以下略

> rnammer -v

$ rnammer -v

This rnammer 1.2, running from '/home/linuxbrew/.linuxbrew/opt/rnammer'

> hmmscan

$ hmmscan

Incorrect number of command line arguments.

Usage: hmmscan [-options] <hmmdb> <seqfile>

以下略

> TransDecoder.LongOrfs -v

$ TransDecoder.LongOrfs -v

########################################################################################

# ______ ___ __

# /_ __/______ ____ ___ / _ \___ _______ ___/ /__ ____

# / / / __/ _ `/ _\(_-</ // / -_) __/ _ \/ _ / -_) __/

# /_/ /_/ \_,_/_//_/___/____/\__/\__/\___/\_,_/\__/_/ .LongOrfs

########################################################################################

以下略

> prokka -v

$ prokka -v

prokka 1.13

> augustus

$ augustus

AUGUSTUS (3.3.1) is a gene prediction tool

written by M. Stanke, O. Keller, S. König, L. Gerischer and L. Romoth.

以下略

> art_illumina

$ art_illumina

====================ART====================

ART_Illumina (2008-2016)

Q Version 2.5.8 (June 6, 2016)

Contact: Weichun Huang <whduke@gmail.com>

-------------------------------------------

以下略

> wgsim

$ wgsim

Program: wgsim (short read simulator)

Version: 1.9

以下略

> embossversion

$ embossversion

Report the current EMBOSS version number

> kraken

$ kraken

Need to specify input filenames!

Usage: kraken [options] <filename(s)>

以下略

> mash

$ mash

Mash version 2.0

以下略

> kaiju

$ kaiju

Error: Please specify the location of the nodes.dmp file, using the -t option.

Kaiju 1.5.0

以下略

> centrifuge --version

$ centrifuge --version

/home/linuxbrew/.linuxbrew/bin/../Cellar/centrifuge/1.0.3/bin/centrifuge-class version 1.0.3-beta

64-bit

Built on 776abb41706e

以下略

> orthofinder.py

$ orthofinder.py

> racon

$ racon

[racon::] error: missing input file(s)!

usage: racon [options ...] <sequences> <overlaps> <target sequences>

以下略

> fastANI

$ fastANI

Required option missing: '-o, --output'

linu

> python2 --version

$ python2 --version

Python 2.7.15

> python3 --version

$ python3 --version

Python 3.7.0 (default, Aug 24 2018, 00:54:23)

[GCC 5.4.0 20160609]

> R --version

$ R --version

R version 3.5.1 (2018-07-02) -- "Feather Spray"

Platform: x86_64-pc-linux-gnu (64-bit)

以下略

> java -version

$ java -version

java version "9.0.4"

Java(TM) SE Runtime Environment (build 9.0.4+11)

Java HotSpot(TM) 64-Bit Server VM (build 9.0.4+11, mixed mode)

環境によるランタイムの違い（簡単なベンチマーク）

mac pro2012 、core i7 8700のwindows10 proマシン、xeon E5 2680v4のubuntu18.04マシン間で、E.coliのバリアントコールとde novo assembly関連ジョブのランタイムにどのくらい違いが出るのか調べてみる。

１、準備。コンテナを立ち上げ、SRAに上がっているE.coliのWGSデータを１つダウンロードする。リファレンスはNCBIからfetchする。

#mac
mkdir ~/test && cd test/
docker run --rm -itv $PWD:/data -w /data bcgsc/orca

#bash on windows
mkdir ~/test && cd ~/test/
#ここではユーザー名kazuのkazu/dataに移動。
docker run --rm -itv c:/Users/kazu/test:/data -w /data bcgsc/orca


#variant call
# reference genome download URL
# genome https://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3

#download small dataaset
fasterq-dump SRR6281664 -e 8

リファレンス名はEscherichia_coli_K-12_substr_MG1655.fastaに修正した。この３つのファイルがあるはず。

f:id:kazumaxneo:20190424151228j:plain

２、ヒアドキュメントで以下のテストスクリプトを作成

cat >run.sh <<EOF
#!/bin/sh
#variant call
ref='Escherichia_coli_K-12_substr_MG1655.fasta'
read1='SRR6281664_1.fastq'
read2='SRR6281664_2.fastq'
thread='8'

mkdir QT_reads
#preprocessing
fastp -i \$read1 -I \$read2 -3 -o QT_reads/cleaned_1.fq.gz -O QT_reads/cleaned_2.fq.gz \\
 -h QT_reads/fastp_report.html -j QT_reads/fastp_report.json \\
 -q 15 -n 20 -w \$thread
cread1='QT_reads/cleaned_1.fq.gz'
cread2='QT_reads/cleaned_2.fq.gz'

#mapping(*3)
mkdir temp
cp \$ref temp/
bwa index -a is temp/\$ref > /dev/null
bwa mem -t \$thread temp/\$ref \$cread1 \$cread2 | samtools sort -@ \$thread -O BAM - > mapped.bam && samtools index -@ \$thread mapped.bam
sambamba mpileup -t \$thread mapped.bam --samtools -u -g -f temp/\$ref  | bcftools call -v -m -O z -o mpileup.vcf.gz
bcftools stats -F temp/\$ref -s - mpileup.vcf.gz > mpileup.vcf.gz.stats
plot-vcfstats -p report/ mpileup.vcf.gz.stats

#de novo assembly
mkdir merged
bbmerge.sh in1=\$read1 in2=\$read2 out=merged/merged.fq ihist=merged/ihist.txt

#evaluation of assembly
spades.py -1 \$cread1 -2 \$cread2 --merged merged/merged.fq --careful -t \$thread -o spades
quast spades/scaffolds.fasta -R \$ref -1 \$cread1 -2 \$cread2 -o report/quast_test_output -t \$thread
exit
EOF

run.shができる。

３、ベンチマークする。

#run benchmark
time (sh run.sh)

timeのrealの値を表にまとめた（n=1）。

f:id:kazumaxneo:20190425115731j:plain

thread数を8と16で１回だけテストしたがどちらの条件でもxeonE5 v4のマシンが一番早く終わった。core i7マシンも10年前のxeon搭載mac proと比べると0.6倍以下の時間で処理できた。

感想

　調べた限りGeneralなツールは割となんでも入っています。専門的な解析向けのツールは少なめですが、配布サイズを鑑みて現実的なバランスで調整されていると思います。とは言っても環境によっては10GB以上あるイメージサイズはヘビーです。解析に供する直前ではなく、前もってpullしておくのがよいと思います。

　論文中で言及されていますが、講習会で使うならリソースが潤沢なサーバーにポートを開けて立てて、講習者がsshでリモートログインして使うのが楽でしょうね。講習会後も必要なら後からpullすれば良いわけで。

引用

ORCA: A Comprehensive Bioinformatics Container Environment for Education and Research
Shaun D Jackman Tatyana Mozgacheva Susie Chen Brendan O’Huiginn Lance Bailey Inanc Birol Steven J M Jones
Bioinformatics, Published: 20 April 2019

関連

-v $HOME:$HOME"と指定しているので、ホストの$HOMEが仮想環境の$HOMEにマウントされる（＝>シェアディレクトリになる）。私の場合、ホストの$HOME (私の場合/Users/user)がORCAの$HOME（/home/linuxbrew）にマウントされる。"-w $HOME"指定も同時にしているので、log in後の初期位置は$HOMEになる。

私はこの記事を参考にして10 proにdockerを入れました。

↓もう少し噛み砕いて説明したページ

variant callのstatisticsは以下のチュートリアルを参考にしました。

https://genomics.sschmeier.com/ngs-variantcalling/index.html

2019-04-24

long readのRNA seq向けマッピング評価ツール AlignQC

Nanopore long read Iso-Seq Pacbio F1000Research Hybrid-Seq 結果の視覚化 (visualization) evaluation tool

　Pacific Bioscience（PacBio）が20111年に1分子リアルタイム（SMRT）シーケンス技術を商品化し、第3世代シーケンシング（TGS）が登場した。TGSプラットフォームには大きな技術的違いがある。これは、第2世代シーケンス（SGS）とは異なる。ペアエンド情報を考慮すると、メインのSGSプラットフォームであるイルミナは、各DNAフラグメントから50〜600 bpの情報を提供する。最長SGSリードを生成する454シーケンス（〜700bp）を含め、> 1000bpを提供するSGSプラットフォームはない。したがって、シーケンシング長が短いため、SGSの適用は遺伝子アイソフォームの再構築などの大規模または複雑なゲノムイベントに限定される。 TGSはリード長の長さによってこれらの困難な問題を解決する。

　最も広く使用されているTGSプラットフォーム[PacBioおよびOxford Nanopore Technologies（ONT）]は、１DNA分子から非常に長いヌクレオチド配列を直接捕捉するための新しい生化学/生物物理学的方法を開発した。他の新たなTGSプラットフォーム（Moleculoおよび10X Genomics）は、同じDNA分子からのショートリードのアセンブリによる合成ロングリード（SLR）を生成することに基づいている。ここでは、PacBioとONTのデータ機能とそれらのトランスクリプトーム解析への応用に焦点を当てる。

　PacBioは1DNA分子を捕獲し、Illuminaは増幅されたDNAフラグメントのクローン集団から増強されたシグナルを検出することを除いて、イルミナのシークエンシングと同様の合成によるシークエンシング戦略を採用する。raw PacBioデータのエラー率は13〜15％である。これは、1DNA分子からのS / N比が高くないためである（ref.3)。正確性を高めるために、PacBioプラットフォームはヘアピンアダプターを標的二本鎖DNAの両端に連結することにより環状DNAテンプレートを使う。ポリメラーゼが環状分子を繰り返し横断しそして複製するにつれて、DNA鋳型は連続的な長いリード（CLR）を生成するために複数回シーケンシングされる。 CLRは、アダプター配列を除去することによって複数のリード（「サブリード」）に分割でき、複数のサブリードはより高い精度で循環コンセンサス配列（「CCS」）リードを生成する。 CLRの平均長は> 10kbから最大60kbである。これはポリメラーゼの寿命による( ref.3)。したがって、CCSリードの長さと精度はフラグメントサイズによって異なる。PacBioシークエンシングは、ゲノム（例えば、新規のアセンブリ、構造変異の検出およびハプロタイプ決定）およびトランスクリプトーム（例えば、遺伝子アイソフォームの再構築および新規遺伝子/アイソフォームの発見）の研究に利用されてきた。

　ヌクレオチド取り込みまたはハイブリダイゼーションを利用する他のシーケンシング技術と比較して、ONTは、特徴的な電流を測定することによって天然の一本鎖DNA（ssDNA）分子を直接シーケンシングする。塩基が分子モータータンパク質によってナノポアを貫通するにつれて変化する。ONT MinIONは、PacBio環状DNAテンプレートと同様のヘアピンライブラリー構造を使用する。DNAテンプレートとその相補鎖はヘアピンアダプターによって結合されている。したがって、DNAテンプレートはナノポアを通過し、続いてヘアピンが通過し、最後に補体が通過する。 rawリード、アダプターを取り外すことによって2つの「1D」リード（「テンプレート」と「補数」）に分割できる。 2回の「1D」リードのコンセンサスシーケンスは、より高い精度での「2D」リードである（ref.2）。 PacBioと同様のデータ機能を備えているため、多くの研究者がPacBioが適用されているアプリケーションでONTを利用したりテストしたりしている。

　PacBioとONTプラットフォームはリード長が長いという利点を共有しているが、SGSと比較してシーケンスエラー率が高く、スループットが低いという同じ欠点もある(ref.3、14–16)。高いシーケンシングエラー率は、転写物の正確なシーケンシング、スプライス部位の同定およびSNP callingのような1ヌクレオチド分析に課題を投げかける。低処理量は、遺伝子／アイソフォーム存在量の推定などの定量分析にとって障害となる。 PacBio CCSとONT 2Dのコンセンサス戦略はエラー率を減らすことができるが、対応するリード長は短くなり、スループットは低下する。したがって、TGSデータとSGSデータとを統合するハイブリッドシーケンシング（「Hybrid-Seq」）は、SGSデータの支援によるTGSデータの分析に関連する制限に対処するためのアプローチとして浮上している。たとえば、PacSioまたはONTのロングリードをSGSのショートリードでエラー訂正すると、ロングリードの精度とマッピング性が向上する（ref.17–19)。 Hybrid-Seqは、ゲノム構築とトランスクリプトームの特徴付けに適用でき、全体的な性能と分解能を向上させることができる（ref.11–13,17)。

　PacBioとONTのリード長の長さは、トランスクリプトーム研究、特に遺伝子アイソフォームの同定にとって非常に有益である。ヒトのトランスクリプトーム（ref.20–22)

に加えて、PacBioトランスクリプトシークエンシングプロトコルであるIso-Seqが、非モデル生物および特定の遺伝子/遺伝子ファミリーにおけるトランスクリプトームの複雑さを特徴付けるために広く使用されている（ref.23–31)。対照的に、ONTには標準的な転写産物シークエンシングプロトコルはなく、公に利用可能なパイロット試験はわずかしかない。 Minionを使って、Bolisetty et alはショウジョウバエの4つの遺伝子の非常に高いアイソフォーム多様性を発見した。そして、それは複雑な転写イベントを調査することにおけるONTの有用性を例示する（ref.32)。 Oikonomopoulos et alは Spike-In RNAを用いた92の転写産物の人工混合物を分析することにより、トランスクリプトームの定量におけるONTシーケンシングの安定性も実証した（ref,33）。 PacBioまたはONTを単独で使用するこれらの研究と比較して、Hybrid-Seqは、特にトランスクリプトーム全体の研究において、データサイズの要件を軽減し、出力を向上させることができる。例えば、一連のHybrid-Seq法（IDP、IDP-fusion、IDP-ASE）は、トランスクリプトーム研究をisoformレベル（例えば、遺伝子isoform再構築、融合遺伝子および対立遺伝子フェージング）までより高い感度および精度で改善するために開発された。（一部略）

　この論文では、hESCのcDNAからPacBioとONTのデータを生成した。我々（著者ら）のツールAlignQC（http://www.healthcare.uiowa.edu/labs/au/AlignQC/）を使用して、rawデータ（subreadsと1D “template” reads）、またはコンセンサス（CCSと2D reads）のPacBioとONTデータの包括的な分析と比較を行った。製造業者はサイズ選択を推奨しているので、PacBioシーケンシングをサイズ選択ライブラリで行った。サイズ選択はシーケンシング時には標準的ではなく、そしてONTについては行われなかったので、ONTライブラリーはサイズ選択しなかった。これらの技術は異なるライブラリー調製プロトコルに従うため、シーケンシングプラットフォーム自体が変動性を導入し得るのと同様に、これらの工程を変動性の潜在的な原因として考慮することが重要である。比較には、エラー率とエラーパターン、リード長、マッピング可能性と異常なアライメント、そして最新のシーケンスモデル（PacBio P6-C4とONT R9）と以前のバージョン（C2とR7）の間の技術改善が含まれる。また、PacBioとONT単独の機能を検証して比較し、スパイクイントランスクリプトのゴールドスタンダードセットを研究した。その後、アイソフォームの同定、定量、複雑なトランスクリプトームイベントの発見など、ヒトのトランスクリプトーム解析にロングリードオンリーおよび対応するHybrid-Seqアプローチを適用した。 2つの主要なTGSデータプラットフォームの特性の包括的な評価に加え、この論文は、PacBioとONTの応用とそれに対応するトランスクリプトーム解析のためのHybrid-Seqのガイドとして役立つだろう。

ここではAlignQCの使い方についてのみ簡単に紹介します。

このツールはlong read RNA seq のbam分析用に設計されていますが、short readのRNA seq、DNAのシーケンスのbamでも動作します。

インストール

ubuntu16.04のpython2.7環境でテストした（docker使用。ホストOS macos10.14）。

依存

Report Generation Requirements

R
python 2.7+

Report Viewing Requirements

A Modern Web Browser

本体　Github

#pipで導入可能
pip install AlignQC

#anaconda環境ならcondaで導入可能（テスト時は失敗した）
conda create -n alignqc -c vacation alignqc
source activate alignqc

> alignqc -h

# alignqc -h

usage: alignqc [-h] {analyze,dump,compare}

Version 2.0.5 Review reports about alignments.

positional arguments:

{analyze,dump,compare}

MODE of program to run

optional arguments:

-h, --help show this help message and exit

> alignqc analyze --help

# alignqc analyze --help

usage: alignqc [-h] (-g GENOME | --no_genome)

(-t GTF | --gpd GPD | --no_transcriptome) [-o OUTPUT]

[--portable_output PORTABLE_OUTPUT]

[--output_folder OUTPUT_FOLDER] [--threads THREADS]

[--tempdir TEMPDIR | --specific_tempdir SPECIFIC_TEMPDIR]

[--min_intron_size MIN_INTRON_SIZE]

[--min_aligned_bases MIN_ALIGNED_BASES]

[--max_query_overlap MAX_QUERY_OVERLAP]

[--max_target_overlap MAX_TARGET_OVERLAP]

[--max_query_gap MAX_QUERY_GAP]

[--max_target_gap MAX_TARGET_GAP]

[--required_fractional_improvement REQUIRED_FRACTIONAL_IMPROVEMENT]

[--do_loci] [--min_depth MIN_DEPTH]

[--min_coverage_at_depth MIN_COVERAGE_AT_DEPTH]

[--min_exon_count MIN_EXON_COUNT]

[--locus_downsample LOCUS_DOWNSAMPLE]

[--alignment_error_scale ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE]

[--alignment_error_max_length ALIGNMENT_ERROR_MAX_LENGTH]

[--context_error_scale CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE]

[--context_error_stopping_point CONTEXT_ERROR_STOPPING_POINT]

[--samples_per_xval SAMPLES_PER_XVAL]

[--max_bias_data MAX_BIAS_DATA] [--rscript_path RSCRIPT_PATH]

input

Create an output report

optional arguments:

-h, --help show this help message and exit

Input parameters:

Required BAM file. If reference or annotation is not set, use

--no_reference or --no_annotation respectively to continue.

input INPUT BAM file

-g GENOME, --genome GENOME

Reference Fasta (default: None)

--no_genome No Reference Fasta (default: False)

-t GTF, --gtf GTF Reference transcriptome in GTF format, assumes gene_id

and transcript_id are defining gene and transcript

names (default: None)

--gpd GPD Reference transcriptome in genePred format (default:

None)

--no_transcriptome No annotation is available (default: False)

Output parameters:

At least one output parameter must be set

-o OUTPUT, --output OUTPUT

OUTPUT xhtml with data (default: None)

--portable_output PORTABLE_OUTPUT

OUTPUT file in a small xhtml format (default: None)

--output_folder OUTPUT_FOLDER

OUTPUT folder of all data (default: None)

Performance parameters:

--threads THREADS INT number of threads to run. Default is system cpu

count (default: 1)

Temporary folder parameters:

--tempdir TEMPDIR The temporary directory is made and destroyed here.

(default: /tmp)

--specific_tempdir SPECIFIC_TEMPDIR

This temporary directory will be used, but will remain

after executing. (default: None)

Alignment plot parameters:

--min_intron_size MIN_INTRON_SIZE

minimum intron size when smoothing indels (default:

68)

--min_aligned_bases MIN_ALIGNED_BASES

for analysizing alignment, minimum bases to consider

(default: 50)

--max_query_overlap MAX_QUERY_OVERLAP

for testing gapped alignment advantage (default: 10)

--max_target_overlap MAX_TARGET_OVERLAP

for testing gapped alignment advantage (default: 10)

--max_query_gap MAX_QUERY_GAP

for testing gapped alignment advantge (default: None)

--max_target_gap MAX_TARGET_GAP

for testing gapped alignment advantage (default:

500000)

--required_fractional_improvement REQUIRED_FRACTIONAL_IMPROVEMENT

require gapped alignment to be this much better (in

alignment length) than single alignment to consider

it. (default: 0.2)

Locus parameters:

Optionally produce plots and data regarding clusters of sequences

--do_loci this analysis is time consuming at the moment

(default: False)

--min_depth MIN_DEPTH

require this or more read depth to consider locus

(default: 1.5)

--min_coverage_at_depth MIN_COVERAGE_AT_DEPTH

require at leas this much of the read be covered at

min_depth (default: 0.8)

--min_exon_count MIN_EXON_COUNT

Require at least this many exons in a read to consider

assignment to a locus (default: 2)

--locus_downsample LOCUS_DOWNSAMPLE

Limit how deep to search loci (default: 100)

Alignment error parameters:

--alignment_error_scale ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE ALIGNMENT_ERROR_SCALE

<ins_min> <ins_max> <mismatch_min> <mismatch_max>

<del_min> <del_max> (default: None)

--alignment_error_max_length ALIGNMENT_ERROR_MAX_LENGTH

The maximum number of alignment bases to calculate

error from (default: 1000000)

Context error parameters:

--context_error_scale CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE CONTEXT_ERROR_SCALE

<ins_min> <ins_max> <mismatch_min> <mismatch_max>

<del_min> <del_max> (default: None)

--context_error_stopping_point CONTEXT_ERROR_STOPPING_POINT

Sample at least this number of each context (default:

5000)

Rarefraction plot parameters:

--samples_per_xval SAMPLES_PER_XVAL

Bias parameters:

--max_bias_data MAX_BIAS_DATA

Bias does not need too large of a dataset. By default

data will be downsampled for large datasets. (default:

500000)

Path parameters:

--rscript_path RSCRIPT_PATH

The location of the Rscript executable. Default is

installed in path (default: Rscript)

> alignqc dump --help

# alignqc dump --help

usage: alignqc [-h] [-o OUTPUT] [-v] (-l | -e EXTRACT)

[--tempdir TEMPDIR | --specific_tempdir SPECIFIC_TEMPDIR]

xhtml_input

Extract data from xhtml output of alignqc through the command line

positional arguments:

xhtml_input INPUT XHTML FILE

optional arguments:

-h, --help show this help message and exit

-o OUTPUT, --output OUTPUT

OUTPUTFILE or STDOUT if not set (default: -)

-v, --verbose Show all stderr messages (default: False)

-l, --list show available data (default: False)

-e EXTRACT, --extract EXTRACT

dump this data (default: None)

--tempdir TEMPDIR The temporary directory is made and destroyed here.

(default: /tmp)

--specific_tempdir SPECIFIC_TEMPDIR

This temporary directory will be used, but will remain

after executing. (default: None)

> alignqc compare --help

# alignqc compare --help

usage: alignqc [-h] -o OUTPUT [--threads THREADS]

[--tempdir TEMPDIR | --specific_tempdir SPECIFIC_TEMPDIR]

xhtml_inputs [xhtml_inputs ...]

positional arguments:

xhtml_inputs xhtml analysis files

optional arguments:

-h, --help show this help message and exit

-o OUTPUT, --output OUTPUT

OUTPUT directory (default: None)

--threads THREADS INT number of threads to run. Default is system cpu

count (default: 18)

--tempdir TEMPDIR The temporary directory is made and destroyed here.

(default: /tmp)

--specific_tempdir SPECIFIC_TEMPDIR

This temporary directory will be used, but will remain

after executing. (default: None)

実行方法

analyze

リファレンスゲノムがある場合

alignqc analyze long_reads.bam -g reference.fa.gz \
 -t ref_transcriptome.gtf.gz -o output.xhtml

--threads INT number of threads to run. Default is system cpu count (default: 1)
-t Reference transcriptome in GTF format, assumes gene_id and transcript_id are defining gene and transcript names (default: None)
-g Reference Fasta (default: None)
-o OUTPUT xhtml with data (default: None)
--portable_output OUTPUT file in a small xhtml format (default: None)
--output_folder OUTPUT folder of all data (default: None)

出力フォーマットはxhtmlになっているが、拡張子をhtmlにして出力しても開ける。

--output_folderで出力するとディレクトリに関連ファイルが全て出力される。

f:id:kazumaxneo:20190422130950j:plain

xhtmlファイルを開く。

f:id:kazumaxneo:20190422115541j:plain

f:id:kazumaxneo:20190422115557j:plain

f:id:kazumaxneo:20190422115600j:plain

f:id:kazumaxneo:20190422115645j:plain

リファレンスゲノムがない場合

alignqc analysis long_reads.bam --no_genome --no_transcriptome \
 -o output.html

--no_transcriptome No annotation is available (default: False)
--no_genome No Reference Fasta (default: False)

compare

複数結果を比較するstatisticsテーブルを出力する。

alignqc compare -o output_dir --threads 16 sample1/output_report.xhtml sample2/output_report.xhtml <sample3 sample4 ...>

出力txtをexcelで開いた。

f:id:kazumaxneo:20190422133128j:plain

引用

Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis

Jason L Weirather, Mariateresa de Cesare, Yunhao Wang x, Paolo Piazza, Vittorio Sebastiano, Xiu-Jie Wang, David Buck, Kin Fai Au

F1000 Research

2019年４月現在、査読中になっています。現在version2です。

ショートリードではそこまででもないが、ラージゲノムのlong_reads.bamだとかなりメモリを使うためkillしてしまう。私の環境ではchr１つだけでもメモリが足りず落ちてしまったため、Mtゲノムだけ分析しています。注意して下さい。

関連

2019-04-23

E.coliとKlebsiellaに対応したWGSからのプラスミド検出ツール PlaScope

2018 Microbial Genomics plasmid Virulence Factor drug resistance

　最近、いくつかの研究がin silicoプラスミド予測ツールの有効性を評価している[ref.1、2]。実際、現在、多くのバイオインフォマティクス法が、リードカバレッジ解析（例：PlasmidSPAdes）、k-merベースの分類（例：cBAR、PlasFlow）、レプリコン検出（例：PlasmidFinder）などのさまざまなアプローチが、このようなモバイル因子の検出に使用できる。これらの中には完全に自動化されているものもあり[ref.3–7]、そうでないものもある[ref.8]。そのうちのいくつかは高い感度を達成している：例えば、PlasmidSPAdes（紹介）とcBar（紹介）は42ゲノムのデータセットでそれぞれ0.82と0.76のプラスミド回収を可能にする[ref.1]。一方で、100％に達するPlasmidFinder（紹介）のように、いくつかのツールは非常に高い精度を示す[ref.1]。残念なことに、感度と特異性の間の良いトレードオフを見つけることに成功したものはないので、正しい予測を得るためにユーザーは異なる方法を組み合わせる必要がある。

　付随して、公的データベースにおいてますます多くの配列が利用可能になりつつあり、これらには、多数の一連のコンティグから、完全に環状化されたゲノムおよびプラスミドまで様々なレベルの完全性がある。何人かの研究者はこれらのデータベースをキュレーションするための努力をして、高品質のデータセットを提案した。例えば、Carattoli et alとOrlek et alらは、腸内細菌科のための興味深くそして徹底的なプラスミドデータセットを発表した[ref.4、9]。

　このことを念頭に置いて、本著者らはここでPlaScopeと呼ばれる、ゲノムアセンブリのplasmidomeを評価するためのワークフローを提案する。利用可能なゲノムデータを利用して、プラスミドと染色体のカスタムデータベースを作成した。これらは、もともとメタゲノミクス分類器として開発されたツールCentrifugeソフトウェアの入力として使用され、これはデータベースに対する完全一致に基づいて配列を割り当てることができる[ref.10]。本著者らはそれを他のプラスミド分類子、cBarおよびPlasFlowと比較し、そして特定のknowledgeベースのアプローチで特異性を危うくすることなく様々な大腸菌株のほぼ全てのプラスミドを回収できることを示した。最後に、このアプローチの有用性を2つのデータセットで示した。（一部略）

　PlaScopeのワークフローを図1に示す。まず、ユーザーはペアエンド fastqファイルを入力する必要がある。アセンブリはSPAdes 3.10.1 [ref.11]を使用し、 'careful'オプションとauto k-merサイズのオプションを使用して実行し、コンティグを取得する。その後、Centrifuge [ref.10]のカスタムデータベースによりこれらのコンティグの場所を予測し、それらを3つのクラスに分類した、すなわちプラスミド、染色体、そして未分類。未分類には、（i）両カテゴリーで共有される（すなわち、データベースからのプラスミドおよび染色体配列と一致する）区別できないコンティグ、（ii）データベースにヒットしないコンティグ、および（iii）長さ、ヒット長または定義されたしきい値を下回るカバレッジ、が含まれる。最後に、これら3つのクラスに基づいて結果がソートされ、awkを使用して抽出される。完全なワークフローはgithubのPlaScope.sh（https://github.com/GuilhemRoyer/PlaScope）bash スクリプトを通して利用可能である。または、すべての依存関係をBioCondaを通してインストールし利用できる（conda install plascope）。

　これらのレプリコンは細菌の特定の環境適応において極めて重要であるため、プラスミドの探索は非常に興味深いものとなり得る。それらは種内および種間の多くの遺伝子の交換に関与しており、特に抗生物質耐性および病原性に大きな影響を与えている。しかしながら、プラスミドの特徴付けは長年にわたり面倒な仕事であり、例えば複雑な接合またはエレクトロポレーション操作を必要としていた。全ゲノムシーケンシング技術の出現により、適切なツールが利用可能であるならば、これらの配列へのアクセスは今や潜在的により容易である。多種多様なバクテリアのプラスミドソームを探索するための多くのソフトウェアツールが開発されているが、特異性と感度の点で最良の妥協点を提供することはめったにない。ここでは、本著者らは単一の種または属に焦点を当て、そして、この問題を克服するために利用可能な多くのデータを使用する。本ツールPlaScopeは他の2つの分類器、PlasFlowとcBarと比較して高性能を達成し、そしてそのようなアプローチが毒性または耐性遺伝子の位置を決定するできることを実証する。 PlaScopeは特定のよく知られたバクテリアの分析に非常に役立つかもしれないと私達（本著者ら）は考えている。

The PlaScope workflow. 論文より転載。

インストール

mac10.14のanaconda3環境でテストした（python3.6.8）。

依存

SPAdes 3.10.1 or later if you want to run the assembly (= mode 1) (header of contigs must be the same as in version 3.10.1, e.g >NODE_1_length_506801_cov_117.065)
Centrifuge 1.0.3

本体　Github

#anaconda環境ならcondaで導入可能
conda install -c bioconda -y plascope

> plaScope.sh

$ plaScope.sh

usage: plaScope.sh [OPTIONS] [ARGUMENTS]

General options:

-h, --help display this message and exit

-v, --version display version number and exit

-n, --no-banner don't print beautiful banners

-t number of threads[OPTIONAL] [default : 8]

-o output directory [OPTIONAL] [default : current directory]

--sample Sample name [MANDATORY]

--db_dir path to centrifuge database [MANDATORY]

--db_name centrifuge database name [MANDATORY]

Mode 1: SPAdes assembly + contig classification

-1 forward paired-end reads [MANDATORY]

-2 reverse paired-end reads [MANDATORY]

Mode 2: contig classification of a fasta file (only if you already have your SPAdes assembly!)

--fasta SPAdes assembly fasta file [MANDATORY]

Example mode 1:

plaScope.sh -1 my_reads_1.fastq.gz -2 my_reads_2.fastq.gz -o output_directory --db_dir path/to/DB --db_name chromosome_plasmid_db --sample name_of_my_sample

Example mode 2:

plaScope.sh --fasta my_fastafile.fasta -o output_directory --db_dir path/to/DB --db_name chromosome_plasmid_db --sample name_of_my_sample

Github:

https://github.com/GuilhemRoyer/PlaScope

データベースの準備

Pre-buildされたデータベースをダウンロードする。

https://zenodo.org/record/1311641#.XLswji3APmE

#E.coli
mkdir database && cd database/
wget https://zenodo.org/record/1311641/files/chromosome_plasmid_db.tar.gz?download=1
tar -zxvf chromosome_plasmid_db.tar.gz?download=1

#Klebsiella
mkdir database && cd database/
wget https://zenodo.org/record/1311647/files/Klebsiella_PlaScope.tar.gz?download=1
tar -zxvf Klebsiella_PlaScope.tar.gz?download=1

KlebsiellaならKlebsiella_PlaScope.1.cf、Klebsiella_PlaScope.2.cf、Klebsiella_PlaScope.3.cfができる。ラン時にはKlebsiella_PlaScopeを指定する。

カスタムデータベースの作成方法はGithubにまとめられています。

実行方法

fastqとデータベースを指定する。

plaScope.sh -1 my_reads_1.fastq.gz -2 my_reads_2.fastq.gz \
 -o output_dir --db_dir database_dir --db_name Klebsiella_PlaScope \
 --sample  sample1

引用

PlaScope: a targeted approach to assess the plasmidome from genome assemblies at the species level
G. Royer, J. W. Decousser, C. Branger, M. Dubois, C. Médigue, E. Denamur, D. Vallenet

Microb Genom. 2018 Sep; 4(9)

関連

2019-04-22

スプライシングジャンクションを上手く処理できるエラーの多いロングリードRNA seqのアライナーdeSALT

2019 Preprint RNA seq Pacbio Nanopore long read 高速なツール mapping Alternative splicing splicing variant Iso-Seq Genome Biology

2019 12/17 論文追記

　RNAシークエンシングはトランスクリプトームを特徴付けるための基本的なアプローチとなっている。正確な遺伝子構造を明らかにし、遺伝子/転写産物の発現を定量できる[ref.1-5]、さらにバリアントコーリング[ref.6]、RNA edit/ng解析[ref.7 - 8]、遺伝子融合検出[9] 10]が可能である。しかしながら、限られたリード長やライブラリ調製からのシステマチックなバイアスなどのショートリードシーケンシングの欠点のために、リードを正確にアラインメントし[11]、遺伝子アイソフォームを再構築し[12]、そして転写産物を定量することは依然としてトランスクリプトーム研究のボトルネックとなっている。
　Pacific Biosciences（PacBio）の1分子リアルタイム（SMRT）シークエンシング[ref.13]とOxford Nanopore Technologies（ONT）のナノポアシークエンシング[ref.14]の2種類の画期的なロングリードシークエンシング技術が出現し、これはトランスクリプトーム解析におけるショートリードのボトルネックを回避できる。両方ともはるかに長いリード生成を可能にする、すなわち、平均リード長は10kbを超え、最大リード長は数百kbを超える［ref. 15, 16］。この利点を利用して、全長転写物を単一のリードによってシーケンシングでき、これは遺伝子アイソフォーム再構築の精度を実質的に改善することが期待されている。さらに、シークエンシング手順におけるシステマチックなバイアスも少なく[ref.17]、これは遺伝子/転写産物の発現定量にも有益である。
利点に加えて、PacBioとONTのリードは、ショートリードよりもはるかに高いシーケンスエラー率を持っている。PacBio SMRTシークエンシングでは、rawリード（「サブリード」）のシークエンシングエラー率は約10％-20％である[ref.16]。 ONTナノポアシーケンシングの場合、1Dおよび2D（1D2としても知られる）リードのシーケンシングエラー率は、それぞれ約25％および12％である[ref.18、19]。 PacBio SMRTプラットフォームでは、環状フラグメントを複数回シーケンスしてシーケンスエラーを大幅に減らすことで、目的のリード（ROI）を生成することもできるが、このテクノロジではシーケンスの歩留まりが大幅に低下し、リード長が短くなる。シーケンシングエラーは、RNA seqデータ分析に新たな技術的課題を投げかける。リードアラインメントは最も影響を受けるものである可能性があり、そしてそれは多くの下流の分析にとって基本的であるので、影響はリードアラインメントそれ自体に限定されないかもしれない。
以前の研究[ref.20-22]は、ノイズの多いDNA-seqロングリードアラインメントは重要なタスクであることを実証している。深刻なシークエンシングエラー、潜在的なゲノム変異、長いリード長など、多くの技術的問題をうまく処理する必要がある。RNA seqのロングリードアラインメントについては、アライナは上記の問題以外にも多数のスプライシングイベントに対処しなければならないので、その作業はさらに困難である。これはアライナが多くのスプライシングジャンクション部を正しく認識しそして対応するエクソンに塩基をマッピングするために非常に複雑なスプリットアラインメント（「スプライスアラインメント」とも呼ばれる）を実施する強い能力を有することを必要とする。提案されているDNA配列ロングリードアラインメントアプローチのほとんどは、ゲノム構造変異（SV）を処理するためにスプリットアラインメントを実行する能力を有するが[ref.21-23]、スプライシングジャンクション部ははるかに頻繁に起こり、エクソンの長さははるかに短く分岐している。調整されたアルゴリズムはまだ要求がある。

　BBMap [ref.24]、GMAP [ref.25]、STAR [ref.26]、BLAT [ref.27]およびMinimap2 [28]など、RNA seqロングリードアラインメントを支持するいくつかのアプローチが存在する。これらのアプローチはすべて一般的に使用されているシードアンドエクステンド戦略に基づいており、RNA seqロングリードアライメントの問題に対処するためにさまざまなシーディングおよびエクステンション方法が実装されている。これらのアプローチの全てはスプライシングジャンクション部を取り扱う能力を有する。ただし、これらのアルゴリズムのほとんどは比較的低速である[ref.28]。これは、seed処理ステップでのショートマッチが莫大に出ることと、extensionステップでの時間のかかるローカルアライメントに原因がある。さらに、いくつかのアルゴリズムはまた感度が低い［ref.29］。すなわち、多くのリードがアライメントしていないかまたは部分的にアライメントしているだけである。優れたアルゴリズムはMinimap2（紹介）である。これは、他の最先端のアライナよりも数十倍速いスピードと、同等またはそれ以上の感度を同時に達成ししている。これは主に、よく設計されたminimizer-basedのインデックス作成[ref.30]およびSSEベースのローカルアラインメント方法[ref.31]の恩恵を受けている。
　絶対的に言えば、このタスクの最終的な目標は、すべてのリードについて、すべての塩基を正しくマップすることである。ただし、いくつかの点で、これは依然として最先端のアライナにとって自明ではない可能性がある。一つは比較的短いエクソン、例えばわずか数十bpのエクソンの塩基のアライメントである。深刻なシーケンシングエラーおよび潜在的な変異の状況下では、そのようなショートエクソンからリード部分にシードを見つけることは極めて困難であり、そのためリード部分は通常アライメントされていないか誤ってアライメントされている。他の問題は、スプライシングジャンクション部分付近の塩基を正しくアライメントさせることが困難であるということである。この問題は、ショートRNA seqのリードアラインメントにも存在する。しかし、ノイズの多いロングリードRNAシーケンスのリードのアラインメントではより深刻になる。さらに、シーケンシングエラーの影響を受けて、同じ遺伝子アイソフォームからのリードのアラインメントは通常互いに分岐しており、これも下流の分析に誤解を招く。
　ここでは、ロングトランスクリプトームリードのためのde Bruijnグラフベーススプライスアライナ（deSALT）を提案する。 deSALTは、de Bruijnグラフベースのインデックスに基づく新規なツーパスリードアライメントストラテジーの利点を利用する、高速で正確なRNA seqロングリードアライメントアプローチである。それは、複雑な遺伝子構造および深刻なシーケンシングエラーをうまく処理して、より高感度で正確かつ合意されたアラインメントを生み出す能力を有する。ほとんどのリードについて、deSALTは全長リードに沿ってエクソンとスプライシングジャンクション部を完全に回復する全長アライメントを生成することができる。さらに、deSALTの速度も、最先端のアプローチと同じか、それより速いか同等である。私達(本著者ら)はそれが多くの今後のトランスクリプトーム研究において重要な役割を果たす可能性があると信じている。

An example of the alignments of simulated reads by various aligners.　Preprintより転載。

deSALTに関するツイート

Liu, Liu, Zang, Wang and co present deSALT, for aligning long RNA-seq reads. it takes a two-pass approach, using a graph-based index to match blocks between read and genome, and then relocating short matches between read and detected exons. https://t.co/HVtdirlDt0 pic.twitter.com/tj9PNjoTQm
— Genome Biology (@GenomeBiology) 2019年12月16日

インストール

依存

#素の状態のubuntuに入れるなら
sudo apt update && sudo apt install git make gcc zlib1g zlib1g-dev

本体　Github

git clone https://github.com/ydLiu-HIT/deSALT.git
cd deSALT/src
make

> ./deSALT

# ./deSALT

Program: deSALT (Third generation RNA sequence alignment)

Version: 1.0

Contact: Yadong Liu <hitliuyadong1994@163.com>

Usage: deSALT <command> [options]

Command:

index index reference sequence

aln align long RNA sequence to reference

> deSALT index

# deSALT index

Usage: deSALT index <ref.fa> <index_route>

build deBGA index file with default 22-kmer. You can get more deBGA information from https://github.com/HongzheGuo/deBGA

> deSALT aln

# deSALT aln

[Main] deSALT - De Bruijn graph-based Spliced Aligner for Long Transcriptome reads

Program: de Brijn Graph-based 3rd RNA sequence alignment

Usage: deSALT aln [options] <index_route> <read.fa/fq>

Algorithm options:

-t --thread [INT] Number of threads. [1]

-K --index-kmer [INT] K-mer length of RdBG-index. [22]

-k --seeding-kmer [INT] K-mer length of seeding process (no long than RdBG-index). [15]

-a --local-hash-kmer [INT] K-mer length of local hash process. [8]

-s --seed-step [INT] The interval of seeding. [5]

-B --batch-size [INT] The number of reads to be processed in one loop. [100000]

-n --max-uni-pos [INT] Maximum allowed number of hits per seed. [50]

-l --max-readlen [INT] Maximum allowed read length. [1000000]

-i --min-frag-dis [INT] Maximum allowed distance of two fragment can be connect. [20]

-I --max-intron-len [INT] maximum allowed intron length. [200000]

-c --min-chain-score [INT] minimal skeleton score(match bases minus gap penalty). [20]

-d --strand-diff [INT] The minimal difference of dp score by two strand to make sure the transcript strand. [20]

-g --max-read-gap [INT] Maximum allowed gap in read when chaining. [2000]

-p --secondary-ratio [FLOAT] Min secondary-to-primary score ratio. [0.90]

-e --e-shift [INT] The number of downstream (upstream) exons will be processed when left (right) extension. [5]

-G --gtf [STR] Provided an annotation file for precise intron donor and acceptor sites.

The release of annotation file and reference genome must the same!

-x --read-type [STR] Specifiy the type of reads and set multiple paramters unless overriden.

[null] default parameters.

ccs (PacBio SMRT CCS reads): error rate 1%

clr (PacBio SMRT CLR reads): error rate 15%

ont1d (Oxford Nanopore 1D reads): error rate > 20%

ont2d (Oxford Nanopore 2D reads): error rate > 12%

Scoring options

-O --open-pen [INT(,INT)]

Gap open penealty. [2,32]

-E --ext-pen [INT(,INT)]

Gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2}. [1,0]

-m --match-score [INT] Match score for SW-alginment. [1]

-M --mis-score [INT] Mismatch score for SW-alignment. [2]

-z --zdrop [INT(,INT)]

Z-drop score for splice/non-splice alignment. [400]

-w --band-width [INT] Bandwidth used in chaining and DP-based alignment. [500]

Output options

-N --top-num-aln [INT] Max allowed number of secondary alignment. [4]

-Q --without-qual Don't output base quality in SAM

-f --temp-file-perfix [STR] Route of temporary files after the first-pass alignment. [1pass_anchor]

If you run more than one tgs program in the same time,

you must point out different routes of temporary files for each program!!!

If no, every deSALT program will write temporary data to the same file which

will cause crash of program in 2-pass alignment due to inconsistent temporary data.

-o --output [STR] Output file (SAM format). [./aln.sam]

> ./deBGA

# ./deBGA

Program: deBGA (De bruijn graph nucleotide alignment)

Version: 0.1

Contact: Hongzhe Guo <hzguo@hit.edu.cn>

Usage: deBGA <command> [options]

Command: index index sequences in the FASTA format

aln pair-end and single-end reads seed reduction and alignment based on De bruijn graph

> deBGA index

# deBGA index

Program: de Brijn Graph-based mapping system index building

Version: 0.1

Contact: Hongzhe Guo <hzguo@hit.edu>

Usage: deBGA index [options] reference.fasta <index_route>

Options: -k INT the k-mer length of the vertices of RdBG [20-28]

> deBGA aln

# deBGA aln

Program: de Brijn Graph-based mapping system seed reduction and alignment

Version: 0.1

Contact: Hongzhe Guo <hzguo@hit.edu>

Usage: deBGA aln [options] <index_route> <read pair-end1.fq> [read pair-end2.fq] <result_file.sam>

Options:

-k INT the minimum length of a valid Uni-MEM seed [21-28]

-s INT the number of iterations of re-seeding [4]

-i INT the minimum interval of seeding [5]

-n INT the maximum allowed number of hits per seed [300]

-c NUM the threshold on the edit distance for early stop [0.05]

--cl NUM the adjusted threshold on the edit distance for early stop [0.00]

--local the local alignment option for confident alignment

--local-match NUM the score for a matched base in the local alignment [1]

--local-mismatch NUM the penalty for a mismatched base in the local alignment [4]

--local-gap-open NUM the penalty for a gap open in the local alignment [6]

--local-gap-extension NUM the penalty for gap extension in the local alignment [1]

--stdout (default: not set) output alignments by stdout

-u INT the upper limit of insert size (only for pair-end reads) [700]

-f INT the lower limit of insert size (only for pair-end reads) [300]

-o INT the maximum number of alignment output [20]

-x INT the maximum number of alignment output for anchoring alignment [150]

-l INT the maximum allowed read length [512]

-e INT the budget for single-end alignment [100]

-p INT the number of threads [1]

Please refer to the following link for more detailed information about the options: https://github.com/HIT-Bioinformatics/deBGA

実行方法

1、indexing

deSALT index ref.fa index_dir

output_index_dir/にindexが出力される。

２、mapping

deSALT aln index_dir long_read.fq -t 16 -o aln.sam

-t Number of threads. [1]
-o Output file (SAM format). [./aln.sam]

引用

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index
Bo Liu, Yadong Liu, Tianyi Zang, Yadong Wang

bioRxiv preprint first posted online Apr. 17, 2019

追記

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Bo Liu, Yadong Liu, Junyi Li, Hongzhe Guo, Tianyi Zang, Yadong Wang
Genome Biology volume 20, Article number: 274 (2019)

関連

2019-04-21

machine leraningも併用するmetagenomeのビニングツール Autometa

2019 Nucleic Acids Research Binning (metagenomics) metagenome workflow 結果の視覚化 (visualization) bacteria archaea Marine Metagenomics deep neural network automatic pipeline docker

2019 4/22 誤字修正

2019 5/6 リンク追記

2019 6/23 ランのstep1,2 の説明修正

2019 9/25 step1のフラグの誤り修正

　微生物は、人間を含む地球上のほとんどすべての生物に繋がることが知られており、そこでは微生物は健康、病気および農業に多大な影響を与えると考えられている（ref.1-3）。しかしながら、ほんのわずかな割合の環境微生物しか実験室で容易に培養できないことが長い間知られてきた（ref.4）。したがって、微生物系統樹の大部分は、培養に依存しないシークエンシング（「メタゲノミクス」）を介してのみまだアクセス可能である。初期のメタゲノム解析では、異なる環境内の個々のバクテリア種の相対的存在量を調べることによって群集の系統発生プロファイルに焦点を当てていたが（16S rRNA遺伝子シークエンシングを通じて定量）、これらの環境を形成する機能的寄与および生物レベルの相互作用に関する情報は限られていた（ref.5）。全ゲノム「ショットガン」シーケンシングは、標準的なプライマーでは検出できない非標準のリボソーム RNA遺伝子に関する問題や固有の低分解能など、ハイスループット16S rRNAアンプリコンシーケンシングが直面するいくつかの課題を克服することができる。しかし、メタゲノムコンティグを個々のゲノムを表すクラスターに分類すること（「ビニング」）は、困難な計算問題であり、現在も活発な研究分野である（ref.7,8）。ビニングは、コミュニティ全体の代謝能力に対する個々の微生物の代謝、機能的寄与を理解するための必要なステップである。言い換えれば、メタゲノムのゲノムレベルの解明により、研究者は、代謝機能の解釈を超えて、その場で集合的に複雑な系内の個々の生物の役割を理解することが可能になる。

　ほとんどの環境が主に特徴付けられていない微生物で構成されていることを考え、リファレンスフリーのビニングを達成するためにさまざまなアプローチが取られてきた。例えば、ヌクレオチド組成は、emergent self-organizing maps（ESOM）（ref.9）またはBarnes-Hut stochastic Neighbor Embedding（BH-tSNE）（ref.10、11）でコンティグをグループ化するために使用されてきた。これらのアプローチは、k-merスペクトラムの変動を２次元に減らし、高次元データの視覚化を可能にし、そしてヒト主導のクラスタリングを可能にする。他の取り組みでは、共有ゲノムのコンティグのカバレッジは明確な共分散を示すという仮定の下、複数のサンプルからの情報を活用することに焦点が当てられている。手動（ref.12、13）と自動パイプライン（ref.14、15）の両方がこのアプローチを使用している。ただし、この方法にはいくつかの欠点がある。多くのマルチサンプルプロトコルでは、すべてのサンプルからのリードのアセンブリが必要である。共有ゲノムがクローンでない場合、計算要件が高まり、潜在的にアセンブリ品質が低下する。この問題は「microdiversity」として知られている - Albertsen et alによって認められた問題であり（ref.13 link（有名な論文））、そして最近他でも実証された（ref.16 link）（紹介）。同時アセンブリのためにサンプルをプールすることによって、ユーザーはco-assemblyの効果を悪化させることもあり得る。Co-assembly結果は、1つのサンプルから採取された単一系統、生物または集団のゲノムの代わりに、広く集約されたコンセンサス配列を表している（ref.8,16,17）。そのようなaggregationは、個々の株またはサンプルにのみ見られるpan-genome配列（ref.18）の存在をマスクでき、それは抗生物質耐性を付与する可動要素（ref.19,20）または水平伝播を通して得られる生合成遺伝子クラスターを考慮するとき重要な臨床的およびバイオテクノロジー的意味を有する。システムの根本的な変動性または重複が未知であるさらなる状況があり、少数のパイロットデータセットから情報を抽出したいという要望がある。さらに、本質的により高いシークエンシングコストを招くマルチサンプル比較は、1つのサンプルに固有のゲノムのビニングに必ずしも役立つわけではない（ref.21）。

　今日まで、bioactive small moleculeを作る海洋無脊椎共生生物のゲノムをシーケンシングするための我々の努力は半手動のビニング技術に頼っていた（ref.21、22）。しかし、海綿動物のマイクロバイオームは豊富なバイオテクノロジーの可能性を提供し（ref.23）、何百もの微生物種を含むことができ、海綿動物の組織体積の最大40％を占める（ref.24、25）。他のシステムも挑戦的である。例えば、カブトムシLagria villosaの卵はBurkholderia gladioliのいくつかclosely relatedな株の混合物と関連していることを発見したが、これらのうちのいくつかだけが培養可能で卵を感染から守る抗真菌化合物を生産する。これらのシステムは合理的な手動処理の限界を超え、そしてそのような宿主関連メタゲノムのための既存の自動ビニングパイプラインの不十分な性能が、本著者らの自動化されたスケーラブルなビニングアルゴリズム開発の動機となった。本手法Autometaは、個々のメタゲノムアセンブリからのメタゲノムの複雑さに従ってスケーリングを最大化するために、コンティグの単純化されたサブセット（分類学的にバクテリアまたはアーキアのいずれかに分類されるもの）でクラスタリングを実行する。初期クラスタは、教師付き機械学習アルゴリズムによるその後の分類のためのトレーニングセットとして機能する。我々（著者ら）は、既知の構成要素ゲノムを参照して性能を評価できるシミュレーションの合成メタゲノムを使い、ならびに以前にセミマニュアルビニングによって調べた実際の宿主関連メタゲノムを使用し、Autometaを評価した（ref.21、22）。特にメタゲノムの複雑さが高い場合やホスト関連のデータセットにおいて、AutometaがMaxBin（ref.14）、MetaBAT（ref.27）、MyCC（ref.28）、BusyBee Web（ref.29）と同程度または性能を上回っていることがわかった。さらに、lowest common ancestor（LCA）分析を使用したコンティグレベルの分類学上の分類により、Autometaのパフォーマンスや他のパイプラインのビニングパフォーマンスを向上させることができた。

　Autometaは、コンティグを区別するために配列相同性、カバレッジおよびヌクレオチド組成を使用して、シングルショットガンメタゲノムから新規にゲノムを回収する。このタスクは、以前にバクテリアおよびアーキアにおいて同定されているシングルコピー存在することが知られているマーカー遺伝子の存在によってガイドされる。各マーカーはビンごとに1回しか検出されないため、マーカー遺伝子の存在は、ビンのゲノム完全性および汚染レベルを推定するために使用することができる。シングルコピーマーカーは、以前はMaxBin（ref.14）とMyCC（ref.28）で使用されていたが、ここでは異なるアプローチを取る。 MaxBinでは、期待値最大化（EM）アルゴリズムのためシングルコピーマーカーを使用してクラスター数、平均テトラヌクレオチド頻度、およびカバレッジを初期化する。ビン内の2以上のマーカーの中央値は、EMが収束したかどうかの大まかな尺度として使用される。 MyCCは、どのクラスターをマージまたは分割すべきかを決定するために、 affinity propagationによる1ラウンドのクラスター化の後にシングルコピーマーカーを利用する（ref.28）。これら両方のパイプラインは、真のゲノムが予想される数のマーカーを有するという仮定に悩まされており、そしてMyCCの場合には、この情報はクラスター化段階を導くために使用されない。これとは対照的に、Autometaはシングルコピーマーカーを使用してクラスタリングをガイドし、回復可能なゲノムが必ずしも「完全」であるとは想定していない。環境メタゲノムに見られる微生物は、以前にシーケンシングされたすべての生物とは非常に異なっている可能性があり、真核生物宿主に関連する微生物はしばしば独立した生活に不可欠な機能を失うようなゲノムの分解および減少の過程を経ることが知られている（ref.31,32）。例えば、本著者らは最近、NCBI NRデータベースでヒットした遺伝子はわずか20％で、予想されるバクテリアのシングルコピーマーカーのわずか20％しか検出できないという、既知の配列とは異なるゲノムが減少したバクテリアを同定した。したがって、MaxBin（ref.14）のように、ビンが100％完全に近いことを想定したり、ビンの数を事前計算するためにシングルコピーマーカーを使用したりすることは想定していない。 Autometaで採用されているプロセス全体は、大きく3つの段階から構成されている（論文図1）。

配列相同性に基づいてコンティグをkingdomビンに分離する。
Kingdom固有のコンティグを繰り返しクラスタ化する。
教師あり機械学習により、クラスタ化されていないコンティグをビンに分類する。

　コンティグをKingdomビンに広範に分離することにより、宿主由来または他の真核生物汚染を除去することが可能になり（たとえ宿主ゲノムがリファレンスデータベースに表されていなくても）、バクテリアおよびアーキア由来コンティグの分離が可能になる。遺伝子は、Prodigal（ref.33）を使用して指定された長さのカットオフよりも長いすべてのコンティグで識別される（デフォルトは10,000 bpだが、ここでテストされたすべてのデータセットは3,000 bpカットオフに基づいている）。次いで、翻訳されたコード配列は、DIAMOND（ref.34）を用いてNCBI NRデータベースに対してクエリされる。トップヒットの10％以内のビットスコアを有するヒットのLCAを用いて、NCBI taxonomyデータベース（https://www.ncbi.nlm.nih.gov/taxonomy）に従って各予測タンパク質にtaxonomy IDを割り当てる。遺伝子水平伝播の影響を減らすために、コンティグレベルの分類は、構成要素予測コード配列の修正多数決によって割り当てられる。分類は特異性の低い順に（種、次に属、科、目、綱、門および界）考慮され、タンパク質の大多数がより低い特異性で分類されてしまい答えが得られない場合、コンティグ内のすべてのタンパク質の最小公約数がコンティグ分類として使用される。

（以下略）

f:id:kazumaxneo:20190409101544j:plain

Autometa metagenome workflow.　論文より転載

Our automated metagenomic binning pipeline - Autometa - is now published in @NAR_Open! https://t.co/KbCVOxyigp
— Jason C Kwan (@kwan_lab) March 6, 2019

インストール

オーサーらが準備したdockerイメージを使いテストした。

依存

Third party programs

Databases (these will be automatically downloaded the first time you run make_taxonomy_table.py). NOTE: You will need ~150 GB (at time of writing) to download and process these.

nr (Note: tested with newer NR versions without GI numbers)
taxdump.tar.gz
prot.accession2taxid.gz

Python packages (Note: this list assumes the use of Anaconda Python)

Additionally, if you want to calculate your own contig coverages (rather than trusting the coverage values given by the SPAdes assembler), you will need:

本体　Github

docker image

docker pull jasonkwan/autometa
docker run -itv $PWD:/data/ jasonkwan/autometa
> cd /autometa/pipeline/

> run_autometa.py -h

# run_autometa.py -h

usage: run_autometa.py [-h] -a <assembly.fasta> [-p <int>] [-l <int>]

[-c <float>] [-k <archaea|bacteria>]

[-t <taxonomy.tab>] [-o <dir>] [-r] [-m] [-db <dir>]

[-v <coverage.tab>]

Script to run the Autometa pipeline.

optional arguments:

-h, --help show this help message and exit

-a <assembly.fasta>, --assembly <assembly.fasta>

Path to metagenomic assembly fasta (default: None)

-p <int>, --processors <int>

Number of processors to use (default: 1)

-l <int>, --length_cutoff <int>

Contig length cutoff to consider for binning in bp

(default: 10000)

-c <float>, --completeness_cutoff <float>

Completeness cutoff (in percent) to use for accepting

clusters (default: 20.0)

-k <archaea|bacteria>, --kingdom <archaea|bacteria>

Kingdom to consider (default: bacteria)

-t <taxonomy.tab>, --taxonomy_table <taxonomy.tab>

Path to output of make_taxonomy_table.py (default:

None)

-o <dir>, --output_dir <dir>

Path to directory to store all output files (default:

-r, --ML_recruitment Use ML to further recruit unclassified contigs

(default: False)

-m, --maketaxtable runs make_taxonomy_table.py before performing autometa

binning. Must specify databases directory (-db)

(default: False)

-db <dir>, --db_dir <dir>

Path to directory with taxdump files. If this doesn't

exist, the files will be automatically downloaded

(default: /autometa/databases)

-v <coverage.tab>, --cov_table <coverage.tab>

Path to coverage table made by

calculate_read_coverage.py. If this is not specified

then coverage information will be extracted from

contig names (SPAdes format) (default: None)

Please do not forget to cite us. Thank you for using Autometa!

> calculate_read_coverage.py -h

# calculate_read_coverage.py -h

usage: calculate_read_coverage.py [-h] -a <assembly.fasta> [-p <int>]

[-F [<reads.fastq|reads.fastq.gz> [<reads.fastq|reads.fastq.gz> ...]]]

[-R [<reads.fastq|reads.fastq.gz> [<reads.fastq|reads.fastq.gz> ...]]]

[-S [<reads.fastq|reads.fastq.gz> [<reads.fastq|reads.fastq.gz> ...]]]

[-o <dir>]

Script to tabulate paired-end or single read coverage using Samtools and

Bedtools. Note: it is recommended that you quality filter your *.fastq or

*fastq.gz files prior to calculating read coverage.

optional arguments:

-h, --help show this help message and exit

-a <assembly.fasta>, --assembly <assembly.fasta>

Path to input assembly file

-p <int>, --processors <int>

Number of processors

-F [<reads.fastq|reads.fastq.gz> [<reads.fastq|reads.fastq.gz> ...]], --forward_reads [<reads.fastq|reads.fastq.gz> [<reads.fastq|reads.fastq.gz> ...]]

Paired (*.fastq|*.fastq.gz) forward reads (must be in

same order as reverse list)

-R [<reads.fastq|reads.fastq.gz> [<reads.fastq|reads.fastq.gz> ...]], --reverse_reads [<reads.fastq|reads.fastq.gz> [<reads.fastq|reads.fastq.gz> ...]]

Paired (*.fastq|*.fastq.gz) reverse reads (must be in

same order as forward list)

-S [<reads.fastq|reads.fastq.gz> [<reads.fastq|reads.fastq.gz> ...]], --single_reads [<reads.fastq|reads.fastq.gz> [<reads.fastq|reads.fastq.gz> ...]]

Single (*.fastq|*.fastq.gz) reads

-o <dir>, --output_dir <dir>

Path to output directory

> make_taxonomy_table.py -h

# make_taxonomy_table.py -h

usage: make_taxonomy_table.py [-h] -a <assembly.fasta> [-p <int>] [-db <dir>]

[-udb <user_prot_db>] [-l <int>]

[-v <coverage.tab>] [-o <dir>] [-bgc <dir>] [-s]

[-u]

Script to generate the contig taxonomy table.

optional arguments:

-h, --help show this help message and exit

-a <assembly.fasta>, --assembly <assembly.fasta>

Path to metagenomic assembly fasta

-p <int>, --processors <int>

Number of processors to use.

-db <dir>, --db_dir <dir>

Path to directory with taxdump, protein accessions and

diamond (NR) protein files. If this path does not

exist, will create and download files.

-udb <user_prot_db>, --user_prot_db <user_prot_db>

Replaces the default diamond database (nr.dmnd)

-l <int>, --length_cutoff <int>

Contig length cutoff to consider for binning in bp

-v <coverage.tab>, --cov_table <coverage.tab>

Path to coverage table made by

calculate_read_coverage.py. If this is not specified

then coverage information will be extracted from

contig names (SPAdes format)

-o <dir>, --output_dir <dir>

Path to directory to store output files

-bgc <dir>, --bgcs_dir <dir>

Path to directory of biosynthetic gene clusters. Masks

BGCs

-s, --single_genome Specifies single genome mode

-u, --update Checks/Adds/Updates: nodes.dmp, names.dmp,

accession2taxid, nr.dmnd files within specified

directory.

Output will be directed to recursive_dbscan_output.tab

> ML_recruitment.py -h

# ML_recruitment.py -h

usage: ML_recruitment.py [-h] -t <contig.tab> [-c <column header>] [-p <int>]

[-r] [-C <int>] [-u <unclustered name>] [-n <int>]

[-m <k-mer.tab>] -o <output.tab>

[-k <archaea|bacteria>]

Recruit unclustered (or non-marker) sequences with Machine Learning

classification using clustered sequence as training data. Features to train

with include sequence coverage, composition, and homology. Confidence is

calculated using jackknife cross-validation by randomly subsetting the

training data n number of times.

optional arguments:

-h, --help show this help message and exit

-t <contig.tab>, --contig_tab <contig.tab>

Path to master contig table which includes initial

clusters

-c <column header>, --cluster_column <column header>

Name of column containing initial cluster information

-p <int>, --processors <int>

Number of processors to use

-r, --recursive If specified, will run classification iteratively and

refine traning data after each iteration.

-C <int>, --Confidence_cutoff <int>

Confidence cutoff value to use to keep ML-based

predictions.

-u <unclustered name>, --unclustered_name <unclustered name>

Name of unclustered group in cluster column

-n <int>, --num_iterations <int>

Number of iterations for jackknife cross-validation.

-m <k-mer.tab>, --k_mer_matrix <k-mer.tab>

Path to k-mer_matrix file.

-o <output.tab>, --out_table <output.tab>

Path to create output table with new column for ML-

recruited sequences.

-k <archaea|bacteria>, --kingdom <archaea|bacteria>

Kingdom to consider (archaea|bacteria)

> cluster_process.py -h

# cluster_process.py -h

usage: cluster_process.py [-h] -b <bin.tab> [-c <bin column name>] -f

<assembly.fasta> [-o <dir>] [-k <archaea|bacteria>]

[-t] [-db <dir>]

Script to summarize the assembly characteristics and taxonomy of binned

clusters, as well producing individual cluster fasta files

optional arguments:

-h, --help show this help message and exit

-b <bin.tab>, --bin_table <bin.tab>

path to the output from either run_autometa.py or

ML_recruitment.py

-c <bin column name>, --column <bin column name>

the name of the column to use for binning purposes

-f <assembly.fasta>, --fasta <assembly.fasta>

path to the assembly used to make the bin table

-o <dir>, --output_dir <dir>

path to the directory where output files will go

-k <archaea|bacteria>, --kingdom <archaea|bacteria>

kingdom to consider

-t, --do_taxonomy carry out taxonomic analysis on the clusters (you must

have already run make_taxonomy_table.py)

-db <dir>, --db_dir <dir>

Path to directory with taxdump files

実行方法

Step0 、de novo assembly

メタゲノムのアセンブラはmetaspadesが推奨されている。ラン後にできるscaffolds.fastaを使う。

この先はstep by stepで進めるか、または全ステップをワンライナーで自動実行する。

step by stepの流れ

optional step

Autometaは、metaspadesなどのアセンブルのmulti-fastaのヘッダにあるcoverage情報を使う。"metaspades"以外のメタゲノムアセンブラでアセンブルを行なった場合、まずこのマンドでcoverage情報を取得する。

calculate_read_coverage.py --assembly scaffolds.fasta -p 24 \
 -F reads_R1.fastq.gz -R reads_R2.fastq.gz

-F Paired (*.fastq|*.fastq.gz) forward reads (must be in same order as reverse list)
-R Paired (*.fastq|*.fastq.gz) reverse reads (must be in same order as forward list)
-S Single (*.fastq|*.fastq.gz) reads

coverage.tabが出力される。step1、2で使用する。

Step 1: Split data into kingdom bins [optional]

アセンブル配列のkingdomレベルでのbiinningを行い真核生物のコンタミネーションを除く（例えばホストと相互作用する系から取ったメタゲノムなど）。真核生物のコンタミネーションがないサンプルなら、このステップはスキップしても支障ない。このジョブにはデータベースとしてnrを使う。

実行する。初回ランでデータベースがない場合はまずダウンロードされる（/autometa/databases/）。150GBほどディスクスペースを使うので注意する（*1）。step1はDiamondなどを使ったリソース集約的なジョブなので時間がかかる。CPUも利用できるだけ指定する。optional stepでcoverage情報ファイルを計算している場合、"-v coverage.tab"で指定する。

#metaspades scaffolds
make_taxonomy_table.py -a autometa/test_data/scaffolds.fasta \
 -p 24 -l 3000

#other assembler (step0出力の "--cov_table coverage.tabをつける")
make_taxonomy_table.py -a autometa/test_data/scaffolds.fasta \
 -p 24 -l 3000 --cov_table coverage.tab

-a Path to metagenomic assembly fasta (default: None)
-l Contig length cutoff to consider for binning in bp
-v Path to coverage table made by calculate_read_coverage.py. If this is not specified then coverage information will be extracted from contig names (SPAdes format)
-p Number of processors to use (default: 1)
-o Path to directory to store output files (default: .)

ジョブが終わるとBacteria.fastaとEukaryotes.fasta、Viruses.fasta、その他の関連ファイルが出力される。taxonomy.tabというcontigにtaxonomyがアサインされたファイルも出力される。

taxonomy.tabの中身

f:id:kazumaxneo:20190420180951j:plain

Step 2: Bin bacterial contigs with BH-tSNE and DBSCAN

binningを行うAutometaのコアジョブになる。先ほど作成したbacteria.fastaを使う。このコマンドはbacteriaのkingdomのみで正常動作する。これはautomataにはbacteriaのsingle-copy geneデータベースしか入っていないためで、他のkingdomのfastaファイルではうまく動作しないとされる。binning精度を上げるためstep1で作ったtaxonomy.tabも必要になる。"-v coverage.tab"で指定する。

#metaspades scaffolds
run_autometa.py -a Bacteria.fasta -p 24 -l 3000 -t taxonomy.tab -c 20

##other assembler
run_autometa.py -a Bacteria.fasta -p 24 -l 3000 -t taxonomy.tab -c 20 --cov_table coverage.tab

-a Path to metagenomic assembly fasta (default: None)
-p Number of processors to use (default: 1)
-l Contig length cutoff to consider for binning in bp (default: 10000)
-c Completeness cutoff (in percent) to use for accepting clusters (default: 20.0)
-k <archaea|bacteria> Kingdom to consider (default: bacteria)
-t Path to output of make_taxonomy_table.py (default: None)
-v Path to coverage table made by calculate_read_coverage.py. If this is not specified then coverage information will be extracted from contig names (SPAdes format) (default: None).
-o Path to directory to store all output files (default: .)

各contigに見つかったmarker geneをまとめたファイルBacteria_filtered_marker.tab、binning結果をまとめたtab仕分けファイルrecursive_dbscan_output.tab、各contigの5-mer頻度をまとめたk-mer_matrixなどが出力される。

Step 3: Recruit unclustered contigs to bins through supervised machine learning [optional]

step2のbinning結果をデータベースにsupervisedのmachine learningを行い、クラスタリングされなかったcontigの再binningを試みる。step2で作成したrecursive_dbscan_output.tabとk-mer_matrixを指定する。

ML_recruitment.py --contig_tab recursive_dbscan_output.tab --recursive \
 --k_mer_matrix k-mer_matrix -o ML_recruitment_output.tab

-t Path to master contig table which includes initial clusters
--k_mer_matrix Path to k-mer_matrix file.
-o Path to create output table with new column for ML-recruited sequences.
--recursive If specified, will run classification iteratively and refine traning data after each iteration.

ML_recruitment_output.tabが出力される。

Running all steps in sequence

全ステップの自動実行。--ML_recruitment をつけるとstep3のML_recruitment.pyも含めて実行される。

run_autometa.py -a /autometa/test_data/scaffolds.fasta \
 -p 16 -l 3000 -m --ML_recruitment

-m runs make_taxonomy_table.py before performing autometa binning. Must specify databases directory (-db) (default: False)
--ML_recruitment Use ML to further recruit unclassified contigs (default: False)

Analysis of results

最後に分離したfastaやtaxnomy情報を記載したファイルを出力する。step3のML_recruitment_output.tab、step1のBacteria.fasta、データベースのパスを指定する。

cluster_process.py --bin_table ML_recruitment_output.tab \
 -c ML_expanded_clustering --fasta Bacteria.fasta \
 --do_taxonomy -db /autometa/databases -o cluster_process_output

-c the name of the column to use for binning purposes
-db Path to directory with taxdump files. If this doesn't exist, the files will be automatically downloaded (default: /autometa/databases)
--do_taxonomy carry out taxonomic analysis on the clusters (you must have already run make_taxonomy_table.py)

上記コマンドの実行でcluster_process_output/ができ、その中にbinning（クラスタリング）したfasta、summary table等が出力される。

結果の視覚化 - ggplot2を使う。

library(ggplot2)
data = read.table('ML_recruitment_output.tab', header=TRUE, sep='\t')
ggplot( data, aes( x = bh_tsne_x, y = bh_tsne_y, 
col = ML_expanded_clustering )) + 
geom_point( aes( alpha = 0.5,  size = sqrt(data$length ) / 100 )) +
 guides( color = 'legend', size = 'none', alpha = 'none' ) +
 theme_classic() + xlab('BH-tSNE X') +
 ylab('BH-tSNE Y') + guides( color = 
 guide_legend( title = 'Cluster/bin' ))

f:id:kazumaxneo:20190420211426j:plain

引用

Autometa: automated extraction of microbial genomes from individual shotgun metagenomes
Ian J Miller Evan R Rees Jennifer Ross Izaak Miller Jared Baxa Juan Lopera Robert L Kerby Federico E Rey Jason C Kwan
Nucleic Acids Research, Published: 06 March 2019

メタゲノムのコンタミ除去やメタゲノムのサンプル間比較を行って結果を視覚化する Recentrifuge

2019 PLoS ONE ゲノム比較 (comparative genomics) 結果の視覚化 (visualization) 生物種の推定 (taxonomic profiling) metagenome Binning (metagenomics) 法医学 contamination docker

2019 4/21　タイトル追加

2019 4/21 オーサーのJose Manuel Martíさんのコメント追加

2019 4/23　タイトル修正

2019 4/26　誤字修正

2019 dockerリンク追記

219 5/9 パラメータ追記

20206/13 ツイート追記

2020 6/14 condaインストール追記

　メタゲノミクスによる微生物群集の研究は、環境、臨床、食品、法医学の研究など、さまざまな生物学的分野でより一般的になってきている[ref.1-3]。新しいDNAおよびRNAシークエンシング技術は、シークエンシングされた塩基あたりのコストを劇的に減らすことによってこれらの研究を後押ししている。研究者は、微生物叢の縦方向（空間的または時間的）パターンを解明するために、さまざまなソースおよび時間からの微生物群集に属するシーケンスのセットを分析することができる（モデルの例については図1を参照）。ショットガンメタゲノムシーケンシング（SMS）研究では、研究者は各サンプルから核酸を抽出して精製し、それらをシーケンシングし、バイオインフォマティクスパイプラインを通して配列を分析する（詳細な例については、論文図S2およびS3参照）。ナノポアシーケンシングの発展に伴い、ポータブルで手頃な価格のリアルタイムSMSが現実のものとなっている[ref.4]。

メタゲノミクスにおける汚染

　低微生物バイオマスサンプルの場合、微生物由来の天然DNAはほとんどない。ライブラリーの調製および配列決定法は、その主な原因が汚染である配列を返すだろう[ref.5、6]。追加の工程を必要とするRNAシーケンシングは、さらなるバイアスおよびアーティファクトを導入し［ref.7］、これは、低微生物バイオマス研究の場合、汚染および偽分類群検出の深刻な問題につながる［ref.8］。メタゲノム解析の臨床界では、メタゲノミクスのワークフローにおけるネガティブコントロールの重要性が強調されており、最近では、結果から汚染物質をどのように差し引くかについて根本的な懸念が生じている[ref.9]。

　データサイエンスの観点からは、これは良いSN比を維持することの重要性のほんの一例である[ref.10]。シグナル（固有のDNA / RNA、サンプリングのターゲット）がノイズの大きさ（コンタミネーションやアーティファクトから取得したDNA / RNA）に近づくと、それらを区別するために特定の方法が必要になる。

　汚染配列の大元は、核酸抽出キット（kitome）[ref.11、12]、試薬および希釈剤[ref.13、14]、宿主[ref.15]、およびサンプリング後の環境にまでさかのぼることができるので多様である。汚染が空中浮遊粒子のような異なる起源から生じる場合、現在のサンプルまたはDNA間のクロスオーバーは過去のシーケンスランからのままであるref.17]。これらの供給源からの様々な量のDNAが天然の微生物のDNAと同時にシーケンシングされる。これは、特に微生物のバイオマスが少ない状況では、存在量や範囲などの大きさに深刻なバイアスをもたらす可能性がある[ref.18]。マルチプレックスシーケンシングが単純なインデックス付けを使用している場合、誤ったアサインは容認できるレートを簡単に超える可能性がある[ref.19]。メタゲノムリファレンスデータベースでさえも無視できない量のクロスコンタミネーションがある[ref.15、17、20]。

　kitomeに関して、それは同じ製品の異なるロット内でさえも異なる。例えば、DNeasy PowerSoil Kit（以前のPowerSoil DNA Isolation Kitとして知られている）は、Earth Microbiome ProjectやHuman Microbiome Projectなど、通常大量のDNAを提供しているため、無視できないほどのバックグラウンド汚染をもたらすことがよくある[ref.6]。サンプル中のバイオマスが低いほど、汚染バックグラウンド評価に役立つネガティブコントロールサンプルを集めることがより重要になる。それらがなければ、検体中の固有の微生物叢 - 汚染 - ノイズ - を区別することはほとんど不可能だからである。

　天然のDNAと混入しているDNAが正確に分離されていると仮定すると、サンプル間の信頼できる比較を実行するという問題が残る。一般的に、taxonomic classificationエンジンは、特にこの方法が最小公倍数祖先（LCA）[ref.21]のようなより保守的なアプローチを使用する場合、シーケンスランからのリードを異なるtaxonomicランクにアサインする。 LCAは誤検知のリスクを大幅に軽減するが、通常分類の分類レベルをより具体的なものからより一般的なものへと広げる。分類器がLCA戦略を使用しない場合でも、通常、各リードに特定のスコアまたは信頼レベルがアサインされる。これは、分類の信頼性推定量として下流のアプリケーションで考慮されるべきである。

　これらの困難さに加えて、分類レベルでの分離度が異なるため、非常に異なるDNA収量を持つサンプル、たとえば、低バイオマスサンプルと高バイオマスサンプルを比較することはさらに困難である。この種の問題は、サンプルが同じ程度のDNA収量であっても、全く異なる微生物構造を持っているため、少数派と多数派の微生物がそれらの間で根本的に異なる場合にも起こる[ref.8]。最後に、closely relatedの問題がメタゲノムbioforensic研究および環境サーベイランスで出現し、そこでは特定の分類群のごくわずかな存在を検出し、そして正確さと精度の両方で定量的結果を提供する方法を準備することが不可欠である。

　当初から、環境試料へのSMSの適用は、バクテリア人工染色体（BAC）クローンまたは16S rRNAシーケンシングからは得られない微生物群集の洞察を生物学者に提供した[ref.24、25]。科学界はすぐに比較メタゲノミクスの必要性と課題を強調した[ref.26、27]。最初のメタゲノムデータ解析ツールの1つであるMEGAN[ref.28]は、最初のリリースでサンプルの非常に基本的な比較を提供した。これは、より最近のバージョンでの対話型アプローチで改善された[ref.29]。一般に、メタゲノム分類およびアセンブリソフトウェアは、サンプル間指向よりもイントラ指向的である[ref.30]。いくつかのツールがこのギャップを埋めることを試みた。CoMet[ref.31]はメタゲノムサンプルのコレクションにおける機能的な違いを予測するために多次元尺度法と階層的クラスタリング分析などの異なる方法を組み合わせる比較機能プロファイリングのためのウェブベースのツールである。た

　Taxonomic classificationエンジンの結果の信頼性を高めるためのRecentrifugeのアプローチは、2つの戦略に従う。まず、各ステップで分類のスコアレベルを考慮する。次に、クロスオーバーを含むさまざまな種類の汚染物質を検出して選択的に除去する、強力な汚染物質除去アルゴリズムを使用する。Recentrifugeは次の高性能の分類分類器を直接サポートする：centrifuge[ref.7]、LMAT [ref.21]、CLARK [ref.39]、CLARK-S [ref.40]、およびKraken [ref.41]。他の分類ソフトウェアは一般的なパーサーを通してサポートされている。 Recentrifugeのインタラクティブなインタフェースにより、研究者はスコア付きKronaのようなチャートを使用してこれらの分類結果を分析することができる。生サンプルのプロットに加えて、Recentrifugeは、関心のある分類レベルごとに4つの異なるスコア付きチャートセットを生成する。コントロール減算サンプル、共有分類群（コントロールを含むまたは含まない）、およびサンプルごとの排他的分類群。この一連の分析およびプロットは、メタゲノム研究における複数のサンプルのロバストな比較分析を可能にし、特に低い微生物バイオマス環境または身体部位の場合に有用である。

　Recentrifugeは、特に汚染除去が必須である低微生物バイオマスメタゲノム研究において、頑健な汚染除去および複数サンプルのスコア指向の比較分析を可能にする。物理的測定に付随する不確実性の記述を添付することが不可欠であるのと同様に、アサインられた分類群の信頼性推定を伴う任意のリード分類に参加することが望ましい。Recentrifugeは、分類ソフトウェアによって得られたスコアをリードに読み取り、この貴重な情報を使用して、分析されたサンプルに関連する分類学的ツリー内の各分類群の平均信頼水準を計算する。この値はまた、リード品質または長さなどのさらなるパラメータの関数であってもよく、これは、ナノポアシーケンサーによって生成されたデータセットのように、リード長が大きく変動する場合に特に有用である。

Recentrifuge’s flowchart. 論文より転載。

Recentrifuge: robust comparative analysis and contamination removal for #metagenomics https://t.co/QE9iLvG5t2 … https://t.co/1mOCgmpXbG #bioRxiv Now freely available on @github and @pypi
— Jose Manuel Martí (@dyn_omics) February 18, 2019

"Negative control sequence were subtracted from patient sample reads by Recentrifuge" 👏👏👏 https://t.co/NdMWxovu4V
— Jose Manuel Martí (@dyn_omics) 2020年6月9日

Happy to see our systems biology approach to Verticillium wilt of olive published in @BioMedCentral Plant Biology: Metatranscriptomic dynamics after Verticillium dahliae infection and root damage in Olea europaea: https://t.co/wEtshVtA6l
— Jose Manuel Martí (@dyn_omics) 2020年2月23日

インストール

macos10.14のpython3.7.1環境とmacos10.12のpython3.6.7環境でテストした。

依存

Python 3.6 is required.

pandas for exporting results to CSV or TSV as extra files or for testing Recentrifuge.
openpyxl package is also required, additionally, for pandas to export results in Excel format.
matplotlib and xlrd are needed in addition to the previous packages for comprehensive testing the Recentrifuge package.

pip install pandas openpyxl xlrd matplotlib

#bioconda(link)
conda install -c bioconda -y recentrifuge

本体　Github

pip install recentrifuge

> rcf -h

$ rcf -h

=-= /Users/kazuma/miniconda3/bin/rcf =-= v0.28.7 - Mar 2019 =-= by Jose Manuel Martí =-=

usage: rcf [-h] [-V] [-n PATH] [--format GENERIC_FORMAT]

(-f FILE | -g FILE | -l FILE | -r FILE | -k FILE) [-o FILE]

[-e OUTPUT_TYPE] [-c CONTROLS_NUMBER] [-s SCORING] [-y NUMBER]

[-m INT] [-x TAXID] [-i TAXID] [-a] [-z NUMBER] [-w INT]

[-u OPTION] [-t] [--nokollapse] [-d] [--sequential]

Analyze results of metagenomic taxonomic classifiers

optional arguments:

-h, --help show this help message and exit

-V, --version show program's version number and exit

input:

Define Recentrifuge input files and formats

-n PATH, --nodespath PATH

path for the nodes information files (nodes.dmp and

names.dmp from NCBI)

--format GENERIC_FORMAT

Format of the output files from a generic classifier

included with the option -g. It is a string like

"TYP:csv,TID:1,LEN:3,SCO:6,UNC:0" where valid file

TYPes are csv/tsv/ssv, and the rest of fields indicate

the number of column used (starting in 1) for the

TaxIDs assigned, the LENgth of the read, the SCOre

given to the assignment, and the taxid code used for

UNClassified reads

-f FILE, --file FILE Centrifuge output files. If a single directory is

entered, every .out file inside will be taken as a

different sample. Multiple -f is available to include

several Centrifuge samples.

-g FILE, --generic FILE

Output file from a generic classifier. It requires the

flag --format (see such option for details). Multiple

-g is available to include several generic samples.

-l FILE, --lmat FILE LMAT output dir or file prefix. If just "." is

entered, every subdirectory under the current

directory will be taken as a sample and scanned

looking for LMAT output files. Multiple -l is

available to include several samples.

-r FILE, --clark FILE

CLARK full-mode output files. If a single directory is

entered, every .csv file inside will be taken as a

different sample. Multiple -r is available to include

several CLARK, CLARK-l, and CLARK-S full-mode samples.

-k FILE, --kraken FILE

Kraken output files. If a single directory is entered,

every .krk file inside will be taken as a different

sample. Multiple -k is available to include several

Kraken (version 1 or 2) samples.

output:

Related to the Recentrifuge output files

-o FILE, --outhtml FILE

HTML output file (if not given, the filename will be

inferred from input files)

-e OUTPUT_TYPE, --extra OUTPUT_TYPE

type of extra output to be generated, and can be one

of ['FULL', 'CMPLXCRUNCHER', 'CSV', 'TSV']

tuning:

Coarse tuning of algorithm parameters

-c CONTROLS_NUMBER, --controls CONTROLS_NUMBER

this number of first samples will be treated as

negative controls; default is no controls

-s SCORING, --scoring SCORING

type of scoring to be applied, and can be one of

['SHEL', 'LENGTH', 'LOGLENGTH', 'NORMA', 'LMAT',

'CLARK_C', 'CLARK_G', 'KRAKEN', 'GENERIC']

-y NUMBER, --minscore NUMBER

minimum score/confidence of the classification of a

read to pass the quality filter; all pass by default

-m INT, --mintaxa INT

minimum taxa to avoid collapsing one level into the

parent (if not specified a value will be automatically

assigned)

-x TAXID, --exclude TAXID

NCBI taxid code to exclude a taxon and all underneath

(multiple -x is available to exclude several taxid)

-i TAXID, --include TAXID

NCBI taxid code to include a taxon and all underneath

(multiple -i is available to include several taxid);

by default, all the taxa are considered for inclusion

-a, --avoidcross avoid cross analysis

fine tuning:

Fine tuning of algorithm parameters

-z NUMBER, --ctrlminscore NUMBER

minimum score/confidence of the classification of a

read in control samples to pass the quality filter; it

defaults to "minscore"

-w INT, --ctrlmintaxa INT

minimum taxa to avoid collapsing one level into the

parent (if not specified a value will be automatically

assigned)

-u OPTION, --summary OPTION

select to "add" summary samples to other samples, or

to "only" show summary samples or to "avoid" summaries

at all

-t, --takeoutroot remove counts directly assigned to the "root" level

--nokollapse show the "cellular organisms" taxon

advanced:

Advanced modes of running

-d, --debug increase output verbosity and perform additional

checks

--sequential deactivate parallel processing

rcf - Release 0.28.7 - Mar 2019

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU Affero General Public License as

published by the Free Software Foundation, either version 3 of the

License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License

along with this program. If not, see <https://www.gnu.org/licenses/>.

> rextract -h

$ rextract -h

=-= /Users/kazuma/miniconda3/bin/rextract =-= v0.28.7 - Mar 2019 =-= by Jose Manuel Martí =-=

usage: rextract [-h] [-V] [-d] -f FILE [-l NUMBER] [-m NUMBER] [-n PATH]

[-i TAXID] [-x TAXID] [-y NUMBER] (-q FILE | -1 FILE)

[-2 FILE]

Selectively extract reads by Centrifuge output

optional arguments:

-h, --help show this help message and exit

-V, --version show program's version number and exit

-d, --debug increase output verbosity and perform additional

checks

-f FILE, --file FILE Centrifuge output file.

-l NUMBER, --limit NUMBER

Limit of FASTQ reads to extract. Default: no limit

-m NUMBER, --maxreads NUMBER

Maximum number of FASTQ reads to search for the taxa.

Default: no maximum

-n PATH, --nodespath PATH

path for the nodes information files (nodes.dmp and

names.dmp from NCBI)

-i TAXID, --include TAXID

NCBI taxid code to include a taxon and all underneath

(multiple -i is available to include several taxid).

By default all the taxa is considered for inclusion.

-x TAXID, --exclude TAXID

NCBI taxid code to exclude a taxon and all underneath

(multiple -x is available to exclude several taxid)

-y NUMBER, --minscore NUMBER

minimum score/confidence of the classification of a

read to pass the quality filter; all pass by default

-q FILE, --fastq FILE

Single FASTQ file (no paired-ends)

-1 FILE, --mate1 FILE

Paired-ends FASTQ file for mate 1s (filename usually

includes _1)

-2 FILE, --mate2 FILE

Paired-ends FASTQ file for mate 2s (filename usually

includes _2)

rextract - Release 0.28.7 - Mar 2019

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU Affero General Public License as

published by the Free Software Foundation, either version 3 of the

License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License

along with this program. If not, see <https://www.gnu.org/licenses/>.

> retaxdump -h

retaxdump -h

usage: retaxdump [-h] [-V] [-n PATH]

Get needed taxdump files from NCBI servers

optional arguments:

-h, --help show this help message and exit

-V, --version show program's version number and exit

-n PATH, --nodespath PATH

path for the nodes information files (nodes.dmp and

names.dmp from NCBI

retaxdump - Release 0.28.8 - Apr 2019

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU Affero General Public License as

published by the Free Software Foundation, either version 3 of the

License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License

along with this program. If not, see <https://www.gnu.org/licenses/>.

追記

公式ではないようですが、docker imageも上がってますね。

https://hub.docker.com/r/replikation/recentrifuge

docker pull replikation/recentrifuge

データベースの準備

作業ディレクトリで以下のコマンドを打つ。

retaxdump

Downloading taxdmp.zip from NCBI FTP... OK!

Extracting nodes.dmp... OK!

Extracting names.dmp... OK!

実行方法

例えばcentrifugeの結果を使う。centrifugeのデータベース準備からランまでの流れはこちらを参照（リンク）。以下は fastqからtaxonomy assignmenを行うコマンドのみ記載。

１、まずcentrifugeをランする。sample1~sample3の3サンプル分のデータを解析する。

centrifuge -x abv -1 sample1_R1.fq -2 sample1_R2.fq -p 16 --report-file sample1_report.txt -S sample1.out

centrifuge -x abv -1 sample2_R1.fq -2 sample2_R2.fq -p 16 --report-file sample2_report.txt -S sample2.out

centrifuge -x abv -1 sample3_R1.fq -2 sample3_R2.fq -p 16 --report-file sample3_report.txt -S sample3.out

optional step、reextractを行う場合は一番下を参照。

２、Recentrifugeで再解析する。

rcf -f sample1.out -f sample2.out -f sample3.out

htmlファイルとxlsxファイルが出力される。複雑なメタゲノムサンプルだと、３サンプルでも数十分はかかる。

f:id:kazumaxneo:20190419173745j:plain

３、htmlファイルを開く。

kronaを使って結果は視覚化される。

f:id:kazumaxneo:20190419173853j:plain

図の見方は論文図５を参照（ダイレクトリンク）

クリックすればabundance、tax id、種名など確認できる。

f:id:kazumaxneo:20190419174159j:plain

ダブルクリックすれば下位の階級にジャンプできる。

f:id:kazumaxneo:20190419174316j:plain

rootに戻るには中心の文字部分を繰り返しダブルクリックしていく。

sampleを左上から切り替えられる。例えばsample1のみで検出されたものだけ表示することも可能。

f:id:kazumaxneo:20190419172806j:plain

他のパラメータもインタラクティブに変更可能。

sample1

f:id:kazumaxneo:20190419172925j:plain

sample2

f:id:kazumaxneo:20190419172935j:plain

sample3

f:id:kazumaxneo:20190419172943j:plain

サンプルを選ぶたび遠心機が回転するような動作でabundanceが切り替えられる。

shared species

f:id:kazumaxneo:20190419173038j:plain

shared genus

f:id:kazumaxneo:20190419173218j:plain

もちろんジーナス（genus）より上の階級も選択できる。

sample1 exclusive genus

f:id:kazumaxneo:20190419173308j:plain

切り替えは右のメニューからも行える。

f:id:kazumaxneo:20190419173640j:plain

　↑speciesの階級以下を表示する。

xlsxファイルにもkronaと同じ結果（元データ）が表示される。

f:id:kazumaxneo:20190419174812j:plain

別のsheetにより簡潔なsummaryも作られる。

f:id:kazumaxneo:20190419174715j:plain

reextractコマンドを使えば、centrifuge（Kraken）の結果をチェリーピッキング（*1）して、必要なデータのみ抽出してrcfコマンドにかけれる。

rextract -f S1.out -i 1117 -1 S1_R1.fastq -2 S1_R2.fastq

-i <id> NCBI taxid code to include a taxon and all underneath (multiple -i is available to include several taxid). By default all the taxa is considered for inclusion.
-x <id> NCBI taxid code to exclude a taxon and all underneath (multiple -x is available to exclude several taxid).
-y minimum score/confidence of the classification of a read to pass the quality filter; all pass by default
-q Single FASTQ file (no paired-ends)
-1 Paired-ends FASTQ file for mate 1s (filename usually includes _1)
-2 Paired-ends FASTQ file for mate 2s (filename usually includes _2)

引用

Recentrifuge: Robust comparative analysis and contamination removal for metagenomics

Jose Manuel Martí

PLoS Comput Biol. 2019 Apr 8;15(4):e1006967

関連

オーサーがそう述べている。

追記

Recentrifugeについての詳しいブログ記事をどもありがとうございました。Sorry for the confusion about "cherry-pick", I think it may be translated to 日本語 as「注意深く良い物だけを選ぶ」🙂
— Jose Manuel Martí (@dyn_omics) April 20, 2019

Thank you very much !!

注意深く良い物だけを選ぶ、というニュアンスで使われています。修正いたします。

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

メタゲノムの既知および未知バクテリアの存在量を推定single-copy phylogenetic marker genesに基づいて見積もる mOTUs2

教育にも利用できる、何百ものバイオインフォマティクスツールが入った包括的コンテナ環境 ORCA

long readのRNA seq向けマッピング評価ツール AlignQC

E.coliとKlebsiellaに対応したWGSからのプラスミド検出ツール PlaScope

スプライシングジャンクションを上手く処理できるエラーの多いロングリードRNA seqのアライナーdeSALT

machine leraningも併用するmetagenomeのビニングツール Autometa

メタゲノムのコンタミ除去やメタゲノムのサンプル間比較を行って結果を視覚化する Recentrifuge