2021-02-17

メタゲノムの組成を解析する CAMAMED

　メタゲノミクスは、分子ゲノミクス、微生物生態学、データ解析が交差する学際的な研究分野である。この分野の主な研究対象は、ある環境に存在する微生物のゲノム総量を指すメタゲノムである。メタゲノミクスは、ハイスループットゲノムシークエンシング技術をはじめとする微生物の培養に依存しない方法に基づいている。この方法では、腸などの特定の環境に生息する微生物（マイクロバイオーム）のDNAを抽出し、様々な計算機技術を用いて研究する。

　メタゲノムデータを解析するためのアプローチは大きく分けて2つある。(i) サンプル中の微生物の系統的多様性を記述する分類学的プロファイリング、(ii) サンプル中の微生物の機能的能力についての洞察を得るために、ゲノム配列を「機能的」グループにマッピングするための計算戦略を含む機能的プロファイリングである。

　マイクロバイオームの包括的な機能解析は、微生物群集の生化学的能力に関する我々の理解を大幅に向上させることができる(ref.3)。マイクロバイオームの機能性プロファイリングのための戦略として、リードをアセンブリしてより大きなコンティグにし、機能性プロファイリングのために遺伝子量を予測することが考えられる(ref.4)。これまで研究されていないマイクロバイオームに対しては、従来のアセンブリのほとんどが比較的精度が低いという事実にもかかわらず、アセンブリに基づく戦略は避けられない(ref.5)。対照的に、ヒトの腸のように既に広範囲に研究されている環境では、良好なマイクロバイオーム遺伝子カタログが以前に編集されている(ref.6)。このような場合には、マッピングに基づく解析をメタゲノムの機能解析に利用することができる。

　ショットガンメタゲノミクス研究では、微生物のコレクションは、培養や分離を行わずに直接DNAシークエンシングを行うことで研究される。遺伝子カタログにマッピングされた遺伝子の頻度を比較することにより、サンプル間の機能的差異を研究することができる。このように、遺伝子の豊富さデータは高いレベルの系統的変動の影響を受け、統計的な力を大きく低下させ、偽陽性を増加させる可能性がある(ref.7)。メタゲノムデータの系統的変動が遺伝子や微生物の観察された豊富さに影響を与える理由は多くある。重要な理由の一つは、各サンプルが異なる数のシークエンシングリードを持つことである。つまりシークエンシングの深さが異なることである(ref.8)。系統的変動の他の理由としては、サンプリング方法の不一致、DNA抽出、シーケンシングランの品質のばらつき、リードマッピングのエラー、リファレンス遺伝子カタログの不完全さなどが挙げられる(ref.9)。さらに、系統的変動は、微生物の平均ゲノムサイズの違い、種の豊富さ、リードに関連するGC含有量の違いに起因することがあり、これが観察された遺伝子の豊富さに影響を与える可能性がある（(ref.7,10）。

　次世代シークエンシング（NGS）データもまた、本質的に組成的(compositional)である。組成(compositional)とは、各ヌクレオチド断片の相対的な豊富さが他の断片の豊富さに依存していることを意味する。この性質は、シーケンシングマシンおよび基礎となる方法に関係しており、結果として得られる配列は、増幅およびそれに続くヌクレオチドサンプリングに関与するバイアスの影響を受ける((ref.11)。したがって、組成は、全体の不明確な部分である測定（例えば、NGSシーケンシングによって生成されたメタゲノムカウントデータ）において、このような曖昧さの結果である。The compositional data analysis (CoDA) は、このバイアスを処理して解決することを目的としている（(ref.12）。

　メタゲノムカウントデータもまた、他のNGSデータと比較して、より厳しい課題に直面している。これらの課題の1つは、異なるサンプルにおけるシークエンシングされたリードの数やシークエンシングの深さが非常に変動しやすいことである。第二の課題は、ゼロインフレーションと呼ばれるメタゲノムカウントデータにおけるゼロの割合が非常に高いことである（約50～90％）(ref.13)。また、メタゲノムデータは、他のNGSデータと比較して非常に高次元である（例えば、腸内マイクロバイオーム遺伝子カタログのサンプルでは、約10Mの遺伝子配列やfeaturesがある（ref.6））。一方で、DNAサンプリングの頻度が低いため、非常に希少な分類群は記録されておらず、テクニカルゼロと呼ばれている。また、構造的ゼロと呼ばれる欠落した個体群が記録されていない場合もある。もう一つの課題は、研究のサイズと、分類群の分布の大きなばらつき（過剰分散）である（ref.14）。

　正規化処理は系統的な変動や組成の偏りを識別して除去することができるため、データの前処理や解析には欠かせないステップである。高次元カウントデータに対しては多くの正規化手法が提案されてきたが、そのほとんどはメタゲノムデータでの性能が評価されていない(ref.7)。組成データに関わる課題に対処するために、様々なアプローチがまだ提案されている。例えば、シーケンシングの深さが不均一であるという問題を解決するために、2つのアプローチが紹介されている。第一に、異なるサンプルのリード数を再スケーリングしてライブラリサイズの固定値を達成すること、第二に、すべてのサンプルについて固定リード数を達成するためにリードを再サンプリングすることである(ref.2)。また、多くのCoDA法では、正規化の代わりに変換を使用している。これらの方法では、対数比変換を用いてデータを実空間にマッピングするため、従来の統計的手法を用いて更なる解析を行うことが可能となる。これらの方法は、データ変換にサブセット特徴量の幾何平均などの参照を使用しようとしている（ref.15）。

　説明したように、メタゲノムの組成データにおける主要な課題の一つは、sparsity（まばら）またはゼロインフレーションであり、これは遺伝子中心のカウントデータではより深刻になる。この問題を処理し、比較遺伝子量研究の精度を向上させるために、いくつかのパッケージやツールが開発されてきた。いくつかのパッケージは、メタゲノムデータのゼロを扱うために特別に開発されてきた。metagenomeSeqでは、遺伝子量データにゼロインフレーション対数正規モデルを使用している(ref.16,17)。この方法では、ゼロインフレーションはサンプル固有のものであり、シークエンシングの深さに依存すると仮定している(ref.18)。比率アプローチ（Ratio approach for identifying differential abundance (RAIDA)）は、最初にカウントを対数正規分布で記述される相対頻度に変換する統計モデルを使用している。RAIDA は、ほとんどの特徴は変動していないと仮定しているため、分類群や遺伝子レベルでのメタゲノムデータの解析に適している。また、このツールは 2 つの異なる条件でのメタゲノムデータサンプルの比較分析のために開発されたもので、2 つ以上の条件に一般化することが可能である(ref.19)。また、メタゲノムデータの統計モデルとしては、ゼロインフレーション負二項式やゼロインフレーションβ回帰モデルなど、いくつかのゼロインフレーション統計モデルがある(ref.20,21)。

　しかし、他にも、ANCOM（ref.22）、ZIBSeq（ref.21）、CPL（ref.23）、DESeq2（ref.24）、edgeR（ref.25）など、CoDAのためのパッケージやツールが提供されている。これらの手法は、異なる統計的手法を駆使し、ハイスループットシーケンシングデータの組成バイアスやゼロインフレーションを処理しようとしている。DESeq2とedgeRを含む上記のパッケージのほとんどは、もともとRNA-seqデータ解析のために開発されたものであることに注意することが重要である。最も広く使用されているCoDAパッケージのいくつかとその特性を論文表1に整理した。

　また、組成データの正規化手法も数多く提案されている。これらの手法のいくつかは、RNA-seqデータ、アンプリコンシーケンシングによって生成されたオペレーショナル・タクソノミック・ユニット(OTU)データ、およびメタゲノムデータに対して提供されている。しかし、これらの方法の比較研究では、性能とデータの特性の間に大きな依存性があることが示されている。例えば、RNA-seqデータに対してより優れた性能を持つ方法が、必ずしもメタゲノムデータに適しているとは限らない(ref.7)。これらの正規化手法のいくつかを以下に説明する。Total sum scaling（TSS）は、正規化された値の合計が１となるように、個々のカウントをサンプル中の総カウントで除算することで得られるカウントデータの標準的な正規化方法である（ref.28）。固定のスケーリング係数を持つ TSS は、技術的なシーケンシングのバイアスにより OTU カウントに悪影響を与える可能性がある。Cumulative sum scaling (累積和スケーリング)(CSS) （ref.16）は、低頻度（比較的一定で独立した）四分位に基づいてサンプルを再スケーリングし、高頻度サンプルの影響を排除しない。さらに、CSSは対数正規分布モデルとして高く評価されており、多くの研究で分類群/遺伝子レベルでのCoDAに使用されている(ref.2,18,29,30)。 Trimmed mean of M-values (TMM) は、差のある発現を識別するために、サンプル間のスケールファクターを推定する。TMM正規化は、ほとんどの遺伝子の発現がサンプル間で変動していないことを前提としている(ref.31)。組成データを正規化するための別の方法のファミリーも存在し、これは対数比変換に基づいている。組成における特徴間の相互依存性は、個々の特徴の分析が、各サンプルを新しい空間に変換するリファレンスベースラインに対して行われ、統計分析がこの新しい空間で行われることを暗示している。基準の選択に基づいて、異なる対数比に基づく方法が開発された。中心対数比（CLR）、加法対数比（ALR）、相対対数式（RLE）変換は、基準を選択するために異なる戦略を使用している（ref.11,32）。論文表1は、いくつかのパッケージ、関連する正規化方法、特性、およびCoDAのためのそれらの利点の詳細を示している。

　メタゲノムデータの機能解析のためには、メタゲノムデータの組成を考慮したパイプラインを使用する必要がある。本論文では、メタゲノムデータの分類学的・機能的解析を行うためのマッピングベースのソフトウェアパイプラインであるCAMAMEDを紹介する。そのため、適切な正規化手法は、メタゲノムデータの解析において非常に重要である。このデータの特性（例えば、スパース、高次元性、レアファクション、アンダーサンプリング、シークエンスデプスさの違いなど）を考慮すると、metagenomeSeq とその正規化手法としてのCSSは、メタゲノム研究において最も広く利用されている手法の一つである（ref.2,17,18,33）。しかし、METAGENOMESeqの最も重要な欠点の一つは、サンプルサイズが小さい場合のdifferential abundance analysis（存在量変動解析）の偽陽性率が高いことであると報告されている(ref.34,35)。CAMAMEDでは、分類、遺伝子、KO、EC数、反応レベルの疎なメタゲノムデータの組成バイアスに対応するため、metagenomeSeqを使用している。CAMAMED は、Linux オペレーティングシステムをベースとした Python3 と Shell プログラミングを用いて実装されている。CAMAMEDは非専門家向けに設計されており、比較的簡単に実行できるようになっている。また、CAMAMEDをより簡単に利用できるように、2つのDockerイメージを構築し、インストールの詳細や依存関係を介さずにCAMAMEDパイプラインを実行できるようにした。これらのイメージは www.hub.docker.com にある。
　この研究では、80のメタゲノムショットガン配列の糞便サンプルを使用した。このデータセット（当初は「cohort1」と命名）には、健常者（コントロール）24名、大腸腺腫27名、大腸がん29名からのサンプルが含まれている。これらのサンプルは、Illuminaプラットフォームとペアエンドシークエンシング法を用いてシークエンシングされた(36)。また、サンプル中の遺伝子の豊富さを得るために、9.88×106の非重複遺伝子配列を持つ既報の遺伝子カタログを用いた。このカタログは統合遺伝子カタログ（IGC）と呼ばれ、ヒト腸内マイクロバイオームのヌクレオチドとタンパク質の遺伝子配列が掲載されている（ref.6）。

マニュアルPDFはレポジトリに含まれています。

https://github.com/mhnb/CAMAMED/blob/master/CAMAMED-manual.pdf

インストール

docker imagesを使ってテストした。

依存

Software Requirements

Linux operating system (Preferably Ubuntu)
Python 2 or 3 (Preferably Python ≥3.7 and it is also better to have Python3 by default)

Hardware Requirements

It is better to run on a computer with at least 32GB of RAM and ten cores of processor.

Github

Docker

#dockerhub (link)
docker pull camamed/camamed_pipeline_db:latest

> ./camamed_pre_processing.py

# ./camamed_pre_processing.py

This function is a preprocessing step in pipeline

Command: ./camamed_pre_processing.py method [options]

method:

-h Shows help related to this function

-sra Running SAR toolkit to convert SRA files to Fastq or Fasta files. (**optional**)

Copy SRA samples in the /CAMAMED/sra_files folder.

Copy the sample names in the /CAMAMED/sra_files/sra_file_names.txt.

for example:

file1.sra

file2.sra

file3.sra

After executing SRA-Toolkit, Fastq or Fasta files are automatically copied to folder /Read_files.

For Example: ./camamed_pre_processing.py -sra

-gec Get information on sequences and gene catalogs.

options:

-gc (gene_catalog): Gene catalog sequences must have Fasta format.

First copy the gene catalog into the /CAMAMED folder.

-rff (read_files_format)[fastq/fasta]:

Copy the sample files in the /CAMAMED/Read_files folder.

Enter the file name of the samples in the /CAMAMED/Read_files/sample_file_names.txt

For single end fastq files:

file1.fastq

file2.fastq

file3.fastq

For paired end fastq files:

file1_1.fastq

file1_2.fastq

file2_1.fastq

file2_2.fastq

-rt (read_type): paired end or single end [p/s]

-is (insert_size): For paired end sequences (An integer number) or ignor enter -1 (default=-1).

For Example: ./camamed_pre_processing.py -gec -gc gene_catalog.fa -rff fastq -rt p

./camamed_pre_processing.py -gec -gc gene_catalog.fa -rff fastq -rt p -is 350

-cdh Running CD-HIT on the gene catalog to remove redundant genes. (**optional**)

You can use this option if you think your gene catalog has redundant sequences.

options:

-sit (sequence_identity_threshold): This value can be in the range of 0.8 to 1 (default=0.9).

For Example: ./camamed_pre_processing.py -cdh

./camamed_pre_processing.py -cdh -sit 0.95

-pgc Peprocessing gene catalog.

At this point, the names of the genes are deleted from the gene catalog and stored in the /CAMAMED/files/gene_name.txt

and for the genes the gene1, gene2 and ... are respectively selected.

For Example: ./camamed_pre_processing.py -pgc

> ./camamed_quality_control.py -h

# ./camamed_quality_control.py -h

This function executes fastqc quality control and SeqKit tools on sample sequences.

At this state, if the sequences have Fastq format, they will be quality controlled and their statistical information extracted.

But if they have Fasta format, only their statistical information obtained.

Before running this step, the sample data should be in the /CAMAMED/Read_files folder

and the file names should be written in the text file /CAMAMED/Read_files/sample_file_names.txt, respectively.

Command: ./camamed_quality_control.py method

method:

-h Shows help related to this function.

-fqc Execut FastQC quality control only for Fastq files.

For Example: ./camamed_quality_control.py -fqc

-sek Execut SeqKit to extract information from sample files.

For Example: ./camamed_quality_control.py -sek

-eti Extract total information from SeqKit outputs.

For Example: ./camamed_quality_control.py -eti

> ./camamed_metaphlan_profiling.py -h

# ./camamed_metaphlan_profiling.py -h

In this function, using the Metaphlan2 software extract the abundance of all bacteria.

Metaphlan2 can produce taxonomic profiling at different levels.

Such as Kingdom, Phylum, Class, Order, Family, Genus, and Species.

Command: ./camamed_metaphlan_profiling.py method [options]

method:

-h Shows help related to this function.

-mph Execute metaphlan2.

options:

-tl (taxa_level): For selecting Kingdom, Phylum, Class, Order, Family, Genus, or Species,

Select one of {'k', 'p', 'c', 'o', 'f', 'g', 's'}

-c (core_number): select number of cores for Metaphlan2 execution (default=1).

The number of cores must be a positive integer, otherwise one is chosen.

For Example: ./camamed_metaphlan_profiling.py -mph -tl s -c 3

-imp Extract information from metaphlan2 output files.

For Example: ./camamed_metaphlan_profiling.py -imp

> ./camamed_kegg_annotation.py -h

# ./camamed_kegg_annotation.py -h

This function extract annotated information from KEGG databases

Command: ./camamed_kegg_annotation.py method [options]

method:

-h Shows help related to this function

-pgc Preparing the Gene Catalog for extracting annotated information from the KEGG databases.

To obtain KOs associated with gene sequences using web services GhostKOALA and KAAS,

it would be better if the file size of the gene catalog is less than 300 MB.

The gene catalog is stored after the preprocessing and deletion of the duplicate genes (optional)

named main_gene_catalog.fa in the CAMAMED folder of the program.

If the gene catalog size is more than 300MB, you can use this option to convert it to smaller files.

For example, if the gene catalog has 300 gene sequences,

you can convert it to three files with 100 genes by entering the values 100 200.

For Example: ./camamed_kegg_annotation.py -pgc 100 200

-egk Extracting information from GhostKOALA or KAAS output files.

In this step, the data of the files generated by GhostKOALA or KAAS server are read and the required information is extracted.

To do this, first, copy the text files into CAMAMED/kegg_annotation and insert the filenames based on the example with space.

Be sure to arrange the file names based on the gene number.

For Example: ./camamed_kegg_annotation.py -egk gk_file1.txt gk_file2.txt

-kla KEGG local anotation

We extracted all the annotated information related to the KOs and EC numbers and reactions to the date 2019/6/30.

from the KEGG database and saved in the 'kegg_annotation' folder in the text files that start with the def prefix.

At this step, we extract all the annotated information associated with the sequences step by step.

For Example: ./camamed_kegg_annotation.py -kla

-kua KEGG user annotation

We extracted all the annotated information related to the KOs and EC numbers and reactions to the date 2019/6/30

from the KEGG database and stored in the CAMAMED/kegg_annotation folder in the text files that start with the def prefix.

But if you want to extract this information yourself, you can use these options.

****These steps may take a long time****

options:

-koec Extract KO-related EC numbers from the KEGG database.

-ecre Extract EC-related reactions from the KEGG database.

-reeq Extract reaction-related equation from the KEGG database.

For Example: ./camamed_kegg_annotation.py -kua -koec

./camamed_kegg_annotation.py -kua -ecre

./camamed_kegg_annotation.py -kua -reeq

> ./camamed_data_normalization.py -h

# ./camamed_data_normalization.py -h

This function Extract samples information in normalized bacteria, gene, KO, EC number and reaction matrix.

Command: ./camamed_data_normalization.py method [options]

method:

-h Shows help related to this function

-nlk Normalizing data using local information previously extracted from the KEGG database.

option:

-mtct Minimum total counts for each taxon. This value is percentage of all taxa, for example 0.5 (default=0.001).

-mtcg Minimum total counts of mapped reads per gene in total samples (default=5).

For Example: ./camamed_data_normalization.py -nlk

./camamed_data_normalization.py -nlk -mtct 0.002 -mtcg 10

-nuk Normalizing data using information extracted by the user from the KEGG database.

option:

-mtct Minimum total counts for each taxon. This value is percentage of all taxa, for example 0.5 (default=0.001).

-mtcg Minimum total counts of mapped reads per gene in total samples (default=5).

For Example: ./camamed_data_normalization.py -nuk

./camamed_data_normalization.py -nuk -mtct 0.002 -mtcg 10

> ./camamed_statistical_test.py -h

# ./camamed_statistical_test.py -h

This function perform the statistical test (Kruskal-Wallis H-test or ANOVA test) on normalized data

At this stage, we perform the statistical test on normalized bacteria, gene, KO, EC number, and Reaction data

that stored in the all_results folder.

To get started, you first select the label of class for each sample in the /Read_files/class_label.txt file.

The number of classes can be in the range [2:10]. For each sample, enter a separate row. for example:

class1

class2

class1

class3

Command: ./camamed_statistical_test.py method [options]

method:

-h Shows help related to this function

-kwd Running the statistical test on the default annotated data.

option:

-tsty Statistical type (Kruskal-Wallis H-test or ANOVA test)[krw/ano]

-pval The p-value to filter the output. This value can be in the interval [0:1] (default=0.05)

For Example: ./camamed_statistical_test.py -kwd -tsty krw

./camamed_statistical_test.py -kwd -tsty ano

./camamed_statistical_test.py -kwd -tsty krw -pval 0.01

./camamed_statistical_test.py -kwd -tsty ano -pval 0.01

-kwu Running the statistical test on the user extracted data.

option:

-tsty Statistical type (Kruskal-Wallis H-test or ANOVA test)[krw/ano]

-pval The p-value to filter the output. This value can be in the interval [0:1] (default=0.05)

For Example: ./camamed_statistical_test.py -kwu -tsty krw

./camamed_statistical_test.py -kwu -tsty ano

./camamed_statistical_test.py -kwu -tsty krw -pval 0.01

./camamed_statistical_test.py -kwu -tsty ano -pval 0.01

> ./camamed_mapping_mosaik.py -h

# ./camamed_mapping_mosaik.py -h

This function maps the reads to the genes catalog using the MOSAIK software.

To mapping the read sequences to the gene catalog, the following two steps should be implemented.

In the first step, the gene catalog should be converted into an acceptable form for the MOSAIK software.

In the second step, the reads are mapped to the gene catalog.

Command: ./camamed_mapping_mosaik.py method [options]

method:

-h Shows help related to this function

-cag Creating an acceptable form of gene catalog.

options:

-hw (hash word): Select the length of the hash word that can be in range 4 to 32 (default=15).

For Example: ./camamed_mapping_mosaik.py -cag

./camamed_mapping_mosaik.py -cag -hw 17

-mrc Mapping reads to the gene catalog using MOSAIK software.

options:

-c (core_number): Select number of cores for MOSAIK execution (default=1).

The number of cores must be a positive integer.

For Example: ./camamed_mapping_mosaik.py -mrc

./camamed_mapping_mosaik.py -mrc -c 5

実行方法

１、CAMAMEDはマッピングベースの手法のため、最適なマッピング結果を得るためには、適切な遺伝子カタログを提供する必要がある。メタゲノムのリード配列は、MOSAIK v2.2.3ソフトウェアを使用して遺伝子カタログにマッピングされる。

２、遺伝子カタログにリードをマッピングした後、各サンプルの遺伝子頻度を計算する。機能プロファイリングには、遺伝子とゲノムの包括的なリストとそれらの生化学的アノテーションを含むKEGGデータベースを使用する。各遺伝子に関連するKEGGオルソロジー（KO）グループを抽出するには、KEGG Automatic Annotation Server（KAAS, Ver.2.1）（ref.42）を利用する。このツールを用いて、ヌクレオチド配列とアミノ酸配列をKEGGにマッピングし、関連するKO群を検索する。また、アミノ酸配列をKEGGにマッピングするGhostKOALA Ver.2.2(43)を使用してもよい。各遺伝子に関連するＫＯを抽出した後、各ＫＯについて、酵素コミッション（EC）番号と反応IDを取得する。

３、遺伝子の頻度を抽出した後、これらのデータを組成バイアスのために補正する。このステップの前に、すべてのサンプル中のマッピングされたりーどが5未満の遺伝子は排除される。次に、各遺伝子の頻度を遺伝子の長さで割って、長さの異なる遺伝子の正規化されたアバンダンスを計算し、KO、EC数、反応頻度を計算する。そして、組成補正と正規化には、CSS法を用いたmetagenomeSeqパッケージを使用する。この方法を用いて、分類学データの種レベルでの組成の偏りを補正している。KO, EC数, 反応の正規化度数を計算するために、それらのサブセット遺伝子の正規化度数の和を使用する。

カレントと共有してイメージのラン

sudo docker run --rm -itv $PWD:/data camamed/camamed_pipeline_db:latest

１、準備

ペアエンドfastqファイルは、/camamed/sample_input_files/Read_files/paired_end_fastq/に用意されている。これを Read_files/にコピーする。

cp /camamed/sample_input_files/Read_files/paired_end_fastq/* /camamed/Read_files/
cp sample_input_files/gene_catalog/gene_catalog.fa

以下のコマンドを実行する。

./camamed_pre_processing.py -gec -gc gene_catalog.fa -rff fastq -rt p -is 350

-gec Get information on sequences and gene catalogs. options:
-gc (gene_catalog): Gene catalog sequences must have Fasta format.
First copy the gene catalog into the /CAMAMED folder.
-rff (read_files_format)[fastq/fasta]
-rt (read_type): paired end or single end [p/s]
-pgc Peprocessing gene catalog.
At this point, the names of the genes are deleted from the gene catalog and stored in the /CAMAMED/files/gene_name.txt
-is (insert_size): For paired end sequences (An integer number) or ignor enter -1 (default=-1).

Rad_files/sample_file_names.txtが出力される。

> cat cat Read_files/sample_file_names.txt

p_file1_1.fq

p_file1_2.fq

p_file2_1.fq

p_file2_2.fq

p_file3_1.fq

以後のコマンドではfastqのリストが自動で認識される。すなわち、CAMAMED/Read_files/sample_file_names.txtに記載されたfastqが使用される。

２、シークエンシングリードの前処理。Fastq/fastaに対して、Fastqcや SeqKit ツールを実行。３つのコマンドを順番に実行する。

./camamed_quality_control.py -fqc
./camamed_quality_control.py -sek
./camamed_quality_control.py -eti

-fqc Execut FastQC quality control only for Fastq files. For Example: ./camamed_quality_control.py -fqc
-sek Execut SeqKit to extract information from sample files. For Example: ./camamed_quality_control.py -sek
-eti Extract total information from SeqKit outputs. For Example: ./camamed_quality_control.py -eti

３、Metaphlan2を用いた細菌の豊富さの抽出。Metaphlan2は様々なレベルでの分類学的プロファイリングを行うことができる。ここでは種レベルを選択。

./camamed_metaphlan_profiling.py -mph -tl s -c 12
./camamed_metaphlan_profiling.py -imp

-mph Execute metaphlan2. options:
-tl (taxa_level): For selecting Kingdom, Phylum, Class, Order, Family, Genus, or Species, Select one of {'k', 'p', 'c', 'o', 'f', 'g', 's'}
-c select number of cores for Metaphlan2 execution (default=1).
-imp Extract information from metaphlan2 output files.

4、MOSAIKを用いてリード配列を遺伝子カタログにマッピングする。最初のステップでは、遺伝子カタログを MOSAIK ソフトウェアで使用可能な形式に変換する。第二のステップでは、リードを遺伝子カタログにマッピングする。

./camamed_mapping_mosaik.py -cag
./camamed_mapping_mosaik.py -mrc -c 12

-cag Creating an acceptable form of gene catalog. options:
-hw (hash word): Select the length of the hash word that can be in range 4 to 32 (default=15).
-mrc Mapping reads to the gene catalog using MOSAIK software.
-c (core_number): Select number of cores for MOSAIK execution (default=1).
The number of cores must be a positive integer.

ここでトラブル発生。以下、未テスト。

KEGGデータベースからアノテーション情報を抽出。

./camamed_kegg_annotation.py -pgc 100 200
./camamed_kegg_annotation.py -egk gk_file1.txt gk_file2.txt

./camamed_kegg_annotation.py -kla
./camamed_data_normalization.py -nlk
./camamed_statistical_test.py -kwd -tsty krw

./camamed_kegg_annotation.py -kua -koec
./camamed_kegg_annotation.py -kua -ecre
./camamed_kegg_annotation.py -kua -reeq
./camamed_data_normalization.py -nuk
./camamed_statistical_test.py -kwu -tsty krw

引用

CAMAMED: a pipeline for composition-aware mapping-based analysis of metagenomic data
Mohammad H Norouzi-Beirami, Sayed-Amir Marashi, Ali M Banaei-Moghaddam, Kaveh Kavousi
NAR Genomics and Bioinformatics, Volume 3, Issue 1, March 2021

参考

分野は異なりますが、組成データについては以下の資料を読みました。

関連

2021-02-16

データベースやゲノムアセンブリの汚染・キメラアセンブリ配列を調べる conterminator

2020 Genome Biology contamination

2022/06/22 タイトル修正, インストール手順追記

　公的・私的リポジトリのゲノム数は、少なくとも過去10年間で急増しており、その主な理由は、塩基配列決定にかかる費用が急速に低下したことにある。また、公開されているゲノムデータベースGenBankは、EMBLやDDBJと定期的に同期しており、約18ヶ月ごとにその数が倍増している。これらのゲノムデータベースは、約30年間にわたってバイオテクノロジーと医学の新しい知見を推進してきた重要な世界的リソースである。

　このデータベースは、数百から数千の無秩序なDNA配列断片からなるドラフトゲノムであり、GenBankに保存されている50万以上のゲノムの大部分を占めている。これらの断片の中には、試薬、実験材料、サンプル処理の人工物、多重化されたシークエンシングの実行による交差汚染などにより、外来DNAが含まれているものもある（論文図1a）。これらの汚染された配列は、メタゲノム研究における配列のラベル付けの誤り、遺伝子の水平移動に関する誤った結論、ゲノムのアノテーションの質の低下など、様々な問題を引き起こす可能性がある。

　汚染の問題に対処するために、NCBI（GenBankの本拠地）では、汚染された断片を検出するために2つのフィルタリングプロトコルを適用している。第一に、合成配列（ベクター、アダプター、リンカー、プライマーなど）を検出するためにVecScreenを使用し、第二に、一般的な汚染物質とのBLASTアライメントにより、より広範な汚染配列を検出する。これらのフィルターにもかかわらず、汚染は依然として発生しており、その検出は依然として困難な状況にある。

ヒトDNAはシーケンシングラボに常に存在しているため、ホモ・サピエンスはゲノムプロジェクトの主要な汚染源であり続けている。自動検索にもかかわらず、公開されているゲノムの中に汚染されたヒトDNAの断片が残っていることがある [ref.11]。例えば、最近の研究では、何千ものヒトDNA断片が細菌ゲノムのドラフトに含まれている可能性があり、これらの多くが誤ってタンパク質として翻訳され、アノテーションされていることが示されている[ref.10]。しかし、他の多くの種[ref.12-16]も汚染の原因となっている。汚染を検出するための系統的なアプローチは、提出されたすべてのゲノムを他の既知のすべてのゲノムと比較する計算コストによって制限されている。例えば、1.5Tbのサイズを持つRefSeqデータベースのBLAST全対全比較を行うには、約30,000CPU年かかると言われている。Minimap2やBowtie2のようなより高速なアラインメント手法は時間はかからないが、この比較の二次的な複雑さに悩まされることになる。Mashやsourmashのような他の高速な方法は、より迅速にゲノムを比較することができるが、より大きなゲノム内の小さな汚染配列を見つけるのには適していない。

　本研究では、分類学的キングダム間の局所的なアラインメントを計算することで、ヌクレオチドデータベースやタンパク質データベースの汚染を検出する高速な手法であるConterminator（論文図1b）を紹介する。この手法は、Linclustの線形時間の全対全比較アルゴリズムを利用し、MMseqs2を用いた網羅的なアラインメントを行う。これにより、巨大なヌクレオチドとタンパク質の配列セットを単一のサーバー上で処理することが可能となった。この方法を適用して、ヌクレオチドデータベースGenbankとRefSeq、および包括的なNRタンパク質データベースのコンタミネーションの現状を定量化した。

Conterminatorは、クロスキングドムのアラインメントを検出し、汚染を予測する。MMseqs2の既存のモジュールをベースに、汚染検出用に拡張したものである。Conterminatorは、ゲノム配列の中から、他のキングダムのゲノムと少なくとも100塩基以上の長さで、かつ配列同一性閾値が90%以上でアラインする領域を同定する。ごく少数の例外を除き、異なるキングダムのDNA配列は全くアライメントされるべきではなく、このレベルの同一性で一致する配列は汚染物の有力な候補となる。このルールの例外として、最近の水平方向の遺伝子導入イベントが挙げられるが、これは非常にまれである。

インストール

ubuntu18.04LTSでmambaを使ってテストした。

# SSE4.1
wget https://mmseqs.com/conterminator/conterminator-linux-sse41.tar.gz; tar xvfz conterminator-linux-sse41.tar.gz; export PATH=$(pwd)/conterminator/:$PATH
# AVX2
wget https://mmseqs.com/conterminator/conterminator-linux-avx2.tar.gz; tar xvfz conterminator-linux-avx2.tar.gz; export PATH=$(pwd)/conterminator/:$PATH

# conda、ここでは高速なmambaを使う（link）
mamba install -c bioconda conterminator -y

> conterminator -h

Search for Contamination.

conterminator Version: 570993be7f5f31ee357183c9118bf3aa75575870

Main workflows for database input/output

dna Searches for cross taxon contamination in DNA sequences

protein Searches for cross taxon contamination in protein sequences

Bash completion for modules and parameters can be installed by adding "source MMSEQS_HOME/util/bash-completion.sh" to your "$HOME/.bash_profile".

Include the location of the conterminator binary in your "$PATH" environment variable.

> conterminator dna -h

$ conterminator dna

Usage: conterminator dna <i:fasta/q> <i:mappingFile> <o:result>

<tmpDir> [options]

Options:

Align:

--min-seq-id FLOAT list matches above this sequence

identity (for clustering) (range 0.0-1.0) [0.900]

--min-aln-len INT minimum alignment length (range 0-INT_MAX) [100]

Misc:

--blacklist STR Comma separated list of ignored taxa in

LCA computation

[10239,12908,28384,81077,11632,340016,61964,48479,48510]

Common:

--threads INT number of cores used for the computation

(uses all cores by default) [128]

-v INT verbosity level: 0=nothing, 1: +errors,

2: +warnings, 3: +info [3]

--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K,

10M, 1G. Defaults (0) to all available system memory. [0]

References:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence

searching for the analysis of massive data sets. Nature Biotechnology,

35(11), 1026-1028 (2017)

Examples:

Searches for cross taxon contamination in DNA sequences

Show an extended list of options by calling 'conterminator dna -h'.

実行方法

BLASTデータベースを指定する。

#1 blastdbcmdコマンドを使って全ての配列を取り出す。
blastdbcmd -db nt -entry all > nt.fna

#2 taxonomical identifierを各fasta配列にアサインしたmapping.fileが必要。これを作る。
blastdbcmd -db nt -entry all -outfmt "%a %T" > nt.fna.taxidmapping

#3 conterminatorの実行
conterminator dna nt.fna nt.fna.taxidmapping nt.result tmp

出力例

${RESULT_PREFIX}_conterm_prediction

f:id:kazumaxneo:20210217004952p:plain

予測された汚染が含まれている。フォーマットについてはレポジトリを確認して下さい。

メモ

上の例は分割されたntⅠつ分の分析結果だが、ntデータベース全ても同時に分析してみたが、物理メモリ128GBのマシンではメモリ不足でエラーになった（2020/0にNCBIから取得）。

引用

Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
Martin Steinegger & Steven L. Salzberg
Genome Biology volume 21, Article number: 115 (2020)

関連

2021-02-15

: 高次元データのクラスタリングと可視化のためのインタラクティブな教育用ウェブリソース ClusterEnG

KnowEnG PeerJ 2018 education 結果の視覚化 (visualization) web tool

　クラスタリングは、何らかの尺度に従って類似したデータポイントをグループ化することにより、大規模データセットの構造を発見するための最も強力で広く利用されている分析手法の一つである。R(R Core Team, 2015)やPython(Pedregosa et al., 2011)のようないくつかのプログラミング言語は、カスタムデータをクラスタリングし、静的プロットを生成するためのライブラリまたはパッケージを提供している。しかし、ユーザがより深いレベルでデータを理解するのを助けるインタラクティブな可視化は、追加のライブラリや外部ソフトウェアを必要とする。ClustVis（Metsalu et al、2015）のようなWebサーバーは、主成分分析（PCA）やヒートマッププロットを可視化するためのシンプルでありながら強力なインターフェースを提供している。しかし、現在のところ、ClustVisではファイルのアップロードサイズが2MBに制限されており、プロットも静的なものになっている。

　次世代シーケンシングの出現により、研究者はこれまでにないスピードでビッグデータを生成することが可能になった。そのため、高次元の生物学的データの利用者が、クラスタリングなどの「初歩的な」解析を迅速に実行できるようなリソースが急務となっている（Stephens et al、2015）。このようなリソースを構築するための主な課題は、ビッグデータを扱うことと、その解釈を容易にすることである。クライアント側のコンピュータシステムやWebブラウザは、データを効率的にナビゲートするのに十分なパワーを持っているとは限らない。NIHは最近、この種の課題に取り組むためのBig Data to Knowledge (BD2K)センターに資金を提供している。KnowEnG BD2Kセンターの一環として、本著者らは、効率的な並列アルゴリズムとソフトウェアコンテナ化によるビッグデータのクラスタリングのためのClusterEnG（Clustering Engine for Genomicsの頭文字をとったもの）と呼ばれるウェブベースのリソースを開発した。高次元データの可視化を容易にするために、最も一般的な次元削減技術の一つであるPCAの対話型バージョンを実装した。ClusterEnGの2Dと3Dの主成分プロットは、データ内の構造を直感的に探索することを可能にする。論文図1は、ユーザーがアップロードしたデータから出力される可視化までのClusterEnGの様々なコンポーネントのフローチャートを示している。

　ClusterEnGは行列の表形式のデータを受け付け、RNA-seq、マイクロアレイ、薬物反応データなどの典型的な生物学的実験で生成されたほとんどのデータセットを分析することができる。入力データはdata.tableパッケージから高速で便利な "fread "関数を利用してRで読み込まれる(Dowle et al., 2014)。ClusterEnGサーバは現在1GBまでのファイルを受け付けているが、将来的には増加する予定である。アップロードされたファイルは7日間サーバに安全に保存され、その間、ユーザはファイルを取得したり、同じブラウザ（クッキーを有効にして）からより多くのジョブを実行したりすることができる。

（一部略）

　ClusterEnGは、2つのアルゴリズムの並列実装を含む7つのクラスタリングアルゴリズムを提供する。現在、シリアル実装は、CRANリポジトリで利用可能な様々なパッケージを用いてRプログラミング言語で記述されている（R Core Team, 2015）。7つのアルゴリズムは、k-means、k-medeedids、アフィニティ伝播、スペクトルクラスタリング、ガウス混合モデル、階層的クラスタリング、およびDBSCANを含む（Ester et al. k-meansアルゴリズムの並列コードはC言語で書かれたソフトウェアパッケージを利用しており(Liao, 2005)、スペクトルクラスタリングの並列コードはC++コードを利用している(Chen et al., 2011)。アルゴリズムの選択肢を提供することに加えて、ユーザーは、アルゴリズムのサブセットのために一般的に使用されるパラメータのリストを与えられ、修正して、変更したものを可視化することができる（図2）。

webサービス

http://education.knoweng.org/clustereng/ にアクセスする。

１、ファイルをアップロードする。

f:id:kazumaxneo:20210224211453p:plain

以下のフォーマットに対応している（左側）。

f:id:kazumaxneo:20200423213226p:plain

ここではexample dataを選んだ（ファイルアップロード前はデフォルトでこのexample dataが選択されている）。

f:id:kazumaxneo:20210224211611p:plain

そのまま下のstep2に進む。クラスタリング手法を選ぶ。

f:id:kazumaxneo:20210224211719p:plain

iマークをクリックすると右側に簡単な解説が表示される。

f:id:kazumaxneo:20210224211855p:plain

クラスタリング手法を選ぶ。ここではk-meansを選び、クラスター数を5と指定した。

f:id:kazumaxneo:20210224212713p:plain

テストした時はrunningから先に進めなかった。

f:id:kazumaxneo:20210224235707p:plain

引用
ClusterEnG: an interactive educational web resource for clustering and visualizing high-dimensional data

Manjunath M, Zhang Y, Yeo SH, Sobh O, Russell N, Followell C, Bushell C, Ravaioli U, Song JS

PeerJ Comput Sci. 2018;4. pii: e155

関連

2021-02-14

IGVのtips その２

tips 初心者向け IGV

2021 2/17 tips追記

IGVをより便利に使う方法はないでしょうかという質問があったので、今日は自分が知っているIGVのtipsをいくつか紹介します。

IGVの公式動画チャンネルがあります。

統合TVでもIGVの基本的な使い方について分かりやすく説明されています

インストール

IGVのCUIへの導入はcondaで出来ます。ここでは高速なmambaを使います。

mamba install -c bioconda -y igv

> igv --help

Command line options:

Space delimited list of data files to load

--preferences, -o Path or URL to a preference property file

--batch. -b Path or url to a batch command file

--port, -p IGV command port number (defaults to 60151)

--genome, -g Genome ID (e.g hg19) or path or url to .genome or indexed fasta file

--dataServerURL, -d Path or url to a data server registry file

--genomeServerURL, -u Path or url to a genome server registry file

--indexFile, -i Index file or comma delimited list of index files corresponding to data files

--coverageFile, -c Coverage file or comma delimited list of coverage files corresponding to data files

--name, -n Name or comma-delimited list of names for tracks corresponding to data files

--locus, -l Initial locus

--header, -H http header to include with all requests for list of data files

--igvDirectory Path to the local igv directory. Defaults to <user home>/igv

--version Print the IGV version and exit

--help Print this output and exit

様々なオプションが利用できます。たとえば-l (--locus)オプションを使うと、指定の染色体ポジションを初期位置にして起動できます。

Tips

１、マウスホバーした時にポップアップウィンドウが出ないようにする。

上のメニューの吹き出しマークをクリックしてNever～を選択します。show Details on clickを選ぶと、クリックした時に大型のポップアップウィンドウが表示されます。

f:id:kazumaxneo:20210214005223p:plain

２、読み込んだデータの環境を保存する。

IGVに複数のデータを読み込んで細かく設定を調整したら、後になって再びその環境を再現するのが難しくなります。その時はxmlのセッションファイルをexportしておくと、次回はそれを読み込むだけで環境が完全に再現されます。

IGVで表示を色々チューニングしたとします。

f:id:kazumaxneo:20210214010526p:plain

File => Save Sessionを選択して、xmlファイルを保存します。

f:id:kazumaxneo:20210214005859p:plain

xmlファイルを保存したらIGVを終了します。

再びIGVを起動してFile => Open Sessionを選択、

f:id:kazumaxneo:20210214010911p:plain

カスタムした設定で読み込まれました。ポジションも完全に再現されています。

f:id:kazumaxneo:20210214011022p:plain

CUI環境でIGVを起動するなら、オプションなしでxmlを指定します。

igv igv_session.xml

（前回と同じパスから起動する）

３、配列をコピーする。

様々なコピー方法が用意されています。まずはリファレンスの特定の領域の塩基配列をコピーするやり方です。メニューのアイコンからDefine regionsを選択します（上辺が赤いフロッピーディスクのようなマーク）。

f:id:kazumaxneo:20210214104720p:plain

アイコンの形が変わるので、カバレッジウィンドウ内でコピーしたい領域の左端か右端をクリックします。

f:id:kazumaxneo:20210214105356p:plain

縦に線がつきました。

左端を決めたので、次に右端をクリックします。クリックすると、その領域の上に赤いバーがつきます。

f:id:kazumaxneo:20210214105541p:plain

赤いバーを右クリックしてコピーします。

f:id:kazumaxneo:20210214110108p:plain

Copy sequences を選択、

テキストエディタにペースト。コピーされています。

f:id:kazumaxneo:20210214110208p:plain

赤いバーを右クリックしてremove~を選択するとバーは消えます。

次はシークエンシングリードの塩基配列をコピーしてしてみます。ここでは、ソフトマスクされた領域を調べることを目的とします。まずはソフトマスク領域を表示するようにIGVの設定を変更します。

View => Preference 、出てきたウィンドウのAlignmentタブを選択します。

f:id:kazumaxneo:20210214110728p:plain

show soft-clipped basesにチェックを付けます。これはデフォルトではオフになっています。

f:id:kazumaxneo:20210214110803p:plain

saveして閉じます。

末端がソフトクリップされたリードの全長が表示されました。リードの上で右クリックしてCopy read sequenceを選べば、そのリードの塩基配列がコピーされます。

f:id:kazumaxneo:20210214110956p:plain

ソフトクリップされた塩基配列だけコピーしたければ、copy left-clipped sequenceを選択します（ソフトクリップ領域がないリードでは表示されない）。ここではリファレンスと異なるGTTGGTTTTという配列だけがコピーされました。

f:id:kazumaxneo:20210214111534p:plain

次はCopy consensus sequenceです。これを選択すると、現在ウィンドウに表示されている領域のリファレンス配列とポジションごとのアラインメントの全塩基がATGCNそれぞれ集計されてコピーされます。

f:id:kazumaxneo:20210214112143p:plain

テキストエディタにペースト。samtools pileupの前半のフィールドと同じようなフォーマットになります。このままRなどで分析できます。

f:id:kazumaxneo:20210214112253p:plain

最後です。リファレンスにない塩基が挿入されている時、その挿入配列だけをコピーします。挿入はリードの上でⅠ（アイ）として表されます。

f:id:kazumaxneo:20210214112553p:plain

Ⅰ（アイ）のマークの上で右クリックしてCopy insert sequenceを選択するとコピーされます。

f:id:kazumaxneo:20210214112821p:plain

４、表示のカスタマイズ。

標準のリードアラインメントでは画面に収まりきらいない場合、右クリックしてExpandから変更します。collapsedに変えるとリードの厚みを小さくして、より多くのリードを表示できます。Suqalishedだと全てのリードがウィンドウ内に収まるように圧縮されます。

f:id:kazumaxneo:20210214123925p:plain

上のメニューresize tracesボタン（Xマーク）をクリックすると、一括でこのSuqalished表示を行えます。

f:id:kazumaxneo:20210214124215p:plain

カバレッジトラックの上でshow splice junction trackを選択すると、スプライスジャンクションをブリッジのマークで表示できます。

f:id:kazumaxneo:20210214130330p:plain

表示されました。このshow splice junction trackも１つのトラックです。

f:id:kazumaxneo:20210214130455p:plain

左端のトラックをシフトキーを押しながらクリックして、全トラックを選択します。それから右クリックすると、トラックの一括設定が行えます。

f:id:kazumaxneo:20210214124909p:plain

change track heightで200にします。

f:id:kazumaxneo:20210214124954p:plain

カバレッジトラックの高さが200になりました。

f:id:kazumaxneo:20210214125030p:plain

リードトラックを一時的に非表示にするなら、右クリックしてShow alignment Trackのチェックを外します。

f:id:kazumaxneo:20210217123202p:plain

リードトラックを完全に消すなら、右クリックしてremove trackを選択します。リードトラックは完全に消えます。

f:id:kazumaxneo:20210214125233p:plain

いかにもjavaのGUIライクなトラック上下の二重境界線が邪魔なので、トラックをドラッグして１つのトラック内にカバレッジトラックを集めます。

移動しました。

f:id:kazumaxneo:20210214125358p:plain

追記

二重線を最初から非表示にしたければ、Preference => General => Display all tracks in a single panelにチエックをつけるのも手です。

f:id:kazumaxneo:20210215175551p:plain

色を変えます。各トラックの上で右クリックしてchange track colorを選択します。

変更しました。

f:id:kazumaxneo:20210214125632p:plain

右クリックしてsave imageを選択するとウィンドウ画像を保存できます。

f:id:kazumaxneo:20210214125813p:plain

View => set name panelを選択して、ネームパネルの幅を広くします。

f:id:kazumaxneo:20210214131123p:plain

ここでは300を選択。

f:id:kazumaxneo:20210214131153p:plain

続いてchange fot sizeでフォントサイズを変えます。

f:id:kazumaxneo:20210214131253p:plain

ここでは30を選択、

f:id:kazumaxneo:20210214131320p:plain

大きくなりました。

f:id:kazumaxneo:20210214131350p:plain

色々カスタマイズしたら、セッションを保存しておきます（２参照）。

引用

Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration
Helga Thorvaldsdóttir, James T. Robinson, Jill P. Mesirov

Brief Bioinform. 2013 Mar; 14(2): 178–192. Published online 2012 Apr 19

参考

関連

2021-02-13

nf-coreのampliseqパイプライン

nf-core workflow manager 結果の視覚化 (visualization) 相関分析分子系統樹 automated pipeline 2020 Frontiers in Microbiology amplicon sequence nextflow docker

2021 2/13 誤字修正

2023/03/05 テストラン追記

　微生物群集の構成を明らかにし、微生物集団の動態を解明し、環境試料中の微生物の多様性を探るための主要な手法の一つとして、DNAやRNAを用いた16S rRNA（遺伝子）アンプリコンシークエンシングとバイオインフォマティクス解析を組み合わせたハイスループットスクリーニングがある。しかし、対照的な生息地からの環境試料に焦点を当てた場合、どのような解析手法が現実を最も正確に反映した結果を提供するのか、解析手法の違いによって微生物群集研究の解釈にどのような偏りがあるのか、使いやすいパイプラインで最適な解析ワークフローを実現できるのかについては、これまで体系的に評価されていなかった。ここでは、微生物群集の構成が既知の3つの模擬データセットを用いて、16S rRNA（遺伝子）アンプリコン配列解析ツール（Mothur, QIIME1, QIIME2, MEGAN）の性能を比較した。その結果、QIIME2は、配列復元（10倍以上の偽陽性の減少）、分類学的分類（22％以上のFスコアの改善）、多様性推定（5％以上の評価の改善）において、他のすべての調査ツールを凌駕することが示された。陸上と淡水の対照的な4つのサイトから得られた24の環境データセットをさらに分析した結果、すべてのパイプラインの微生物群集の構成が属数レベルで劇的に異なることが明らかになった。例えば、調査した河川水域では、SpaerotilusはQIIME1を用いた場合にのみ報告され（8％の含有量）、AgitococcusはQIIME1またはQIIME2を用いた場合にのみ報告された（それぞれ2％または3％の含有量）が、MothurまたはMEGANを用いた場合には、両属とも検出されなかった。これらの豊富な分類群は、これらのサイトにおける重要な生物地球化学的サイクル（例えば、硝酸塩や硫酸塩の還元）に関係していると考えられるため、それらの検出と半定量的な列挙は、有効な解釈のために重要である。16S rRNA（遺伝子）アンプリコン配列を生の配列ファイルから解析するためのFAIR（Findable, Accessible, Interoperable, and Re-usable）を可能にする高性能コンピューティング準拠のワークフローを構築し、本研究で明らかになった最適な手法を用いている。本ワークフローは、今後の研究のために検討されるべきものであり、これにより、微生物群集データ解析の信頼性と信頼性を最大化しつつ、ハイスループット16S rRNA（遺伝子）配列データの解析を大幅に促進することが可能となる。

By default, the pipeline currently performs the following:

Sequencing quality control (FastQC)
Trimming of reads (Cutadapt)
Illumina read processing with QIIME2
Infer Amplicon Sequence Variants (ASVs) (DADA2)
Taxonomical classification based on SILVA v132 or UNITE database
excludes unwanted taxa, produces absolute and relative feature/taxa count tables and plots, plots alpha rarefaction curves, computes alpha and beta diversity indices and plots thereof (QIIME2)
Calls differentially abundant taxa (ANCOM)
Overall pipeline run summaries (MultiQC)

Usage

https://nf-co.re/ampliseq/1.1.2/usage

インストール

macos10.14のnextflow version 20.10.0.5430で-profile dockerを使ってテストした（nextflowが古い場合は更新すること）。

Github

# Make sure that Java v8+ is installed:
java -version
# Install Nextflow (持ってない人だけ、condaでも導入可能)
curl -fsSL get.nextflow.io | bash
#パスの通ったディレクトリに移動
mv nextflow ~/bin/

#pull
nextflow pull nf-core/ampliseq

テストラン

#dockerを使う例（実行権がなければsudo実行する）
nextflow run nf-core/ampliseq -profile test,docker --outdir output_dir

-profile <docker/singularity/podman/conda/institute>

出力

f:id:kazumaxneo:20210213111847p:plain

実行方法

増幅プライマー配列と（--FW_primer & --RV_primer）、fastqのパスを指定する（--input）。ここでは　プロファイルにdockerを指定。任意でサンプルグループも指定する。パイプラインに提供するfastqは"*_R{1,2}_001.fastq.gz"の形式になっていないといけない。ファイルのパスを指定する際はダブルクオーテーションで囲む。メタデータファイルはQIIME2の仕様(https://docs.qiime2.org/2019.10/tutorials/metadata/)に準拠していなければならない。

nextflow run nf-core/ampliseq \
 -profile docker \
 --input "fastq_dir" \
 --FW_primer GTGYCAGCMGCCGCGGTAA \
 --RV_primer GGACTACNVGGGTWTCTAAT \
 --metadata "Metadata.tsv"

--input Folder containing paired-end demultiplexed FastQ files
--FW_primer Forward primer sequence
--RV_primer Reverse primer sequence
--metadata Path to metadata sheet, when missing most downstream analysis are skipped (barplots, PCoA plots, ...).
--multipleSequencingRuns If samples were sequenced in multiple sequencing runs
--manifest Path to tab-separated table with sample IDs and paths to sequencing files
--max_cpus　Maximum number of CPUs that can be requested for any single job.　default:16
--max_memory　Maximum amount of memory that can be requested for any single job.　default:'128.GB'

様々なオプションが用意されています。nf-coreのparameter docタブで確認してください。正常にランするには、データによりますが、メモリは最低64GBくらいは必要です。

メモ

QIIME2のメタデータファイル例

https://data.qiime2.org/2019.10/tutorials/metadata/faith_pd_vector.qza

f:id:kazumaxneo:20210213133755p:plain

引用

Interpretations of Environmental Microbial Community Studies Are Biased by the Selected 16S rRNA (Gene) Amplicon Sequencing Pipeline
Daniel Straub, Nia Blackwell, Adrian Langarica-Fuentes, Alexander Peltzer, Sven Nahnsen, Sara Kleindienst
Front. Microbiol., 23 October 2020

2021-02-12

RNAseqのDEGsを視覚化する DrEdGE

2020 RNA seq Bioinformatics web tool 結果の視覚化 (visualization) C. elegans

　Differential Expression Gene Explorer（DrEdGE）はウェブベースのツールで、インタラクティブなオンラインのデータビジュアライゼーションを簡単に作成できるようにgenomicists（*1）を案内する。 DrEdGEの機能を、公開されているデータセット（ヒトの神経組織、マウス胚組織、Caenorhabditis elegans全胚）から生成された3つのサンプルWebサイトで実証する。 DrEdGEは、独立した探査に対する技術的な障害を取り除くことにより、大規模なゲノムデータセットの有用性を高める。DrEdGEはhttp://dredge.bio.unc.eduにて無料で利用できる。

Github

WEBサービス

http://dredge.bio.unc.edu/にアクセスする。

使用方法は動画で丁寧に説明されています。

webでは３つのデータセットが利用できる。

Human neuronal tissue

f:id:kazumaxneo:20210211210917p:plain

Mouse embryonic tissue

f:id:kazumaxneo:20210211211002p:plain

C. elegans timecourse

f:id:kazumaxneo:20210211211030p:plain

ローカルで使うなら、HPからdredge-7.4.1.zipをダウンロードして解凍しておく。

f:id:kazumaxneo:20210211211331p:plain

index.htmlファイルを開き、指示に従ってDrEdGEを設定する。

f:id:kazumaxneo:20210212231408p:plain

引用

Differential Expression Gene Explorer (DrEdGE): a tool for generating interactive online visualizations of gene expression datasets
Sophia C Tintori, Patrick Golden, Bob Goldstein
Bioinformatics, Published: 03 January 2020

genomicist; A scientist whose speciality is genomics.

関連

2021-02-11

condaの代わりに高速なmambaを使う

tips 初心者向け高速なツール conda

2021 2/11 誤りを修正

2021 4/26 Rについて追記

2021 4/30 tips追記

2022 2/7 再インストール追記

Githubより

Mamba は C++ での conda パッケージマネージャの再実装です。マルチスレッドを使ったリポジトリデータとパッケージファイルの並列ダウンロード、依存関係の解決をより高速にするための libsolv、Red Hat、Fedora、OpenSUSE の RPM パッケージマネージャで使用されている最先端のライブラリです。
mambaのコア部分はC++で実装されており、最大限の効率化が図られています。
同時に、mamba は可能な限り互換性を保つために、codaと同じコマンドラインパーサ、パッケージのインストールとデインストール、トランザクション検証ルーチンを利用しています。

開発の動機のブログ記事。condaの問題点についても言及されています。

mambaという名前は、その名前の通りマンバ属（wikipedia）に由来するようです。

mambaを導入してもcondaが消えるわけではなく、両者は両立して使えます。インストールしたツールが置かれるパスも同じです。また不要であれば簡単に消去することができます。気軽にテストしてみて下さい。

インストール

macos10.14（miniconda3.8）、win10pro-2020H2適用済みWSL2環境 (ubuntu20.04, miniconda3.8)、ubuntu18.04(anaconda3.7)で動作確認した。

mamba Github repository

#conda (anaconda)
conda install -c conda-forge mamba

#pip (pypi)  （間違ってます。実行しないで下さい）
pip install mamba

> mamba

$ mamba

usage: mamba [-h] [-V] command ...

conda is a tool for managing and deploying applications, environments and packages.

Options:

positional arguments:

command

clean Remove unused packages and caches.

compare Compare packages between conda environments.

config Modify configuration values in .condarc. This is modeled after the git config command. Writes to the user .condarc file (/Users/kazu/.condarc) by default.

create Create a new conda environment from a list of specified packages.

help Displays a list of available conda commands and their help strings.

info Display information about current conda install.

init Initialize conda for shell interaction. [Experimental]

install Installs a list of packages into a specified conda environment.

list List linked packages in a conda environment.

package Low-level conda package utility. (EXPERIMENTAL)

remove Remove a list of packages from a specified conda environment.

uninstall Alias for conda remove.

run Run an executable in a conda environment. [Experimental]

search Search for packages and display associated information. The input is a MatchSpec, a query language for conda packages. See examples below.

update Updates conda packages to the latest compatible version.

upgrade Alias for conda update.

repoquery Query repositories using mamba.

optional arguments:

-h, --help Show this help message and exit.

-V, --version Show the conda version number and exit.

conda commands available from other packages:

env

使い方

condaの代わりにmambaを指定するだけで使えます。

mamba install hoge

biocondaチャネルを指定して、バイオインフォマティクスのツールを導入します。

#bwa mem2を導入。-yで確認メッセージをスキップできます。
mamba install -c bioconda bwa-mem2 -y

#bioconda 以外のチャネルからインストール可能なバイオインフォマティクスツールもあります。
mamba install -c bioconda samtools
mamba install -c bioconda/label/cf201901 samtools

#fastqcとartを同時に導入します。
mamba install -c bioconda fastqc art

導入したツールをアンインストールします。

#-yで確認メッセージをスキップできます。
mamba remove bwa-mem2 -y

仮想環境を作ってツールを導入することで、ほかのツールが使っているライブラリとのバージョン違いによるコンフリクトを防げます。インストールも早くなります。activate / deactivate だけはcondaを使います（以前はsource activateだったが最近ではconda activate推奨）。

#tensorflowのバージョン1.10.0をtensorflow1.10_testという環境に導入
mamba create -n tensorflow1.10_test tensorflow=1.10.0
conda activate tensorflow1.10_test

#使用後、環境を抜ける
conda deactivate

pythonのバージョンも指定して仮想環境を作り、その仮想環境にツールを導入します。"tested with python3.6"のように、特定のpythonバージョンだけで開発・検証されているツールを導入したい時や、使ったことのないツールをまとめてテストする環境を作りたい時に便利です。

#cutadaptをpython3.6に導入
mamba create -n hoge python==3.6 -y
conda activate hoge
mamba install -c bioconda cutadapt -y

#使用後、環境を抜ける
conda deactivate

# 〇〇先生との共同研究のデータ解析環境を作る
mamba create -n ○○experiment python==3.9 -y
conda activate ○○experiment
mamba install -c bioconda star samtools rsem -y
#使用後、環境を抜ける
conda deactivate

conda（mamba含む）で導入したツール一覧を確認します。

mamba list

conda（mamba含む）で作成した環境一覧を確認します。

mamba info -e

#もしくは
mamba env list

導入したツールをアンインストールします。

#-yで確認メッセージをスキップ
mamba remove bwa-mem2

作った環境を消します。

#○○experimentという環境を削除
mamba remove --all -n ○○experiment

不要なツールやキャッシュを消して空き容量を増やします。必要なライブラリが消えたという話もあるようなので、自己責任で行って下さい。

#remove index cashe
mamba clean -i

#remove unused packages
mamba clean -p

#all
mamba clean -a

conda環境で使用するpythonのバージョンを変更することもできます。python3.9からpython3.8に変更します。

#-yで確認メッセージをスキップ
mamba install python=3.8

今のところ、環境のexportと再構築のコマンドはないようで、これはcondaコマンドで行う必要があります。

間違ってました。mamba envコマンドが用意されています。

$ mamba env -h

usage: conda-env [-h] {create,export,list,remove,update,config} ...

positional arguments:

{create,export,list,remove,update,config}

create Create an environment based on an environment file

export Export a given environment

list List the Conda environments

remove Remove an environment

update Update the current environment based on environment file

config Configure a conda environment

optional arguments:

-h, --help Show this help message and exit.

conda commands available from other packages:

env

環境をエキスポートして、他のマシンや他のユーザーが環境を再現できるようにします。

#環境をエキスポートする
conda activate ○○experiment
mamba env export > env.yaml

> cat env.yaml

name: "\u25CB\u25CBexperiment"

channels:

- bioconda

- conda-forge

- defaults

- au-eoed

- ursky

dependencies:

- ca-certificates=2021.1.19=hecd8cb5_0

- certifi=2020.12.5=py39h6e9494a_1

- libcxx=11.0.1=habf9029_0

- libffi=3.3=h046ec9c_2

- ncurses=6.2=h2e338ed_4

- openssl=1.1.1i=h35c211d_0

- pip=21.0.1=pyhd8ed1ab_0

- python=3.9.0=h4f09611_5_cpython

- python_abi=3.9=1_cp39

- readline=8.1=h9ed2024_0

- setuptools=52.0.0=py39hecd8cb5_0

- sqlite=3.34.0=h17101e1_0

- tk=8.6.10=hb0a8c7a_1

- tzdata=2021a=he74cb21_0

- wheel=0.36.2=pyhd3deb0d_0

- xz=5.2.5=haf1e3a3_1

- zlib=1.2.11=h7795811_1010

prefix: "/Users/kazu/miniconda3/envs/\u25CB\u25CBexperiment"

作った環境を取り込みます。

mamba env create --file env.yaml
#activateすれば使えます。
conda activate ○○experiment

同名のmambaというフレームワークもあるようですが、このmambaとは別物です。

2021 4/26

Rの仮想環境としても使える（Rのパスがcondaのパスの方が優先される時）。

#R3.6 (link)
mamba create -n R36 -y
conda activate R36
mamba install -c r r
which R

#R4.0.3 (link)
mamba create -n R4 -y
conda activate R4
mamba install -c conda-forge r-base -y
which R

#devtools (anaconda) やbiomanager (anaconda) も配布されている
mamba install -c conda-forge r-biocmanager

仮想環境名R36のRにggplot2を入れると、miniconda3/envs/R36/lib/R/library/ggplot2に置かれる。bioconductorのパッケージはcondaをサポートしているので（リンク）、 BiocManagerでは時間がかかる時には上のように環境を作ってから、mambaで入れるのが簡単（ただし、バージョンが1～２段階古いことが多い）。

注意：他の方法で既にRを導入している場合、コンフリクトが起こる可能性があります。

2021/04/30

tar.gzファイルを直接指定して導入することもできる。例えばmetawrapを依存は他の方法で導入済みで(e.g., mamba install -c ursky metawrap-mg -y --only-deps)、本体のみ導入する。対応したプラットフォームのバイナリ(link)をダウンロードし、conda installでtar ballを直接（圧縮状態のまま）指定する。

conda activate metawrap-env
mamba install metawrap-1.2-hdfd78af_2.tar.bz2

対応したプラットフォームでなくてもインストールできてしまうので注意が必要だが、 anaconda/cloudとやり取りしないため、ほぼ一瞬でインストールできる。

もしくはconda installを--no-deps付きで実行する。

mamba install -c ursky metawrap-mg -y --no-deps

--no-deps Do not install, update, remove, or change dependencies. This WILL lead to broken environments and inconsistent behavior. Use at your own risk.
--only-deps Only install dependencies.

2022/02/07

挙動がおかしくなったら一度アンインストールして最新版を導入し直す。エラーが起きた時はこれで直った。

conda remove mamba
conda install -c conda-forge mamba

最新版ではさらに高速になった。

2022/08/23

base環境を抜ける。

#base enviromentでdeactivateする。
conda deactivate

condaのコマンドにパスは通っており、conda|maba自体は使える。ただしbase 環境でいれたツールにパスは通っていない。他の方法で入れたツールやライブラリがcondaで入れたツールと競合する時にこの手順を取る。尚、この状態でcondaコマンドを使ってツールをインストールするとbase環境に導入される。condaのbase環境を最アクティベートするには"conda activate base"を実行する。

関連

2022/07/16

Conda is awesome, but yeah, it's a snail 🐌 not a snake 🐍!
So, here's one last little surprise from #SciPy2022 you will love!
Conda now supports the mamba solver for a 5x-10x speedup 🚀!
Thanks @chenghlee!!! pic.twitter.com/rY575iPhFE
— Sebastian Raschka (@rasbt) July 15, 2022

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

メタゲノムの組成を解析する CAMAMED

データベースやゲノムアセンブリの汚染・キメラアセンブリ配列を調べる conterminator

: 高次元データのクラスタリングと可視化のためのインタラクティブな教育用ウェブリソース ClusterEnG

IGVのtips その２

nf-coreのampliseqパイプライン

RNAseqのDEGsを視覚化する DrEdGE

condaの代わりに高速なmambaを使う