S. cerevisiaeの変異を同定するための自動化されたパイプライン MutantHuntWGS

　MutantHuntWGSは、Saccharomyces cerevisiaeの全ゲノムシーケンスデータを解析するためのユーザーフレンドリーなパイプラインである。オープンソースのプログラムを使用している。(1) ペアエンドおよびシングルエンドリードのシークエンスアラインメント、(2) バリアントのコール、(3) バリアントの影響と重症度の予測。MutantHuntWGS は、バリアントのショートリストを出力すると同時に、すべての中間ファイルへのアクセスを可能にする。その有用性を実証するため、MutantHuntWGSを使用して複数の公開データセットを評価したところ、すべてのケースで、文献で報告されているのと同じ原因バリアントを検出することができた。広範な採用を奨励し、再現性を促進するために、MutantHuntWGSパイプラインのコンテナ化バージョンを配布し、ユーザーがたった2つのコマンドでインストールとデータ解析を行えるようにしている。MutantHuntWGSソフトウェアとドキュメントは、https://github.com/mae92/MutantHuntWGS から無料でダウンロードできる。

　MutantHuntWGSパイプラインは、オープンソースのバイオインフォマティクスツールとUnixコマンドを統合したもので、生のシーケンスリード（圧縮FASTQフォーマットまたは.fastq.gz）およびプロイド情報を含むテキストファイルを入力として受け取り、シーケンスバリアントのリストを出力として生成する。ユーザーは、少なくとも2つの株（対照株と1つ以上の実験株）の入力データを提供する必要がある。パイプラインは、(1)Bowtie2による各入力サンプルのリードとリファレンスゲノムとのアライメント、(2)SAMtoolsによるデータ処理と遺伝子型尤度の計算，(3)BCFtools によるバリアントのコール, (4)VCFtoolsおよびカスタムシェルコマンドを使用して実験株と対照株で見つかったバリアントを比較し、（5）SnpEffおよびSIFTを使用して注釈付き遺伝子に関連してバリアントが見つかった場所と影響を受けた遺伝子製品の発現および機能に対する潜在的な影響を評価する（論文図1）。パイプラインで使用されるコマンドの詳細な説明とすべてのコードは、MutantHuntWGS Gitリポジトリ（https://github.com/mae92/MutantHuntWGS; README.md, Supplemental_Methods.docx ファイル参照）で入手できる。

インストール

ubuntu18.04で公式のdockerイメージを使ってテストした。

Github

git clone https://github.com/mae92/MutantHuntWGS.git

#dockerhub
docker pull mellison/mutant_hunt_wgs:version1

実行方法

１、ファイルの準備

Analysis_Directoryを作成し、その中にFASTQというディレクトリを作成する。その中に全てのFASTQファイルを配置する。

mkdir -p Analysis_Directory/FASTQ
cp <your>/*fastq.gz Analysis_Directory/FASTQ/

FASTQファイルはgzip圧縮されていて、以下の命令規則に従っている必要がある。

シングルエンドfastq；xxx.fastq.gz
ペアエンドfastq；xxx_R1.fastq.gzとxxx_R2.fastq.gz

xxx部分でスペースや句読点、アンダースコア（"_"）は使用してはならない。

２、dockerイメージを立ち上げる。先ほど作成したディレクトリを共有ディレクトリとして指定する。

docker run --rm -it -v /<PATH>/<TO>/<YOUR>/Analysis_Directory:/Main/Analysis_Directory mellison/mutant_hunt_wgs:version1

３、MutantHuntWGSのラン。genomeのfastaファイルとindexファイルはdocker imagesに含まれる（Saccharomyces cerevisiae S288C？）。

#テストランが可能

MutantHuntWGS.sh \
-n wttoy -g /Main/MutantHuntWGS/S_cerevisiae_Bowtie2_Index_and_FASTA/genome \
-f /Main/MutantHuntWGS/S_cerevisiae_Bowtie2_Index_and_FASTA/genome.fa \
-p /Main/MutantHuntWGS/S_cerevisiae_Bowtie2_Index_and_FASTA/ploidy_n1.txt \ -d /Main/MutantHuntWGS/FASTQ_test \
-o /Main/Analysis_Directory/test_output \
-a YES -r single -s 0

テストランは数十秒で終わる。Analysis_Directory/にtest_outputディレクトリができる。

f:id:kazumaxneo:20211229140520p:plain

実際のシークエンスデータを使ったランでは、"-d"オプションで/Main/Analysis_Directory/FASTQを指定する。

MutantHuntWGS.sh \
-n wttoy -g /Main/MutantHuntWGS/S_cerevisiae_Bowtie2_Index_and_FASTA/genome \
-f /Main/MutantHuntWGS/S_cerevisiae_Bowtie2_Index_and_FASTA/genome.fa \
-p /Main/MutantHuntWGS/S_cerevisiae_Bowtie2_Index_and_FASTA/ploidy_n1.txt \ -d /Main/Analysis_Directory/FASTQ \
-o /Main/Analysis_Directory/output \
-a YES -r single -s 0 -t 8

-n The -n option takes the prefix of the FASTQ file name for the wild-type strain. For the example of FILENAME.fastq or FILENAME_R1.fastq this prefix would simply be "FILENAME".
-g The -g option takes the file PATH to the bowtie index files and the file prefix (genome). Use exactly what is shown above for this command.
-f The -f option takes the file PATH and file name of the genome FASTA file (genome.fa) Use exactly what is shown above for this command.
-r The -r option specifies whether the input data contains paired-end or single-end reads and can take values of "paired" or "single".
-s The -s option takes a score cutoff for the variant scores. This score is calculated by the following formula: -10 * log10(P) where P is the probability that the variant call (ALT) in the VCF file is wrong.

-p The -p option takes the file PATH and file name of the ploidy file (genome.fa) Use exactly what is shown above for this command.
-d Directory containing your FASTQ files. If you set things up in the way that the instructions outline above this should stay the same as the example: /PATH_TO_DESKTOP/Analysis_Directory/FASTQ. Use exactly what is shown above for this command.

-o This allows you to specify a folder for your data to output to. This should be structured like the example /Main/Analysis_Directory/NAME_YOUR_OUTPUT_FOLDER except you will come up with a descriptive name to replace the NAME_YOUR_OUTPUT_FOLDER part of the file PATH.

-a This allows you to turn on and off the alignment and calling step. So if you have already aligned reads and called variants and all that you want to do is reanalyze with a different score cuttoff then you can set this to "NO", but if you are starting from FASTQ files that have not gone throught this process yet you set this to "YES"
-t (version 1.1 only) This allows you to set a number of concurrent threads that will be used when running bowtie2. This is equivolent to setting -p/--threads in bowtie2.

引用

MutantHuntWGS: A Pipeline for Identifying Saccharomyces cerevisiae Mutations
Mitchell A Ellison, Jennifer L Walker, Patrick J Ropp, Jacob D Durrant, Karen M Arndt
G3 Genes|Genomes|Genetics, Volume 10, Issue 9, 1 September 2020, Pages 3009–3014