2018-03-15

k-merサイズを変えながらエラー訂正を繰り返す SGA-ICE (IterativeErrorCorrection)

　イルミナのMiSeqでシーケンスを1回実行すると、300 bpのペアエンドで15ギガバイト（GB）のデータが出力される。Illumina HiSeq 2500では、最大ペアエンド250 bpで300 GBのシーケンスが可能担っている。この高いスループットは、ゲノムアセンブリにとって魅力的なものである。 Illuminaのデータではエラーは1％未満だが、１つのリードでエラーがゼロの確率は低く、特に250または300 bpの長いリードでは低くなる。ゲノムアセンブリのためには、可能な限り正確なリードを使うことが望ましい。よって、シーケンシングエラーの修正は必須の前処理ステップとなる。すべてのエラーはまれであり、シーケンシングカバレージは十分に高く、同じゲノム遺伝子座をカバーする他のリード情報を用いてエラーを訂正することができる。エラー訂正ツールは、塩基置換を主に扱うk-merベースの訂正と、挿入と削除を訂正するオーバーラップベースの2つに分類できる。それぞれの手法の詳細な概要はLaehnemann et al. [論文より ref.1]によって説明されている。

　k-merベースのエラー訂正の背後にあるアイデアは、シーケンシングエラーがゲノムには存在しないため、低い頻度のk-mer配列になっていることである。低頻度のk-merを検出・置換することにより、誤った塩基を修正することができる。 K-merベースの補正は、k-merサイズのためのパラメータ選択、およびまれでないk-merのカウントに依存する。オーバーラップベースの補正の背後にあるアイデアは、似たリード、すなわちおそらくは同じ遺伝子座に由来するリードでマルチプルアラインメントを構築することにある。シークエンシングエラーは、アラインメントのまれな差として検出され、アラインメントの同じカラムのコンセンサス配列で補正される。オーバーラップベースの補正は、リードの最小の相同性、まれな違いをカウントするしきい値、およびコンセンサスをサポートする最小のリード数のパラメータに依存する。

　ほとんどのツールは、テストされたデータセットでうまく機能し、小さなゲノムではほとんどすべてのエラーを修正することができる[ref.20]。しかし、ヒトゲノムのような複雑でリピートリッチなゲノムでは、エラー訂正後もかなりのリードが誤りを有する。例えば、最高性能のツール（[ref.20]の表2）で修正した後でも、ヒトの100bpのHiSeqシーケンスデータでは15〜20％のエラーが残る。このパフォーマンスは、リードが長くなるとさらに悪化する。MiSeqで読んだE.coliゲノムの250bpのリードの半分以上にエラーが残っている（[ref.20]の表3）。シーケンスエラーが訂正されないまま残る理由はいくつかある。第1に、多くのエラーを伴うリードは、他のリードと似ていないので、訂正が困難である。しかしこのようなリードは、頻繁に使用されるk-mersが数多く存在するため、破棄が容易である。第2に、低い低カバレッジの遺伝子座のリードは、solid k-mers（与えられた閾値よりも頻繁に発生するk -mer）の欠如のために訂正されないかもしれない。第3に、エラーのk-merがゲノムの他の場所に頻繁に見つかると、シークエンシングエラーとして検出されずに残っている可能性がある。これは特にリピート領域で起こる。最初の2つのケースは計算上の扱いが困難だが、250または300 bpのより長いリードならば第３のエラーは低減される。

本論文ではエラー訂正後に補正されていないシーケンシングエラーを伴う多くの読み取りが繰り返し重複していることを示している。これらのリードでは、短い誤ったk-merが別のリピートで同じように発生し、誤って正しいと見なされる。長いイルミナリードのエラー訂正を改善するために、String Graph Assemblerのモジュールを使用し、k-merベースの補正をk-merサイズを増やしながら複数回実行する反復誤差補正パイプラインが開発された。この反復戦略は、リピート領域のエラーを効果的に修正し、誤ったリードの総量を削減することを示している。さらに、この高い読み取り精度によって、コンティグが2~3倍長くなることを示している。

インストール

本体 Github

https://github.com/hillerlab/IterativeErrorCorrection

git clone https://github.com/hillerlab/IterativeErrorCorrection.git
cd IterativeErrorCorrection/
python SGA-ICE.py

python SGA-ICE.py

$ python SGA-ICE.py

usage: SGA-ICE.py [-h] [-k KMERS] [-t THREADS] [--noOvlCorr] [--noCleanup]

[--scriptName SCRIPTNAME] [--errorRate ERRORRATE]

[--minOverlap MINOVERLAP]

inputDir

SGA-ICE produces a shell script that contains all commands to run iterative

error correction of the given read data with the given parameters. Read data

must be in fastq format and files need to have the ending .fastq or .fq.

positional arguments:

inputDir Path to directory with the *.fastq or *.fq files. The

produced shell script will be located here.

optional arguments:

-h, --help show this help message and exit

-k KMERS, --kmers KMERS

List of k-mers for k-mer correction; values should be

comma-separated. If -k is not provided, SGA-ICE does 3

rounds of k-mer correction with k-mer sizes determined

based on the length of the read from the first file in

inputDir. We advise the user to choose k-mer values

manually if the sequences in the *.fastq files have

different read lengths.

-t THREADS, --threads THREADS

Number of threads used. Default is 1. Set to higher

values if you have more than one core and want to

reduce the runtime.

--noOvlCorr If set, do not run a final overlap-based correction

round.

--noCleanup If set, keep all intermediate files in the temporary

directory.

--scriptName SCRIPTNAME

Name of the shell script containing the error

correction commands. By default, script is called

runMe.sh

--errorRate ERRORRATE

sga correct -e parameter for overlap correction.

Maximum error rate allowed between two sequences to

consider them overlapped. Default is 0.01

--minOverlap MINOVERLAP

sga correct -m parameter for overlap correction.

Minimum overlap required between two reads. Default is

ラン

fastqのディレクトリを指定してランする（fqまたはfastqを認識する）。

SGA-ICE.py /path/to/fastq/data/ -k 40,60,100,125,150,200 --noCleanup --noOvlCorr --scriptName correctMyData.sh

-k List of k-mers for k-mer correction; values should be comma-separated. If -k is not provided, SGA-ICE does 3 rounds of k-mer correction with k-mer sizes determined based on the length of the read from the first file in inputDir. We advise the user to choose k-mer values manually if the sequences in the *.fastq files have different read lengths.
-noCleanup If set, keep all intermediate files in the temporary directory.
--noOvlCorr If set, do not run a final overlap-based correction round.
--scriptName Name of the shell script containing the error correction commands. By default, script is called runMe.sh
-t THREADS, --threads THREADS Number of threads used. Default is 1. Set to higher values if you have more than one core and want to reduce the runtime.

上のコマンドではk-merサイズを上げながら6回繰り返すことになる。ランが終わるとシェルスクリプトcorrectMyData.shができるので、それを実行するとエラー訂正が自動で行われる。

引用

Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly.

Sameith K, Roscito JG, Hiller M.

Briefings in Bioinformatics. 2017 Jan;18(1):1-8.

2018-03-15

メタゲノムデータを使ってシングルセルのエラー訂正を行う MeCorS

single cell error correction metagenome

　自然界に存在する大部分の微生物種は培養できないが、メタゲノミクスや最近のシングルセルシーケンス技術によりゲノムにアクセスできるようになってきた。シングルセルシーケンスとメタゲノムのショットガンシーケンスが同じ環境サンプルから生成され、方法論的に組み合わせられる。例えば、シングルセルのデータでメタゲノムビンを検証したり、シングルセルのアセンブリのcontiguityをメタゲノムデータで改善したりすることができる（論文より　Campbell et al, 2013、Hess et al, 2011）。しかし、通常シングルセルシーケンスを行うにはシーケンシングに先立って全ゲノム増幅する必要がある(MDA; Lasken, 2007)。この増幅はひどく偏っており、超低カバレッジ領域を発生させる(Chitsaz et al., 2011; Supplementary Fig. S1)。さらに、キメラ形成は、MDA中に約10kbp当たり約1回発生し、single amplified genomes (SAGs) をさらに複雑にする（Nurk et al, 2013、Rodrigue et al, 2009）。さまざまなケースのためのエラーコレクションツールが存在するが、SAGデータをエラー訂正するために特別に設計されたツールはhammer (Medvedev et al., 2011)（のちにBayesHammer （Nikolenko、2013）としてリファインされた）しか存在しない。

　MeCorSはシングルセルシーケンスのためのメタゲノム対応エラー訂正ツール。カバレッジバイアスのないメタゲノムデータを利用し、SAGの超低カバレッジ領域のエラー訂正を行う。またキメラのないメタのゲムデータを使い、キメラのSAGを訂正する。

インストール

cent OSに導入した。

本体 Github

https://github.com/metagenomics/MeCorS

git clone https://github.com/metagenomics/MeCorS.git 
cd MeCorS && make

> ./mecors

$ ./mecors

MeCorS version 0.4.1 by Andreas Bremges (andreas@cebitec.uni-bielefeld.de)

Usage: mecors [options] -s <SC.fastq> -m <MG.fastq>

-s <SC.fastq> single cell sequencing reads

-m <MG.fastq> metagenomic sequencing reads

(fastq or fasta, can be gzip'ed)

-k INT k-mer size for error correction [31]

-c INT min. coverage in the metagenome [2]

-t INT number of threads [16]

-B NUM process ~NUM bp in each batch [100M]

-v be verbose

パスの通ったディレクトリにコピーしておく。

ラン

シングルセルとメタゲノムのfastqを指定して実行する。

mecors [options] -s single_cell.fq -m metagenome.fq

-s single cell sequencing reads
-m metagenomic sequencing reads (fastq or fasta, can be gzip'ed)
-k k-mer size for error correction [31]
-c min. coverage in the metagenome [2]
-t number of threads [16]
-B process ~NUM bp in each batch [100M]
-v be verbose

引用

MeCorS: Metagenome-enabled error correction of single cell sequencing reads.

Bremges A, Singer E, Woyke T, Sczyrba A

Bioinformatics. 2016 Jul 15;32(14):2199-201.

2018-03-15

高カバレッジな細菌ゲノムのdenovoゲノムアセンブリツール HGA

assembly bacteria k-mer

　デノボゲノムアセンブリにはgreedy strategy、string overlap graph、そしてde Bruijn graphの3つの主なアプローチがある。greedy strategyは、シードリードを選択し、最大のオーバーラップが可能になるまで貪欲に拡張していくことによって機能する。このアプローチは、SSAKE [論文より ref.1]、SHARCGS [ref.2]、VCAKE [ref.3]などの初期のアセンブラで採用された。残念ながらこのアプローチは、リピートおよびシーケンシングエラーによって引き起こされるあいまいさを考慮せず、多数の誤ったアセンブリエラーを招く。string overlap graphのアプローチは、ノードとエッジとしてのリードと、すべてのノードの対を接続するエッジを有するグラフを、重複が最小の状態で構築することに基づいている。オーバーラップグラフを作成するには、計算集約的な全ペアごとの比較ステップが必要となる。グラフを構築した後、リードのレイアウトを計算し、コンセンサス配列をmultiple アライメントを用いて決定する。このアプローチは、Newbler [ref.4]やSGA [ref.5]のようなアセンブラで実装され、CABOG [ref.6]はSangerと454のシーケンシングのような長いリードに対してより効率的な方法になっている。

　de Bruijn graphモデル[ref.7]に基づく第3のアプローチは、ABySS [ref.8]、ALLPATHS-LG [ref.9]、Euler-USR [ref.10]、MaSuRCA [ref.10] (OLC+de Bruijn graph) 、SoapDenovo2 [ref.12]、SPAdes [ref.13]およびVelvet [14]など、NGSデータを対象とするアセンブラで最も一般的に使用されている[ref.11]。 de Bruijn graphの作成は、リードを長さk（k-mersと呼ばれる）の全部分文字列を集めることから始まる。それから k-mer配列aのk-1の長さがk-mer配列bの長さk-1の長さと一致しk+1の配列を得られるならば、aとbを重複させることによって拡張する。 de Bruijnグラフは線形時間で構築できるが、格納するには非常に大量のメモリが必要となり、通常はoverlap graphよりもはるかに大きくなる。 de Bruijn graphを作成した後、各アセンブラでは、ゲノムの反復やヘテロガスな部位によって引き起こされるcyclesやbulgesなどのグラフ構造をヒューリステックなアプローチで単純化する。最後にアセンブラは、最終的なコンティグを形成するde Bruijn graphの単純なパスを選択する。ゲノムアセンブリのアルゴリズムの詳細については、[ref.15-17]に記載されている。

　GAGE-Bの研究 [ref.20]では、バクテリアゲノムについて、高いカバレッジ（100-300×）のショートインサートライブラリをシーケンスして、MaSuRCAおよびSPAdes を単独で実行することで高いアセンブリ品質が得られることを示している。本論文のHGAは、このような高いカバレッジを利用するために、独立したサブセットのリードをアセンブルし、できたサブセットのアセンブリを結合し、最後にオリジナルのリードを使い結合されたコンティグを再アセンブリする。より高いカバレージと短いkmerよりも、低いカバレッジと短いkmerを使用する方がtips、bubbles、bulges、cycles、false branchは少なくなる。結果として得られるコンティグは、コンティグがより短くなるが、エラーがより少なくなる。この方法を実行するため、HGAは最初に読み込み全体を複数のパーティションに分割し、各パーティション内のカバレッジを低くしている。ただし低いカバレッジを使用することによって得られるコンティグは、カバレッジが高いよりも短くなり、より多くのギャップを有する。これを解決するために、各パーティションのアセンブリから生成された全てのコンティグをマージまたはアセンブルする。全てのパーティションから生成されたコンティグを組み立てることで、共通のコンティグ（より正確なコンティグである可能性が高い）を選択し、任意の重複または偽（部分または完全）コンティグをフィルタリングしている。100bpのIllumina HiSeqと250bpのIllumina MiSeqのデータからなる7つのGAGE-Bバクテリアデータセットを用いて、この方法論を経験的に評価している。結果は、HGAがN50および修正N50メトリックに基づいてアセンブリの品質を大幅に向上させることを示している。

インストール

環境

Python v2.7.6

依存

Velvet（make 'MAXKMERLENGTH=111' 'LONGSEQUENCES=1'をつけてビルドしていること、不明ならランのシェルスクリプトを真似する）

本体 Github

https://github.com/aalokaily/Hierarchical-Genome-Assembly-HGA

git clone https://github.com/aalokaily/Hierarchical-Genome-Assembly-HGA.git
cd Hierarchical-Genome-Assembly-HGA/
python HGA.py

$ python HGA.py

Hierarchical Genome Assembly version 1.0.0

--Prerequisite--

Velvet should be installed with option 'LONGSEQUENCES=1' in the make command, to allow Velvet to accept contigs (long sequences) as input; as well and 'MAXKMERLENGTH=111' as the default max kmer length is 31

--Parameters--:

-velvet path path to the directory where velvet bineries (velveth, velvetg) are located

-spades path path to the directory where spades.py file is located

-PA string select "SPAdes" or "velvet" to choose which assembler will be used to assemble the partitions

-P12 file file of the interlaced fastq file of paired reads that will be used in the partiton step, usually raw or clean (corrected) reads

-R12 file file of the interlaced fastq file of paired reads that will be used in the re-assembly step, usually raw or clean (corrected) reads

-ins int insert size of the fragments

-std int standard deviation of the inset size

-P int number of partitions

-Pkmer int(odd) kmer value for assembling the partitions

-Rkmer int(odd) kmer value for re-assembly step

-t int number of threads to be used for re-assembly step using SPAdes assembler, default 1.

-out Path output path

-h print option menue

ラン

テストランを行うスクリプトSample_test_cholera.shとSample_test_axonopodis.shが準備されている。これを動かしてみる。

chmod u+x Sample_test_cholera.sh
./Sample_test_cholera.sh

GAGE-B（リンク）からV_cholerae_HiSeq.tar.gzがダウンロードされ、解凍された後interleaveのfastqに分離される。spadesとvelvetがインストールされ、以下のコマンドが実行される。

curr=`pwd`

python HGA.py -velvet $curr/velvet_1.2.10 -spades $curr/SPAdes-3.0.0/bin -PA velvet -P12 $curr/data.fastq -R12 $curr/data.fastq -ins 335 -std 35 -P 4 -Pkmer 31 -Rkmer 51 -t 20 -out ./HGA

-P12 file of the interlaced fastq file of paired reads that will be used in the partiton step, usually raw or clean (corrected) reads
-R12 file of the interlaced fastq file of paired reads that will be used in the re-assembly step, usually raw or clean (corrected) reads
-ins insert size of the fragments
-std standard deviation of the inset size
-P number of partitions
-Pkmer (odd) kmer value for assembling the partitions
-Rkmer (odd) kmer value for re-assembly step
-t number of threads to be used for re-assembly step using SPAdes assembler, default 1.
-out output path

終わると出力ディレクトリにいくつかのファイルができる。part〜はサブセットのアセンブルディレクトリになる。上では-P 4なので、４つに分割してアセンブルしている。

$ ls -alth HGA/

total 848M

-rw-r--r-- 1 uesaka user 174K Mar 15 12:02 HGA.log

drwxr-xr-x 7 uesaka user 4.0K Mar 15 12:02 HGA_combined

drwxr-xr-x 7 uesaka user 4.0K Mar 15 11:33 HGA_merged

drwxr-xr-x 2 uesaka user 4.0K Mar 15 11:06 combined_contigs

drwxr-xr-x 2 uesaka user 4.0K Mar 15 11:05 merged_contigs

drwxr-xr-x 2 uesaka user 4.0K Mar 15 11:05 part_4_assembly

drwxr-xr-x 2 uesaka user 4.0K Mar 15 11:04 part_3_assembly

drwxr-xr-x 2 uesaka user 4.0K Mar 15 11:03 part_2_assembly

drwxr-xr-x 2 uesaka user 4.0K Mar 15 11:02 part_1_assembly

-rw-r--r-- 1 uesaka user 214M Mar 15 11:01 data._part_4.fastq

-rw-r--r-- 1 uesaka user 214M Mar 15 11:00 data._part_3.fastq

-rw-r--r-- 1 uesaka user 212M Mar 15 11:00 data._part_2.fastq

-rw-r--r-- 1 uesaka user 209M Mar 15 10:59 data._part_1.fastq

論文公開時はspade3.0を使っているが、現在ではspades3.11まで更新されている。現在のspadesと比較してみる。データは先ほどのコレラのtrimmed.fastqを使う。パラメータは上のテストと同じにした。spadesは以下のパラメータ設定でアセンブルを行った。

spades.py -k auto -t 30 --careful -1 reads_1.trimmed.fastq -2 reads_2.trimmed.fastq -o output

Questの分析結果。

f:id:kazumaxneo:20180315142207j:plain

このデータセットではspades(青線)の方がややcontiguityは上になった。

追記

spadesのk-merを51に変えても変化がなかった。

追記

HGAのk-merサイズを長くするとcontiguityが悪くなった（*1）。

引用

HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads.

Al-Okaily AA

BMC Genomics. 2016 Mar 5;17:193.

使えないと言っているわけではないことに注意してください。どんなツールもアルゴリズムの特性にあった使い方ができていなければパフォーマンスは発揮できません。

2018-03-14

メイトペア情報を使いスキャホールドの誤りを検出する NxRepair

contigのscaffolding assembly evaluation tool misassembly 2015 PeerJ

　ゲノムのde novoアセンブリの一般的な方法は、de Bruijnグラフ（論文より Compeau、Pevzner＆Tesler、2011）の構築に基づく。最も単純なケースでは、グラフはシングルエンドリードから構成されるが、シングルエンドのリードだけでは、de Bruijnグラフをもつれさせるリピート領域を解消することは困難である。Illumina Nextera mate pair（Illumina、2012）など、数キロベースの大きな挿入サイズのペアエンドシーケンスはリピート領域のアセンブルに有用である。アセンブリは、通常、2段階プロセスで行われる。まず、コンティグが構築される。次にコンティグがそれ以上伸長できなくなると、スキャフォールディングアルゴリズムは、コンティグの順序および近似ギャップサイズを決定するためにインサートサイズ情報を用いて複数のコンティグを連結させようとする。多くのアセンブラーは、コンティグのアセンブリプロセスとスキャフォールディングのプロセスの両方に、またはスキャフォールディングにメイトペアのインサートサイズ情報を組み込むが、エラーは引き続き発生する。最も重大な間違いは、ゲノムの2つの異なる領域が結合した大規模な伸長エラーである（論文　図1A）。同様に、large indelはde novoアセンブリで構造的な不規則性を作り出し、塩基レベルの誤りはそのポジションのみにエラーを導入する。

　de novoアセンブリのエラーは、アセンブルされたゲノムの品質に大きな影響を与える可能性があり、de novoアセンブリのエラー修正および品質評価は、かなりの注目されている問題である。 Assemblathon（Bradnam et al。、2013）やGAGE（Salzberg et al。、2012）などのアセンブラコンペティションでは、様々なアセンブラで作成された配列のクオリティをリファレンスゲノムと比較している。しかしながら、高品質のリファレンスがない場合、この方法は使えない。 Ghodsi et al（2013）は、参照ゲノムを必要とせずに、アセンブリクオリティスコアを計算するベイジアンアセンブリ品質評価方法を開発した。ただし、これは品質のスコアのみを出力し、エラーや低クオリティの領域を識別するために使用することはできない。最近の論文では、エラーの特定と修正方法が開発されており、アセンブリの品質をきめ細かく分析している。最もよく知られているものは、A5アセンブリパイプライン（Coil、Jospin＆Darling、2014; Tritt et al。、2012）で、A5にはメイトペア情報を利用するエラー検出と再スケーリングのステップが含まれている。REAPR（Hunt et al。、2013）とALE（Clark et al。、2013）も開発されたが、これらの新しいツールはメイトペア情報を使用するように最適化されていない。

　NxRepairは、Nextraのメイトペアのインサートサイズの分布を調べることで、最も重大なアセンブリの誤りであるスキャホールディングの誤りを特定できるアセンブリエラー検出ツール。異常なインサートサイズが多い領域を特定するか、またはリードが非常に少ない領域を特定し、スキャホールドを切断し、任意でトリミングも実行する。

チュートリアル

http://nxrepair.readthedocs.io/en/latest/tutorial.html#running-nxrepair

インストール

依存

scipy
numpy
matplotlib
pysam

pipで導入できる。

pip install numpy 
pip install scipy 
pip install matplotlib
pip install pysam

本体

git clone https://github.com/rebeccaroisin/nxrepair
cd nxrepair/nxrepair/
python nxrepair.py -h #ヘルプ

$ python nxrepair.py -h

usage: nxrepair.py [-h] [-min_size min_size] [-img_name img_name] [-trim trim]

[-T T] [-step_size step_size] [-window window]

[-minmapq minmapq] [-maxinsert maxinsert]

[-fraction fraction] [-prior prior]

bam fasta outfile newfasta

Routine to identify and correct large-scale misassemblies in de novo

assemblies

positional arguments:

bam bam

fasta scaffold fasta

outfile Output file name

newfasta Fasta file for new contigs, including filepath

optional arguments:

-h, --help show this help message and exit

-min_size min_size Minimum contig size to analyse

-img_name img_name Name under which to save (optional) graphs of

alignment quality. Default value: None (no graphs

produced)

-trim trim Number of bases to trim from each side of an

identified misassembly. Default value: 5000

-T T Threshold in Z score below which a misassembly is

called. Default value: -4.0

-step_size step_size Step-size in bases to traverse contigs. Default value:

1000

-window window Window size across which bridging mate pairs are

evaluated. Default value: 200

-minmapq minmapq Minimum MapQ value, above which a read pair is

included in calculating population statistics. Default

value: 40

-maxinsert maxinsert Maximum insert size, below which a read pair is

included in calculating population statistics. Default

value: 30000

-fraction fraction Minimum fraction of read pairs with correct

orientation to call support for the assembly. Default

value: 0.75

-prior prior Prior probablility that the insert size is anomalous.

Default value: 0.01

ラン

ランには、メイトペアのfastqをアセンブルしたfastaにアライメントして得たbamファイルが必要となる。bamが準備できたら、アセンブルのfastaと出力のファイル名も指定してコマンドを打つ。

python nxrepair.py smatepairs.bam assembly.fasta error_locations.csv new_fasta.fasta

-img_name img_name Name under which to save (optional) graphs of alignment quality. Default value: None (no graphs produced)
-min_size min_size Minimum contig size to analyse (default: 10000)

エラーを記載したCSVと、エラーが修正されたfastaファイルが出力される。

引用

NxRepair: error correction in de novo sequence assembly using Nextera mate pairs

Rebecca R. Murphy, Jared O’Connell, Anthony J. Cox, and Ole Schulz-Trieglaff

PeerJ. 2015; 3: e996.

2018-03-14

ウィルスintegration部位を分析するGUIツール ChimericSeq

virus human genome GUIツール

　ウイルスintegration部位の同定は、特定のウイルス感染に関連する疾患の病因および進行を理解する上で重要であるが、ウイルス - ホストjunction部位のNGSデータを解析するための現在の計算方法は、アクセス可能性の点で制限されている。たとえば、現在入手可能なプログラムのほとんどは、コマンドライン引数とLinuxベースの知識を必要とするため、Linuxのディストリビューションに精通していないMacユーザーや生物学者のアクセシビリティが制限されている。

ChimericSeqは、WGS、RNA-seq、およびtargeted sequencingを含むNGSデータから2生物間のキメラjunction配列を同定できるツールとして開発された。WindowsとMacの両方のオペレーティングシステムとの互換性を備えたグラフィカルユーザーインターフェイスを提供する最初のプログラムである。見つかったウイルスintegration部位の配列の位置および方向の情報、melting temperature (Tm) の分析、遺伝子情報および相同性分析を提供する。

インストール

pyenvでanaconda3-4.2.0環境に切り替えて動作を確認した。

本体（JBS ChimericSeq™ Software）の他、テストデータとリファレンスをJBS SCIENCE社（リンク）のHPからダウンロードできる。

http://www.jbs-science.com/ChimericSeq.php

ダウンロードしたzipファイルを解凍して内部で以下のように打つ。

python ChimericSeq.py
sudo chmod -R 0777 ./ChimericSeq_Mac 
sudo chmod a+x ./ChimericSeq_Mac/JBS_ChimericSeq.tool

本体を叩いて起動する。

f:id:kazumaxneo:20180314211741j:plain

ラン

テストデータを使ってランしてみる。Options -> Set Locationsを選択。

f:id:kazumaxneo:20180314154932j:plain

以下のようにした。Buildボタンを押すことでindexを構築できる。

f:id:kazumaxneo:20180314155029j:plain

ウィンドウをそのまま閉じる。

Options -> Configurationsを選択。

f:id:kazumaxneo:20180314155104j:plain

変更したいパラメータがあれば変えておく。

f:id:kazumaxneo:20180314155251j:plain

解析するfastqを選択する。ペアエンドデータは同時に選択してopenボタンを押す。

f:id:kazumaxneo:20180314154805j:plain

認識された。

f:id:kazumaxneo:20180314154846j:plain

解析をスタートするにはメインウィンドウ左下のStart Runを押す。

途中でアライメントを行うか聞いてくるので、メインウィンドウ左下のYes No ボタンを押して決定する。解析が終わると指定したフォルダにCSVファイル等ができる。

f:id:kazumaxneo:20180314155600j:plain

chromosome名、遺伝子のexon上かどうか、遺伝子のアノテーション、intergenicなら遺伝子までの距離などが記載されている。

f:id:kazumaxneo:20180314212206j:plain

引用

ChimericSeq: An open-source, user-friendly interface for analyzing NGS data to identify and characterize viral-host chimeric sequences.

Shieh FS, Jongeneel P, Steffen JD, Lin S, Jain S, Song W, Su YH

PLoS One. 2017 Aug 22;12(8):e0182843.

2018-03-13

ウィルスコミュニティを検出する viromescan

virus 2016 BMC Genomics

　ウイルスは常にヒトの体に生息している [論文よりref.1]。細菌および真菌のように、ある種のウイルスは、ヒト免疫の調節にとって重要な低レベルの免疫応答を刺激し得るが、代謝ホメオスタシスもまた刺激し得る。これに関して、FoxmanとIwasaki [ref.4]は、一般的な低病原性ウイルスによる絶え間の再感染が、免疫系の抗ウイルス成分を刺激し、1型糖尿病や喘息などの疾患に対する感受性と相関することを示した。一方、鼻咽頭の急性感染症に通常存在するウイルスは、健康な人でも広く検出されることが報告されている[ref.5]。

　通常、試料中のウイルスの量と多様性は過小評価されている[ref.14]。例えば、フィルタリング手順に基づいたウイルス単離のための方法は、巨大ウイルスを欠いていることが認識されている[ref.15]。全てのウイルスゲノムに共通する単一の遺伝子が存在しないため、特徴づけが困難であり、細菌に対するリボソームDNAプロファイリングに類似したアプローチの適用は難しい[ref.2]。微生物群集のウイルスは、ウイルスのシーケンスリードを適切なウイルスデータベースに割り当てることにより推定することができる。実際にメタゲノム試料は、細菌、古細菌、真核生物、ファージおよび真核生物ウイルスの核酸を含んでいる。未処理のシーケンスデータからリードを直接ウィルスゲノムに割り当てることができればウィルスのより迅速なcharacterizeを可能にし、濾過手順のために巨大ウイルスが失われることもない。

　ViromeScanはウイルスのコミュニティを正確にプロファイリングするツール。微生物内の真核生物ウイルスのコミュニティをプロファイリングするために開発された。ヒトおよび細菌起源のメタゲノム読み取りを除外し（denoise）、残りの配列を階層的ウイルスデータベースにマッピングすることによってDNAやRNAウイルスの相対的存在量を推定する。

インストール

cent OSに導入した。

本体SourceForgge

https://sourceforge.net/projects/viromescan/

mkdir viromescan
mv viromescan.tar.gz viromescan/ && cd viromescan/
tar -zxvf viromescan.tar.gz
cd viromescan/database/
gzip -d Bacteria_custom/* 
gzip -d bowtie2/* 
gzip -d hg19/*

> ./viromescan.sh

$ ./viromescan.sh

./viromescan.sh -1 <INPUT_FASTQ_paired_end1> -2 [INPUT_FASTQ_paired_end2] -d [DATABASE] -p [N_THREADS] -m [VIROMESCAN_PATH] -o <OUTPUT_DIR>

-1/--input1: .fastq file containing the sequences (paired end 1) (MANDATORY)

-2/--input2: .fastq file containing the sequences (paired end 2) (if available)

-d/--database: viral database, choose in the viromescan folder your database, among: human_ALL (RNA/DNA), human_DNA (DNA only), virus_ALL (vertebrates, invertebrates, plants and protozoa virus. NO bacteriophages), virus_DNA (vertebrates, invertebrates, plants and protozoa DNA virus. NO bacteriophages) (MANDATORY)

-p/--n_threads: number of threads to launch (default: 1)

-m/--viromescan_path: pathway to viromescan folder (default: working directory)

-o/--output: output directory (MANDATORY)

注; ランには空き容量が45 GB以上必要とされる。

ラン

入力ファイルは、DNA-seq またはRNA-seqのsingle-endまたはpaired-endのfastqとなっている。gzip、bzip2、zip形式も対応している。

ViromeScanは、ヒトDNAウイルス、ヒトDNA / RNAウイルス、真核生物DNAウイルス、および真核生物DNA / RNAウイルスからデータベースを選択してランする。ヒトウイルスデータベースは、自然宿主としてのヒトを有するウイルスのみを含む。一方、真核生物ウイルスデータベースには、バクテリオファージを除きながら、脊椎動物、無脊椎動物、真菌、藻類および植物のためのウイルスも含まれる。すべてのデータベースは、NCBIのウェブサイト[23]で利用可能な完全なウイルスゲノムに基づいている。

初回にindexを作成しておく。

cd database/hg19/
bmtool -d hg19reference.fa -o hg19reference.bitmask -A 0 -w 18

makeblastdb -in hg19reference.fa -dbtype nucl

ラン。ここでは"virus_ALL"を選んだ。

cd viromescan/
./viromescan.sh -1 pair1.fastq -2 pair2.fastq -p 12 -o output -d virus_ALL

-1　fastq file containing the sequences (paired end 1) (MANDATORY)
-2　fastq file containing the sequences (paired end 2) (if available)
-d　viral database, choose in the viromescan folder your database, among: human_ALL (RNA/DNA), human_DNA (DNA only), virus_ALL (vertebrates, invertebrates, plants and protozoa virus. NO bacteriophages), virus_DNA (vertebrates, invertebrates, plants and protozoa DNA virus. NO bacteriophages) (MANDATORY)
-p　number of threads to launch (default: 1)
-m　pathway to viromescan folder (default: working directory)
-o　output directory (MANDATORY)

databaseがないというエラーが出たら、-mでviromescanの場所を指定してやれば良い。例えば/home/uesaka/viromescan/viromescanにdatabase/があるなら、"-m /home/uesaka/viromescan/"と指定する。

引用

ViromeScan: a new tool for metagenomic viral community profiling.

Rampelli S, Soverini M, Turroni S, Quercia S, Biagi E, Brigidi P, Candela M.

BMC Genomics. 2016 Mar 1;17:165.

From Whole-Genome Shotgun Sequencing to Viral Community Profiling: The ViromeScan Tool.

Rampelli S, Turroni S.

Methods Mol Biol. 2018;1746:181-185.

Biostars (error)

https://www.biostars.org/p/280187/

2018-03-11

　　　シングルセルの汚染を検出する ACDC

single cell quality check contamination

　シングルセルシーケンスの主な課題は、コンタミの可能性とその検出である[論文よりref.7]。標的ゲノムに属さない外来DNAは、複数の方法で試料に導入され得る。コンタミの原因には、全ゲノム増幅試薬が含まれる可能性すらあり得る[ref.8、9]。これらの障害を考慮してコンタミを除くための自動プロトコールが最近開発された[ref.11]。 ProDeGeは、よく使われるPCA-reduced oligonucleotide profilingとBLASTアルゴリズムを組み合わせている。CheckM [ref.13]は、コンタミネーションの指標として、複数のシングルコピーマーカー遺伝子の存在のみを使っている（checkM紹介）。より最近の分類方法[ref.14,15]、特によく知られたKraken [ref.16]は、BLASTと同レベルの検出精度を持ち、かつはるかに速い（kraken紹介）。ただしこれらの技術はすべてリファレンスに大きく依存しているため、データベースに含まれていない汚染配列やサンプル中に存在しないマーカー遺伝子は検出できない。大多数の種は未知であるため[ref.5]、そのような方法で検出することは困難であり、教師なしの分析が必要である[ref.17]。計算知能の観点からは、コンタミの検出は、メタゲノムビニングと非常によく似ている。メタゲノムおよびSCSサンプルの両方は、一次元の高次元データポイントとして表すことができる（以下略）。

　ACDC（automated contamination detection and confidence estimation for single-cell genome data）は、リファレンスベースの手法とリファレンスを統合するシングルセルゲノムデータの自動汚染検出ツール。監督された方法と監督されない方法を組み合わせることにより、既知のコンタミと未知のコンタミの両方を確実に検出する。

公式サイト（ web版もあり）

https://bibiserv2.cebitec.uni-bielefeld.de/acdc

インストール

cent OSに導入した。

依存

C++11 compatible compiler
CMake >= 3.0
Boost C++ Libraries >= 1.55.0 (program_options, system, filesystem)
Kraken >= 0.10.5-beta (optional)
RNAmmer = 1.2 (optional)

本体 Github

https://github.com/mlux86/acdc

git clone https://github.com/mlux86/acdc.git
cd acdc/
mkdir build && cd build
cmake ..
make -j4
sudo make install

$ ./acdc -h

,'/

,' / █████╗ ██████╗██████╗ ██████╗ (a)utomated

,' /_____, ██╔══██╗██╔════╝██╔══██╗██╔════╝ (c)ontamination

.'____ ,' ███████║██║ ██║ ██║██║ (d)etection and

/ ,' ██╔══██║██║ ██║ ██║██║ (c)onfidence estimation for single-cell genome data

/ ,' ██║ ██║╚██████╗██████╔╝╚██████╗

/,' ╚═╝ ╚═╝ ╚═════╝╚═════╝ ╚═════╝

Usage:

-h [ --help ] Display this help message

-v [ --verbose ] Verbose output (use -vv for more or -vvv for maximum verbosity)

-q [ --quiet ] No output

-i [ --input-fasta ] arg Input FASTA file

-I [ --input-list ] arg File with a list of input FASTA files, one per line

-d [ --tsne-dimension ] arg (=2) T-SNE dimension

-u [ --tsne-pca-dimension ] arg (=50) T-SNE initial PCA dimension

-p [ --tsne-perplexity ] arg T-SNE perplexity (overrides automatic estimation)

-t [ --tsne-theta ] arg (=0.5) T-SNE parameter 'theta' of the underlying Barnes-Hut approximation

-m [ --min-contig-length ] arg (=0) Minimal length of contigs to consider for analysis

-k [ --window-kmer-length ] arg (=4) Length of the k-mers in the sequence vectorizer window

-w [ --window-width ] arg Width of the sliding sequence vectorizer window (overrides automatic estimation using number of target points)

-s [ --window-step ] arg Step of the sliding sequence vectorizer window (overrides automatic estimation using number of target points)

-n [ --target-num-points ] arg (=1000) Approximate number of target points for estimating window parameters

-b [ --num-bootstraps ] arg (=10) Number of bootstraps

-r [ --bootstrap-ratio ] arg (=0.75) Bootstrap subsampling ratio

-T [ --num-threads ] arg (=40) Number of threads for bootstrap analysis (default: detect number of cores)

-o [ --output-dir ] arg (=./results) Result output directory

-K [ --kraken-db ] arg Database to use for Kraken classification

-a [ --aggressive-threshold ] arg (=5000) Aggressive threshold: Treat clusters having a bp size below this threshold as outliers. (Default = 0 = aggressive mode disabled)

Dockerイメージでも提供されている。

ラン

fastqとkrakenデータベースを指定してランする。

acdc -i inout.fq -K ~/kraken/database/

引用

acdc - Automated Contamination Detection and Confidence estimation for single-cell genome data.

Lux M, Krüger J, Rinke C, Maus I, Schlüter A, Woyke T, Sczyrba A, Hammer B.

BMC Bioinformatics. 2016 Dec 20;17(1):543.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

k-merサイズを変えながらエラー訂正を繰り返す SGA-ICE (IterativeErrorCorrection)

メタゲノムデータを使ってシングルセルのエラー訂正を行う MeCorS

高カバレッジな細菌ゲノムのdenovoゲノムアセンブリツール HGA

メイトペア情報を使いスキャホールドの誤りを検出する NxRepair

ウィルスintegration部位を分析するGUIツール ChimericSeq

ウィルスコミュニティを検出する viromescan

シングルセルの汚染を検出する ACDC