ペアエンドRNAシーケンスを使いアセンブルを改善する P_RNA_scaffolder

2020 7/12 追記

　ゲノムシークエンシングプロジェクトでは、遺伝子の同定は機能的研究と比較分析の基本である。メイトペアライブラリーおよびロングリードは高品質のアセンブリの生成を容易にするが、すべての遺伝子の完全な構造を回復することは困難であり、解決にはnovelなアセンブルメソッドが必要である。転写されたゲノムは、mRNAや長い非コードRNAなどの異なるタイプのRNAを生成する[論文より ref.1]。これらの広範なRNAをアセンブルに使い、転写された領域の構造を完成させることは可能である。ゲノムアセンブリの手法がまだ満足できる品質ではないため、タンパク質[ref.2]または転写産物[ref.3]をガイドとして用いて遺伝子領域の連続性を高める多くのアプローチが生成されている。著者らは以前、長いシングルエンドRNAシークエンシングリードまたはペアエンドRNAシーケンシングリードからの転写産物を組み込んだアセンブルツールL_RNA_scaffolderを開発した[ref.3]。このツールは、ゲノムアセンブリと遺伝子アノテーションを改善するために、多くのゲノムプロジェクトで広く採用されている[ref.5-8]。長いシングルエンド転写産物を用いる戦略とは対照的に、Mortazavi et alはRNAPATHを使いペアエンドRNAシーケンシングをアセンブルに使用した[ref.10]。BESST_RNA（https://github.com/ksahlin/BESST_RNA）、Rascaf [ref.11]、およびAGOUTI [ref.12]は、ペアエンドRNAシーケンシングデータを使用する他のscaffoldersである。しかしこれらのツールはエラーが発生しやすいか、プロセスが複雑で実行時間が長い。scaffoldersにRNAシーケンシングのカバレッジとサイズの影響を考慮させることで、ペアエンドRNAシーケンシングデータを使うツールの性能は改善するだろう。

　P_RNA_scaffolderは、ペアエンドRNAシーケンシングを使用してゲノムのアセンブリを行うツール。 RNAシークエンシングリードを使用する他のscaffoldersと比較して、このツールは最短ランタイムで実行でき、かつ最高精度で最も長くcontigsを生成する。このツールの顕著な利点の1つは、スキャフォールディング後の完全にカバーされたタンパク質コード遺伝子および非コード遺伝子の改良された割合が、completeゲノム中の比率に近いことである。

リンク切れ

公式ページ　ダウンロード

http://www.fishbrowser.org/software/P_RNA_scaffolder/index.php/Home/Index/Downloads.html

マニュアル

http://www.fishbrowser.org/software/P_RNA_scaffolder/index.php/Home/Index/Documentation.html

インストール

Gihtub

cent OSに導入した。

依存

BWA（RNAのマッピング。原核生物のランでsamを作るため）
HISAT2（RNAのマッピング。真核生物のランでsamを作るため）

BWAもHISAT2もbrewで導入できる。またはBioconda環境ならcondaでもインストールできる。

brew install hisat2 bwa #brewで両方入れる。

本体はシェルスクリプトとC++（コンパイル済み）で実装されており、上記からダウンロードしたシェルスクリプトを実行するだけでランできる。

tar -zxvf P_RNA_scaffolder.tar.gz
cd P_RNA_scaffolder/
sh P_RNA_scaffolder.sh

$ sh P_RNA_scaffolder.sh

Usage: sh P_RNA_scaffolder.sh -d Program_DIR -i inputfile.sam -j contig.fasta -F read_1.fastq -R read_2.fastq -s yes

Input options

-d the installing direcotry of P_RNA_scaffolder [ mandatory ]

-i SAM file of RNA-seq alignments to contigs with hisat [ mandatory ]

-j Pre-assembled contig FASTA file [ mandatory ]

-F FASTQ file of left reads [ mandatory ]

-R FASTQ file of right reads [ mandatory ]

Output options

-o write all output files to this directory [ default: ./ ]

Species options

-s the target species is Eukaryote or Prokaryote [default: yes ]

(1) yes represents that the target species is Eukaryote.

(2) no represents that the target species is Prokaryote

Two modes selection options

-b re-align filtered RNA-seq reads to contigs with BLAT [ default: yes ]

(1) If yes, perform the 'accurate' mode using BLAT to further filter

out reads. The 'accurate' scaffolding has higher accuracy and longer

running time than the 'fast' mode.

(2) If no, perform the 'fast' mode without BLAT re-alignment and this mode

is faster than the 'accurate' mode with less accuracy.

-p BLAT alignment identity cutoff [ default: 0.90 ]

-t number of threads used in BLAT re-alignment [ default: 5 ]

Scaffolding options

-e the maximal allowed intron length [ default: 100000 ]

-f the minimal supporting RNA-seq pair number [ default: 2 ]

-n the number of inserted N to indicate a gap [ default: 100 bp ]

他の依存バイナリは全てダウンロードパッケージに同封されている。

ラン

ランにはDNAをアセンブルして得たcontigと、RNA seqデータをこのcontigにマッピングして得たsamファイルが必要となる。マッピングツールには、真核生物のRNA seqならHISAT2が推奨されている。原核生物ならBWAを使う。

eukaryotesのRNA-seq

hisat2-build contigs.fa human_hisat 
hisat2 -x human_hisat -1 read_1.fq -2 read_2.fq -k 3 -p 10 --pen-noncansplice 1000000 -S input.sam

prokaryotesのRNA-seq

bwa index -a is contigs.fa 
bwa mem -t 10 contigs.fa read_1.fq read_2.fq >input.sam

--テスト--

公式のC．elegansのランをテストしてみる。SRA_toolkitでSRAからRNA-seqデータをダウンロード（SRA_toolkitがない人はダウンロードしてパスを通してください=>リンク）。（遅ければAsperaを検討してください=> リンク）

prefetch SRR4017997 #シーケンスデータのダウンロード
mkdir fastq
#ペアエンドfastqに変換。8GBx2容量がある。
fastq-dump /Users/user/ncbi/public/sra/SRR4017997.sra --split-files -O fastq/
#SRR4017997_1.fastqとSRR4017997_2.fastqができる。

#消す
rm /Users/user/ncbi/public/sra/SRR4017997.sra

#公式からC.elegansのcontigファイルをダウンロード
wget http://www.fishbrowser.org/software/P_RNA_scaffolder/index.php/Home/Index/down2/content/Celegans.random.contigs.fa.gz
#解凍
gzip -dv Celegans.random.contigs.fa.gz

samを作成。

hisat2-build Celegans.random.contigs.fa Celegans_hisat
hisat2 -x Celegans_hisat -1 SRR4017997_1.fastq -2 SRR4017997_2.fastq -k 3 -p 24 --pen-noncansplice 1000000 -S output.sam

-k 3 report up to 3 alignments per read.
-p 24 using 10 threads to align reads.
--pen-noncansplice 1000000 means high penalty for a non-canonical splice site
-S output.sam

P_RNA_scaffolderを実行する。

sh P_RNA_scaffolder.sh -d /home/uesaka/P_RNA_scaffolder -i output.sam -j Celegans.random.contigs.fa -F SRR4017997_1.fastq -R SRR4017997_2.fastq -s yes

-d 　the installing direcotry of P_RNA_scaffolder [ mandatory ]
-i 　SAM file of RNA-seq alignments to contigs with hisat [ mandatory ]
-j 　Pre-assembled contig FASTA file [ mandatory ]
-F　 FASTQ file of left reads [ mandatory ]
-R 　FASTQ file of right reads [ mandatory ]
-o　write all output files to this directory [ default: ./ ]
-s 　the target species is Eukaryote or Prokaryote [default: yes ] yes represents that the target species is Eukaryote. no represents that the target species is Prokaryote
-b 　 re-align filtered RNA-seq reads to contigs with BLAT [ default: yes ] (1) If yes, perform the 'accurate' mode using BLAT to further filter out reads. The 'accurate' scaffolding has higher accuracy and longer running time than the 'fast' mode. (2) If no, perform the 'fast' mode without BLAT re-alignment and this mode is faster than the 'accurate' mode with less accuracy.
-p　 BLAT alignment identity cutoff [ default: 0.90 ]
-t 　number of threads used in BLAT re-alignment [ default: 5 ]
-e 　the maximal allowed intron length [ default: 100000 ]
-n　 the number of inserted N to indicate a gap [ default: 100 bp ]