macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

RNAのリファレンスガイドアセンブリを行いDe novo RNA seqの精度を上げる BRANCH

 

非モデル生物のDe novo RNA seq解析は断片化したRNAしかできないので、DEG解析が困難となる。BRANCHはそういった不完全なRNAに対して使う方法論で、近縁種のゲノム、またはcontigの情報をRNAのガイドとして利用し、アセンブルの精度を高める方法論。ゲノムのエキソンを推定することで実現している。

ランにはRNAのde novo assemblyツール(OasisやTrinity、TransABySS、SOAPdenovoなど)で作成したRNAアセンブルデータと、ゲノム(またはcontigやgene)配列が必要。精度はリファンレスがどれほど近いか(同じ種でないならおそらく厳しい)、エキソンが正しく識別できるかに依存する。

 

 依存

lemon:  graph library

http://lemon.cs.elte.hu/trac/lemon/wiki/Downloads

lemonをビルドする。 

cd lemon-x.y.z
mkdir build
cd build
cmake ..
make
make check #self-test
sudo make install #/usr/localにパスが通る

 > lemon -x

$ lemon -x

Lemon version 1.0

 

本体 Github

https://github.com/baoe/BRANCH

g++ -o BRANCH BRANCH.cpp -lemon -lpthread #あらかじめlemonがビルドされてpathが通っていること

> ./BRANCH

$ ./BRANCH 

BRANCH: boosting RNA-Seq assemblies with partial or related genome sequences

By Ergude Bao, CS Department, UC-Riverside. All Rights Reserved

 

BRANCH --read1 reads_1.fa --read2 reads_2.fa --transfrag transfrags.fa --contig contigs.fa --transcript transcripts.fa [--insertLow insertLow --insertHigh insertHigh --threshSize threshSize --threshCov threshCov --threshSplit threshSplit --threshConn threshConn --closeGap --noAlignment --lowEukaryote]

Inputs: 

--read1 is the first pair of PE RNA reads or single-end RNA reads in fasta format

--read2 is the second pair of PE RNA reads in fasta format

--transfrag is the de novo RNA transfrags to be extended

--contig is the DNA contigs or the genes of close related species

Output: 

--transcript is the extended de novo transfrags

Options: 

--insertLow is the lower bound of insert length (highly recommended; default: 0)

--insertHigh is the upper bound of insert length (highly recommended; default: 99999)

--threshSize is the minimum size of a genome region that could be identified as an exon (default: 2 bp)

--threshCov is the minimum coverage of a genome region that could be identified as an exon (default: 2)

--threshSplit is the minimum upstream and downstream junction coverages to split a genome region into more than one exons (default: 2)

--threshConn is the minimum connectivity of two exons that could be identified as a junction (default: 2)

--closeGap closes sequencing gaps using PE read information (default: none)

--noAlignment skips the initial time-consuming alignment step, if all the alignment files have been provided in tmp directory (default: none)

--lowEukaryote runs in a different mode for low eukaryotes with rare splice variants (default: none)

--misassemblyRemoval detects and then breaks at or removes misassembed regions (default: none)

パスを通しておく。

 

ラン

FASTA形式のペアリードとoasisやtriniityでRNAアセンブルした配列、参照するDNA配列(ゲノムでもcontigでもgeneでも使える)を指定してランする。

BRANCH --read1 reads1.fa --read2 reads2.fa --transfrag transfrags.fa --contig contigs.fa --transcript output.fa
  • --read1 is the first pair of PE RNA reads or single-end RNA reads in fasta format -
  • -read2 is the second pair of PE RNA reads in fasta format
  • --transfrag is the de novo RNA transfrags to be extended
  • --contig is the DNA contigs or the genes of close related species
  • --transcript is the extended de novo transfrags (output file)

 

fastqからfastaへの変換はこちらを参照。


 

 

Githubに記載されているtips

  • Single-end reads should have the same length and are not recommended, since the quality of single-end alignment is hard to be kept.
  • It is better to use related gene sequences rather than related genome sequences to greatly reduce run time and memory usage.
  • Though --insertLow and --insertHigh are options, they should always be specified to generate meaning result. Suppose the insert length is I, insertLow = I - 20 and insertHigh = I + 20 would be fine.

 

 

非モデル生物の断片化したRNAクラスタリングにはCorsetなどがあります。


 

引用

BRANCH: boosting RNA-Seq assemblies with partial or related genomic sequences.

Bao E1, Jiang T, Girke T.

Bioinformatics. 2013 May 15;29(10):1250-9. doi: 10.1093/bioinformatics/btt127. Epub 2013 Mar 14.