非モデル生物のDe novo RNA seq解析は断片化したRNAしかできないので、DEG解析が困難となる。BRANCHはそういった不完全なRNAに対して使う方法論で、近縁種のゲノム、またはcontigの情報をRNAのガイドとして利用し、アセンブルの精度を高める方法論。ゲノムのエキソンを推定することで実現している。
ランにはRNAのde novo assemblyツール(OasisやTrinity、TransABySS、SOAPdenovoなど)で作成したRNAのアセンブルデータと、ゲノム(またはcontigやgene)配列が必要。精度はリファンレスがどれほど近いか(同じ種でないならおそらく厳しい)、エキソンが正しく識別できるかに依存する。
依存
lemon: graph library
http://lemon.cs.elte.hu/trac/lemon/wiki/Downloads
lemonをビルドする。
cd lemon-x.y.z
mkdir build
cd build
cmake ..
make
make check #self-test
sudo make install #/usr/localにパスが通る
> lemon -x
$ lemon -x
Lemon version 1.0
本体 Github
https://github.com/baoe/BRANCH
g++ -o BRANCH BRANCH.cpp -lemon -lpthread #あらかじめlemonがビルドされてpathが通っていること
> ./BRANCH
$ ./BRANCH
BRANCH: boosting RNA-Seq assemblies with partial or related genome sequences
By Ergude Bao, CS Department, UC-Riverside. All Rights Reserved
BRANCH --read1 reads_1.fa --read2 reads_2.fa --transfrag transfrags.fa --contig contigs.fa --transcript transcripts.fa [--insertLow insertLow --insertHigh insertHigh --threshSize threshSize --threshCov threshCov --threshSplit threshSplit --threshConn threshConn --closeGap --noAlignment --lowEukaryote]
Inputs:
--read1 is the first pair of PE RNA reads or single-end RNA reads in fasta format
--read2 is the second pair of PE RNA reads in fasta format
--transfrag is the de novo RNA transfrags to be extended
--contig is the DNA contigs or the genes of close related species
Output:
--transcript is the extended de novo transfrags
Options:
--insertLow is the lower bound of insert length (highly recommended; default: 0)
--insertHigh is the upper bound of insert length (highly recommended; default: 99999)
--threshSize is the minimum size of a genome region that could be identified as an exon (default: 2 bp)
--threshCov is the minimum coverage of a genome region that could be identified as an exon (default: 2)
--threshSplit is the minimum upstream and downstream junction coverages to split a genome region into more than one exons (default: 2)
--threshConn is the minimum connectivity of two exons that could be identified as a junction (default: 2)
--closeGap closes sequencing gaps using PE read information (default: none)
--noAlignment skips the initial time-consuming alignment step, if all the alignment files have been provided in tmp directory (default: none)
--lowEukaryote runs in a different mode for low eukaryotes with rare splice variants (default: none)
--misassemblyRemoval detects and then breaks at or removes misassembed regions (default: none)
パスを通しておく。
ラン
FASTA形式のペアリードとoasisやtriniityでRNAをアセンブルした配列、参照するDNA配列(ゲノムでもcontigでもgeneでも使える)を指定してランする。
BRANCH --read1 reads1.fa --read2 reads2.fa --transfrag transfrags.fa --contig contigs.fa --transcript output.fa
- --read1 is the first pair of PE RNA reads or single-end RNA reads in fasta format -
- -read2 is the second pair of PE RNA reads in fasta format
- --transfrag is the de novo RNA transfrags to be extended
- --contig is the DNA contigs or the genes of close related species
- --transcript is the extended de novo transfrags (output file)
fastqからfastaへの変換はこちらを参照。
Githubに記載されているtips
- Single-end reads should have the same length and are not recommended, since the quality of single-end alignment is hard to be kept.
- It is better to use related gene sequences rather than related genome sequences to greatly reduce run time and memory usage.
- Though --insertLow and --insertHigh are options, they should always be specified to generate meaning result. Suppose the insert length is I, insertLow = I - 20 and insertHigh = I + 20 would be fine.
非モデル生物の断片化したRNAのクラスタリングにはCorsetなどがあります。
引用
BRANCH: boosting RNA-Seq assemblies with partial or related genomic sequences.
Bao E1, Jiang T, Girke T.
Bioinformatics. 2013 May 15;29(10):1250-9. doi: 10.1093/bioinformatics/btt127. Epub 2013 Mar 14.