2018-06-18

(RNA seq) 複数のde novoアセンブリ結果をマージし、冗長なcontigを除く DRAP

　第二世代シークエンシングプラットフォームは、多種多様な種および条件の遺伝子発現を分析することを可能にする、多量の転写産物のシーケンスデータの生成を可能にした。リファレンスゲノム配列を欠く種については、現在の古典的なプロセシングパイプラインには新規トランスクリプトームアセンブリステップが含まれるが、生データの変動性のため、正確なトランスクリプトームのリファレンスをアセンブリすることは困難である。この変動性は、以下の様々な要因に由来する：1.遺伝子発現レベルの変動は、通常、1〜数百万コピーの範囲である。 2.初期段階のイントロンをまだ含むプレmRNAや、後期の崩壊中のmRNAの存在。; 3.プレmRNAから多数のオルタナティブ転写産物合成。 4.潜在的なサンプル汚染。 5.シーケンスクオリティバイアス。ゲノムの大部分は、DjebaliらによってENCODEプロジェクトから明らかになったように、生物学的条件に依存して低存在量で発現され得る。

　今日、これらのRNA-Seqアセンブリ問題に対するユニークな最良の解決法は存在しないが、いくつかのソフトウェアパッケージが正しく再構築された発現転写物の大部分を含むコンティグセットを生成することが証明されている。 Trinity（Grabherr et al、2011）とOases（Schulz et al、2012）は良い例である。これらのパッケージによって生成された集合コンティグセットは、しばしば完全または部分トランスクリプトの複数のコピーおよびキメラを含む。キメラは、固有のトランスクリプト（自己キメラ）または複数のトランスクリプト（多重トランスクリプトキメラ）の構造異常である。それらは、トランスクリプトが同じ方向にある場合には「シス」と呼ばれ、反対方向にある場合には「トランス」と呼ばれる。天然のキメラトランスクリプトは、いくつかの癌組織に存在するが、まれである（Frenkel-Morgenstern et al、2013）。Yang and Smith、2013は、トランスクリプトームアセンブラが自己キメラコンティグを生産する傾向を示している。この現象は、アセンブリパラメータに依存する。複数転写キメラはコンティグアノテーションを変形する。同じコンティグでマージされたトランスクリプトの機能は非常に異なる可能性があり、したがって、そのようなキメラコンティグに与えられるしばしばユニークな注釈はその内容を反映しない。アセンブリには、転写またはシークエンシングノイズに対応するコンティグも含まれ、これはしばしばillegitimate transcriptionと呼ばれる現象である（Chelly et al、1989）。これらのコンティグはしばしばカバレッジが低く、同じ条件のレプリケーションデータでは見られない。

　いくつかのコンティグは、局所的な変異、置換、挿入または欠失などのシーケンス決定ミスを含む。これらの変動および誤差は、リードのアライメント率に深く影響し、注釈を妨害するフレームシフトを生成し、プライマー設計の有効性を制限し、偽の変化を生成する可能性がある。アセンブリは、転写後のマークであるポリA / Tテイルも含む。それらは通常、publish前に削除される。これらの理由のために、コンティグセットは通常、エラー訂正を必要とする。

　TrinityとOasesはアルゴリズムが異なり、遺伝子発現レベルに応じて長所と短所がある。主な違いは、アセンブリの戦略から来ている。 Trinityはde Bruijn graphを持つgreedyアルゴリズムを連鎖させ、Oasesは異なるk-merからの複数のde Bruijn graphを使用する。Trinityの第一段階は、第二段階でつながる、高度に表現された転写物の部分を組み立てるのに非常に効果的である。 Surget-Groba＆Montoya-Burgos（2010）に示されているように、Oasesのmulti k-merアセンブリアプローチは、非常に低いレベルから非常に高い発現レベルの転写物に対応するコンティグを構築できる。しかし、複数の転写産物を有する高度に発現した遺伝子は、主にバリエーションまたは配列エラーの存在のために非常に複雑なgraphを生成し、アセンブラによって有効であると考えられる新しい経路を形成し、多数の誤ったコンティグを生じる。すべての状況で最適なコンティグセットを生成するアセンブラはない。したがって、研究者は、コンティグセットのクオリティを最大にするため異なる戦略を併用する（Mbandi et al、2015; Bens et al、2016; He et al、2015; Nakasugi et al、2014）。最も簡単なアプローチは、ソフトウェアパッケージまたはパラメータセットごとに参照セットを作成し、それらのメトリックを比較し、最良のものを選択することである。異なる結果をマージしてフィルタリングすることもできる。

　アセンブリは異なる基準で比較できる。通常は、総カウント、全長、N50、平均長などの単純なコンティグメトリックとなるが、品質をチェックするための良い指標は、コンティグにマップされたリードの割合である。トランスクリプトの大部分がmRNAに対応するので、CEGMA（Parra、Bradnam＆Korf、2007）またはBUSCOによって行われたように、グローバル基準を用いて正しく再構成されたタンパク質の数を品質メトリックとして使用することも可能である。または系統的に非常に近縁な生物由来のタンパク質リファレンスセットを使用することもできる（一部略）。

　著者らは以下のアセンブリ問題を修正するためにde novo RNA-Seq Assembly Pipeline（DRAP）を構築した。すなわち、完全または部分トランスクリプトが複数コピーあったり、キメラ、アセンブラおよびポリAテイルによって低レベルの遺伝子間転写、挿入および欠失などの問題である。パイプラインの実装については、次のセクションで説明する。「結果と議論」では、7つの異なるデータセットのrawアセンブリメトリックと、DRAP処理後のアセンブリメトリックを比較する。

TrinityとOasesはデノボトランスクリプトームでよく使われるアセンブラで、質は良いとされているが、冗長性、エラー率、キメラなど考えるとまだ改善の余地はある。DRAPはこの TrinityとOasesを組み込んだパイプラインで、アセンブリのコンティグの数を1.3倍から15倍に減らすとされる。7つのアセンブラとの比較では、クオリティを損なわないままデータを圧縮できていることが示されている。

DRAPパイプライン。公式HPより転載。

公式ページ

http://www.sigenae.org/drap/index.html

quick start

http://www.sigenae.org/drap/quick_start.html

インストール

cent os6のdocker環境でテストした。

依存

下のcheckコマンドで表示されるツール

依存ツールが大量にありバージョン管理などが煩雑なため、公式ではdocker環境での導入を推奨している。

#imageを取ってくる
docker pull sigenae/drap 

#コンテナを作成。ここではホストの外付けHDDのfastqディレクトリとイメージのhomeを共有指定して立ち上げている。
docker run -i -t -v /Volumes/4TB-gene_new3/data/fastq/:/home sigenae/drap

#依存が揃ってるかテストするには付属configファイルを使う。
#drap1.91をチェックアウト
svn checkout svn://scm.mulcyber.toulouse.inra.fr/svnroot/drap/tags/drap-v1.91
cd drap-v1.91/cfg/

> runCheck --cfg-file drap.cfg

# runCheck --cfg-file drap.cfg

Status -> Softwares

............................................

[VALID] -> STAR

[VALID] -> TransDecoder.LongOrfs

[VALID] -> TransDecoder.Predict

[VALID] -> Trinity

[VALID] -> bedtools

[VALID] -> bl2seq

[VALID] -> blastn

[VALID] -> blat

[VALID] -> bwa

[VALID] -> cd-hit

[VALID] -> cd-hit-est

[VALID] -> cutadapt

[VALID] -> dc

[VALID] -> exonerate

[VALID] -> express

[VALID] -> fastq_illumina_filter

[VALID] -> generate_plot.py

[VALID] -> getorf

[VALID] -> insilico_read_normalization.pl

[VALID] -> interleave-reads.py

[VALID] -> normalize-by-median.py

[VALID] -> oases

[VALID] -> parallel

[VALID] -> perl

[VALID] -> python

[VALID] -> rsync

[VALID] -> run_BUSCO.py

[VALID] -> samtools

[VALID] -> seqclean

[VALID] -> split-paired-reads.py

[VALID] -> tgicl

[VALID] -> transrate

[VALID] -> trim_galore

[VALID] -> vecscreen

[VALID] -> velvetg

[VALID] -> velveth

Already is OK.

全て揃っている。

ヘルプ

> runDrap -h

# runDrap -h

Usage:

runDrap \

--outdir OUTPUT_DIR \

--R1 R1_FILE[,...,R1_FILE_n] \

[--R2 R2_FILE[,...,R2_FILE_n] \

[--strand R|F|RF|FR] \

[--ref FASTA_REF] \

[--dbg oases|trinity] \

[--kmer KMER_LIST] \

[--mapper bwa|STAR] \

[--alignTrim] \

[--alignNorm] \

[--write] \

[--run] \

... \

--help

Options:

-o, --outdir OUTPUT_DIR

The path for the output directory. The last folder will be

created by the workflow. If you use an already existing output

directory the workflow will process only the missing results

(this is used in rerun after error). If you use an already

existing output directory and give no other options, the

previous set of options will be used to rerun the workflow.

-1, --R1 R1_FILE[,...,R1_FILE_n]

The R1 file(s) in FASTQ format (gzipped or not),

comma-separated.

-2, --R2 R2_FILE[,...,R2_FILE_n]

The R2 file(s) in FASTQ format (gzipped or not),

comma-separated.

-s, --strand FR|RF|F|R

For strand specific sequencing data: FR or RF for paired reads;

F or R for single

-r, --ref FASTA_FILE

Fasta file with knowns proteins or transcripts to align with

contigs using Exonerate or Blat.

-d, --dbg trinity|oases

De Bruijn Graph assembler to perform assembly. The default

assembler is oases.

-k, --kmer INTEGER_LIST

Oases k-mers list to perform dbg assembly, comma-separated. The

default kmer list is 25,31,37,43,49,55,61,65,69. Ignored with

trinity.

--trim 'STRING'

trim_galore parameters BETWEEN SINGLE QUOTES to perform the

reads trimming step before dbg assembly. The default parameters

are '--quality 10 --stringency 3'. See trim_galore --help.

--no-trim

Do not trim reads using trim_galore before assembly.

-c, --clean 'STRING'

filter_illumina parameters BETWEEN SINGLE QUOTES to perform the

reads cleaning step before dbg assembly. The default parameters

are '-q 10 -t 33 -e'. See DRAP_DIR/bin/filter_illumina --help.

--norm-src trinity|khmer

Normalizer source to perform the reads normalization step before

dbg assembly. Available sources are trinity with the

insilico_read_normalization.pl script or the khmer package. The

default source is trinity.

--norm 'STRING'

Normalizer parameters BETWEEN SINGLE QUOTES. The default

parameters are '--max_cov 50' for trinity. No default parameters

for khmer.

--no-norm

Do not normalize reads before assembly.

--norm-mem INTEGER

The maximum memory used for the normalization step in Gb of RAM.

DRAP normalizes in two step if R1 (and R2) parameter(s) are

lists of fastq files: 1) DRAP normalizes each fastq file or pair

of fastq files separately. 2) DRAP merges normalized fastq files

and performs another normalization on the merged fastq files or

pairs of fastq files. The default mem is the highest between 1

Gb by million of reads for the heaviest fastq file or pair of

fastq files (step 1) or 1 Gb for 2 millions of reads in all

fastq files or pairs of fastq files (we expects files to be

twice as light after normalization). The default norm-mem has a

minimum of 64 Gb.

--no-norm-merge

If R1 (and R2) parameter(s) are lists of fastq files, only

normalize each fastq file or pair of fastq files separately.

--norm-merge-only

If R1 (and R2) parameter(s) are lists of fastq files, do not

normalize each fastq file or pair of fastq files separately.

Just merge fastq files and perform normalization on the merged

fastq files or pairs of fastq files (useful with fastq files

already normalized).

--dbg-mem INTEGER

The maximum memory used for the dbg step in Gb of RAM. The

default dbg-mem for oases is 100 Gb. For trinity, the default

dbg-mem is computed with 1 Gb by million of reads with a minimum

of 64 Gb.

-n, --discardN

Discard contigs containing N after dbg assembly.

-m, --mapper bwa|star

Mapper to perform the RMBT step (Reads Mapped Back to

Transcripts). Available mappers are bwa and star. The default

mapper is bwa.

--alignR1 R1_FILE[,...,R1_FILE_n]

The R1 file(s) to align at RMBT step if different from --R1,

comma-separated.

--alignR2 R2_FILE[,...,R2_FILE_n]

The R2 file(s) to align at RMBT step if different from --R2,

comma-separated.

--alignTrim

Map back trimmed reads at RMBT step.

--alignNorm

Map back normalized reads at RMBT step.

-l, --length INTEGER

Length filter cutoff in nucleotides at the contig filtering

step. The default length is 200.

-t, --type contig|orf

Filter target type is contig or orf. If type is contig, contigs

shorter than the length filter cutoff are discarded. If type is

orf, contigs with a longest orf shorter than the length filter

cutoff are discarded.

-v, --coverage INTEGER_LIST

FPKM list of values used as coverage filter cutoff,

comma-separated. For each value, a subset of the final contig

set will be created only with contigs having an higher abundance

estimation. Abundances are estimated with eXpress. Default list

is 1,3,5,10.

--no-rate

Without this option TransRate (http://hibberdlab.com/transrate/)

is used to produce a score on assembly from paired-end data. The

score is calculated as the geometric mean of all contig scores

multiplied by the proportion of input reads that provide

positive support for the assembly. By default, TransRate is

performed on the assembly filtered with the lowest FPKM value.

--optimize

Apply the TransRate optimization procedure on the assembly

filtered with the lowest FPKM value. This will create a subset

of the final contig set only with contigs having a TransRate

score higher than the TransRate optimal score.

-q, --quit INTEGER

Quit workflow after step: 1:pre-process ; 2:dbg ; 3:merge ;

4:clustering ; 5:asm ; 6:post-asm ; 7:rmbt-editing

--nb-frags

Number of reads in R1 fastq file(s). Computed before running

workflow if not provided.

--max-nb-frags

Max number of reads in R1 fastq file. Computed before running

workflow if not provided.

--exonerate

Align reference with contigs using Exonerate only.

--blat Align reference with contigs using Blat only.

--cfg-file

Run worflow with a user DRAP configuration file. This file must

keep the syntax of the default configuration file. The default

configuration file is DRAP_DIR/cfg/drap.cfg.

--write Only write job submission files and exit before running

workflow. Allow users to modify by hand some scripts executed

during the workflow.

--run Only execute job submission files. For example, after user

script modifications.

-h, --help

Print help.

> runMeta -h

# runMeta -h

Usage:

runMeta \

--outdir OUTPUT_DIR \

--drap-dirs DRAP_DIR_1,DRAP_DIR_2[,...,DRAP_DIR_n] \

[--strand R|F|RF|FR] \

[--ref FASTA_REF] \

[--mapper bwa|STAR] \

[--write] \

[--run] \

... \

--help

Options:

-o, --outdir OUTPUT_DIR

The path for the output directory. The last folder will be

created by the workflow. If you use an already existing output

directory the workflow will process only the missing results

(this is used in rerun after error). If you use an already

existing output directory and give no other options, the

previous set of options will be used to rerun the workflow.

--drap-dirs DRAP_DIR_1,DRAP_DIR_2[,...,DRAP_DIR_n]

The paths to the runDrap output directories of assemblies to

meta-assemble, comma-separated.

--input-type coding_transcripts|transcripts

runDrap assembly target type: coding_transcripts or transcripts.

Associated with the input-cov parameter, indicates from which

fasta files inside each runDrap output directories runMeta will

be performed. The default type is transcripts.

--input-cov INTEGER

runDrap assembly target coverage. By default, runMeta is

performed from transcripts_fpkm_$lowest.fa files found inside

runDrap directories with $lowest equal to the lowest fpkm value.

-s, --strand FR|RF|F|R

For strand specific sequencing data: FR or RF for paired reads;

F or R for single

-r, --ref FASTA_FILE

Fasta file with knowns proteins or transcripts to align with

contigs using Exonerate or Blat.

-m, --mapper bwa|star

Mapper to perform the RMBT step (Reads Mapped Back to

Transcripts). Available mappers are bwa and star. The default

mapper is bwa.

-y, --filter

Use the -y bwa option to filter Casava-filtered sequences at the

RMBT step. Ignored if mapper is star.

-l, --length INTEGER

Length filter cutoff in nucleotides at the contig filtering

step. The default length is 200.

-t, --type contig|orf

Filter target type is contig or orf. If type is contig, contigs

shorter than the length filter cutoff are discarded. If type is

orf, contigs with a longest orf shorter than the length filter

cutoff are discarded.

-v, --coverage INTEGER_LIST

FPKM list of values used as coverage filter cutoff,

comma-separated. For each value, a subset of the final contig

set will be created only with contigs having an higher abundance

estimation. Abundances are estimated with eXpress. Default list

is 1,3,5,10.

--no-rate

Without this option TransRate (http://hibberdlab.com/transrate/)

is used to produce a score on assembly from paired-end data. The

score is calculated as the geometric mean of all contig scores

multiplied by the proportion of input reads that provide

positive support for the assembly. By default, TransRate is

performed on the assembly filtered with the lowest FPKM value.

--optimize

Apply the TransRate optimization procedure on the assembly

filtered with the lowest FPKM value. This will create a subset

of the final contig set only with contigs having a TransRate

score higher than the TransRate optimal score.

--exonerate

Align reference with contigs using Exonerate only.

--blat Align reference with contigs using Blat only.

--cfg-file

Run worflow with a user DRAP configuration file. This file must

keep the syntax of the default configuration file. The default

configuration file is DRAP_DIR/cfg/drap.cfg.

--write Only write job submission files and exit before running

workflow. Allow users to modify by hand some scripts executed

during the workflow.

--run Only execute job submission files. For example, after user

script modifications.

-h, --help

Print help.

ラン

テストデータを使う。

テスト１、アセンブリとマージ

1、２つのテストデータをそれぞれアセンブリする。シーケンスデータは非圧縮fastqでもOK。

cd drap-v1.91/test/data/

#sampleAのアセンブリ。
runDrap --R1 sampleA_R1.fastq.gz --R2 sampleA_R2.fastq.gz --dbg oases -kmer 25,31,37,43,49 --outdir oases_splA --dbg-mem 16 --norm-mem 16

#sampleBのアセンブリ。
runDrap --R1 sampleB_R1.fastq.gz --R2 sampleB_R2.fastq.gz --dbg oases --kmer 25,31,37,43,49 --outdir oases_splB --dbg-mem 16 --norm-mem 16

--norm-mem The maximum memory used for the normalization step in Gb of RAM.
--dbg-mem The maximum memory used for the dbg step in Gb of RAM.
--dbg <trinity|oases> De Bruijn Graph assembler to perform assembly. The default assembler is oases.
-k Oases k-mers list to perform dbg assembly, comma-separated. The default kmer list is 25,31,37,43,49,55,61,65,69. Ignored with trinity.

２、1の出力２つを冗長性なしにマージする。

runMeta --drap-dirs oases_splA,oases_splB --ref Danio_rerio.pep.fasta --outdir meta_oases

> ls -alth meta_oases/

$ ls -alth meta_oases/

total 108K

drwxr-xr-x 2 root root 4.0K Jun 18 08:57 e-blat

drwxr-xr-x 2 root root 4.0K Jun 18 08:57 e-exonerate

drwxr-xr-x 2 root root 4.0K Jun 18 08:55 err_log

-rw-r--r-- 1 root root 5.1K Jun 18 08:55 00-META-ASSEMBLY_RATING

drwxr-xr-x 3 root root 4.0K Jun 18 08:55 transrate

drwxr-xr-x 11 root root 4.0K Jun 18 08:54 .

-rw-r--r-- 1 root root 5.7K Jun 18 08:54 .drap_conf.json

lrwxrwxrwx 1 root root 41 Jun 18 08:54 coding_transcripts_fpkm_1.fa -> d-cov_filter/coding_transcripts_fpkm_1.fa

lrwxrwxrwx 1 root root 42 Jun 18 08:54 coding_transcripts_fpkm_10.fa -> d-cov_filter/coding_transcripts_fpkm_10.fa

lrwxrwxrwx 1 root root 41 Jun 18 08:54 coding_transcripts_fpkm_3.fa -> d-cov_filter/coding_transcripts_fpkm_3.fa

lrwxrwxrwx 1 root root 41 Jun 18 08:54 coding_transcripts_fpkm_5.fa -> d-cov_filter/coding_transcripts_fpkm_5.fa

lrwxrwxrwx 1 root root 34 Jun 18 08:54 transcripts_fpkm_1.fa -> d-cov_filter/transcripts_fpkm_1.fa

lrwxrwxrwx 1 root root 35 Jun 18 08:54 transcripts_fpkm_10.fa -> d-cov_filter/transcripts_fpkm_10.fa

lrwxrwxrwx 1 root root 34 Jun 18 08:54 transcripts_fpkm_3.fa -> d-cov_filter/transcripts_fpkm_3.fa

lrwxrwxrwx 1 root root 34 Jun 18 08:54 transcripts_fpkm_5.fa -> d-cov_filter/transcripts_fpkm_5.fa

-rw-r--r-- 1 root root 0 Jun 18 08:54 00-META-ASSEMBLY_COMPLETE

-rw-r--r-- 1 root root 200 Jun 18 08:54 .processed_reads.tsv

drwxr-xr-x 3 root root 4.0K Jun 18 08:54 d-cov_filter

drwxr-xr-x 5 root root 4.0K Jun 18 08:54 c-rmbt

drwxr-xr-x 2 root root 4.0K Jun 18 08:54 b-contig_clustering

drwxr-xr-x 2 root root 4.0K Jun 18 08:54 a-orf_clustering

drwxr-xr-x 4 root root 4.0K Jun 18 08:54 report

-rwxrwxr-x 1 root root 910 Jun 18 08:54 10-meta_reference.sh

-rwxrwxr-x 1 root root 806 Jun 18 08:54 07-meta_rmbt.sh

-rwxrwxr-x 1 root root 2.4K Jun 18 08:54 08-meta_filter.sh

-rwxrwxr-x 1 root root 933 Jun 18 08:54 09-meta_postprocess.sh

-rwxrwxr-x 1 root root 755 Jun 18 08:54 02-meta_longest_orf.sh

-rwxrwxr-x 1 root root 658 Jun 18 08:54 03-meta_cluster_orf.sh

-rwxrwxr-x 1 root root 2.0K Jun 18 08:54 04-meta_longest_contig.sh

-rwxrwxr-x 1 root root 666 Jun 18 08:54 05-meta_cluster_contig.sh

-rwxrwxr-x 1 root root 245 Jun 18 08:54 06-meta_index.sh

drwxr-xr-x 5 root root 4.0K Jun 18 08:54 ..

-rw-r--r-- 1 root root 105 Jun 18 08:54 00-runMeta_cmd.log

-rwxrwxr-x 1 root root 741 Jun 18 08:54 01-meta_merge.sh

テスト２、アセンブリと評価。

同じデータセットの異なるパラメータでのアセンブリ結果を評価するコマンドも用意されている。例えば、DRAPのoasesとTrinityのアセンブリ結果を比較する。

runDrap --R1 sampleA_R1.fastq.gz --R2 sampleA_R2.fastq.gz --ref Danio_rerio.pep.fasta --dbg trinity --outdir trinity_splA --dbg-mem 16 --norm-mem 16

runDrap --R1 sampleB_R1.fastq.gz --R2 sampleB_R2.fastq.gz --ref Danio_rerio.pep.fasta --dbg trinity --outdir trinity_splB --dbg-mem 16 --norm-mem 16

冗長性なしにマージする。

runMeta --drap-dirs trinity_splA,trinity_splB --ref Danio_rerio.pep.fasta --outdir meta_trinityrunMeta

テスト１とテスト２のマージ結果を比較評価する。

#Rename assemblies 
cp meta_oases/d-cov_filter/transcripts_fpkm_1.fa meta_oases/oases_fpkm_1.fa
cp meta_trinityrunMeta/d-cov_filter/transcripts_fpkm_1.fa meta_trinityrunMeta/trinity_fpkm_1.fa


# Launch assessment
runAssessment --assemblies meta_oases/oases_fpkm_1.fa,meta_trinityrunMeta/trinity_fpkm_1.fa \
--R1 sampleA_R1.fastq.gz,sampleB_R1.fastq.gz --R2 sampleA_R2.fastq.gz,sampleB_R2.fastq.gz \
--protein-db Danio_rerio.pep.fasta --outdir assessment

出力。

f:id:kazumaxneo:20180619125537j:plain

BUSCO評価結果。

f:id:kazumaxneo:20180619125243j:plain

Transrateの評価結果は scoreing/oases_fpkm_1_transrate.txtとscoreing/trinity_fpkm_1_transrate.txtに保存される。片方を１ページほど開いてみる。

head -n 60 /Users/user/Downloads/assessment/scoring/trinity_fpkm_1_transrate.txt

[ INFO] 2018-06-19 03:41:42 : Loading assembly: /drap-v1.91/test/data/meta_trinityrunMeta/trinity_fpkm_1.fa

[ INFO] 2018-06-19 03:41:42 : Analysing assembly: /drap-v1.91/test/data/meta_trinityrunMeta/trinity_fpkm_1.fa

[ INFO] 2018-06-19 03:41:42 : Results will be saved in /drap-v1.91/test/data/assessment/scoring/trinity_fpkm_1/trinity_fpkm_1

[ INFO] 2018-06-19 03:41:42 : Calculating contig metrics...

[ INFO] 2018-06-19 03:41:42 : Contig metrics:

[ INFO] 2018-06-19 03:41:42 : -----------------------------------

[ INFO] 2018-06-19 03:41:42 : n seqs 237

[ INFO] 2018-06-19 03:41:42 : smallest 201

[ INFO] 2018-06-19 03:41:42 : largest 1045

[ INFO] 2018-06-19 03:41:42 : n bases 72016

[ INFO] 2018-06-19 03:41:42 : mean len 303.86

[ INFO] 2018-06-19 03:41:42 : n under 200 0

[ INFO] 2018-06-19 03:41:42 : n over 1k 1

[ INFO] 2018-06-19 03:41:42 : n over 10k 0

[ INFO] 2018-06-19 03:41:42 : n with orf 7

[ INFO] 2018-06-19 03:41:42 : mean orf percent 78.01

[ INFO] 2018-06-19 03:41:42 : n90 219

[ INFO] 2018-06-19 03:41:42 : n70 255

[ INFO] 2018-06-19 03:41:42 : n50 298

[ INFO] 2018-06-19 03:41:42 : n30 358

[ INFO] 2018-06-19 03:41:42 : n10 541

[ INFO] 2018-06-19 03:41:42 : gc 0.4

[ INFO] 2018-06-19 03:41:42 : bases n 0

[ INFO] 2018-06-19 03:41:42 : proportion n 0.0

[ INFO] 2018-06-19 03:41:42 : Contig metrics done in 0 seconds

[ INFO] 2018-06-19 03:41:42 : Calculating read diagnostics...

[ INFO] 2018-06-19 03:41:46 : Read mapping metrics:

[ INFO] 2018-06-19 03:41:46 : -----------------------------------

[ INFO] 2018-06-19 03:41:46 : fragments 40000

[ INFO] 2018-06-19 03:41:46 : fragments mapped 12820

[ INFO] 2018-06-19 03:41:46 : p fragments mapped 0.32

[ INFO] 2018-06-19 03:41:46 : good mappings 8167

[ INFO] 2018-06-19 03:41:46 : p good mapping 0.2

[ INFO] 2018-06-19 03:41:46 : bad mappings 4653

[ INFO] 2018-06-19 03:41:46 : potential bridges 3

[ INFO] 2018-06-19 03:41:46 : bases uncovered 27328

[ INFO] 2018-06-19 03:41:46 : p bases uncovered 0.38

[ INFO] 2018-06-19 03:41:46 : contigs uncovbase 118

[ INFO] 2018-06-19 03:41:46 : p contigs uncovbase 0.5

[ INFO] 2018-06-19 03:41:46 : contigs uncovered 36

[ INFO] 2018-06-19 03:41:46 : p contigs uncovered 0.15

[ INFO] 2018-06-19 03:41:46 : contigs lowcovered 65

[ INFO] 2018-06-19 03:41:46 : p contigs lowcovered 0.27

[ INFO] 2018-06-19 03:41:46 : contigs segmented 68

[ INFO] 2018-06-19 03:41:46 : p contigs segmented 0.29

[ INFO] 2018-06-19 03:41:46 : Read metrics done in 4 seconds

[ INFO] 2018-06-19 03:41:46 : Calculating comparative metrics...

[ INFO] 2018-06-19 03:42:48 : Comparative metrics:

[ INFO] 2018-06-19 03:42:48 : -----------------------------------

[ INFO] 2018-06-19 03:42:48 : CRBB hits 84

[ INFO] 2018-06-19 03:42:48 : n contigs with CRBB 84

[ INFO] 2018-06-19 03:42:48 : p contigs with CRBB 0.35

[ INFO] 2018-06-19 03:42:48 : rbh per reference 0.0

[ INFO] 2018-06-19 03:42:48 : n refs with CRBB 80

[ INFO] 2018-06-19 03:42:48 : p refs with CRBB 0.0

[ INFO] 2018-06-19 03:42:48 : cov25 17

[ INFO] 2018-06-19 03:42:48 : p cov25 0.0

[ INFO] 2018-06-19 03:42:48 : cov50 4

[ INFO] 2018-06-19 03:42:48 : p cov50 0.0

[ INFO] 2018-06-19 03:42:48 : cov75

他にもmappingによる評価や、キメラcontigの評価などのmetricsがある。詳細はreport/DRAP_assessment.htmlを確認してください。

引用

Compacting and correcting Trinity and Oases RNA-Seq de novo assemblies.

Cabau C, Escudié F, Djari A, Guiguen Y, Bobe J, Klopp C

PeerJ. 2017 Feb 16;5:e2988. doi: 10.7717/peerj.2988. eCollection 2017.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

(RNA seq) 複数のde novoアセンブリ結果をマージし、冗長なcontigを除く DRAP