2018-02-01

複数のアセンブラのコンティグをマージする GAM-NGS

GAGEのペーパーでも議論されているが、生き物をと言わず常にベストな結果を出すアセンブラと言うのは現状存在しない。アセンブルを改善するため、複数のアセンブル結果をマージしてアセンブル結果を強化するポストアセンブルのツールがいくつか発表されてきている。これらのツールは、近縁種または同一種のリファレンスをガイドとしてアセンブル結果を並べマージするツールと、リファレンスを使用せずconitgをつないだりクラスタリングするツールに分かれ、それぞれいくつかの方法論が発表されている。GAM-NGSは後者に属するツールで、連続性と正確性を向上させるために、リファレンスを使わず2つ以上のアセンブリを統合する。

公式サイト

http://garm-meta-assem.sourceforge.net

インストール

cent OSにインストールした。

依存

Perl v5.8 or above
Parallel::ForkManager
List::MoreUtils

perlのモジュールはcpanmでインストールしておく。

cpanm Parallel::ForkManager
cpanm List::MoreUtils

本体　SourceForge

https://sourceforge.net/projects/garm-meta-assem/files/

ダウンロードして解凍する。

他の依存は全てダウンロードしたファイルに入っているので、READMEに従って環境変数を定義する。~/.bash_profile(またはbashrc)に以下を追加しておく。

export GARMBIN=/Users/user/Downloads/GARM_v0.7.5/bin/ 
export GARMLIB=/Users/user/Downloads/GARM_v0.7.5/lib/ 
export MUMBIN=/Users/user/Downloads/GARM_v0.7.5/MUMmer3.22/ 
export AMOSBIN=/Users/user/Downloads/GARM_v0.7.5/amos-3.0.0/bin/ 
export AMOSLIB=/Users/user/Downloads/GARM_v0.7.5/amos-3.0.0/lib/ 
export PATH=$PATH:/Users/user/Downloads/GARM_v0.7.5/

"source ~/.bash_profile"しておく。

準備できたら正常に動くかテストする。

$ perl config.pl

Your shell is /bin/bash

Configuring GARM env variables

### Your enviromental variables are already defined:

/home/disk1/uesaka/GARM_directory/GARM_v0.7.5/bin/

/home/disk1/uesaka/GARM_directory/GARM_v0.7.5/lib/

/home/disk1/uesaka/GARM_directory/GARM_v0.7.5/MUMmer3.22/

/home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin/

### Checking apps

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/MUMmer3.22//nucmer

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//toAmos

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//bank-transact

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//nucmer2ovl

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//sort2

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//tigger

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//make-consensus

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//bank2contig

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//bank2fasta

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//listReadPlacedStatus

Checking /home/disk1/uesaka/GARM_directory/GARM_v0.7.5/amos-3.0.0/bin//dumpreads

You can run now GARM without problems :D (hopefully...)

問題ないなら、ランに必要なconfigファイルを準備する。configファイルには１列目にマージしたいcontigのフルパス、２列目に固有の名前を書いておく。

ラン

configファイルを指定してランする。

GARM.pl -g config_file -o <prefix>

ラン中にエラーがでます。直ったら追記します。

引用

GAM-NGS: genomic assemblies merger for next generation sequencing.

Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A.

BMC Bioinformatics. 2013;14 Suppl 7:S6.

パフォーマンス比較。

2018-01-31

リファンレンスガイドのトランスクリプトのアセンブル strawberry

RNA seq assembly RNAseqの定量 Reference-assisted assembly 高速なツール

ゲノムガイドのRNAのアセンブル法は、遺伝子アノテーション情報を使わず、RNA-Seqデータから転写物の再構成を行う方法である。

Strawberryはゲノムガイドのアセンブリと定量の2つのモジュールで構成されており、ゲノムガイドのアセンブルではbamをスプライシンググラフにして解析し、最も可能性の高い転写物を選択する。定量化モジュールは、スプライシンググラフのノードからの読み取りカウントを転写物に割り当て、転写物量を推定し、EMアルゴリズムによりバイアスを補正する。シミュレートされたデータとリアルデータの両方を使用して検証されており、Strawberryはアセンブリと定量化の両方のいずれもCUfflinksとStringTieを上る。動作は高速で、1000万リードのゲノムガイドアセンブルが２分程度で終わるとされる（２スレッド使用時）。

インストール

cent OSに導入した。

Github

https://github.com/ruolin/strawberry

git clone --recursive https://github.com/ruolin/Strawberry.git 
cd Strawberry/
sh cmake.sh
cd build 
make

cd ../bin/
./strawberry #テスト

$ ./strawberry

strawberry v0.9.1

--------------------------------------

Usage: strawberry [options] <input.bam>

General Options:

-o/--output-dir Output files directory. [default: ./strawberry_out ]

-g/--GTF Reference transcripts annotation file. Current support gff3 and gtf format. [default: NULL]

-r/--no-assembly Skip assembly and use reference annotation to quantify transcript abundance (only use with -g) [default: false]

--no-quant Skip quantification [default: false]

-p/--num-threads number of threads used for Strawberry [default: 1]

-v/--verbose Strawberry starts to gives more information. [default: false]

-q/--min-mapping-qual Minimum mapping quality to be included in the analyses. [default: 0]

-J/--max-junction-splice-size Maximum spliced junction. [default: 300000]

-j/--min-junction-splice-size Minimum spliced junction size. [default: 50]

-m/--min-isoform-frac Minimum isoform fraction. [default: 0.01]

--allow-multimapped-hits By default, Strawberry only use reads which map to unique position in the genome. [default: false]

Assembly Options:

-t/--min-transcript-size Minimun transcript size to be assembled. [default: 200]

-d/--max-overlap-distance Maximum distance between read clusters to be merged. [default: 30]

-s/--min-anchor-size Read overhang less than this value is subject to Binomial test. [default: 10]

-a/--small-anchor-alpha Threshold alpha for junction binomial test filter. [default: 0]

--min-support-4-intron Minimum number of spliced aligned read required to support a intron. [default: 2.0]

--min-exon-cov Minimum exon coverage. [default: 1.0]

-c/-combine-short-transfrag merging non-overlap short transfrags. [default: false]

--min-depth-4-transcript Minimum average read depth for transcript. [default: 1.0]

Quantification Options:

-f/--fragment-context Print fragment context for differential expression to this file. [default: Disabled]

-i/--insert-size-mean-and-sd User specified insert size mean and standard deviation, format: mean/sd, e.g., 300/25. [default: Disabled]

This will disable empirical insert distribution learning. [default: NULL]

-b/--bias-correction Specify reference genome for bias correction. [default: NULL]

-e/--filter-low-expression Skip isoforms whose relative expression (within locus) are less than this number. [default: 0.]

パスを通しておく。

ラン

解析前にゲノムにリードをアライメントして、bamを作っておく。

--テストラン--

ゲノムガイドアセンブリと定量の両方を行う。

strawberry examples/geuvadis_300/sample_01.sorted.bam -o output_dir -p 8

-o　Output files directory. [default: ./strawberry_out ]
-p　number of threads used for Strawberry [default: 1]

既存のアノテーションファイルを使い定量を行う。新規isoformも検出する。

strawberry examples/geuvadis_300/sample_01.sorted.bam -o output_dir -g reference.gtf -p 8

-rをつけると、ゲノムガイドアセンブルはスキップされ、新規isoform探索は一切行われません。定量だけ行うときに使うフラグです。

引用

Strawberry: Fast and accurate genome-guided transcript reconstruction and quantification from RNA-Seq.

PLoS Comput Biol. 2017 Nov 27;13(11):e1005851.

Liu R, Dickerson J.

2018-01-31

リファンレンスガイドのトランスクリプトのアセンブリツール Scallop

RNA seq Reference-assisted assembly contigのscaffolding 2017 Nature Biotechnology

2020 2/5 condaインストール追記

Scallopは、リファンレンスガイドのトランスクリプトのアセンブルツール。マルチエキソンの転写物や低発現の転写物を組み立てる際の高い精度を特徴とする。ヒトRNA-seqサンプルでは、ScallopはStringTieおよびTransCombよりも34.5％および36.3％正確なマルチエキソン転写物を作り、低発現の転写物もそれぞれ67.5％および52.3％同定したと報告されている。2017年のNature Biotechnologyに掲載された。

podcastでの説明

the bioinformatics chat by Roman Cheplyaka on Apple Podcasts

インストール

依存

htslib (zlibが必要)

Githubのリンクからmac向けのバイナリをダウンロードできる(10.10~10.12)。また主要なlinuxディストリビューション系のバイナリも用意されている。

Github

#bioconda (link)
conda install -c bioconda -y scallop

> scallop

$ scallop

scallop scalpel-discovery scalpel-export

user-no-MacBook-Pro-2:mapping user$ scal

scallop scalpel-discovery scalpel-export

user-no-MacBook-Pro-2:mapping user$ scallop

Usage: scallop -i <bam-file> -o <gtf-file> [options]

Options:

--help print usage of Scallop and exit

--version print current version of Scallop and exit

--verbose <0, 1, 2> 0: quiet; 1: one line for each graph; 2: with details, default: 1

--library_type <first, second, unstranded> library type of the sample, default: unstranded

--min_transcript_coverage <float> minimum coverage required for a multi-exon transcript, default: 1.01

--min_single_exon_coverage <float> minimum coverage required for a single-exon transcript, default: 20

--min_transcript_length_increase <integer> default: 50

--min_transcript_length_base <integer> default: 250, minimum length of a transcript would be

--min_transcript_length_base + --min_transcript_length_increase * num-of-exons

--min_mapping_quality <integer> ignore reads with mapping quality less than this value, default: 1

--min_bundle_gap <integer> minimum distances required to start a new bundle, default: 50

--min_num_hits_in_bundle <integer> minimum number of reads required in a bundle, default: 20

--min_flank_length <integer> minimum match length in each side for a spliced read, default: 3

--min_splice_bundary_hits <integer> minimum number of spliced reads required for a junction, default: 1

___ ___ ___ ___ ___

/ /\ / /\ / /\ / /\ / /\

/ /:/_ / /:/ / /::\ / /::\ / /::\

/ /:/ /\ / /:/ / /:/\:\ ___ ___ ___ ___ / /:/\:\ / /:/\:\

/ /:/ /::\ / /:/ ___ / /:/~/::\ /__/\ / /\ /__/\ / /\ / /:/ \:\ / /:/~/:/

/__/:/ /:/\:\ /__/:/ / /\ /__/:/ /:/\:\ \ \:\ / /:/ \ \:\ / /:/ /__/:/ \__\:\ /__/:/ /:/

\ \:\/:/~/:/ \ \:\ / /:/ \ \:\/:/__\/ \ \:\ /:/ \ \:\ /:/ \ \:\ / /:/ \ \:\/:/

\ \::/ /:/ \ \:\ /:/ \ \::/ \ \:\/:/ \ \:\/:/ \ \:\ /:/ \ \::/

\__\/ /:/ \ \:\/:/ \ \:\ \ \::/ \ \::/ \ \:\/:/ \ \:\

/__/:/ \ \::/ \ \:\ \__\/ \__\/ \ \::/ \ \:\

\__\/ \__\/ \__\/ \__\/ \__\/

ラン

マッピングしたbamを指定してランする。

scallop -i input.bam -o output.gtf

idbaとTrinityとspadesのレポートがPDF形式で出力される。

bamはsortされている必要がある。(samtools sort input.bam > output_sorted.bam)

GithubにはsalmonとScallopを組み合わせてRNA seqの定量を行う例が載っています。確認してみてください。

引用

Accurate assembly of transcripts through phase-preserving graph decomposition.

Shao M, Kingsford C.

Nat Biotechnol. 2017 Dec;35(12):1167-1169. doi: 10.1038/nbt.4020. Epub 2017 Nov 13.

真核生物のRNAのコード領域を予測するGeneMarkS-T

RNA seq annotation 2015 Nucleic Acids Research coding region

GeneMarkS-T は教師なし学習でトレーニングされたRNAのタンパク質コード領域を予測ツール。原核生物向けのGeneMarkSを真核生物向けに拡張して作られた。データサイズに寄らず一定の検出率を示すため、データが莫大になるメタトランスクリプトーム解析のコード領域予測にも向いているとされる。

ダウンロード

このリンクからlinux64bit版をダウンロードできる。

GeneMark™ download

> perl gmst.pl

$ perl gmst

gmst_linux_64.tar gmst.pl

[uesaka@cyano GeneMarkS-T]$ perl gmst.pl

GeneMarkS version 5.1 March 2014

Usage: gmst.pl [options] <sequence file name>

input sequence in FASTA format

Output options:

(output is in current working directory)

--output <string> output file with predicted gene coordinates by GeneMarh.hmm

and species parameters derived by GeneMarkS-T.

(default: <sequence file name>.lst)

GeneMark.hmm can be executed independently after finishing GeneMarkS training.

This method may be the preferable option in some situations, as it provides accesses to GeneMarh.hmm options.

--format <string> output coordinates of predicted genes in this format

(default: LST; supported: LST and GFF)

--fnn create file with nucleotide sequence of predicted genes

--faa create file with protein sequence of predicted genes

--clean <number> delete all temporary files

(default: 1; supported: 1 <true> and 0 <false>)

Run options:

--bins <number> number of clusters for inhomogeneous genome

(default: 0; supported: 0(automatic clustering),1,2,3)

--filter <number> keep at most one prediction per sequence

(default: 1; supported: 1 <true> and 0 <false>)

--strand <string> sequence strand to predict genes in

(default: 'both'; supported: direct, reverse and both )

--order <number> markov chain order

(default: 4; supported in range: >= 0)

--order_non <number> order for non-coding parameters

(default: 2)

--gcode <number> genetic code

(default: 1; supported: 11, 4 and 1)

--motif <number> iterative search for a sequence motif associated with CDS start

(default: 1; supported: 1 <true> and 0 <false>)

--width <number> motif width

(default: 12; supported in range: >= 3)

--prestart <number> length of sequence upstream of translation initiation site that presumably includes the motif

(default: 6; supported in range: >= 0)

--fixmot

if <number> the motif is located at a fixed position with regard to the start; motif could overlap start codon

(default: 1; supported: 1 <true> and 0 <false> if this option is on, it changes the meaning of --prestart

option which in this case will define the distance from start codon to motif start)

--offover <number> prohibits gene overlap

(default: 1; supported: 1 <true> and 0 <false>)

Combined output and run options:

--prok to run program on prokaryotic transcripts

(this option is the same as: --bins 1 --filter 0 --order 2 --order_non 2 --gcode 11 --width 6 --prestart 40 --fixmotif 0)

Test/developer options:

--par <file name> custom parameters for GeneMarkS

(default is selected based on gcode value: 'par_<gcode>.default' )

--gibbs <number> version of Gibbs sampler software

(default: 3; supported versions: 1 and 3 )

--test installation test

--identity <number> identity level assigned for termination of iterations

(default: 0.99; supported in range: >=0 and <= 1)

--maxitr <number> maximum number of iterations

(default: 10; supported in range: >= 1)

--verbose

--version

ラン

de novo assemblyで作ったfragments（FASTA）を指定してランする。

perl gmst.pl transcripts.fa --output output --fnn --faa --strand both --gcode 1

--fnn　create file with nucleotide sequence of predicted genes
--faa　create file with protein sequence of predicted genes
--strand <string>　sequence strand to predict genes in (default: 'both'; supported: direct, reverse and both )
--gcode <number>　genetic code (default: 1; supported: 11, 4 and 1)

原核生物のデータに使う場合、-prokをつけると複数のパラメータを一括指定できる。

引用

Identification of protein coding regions in RNA transcripts.

Tang S, Lomsadze A, Borodovsky M.

Nucleic Acids Res. 2015 Jul 13;43(12):e78. doi: 10.1093/nar/gkv227. Epub 2015 Apr 13.

2018-01-25

de novo transcriptome assembliesを評価する rnaQUAST

evaluation tool RNA seq assembly 2016 Bioinformatics de novo transcriptome

2020 2/3 インストール追記、実行例追記

2020 8/13 インストール追記

　rnaQUASTはde novo transcriptomeのアセンブルパフォーマンスを比較するツール。リファレンスゲノムやtranscriptsのカタログにアセンブルした配列をアライメントし、様々な統計データをPDFで出力する。リファンレンスの遺伝子情報（gtf）がない時でも、ラン中にGeneMarkS-Tを動かし遺伝子を予測してランすることもできる。内部でRNAのアライナーを動かし（STARやTophatが必要）、カバレッジを調べることもできる。

公式サイト

http://cab.spbu.ru/software/rnaquast/

マニュアル

http://cab.spbu.ru/files/rnaquast/release1.5.0/manual.html

インストール

ubuntu18.04LTS のマシンでテストした（macosには対応していない）。

依存

Python 2 (2.5 or higher)
matplotlib python package
joblib python package
gffutils python package (needs biopython)
NCBI BLAST+ (blastn)
GMAP (or BLAT) aligner

optional

STAR (or alternatively TopHat aligner + SAM tools)
GeneMarkS-T

gffutils、matplotlib、joblibはpipでインストールする。BLATやBLASTはbrewで導入できる。GMAPを使うなら、github（リンク）からダウンロードしてビルドする。

Github

#bioconda (link)
conda create -n rnaquast -y
conda activate rnaquast
conda install -c bioconda -y rnaquast

>rnaQUAST.py -h

$ rnaQUAST.py -h

usage: /home/kazu/anaconda3/envs/rnaquast/bin/rnaQUAST.py [-h]

[-r REFERENCE [REFERENCE ...]]

[--gtf GTF [GTF ...]]

[--gene_db GENE_DB]

[-c TRANSCRIPTS [TRANSCRIPTS ...]]

[-psl ALIGNMENT [ALIGNMENT ...]]

[-sam READS_ALIGNMENT]

[-1 LEFT_READS]

[-2 RIGHT_READS]

[-s SINGLE_READS]

[--gmap_index GMAP_INDEX]

[-o OUTPUT_DIR]

[--test] [-d]

[-t THREADS]

[-l LABELS [LABELS ...]]

[-ss]

[--min_alignment MIN_ALIGNMENT]

[--no_plots]

[--blat] [--tophat]

[--gene_mark]

[--meta]

[--lower_threshold LOWER_THRESHOLD]

[--upper_threshold UPPER_THRESHOLD]

[--disable_infer_genes]

[--disable_infer_transcripts]

[--busco_lineage BUSCO_LINEAGE]

[--prokaryote]

QUALITY ASSESSMENT FOR TRANSCRIPTOME ASSEMBLIES /home/kazu/anaconda3/envs/rnaquast/bin/rnaQUAST.py v.1.5.1

Usage:

python /home/kazu/anaconda3/envs/rnaquast/bin/rnaQUAST.py --transcripts TRANSCRIPTS --reference REFERENCE --gtf GENE_COORDINATES

optional arguments:

-h, --help show this help message and exit

Input data:

-r REFERENCE [REFERENCE ...], --reference REFERENCE [REFERENCE ...]

Single file (or several files for meta RNA) with

reference genome in FASTA format or *.txt file with

one-per-line list of FASTA files with reference

sequences

--gtf GTF [GTF ...] File with gene coordinates (or several files or *.txt

file with one-per-line list of GTF / GFF files for

meta RNA). We recommend to use files downloaded from

GENCODE or Ensembl [GTF/GFF]

--gene_db GENE_DB Path to the gene database generated by gffutils to be

used

-c TRANSCRIPTS [TRANSCRIPTS ...], --transcripts TRANSCRIPTS [TRANSCRIPTS ...]

File(s) with transcripts [FASTA]

-psl ALIGNMENT [ALIGNMENT ...], --alignment ALIGNMENT [ALIGNMENT ...]

File(s) with transcript alignments to the reference

genome [PSL]

-sam READS_ALIGNMENT, --reads_alignment READS_ALIGNMENT

File with read alignments to the reference genome

[SAM]

-1 LEFT_READS, --left_reads LEFT_READS

File with forward paired-end reads [FASTQ or gzip-

compressed]

-2 RIGHT_READS, --right_reads RIGHT_READS

File with reverse paired-end reads [FASTQ or gzip-

compressed]

-s SINGLE_READS, --single_reads SINGLE_READS

File with unpaired reads [FASTQ or gzip-compressed]

--gmap_index GMAP_INDEX

Folder containing GMAP index for the reference genome

Basic options:

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

Directory to store all results [default:

rnaQUAST_results/results_<datetime>]

--test Run rnaQUAST on the test data from the test_data

folder, output directory is rnaOUAST_test_output

-d, --debug Report detailed information, typically used only for

detecting problems.

Advanced options:

-t THREADS, --threads THREADS

Maximum number of threads, default: min(number of CPUs

/ 2, 16)

-l LABELS [LABELS ...], --labels LABELS [LABELS ...]

Name(s) of assemblies that will be used in the reports

-ss, --strand_specific

Set if transcripts were assembled using strand-

specific RNA-Seq data

--min_alignment MIN_ALIGNMENT

Minimal alignment length, default: 50

--no_plots Do not draw plots (to speed up computation)

--blat Run with BLAT alignment tool

(http://hgwdev.cse.ucsc.edu/~kent/exe/) instead of

GMAP

--tophat Run with TopHat tool

(https://ccb.jhu.edu/software/tophat/index.shtml)

instead of STAR

--gene_mark Run with GeneMarkS-T tool

(http://topaz.gatech.edu/GeneMark/)

--meta Run QUALITY ASSESSMENT FOR METATRANSCRIPTOME

ASSEMBLIES

--lower_threshold LOWER_THRESHOLD

Lower threshold for x-assembled/covered/matched

metrics, default: 0.5

--upper_threshold UPPER_THRESHOLD

Upper threshold for x-assembled/covered/matched

metrics, default: 0.95

--prokaryote Use this option if the genome is prokaryotic

Gffutils related options:

--disable_infer_genes

Use this option if your GTF file already contains

genes records

--disable_infer_transcripts

Use this option if your GTF already contains

transcripts records

BUSCO related options:

--busco_lineage BUSCO_LINEAGE

Run with BUSCO tool (http://busco.ezlab.org/). Path to

the BUSCO lineage data to be used (Eukaryota, Metazoa,

Arthropoda, Vertebrata or Fungi)

Don't forget to add GMAP (or BLAT) to PATH.

(rnaquast) kazu@kazu:~/Downloads$

ラン

アセンブルしたfastaとリファレンスのゲノム、リファレンスのgtfファイルを指定してランする。複数のアセンブリを評価する時はスペースで区切る

（-c transcripts1.fasta<space>transcripts2.fasta）。

rnaQUAST.py -c transcripts1.fasta transcripts2.fasta -r reference.fasta --gtf gene_coordinates.gtf -t 2 -o output

-c File(s) with transcripts in FASTA format separated by space.
-o Directory to store all results. Default is rnaQUAST_results/results_<datetime>.
-r Single file with reference genome containing all chromosomes/scaffolds in FASTA format (preferably with *.fasta, *.fa, *.fna, *.ffn or *.frn extension) OR *.txt file containing the one-per-line list of FASTA files with reference sequences.
--gtf File with gene coordinates in GTF/GFF format (needs information about parent relations). We recommend to use files downloaded from GENCODE or Ensembl.
-t Maximum number of threads. Default is min(number of CPUs / 2, 16).
--blat Run with BLAT alignment tool instead of GMAP.
--gene_mark Run with GeneMarkS-T gene prediction tool. Use --prokaryote option if the genome is prokaryotic.

レポートがPDF形式で出力される。

f:id:kazumaxneo:20180125100017j:plain

f:id:kazumaxneo:20180125100012j:plain

各評価項目の詳細はrnaQUAST 1.5 Manualから確認してください。オプション解説の下のほうに載ってます。

追記

buscoも走らせる場合、busco v3のHP（link）からダウンロードして解答したlineage ディレクトリを指定する。

 rnaQUAST.py -1 R1.fastq -2 R2.fastq -t 12 -o quast \
 -c Trinity.fasta transcripts.fasta \
 --busco_lineage protists_ensembl/


#reference genomeとアノテーション情報も利用できるなら使う
rnaQUAST.py -1 R1.fastq -2 R2.fastq -t 12 -o quast \
 -c Trinity.fasta transcripts.fasta \
 --busco_lineage protists_ensembl/
 -r ref_genome.fa --gtf ref_annotation.gtf

引用

rnaQUAST: a quality assessment tool for de novo transcriptome assemblies.

Bushmanova E, Antipov D, Lapidus A, Suvorov V, PrjibelskiAD.

Bioinformatics. 2016 Jul 15;32(14):2210-2.

demulitiplexしてサンプルを分割する sabre

trimming / preprocessing barcode / index demultiplexing

sabreはバーコードをdemulitiplexするツール。バーコードを除いたあと、バーコードに従って分割する。バーコードがないリードは別ファイルにまとめて出力される。gzip入力もサポートしている。

インストール

Github

https://github.com/najoshi/sabre

git clone https://github.com/najoshi/sabre.git
cd sabre/
make
./sabre #動作確認

> ./sabre pe

$ ./sabre pe

Usage: sabre pe -f <paired-end fastq file 1> -r <paired-end fastq file 2> -b <barcode file> -u <unknown barcode output file 1> -w <unknown barcode output file 2>

Options:

-f, --pe-file1, Input paired-end fastq file 1 (required, must have same number of records as pe2)

-r, --pe-file2, Input paired-end fastq file 2 (required, must have same number of records as pe1)

-b, --barcode-file, File with barcode and two output file names per line (required)

-u, --unknown-output1, Output paired-end file 1 that contains records with no barcodes found. (required)

-w, --unknown-output2, Output paired-end file 2 that contains records with no barcodes found. (required)

-c, --both-barcodes, Optional flag that indicates that both fastq files have barcodes.

-m <n>, --max-mismatch <n>, Optional argument that is the maximum number of mismatches allowed in a barcode. Default 0.

--quiet, don't print barcode matching info

--help, display this help and exit

--version, output version information and exit

> ./sabre se

o$ sabre se

Usage: sabre se -f <fastq sequence file> -b <barcode file> -u <unknown barcode output file>

Options:

-f, --fastq-file, Input fastq file (required)

-b, --barcode-file, File with barcode and output file name per line (required)

-u, --unknown-output, Output file that contains records with no barcodes found. (required)

-m <n>, --max-mismatch <n>, Optional argument that is the maximum number of mismatches allowed in a barcode. Default 0.

--quiet, don't output matching info

--help, display this help and exit

--version, output version information and exit

パスの通ったディレクトリに移動しておく。

ラン

１、シングルエンド

バーコードのフォーマット

barcode1 barcode1_output_file.fastq
barcode2 barcode2_output_file.fastq
...

sabre se -f input_file.fastq -b barcode_data.txt -u unknown_barcode.fastq

２、ペアードエンド

バーコードのフォーマット

barcode1 barcode1_output_file1.fastq barcode1_output_file2.fastq
barcode2 barcode2_output_file1.fastq barcode2_output_file2.fastq
...

sabre pe -m 0 -f input_file1.fastq -r input_file2.fastq -b barcode_data.txt -u unknown_barcode1.fastq -w unknown_barcode1.fastq

両方のfastqにバーコードがあるなら、-cのフラグをつける。

引用

https://github.com/najoshi/sabre

https://blog.insidedna.me/insidedna-now-supports-sabre-a-barcode-demultiplexing-and-trimming-tool-for-fastq-files/

2018-01-22

トランスクリプトームから主要なtrasncriptsを選抜する EvidentialGene

RNA seq sequence clustering assembly de novo transcriptome dereplication

2019 6/19 関連論文追記

2020 2/17 依存のインストール手順更新

EvidentialGeneのtr2aacds.plは、de novo アセンブルツールの結果から生物学的に有用な最良のmRNAセットにクラスタリングするパイプライン。論文は準備中で不明な点もあるが、ポスターによると以下の流れで冗長なtranscirptsを減らすらしい。fastanrdbとcd-hitを使ったあと、blastを使いprimaryなtranscirptsを選抜している。

Algorithm of tr2aacds:

　0. collect input transcripts.tr, produce CDS and AA sequences, work mostly on CDS.

　1. perfect redundant removal with fastanrdb

　2. perfect fragment removal with cd-hit-est

　3. blastn, basic local align high-identity subsequences for alternate tr.

　4. classify main/alternate cds, okay & drop subsets by CDS-align, protein metrics.

　5. output sequence sets from classifier: okay-main, okay-alts, drops. See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

すでにいくつかのde novo transcriptome解析の論文で、複数のde novo アセンブルツールの結果をマージして冗長性を減らすために使われている（ref1）。

公式サイト

http://arthropods.eugenes.org/genes2/about/EvidentialGene_trassembly_pipe.html

wiki

https://sourceforge.net/p/evidentialgene/wiki/Home/

インストール

依存

fastanrdb of exonerate package, quickly reduces perfect duplicate sequences
cd-hit, cd-hit-est clusters protein or nucleotide sequences.
blastn and makeblastdb of NCBI BLAST, Basic Local Alignment Search Tool, finds regions of local similarity between sequences.

#bioconda
conda install -c bioconda -y exonerate blast

本体

http://arthropods.eugenes.org/EvidentialGene/evigene/

The best way to get〜のftpリンクからダウンロードする。

解凍してpub/evigene/scripts/prot/tr2aacds2.plを使う。

$ perl evigene/scripts/prot/tr2aacds2.pl

EvidentialGene tr2aacds.pl VERSION 2017.12.21

convert large, redundant mRNA assembly set to best protein coding sequences,

filtering by quality of duplicates, fragments, and alternate transcripts.

See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

Usage: tr2aacds.pl -mrnaseq transcripts.fasta[.gz]

opts: -MINCDS=90 -NCPU=1 -MAXMEM=1000.Mb -[no]smallclass -logfile -tidyup -dryrun -debug

ラン

de novo アセンブルツールの結果をマージしたfastaを入力とする。複数あるなら"cat *fa > merged.fa"などでコンカテネートしておく。

１、解析前にFASTAを修正する。protocol.ioにJared Mamrotさんが投稿されたDe novo transcriptome assembly workflowのワークフローを真似て、FASTAのヘッダーをシンプルな名前に修正する。

perl -ane 'if(/\>/){$a++;print ">Locus_$a\n"}else{print;}' input.fasta > renamed.fasta

#さらにformatを修正
pat.pl -output output.fa -input renamed.fasta

(trformat.pl :regularize IDs in fasta of Velvet,Soap,Trinity, ensure unique IDs, add prefixes for parameter sets.)

evigene/scripts/prot/にtr2aacds2.plはある。ラン。

perl evigene/scripts/prot/tr2aacds2.pl -mrnaseq output.fa -MINCDS=90 -NCPU=12 -logfile

okayset/のoutput.okalt.faとoutput.okay.faは同じものではなく、主要な転写配列とalternativeの転写配列になる。こちらも参考にしてください。

https://translate.google.co.jp/translate?hl=ja&sl=en&u=https://www.biostars.org/p/273551/&prev=search

引用

poster

Gene-omes built from mRNA seq not genome DNA

ref1: cd-hitより効率的。

https://sourceforge.net/p/evidentialgene/discussion/general/thread/a4f0e29f/

この論文でEvidentialGeneと比較されています。

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6105091/

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

複数のアセンブラのコンティグをマージする GAM-NGS

リファンレンスガイドのトランスクリプトのアセンブル strawberry

リファンレンスガイドのトランスクリプトのアセンブリツール Scallop

真核生物のRNAのコード領域を予測するGeneMarkS-T

de novo transcriptome assembliesを評価する rnaQUAST

demulitiplexしてサンプルを分割する sabre

トランスクリプトームから主要なtrasncriptsを選抜する EvidentialGene