2018-01-17

de novo transcriptome 解析のツールまとめ（主にde novo assemblerについて）

随時更新

2020 2/4 help更新

RNAの発現は分子によって大きな差があるため、どのようなデータセットでも万能なRNAのアセンブルツールというのは存在しない。そのため単一のアセンブルツールでアセンブルするより、複数のアセンブルツールでかつ複数のk-merサイズでアセンブルして、そのプール全体からよりよいmRNAセットを選抜してくるほうが、より完全なRNAのコレクションになりやすい。一般に、より大きいk値は、高い遺伝子発現レベルまたはより長いコンティグで良好に機能し、小さいk値は、低い遺伝子発現レベルまたはより短いコンティグで良好に機能する。したがって、Trans-ABySS、Oases-M（Oaseのmultiple-kバージョン）、IDBA-Tranなどのいくつかのアセンブラでは、複数のアセンブリが異なるk値が合併されてより高い感度が得られるが、同時により多くの偽陽性転写産物もできる傾向にある。

以下のポスターの右側に、その複数のアセンブルツールを使う覚書があるので確認してみてください。

Do and Don't

https://f1000research.com/posters/5-1695

追記

Pre-processing

前処理　　RNA seqデータの包括的な前処理ツール FastqPuri
マッピング分析　RNA seqのクオリティコントロールツール RSeQC
クオリティチェック　RNA seqのクオリティチェックツール QoRTs
rRNA フィルタリング　rRNAのコンタミを除く SortMeRNA
エラーコレクション　RNA seqのエラーコレクションツール Rcorrector

De novo assembly

最適化　複数のアセンブラを使った自動アセンブリOyster River Protocol
スーパートランスクリプト　SuperTranscripts 其の１、 其の２
BRANCH 　リファレンスガイドアセンブリを行う
BinPacker de novoアセンブリツール
Scallop リファンレンスガイドのトランスクリプトのアセンブリ
TransLiG de novo transcriptomeのアセンブリTransLiG
RNA-Bloom シングルセルとバルクのトランスクリプトームのアセンブリ
TransPI 包括的なde novoトランスクリプトームアセンブリのパイプライン

Hybrid de novo assembly

IDP-denovo hybrid-assemblyによる de novo transcriptomeアセンブリ
SPAdes（Hybrid transcriptome assemblyは3.14から実装）SPAdesアセンブラ

Genome guide assembly (よく似たゲノムが利用できる場合)

StringTie2（genomeガイドアセンブリを行いGTFを出力） StringTie2
strawberry 　リファンレンスガイドのトランスクリプトアセンブリ

Non-coding RNA

真核と原核のSSU rRNA再構成にも使えるphyloFlash
入力配列lからrRNAを検出するBarrnap
rRNAを除くSortMeRNA

Long read

isONclust ロングリードのクラスタリングツール
エラー修正　既知変異を保護してエラー修正する TranscriptClean
ONT cDNAロングリードのエラー修正　エラー修正を行うisONcorrect
ハイブリッドエラーコレクション　 Transcript-level Aware なエラー修正 TALC

Clustering

EvidentialGene　トランスクリプトームから主要なtrasncriptsを選抜
Corset　de novo transcriptomeのcontigクラスタリングツール
Grouper クラスタリングとclosely rellatedな種情報を用いたアノテーション
Clust 共発現遺伝子の自動クラスタリングツール
CD-HIT 配列をクラスタリングする CD-HIT

Pan-transcriptome

GET_HOMOLOGUES-EST パントランスクリプトーム解析

Coding region

TransDecoder タンパク質のコード領域を推定する TransDecoder
orfipy　柔軟な出力パラメータをもつ高速なORF予測ツール orfipy
Borf　de-novo assembled transcriptomeのORF予測を行う Borf

Merge

Annotation

Trinotate　de novo transcriptome向けのアノテーションツール
dammit　de novo transcriptomeのアノテーションツール
MEGANTE　植物ゲノムアノテーションwebサービス MEGANTE
GO FEAT　webベースのfunctional annotation ツール
FunctionAnnotator　包括的なfunctional annotationを行うwebツール
KAAS KEGGのアノテーションサーバー KAAS
eggNOG-mapper 機能アノテーション付けを行う eggNOG-Mapper
統合サービス DAVIDデータベース
汚染のチェックや機能解析　de novo transcriptomeの系統・機能解析 TRAPID 2.0
タンパク質の機能的アノテーション　Functional annotationを行うwebサーバー PANNZER2
ディープラーニングによるアノテーションツール　DeepNOG

GO enrichment analysis

Gonet　GO term enrichment解析を行う（ヒト、マウス向け）
agriGO　GO enrichment解析データベース agriGO （植物、藻類向け）
shinyGO インタラクティブなGOエンリッチメント解析webサービス ShinyGO （pathway解析にも利用可能）
DiVenn (Omics向け) 従来のベン図表現を拡張する DiVenn
GOとアノテーション　GOアノテーション間の関係と類似性を調べるNaviGO
GO enrichment解析結果の視覚化　REVIGO
GenFam　植物遺伝子のファミリー分類とエンリッチメント解析
WEGO2 GO annotation分析ウェブサービス WEGO2

Functional enrichment analysis

g:Profiler Functional enrichment analysisとID変換

Pathway analysis

WikiPathways App for Cytoscape　 WikiPathwaysのCytoscapeプラグイン

Evaluation

BUSCO　アセンブル結果をCore gene setの検出数で評価する BUSCO
TransRate アセンブル結果をリードのアライメントから評価 TransRate
RSEM-EVAL リファレンスなしアセンブル評価 RSEM-EVAL（DETONATE）
rnaQUAST transcriptome assembliesを評価する rnaQUAST
アセンブリ配列比較　blast比較結してベン図を描く VennBLAST

More comprehensive transcriptome

Necklace よく似たゲノム情報を使い不完全なゲノム情報のRNA seq解析の精度を上げるNecklace

Visualization

intervene (Venn diagram) 共通 / 非共通の遺伝子リストを視覚化
Vennt (Venn diagram)　　DGEリストからベン図を作成するwebサービス
jvenn (Venn diagram) 共通/非共通の要素をベン図で視覚化し、抽出
svist4get(mapping) genome trackを可視化
shinyheatmap (heatmap) インタラクティブなヒートマップを作成
coverageトラックの視覚化　　カバレッジトラックを視覚化する SparK

Integrated Analysis Platform (web service)

iDEP インタラクティブなRNA seq解析webアプリケーション iDEP
GREIN NCBI GEO のRNA-seqデータを分析する GREIN
IRIS-EDA IRIS-EDA（立ち上げだけ）
SRAデータ比較　SRAのRNA seqを比較・分析 Digital expression explorer 2
Degust　　webベースでRNA seqのDEG解析などができるDegust
RaNA-seq 自動化された高速なRNA seq解析webサービス

Other

結果をまとめるツール　様々なツールの分析結果を１つに集約して分析できる MultiQC
ID変換サービス　 Functional enrichment analysisとID変換
統合サービス DAVIDデータベース

Database

主要なpathwayデータベースへのリンク　PathBank
AmiGO2 Gene Ontologyデータベース AmiGO2

ここでは代表的なRNA seqのアセンブルプログラムの導入と使い方についてまとめる。

１、Trinity

Releases · trinityrnaseq/trinityrnaseq · GitHub

#bioconda　（link）
mamba create -n trinity python=3.9
conda activate trinity
#バージョン指定しないと古いバージョンが入ることがあるので、バージョンを指定するのが無難。2022年5月現在ならレポジトリは2.14が最新、conda配布版は2.13.2が最新（上のリンクから確認）。
mamba install -c bioconda -y trinity=2.13.2

#bowtie1も必要
conda install -c bioconda -y bowtie

#homebrew (試してません)
brew install Trinity

#dockerhub (link) 
docker pull trinityrnaseq/trinityrnaseq:latest

> Trinity

$ Trinity

################################################################

# Required:

# --seqType <string> :type of reads: (fq or fa)

# If paired reads:

# --left <string> :left reads

# --right <string> :right reads

# Or, if unpaired reads:

# --single <string> :single reads

# --output <string> :name of directory for output (will be created if it doesn't already exist)

# default( "trinity_out_dir" )

# if strand-specific data, set:

# --SS_lib_type <string> :if paired: RF or FR, if single: F or R

# Butterfly-related options:

# --bfly_opts <string> :parameters to pass through to butterfly (see butterfly documentation).

# --bflyHeapSpace <string> :java heap space setting for butterfly (default: 1000M) => yields command java -Xmx1000M -jar Butterfly.jar ... $bfly_opts

# --no_run_butterfly :stops after the Chrysalis stage. You'll need to run the Butterfly computes separately, such as on a computing grid.

# Then, concatenate all the Butterfly assemblies by running:

# find trinity_out_dir/ -name "*allProbPaths.fasta" -exec cat {} + > trinity_out_dir/Trinity.fasta

# Inchworm-related options:

# --no_meryl :do not use meryl for computing the k-mer catalog (default: uses meryl, providing improved runtime performance)

# --min_kmer_cov <int> :min count for K-mers to be assembled by Inchworm (default: 1)

# Chyrsalis-related options:

# --max_reads_per_graph <int> :maximum number of reads to anchor within a single graph (default: 20000000)

# Misc:

# --CPU <int> :number of CPUs to use, default: 2

# --min_contig_length <int> :minimum assembled contig length to report (def=200)

# --paired_fragment_length <int> :maximum length expected between fragment pairs (aim for 90% percentile) (def=300)

# --jaccard_clip :option, set if you have paired reads and you expect high gene density with UTR overlap (use FASTQ input file format for reads).

#############################################################

ラン

Trinityにはgenome guide assemblyとde novo asemblyの２つの方式がある。

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Genome-Guided-Trinity-Transcriptome-Assembly

A. Genome guided assembly

#まずindexをつけ、ゲノムにmappingする。
#index
mkdir genome/ #空フォルダを作っておく
STAR --runMode genomeGenerate --genomeDir genome/ --genomeFastaFiles chromosome.fa --runThreadN 12 --sjdbGTFfile reference.gtf #gtfがないゲノムなら、gtfのオプションは指定なしでindex。

#mapping 予めmergeしておいたfastqを使いマッピング(例えば cat *pair1.fq > merge_pair1.fq)
STAR --genomeDir genome/ --readFilesIn merged_pair1.fq merged_pair2.fq --runThreadN 30 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix sample

#bamのアライメント情報を使いtrinityでgenome guide assembly (k-merはdefaultの25)
Trinity --genome_guided_bam sample_mergeAligned.sortedByCoord.out.bam --genome_guided_max_intron 50000 --max_memory 100G --CPU 20

B. De novo assembly

#全サンプルをmergeしたfastqを使い、de novoでアセンブル (mergeはcat *pair1.fq > merge_pair1.fq)
Trinity --seqType fq --left pair1.fastq --right pair2.fastq --max_memory 100G --CPU 40 --output outdir

注意：Trinityはアップデートが続けられており、古いバージョンより最新のバージョンの方が連続性、精度共に向上する可能性が高いです。

Trinity --version

コマンドを打って、古いバージョンを使っていないか定期的に確認してください。

追記

２、Trans-ABySS

GitHub - bcgsc/transabyss: de novo assembly of RNA-Seq data using ABySS.

依存

#bioconda　（link）
conda install -c bioconda -y transabyss


#condaを使わないなら依存から順番に入れていく。python2環境でうごくので、pyenvやconda createでpython2.7環境にしておく。BLATはbrewで導入する。ただしABySSはbrewで導入するとversion2以降が入ってしまう。上のリンクからyosemite用のバイナリをダウンロードしてパスを通す。
brew install blat

#依存のigraphのインストール。
pip install python-igraph

#本体はソースから。リンクを張るかパスを通しておく。
git clone https://github.com/bcgsc/transabyss.git
cd  transabyss/
./transabyss

> transabyss -h

$ transabyss -h

usage: transabyss [-h] [--version] [--se PATH [PATH ...]]

[--pe PATH [PATH ...]] [--SS] [--outdir PATH] [--name STR]

[--stage {dbg,unitigs,contigs,references,junctions,final}]

[--length INT] [--cleanup {0,1,2,3}] [--threads INT]

[--mpi INT] [-k INT] [-c INT] [-e INT] [-E INT] [-q INT]

[-Q INT] [-n INT] [-s INT] [--gsim INT] [--indel INT]

[--island INT] [--useblat] [--pid FLOAT] [--walk FLOAT]

[--noref]

Assemble RNAseq with Trans-ABySS.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

Input:

--se PATH [PATH ...] single-end read files

--pe PATH [PATH ...] paired-end read files

--SS input reads are strand-specific

Basic Options:

--outdir PATH output directory

[/Users/kazu/1/transabyss_2.0.1_assembly]

--name STR assembly name [transabyss] (ie. output assembly:

'transabyss-final.fa')

--stage {dbg,unitigs,contigs,references,junctions,final}

run up to the specified stage [final]

--length INT minimum output sequence length [100]

--cleanup {0,1,2,3} level of clean-up of intermediate files [1]

ABySS Parameters:

--threads INT number of threads ('j' in abyss-pe) [1]

--mpi INT number of MPI processes ('np' in abyss-pe) [0]

-k INT, --kmer INT k-mer size [32]

-c INT, --cov INT minimum mean k-mer coverage of a unitig [2]

-e INT, --eros INT minimum erosion k-mer coverage [c]

-E INT, --seros INT minimum erosion k-mer coverage per strand [0]

-q INT, --qends INT minimum base quality on 5' and 3' ends of a read [3]

-Q INT, --qall INT minimum base quality throughout a read

-n INT, --pairs INT minimum number of pairs for building contigs [2]

-s INT, --seed INT minimum unitig size for building contigs [k]

Advanced Options:

--gsim INT maximum iterations of graph simplification [2]

--indel INT indel size tolerance [1]

--island INT minimum length of island unitigs [0]

--useblat use BLAT alignments to remove redundant sequences.

--pid FLOAT minimum percent sequence identity of redundant

sequences [0.95]

--walk FLOAT percentage of mean k-mer coverage of seed for path-

walking [0.05]

--noref do not include reference paths in final assembly

Written by Ka Ming Nip.

Report bugs to <trans-abyss@googlegroups.com>

ランチュートリアル

https://github.com/bcgsc/transabyss/blob/master/TUTORIAL.txt

ラン

アセンブル。

#transabyssでアセンブル
#K=21
transabyss --pe merged_pair1.fq merged_pair2.fq -k 21 --threads 10 --length 100 --outdir k21
#K=31
transabyss --pe merged_pair1.fq merged_pair2.fq -k 31 --threads 10 --length 100 --outdir k31
#K=41
transabyss --pe merged_pair1.fq merged_pair2.fq -k 41 --threads 10 --length 100 --outdir k41
#K=51
transabyss --pe merged_pair1.fq merged_pair2.fq -k 51 --threads 10 --length 100 --outdir k51
#K=61
transabyss --pe merged_pair1.fq merged_pair2.fq -k 61 --threads 10 --length 100 --outdir k61

# transabyss-mergeで複数のk-merアセンブルをmerge
 transabyss-merge k21.fa k31.fa k41.fa k51.fa k61.fa --mink 21 --maxk 61 --prefix k21. k31. k41. k51. k61

３、SOAPdenovo-Trans

macではgnuのmakeに失敗したので、深追いせずcent OSに導入した。

ソースコード

SOAPdenovo-Trans - Browse /SOAPdenovo-Trans at SourceForge.net

BInary (linux)

SOAP :: Short Oligonucleotide Analysis Package

#bioconda　（link）
conda install -c bioconda -y soapdenovo-trans


mkdir SOAPdenovo && cd SOAPdenovo/
wget https://downloads.sourceforge.net/project/soapdenovotrans/SOAPdenovo-Trans/bin/v1.03/SOAPdenovo-Trans-bin-v1.03.tar.gz
tar zxvf SOAPdenovo-Trans-bin-v1.03.tar.gz

#カレントにパスを通す。
echo export PATH=\$PATH:`pwd`\ >> ~/.bash_profile && source ~/.bash_profile

> SOAPdenovo-Trans-31mer

$ SOAPdenovo-Trans-31mer

The version 1.03: released on July 19th, 2013

Usage: SOAPdenovo-Trans <command> [option]

pregraph construction kmer-graph

contig eliminate errors and output contigs

map map reads to contigs

scaff scaffolding

all doing all the above in turn

SOAPdenovo-Trans-127merもある。

ラン

はじめにconfig_fileを作る。

cat > SOAP.config <<TEXT 
max_rd_len=150 
[LIB] avg_ins=192 
reverse_seq=0 
asm_flags=3 
q1=R1_pairedout.fastq 
q2=R2_pairedout.fastq 
TEXT

config_fileを指定してアセンブル。

SOAPdenovo-Trans-31mer all -K 21 -p 20 -s config_file -o SOAPdenovo-Trans_output_k21

SOAPdenovo-Trans_output_k21.scafStatistics にアセンブルのサマリーが出る。アセンブル結果はSOAPdenovo-Trans_output_k21.scafSeq に出る。

SOAPdenovo-Transのランタイムはかなり短いが、メモリ要求量はかなり多め。

４、Oases

このOases自体、velvetの複数k-merアセンブルをマージするツールとなる。

https://github.com/dzerbino/oases

#bioconda (link)
conda install -c bioconda -y oases

#またはソースから。velvetのインストールについては、こちらのリンクのvelvetの項目を参照。
git clone --recursive https://github.com/dzerbino/oases 
cd oases 
make

> oases

$ oases

oases - De novo transcriptome assembler for the Velvet package

Version 0.2.09

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Compilation settings:

CATEGORIES = 4

MAXKMERLENGTH = 191

OPENMP

LONGSEQUENCES

Usage:

./oases directory [options]

directory : working directory name

Standard options:

--help : this help message

--version : print version and exit

--citation : print citation to Oases manuscript and exit

-ins_length2 <integer> : expected distance between two paired-end reads in the second short-read dataset (default: no read pairing)

-ins_length_long <integer> : expected distance between two long paired-end reads (default: no read pairing)

-ins_length*_sd <integer> : est. standard deviation of respective dataset (default: 10% of corresponding length)

[replace '*' by nothing, '2' or '_long' as necessary]

-unused_reads <yes|no> : export unused reads in UnusedReads.fa file (default: no)

-amos_file <yes|no> : export assembly to AMOS file (default: no export)

-alignments <yes|no> : export a summary of contig alignment to the reference sequences (default: no)

Advanced options:

-cov_cutoff <floating-point> : removal of low coverage nodes AFTER tour bus or allow the system to infer it (default: 3)

-min_pair_count <integer> : minimum number of paired end connections to justify the scaffolding of two long contigs (default: 4)

-min_trans_lgth <integer> : Minimum length of output transcripts (default: hash-length)

-paired_cutoff <floating-point> : minimum ratio allowed between the numbers of observed and estimated connecting read pairs

Must be part of the open interval ]0,1[ (default: 0.1)

-merge <yes|no> :Preserve contigs mapping onto long sequences to be preserved from coverage cutoff (default: no)

-edgeFractionCutoff <floating-point> : Remove edges which represent less than that fraction of a nodes outgoing flow

Must be part of the open interval ]0,1[ (default: 0.01)

-scaffolding <yes|no> : Allow gaps in transcripts (default: no)

-degree_cutoff <integer> : Maximum allowed degree on either end of a contigg to consider it 'unique' (default: 3)

Output:

directory/transcripts.fa

directory/contig-ordering.txt

ラン

アセンブル。 velveth => velvetg => oasesと進む。

velveth velvet_k21 21 -fastq merged.fq 
velvetg velvet_k21 -ins_length1 300 -ins_length2 3000 -read_trkg yes 
oases velvet_k21/ -min_trans_lgth 100 -ins_length 215

oasesはスクリプトでもランできる。

python scripts/oases_pipeline.py -m 21 -M 23 -d 'data/test_reads.fa'

５、IDBA-tran

$ idba_tran

not enough parameters

IDBA-Tran - Iterative de Bruijn Graph Assembler for next-generation transcriptome sequencing data.

Usage: idba_tran -r read.fa -o output_dir

Allowed Options:

-o, --out arg (=out) output directory

-r, --read arg fasta read file (<=600)

-l, --long_read arg fasta long read file (>600)

--mink arg (=20) minimum k value (<=312)

--maxk arg (=60) maximum k value (<=312)

--step arg (=10) increment of k-mer of each iteration

--inner_mink arg (=10) inner minimum k value

--inner_step arg (=5) inner increment of k-mer

--prefix arg (=3) prefix length used to build sub k-mer table

--min_count arg (=2) minimum multiplicity for filtering k-mer when building the graph

--min_support arg (=1) minimum supoort in each iteration

--num_threads arg (=0) number of threads

--seed_kmer arg (=30) seed kmer size for alignment

--min_contig arg (=200) minimum size of contig

--min_transcript arg (=300) minimum size of transcript

--similar arg (=0.95) similarity for alignment

--max_mismatch arg (=3) max mismatch of error correction

--no_local do not use local assembly

--no_coverage do not iterate on coverage

--no_correct do not do correction

--pre_correction perform pre-correction before assembly

--max_isoforms arg (=3) maximum number of isoforms

--max_component_size arg (=30) maximum size of components

１、マージしたfastaの作成

fastqはfastaとして与える必要がある。ペアエンドデータの場合、マージする必要もあるため、IDBAのfq2faコマンドを使ってマージしながらfastaに変換するマージする。fq2faコマンドはgzip圧縮fastqは受け付けないので、解凍してから指定する。

#ペアエンド
fq2fa --merge --filter pair_1.fq pair_2.fq read.fa
#シングルエンド
fq2fa single.fq read.fa

２、IDBA-tranのラン

24スレッド指定、k値は25から125まで5ずつ増やす。相同性98％、サイズ300bp以上とする。precorrection実行。

idba_tran -r read.fa --num_threads 24 --pre_correction -o out_dir --mink 25 --maxk 125 --step 10 --min_transcript 300 --similar 0.98

引用

SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads.

Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, Huang W, He G, Gu S, Li S, Zhou X, Lam TW, Li Y, Xu X, Wong GK, Wang J.

Bioinformatics. 2014 Jun 15;30(12):1660-6.

Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels.

Schulz MH1, Zerbino DR, Vingron M, Birney E.

Bioinformatics. 2012 Apr 15;28(8):1086-92.

Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Manfred G Grabherr, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, Federica di Palma, Bruce W Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir Friedman & Aviv Regev

Nature Biotechnology 29, 644–652 (2011)

De novo assembly and analysis of RNA-seq data.

Gordon Robertson, Jacqueline Schein[…]Inanc Birol

Nature Methods 7, 909–912 (2010).

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

D.R. Zerbino and E. Birney.

Genome Res. 2008 May;18(5):821-9.

De novo transcriptome assembly workflow

https://www.protocols.io/view/de-novo-transcriptome-assembly-workflow-ghebt3e?step=9

参考資料

Sildeshareの関連プレゼン資料

Differential expression analysis of de novo assembled transcriptomes - Nadia Davidson from Australian Bioinformatics Network

論文３つ

Combining Transcriptome Assemblies from Multiple De Novo Assemblers in the Allo-Tetraploid Plant Nicotiana benthamiana

https://dl.sciencesocieties.org/publications/tpg/pdfs/8/1/plantgenome2014.10.0064

https://www.ncbi.nlm.nih.gov/pubmed/27054874

論文４ mergeしてパフォーマンスを上げる

https://www.ncbi.nlm.nih.gov/pubmed/28224052

追記

評価のペーパーが出ています。コマンドや手順も載っているのでぜひ読んで見て下さい。

こちらも読んでみて下さい。

A consensus-based ensemble approach to improve de novo transcriptome assembly

https://www.biorxiv.org/content/10.1101/2020.06.08.139964v1

f:id:kazumaxneo:20200915192446p:plain

2018-01-16

de novo transcriptomeのアセンブル評価ツール RSEM-EVAL（DETONATEパッケージ）

RNA seq evaluation tool assembly de novo transcriptome 2014 Genome Biology

2018、8/9 誤字脱字修正

2021 12/24 タイトル変更

　RNAシーケンシング（RNA-Seq）技術は、トランスクリプトームの大規模分析を可能にすることによって、ゲノムの配列がまだ決定されていない種の研究に革命をもたらしている。このようなトランスクリプトームを研究するためには、ゲノム配列情報の助けを借りずに、RNA-Seqのリードからde novoトランスクリプトームアセンブリを介してtranscripts配列を再構成し、転写物配列のセットを決定しなければならない。 Roche 454 Life Scienceのプラットフォームデータ[論文より　ref.9]〜[12]を対象としたイルミナのプラットフォームデータ[1]〜[8]やその他のものについて、多くのde novo transcriptome assemblersが現在利用可能である。これらのアセンブラは、しばしばかなりのユーザ調整可能パラメータのセットと組み合わされて、単一データセットの候補アセンブリの大きなスペースの生成を可能にしている。しかし、特にgrand truthがわからないときに、アセンブリの精度を適切に評価することは困難である。

　de novo transcriptome assemblyの評価には多くの研究が行われている[ref.13]〜[20]。このような研究で使用されるアセンブリ評価尺度は、リファレンスベースとリファレンスなしの2つのクラスに分類することができる。基準に基づく尺度は、既知の系列を用いて計算される尺度である。例えば、アセンブリとリファレンス転写産物セットとの間の対応を確立した後、リファレンス転写産物とアセンブリのマッチした割合（recall）、または転写産物とこれらの2つの組み合わせ（例えば、F1 measure）[5]、[16]、[17]。転写産物セットに加えて、ゲノムおよびタンパク質配列もまた、アセンブリ評価のためのリファレンスとして使用することができる[ref.2]、[ref.4]、[ref.8]、[ref.13]、[ref.15]、[ref.20]。

　しかし、デノボアセンブリに関心のあるほとんどのケースでは、リファレンスシーケンスは利用できないか、不完全であるか、または関心のあるサンプルのgrand truthからかなり離れているため、アセンブリ評価作業が著しく困難になる。そのような場合には、リファレンスフリーの手段に頼らなければならない。一般的に使用されるリファレンスフリー測定値には、コンティグ長の中央値、コンティグ数およびN50が含まれる[13]、[16]、[17]。残念なことに、これらの措置は原始的であり、しばしば誤解を招く[20]。たとえば、最も一般的なリファレンスフリー測定の1つであるN50は、簡単なアセンブリで最大化できる。 N50は、少なくともその長さのすべてのコンティグがアセンブリの塩基の少なくとも50％を構成するような最長コンティグの長さと定義される[21]。この対策の動機は、より良いアセンブリは、入力リード間の識別されたオーバーラップの数が多いことに起因するため、より長いコンティグに組み合わされたリードが多くなることである。しかし、すべての入力を1つのコンティグに連結することによって構築された簡単なアセンブリは、この尺度を最大化してしまうことは容易に理解できる。要するに、N50はコンティグの連続性を測定するが、精度は測定しない[22]。アセンブリがシングルトン（すなわち、シングルリードから得られるコンティグ）を含む場合に潜在的に有益であることが示されているが、他の単純化されたリファレンスフリー測定は同様に誤解を招く可能性がある[20]。

　著者らは、DETONATE方法論とソフトウェアパッケージを提示することにより、トランスクリプトームアセンブリ評価における最先端技術を向上させる。 DETONATEは、RSEM-EVALとREF-EVALの2つのコンポーネントで構成されている。 DETONATEの主な寄稿者であるRSEM-EVALは、アセンブリにのみ依存する新規確率モデルに基づくリファレンスフリーの評価方法である。 RSEM-EVALは、統計モデルを用いてゲノムおよびメタゲノム[24]、[25]アセンブリを評価または構築する最近のアプローチに類似しているが、著者らが論じるように、転写産物のabundanceおよびオルタナティブスプライシング、N50のような単純化されたリファレンスフリー測定とは異なり、RSEM-EVALは、アセンブリのコンパクトさやRNA-Seqデータからのアセンブリのサポートなど、複数の要因を単一の統計的に原理的な評価スコアに組み合わせる。このスコアは、最適なアセンブラを選択し、アセンブラのパラメータを最適化し、新しいアセンブラの設計を目的関数として導くために使用できる。さらに、アセンブリ内の各コンティグについて、RSEM-EVALは、コンティグがRNA-Seqデータによってどの程度うまくサポートされているかを評価するスコアを提供し、不要なコンティグをフィルタリングするために使用することができる。 DETONATEの第2のコンポーネントであるREF-EVALは、リファレンスベースの対策のツールキットである。これは、既存のリファレンスベースの測定値よりも、アセンブリの精度をより洗練された視点で提供する。

　著者らは、RSEM-EVALスコアの価値を実証するために、実際のデータとシミュレートされたデータの両方について多数の実験を行った（以下略）。

DETONATE:

http://deweylab.biostat.wisc.edu/detonate/

インストール

公式サイトからダウンロードする。

DETONATE: DE novo TranscriptOme rNa-seq Assembly with or without the Truth Evaluation

解凍してビルド。

cd detonate-1.11/rsem-eval
make

#またはcondaを使う
#bioconda(link)
conda install -c bioconda -y detonate

ラン

１、初めに近縁なモデル生物のtranscripts.fastaの長さを出力する。

cd detonate-1.11/rsem-eval/
./rsem-eval-estimate-transcript-length-distribution transcripts.fa parameter_file

parameter_fileが出力される。

２、アセンブルしたFASTAとそれに使ったfastqを指定して RSEM-EVALスコアを計算する。

Paired-end

./rsem-eval-calculate-score -p 8 --transcript-length-parameters parameter_file --paired-end pair1.fq pair2.fq assembly.fasta L

L　For single-end data, L represents the average read length. For paired-end data, L represents the average fragment length. It should be a positive integer (real value will be rounded to the nearest integer).

Single-end

./rsem-eval-calculate-score -p 8 --transcript-length-parameters parameter_file single.fq assembly.fasta L

解析が終わるといくつかファイルができるが、~socreが評価ファイルである。下はテストデータを使った時の~socreファイル出力。

$ cat detonate-1.11/rsem-eval/toy_assembly_1.score

Score -87426.14

BIC_penalty -8.25

Prior_score_on_contig_lengths_(f_function_canceled) -7.91

Prior_score_on_contig_sequences -867.82

Data_likelihood_in_log_space_without_correction -86542.17

Correction_term_(f_function_canceled) -0.00

Number_of_contigs 1

Expected_number_of_aligned_reads_given_the_data 3812.00

Number_of_contigs_smaller_than_expected_read/fragment_length 0

Number_of_contigs_with_no_read_aligned_to 0

Maximum_data_likelihood_in_log_space -86541.97

Number_of_alignable_reads 3812

Number_of_alignments_in_total 3812

1行目はスコア。負の値であるが、大きい方が優れている（-20万より-8万がより良い）

スコアが続き、７行目の Number_of_contigはアセンブリに含まれるコンティグの数。

８行目のNumber_of_contigs_smaller_than_expected_read/fragment_lengthは、長さが期待されるフラグメント長よりも短いコンティグの数。Number_of_alignable_readsは少なくとも1つのアラインメントが検出されたリードの数（テストデータなので3800少ししかない）で、Number_of_alignments_in_totalがその総数。

計算の詳細は論文と公式ページで確認してください。

引用
Evaluation of de novo transcriptome assemblies from RNA-Seq data

Bo Li, Nathanael Fillmore, Yongsheng Bai, Mike Collins, James A Thomson, Ron Stewart and Colin N Dewey

Genome Biol. 2014 Dec 21;15(12):553

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome
Li B, Dewey CN.

BMC Bioinformatics. 2011 Aug 4;12:323.

2018-01-16

RNAのリファレンスガイドアセンブリを行いDe novo RNA seqの精度を上げる BRANCH

RNA seq assembly Reference-assisted assembly de novo transcriptome

非モデル生物のDe novo RNA seq解析は断片化したRNAしかできないので、DEG解析が困難となる。BRANCHはそういった不完全なRNAに対して使う方法論で、近縁種のゲノム、またはcontigの情報をRNAのガイドとして利用し、アセンブルの精度を高める方法論。ゲノムのエキソンを推定することで実現している。

ランにはRNAのde novo assemblyツール（OasisやTrinity、TransABySS、SOAPdenovoなど）で作成したRNAのアセンブルデータと、ゲノム（またはcontigやgene）配列が必要。精度はリファンレスがどれほど近いか（同じ種でないならおそらく厳しい）、エキソンが正しく識別できるかに依存する。

依存

lemon: graph library

http://lemon.cs.elte.hu/trac/lemon/wiki/Downloads

lemonをビルドする。

cd lemon-x.y.z
mkdir build
cd build
cmake ..
make
make check #self-test
sudo make install #/usr/localにパスが通る

> lemon -x

$ lemon -x

Lemon version 1.0

本体　Github

https://github.com/baoe/BRANCH

g++ -o BRANCH BRANCH.cpp -lemon -lpthread #あらかじめlemonがビルドされてpathが通っていること

> ./BRANCH

$ ./BRANCH

BRANCH: boosting RNA-Seq assemblies with partial or related genome sequences

BRANCH --read1 reads_1.fa --read2 reads_2.fa --transfrag transfrags.fa --contig contigs.fa --transcript transcripts.fa [--insertLow insertLow --insertHigh insertHigh --threshSize threshSize --threshCov threshCov --threshSplit threshSplit --threshConn threshConn --closeGap --noAlignment --lowEukaryote]

Inputs:

--read1 is the first pair of PE RNA reads or single-end RNA reads in fasta format

--read2 is the second pair of PE RNA reads in fasta format

--transfrag is the de novo RNA transfrags to be extended

--contig is the DNA contigs or the genes of close related species

Output:

--transcript is the extended de novo transfrags

Options:

--insertLow is the lower bound of insert length (highly recommended; default: 0)

--insertHigh is the upper bound of insert length (highly recommended; default: 99999)

--threshSize is the minimum size of a genome region that could be identified as an exon (default: 2 bp)

--threshCov is the minimum coverage of a genome region that could be identified as an exon (default: 2)

--threshSplit is the minimum upstream and downstream junction coverages to split a genome region into more than one exons (default: 2)

--threshConn is the minimum connectivity of two exons that could be identified as a junction (default: 2)

--closeGap closes sequencing gaps using PE read information (default: none)

--noAlignment skips the initial time-consuming alignment step, if all the alignment files have been provided in tmp directory (default: none)

--lowEukaryote runs in a different mode for low eukaryotes with rare splice variants (default: none)

--misassemblyRemoval detects and then breaks at or removes misassembed regions (default: none)

パスを通しておく。

ラン

FASTA形式のペアリードとoasisやtriniityでRNAをアセンブルした配列、参照するDNA配列（ゲノムでもcontigでもgeneでも使える）を指定してランする。

BRANCH --read1 reads1.fa --read2 reads2.fa --transfrag transfrags.fa --contig contigs.fa --transcript output.fa

--read1 is the first pair of PE RNA reads or single-end RNA reads in fasta format -
-read2 is the second pair of PE RNA reads in fasta format
--transfrag is the de novo RNA transfrags to be extended
--contig is the DNA contigs or the genes of close related species
--transcript is the extended de novo transfrags (output file)

fastqからfastaへの変換はこちらを参照。

Githubに記載されているtips

Single-end reads should have the same length and are not recommended, since the quality of single-end alignment is hard to be kept.
It is better to use related gene sequences rather than related genome sequences to greatly reduce run time and memory usage.
Though --insertLow and --insertHigh are options, they should always be specified to generate meaning result. Suppose the insert length is I, insertLow = I - 20 and insertHigh = I + 20 would be fine.

非モデル生物の断片化したRNAのクラスタリングにはCorsetなどがあります。

引用

BRANCH: boosting RNA-Seq assemblies with partial or related genomic sequences.

Bao E1, Jiang T, Girke T.

Bioinformatics. 2013 May 15;29(10):1250-9. doi: 10.1093/bioinformatics/btt127. Epub 2013 Mar 14.

2018-01-14

ヒトゲノムのmulti-deletion、duplication、inversion、deletionなどのSVsを検出するSVelter

structural variations (SV) human genome

インストール

依存

Cython

Github

https://github.com/mills-lab/svelter

git clone https://github.com/mills-lab/svelter.git 
cd svelter 
python setup.py install --user 
export PATH=$PATH:$HOME/.local/bin

$ svelter.py

SVelter-0.1 Contact: xuefzhao@umich.edu Last Update:2015-08-20

Usage: SVelter [Options] [Parameters]

Options:

Setup

Clean

NullModel

BPSearch

BPIntegrate

SVPredict

SVIntegrate

SVelter Setup should be run first

Type in SVelter [options] for detailed parameters

ラン

gitからクローンすると、ヒトゲノム用のデータベースも入っている。

まずワーキングディレクトリのセットアップを行う。ヒトhg19のbedなどが入ったSupport/hg19/を指定する。他にhg38、GRCh37、GRCh38がある。--referenceでbamのアライメントに使ったFASTAと（samtools faidxでINDEXもあらかじめ作成しておく）をフルパスで指定する必要がある。

svelter.py Setup --reference reference.fa --workdir working/ --support Support/hg19/

ランすると、working/ができ、その中にいくつかファイルができる。これがランに必要になる（working/がすでにあるとランに失敗する）。

先ほどのリファンレスにアライメントしたbamデータを指定し、デフォルト条件でラン。

svelter.py --sample /absolute/path/of/sample.bam --workdir /working/directory/

終わるとworking/にvcfファイルが出力される。

引用

Resolving complex structural genomic rearrangements using a randomized approach.

Zhao X, Emery SB, Myers B, Kidd JM, Mills RE.

Genome Biol. 2016 Jun 10;17(1):126. doi: 10.1186/s13059-016-0993-1.

2018-01-14

複数の似たリファレンスが利用できるデータのアライメント作業を高速化するCompMap

Compression / decompression mapping

種によって利用できるリファンレスの数は大きく異なる。例えばアウトブレイクした菌種を同定するために、１つのfastqデータをたくさんのリファンレスにアライメントするような作業を行う場合、リファレンスが数百ー数万も利用できると、アライメント作業が計算の99%以上を占めてしまい、非常に効率が悪い。CompMapはリファレンスが複数利用できる場合に、リファレンスを圧縮してアライメントすることで、計算時間を短縮する方法論。複数のFASTAから代表するFASTAを作成し、それに対してアライメントすることで、リファレンスが非常に似ている時は劇的に計算時間とメモリ使用量を削減できる。もちろんリファンレスがある程度違っても活用できる（論文の 3 CASE STUDIES参照）。

インストール

依存

BWA or Bowtie

公式サイトからソースコードをダウンロードする。マニュアルもあり。

http://csse.szu.edu.cn/staff/zhuzx/CompMap/#Install

make
make clean

> ./compMAP comp -h

$ ./compMAP comp -h

comp Usage:

compMAP comp <genome_dir> [ option ] <output>

the genomes & the output in the FASTA format.

OPTIONS:

-L INT

Set the minimum length of valid repeat sequences to identify in database sequences. (Default: 1000)

-e FLO

Set the mismatch tolerance rate in local alignment. (Default: 0.05)

-N INT

Set the size of prospecting window in local alignment. When a mismatch is detected, the program looks beyond for N more bases. If there are more than N/2 mismatches within these bases, then the matching goes the other direction or terminates. (Default: 10)

-k INT

Set the length of a kmers used in locate local alignment. (Default: 8)

-P <str1, str2, ... str10>

Set the kmer prefixes. The max number of prefixes is 10.

For example, 'compMAP comp db_dir -P CG, AA, TA' assign three prefixes namely "CG","AC", and "TA". (Default: "CG").

-R INT <filename1, filename2, filename3...> [<@listfiles>]

Assign the reference sequences for database compression.

0: randomly select a reference;

1: select the longest sequence as reference;

2: provide the file names of reference sequences. (Default: 1)

For example,'compMAP comp db_dir -R 2 db1.fasta, db2.fasta ref.fa'

assigns "db1.fasta" and "db2.fasta" as the reference.

'compMAP comp db_dir -R 2 @listfiles.txt ref.fa'

provides the names of reference sequences in a text file "listfiles.txt" which could contain the following contents

db1.fasta

db2.fasta

db3.fasta

...

-h : print the help!

Eg:

compMAP comp db_dir ref.fa

compMAP comp db_dir -R 2 db1.fasta ref.fa

compMAP comp db_dir -L 200 -E 0.03 -N 16 -K 10 -P CG, AT -R 2 @listfiles.txt ref.fa

output file: ref.fa ref.fa.mat ref.fa.log

> ./compMAP map -h

$ ./compMAP map -h

map Usage:

compMAP map [ option ] <aln.sam> <ref.fa>

"aln.sam" is the output of the align tool such as BWA and bowtie; "ref.fa" is the output of the previous command.

Make sure "ref.fa.mat" and "ref.fa.log" are stored in the same directory with "ref.fa".

OPTIONS:

-F Used to speed up the running of the program it fixed-length short reads are input

-D

Display the information of the mapped positions including the number of successful, failed and junction alignments.

-h : print the help!

E.g:

compMAP map -D aln.sam ref.fa

compMAP map -F aln.sam ref.fa

パスを通しておく。

ラン

db_dir/にリファレンスのfasta（４ファイル）があるものとする。

compMAPのcompコマンドでリファレンスを代表するref.faを作成する。

compMAP comp db_dir ref.fa

ref.faが出力される。

このref.faを使ってbwaでfastqをアライメントする（bowtieでも検証されている）。

bwa index ref.fa 
bwa mem ref.fa reads.fq > aln.sam

aln.map.samができる。

元のアライメントを復旧する。

compMAP map aln.sam ref.fa

-F　Used to speed up the running of the program if fixed-length short reads are input.-F Used to speed up the running of the program if fixed-length short reads are input.
-D　Display the information of the mapped positions including the number of successful, failed and junction alignments.

samの行数を調べれば、圧縮したFASTAの数だけ、復旧後のsamはaln.samより倍増していることが確認できる。

引用

CompMap: a reference-based compression program to speed up read mapping to related reference sequences.

Zhu Z, Li L, Zhang Y, Yang Y, Yang X.

Bioinformatics. 2015 Feb 1;31(3):426-8.

2018-01-14

de novo assemblyで得たRNAのコンティグをクラスター化して、非モデル生物のDEG解析を可能にする Corset

RNA seq

　RNA seqデータをde novoでアセンブルすると、一般に同じ遺伝子のアイソフォームが区別され、それぞれを別々にアセンブルするため、似た複数のコンティグが生じてしまう（ SNPまたはindelだけが異なるコンティグを繰り返し報告する）。付け加えて、こうしてできたコンティグが異なるアイソフォームを真に反映している保証もない。こういったそっくりなコンティグは、アイソフォーム、二倍体生物や集団内のシークエンシングエラー、反復、カバレッジの変化または遺伝的変異などに起因して生じている。よってデノボアセンブリによって生成されるコンティグの数は、本来のトランスクリプトームの数より大きくなる。このようなデータを使ってDEG解析を行うと、より多くのコンティグの間でリードを割り当てなければならないので、コンティグあたりの平均カウントが減少する。また、統計的に有意な変動を示すコンティグが同定されたとしても、実際は複数のコンティグがリストに存在するので、解釈が難しくなる。

　Corsetは、コンティグに複数回マップされた（一連の読み取りごとに複数のアライメントが報告される）リードを対象に、発現パターンに基づいてコンティグを階層的にクラスター化する。

Github

https://github.com/Oshlack/Corset

wiki

https://github.com/Oshlack/Corset/wiki

ダウンロード

linuxなら以下からビルド済みバイナリファイルをダウンロードできる。

https://github.com/Oshlack/Corset/releases

ラン

1、テストデータのダウンロード。

prefetch SRR453569 & #ダウンロード1
prefetch SRR453570 & #ダウンロード2
prefetch SRR453571 #ダウンロード3

#sraからpiared-end.fastqに変換。
mkdir fastq_db #ここに収納
fastq-dump ~/ncbi/public/sra/SRR453569.sra -O fastq_db/ --split-files &
fastq-dump ~/ncbi/public/sra/SRR453570.sra -O fastq_db/ --split-files &
fastq-dump ~/ncbi/public/sra/SRR453571.sra -O fastq_db/ --split-files
 
seqkit stats fastq_db/*

$ seqkit stats fastq_db/*

file format type num_seqs sum_len min_len avg_len max_len

fastq_db/SRR453569_1.fastq FASTQ DNA 4,032,514 407,283,914 101 101 101

fastq_db/SRR453569_2.fastq FASTQ DNA 4,032,514 407,283,914 101 101 101

fastq_db/SRR453570_1.fastq FASTQ DNA 6,745,975 681,343,475 101 101 101

fastq_db/SRR453570_2.fastq FASTQ DNA 6,745,975 681,343,475 101 101 101

fastq_db/SRR453571_1.fastq FASTQ DNA 6,163,396 622,502,996 101 101 101

fastq_db/SRR453571_2.fastq FASTQ DNA 6,163,396 622,502,996 101 101 101

２、トリミング (trimmomaticはbrewで導入できます)。

trimmomatic PE -phred33 SRR453569_1.fastq SRR453569_2.fastq SRR453569_pair1.fastq U1.fastq SRR453569_pair2.fastq U2.fastq LEADING:20 TRAILING:20 MINLEN:50

trimmomatic PE -phred33 SRR453570_1.fastq SRR453570_2.fastq SRR453570_pair1.fastq U1.fastq SRR453570_pair2.fastq U2.fastq LEADING:20 TRAILING:20 MINLEN:50

trimmomatic PE -phred33 SRR453571_1.fastq SRR453571_2.fastq SRR453571_pair1.fastq U1.fastq SRR453571_pair2.fastq U2.fastq LEADING:20 TRAILING:20 MINLEN:50

３、fastqを１つにまとめ、Trinityでアセンブルする（プールが大きいほど様々なmRNAのアセンブルが期待できる）。今回は３データをまとめアセンブル。

cat *pair1.fastq | sed 's/ HWI/\/1 HWI/g' > all_1.fastq 
cat *pair2.fastq | sed 's/ HWI/\/2 HWI/g' > all_2.fastq

４、１つにまとめたfastqを使いtrinityでde novo アセンブル (Trinityもbrewで導入できます)。

Trinity --max_memory 30G --seqType fq --left all_1.fastq --right all_2.fastq

５、パターン１、salmonでリードをマッピング (salmonもbrewで導入可)

#index
cp trinity_out_dir/Trinity.fasta . #fastaをカレントに移動
salmon index --index Trinity --transcripts Trinity.fasta

#以下のシェルスクリプトsalomn_loop.shを実行する。
FILES=`ls SRR*_P1.fastq | sed 's/_P1.fastq//g'` 
for F in $FILES ; do 
  R1=${F}_P1.fastq 
  R2=${F}_P2.fastq 
  salmon quant --index Trinity --libType IU --dumpEq -1 $R1 -2 $R2 --output ${F}.out 
done

bash salmon_loop.sh #実行

salmonの出力ディレクトリのaux_info/eq_classes.txtを指定して、Corsetをラン。

corset -g 1,1,1,2,2,2 -n A1,A2,A3,B1,B2,B3 -i salmon_eq_classes SRR*/aux_info/eq_classes.txt

６、パターン２、bowiteでリードをマッピング

bowtieを使うなら、以下のように進める。

#index
cp cp trinity_out_dir/Trinity.fasta . #fastaをカレントに移動
bowtie-build Trinity.fasta Trinity

#以下のシェルスクリプトbowtie_loop.shを実行する。
FILES=`ls *_1.fastq | sed 's/_1.fastq//g'`
for F in $FILES ; do
　R1=${F}_1.fastq
　R2=${F}_2.fastq
　bowtie --all -S Trinity -p 20 -1 $R1 -2 $R2 > ${F}.sam #20スレッド
　samtools view -@ 10 -S -b ${F}.sam > ${F}.bam
done

bash bowtie_loop.sh

出力bamを指定してCorsetをラン。

corset -g 1,1,1,2,2,2 -n A1,A2,A3,B1,B2,B3 *.bam

修正途中のまま時間が経ってしまったのでまとめ直しました。

引用

Corset: enabling differential gene expression analysis for de novoassembled transcriptomes

Nadia M Davidson and Alicia Oshlack

Genome Biology201415:410

2018-01-14

並列化に対応したリファレンスベースのfastq圧縮ツール LW-FQZip2

fasta/fastqの操作 Compression / decompression 高速なツール

fastqの圧縮の方法論にはいくつか種類があるが、その内の１つリファレンスベースの圧縮ツールは、シーケンスデータをそのまま圧縮するのではなく、リファンレスとの位置合わせ結果を記録する方法論である。そのために、リファレンスにリードをアライメントして、そこでマッチした数やミスマッチや計測する作業を行う必要がある。この方式のメリットは、参照に基づく方法は参照を使わない方法に比べて優れた圧縮率を得ることにある。LW-FQZip1はロスレスのリファレンスベースの圧縮ツールとして発表された。LW-FQZip2は並列化に対応することで LW-FQZip1より高速化している。

インストール

cent OSに導入した。

本体

Github

https://github.com/Zhuzxlab/LW-FQZip2

git clone https://github.com/Zhuzxlab/LW-FQZip2.git 
cd LW-FQZip2 
make clean 
make

> ./LWFQZip2 -h

]$ ./LWFQZip2 -h

LW-FQZip 2 -- Reference-based compression of long-read FASTQ files

Usage: LWFQZip2 <mode>...[options] ...

Mode:

-c --compression

-d --decompression

Compression/Decompression Options:

-i, --input FASTQ file or compressed file.

-r, --input Reference file.

-h, --help print this message

-v, --version display program version

-m, --maximal read length,range from 30000 to 300000 (Default: 300000)

-s, --Calculate the counts of the prefixes.

-g, --best compression model but slowest

for example: LWFQZip2 -c -i input -r reference -g.

LWFQZip2 -d -i input.lz -r reference -g.

-a, --assemble-based model, An optional amount (Default: 0.3 percent of the original file size) of reads, which contains the predefined prefix (Default: 'CG', could be combined to be an artificial reference. At the end of the package, this artificial reference is included.

for example: LWFQZip2 -c -i input -a 0.003(Default: '-a 0.003').

LWFQZip2 -d -i input.lz -a.

Mapping Options Options:

-b, --the number of mapping thread(Default: 10, mininum: 6 )

-p, --specify the kmer prefixes, e.g.,'CG', 'AT', and 'TAG' (Default: '-p CG'). 'AA' is not recommended as a prefix.

-k, --length of a kmer used in locate local alignment. (Default: '-k 8')

-e, --the tolerance rate of mismatches.(Default: '-e 0.05')

-L, --the mini length of a legal alignment.(Default: '-l 12')

-o, --open the complementart palindrome mode.(Default: '-o 1' means open the complementart palindrome mode.)

同じディレクトリのzpaq、lpaq9mが動作しないとだめなので、そちらも確認してからパスを通しておく。

ラン

圧縮。

LWFQZip2 -c -i SRR1063349.fastq -r NC_017634.1.fasta -b 6

-i　 --input FASTQ file or compressed file.
-r　 --input Reference file.
-b　--the number of mapping thread(Default: 10, mininum: 6 )

はじめにリファレンス配列にリードはアライメントされ、それから圧縮が行われる。ファイルサイズはうまくいけば元の10%以下になる。

解凍。

LWFQZip2 -d -i SRR1063349.fastq.lz -r NC_017634.1.fasta

低頻度のk-merの頻度を確認した限り完全なロスレス圧縮だが、ファイルサイズは圧縮・解凍後、わずかに変動している。

引用

LW-FQZip 2: a parallelized reference-based compression of FASTQ files.

Huang ZA, Wen Z, Deng Q, Chu Y, Sun Y, Zhu Z.

BMC Bioinformatics. 2017 Mar 20;18(1):179.

Light-weight reference-based compression of FASTQ data.

Zhang Y, Li L, Yang Y, Yang X, He S, Zhu Z.

BMC Bioinformatics. 2015 Jun 9;16:188.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

de novo transcriptome 解析のツールまとめ（主にde novo assemblerについて）

de novo transcriptomeのアセンブル評価ツール RSEM-EVAL（DETONATEパッケージ）

RNAのリファレンスガイドアセンブリを行いDe novo RNA seqの精度を上げる BRANCH

ヒトゲノムのmulti-deletion、duplication、inversion、deletionなどのSVsを検出するSVelter

複数の似たリファレンスが利用できるデータのアライメント作業を高速化するCompMap

de novo assemblyで得たRNAのコンティグをクラスター化して、非モデル生物のDEG解析を可能にする Corset

並列化に対応したリファレンスベースのfastq圧縮ツール LW-FQZip2