2019-02-05

fusion geneを検出する FuSeq

　染色体再編成の一種である遺伝子融合は、発ガンにおいて重要な役割を果たすことがわかっている[ref.1、2]。それはキメラタンパク質の増加、ガンの危険性および腫瘍の表現型と密接に関連しており、それらはすべて臨床解釈の可能性を秘めている[ref.2]。融合遺伝子は、乳がん[ref.3、4]、肺がん[ref.5]、黒色腫[ref.6]、神経膠腫[ref.7]など、さまざまな種類のガンで報告されている。小児期の急性リンパ芽球性白血病の約20〜25％に、融合遺伝子ETV6-RUNX1が最近発見された[ref.8]。ガンにおける遺伝子融合のさらなる考察は最近の総説に見られる[ref.2]。

　RNAシークエンシング（RNA-seq）技術の出現は、我々が効率的に新規な融合遺伝子を発見することを可能にする。 RNA-seqデータを用いて融合転写物を検出するための多くのツールが開発されており、それらの比較は最近のいくつかの論文で利用可能である[ref.9、10]。これらの方法は様々なアプローチを使用するが、一般的に３つの主なステップを含む：（I）リードアラインメント、（ii）融合候補検出、および（iii）偽陽性排除。リードアラインメントは、通常TopHat-Fusion [ref.11]、SnowShoe-FTD [ref.12]、EricScript [ref.13]、JAFFA [ref.14]などのRNA-seqの標準的なリードアラインメント方法、あるいはFusionMap [ref.15]のような独自の方法によって行われる。融合候補を決定するために、ほとんどの方法は、スパンリードペアおよび／またはスプリットリードなどの不一致を使用する。（一部略）

　現在の融合遺伝子検出方法のほとんどは、かなりの計算上の要求を必要とする。 15最近報告された15のの融合検出法、すなわちBreakFusion [ref.16]、Chimerascan [ref.17]、defuse [ref.18]、EricScript [ref.13]、FusionCatcher [ref.3、19]、FusionHunter [ref.20]、FusionMap [ref.15]、 FusionQ [21]、JAFFA [14]、MapSplice [22]、PRADA [23]、ShortFuse [24]、SnowShoes-FTD [12]、SOAPfuse [25]、TopHat-Fusion [11]があるが、118Mの100bpペアエンドリードの単一の前立腺ガンサンプルを分析するため7hから240hが必要である。これは、多数のサンプルを有するデータセットにおいて日常的な融合遺伝子検出を実行することを非実用的にする。これに対処するために、著者らはFuSeqを開発した。これは、従来のアラインメント法よりもかなり速いアラインメントのquasi-mapping method を利用する新しい融合検出法である。 FuSeqは、リードペア（MR）とジャンクションスプリットリード（SR）に基づく2つの別々のパイプラインで構成されており、最終ステップで組み合わされる。 MRパイプラインでは、FuSeqは融合遺伝子候補を生成するために新しい概念fusion-equivalence classを導入する。 SRパイプラインでは、融合遺伝子候補は、２つの異なる遺伝子が同じリードを共有し、各遺伝子が少なくともk-merのマッピング塩基を有するスプリットリードから収集される。さらに、主に偽陽性の低減のため、様々なフィルターおよび統計的検定が融合遺伝子候補に適用される。 FuSeqを4つの検証済みデータセットに適用し、それが感度、特異性およびdiscovery operating characteristicsにおいて一般的に使用される方法Tophat-fusion、SOAPfuseおよびJAFFAより優れていることを示す。またFuSeqは計算時間が桁違いに速い。

インストール

ubuntu16.04でバイナリをダウンロードしてテストした（docker使用; ホストOS mac os 10.12）。

依存

FuSeq is implemented in R and C++. We acknowledge for materials from Sailfish, Rapmap and other tools used in this software.

A C++-11 compliant compiler version of GCC (g++ >= 4.8.2)
R packages version 3.3.0 or latter with following installed packages: GenomicFeatures
FuSeq requires information of flags from Sailfish including DFETCH_BOOST, DBOOST_ROOT, DTBB_INSTALL_DIR and DCMAKE_INSTALL_PREFIX. Please refer to the Sailfish website for more details of these flags.

本体　Github

#バイナリをダウンロードする
wget https://github.com/nghiavtr/FuSeq/releases/download/v1.1.0/FuSeq_v1.1.0_linux_x86-64.tar.gz -O FuSeq_v1.1.0_linux_x86-64.tar.gz
tar -xzvf FuSeq_v1.1.0_linux_x86-64.tar.gz
cd FuSeq_v1.1.0_linux_x86-64
bash configure.sh

#Add paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH
export LD_LIBRARY_PATH=/path/to/FuSeq_v1.1.0_linux_x86-64/linux/lib:$LD_LIBRARY_PATH
export PATH=/path/to/FuSeq_v1.1.0_linux_x86-64/linux/bin:$PATH

>FuSeq -h

$ FuSeq -h

Fusion equivalence class detection

==========

Perform quasi-mapping from RNA-seq reads

to detect fusion equivalence class

sailfish quant options:

basic options:

-v [ --version ] print version string

-h [ --help ] produce help message

-i [ --index ] arg Sailfish index

-l [ --libType ] arg Format string describing the library type

-r [ --unmatedReads ] arg List of files containing unmated reads of

(e.g. single-end reads)

-1 [ --mates1 ] arg File containing the #1 mates

-2 [ --mates2 ] arg File containing the #2 mates

-p [ --threads ] arg (=18) The number of threads to use

concurrently.

-o [ --output ] arg Output quantification file.

-g [ --geneMap ] arg File containing a mapping of transcripts

to genes. If this file is provided

Sailfish will output both quant.sf and

quant.genes.sf files, where the latter

contains aggregated gene-level abundance

estimates. The transcript to gene

mapping should be provided as either a

GTF file, or a in a simple tab-delimited

format where each line contains the name

of a transcript and the gene to which it

belongs separated by a tab. The

extension of the file is used to

determine how the file should be parsed.

Files ending in '.gtf' or '.gff' are

assumed to be in GTF format; files with

any other extension are assumed to be in

the simple format.

--biasCorrect Perform sequence-specific bias correction

--gcBiasCorrect [experimental] Perform fragment GC bias

correction

advanced options:

--auxDir arg (=aux) The sub-directory of the quantification

directory where auxiliary information

e.g. bootstraps, bias parameters, etc.

will be written.

--dumpEq Dump the equivalence class counts that

were computed during quasi-mapping

--gcSizeSamp arg (=1) The value by which to down-sample

transcripts when representing the GC

content. Larger values will reduce

memory usage, but may decrease the

fidelity of bias modeling results.

--gcSpeedSamp arg (=1) The value at which the fragment length

PMF is down-sampled when evaluating GC

fragment bias. Larger values speed up

effective length correction, but may

decrease the fidelity of bias modeling

results.

--strictIntersect Modifies how orphans are assigned. When

this flag is set, if the intersection of

the quasi-mappings for the left and right

is empty, then all mappings for the left

and all mappings for the right read are

reported as orphaned quasi-mappings

--simplifiedLengthCorrection Use a "simplfied" effective length

correction approach, rather than

convolving the FLD with the

characteristic function over each

transcript.

--maxFragLen arg (=1000) The maximum length of a fragment to

consider when building the empirical

fragment length distribution

--txpAggregationKey arg (=gene_id) When generating the gene-level estimates,

use the provided key for aggregating

transcripts. The default is the

"gene_id" field, but other fields (e.g.

"gene_name") might be useful depending on

the specifics of the annotation being

used. Note: this option only affects

aggregation when using a GTF annotation;

not an annotation in "simple" format.

--ignoreLibCompat Disables strand-aware processing

completely. All hits are considered

"valid".

--enforceLibCompat Enforces "strict" library compatibility.

Fragments that map in a manner other than

what is specified by the expected library

type will be discarded, even if there are

no mappings that agree with the expected

library type.

--allowDovetail Allow paired-end reads from the same

fragment to "dovetail", such that the

ends of the mapped reads can extend past

each other.

--discardOrphans This option will discard orphaned

fragments. This only has an effect on

paired-end input, but enabling this

option will discard, rather than count,

any reads where only one of the paired

fragments maps to a transcript.

--noBiasLengthThreshold [experimental] : If this option is

enabled, then bias correction will be

allowed to estimate effective lengths

shorter than the approximate mean

fragment length

--numBiasSamples arg (=1000000) Number of fragment mappings to use when

learning the sequence-specific bias

model.

--numFragSamples arg (=10000) Number of fragments from unique

alignments to sample when building the

fragment length distribution

--fldMean arg (=200) If single end reads are being used for

quantification, or there are an

insufficient number of uniquely mapping

reads when performing paired-end

quantification to estimate the empirical

fragment length distribution, then use

this value to calculate effective

lengths.

--fldSD arg (=80) The standard deviation used in the

fragment length distribution for

single-end quantification or when an

empirical distribution cannot be learned.

-w [ --maxReadOcc ] arg (=200) Reads "mapping" to more than this many

places won't be considered.

--noEffectiveLengthCorrection Disables effective length correction when

computing the probability that a fragment

was generated from a transcript. If this

flag is passed in, the fragment length

distribution is not taken into account

when computing this probability.

--useVBOpt Use the Variational Bayesian EM rather

than the traditional EM algorithm to

estimate transcript abundances.

--numGibbsSamples arg (=0) [*super*-experimental]: Number of Gibbs

sampling rounds to perform.

--numBootstraps arg (=0) [*super*-experimental]: Number of

bootstrap samples to generate. Note: This

is mutually exclusive with Gibbs

sampling.

> GenTC -h

$ GenTC -h

Equivalence class detection

==========

Perform quasi-mapping from RNA-seq reads

to detect equivalence class

sailfish quant options:

basic options:

-v [ --version ] print version string

-h [ --help ] produce help message

-i [ --index ] arg Sailfish index

-l [ --libType ] arg Format string describing the library type

-r [ --unmatedReads ] arg List of files containing unmated reads of

(e.g. single-end reads)

-1 [ --mates1 ] arg File containing the #1 mates

-2 [ --mates2 ] arg File containing the #2 mates

-p [ --threads ] arg (=18) The number of threads to use

concurrently.

-o [ --output ] arg Output quantification file.

-g [ --geneMap ] arg File containing a mapping of transcripts

to genes. If this file is provided

Sailfish will output both quant.sf and

quant.genes.sf files, where the latter

contains aggregated gene-level abundance

estimates. The transcript to gene

mapping should be provided as either a

GTF file, or a in a simple tab-delimited

format where each line contains the name

of a transcript and the gene to which it

belongs separated by a tab. The

extension of the file is used to

determine how the file should be parsed.

Files ending in '.gtf' or '.gff' are

assumed to be in GTF format; files with

any other extension are assumed to be in

the simple format.

--biasCorrect Perform sequence-specific bias correction

--gcBiasCorrect [experimental] Perform fragment GC bias

correction

advanced options:

--auxDir arg (=aux) The sub-directory of the quantification

directory where auxiliary information

e.g. bootstraps, bias parameters, etc.

will be written.

--dumpEq Dump the equivalence class counts that

were computed during quasi-mapping

--gcSizeSamp arg (=1) The value by which to down-sample

transcripts when representing the GC

content. Larger values will reduce

memory usage, but may decrease the

fidelity of bias modeling results.

--gcSpeedSamp arg (=1) The value at which the fragment length

PMF is down-sampled when evaluating GC

fragment bias. Larger values speed up

effective length correction, but may

decrease the fidelity of bias modeling

results.

--strictIntersect Modifies how orphans are assigned. When

this flag is set, if the intersection of

the quasi-mappings for the left and right

is empty, then all mappings for the left

and all mappings for the right read are

reported as orphaned quasi-mappings

--simplifiedLengthCorrection Use a "simplfied" effective length

correction approach, rather than

convolving the FLD with the

characteristic function over each

transcript.

--maxFragLen arg (=1000) The maximum length of a fragment to

consider when building the empirical

fragment length distribution

--txpAggregationKey arg (=gene_id) When generating the gene-level estimates,

use the provided key for aggregating

transcripts. The default is the

"gene_id" field, but other fields (e.g.

"gene_name") might be useful depending on

the specifics of the annotation being

used. Note: this option only affects

aggregation when using a GTF annotation;

not an annotation in "simple" format.

--ignoreLibCompat Disables strand-aware processing

completely. All hits are considered

"valid".

--enforceLibCompat Enforces "strict" library compatibility.

Fragments that map in a manner other than

what is specified by the expected library

type will be discarded, even if there are

no mappings that agree with the expected

library type.

--allowDovetail Allow paired-end reads from the same

fragment to "dovetail", such that the

ends of the mapped reads can extend past

each other.

--discardOrphans This option will discard orphaned

fragments. This only has an effect on

paired-end input, but enabling this

option will discard, rather than count,

any reads where only one of the paired

fragments maps to a transcript.

--noBiasLengthThreshold [experimental] : If this option is

enabled, then bias correction will be

allowed to estimate effective lengths

shorter than the approximate mean

fragment length

--numBiasSamples arg (=1000000) Number of fragment mappings to use when

learning the sequence-specific bias

model.

--numFragSamples arg (=10000) Number of fragments from unique

alignments to sample when building the

fragment length distribution

--fldMean arg (=200) If single end reads are being used for

quantification, or there are an

insufficient number of uniquely mapping

reads when performing paired-end

quantification to estimate the empirical

fragment length distribution, then use

this value to calculate effective

lengths.

--fldSD arg (=80) The standard deviation used in the

fragment length distribution for

single-end quantification or when an

empirical distribution cannot be learned.

-w [ --maxReadOcc ] arg (=200) Reads "mapping" to more than this many

places won't be considered.

--noEffectiveLengthCorrection Disables effective length correction when

computing the probability that a fragment

was generated from a transcript. If this

flag is passed in, the fragment length

distribution is not taken into account

when computing this probability.

--useVBOpt Use the Variational Bayesian EM rather

than the traditional EM algorithm to

estimate transcript abundances.

--numGibbsSamples arg (=0) [*super*-experimental]: Number of Gibbs

sampling rounds to perform.

--numBootstraps arg (=0) [*super*-experimental]: Number of

bootstrap samples to generate. Note: This

is mutually exclusive with Gibbs

sampling.

> TxIndexer -h

$ TxIndexer

---------------------------------------------------------------

--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---

---------------------------------------------------------------

Exception: [the option '--out' is required but missing]. Exiting.

kazu@9a49f407e1fb:/data$ TxIndexer -h

---------------------------------------------------------------

--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---

---------------------------------------------------------------

indexing

==========

Command Line Options:

-v [ --version ] print version string

-h [ --help ] produce help message

-t [ --transcripts ] arg Transcript fasta file(s).

-k [ --kmerSize ] arg (=31) Kmer size.

-o [ --out ] arg Output stem [all files needed by Sailfish will be

of the form stem.*].

--perfectHash [quasi index only] Build the index using a

perfect hash rather than a dense hash. This will

require less memory (especially during

quantification), but will take longer to

construct

-p [ --threads ] arg (=18) The number of threads to use concurrently.

-f [ --force ]

$ TxIndexer

---------------------------------------------------------------

--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---

---------------------------------------------------------------

Exception: [the option '--out' is required but missing]. Exiting.

kazu@9a49f407e1fb:/data$ TxIndexer -h

---------------------------------------------------------------

--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---

---------------------------------------------------------------

indexing

==========

Command Line Options:

-v [ --version ] print version string

-h [ --help ] produce help message

-t [ --transcripts ] arg Transcript fasta file(s).

-k [ --kmerSize ] arg (=31) Kmer size.

-o [ --out ] arg Output stem [all files needed by Sailfish will be

of the form stem.*].

--perfectHash [quasi index only] Build the index using a

perfect hash rather than a dense hash. This will

require less memory (especially during

quantification), but will take longer to

construct

-p [ --threads ] arg (=18) The number of threads to use concurrently.

-f [ --force ]

実行方法

Paired-end RNA seqシーケンシングデータを用意する。また、リファレンスとfasta、gtfに加え、RDataというファイルも必要になる。ここではGithubにリンクが張られているGRCh37のcDNA fasta、gtf、RDataを使用する。

以下の３ファイルを用意した。

f:id:kazumaxneo:20190205101627j:plain

１、indexing

TxIndexer -k 31 -t Homo_sapiens.GRCh37.75.cdna.all.fa -o TxIndexer_idx

-k (31) Kmer size
-t Transcript fasta file(s)
-o Output stem [all files needed by Sailfish will be of the form stem.*].

split-read機能を動かすためk-merサイズはリード長の半分以下に指定する。

出力/

f:id:kazumaxneo:20190205110243j:plain

２、Extract fusion equivalence classes and split reads

FuSeq -i TxIndexer_idx -l IU -1 pair1.fasta -2 pair2.fasta -p 8 -g Homo_sapiens.GRCh37.75.gtf -o feqDir

-i Sailfish index
-l Format string describing the library type
-1 File containing the #1 mates
-2 File containing the #2 mates
-p The number of threads to use concurrently
-o Output quantification file
-g File containing a mapping of transcripts to genes

かなりたくさんのファイルが出力される。

３、gtfからsqliteデータベースを用意する。

Rscript FuSeq_home/R/createSqlite.R Homo_sapiens.GRCh37.75.gtf Homo_sapiens.GRCh37.75.sqlite

４、fusion gene予測

Rscript FuSeq_home/R/FuSeq.R in=feqDir \
 txfasta=/path/to/transcripts.fa \
 sqlite=Homo_sapiens.GRCh37.75.sqlite \
 txanno=txanno_file out=FuSeq_outDir \
 params=params_file

引用

A fast detection of fusion genes from paired-end RNA-seq data

Trung Nghia Vu, Wenjiang Deng, Quang Thinh Trac, Stefano Calza, Woochang Hwang, Yudi Pawitan
BMC Genomics 2018 19:786