染色体再編成の一種である遺伝子融合は、発ガンにおいて重要な役割を果たすことがわかっている[ref.1、2]。それはキメラタンパク質の増加、ガンの危険性および腫瘍の表現型と密接に関連しており、それらはすべて臨床解釈の可能性を秘めている[ref.2]。融合遺伝子は、乳がん[ref.3、4]、肺がん[ref.5]、黒色腫[ref.6]、神経膠腫[ref.7]など、さまざまな種類のガンで報告されている。小児期の急性リンパ芽球性白血病の約20〜25%に、融合遺伝子ETV6-RUNX1が最近発見された[ref.8]。ガンにおける遺伝子融合のさらなる考察は最近の総説に見られる[ref.2]。
RNAシークエンシング(RNA-seq)技術の出現は、我々が効率的に新規な融合遺伝子を発見することを可能にする。 RNA-seqデータを用いて融合転写物を検出するための多くのツールが開発されており、それらの比較は最近のいくつかの論文で利用可能である[ref.9、10]。これらの方法は様々なアプローチを使用するが、一般的に3つの主なステップを含む:(I)リードアラインメント、(ii)融合候補検出、および(iii)偽陽性排除。リードアラインメントは、通常TopHat-Fusion [ref.11]、SnowShoe-FTD [ref.12]、EricScript [ref.13]、JAFFA [ref.14]などのRNA-seqの標準的なリードアラインメント方法、あるいはFusionMap [ref.15]のような独自の方法によって行われる。融合候補を決定するために、ほとんどの方法は、スパンリードペアおよび/またはスプリットリードなどの不一致を使用する。(一部略)
現在の融合遺伝子検出方法のほとんどは、かなりの計算上の要求を必要とする。 15最近報告された15のの融合検出法、すなわちBreakFusion [ref.16]、Chimerascan [ref.17]、defuse [ref.18]、EricScript [ref.13]、FusionCatcher [ref.3、19]、FusionHunter [ref.20]、FusionMap [ref.15]、 FusionQ [21]、JAFFA [14]、MapSplice [22]、PRADA [23]、ShortFuse [24]、SnowShoes-FTD [12]、SOAPfuse [25]、TopHat-Fusion [11]があるが、118Mの100bpペアエンドリードの単一の前立腺ガンサンプルを分析するため7hから240hが必要である。これは、多数のサンプルを有するデータセットにおいて日常的な融合遺伝子検出を実行することを非実用的にする。これに対処するために、著者らはFuSeqを開発した。これは、従来のアラインメント法よりもかなり速いアラインメントのquasi-mapping method を利用する新しい融合検出法である。 FuSeqは、リードペア(MR)とジャンクションスプリットリード(SR)に基づく2つの別々のパイプラインで構成されており、最終ステップで組み合わされる。 MRパイプラインでは、FuSeqは融合遺伝子候補を生成するために新しい概念fusion-equivalence classを導入する。 SRパイプラインでは、融合遺伝子候補は、2つの異なる遺伝子が同じリードを共有し、各遺伝子が少なくともk-merのマッピング塩基を有するスプリットリードから収集される。さらに、主に偽陽性の低減のため、様々なフィルターおよび統計的検定が融合遺伝子候補に適用される。 FuSeqを4つの検証済みデータセットに適用し、それが感度、特異性およびdiscovery operating characteristicsにおいて一般的に使用される方法Tophat-fusion、SOAPfuseおよびJAFFAより優れていることを示す。またFuSeqは計算時間が桁違いに速い。
インストール
ubuntu16.04でバイナリをダウンロードしてテストした(docker使用; ホストOS mac os 10.12)。
依存
FuSeq is implemented in R and C++. We acknowledge for materials from Sailfish, Rapmap and other tools used in this software.
- A C++-11 compliant compiler version of GCC (g++ >= 4.8.2)
- R packages version 3.3.0 or latter with following installed packages: GenomicFeatures
- FuSeq requires information of flags from Sailfish including DFETCH_BOOST, DBOOST_ROOT, DTBB_INSTALL_DIR and DCMAKE_INSTALL_PREFIX. Please refer to the Sailfish website for more details of these flags.
本体 Github
#バイナリをダウンロードする
wget https://github.com/nghiavtr/FuSeq/releases/download/v1.1.0/FuSeq_v1.1.0_linux_x86-64.tar.gz -O FuSeq_v1.1.0_linux_x86-64.tar.gz
tar -xzvf FuSeq_v1.1.0_linux_x86-64.tar.gz
cd FuSeq_v1.1.0_linux_x86-64
bash configure.sh
#Add paths of lib folder and bin folder to LD_LIBRARY_PATH and PATH
export LD_LIBRARY_PATH=/path/to/FuSeq_v1.1.0_linux_x86-64/linux/lib:$LD_LIBRARY_PATH
export PATH=/path/to/FuSeq_v1.1.0_linux_x86-64/linux/bin:$PATH
>FuSeq -h
$ FuSeq -h
Fusion equivalence class detection
==========
Perform quasi-mapping from RNA-seq reads
to detect fusion equivalence class
sailfish quant options:
basic options:
-v [ --version ] print version string
-h [ --help ] produce help message
-i [ --index ] arg Sailfish index
-l [ --libType ] arg Format string describing the library type
-r [ --unmatedReads ] arg List of files containing unmated reads of
(e.g. single-end reads)
-1 [ --mates1 ] arg File containing the #1 mates
-2 [ --mates2 ] arg File containing the #2 mates
-p [ --threads ] arg (=18) The number of threads to use
concurrently.
-o [ --output ] arg Output quantification file.
-g [ --geneMap ] arg File containing a mapping of transcripts
to genes. If this file is provided
Sailfish will output both quant.sf and
quant.genes.sf files, where the latter
contains aggregated gene-level abundance
estimates. The transcript to gene
mapping should be provided as either a
GTF file, or a in a simple tab-delimited
format where each line contains the name
of a transcript and the gene to which it
belongs separated by a tab. The
extension of the file is used to
determine how the file should be parsed.
Files ending in '.gtf' or '.gff' are
assumed to be in GTF format; files with
any other extension are assumed to be in
the simple format.
--biasCorrect Perform sequence-specific bias correction
--gcBiasCorrect [experimental] Perform fragment GC bias
correction
advanced options:
--auxDir arg (=aux) The sub-directory of the quantification
directory where auxiliary information
e.g. bootstraps, bias parameters, etc.
will be written.
--dumpEq Dump the equivalence class counts that
were computed during quasi-mapping
--gcSizeSamp arg (=1) The value by which to down-sample
transcripts when representing the GC
content. Larger values will reduce
memory usage, but may decrease the
fidelity of bias modeling results.
--gcSpeedSamp arg (=1) The value at which the fragment length
PMF is down-sampled when evaluating GC
fragment bias. Larger values speed up
effective length correction, but may
decrease the fidelity of bias modeling
results.
--strictIntersect Modifies how orphans are assigned. When
this flag is set, if the intersection of
the quasi-mappings for the left and right
is empty, then all mappings for the left
and all mappings for the right read are
reported as orphaned quasi-mappings
--simplifiedLengthCorrection Use a "simplfied" effective length
correction approach, rather than
convolving the FLD with the
characteristic function over each
transcript.
--maxFragLen arg (=1000) The maximum length of a fragment to
consider when building the empirical
fragment length distribution
--txpAggregationKey arg (=gene_id) When generating the gene-level estimates,
use the provided key for aggregating
transcripts. The default is the
"gene_id" field, but other fields (e.g.
"gene_name") might be useful depending on
the specifics of the annotation being
used. Note: this option only affects
aggregation when using a GTF annotation;
not an annotation in "simple" format.
--ignoreLibCompat Disables strand-aware processing
completely. All hits are considered
"valid".
--enforceLibCompat Enforces "strict" library compatibility.
Fragments that map in a manner other than
what is specified by the expected library
type will be discarded, even if there are
no mappings that agree with the expected
library type.
--allowDovetail Allow paired-end reads from the same
fragment to "dovetail", such that the
ends of the mapped reads can extend past
each other.
--discardOrphans This option will discard orphaned
fragments. This only has an effect on
paired-end input, but enabling this
option will discard, rather than count,
any reads where only one of the paired
fragments maps to a transcript.
--noBiasLengthThreshold [experimental] : If this option is
enabled, then bias correction will be
allowed to estimate effective lengths
shorter than the approximate mean
fragment length
--numBiasSamples arg (=1000000) Number of fragment mappings to use when
learning the sequence-specific bias
model.
--numFragSamples arg (=10000) Number of fragments from unique
alignments to sample when building the
fragment length distribution
--fldMean arg (=200) If single end reads are being used for
quantification, or there are an
insufficient number of uniquely mapping
reads when performing paired-end
quantification to estimate the empirical
fragment length distribution, then use
this value to calculate effective
lengths.
--fldSD arg (=80) The standard deviation used in the
fragment length distribution for
single-end quantification or when an
empirical distribution cannot be learned.
-w [ --maxReadOcc ] arg (=200) Reads "mapping" to more than this many
places won't be considered.
--noEffectiveLengthCorrection Disables effective length correction when
computing the probability that a fragment
was generated from a transcript. If this
flag is passed in, the fragment length
distribution is not taken into account
when computing this probability.
--useVBOpt Use the Variational Bayesian EM rather
than the traditional EM algorithm to
estimate transcript abundances.
--numGibbsSamples arg (=0) [*super*-experimental]: Number of Gibbs
sampling rounds to perform.
--numBootstraps arg (=0) [*super*-experimental]: Number of
bootstrap samples to generate. Note: This
is mutually exclusive with Gibbs
sampling.
> GenTC -h
$ GenTC -h
Equivalence class detection
==========
Perform quasi-mapping from RNA-seq reads
to detect equivalence class
sailfish quant options:
basic options:
-v [ --version ] print version string
-h [ --help ] produce help message
-i [ --index ] arg Sailfish index
-l [ --libType ] arg Format string describing the library type
-r [ --unmatedReads ] arg List of files containing unmated reads of
(e.g. single-end reads)
-1 [ --mates1 ] arg File containing the #1 mates
-2 [ --mates2 ] arg File containing the #2 mates
-p [ --threads ] arg (=18) The number of threads to use
concurrently.
-o [ --output ] arg Output quantification file.
-g [ --geneMap ] arg File containing a mapping of transcripts
to genes. If this file is provided
Sailfish will output both quant.sf and
quant.genes.sf files, where the latter
contains aggregated gene-level abundance
estimates. The transcript to gene
mapping should be provided as either a
GTF file, or a in a simple tab-delimited
format where each line contains the name
of a transcript and the gene to which it
belongs separated by a tab. The
extension of the file is used to
determine how the file should be parsed.
Files ending in '.gtf' or '.gff' are
assumed to be in GTF format; files with
any other extension are assumed to be in
the simple format.
--biasCorrect Perform sequence-specific bias correction
--gcBiasCorrect [experimental] Perform fragment GC bias
correction
advanced options:
--auxDir arg (=aux) The sub-directory of the quantification
directory where auxiliary information
e.g. bootstraps, bias parameters, etc.
will be written.
--dumpEq Dump the equivalence class counts that
were computed during quasi-mapping
--gcSizeSamp arg (=1) The value by which to down-sample
transcripts when representing the GC
content. Larger values will reduce
memory usage, but may decrease the
fidelity of bias modeling results.
--gcSpeedSamp arg (=1) The value at which the fragment length
PMF is down-sampled when evaluating GC
fragment bias. Larger values speed up
effective length correction, but may
decrease the fidelity of bias modeling
results.
--strictIntersect Modifies how orphans are assigned. When
this flag is set, if the intersection of
the quasi-mappings for the left and right
is empty, then all mappings for the left
and all mappings for the right read are
reported as orphaned quasi-mappings
--simplifiedLengthCorrection Use a "simplfied" effective length
correction approach, rather than
convolving the FLD with the
characteristic function over each
transcript.
--maxFragLen arg (=1000) The maximum length of a fragment to
consider when building the empirical
fragment length distribution
--txpAggregationKey arg (=gene_id) When generating the gene-level estimates,
use the provided key for aggregating
transcripts. The default is the
"gene_id" field, but other fields (e.g.
"gene_name") might be useful depending on
the specifics of the annotation being
used. Note: this option only affects
aggregation when using a GTF annotation;
not an annotation in "simple" format.
--ignoreLibCompat Disables strand-aware processing
completely. All hits are considered
"valid".
--enforceLibCompat Enforces "strict" library compatibility.
Fragments that map in a manner other than
what is specified by the expected library
type will be discarded, even if there are
no mappings that agree with the expected
library type.
--allowDovetail Allow paired-end reads from the same
fragment to "dovetail", such that the
ends of the mapped reads can extend past
each other.
--discardOrphans This option will discard orphaned
fragments. This only has an effect on
paired-end input, but enabling this
option will discard, rather than count,
any reads where only one of the paired
fragments maps to a transcript.
--noBiasLengthThreshold [experimental] : If this option is
enabled, then bias correction will be
allowed to estimate effective lengths
shorter than the approximate mean
fragment length
--numBiasSamples arg (=1000000) Number of fragment mappings to use when
learning the sequence-specific bias
model.
--numFragSamples arg (=10000) Number of fragments from unique
alignments to sample when building the
fragment length distribution
--fldMean arg (=200) If single end reads are being used for
quantification, or there are an
insufficient number of uniquely mapping
reads when performing paired-end
quantification to estimate the empirical
fragment length distribution, then use
this value to calculate effective
lengths.
--fldSD arg (=80) The standard deviation used in the
fragment length distribution for
single-end quantification or when an
empirical distribution cannot be learned.
-w [ --maxReadOcc ] arg (=200) Reads "mapping" to more than this many
places won't be considered.
--noEffectiveLengthCorrection Disables effective length correction when
computing the probability that a fragment
was generated from a transcript. If this
flag is passed in, the fragment length
distribution is not taken into account
when computing this probability.
--useVBOpt Use the Variational Bayesian EM rather
than the traditional EM algorithm to
estimate transcript abundances.
--numGibbsSamples arg (=0) [*super*-experimental]: Number of Gibbs
sampling rounds to perform.
--numBootstraps arg (=0) [*super*-experimental]: Number of
bootstrap samples to generate. Note: This
is mutually exclusive with Gibbs
sampling.
> TxIndexer -h
$
$ TxIndexer
---------------------------------------------------------------
--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---
---------------------------------------------------------------
Exception: [the option '--out' is required but missing]. Exiting.
kazu@9a49f407e1fb:/data$ TxIndexer -h
---------------------------------------------------------------
--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---
---------------------------------------------------------------
indexing
==========
Command Line Options:
-v [ --version ] print version string
-h [ --help ] produce help message
-t [ --transcripts ] arg Transcript fasta file(s).
-k [ --kmerSize ] arg (=31) Kmer size.
-o [ --out ] arg Output stem [all files needed by Sailfish will be
of the form stem.*].
--perfectHash [quasi index only] Build the index using a
perfect hash rather than a dense hash. This will
require less memory (especially during
quantification), but will take longer to
construct
-p [ --threads ] arg (=18) The number of threads to use concurrently.
-f [ --force ]
$ TxIndexer
---------------------------------------------------------------
--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---
---------------------------------------------------------------
Exception: [the option '--out' is required but missing]. Exiting.
kazu@9a49f407e1fb:/data$ TxIndexer -h
---------------------------------------------------------------
--- FuSeq acknowledges the Sailfish and Rapmap for this indexer ---
---------------------------------------------------------------
indexing
==========
Command Line Options:
-v [ --version ] print version string
-h [ --help ] produce help message
-t [ --transcripts ] arg Transcript fasta file(s).
-k [ --kmerSize ] arg (=31) Kmer size.
-o [ --out ] arg Output stem [all files needed by Sailfish will be
of the form stem.*].
--perfectHash [quasi index only] Build the index using a
perfect hash rather than a dense hash. This will
require less memory (especially during
quantification), but will take longer to
construct
-p [ --threads ] arg (=18) The number of threads to use concurrently.
-f [ --force ]
実行方法
Paired-end RNA seqシーケンシングデータを用意する。また、リファレンスとfasta、gtfに加え、RDataというファイルも必要になる。ここではGithubにリンクが張られているGRCh37のcDNA fasta、gtf、RDataを使用する。
以下の3ファイルを用意した。
1、indexing
TxIndexer -k 31 -t Homo_sapiens.GRCh37.75.cdna.all.fa -o TxIndexer_idx
- -k (31) Kmer size
- -t Transcript fasta file(s)
-
-o Output stem [all files needed by Sailfish will be of the form stem.*].
split-read機能を動かすためk-merサイズはリード長の半分以下に指定する。
出力/
2、Extract fusion equivalence classes and split reads
FuSeq -i TxIndexer_idx -l IU -1 pair1.fasta -2 pair2.fasta -p 8 -g Homo_sapiens.GRCh37.75.gtf -o feqDir
- -i Sailfish index
-
-l Format string describing the library type
-
-1 File containing the #1 mates
-
-2 File containing the #2 mates
-
-p The number of threads to use concurrently
-
-o Output quantification file
-
-g File containing a mapping of transcripts to genes
かなりたくさんのファイルが出力される。
3、gtfからsqliteデータベースを用意する。
Rscript FuSeq_home/R/createSqlite.R Homo_sapiens.GRCh37.75.gtf Homo_sapiens.GRCh37.75.sqlite
4、fusion gene予測
Rscript FuSeq_home/R/FuSeq.R in=feqDir \
txfasta=/path/to/transcripts.fa \
sqlite=Homo_sapiens.GRCh37.75.sqlite \
txanno=txanno_file out=FuSeq_outDir \
params=params_file
引用
A fast detection of fusion genes from paired-end RNA-seq data
Trung Nghia Vu, Wenjiang Deng, Quang Thinh Trac, Stefano Calza, Woochang Hwang, Yudi Pawitan
BMC Genomics 2018 19:786