2018-07-10

ロングリードのマッピングからタンデムリピートを検出する tandem-genotypes

　タンデムリピートは、ゲノムDNA中に複数のコピー配列が隣接して存在する領域である。これらの領域は、細胞分裂中の複製エラーのために個体間で非常に可変である。それらは、疾患および健康における表現型変動のソースでもある。タンデムリピートのコピー数改変により、30以上のヒトの病気が引き起こされている（ref.1）。
　リファレンスと比較した病原性コピー数の変化の範囲は、数個から数千まで変化し、リピート単位長は、 3（トリプレット反復病）〜数千（マクロサテライト反復）　に及ぶ。このような多様な根底から予想されるように、疾患メカニズムもまた変化し得る。タンパク質コード領域におけるtriplet-repeat expansion diseases の周知の例は、ポリグルタミン病（例えば、脊髄および球筋萎縮、ハンチントン病）である。グルタミンをコードするCAGまたはCAAコドンのトリプレットリピート拡大は、毒性タンパク質凝集および神経細胞死を招く。トリプレットリピート病の別の例は、DMPK遺伝子からの転写物の3'UTRにおけるCUGリピート拡大によって引き起こされ、スプライシングファクタータンパクを隔離し、異常なスプライシングを引き起こし、複数の症状をもたらす毒性のある機能獲得転写産物を生じる。機能獲得の突然変異だけでなく、転写サイレンシングに起因するプロモーター領域における機能喪失の繰り返し変化も報告されている（例えば、脆弱X症候群）。短いタンデムリピート疾患に加えて、ヒト疾患におけるリピートコピー数異常も、マクロサテライトリピート（D4Z4）において報告されている。 D4Z4リピートの短縮は、筋肉細胞に毒性作用を有するフランキング遺伝子DUX4の異常な発現を引き起こす。コード領域における病原性反復拡大の閾値は、通常100コピー未満であり、時にはいくつかのコピーの相違によっても疾患（例えば、眼咽頭筋ジストロフィー）を引き起こすことがある。対照的に、イントロンまたはUTRにおけるタンデムリピート伸長を引き起こすいくつかの疾患は、非常に長くなり得る（例えば、DMPK）。さらに、いくつかの反復は、異なる配列（例えば、DMPK、ATXN10、SAMD12）によって中断され、正確なリピート構造を解析することが困難になる。
　高スループットのショートリードシーケンサーが臨床遺伝学に導入されてからおよそ10年が経った。主にターゲットシーケンス解析（例えば全エキソームシーケンス）のおかげで、特にコード領域における小さなサイズのヌクレオチド変化が多数同定されている。しかし、診断率は30％（使用されている診断プラットフォームによって異なる）のままであり、Mendelian病の大部分は未解決である。多くの理由があるかもしれないが、最も単純なのは、残りの患者が「非コード領域」に突然変異を有しているか、またはショートリードシーケンス技術の限界のために見過ごされたコード領域に突然変異を有する可能性があることである。 1つの候補はタンデムリピート領域であり、これは従来技術によってゲノム全体を解析することは困難である。疾患を引き起こすタンデムリピート数変化を同定するには、通常、古典的な遺伝子技術（すなわち、連鎖解析、サザンブロットなど）および多数のファミリーにおけるtargeted repeat region analysis によって実現される。
　最近のロングリード・シーケンシング技術の進歩は、リードがリピート全体を十分含むことができ、隣接するユニークなシーケンスを使用して解析できるので、良い解決策を提供できる。非常に最近、PacBioまたはナノポアシーケンサーのようなロングリードシーケンサーが臨床遺伝学に来ている。 2018年現在、これらの技術は精度とデータ出力の点で絶え間なく向上している。しかし、臨床検査室では、費用対効果と大規模データの計算負荷のためにまだ困難なままである。可能であれば低カバレッジデータ（〜10X）でタンデムリピート変化を検出できることが望ましい。
　著者らは、ロングリード・シーケンシングからタンデムリピートのコピー数を決定する既存の方法を2つ認識している：PacmonSTRおよびRepeatHMM。これらのメソッドは、リファレンスゲノムにリードをアライメントし、リファレンスのタンデムリピート領域をカバーするリードを取得し、これらのリードと繰り返しシーケンスとの洗練された確率ベースの比較を実行する。しかし、この研究から、それらの方法が現在のロングリードシーケンスデータでは必ずしも成功するとは限らないことがわかる。
　著者らは最近、ゲノムリアレンジメントとduplicationを考慮したゲノムとロングリードのアライメントに、LASTソフトウェアを使用する方法を提唱している（pubmed）。この方法には2つの重要な特徴がある。最初に、データの挿入、欠失、および各種置換の割合を決定し、これらの率を使用して最も可能性の高いアライメントを決定する。第2に、各リードを分割し（1つまたは複数の）、各部分で最も可能性の高い位置合わせを行うことである。この方法は、多様なタイプのゲノムリアレンジメントを見出したが、その中で最も一般的なものは、tandem multiplication （例えばheptuplication ）であり、しばしばタンデムリピート領域であった（pubmed）。
ここでは、ロングリードをリファレンスゲノムにLASTでアライメントさせ、これらのアライメントを非常に効果的な方法で分析することにより、タンデムリピートコピー数の変化を検出する。著者らは、タンデムリピートシーケンスを分析することでいくつかの実用上の困難を指摘しており、これは我々（著者ら）のクルードな分析方法の動機づけとなっている。この手法は、比較的低いカバレッジシーケンシングデータであってもゲノム全体でタンデムリピートを解析することができる。我々（著者ら）は、このツールが、ショートリードシーケンスでは見過ごされているヒト疾患におけるタンデムリピート領域での疾患原因突然変異の同定に非常に有用であると考えている。

tandem genotypesに関するツイート。

インストール

mac os10.12を使用。lastはpython2.7環境で実行し、nanosvはAnaconda3.5.2環境で実行した。

依存

LAST

git clone https://github.com/mcfrith/tandem-genotypes.git
cd tandem-genotypes/

> python tandem-genotypes -h

$ python tandem-genotypes -h

Usage: tandem-genotypes [options] microsat.txt alignments.maf

Try to indicate genotypes of tandem repeats.

Options:

-h, --help show this help message and exit

-g FILE, --genes=FILE

read genes from a genePred or BED file

-m PROB, --mismap=PROB

ignore any alignment with mismap probability > PROB

(default=1e-06)

--postmask=NUMBER ignore mostly-lowercase alignments (default=1)

-p BP, --promoter=BP promoter length (default=300)

-s N, --select=N select: all repeats (0), non-intergenic repeats (1),

non-intergenic non-intronic repeats (2) (default=0)

-u BP, --min-unit=BP ignore repeats with unit shorter than BP (default=2)

-f BP, --far=BP require alignment >= BP beyond both sides of a repeat

(default=100)

-n BP, --near=BP count insertions <= BP beyond a repeat (default=60)

--mode=LETTER L=lenient, S=strict (default=L)

-v, --verbose show more details

ラン

こちらを参考にmafファイルを作成する。

https://github.com/mcfrith/last-rna/blob/master/last-long-reads.md

１、ゲノムの準備。

#1-1 リピートマスクなしの場合。16thread使いindex
lastdb -P16 -uNEAR -R01 mydb genome.fa
#mydb~というファイルが複数できる。2に進む。

#1-2 リピートマスクあり。windowmaskerを使う（pubmed）。
windowmasker -mk_counts -in genome.fa > genome.wmstat
windowmasker -ustat genome.wmstat -outfmt fasta -in genome.fa > genome-wm.fa
#リピートが小文字になったコピーファイルgenome-wm.faが作成される。
lastdb -P16 -uNEAR -R11 -c mydb genome-wm.fa

２、シーケンスデータがfastqなら、ここでfastaに変換しておく。

awk '/>/ {$0 = ">" ++n} 1' nanopore.fq > nanopore.fa

３、last-train（リンク）を使い、置換とgapのレートを算出する。16スレッド指定している。

last-train -P16 mydb nanopore.fa > myseq.par

事前設定されたパラメータ条件でマッピングが行われ、よりよいパラメータ条件が出力される。

４、Duplicationやリアレンジメントを考慮し、リードをゲノムにアライメント。3のlast-trainで得られたパラメータ条件ファイルmyseq.parを指定している。

lastal -P16 -p myseq.par mydb nanopore.fa | last-split -m1e-6 - > myseq.maf

５、tandem-genotypesを動かすには、マイクロサテライトやリピートファイルのファイルを与える必要がある。UCSC（リンク）からダウンロードして使う時は、最初の4カラムを抽出する。例えばhumanのマイクロサテライトなら、リンク先からmicrosatelliteを選んでダウンロード。"cut -f 1-4 input"で先頭4カラムを抽出。

f:id:kazumaxneo:20180710115854j:plain

以下のようなフォーマットになっていればOK（GIthubより）。

chr22  41914573  41914611  GCGCGA

chr22  41994883  41994923  TG

次のステップで使うので、遺伝子のBEDファイルもUCSCのtable browserからダウンロードしとく（humanリンク）。

６、tandem-genotypesを使い、タンデムリピートを検出する。microsat.txtがstep5で調整したリピートファイル。refGene.txtはstep5でダウンロードしたbedファイル。

tandem-genotypes -g refGene.txt microsat.txt myseq.maf > tg.txt

７、ヒストグラムをプロット。

python tandem-genotypes-plot tg.txt

リピートの出現回数のヒストグラムPDFが出力される。

引用

Robust detection of tandem repeat expansions from long DNA reads

Satomi Mitsuhashi, Martin C Frith, Takeshi Mizuguchi, Satoko Miyatake, Tomoko Toyota, Hiroaki Adachi, Yoko Oma, Yoshihiro Kino, Hiroaki Mitsuhashi, Naomichi Matsumoto

bioRxiv preprint first posted online Jun. 27, 2018; doi: http://dx.doi.org/10.1101/356931.

論文追記 2019 3/24

Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads
Satomi Mitsuhashi, Martin C. Frith, Takeshi Mizuguchi, Satoko Miyatake, Tomoko Toyota, Hiroaki Adachi, Yoko Oma, Yoshihiro Kino, Hiroaki Mitsuhashi, Naomichi Matsumoto
Genome Biology 2019 20:58

2018-07-08

高速なロングリードのマッピング、エラー修正、アセンブリツール MECAT

error correction 高速なツール mapping Nanopore long read Pacbio assembly 2017 Nature Methods docker

2020 2/7 タイトル修正

　MECATは、1分子シークエンシング（SMRT）リードの超高速マッピング、エラー訂正、およびデノボアセンブリを行うツール。State of the artのアライナとエラー訂正ツールよりもはるかに効率的な、新しいアライメントとエラー訂正アルゴリズムを採用している。 MECATは、ラージゲノムの効率的なde novo アセンブリに使用できる。例えば、2.0GHz CPUを搭載した32スレッドコンピュータ環境下では、MECATは54xのSMRTヒトゲノムシーケンスデータを9.5日でアセンブリできる。これは現在のPBcR-Mhap pipelineの40倍速い。また、MECATを用いて、diploidのヒトゲノムの102x SMRTデータをわずか25日でアセンブリできる。後者のアセンブリは、54倍の一倍体SMRTデータから組み立てられた以前のゲノムの品質を大幅に改善するものである。 MECATの性能は、PBcR-Mhapパイプライン、FALCONおよびCanu（v1.3）と5つの実際のデータセットで比較した。 MECATによって作成されたコンティグの品質は、PBcR-MhapパイプラインおよびFALCONと同等以上だった。Githubの表に2.0GHzのCPUと512GBのRAMメモリを備えた32スレッドコンピュータでの上記ツールの比較がある（リンク）。

ランニングタイム

f:id:kazumaxneo:20180714182831p:plain

Githubより転載。

"MECAT assembly"に関するツイート。

インストール

ubuntu16.04でソースからテストした。またdockerイメージもテストした。

依存

HDF5
dextract

#HDF5
wget https://support.hdfgroup.org/ftp/HDF5/releases/hdf5-1.8/hdf5-1.8.15-patch1/src/hdf5-1.8.15-patch1.tar.gz
tar xzvf hdf5-1.8.15-patch1.tar.gz
mkdir hdf5
cd hdf5-1.8.15-patch1
./configure --enable-cxx --prefix=/public/users/chenying/smrt_asm/hdf5
make
make install
cd ..

#dextract
git clone https://github.com/PacificBiosciences/DEXTRACTOR.git
cp MECAT/dextract_makefile DEXTRACTOR
cd DEXTRACTOR
export HDF5_INCLUDE=/public/users/chenying/smrt_asm/hdf5/include
export HDF5_LIB=/public/users/chenying/smrt_asm/hdf5/lib
make -f dextract_makefile
cd ..

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/public/users/chenying/smrt_asm/hdf5/lib

本体　Github

git clone https://github.com/xiaochuanle/MECAT.git
cd MECAT
make -j 8
cd ..

#MECAT/Linux-amd64/binにパスを通す、dextractにもパスを通す

４つの代表的なコマンドがある。

mecat2pw, a fast and accurate pairwise mapping tool for SMRT reads
mecat2ref, a fast and accurate reference mapping tool for SMRT reads
mecat2cns, correct noisy reads based on their pairwise overlaps
mecat2canu, a modified and more efficient version of the Canu pipeline. Canu is a customized version of the Celera Assembler that designed for high-noise single-molecule sequencing

ここではdocker imageを使う。

https://hub.docker.com/r/robegan21/mecat/

docker pull robegan21/mecat

ホストからヘルプを表示。--rmでジョブ終了後コンテナを捨てる（ジョブを投げるたびに停止したコンテナが増殖してしまうので毎回消す）。

> docker run --rm -it robegan21/mecat mecat2pw -h

$ docker run --rm -it robegan21/mecat mecat2pw -h

[parse_arguments, 142] unrecognised option 'h'

usage:

mecat2pw [-j task] [-d dataset] [-o output] [-w working dir] [-t threads] [-n candidates] [-g 0/1]

options:

-j <integer> job: 0 = seeding, 1 = align

default: 1

-d <string> reads file name

-o <string> output file name

-w <string> working folder name, will be created if not exist

-t <integer> number of cput threads

default: 1

-n <integer> number of candidates for gapped extension

Default: 100

-a <integer> minimum size of overlaps

Default: 2000 if x = 0, 500 if x = 1

-k <integer> minimum number of kmer match a matched block has

Default: 4 if x = 0, 2 if x = 1

-g <0/1> whether print gapped extension start point, 0 = no, 1 = yes

Default: 0

-x <0/x> sequencing technology: 0 = pacbio, 1 = nanopore

Default: 0

> docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat mecat2ref -h

$ docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat mecat2ref -h

Error: argument to option 'h' is missing!

usage:

mecat2ref [-d reads] [-r reference] [-o output] [-w working dir] [-t threads]

options:

-d <string> reads file name

-r <string> reference file name

-o <string> output file name

-w <string> working folder name, will be created if not exist

-t <integer> number of cput threads

default: 1

-n <integer> number of of candidates for gap extension

default: 10

-b <integer> output the best b alignments

default: 10

-m <0/1/2> output format: 0 = ref, 1 = m4, 2 = sam

default: 0

-x <0/1> sequencing technology: 0 = pacbio, 1 = nanopore

default: 0

> docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat mecat2cns -h

$ docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat mecat2cns -h

input_type: 1

number of threads: 1

batch size: 100000

mapping ratio: 0.9

align size: 2000

cov: 6

min size: 5000

partition files: 10

tech: 0

USAGE:

mecat2cns [options] input reads output

OPTIONS:

-x <0/1> sequencing platform: 0 = PACBIO, 1 = NANOPORE

default: 0

-i <0/1> input type: 0 = candidte, 1 = m4

-t <Integer> number of threads (CPUs)

-p <Integer> batch size that the reads will be partitioned

-r <Real> minimum mapping ratio

-a <Integer> minimum overlap size

-c <Integer> minimum coverage under consideration

-l <Integer> minimum length of corrected sequence

-k <Integer> number of partition files when partitioning overlap results (if < 0, then it will be set to system limit value)

-h print usage info.

If 'x' is set to be '0' (pacbio), then the other options have the following default values:

-i 1 -t 1 -p 100000 -r 0.9 -a 2000 -c 6 -l 5000 -k 10

If 'x' is set to be '1' (nanopore), then the other options have the following default values:

-i 1 -t 1 -p 100000 -r 0.4 -a 400 -c 6 -l 2000 -k 10

> docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat mecat2canu -h

$ docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat mecat2canu -h

usage: canu [-correct | -trim | -assemble] \

[-s <assembly-specifications-file>] \

-p <assembly-prefix> \

-d <assembly-directory> \

genomeSize=<number>[g|m|k] \

errorRate=0.X \

[other-options] \

[-pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq

By default, all three stages (correct, trim, assemble) are computed.

To compute only a single stage, use:

-correct - generate corrected reads

-trim - generate trimmed reads

-assemble - generate an assembly

The assembly is computed in the (created) -d <assembly-directory>, with most

files named using the -p <assembly-prefix>.

The genome size is your best guess of the genome size of what is being assembled.

It is used mostly to compute coverage in reads. Fractional values are allowed: '4.7m'

is the same as '4700k' and '4700000'

The errorRate is not used correctly (we're working on it). Set it to 0.06 and

use the various utg*ErrorRate options.

A full list of options can be printed with '-options'. All options

can be supplied in an optional sepc file.

Reads can be either FASTA or FASTQ format, uncompressed, or compressed

with gz, bz2 or xz. Reads are specified by the technology they were

generated with:

-pacbio-raw <files>

-pacbio-corrected <files>

-nanopore-raw <files>

-nanopore-corrected <files>

Complete documentation at http://canu.readthedocs.org/en/latest/

ERROR: Invalid command line option '-h'. Did you forget quotes around options with spaces?

ERROR: Assembly name prefix not supplied with -p.

ERROR: Directory not supplied with -d.

ERROR: Required parameter 'genomeSize' is not set

ラン

テスト１、Pacbio

１、テストシーケンスデータをダウンロード。

cd /home/kazu/MECAT/
wget http://gembox.cbcb.umd.edu/mhap/raw/ecoli_filtered.fastq.gz
gzip -dv ecoli_filtered.fastq.gz

ecoli_filtered.fastqができる。

dockerを使い、ホスト側からジョブを投げる。見にくいのでMECATのコマンド部分手前でエスケープして改行し、さらにMECATコマンド部は太字にする。

２、mecat2pw: a fast and accurate pairwise mapping tool for SMRT reads

２、mappingによるオーバーラップ検出。ecoli_filtered.fastqを指定する。

#現在のパス /home/kazu/MECAT/
sudo docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat \
mecat2pw -j 0 -d ecoli_filtered.fastq -o ecoli_filtered.fastq.pm.can -w wrk_dir -t 16

ecoli_filtered.fastq.pm.canが出力される。

３、mecat2cns: correct noisy reads based on their pairwise overlaps

エラー訂正。ecoli_filtered.fastq.pm.canとecoli_filtered.fastqを指定する。

sudo docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat \
mecat2cns -i 0 -t 16 ecoli_filtered.fastq.pm.can ecoli_filtered.fastq corrected_ecoli_filtered.fasta

corrected_ecoli_filtered.fastaが出力される。

４、extract the longest 25X corrected reads。推定ゲノムサイズとカバレッジを指定する（リンク）。

#extract_sequences usage 
extract_sequences inputReads outputReads-prefix genomeSize coverage

sudo docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat \
extract_sequences corrected_ecoli_filtered.fasta corrected_ecoli_25x 4800000 25

corrected_ecoli_25x.fastaが出力される。

５、mecat2canu: a modified and more efficient version of the Canu pipeline. Canu is a customized version of the Celera Assembler that designed for high-noise single-molecule sequencing

アセンブリ。

sudo docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat \
mecat2canu -trim-assemble -p ecoli -d ecoli genomeSize=4800000 ErrorRate=0.02 maxMemory=40 maxThreads=16 useGrid=0 Overlapper=mecat2asmpw -pacbio-corrected corrected_ecoli_25x.fasta

テスト２、Nanopore

1、ダウンロード。

wget http://nanopore.s3.climb.ac.uk/MAP006-PCR-1_2D_pass.fasta

２、mappingによるオーバーラップ検出。

#現在のパス /home/kazu/MECAT/
sudo docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat \
mecat2pw -j 0 -d MAP006-PCR-1_2D_pass.fasta -o candidatex.txt -w wrk_dir -t 16 -x 1

３、エラー訂正。

sudo docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat \
mecat2cns -i 0 -t 16 -x 1 candidates.txt MAP006-PCR-1_2D_pass.fasta corrected_ecoli.fasta

４、extract the longest 25X corrected reads

sudo docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat \
extract_sequences corrected_ecoli.fasta corrected_ecoli_25x.fasta 4800000 25

５、アセンブリ。

sudo docker run --rm -it -v /home/kazu/MECAT/:/root robegan21/mecat \
mecat2canu -trim-assemble -p ecoli -d ecoli genomeSize=4800000 \ ErrorRate=0.06 maxMemory=40 maxThreads=16 useGrid=0 \
Overlapper=mecat2asmpw -nanopore-corrected corrected_ecoli_25x.fasta

マッピングにはmecat2refを使う。

引用

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads

Xiao CL, Chen Y, Xie SQ, Chen KN, Wang Y, Han Y, Luo F, Xie Z.

Nat Methods. 2017 Nov;14(11):1072-1074

2018-07-08

融合遺伝子とキメラ転写産物を検出する ChimPipe

tumor chimera transcript RNA seq fusion genes

2020 5/11 リンク追加

2021 12/5 誤字修正

　キメラtranscriptsは、ゲノム中の異なる2つ以上の遺伝子に由来する配列を有する転写産物であり[論文より ref.1]、ゲノムまたは転写レベルでいくつかの異なる生物学的メカニズムによって説明することができる。ガンとの歴史的関係については、最もよく知られているメカニズムはゲノム再編成である。このプロセスは、生殖系列ゲノムにおいて、そして癌ゲノムにおいて、遠く離れた同じ方向にある2つの遺伝子を互いに近接させる。このようにして作出された融合遺伝子は、タンパク質または転写産物として有害な役割を果たす可能性がある[ref.1,2]。ガンにおけるキメラの既知の役割以外に、正常細胞または腫瘍細胞でキメラ形成を説明できる他の転写機構もある：ポリメラーゼリードスルーおよびトランススプライシング[ref.1]。

　その名前で示されるように、ポリメラーゼのリードスルーは、ポリメラーゼが1つの遺伝子を次の遺伝子に読み込み、2つの隣接する遺伝子の間にキメラを作成するときに起こる。当初、例外であると考えられていたこの機構は、EST（発現配列タグ）およびcDNA（相補的DNA）の大量のコレクションが利用可能になりゲノムにマッピングされ、そしてENCODE （Encyclopedia of DNA Elements）コンソーシアムが注釈付きタンパク質コード遺伝子に関連するトランスクリプトームを系統的に調査した[ref.6-9]結果、哺乳動物に広く広がっていることが判明している[ref.3-5]。隣接する遺伝子のエキソン間、好ましくは上流（5 '）遺伝子の最後から2番目のエキソンと下流（3'）遺伝子の第2エキソンとの間でリードスルーが起こり、両親のドメインを含む新しいタンパク質が得られる。それゆえ、種のプロテオーム多様性を増加させる[ref.1,3,4,10,11]。それらは脊椎動物においても大部分保存されており[ref.11,12]、親遺伝子の一方または両方の発現を調節する方法となりうる[ref.12]。

　トランススプライシングは、よく知られているシススプライシングとは異なり、核の3次元（3D）空間で近くに存在し、同じ"transcription factory"に属すると考えられる2つの異なるプレメッセンジャー RNA（プレmRNA）分子間で起こるスプライシング機構である。 2つのプレmRNAが2つの異なる遺伝子に由来する場合、転写キメラが生成される[ref.1,13-16]。したがって、2つの連結された遺伝子は、ゲノムの遠区離れた位置に存在することができるが、キメラ接合部は正規のスプライス部位を有さなければならない。当初、トリパノソーマに限定されると考えられていたが、いくつかの研究で、根底にあるゲノム再編成の証拠なしに、異なる染色体または鎖上の遺伝子の間にキメラを発見して以来、ヒト研究で関心を集めている[ref.13,14,16]。 1つの仮説は、正常細胞で起こるこのようなトランススプライシングされた転写産物がゲノムリアレンジメントを引き起こし、それが（異なる機構を介して）より多くのこのようなトランススプライシング転写産物を生成し、最終的に腫瘍形成に至ることである[ref.13]。

（３段落省略）

　最先端のキメラ検出プログラムは、通常、（1）キメラリードのためのマッピングおよびフィルタリング、（2）キメラ接合部検出、および（3）キメラアセンブリおよびフィルタリングの3つのステップを含む。これらは、ゲノム（そして場合によってはトランスクリプトーム）にリードをマッピングし、キメラ検出のための2種類の情報を利用する(1) discordant paired-end (PE) reads、すなわちペアエンドのペアがアノテーション上の遺伝子構造と一致しないマップ、例えば異なるクロモソーム間にマッピングされる。（2） ‘split’ reads、すなわちゲノムに連続的にマップされないが、ゲノムにマップするために複数のブロック（通常は2つ）に分割マッピングされる（論文　図1）。さらに、1種類または2種類のリードを使用することにより、(1) the whole paired-end アプローチ、 (2) the direct fragmentation アプローチ、そして(3) the paired-end + fragmentation アプローチ [41]の３つのアプローチをキメラジャンクション検出に取ることができる。

　これらのプログラムのベンチマーキングは、偽陽性率が高く、同じデータセットでの出力間のintersectionの割合が不十分であることを示している[ref.42,43]。他方で、これらのプログラムは、通常、ヒトのガンでの融合遺伝子検出に開発されており、従って、リードスルーイベントを常に検出することはできず、ヒト以外の種に使うこともできない。さらに、これらのプログラムは、遺伝子対ごとに複数のアイソフォームを常に予測できるとは限らず、より重要なことに、塩基対の分解能を提供し、下流の機能検証を妨げる。これらの問題に対処するために、著者らはノーマルと腫瘍の両方からのイルミナペアエンドRNA-seqデータからキメラ転写産物および融合遺伝子の両方を確実に検出する、ペアエンド + フラグメンテーションアプローチおよび厳格なフィルターセットを使用するモジュラー法であるChimPipeを提示する（以下略）。

マニュアル

https://chimpipe.readthedocs.io/en/latest/

インストール

依存

64-bit Linux System (ChimPipe is written in Bash and Awk)
Bedtools v2.20.1 or higher
Samtools v0.1.19 or higher
Blast v2.2.29+ or higher

本体　Github

git clone https://github.com/Chimera-tools/ChimPipe.git
cd ChimPipe/

> ./ChimPipe.sh

$ ./ChimPipe.sh

[ERROR] The mate 1 FASTQ provided does not exist. Mandatory argument --fastq_1

**** ChimPipe version v0.9.5 ****

Execute ChimPipe on one Illumina paired-end RNA-seq dataset (sample).

*** USAGE

FASTQ:

./ChimPipe.sh --fastq_1 <mate1_fastq> --fastq_2 <mate2_fastq> -g <genome_index> -a <annotation> -t <transcriptome_index> -k <transcriptome_keys> [OPTIONS]

BAM:

./ChimPipe.sh --bam <bam> -g <genome_index> -a <annotation> [OPTIONS]

*** MANDATORY

* FASTQ:

--fastq_1 <FASTQ> First mate sequencing reads in FASTQ format. It can be gzip compressed [.gz].

--fastq_2 <FASTQ> Second mate sequencing reads in FASTQ format. It can be gzip compressed [.gz].

-g|--genome-index <GEM> Reference genome index in GEM format.

-a|--annotation <GTF> Reference gene annotation file in GTF format.

-t|--transcriptome-index <GEM> Annotated transcriptome index in GEM format.

-k|--transcriptome-keys <KEYS> Transcriptome to genome coordinate conversion keys.

--sample-id <STRING> Sample identifier (output files are named according to this id).

* BAM:

--bam <BAM> Mapped reads in BAM format. A splicing aware aligner is needed to map the reads.

-g|--genome-index <GEM> Reference genome index in GEM format.

-a|--annotation <GTF> Reference genome annotation file in GTF format.

--sample-id <STRING> Sample identifier (the output files are named according to this id).

*** [OPTIONS] can be:

* General:

--threads <INTEGER> Number of threads to use. Default 1.

-o|--output-dir <PATH> Output directory. Default current working directory.

--tmp-dir <PATH> Temporary directory. Default /tmp.

--no-cleanup Keep intermediate files.

-h|--help Display partial usage information, only mandatory plus general arguments.

-f|--full-help Display full usage information with additional options.

A complete documentation for ChimPipe can be found at: http://chimpipe.readthedocs.org/en/latest/index.html

ラン

テストランできるAll in one packageが準備されている（リンク） (5.2GB)。

wget http://public-docs.crg.es/rguigo/Papers/ChimPipe/ChimPipe_tutorial.tar.gz
tar -zxvf ChimPipe_tutorial.tar.gz

ダウンロードしたテストデータを走らせる。

cd ChimPipe_tutorial/input/

../../ChimPipe.sh --fastq_1 MCF-7_1.fastq.gz --fastq_2 MCF-7_2.fastq.gz -g Homo_sapiens.GRCh37.chromosomes.chr.M.gem \
 -a gencode.v19.annotation.long.gtf -t gencode.v19.annotation.long.gtf.junctions.gem \
-k gencode.v19.annotation.long.gtf.junctions.keys --sample-id MCF-7 --threads 20 \
--similarity-gene-pairs gencode.v19.annotation.long.similarity.txt

GEMToolsのラン中に、-qのフラグ内が指定されてないとのエラーが起きた。本体のシェルスクリプトを開き、283行目のrun "$gemtools --loglevel $logLevel rna-pipeline -の行の手前にquality="33"の行を追加する応急処置を行ってランした。

引用

ChimPipe: accurate detection of fusion genes and transcription-induced chimeras from RNA-seq data.

Rodríguez-Martín B, Palumbo E, Marco-Sola S, Griebel T, Ribeca P, Alonso G, Rastrojo A, Aguado B, Guigó R, Djebali S.

BMC Genomics. 2017 Jan 3;18(1):7.

2018-07-07

複数bamを様々な評価指標で分析して結果を統合する picardmetrics

bam/sam RNA seq 結果の視覚化 (visualization) evaluation tool

2020 8/24 タイトル修正

picardmetricsはKamil Slowikowskiさんが公開されたPicard（ピカード）Toolsのbamを分析する各コマンドを走らせ、その結果を統合してくれるシェルスクリプト。

コマンド

https://slowkow.github.io/picardmetrics/

インストール

ubuntu18.04に導入した。

依存

Picard
samtools, which depends on htslib
stats
gtfToGenePred
ggplot2(optional)

#statsの導入
git clone https://github.com/arq5x/filo.git 
cd filo/ 
make
#/usr/local/bin/にコピー
sudo cp bin/stats /usr/local/bin/

kentutilsのgtfToGenePredバイナリをリンクからダウンロードしてパスの通ったディレクトリに移動。gtfToGenePredは上記のKentutilsのftpサーバにアクセスし、バイナリ（linux）をダウンロード、パスの通ったディレクトリに移動する。

本体 Github

git clone https://github.com/slowkow/picardmetrics 
cd picardmetrics

# Download and install the dependencies. 
make get-deps PREFIX=~/.local

# Install picardmetrics and the man page. 
make install PREFIX=~/.local

homeディレクトリ（$HOME）にpicardmetrics.confがコピーされる。以後はこの$HOME/picardmetrics.confをconfigファイルとして使ってpicardmetricsの解析が行われる。ゲノムFASTAのパスは適当なので、初回はpicardmetrics.confのfastaファイルのパスを修正する。Picard-toolsのパスも違うなら修正しておく。RNA seqに使うなら、アノテーションファイルも修正する必要がある。

#configファイルを修正。emacsかvim、viで開く。
vi ~/picardmetrics.conf

もしくは、毎回 -fでpicardmetrics.confを指定してランする。

ラン

data/project1/sample/にある全bamを解析する。

for bam in data/project1/sample/?.bam
do
 picardmetrics run -k -o out/rnaseq $bam
done

#out/に善データが出力される。サンプルごとに個別の分析ファイルとPDFができる。これをcollateコマンドで統合する。
#データの統合。out/にある全データを統合し、summary/に出力
picardmetrics collate out/ summary/

default-all-metrics.tsvができる。

Excelで表示。ここでは２サンプルのbamの分析結果を統合している。

f:id:kazumaxneo:20180707204934j:plain

評価項目は68もあるので、ここではその先頭カラムだけ表示（画面の右に大量にカラムがある）。

ggplot2でplot。

R #Rに入る

> library(ggplot2) 
> dat <- read.delim("project1-all-metrics.tsv", stringsAsFactors = FALSE)
> ggplot(dat) + geom_point(aes(PF_READS, PF_ALIGNED_BASES))

引用

GitHub - slowkow/picardmetrics: Run Picard on BAM files and collate 90 metrics into one file.

2018-07-07

somaticとgermlineのバリアント検出ツール Scalpel

local assembly tumor small indel family trios k-mer human whole genome human genome family germline human exome 2014 Nature Methods

注: docker イメージのリンクも紹介してますが、テストするとエラーを吐きました。condaを使いlinuxマシンでに導入するのが無難なようです。

　SNVsの分析はヒト遺伝学を研究するための標準的な技術となっているが[論文より　ref.1]。、DNA配列（indels）の挿入と欠失は確実に検出することはできない[ref.2,3]。 Indelsはヒトゲノムで最も2番目に一般的な変異であり、構造変異中では最も多い[ref.4]。マイクロサテライト（単純配列反復、SSR、1〜6bpモチーフ）内で、indelsはリピートモチーフの長さを変え、40以上の神経学的疾患に関連している[ref.5]。 Indelsもまた、自閉症において重要な遺伝的要素を担っている。コードされたタンパク質を破壊する可能性のあるde novo indelsは、影響を受けていない兄弟よりも2倍近くも豊富である[ref.6]。

　indel検出は、いくつかの理由から困難である。（1）indel配列とオーバーラップするリードはアライメントが難しく、gapではなく複数のミスマッチとして扱われることがある。（2）エキソームシーケンシングのキャプチャ効率のばらつきおよび不均一なリード分布は、偽陽性の数を増加させる。（3）エラー率増加は、マイクロサテライト内での検出を非常に困難にする。この研究で示されているように、（4）局在化、ほぼ同一の反復配列は、高い陽性率をもたらす可能性がある。これらの理由から、利用可能なソフトウェアツールで検出可能なindelサイズは比較的小さく、数十塩基を超えるものは少ない[ref.8]。

　現在、indels検出には2つの主要なパラダイムが使用されている。最も一般的なアプローチは、リードマッパー（BWA、Bowtie、Novoalignなど）を使用してすべてのリードをリファレンスゲノムにマッピングすることだが、利用可能なアルゴリズムは数塩基以上のindel間のマッピングには有効ではない。先進的なアプローチではより長い変異を検出するためにペアエンド情報を使い local realignments を行うが（例えば、GATK UnifiedGenotyper[ref.1]およびDindel[ref.9]）、実際には、より長い変異（≧20bp）ではその感度が大幅に低下する。 Split-read methods（例えば、Pindel[ref.10]およびSplitread[ref.11]）は、理論的には任意のサイズの欠失を検出できるが、現在のシーケンス技術ではリード長が短いために（論文執筆時点）挿入を検出する能力は限られている。第2のパラダイムは、デノボ全ゲノムアセンブリを行い、組み立てられたコンティグとリファレンスゲノムとの間の変異を検出することからなる[ref.12,13]。より大きな突然変異を検出する可能性を有する一方で、実際には、このパラダイムは、ホモ接合型およびヘテロ接合型突然変異を正確に報告するために、細かくかつ局在化した分析が必要である。最近では、de novo aasemblyを使ったGATK HaplotypeCaller、SOAPindel[ref.14]、およびCortex[ref.15]の3種類のアプローチが開発されている。他の最近のアプローチであるTIGRA[ref.16]も、ローカルアセンブリを使用するが、ブレークポイントのみ検出するよう調整されており、indelsの配列は報告しない。

　著者らは、exome-captureデータ内のindelsを検出するマイクロアセンブリパイプラインScalpelを提示する（論文より　図1）。マッピングとアセンブリの力を組み合わせることにより、Scalpelはde Bruijn graphを慎重に検索し、各エキソンにまたがるシーケンスパス（コンティグ）を探す。このアルゴリズムには、各エキソンのオンザフライリピート組成分析と、セルフチューニングのk-mer戦略が含まれる。

公式HP

http://scalpel.sourceforge.net/manual.html

マニュアル１

http://scalpel.sourceforge.net/manual.html

マニュアル２

https://sourceforge.net/p/scalpel/wiki/Manual/

Scalpelに関するツイート。

インストール

ubuntu18.04のAnaconda2.4.2でテストした。

Github

#Anaconda環境ならcondaを使う(linux　only)
conda install -c bioconda scalpel

#dockerイメージも提供されている。
docker pull hanfang/scalpel:0.5.3

docker imagesでIDを調べてから

> scalpel-discovery -h

$ scalpel-discovery -h

Local date and time: Sat Jul 7 10:13:11 2018

Program: scalpel-discovery (micro-assembly variant detection)

Version: 0.5.3 (beta), January 25 2016

Contact: Giuseppe Narzisi <gnarzisi@nygenome.org>

usage: scalpel-discovery <COMMAND> [OPTIONS]

COMMAND:

--help : this (help) message

--verbose : verbose mode

--single : single exome study

--denovo : family study (mom,dad,affected,sibling)

--somatic : normal/tumor study

> scalpel-discovery --single

$ scalpel-discovery --single

Local date and time: Sat Jul 7 10:14:20 2018

Program: scalpel-discovery (micro-assembly variant detection)

Version: 0.5.3 (beta), January 25 2016

Contact: Giuseppe Narzisi <gnarzisi@nygenome.org>

usage: scalpel-discovery --single --bam <BAM file> --bed <BED file> --ref <FASTA file> [OPTIONS]

Detect indels in one single dataset (e.g., one individual).

OPTIONS:

--help : this (help) message

--verbose : verbose mode

Required:

--bam <BAM file> : BAM file with the reference-aligned reads

--bed <BED file> : file with list of regions (BED format) in sorted order or single region in format chr:start-end (example: 1:31656613-31656883)

--ref <FASTA file> : reference genome in FASTA format (same one that was used to create the BAM file)

Optional:

--kmer <int> : k-mer size [default 25]

--covthr <int> : threshold used to select source and sink [default 5]

--lowcov <int> : threshold used to remove low-coverage nodes [default 2]

--covratio <float> : minimum coverage ratio for sequencing errors (default: 0.01)

--radius <int> : left and right extension (in base-pairs) [default 100]

--window <int> : window-size of the region to assemble (in base-pairs) [default 400]

--maxregcov <int> : maximum average coverage allowed per region [default 10000]

--step <int> : delta shift for the sliding window (in base-pairs) [default 100]

--mapscore <int> : minimum mapping quality for selecting reads to assemble [default 1]

--pathlimit <int> : limit number of sequence paths to [default 1000000]

--mismatches <int> : max number of mismatches in near-perfect repeat detection [default 3]

--dir <directory> : output directory [default ./outdir]

--numprocs <int> : number of parallel jobs (1 for no parallelization) [default 1]

--sample <string> : only process reads/fragments in sample [default ALL]

--coords <file> : file with list of selected locations to examine [default null]

Output:

--format : export mutations in selected format (annovar | vcf) [default vcf]

--intarget : export mutations only inside the target regions from the BED file

--logs : keep log files

Note 1: the list of detected indels is saved in file: OUTDIR/variants.indel.*

where OUTDIR is the output directory selected with option "--dir" [default ./outdir]

Note 2: use the export tool (scalpel-export) to export mutations using different filtering criteria

> scalpel-discovery --somatic

$ Local date and time: Sat Jul 7 10:14:53 2018

Program: scalpel-discovery (micro-assembly variant detection)

Version: 0.5.3 (beta), January 25 2016

Contact: Giuseppe Narzisi <gnarzisi@nygenome.org>

usage: scalpel-discovery --somatic --normal <BAM file> --tumor <BAM file> --bed <BED file> --ref <FASTA file> [OPTIONS]

Detect somatic indels in a tumor/normal pair

OPTIONS:

--help : this (help) message

--verbose : verbose mode

Required:

--normal <BAM file> : normal BAM file

--tumor <BAM file> : tumor BAM file