既知変異を保護しながらロングリードRNA seqのエラーを訂正する TranscriptClean

　従来のショートリードRNAシークエンシングは、様々な用途における遺伝子発現を定量するために広く使用されている。ショートリードリードは正確で費用効果が高いが、一般に数キロベース長ある全長哺乳動物アイソフォームを解決する能力が欠けている（論文より　Conesa、2016）。 Pacific Biosciences（PacBio）やOxford Nanoporeなどのロングリードシーケンシングプラットフォームを使うことでショートリードの転写再構成の課題を回避できるが、エラー率は大幅に高くなる。 Raw PacBioのリードは、１塩基ミスマッチやmicroindelsを含む確率的エラーが11〜15％ある（Eid、2009）。Microindelsがあるとスプライスジャンクションの位置を誤る可能性があるため、Microindelsはアイソフォームマッピング中に特に問題になる。

　PacBio ToFU分析パイプラインでのCircular consensus correctionおよびリードポリッシュステップでは、rawリードが処理されると、ほとんどの転写産物のエラー率を大幅に低下させる（Eid、2009; Gordon、2015）。しかし、この訂正プロセスは、同一のインサート分子上の複数のシークエンシングパスが利用可能である場合にのみ有効であり、転写物長が増加すると可能性が低下する（Rhoads and Au、2015）。

　この問題に対処するために、ToFUパイプラインの下流で転写産物を訂正するための様々なPacBio特有のツールが開発されている。 TAPIS、HapIsoおよびSQANTIは、エキソン内のindelsを補正するためのリファレンスガイドアプローチを用いている（Abdel-Ghany、2016; Mangul et al、2017; Tardaguila、2018）。 HapIsoは、ロングリードをフェージングにするハプロタイプawareな方式で、１塩基バリアントとエラーを区別する。 TAPISとSQANTIは、スプライスジャンクションクオリティフィルタを使用し、後者はランダムフォレスト分類子を使用して、影響を受けた転写産物を削除することによって残りのエラーを処理する。これらの方法では、よりクリーンなPacBioデータセットが生成されるが、microindelsエラーに起因する非標準的なスプライスジャンクションを訂正しようとするものはない。さらに、HapIsoは、フェージングのために遺伝子あたり複数の転写産物を必要とするが、これはシーケンシングデプスおよび遺伝子発現レベルに依存しない。

　TranscriptCleanは、リファレンスゲノム、スプライスアノテーションおよびバリアントファイルを使用して、PacBio転写産物中のミスマッチ、microindelsおよび非標準のスプライスジャンクションを訂正し、一方で既知のバリアントは保護するプログラムである。 GM12878（Tilgner、2014）の公的に入手可能なPacBioヒトトランスクリプトーム上でTranscriptCleanを実行することにより、これらの転写産物に存在するindelsの99％、ミスマッチの98％および非標準スプライスジャンクションの39％を修正した。これにより、非標準スプライスジャンクションのために、以前のワークフローの下で廃棄されたであろう転写産物32536を回収することができた。

　TranscriptCleanは、SAMフォーマットの転写産物を処理し、リファレンスゲノムに対する挿入、欠失およびミスマッチを探す。サイズ閾値以下（デフォルト≤5bp）のIndelsは、リファレンス配列と一致するように変更される。転写産物中のミスマッチは参照塩基に置き換えられる。 Indelおよびミスマッチ訂正は、バリアント認識モードで実行して、関心のあるバリアントをユーザーに削除しないようにすることもできる。このモードでは、ユーザーが提供するVCFファイル上の既知のバリアントポジションとシーケンスが一致しない場合にのみ、ミスマッチとインデルがリファレンスシーケンスに変更される。この方法の潜在的な欠点は、VCFで提供されていないnovelなSNPまたはRNA編集イベントを除去することである。

　TranscriptCleanの出力は、修正されたSAMファイルで、CIGAR、シーケンス、MD / NMフィールドが更新される。また、個々のエラーや記録の変更を追跡するログファイルとともに、修正されたシーケンスのfastaファイルも出力する。アクセサリスクリプトgenerate_report.Rは、TranscriptCleanの結果を要約した数値を生成する。

論文では、GM12878のPacbioシーケンシングデータ（論文）に適用されている。

SRAリンク

https://www.ncbi.nlm.nih.gov/sra/?term=SRP036136

インストール

ubuntu16.04のminiconda2.4.0.5環境でテストした（docker使用。ホストOS: mac os10.12）。

依存

TranscriptClean is designed to be run with Python version 2.7

Bedtools (v2.25.0): http://bedtools.readthedocs.io/en/latest/content/installation.html
pybedtools (v0.7.8): https://daler.github.io/pybedtools/
pyfasta (v0.5.2): https://pypi.python.org/pypi/pyfasta/
R (v.3.3.2 recommended) is needed to run the visualization script, generate_report.R

conda install -c bioconda -y Bedtools==2.25.0 pybedtools==0.7.8 pyfasta==0.5.2

本体 Github

git clone https://github.com/dewyman/TranscriptClean.git
cd TranscriptClean/

> python TranscriptClean.py -h

$ python TranscriptClean.py -h

Usage: TranscriptClean.py [options]

Options:

-h, --help show this help message and exit

-s FILE, --sam=FILE Input SAM file containing transcripts to correct

-g FILE, --genome=FILE

Reference genome fasta file. Should be the

same one used during mapping to generate the sam file.

-j FILE, --spliceJns=FILE

Splice junction file obtained by mapping Illumina

reads to the genome using STAR. More formats may be

supported in the future.

-v FILE, --variants=FILE

VCF formatted file of variants to avoid

correcting away in the data (optional).

--maxLenIndel=MAXLENINDEL

Maximum size indel to correct (Default: 5 bp)

--maxSJOffset=MAXSJOFFSET

Maximum distance from annotated splice

junction to correct (Default: 5 bp)

-o FILE, --outprefix=FILE

output file prefix. '_clean' plus a file

extension will be added to the end.

-m CORRECTMISMATCHES, --correctMismatches=CORRECTMISMATCHES

If set to false, TranscriptClean will skip

mismatch correction. Default: True

-i CORRECTINDELS, --correctIndels=CORRECTINDELS

If set to false, TranscriptClean will skip indel

correction. Default: True

--correctSJs=CORRECTSJS

If set to false, TranscriptClean will skip

splice junction correction. Default: True, but

requires splice junction

annotation file to work.

--dryRun If this option is set, TranscriptClean will read

in the sam file and record all insertions, deletions,

mismatches, and noncanonical splice junctions, but it

will skip correction. This is

useful for checking the distribution

of transcript errors in the data before running

correction.

--primaryOnly If this option is set, TranscriptClean will only

output primary mappings of transcripts (ie it will

filter out unmapped and

multimapped lines from the SAM input.

実行方法

python TranscriptClean.py --sam transcripts.sam --genome hg38.fa --outprefix /my/path/outfile

引用

TranscriptClean: Variant-aware correction of indels, mismatches, and splice junctions in long-read transcripts
Wyman D, Mortazavi A

Bioinformatics. 2018 Jun 15

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

既知変異を保護しながらロングリードRNA seqのエラーを訂正する TranscriptClean