Complete Genomicsのシーケンスリードをマッピングする sirfast

　ハイスループットシークエンシング（HTS）技術は、[論文より ref.1]におけるペアエンド配列決定、および全ゲノムショットガンシーケンシング（WGS）[ref.2]の最初の使用以来、魅力的な速度で進化し続けている。 Roche / 454 [ref.3]、Illumina [ref.4]、ABI SOLiD [ref.5]、Pacific Biosciences [ref.6]など多くのシーケンシング技術が開発されている[ref.7]。これらのプラットフォームの多くでは、リードのマッピング[ref.8]、バリアントのコール[ref.9,10]、ゲノムアセンブリ[ref.11]のための多くのツールが開発されている。

　多くの異なるアルゴリズムとそのオープンソース実装が利用可能であるため、研究者はHTS技術によって生成されたデータをより有効に利用することができる。しかし、Complete Genomics [ref.12]（CG）プラットフォームによって生成されたデータをマップし分析するために考案された公に利用可能なアルゴリズムは存在しない。これは主に、Complete Genomics社がシーケンシングサービスのみを提供し、独自のソフトウェアを使用して独自の分析結果を顧客に提供しているためである。 CGデータを分析するために考案された唯一のアルゴリズムセットは、同社の科学者グループによって記述されている[ref.13]が、これらのアルゴリズムの実装は独自のものであり、未発表のままである。間違いなく、同社独自の分析パイプラインは、異なる特性を持つデータに対してきめ細かく調整されている。それでも、1000 Genomes Project [ref.14]のような多くのゲノムの解析と新しいアルゴリズム開発を組み込んだ研究プロジェクトは、解析結果をさらに改善するためのオープンソースのツールとアルゴリズムが利用できるというメリットがある。 CGデータを分析するためのオープンソースツールがあれば、他の研究者が他の方法で発見することができない新しい結果を開発することが可能になる。

　本論文では、CGプラットフォームで生成されたデータ用に設計された新しいリードマッパーsirFASTを紹介する。 Complete Genomicsは、他のすべてのHTS技術とは異なり、リードは複数のサブリード（読み取りセクション）で構成され、重複またはギャップを持つ可能性がある。著者らはsirFASTを設計して、リードごとにフレキシブルな "予想される"ギャップ/オーバーラップが存在する場合に、そのようなリードをリファレンスアセンブリに効率よくアライメントするにした。 Burrows-Wheeler変換[ref.15]に基づくメソッドはindelsでうまく拡張できないため、sirFASTをハッシュベースのシード・アンド・延長アルゴリズムとして実装した。イルミナとSOLiDをベースにしたツール[ref.16,17,18,19]と同様に、sirFASTはSAM [ref.20]とDIVET [ref.21]の両方のファイル形式をサポートしている。

インストール

Github

https://github.com/BilkentCompGen/sirfast

git clone https://github.com/BilkentCompGen/sirfast.git
cd sirfast/
make

> ./sirfast

$ ./sirfast

sirFAST : Short interrupted Read Fast Alignment Search Tool. Enhanced with FastHASH.

Usage: sirfast [options]

General Options:

-v|--version Current Version.

-h Shows the help file.

Indexing Options:

--index [file] Generate an index from the specified fasta file.

Searching Options:

--search [file] Search in the specified genome. Provide the path to the fasta file.

Index file should be in the same directory.

--pe Search will be done in Paired-End mode.

--seq [file] Input sequences in fasta/fastq format [file]. If

paired end reads are interleaved, use this option.

--seq1 [file] Input sequences in fasta/fastq format [file] (First

file). Use this option to indicate the first file of

paired end reads.

--seq2 [file] Input sequences in fasta/fastq format [file] (Second

file). Use this option to indicate the second file of

paired end reads.

-o [file] Output of the mapped sequences. The default is "output".

-u [file] Save unmapped sequences in fasta/fastq format.

--best Only the best mapping from all the possible mapping is returned.

--seqcomp Indicates that the input sequences are compressed (gz).

--outcomp Indicates that output file should be compressed (gz).

-e [int] Maximum allowed hamming distance (default 4% of the read length).

--min [int] Min distance allowed between a pair of end sequences.

--max [int] Max distance allowed between a pair of end sequences.

--maxoea [int] Max number of One End Anchored (OEA) returned for each read pair.

We recommend 100 or above for NovelSeq use. Default = 100.

--maxdis [int] Max number of discordant map locations returned for each read pair.

We recommend 300 or above for VariationHunter use. Default = 300.

--crop [int] Trim the reads to the given length.

--sample [string] Sample name to be added to the SAM header (optional).

--rg [string] Read group ID to be added to the SAM header (optional).

--lib [string] Library name to be added to the SAM header (optional).

パスの通ったディレクトリに移動しておく。

ラン

テストデータを使う。はじめにindexファイルを作成する。

sirfast --index Sample_Data/sample_genome.fa

シングルエンドのテストデータをsample_genomeにマッピングする。シーケンスデータはtsv形式で、以下のようになっている（format）。

f:id:kazumaxneo:20180413214026j:plain

マッピングを実行する。

sirfast --search Sample_Data/sample_genome.fa --seq Sample_Data/sample_reads_e0 -e 0 -o Sample_Data/out.sam

--seq Input sequences in fasta/fastq format [file]. If --seq [file] Input sequences in fasta/fastq format [file]. If paired end reads are interleaved, use this option.
-e Maximum allowed hamming distance (default 4% of the read length).
-o Output of the mapped sequences. The default is "output".

out.sam (mapped) とunmappedが出力される。デフォルトでは条件を満たした全ての部位にマッピングされる。ベストマッピングだけに限定したければ--bestをつける。

引用

Fast and Accurate Mapping of Complete Genomics Reads

Donghyuk Lee,†, Farhad Hormozdiari,†, Hongyi Xin,a Faraz Hach, Onur Mutlu, and Can Alkand

Methods. 2015 Jun 1; 0: 3–10.

What is Complete Genomics data and how does it compare with other genomics datasets?

What is Complete Genomics data and how does it compare with other genomics datasets? - Quora