古いサンプルのデータ (fastqやbam) から効率的にアダプターを除く leeHom

　古いDNAが断片化したサンプルからのシーケンスが増えている。しばしば数万年前のサンプルからも抽出される古代のサンプルのDNAは断片化が起きており、うまくDNAを抽出してもサイズが100-bpを超えることは滅多にない。短いDNAをペアードエンドでシーケンスすると、インサート全体がシーケンスされ、さらにアダプターまでシーケンスされることになる（図参照）。

f:id:kazumaxneo:20180121020540j:plain

ResearchGateより転載。

　シーケンス全体がオーバーラップしているので、正しくマージできれば、単純に高いクオリテイのペア側の配列に更新するだけでシーケンス精度を高めることも可能になるが、元々のDNAの質が悪いので（ミスマッチとギャップが発生しうる）、マージは難しいと推測される。

　leeHomは5'側と3'側のアダプターを除去し、ベイジアン最大事後確率アプローチを用いて元のDNA配列を再構成する方法論となる。シミュレーションと、古代のサンプルとして有名なネアンデルタール人のシーケンスデータを使いテストされており、他の方法論より精度が高いという結果が出ている。シングルエンド、ペアードエンドのシーケンスデータに対応している。

公式サイト

https://bioinf.eva.mpg.de/leehom/

インストール

Github

https://github.com/grenaud/leeHom

git clone --recursive https://github.com/grenaud/leeHom.git 
cd leeHom/ 
make 
src/leeHom #動作確認

$ src/leeHom

Usage:

src/leeHom [options] BAMfile

This program takes an unaligned BAM where mates are consecutive

or fastq files and trims and merges reads

You can specify a unaligned bam file or one or two fastq :

-fq1 First fastq

-fq2 Second fastq file (for paired-end)

-fqo Output fastq prefix

-o , --outfile Output (BAM format).

-u Produce uncompressed bam (good for pipe)

--aligned Allow reads to be aligned (default false)

-v , --verbose Turn all messages on (default false)

--log [log file] Print a tally of merged reads to this log file (default only to stderr)

--phred64 Use PHRED 64 as the offset for QC scores (default : PHRED33)

Paired End merging/Single Read trimming options

You can specify either:

--ancientdna ancient DNA (default false)

this allows for partial overlap

or if you know your size length distribution:

--loc Location for lognormal dist. (default none)

--scale Scale for lognormal dist. (default none)

--keepOrig Write original reads if they are trimmed or merged (default false)

Such reads will be marked as PCR duplicates

-f , --adapterFirstRead Adapter that is observed after the forward read (def. Multiplex: AGATCGGAAGAGCACACGTCTGAACTCCAG)

-s , --adapterSecondRead Adapter that is observed after the reverse read (def. Multiplex: AGATCGGAAGAGCGTCGTGTAGGGAAAGAG)

-c , --FirstReadChimeraFilter If the forward read looks like this sequence, the cluster is filtered out.

Provide several sequences separated by comma (def. Multiplex: ACACTCTTTCCCTACACGTCTGAACTCCAG)

-k , --key Key sequence with which each sequence starts. Comma separate for forward and reverse reads. (default '')

-i , --allowMissing Allow one base in one key to be missing or wrong. (default false)

--trimCutoff Lowest number of adapter bases to be observed for single Read trimming (default 1)

パスを通しておく。

マルチコア版

> src/leeHomMulti #-tでthread数を決める

$ src/leeHomMulti

Usage:

leeHomMulti [options] BAMfile

This program takes an unaligned BAM where mates are consecutive

or fastq files and trims and merges reads

You can specify a unaligned bam file or one or two fastq :

-fq1 First fastq

-fq2 Second fastq file (for paired-end)

-fqo Output fastq prefix

-o , --outfile Output (BAM format).

-u Produce uncompressed bam (good for pipe)

--aligned Allow reads to be aligned (default false)

-v , --verbose Turn all messages on (default false)

--log [log file] Print a tally of merged reads to this log file (default only to stderr)

--phred64 Use PHRED 64 as the offset for QC scores (default : PHRED33)

-t [threads] Use multiple cores (default : 1)

Paired End merging/Single Read trimming options

You can specify either:

--ancientdna ancient DNA (default false)

this allows for partial overlap

or if you know your size length distribution:

--loc Location for lognormal dist. (default none)

--scale Scale for lognormal dist. (default none)

--keepOrig Write original reads if they are trimmed or merged (default false)

Such reads will be marked as PCR duplicates

-f , --adapterFirstRead Adapter that is observed after the forward read (def. Multiplex: AGATCGGAAGAGCACACGTCTGAACTCCAG)

-s , --adapterSecondRead Adapter that is observed after the reverse read (def. Multiplex: AGATCGGAAGAGCGTCGTGTAGGGAAAGAG)

-c , --FirstReadChimeraFilter If the forward read looks like this sequence, the cluster is filtered out.

Provide several sequences separated by comma (def. Multiplex: ACACTCTTTCCCTACACGTCTGAACTCCAG)

-k , --key Key sequence with which each sequence starts. Comma separate for forward and reverse reads. (default '')

-i , --allowMissing Allow one base in one key to be missing or wrong. (default false)

--trimCutoff Lowest number of adapter bases to be observed for single Read trimming (default 1)

ラン

テストラン。

fasrqファイルからアダプターを除く。

leeHom -f AGATCGGAAGAGCACACGTCTGAACTCCAG -s GGAAGAGCGTCGTGTAGGGAAAGAGTGTAG --ancientdna -fq1 testData/rawAncientDNA.f1.gz -fq2 testData/rawAncientDNA.f2.gz -fqo testData/outfq

-f Adapter that is observed after the forward read (def. Multiplex: AGATCGGAAGAGCACACGTCTGAACTCCAG)
-s Adapter that is observed after the reverse read (def. Multiplex: AGATCGGAAGAGCGTCGTGTAGGGAAAGAG)
--ancientdna ancient DNA (default false) this allows for partial overlap
-fq1 First fastq-fq1 First fastq
-fq2 Second fastq file (for paired-end)
-fqo Output fastq prefix

ランが終わると下の解析logが表示される。

Total 50000; Merged (trimming) 40540; Merged (overlap) 7376; Kept PE/SR 1955; Trimmed SR 0; Adapter dimers/chimeras 129; Failed Key 0

47,916のペアードエンドがマージされ（=40540+7376)、1955は重複が見つからずマージされなかったことを意味している。またインサートなしにアダプターが２つタンデムに結合したキメラは129見つかった。

出力はマージされたfastq.gzと、マージされなかったr1.fastq.gz、r2.fastq.gz、そして条件が満たされなかったfailのfastq.gzが出力される。

bamファイルからアダプターを除く。

leeHom -f AGATCGGAAGAGCACACGTCTGAACTCCAG -s GGAAGAGCGTCGTGTAGGGAAAGAGTGTAG --ancientdna -o testData/reconsAncientDNA.bam testData/rawAncientDNA.bam

-o , --outfile Output (BAM format).

引用

leeHom: adaptor trimming and merging for Illumina sequencing reads

Gabriel Renaud, Udo Stenzel, and Janet Kelso.

Nucleic Acids Res. 2014 Oct 13; 42(18): e141.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

古いサンプルのデータ (fastqやbam) から効率的にアダプターを除く leeHom