アセンブリのエラーやギャップ（NNN）を検出し、ポリッシュしたFASTAを出力するPilon

2018 8/31 タイトルと紹介文修正、11/5 タイトル修正

2019 1/11 追記、3/3ラストにnanopore long read追記、4/12 ラストにpacbio long read追記、6/12 リンク追記、6/27 merged.fq追記、7/15 追記、9/29 追記、10/28インストール追記

2021 2/28 関連論文追記、3/1 1.24のリンク追加、4/12 help更新、tips追加

　何百というバクテリアゲノムをシーケンスできる時代になったが、それに伴い大量のデータを効率よく分析できる堅牢でスケール変化に対応できる手法が求められている。Pilonは全自動で動作するバクテリア向けのリファレンスベースのアセンブルツールである。draftゲノムかfinishゲノムのリファレンスとそのbamを指定することで、リファレンスと構造的に異なる部位を識別し、解析株のゲノムをリファンレンスから再構築して出力する。この時、構造が異なった部位はバリアントして出力する。リファレンスベースでアセンブルを行なっているため、SNVやsmall indelのみならず、リードの長さ以上の構造変化も検出することができる。また、ドラフトゲノムのポリッシュにも使われる。ポリッシュの過程で、ミスマッチ、indel、SV、scaffoldsのNNN...などが修復される。全てのプロセスは自動化されているため、ユーザーはFASTAとbamを指定するだけで、解析データのゲノムを得ることが可能になっている。2014年に論文が発表された。

公式サイト

http://software.broadinstitute.org/software/pilon/

ダウンロード

Github

Javaのコンパリル済みバイナリも配布されている。バイナリをダウンロードして、実行権をつける。

#1.24をダウンロード（リリース）（2021 3/1時点最新）
wget https://github.com/broadinstitute/pilon/releases/download/v1.24/pilon-1.24.jar
chmod u+x pilon-1.24.jar


# 1つ前1.23をダウンロード
wget https://github.com/broadinstitute/pilon/releases/download/v1.23/pilon-1.23.jar
chmod u+x pilon-1.23.jar

#biocondaのサポートがあるのでcondaで導入できますが、エラーが起きた時はjavaの権限でダウンロードした.jarの実行ファイルを直接叩いてください。


#bioconda (link)
conda install -c bioconda -y pilon

> java -Xmx40G -jar pilon-1.24.jar --help

$ java -Xmx40G -jar pilon-1.24.jar --help

Pilon version 1.24 Thu Jan 28 13:00:45 2021 -0500

Usage: pilon --genome genome.fasta [--frags frags.bam] [--jumps jumps.bam] [--unpaired unpaired.bam]

[...other options...]

pilon --help for option details

INPUTS:

--genome genome.fasta

The input genome we are trying to improve, which must be the reference used

for the bam alignments. At least one of --frags or --jumps must also be given.

--frags frags.bam

A bam file consisting of fragment paired-end alignments, aligned to the --genome

argument using bwa or bowtie2. This argument may be specifed more than once.

--jumps jumps.bam

A bam file consisting of jump (mate pair) paired-end alignments, aligned to the

--genome argument using bwa or bowtie2. This argument may be specifed more than once.

--unpaired unpaired.bam

A bam file consisting of unpaired alignments, aligned to the --genome argument

using bwa or bowtie2. This argument may be specifed more than once.

--bam any.bam

A bam file of unknown type; Pilon will scan it and attempt to classify it as one

of the above bam types.

--nanopore ont.bam

A bam file containing Oxford Nanopore read alignments. Experimental.

--pacbio pb.bam

A bam file containing Pacific Biosciences read alignments. Experimental.

OUTPUTS:

--output prefix

Prefix for output files

--outdir directory

Use this directory for all output files.

--changes

If specified, a file listing changes in the <output>.fasta will be generated.

--vcf

If specified, a vcf file will be generated

--vcfqe

If specified, the VCF will contain a QE (quality-weighted evidence) field rather

than the default QP (quality-weighted percentage of evidence) field.

--tracks

This options will cause many track files (*.bed, *.wig) suitable for viewing in

a genome browser to be written.

CONTROL:

--variant

Sets up heuristics for variant calling, as opposed to assembly improvement;

equivalent to "--vcf --fix all,breaks".

--chunksize

Input FASTA elements larger than this will be processed in smaller pieces not to

exceed this size (default 10000000).

--diploid

Sample is from diploid organism; will eventually affect calling of heterozygous SNPs

--fix fixlist

A comma-separated list of categories of issues to try to fix:

"snps": try to fix individual base errors;

"indels": try to fix small indels;

"gaps": try to fill gaps;

"local": try to detect and fix local misassemblies;

"all": all of the above (default);

"bases": shorthand for "snps" and "indels" (for back compatibility);

"none": none of the above; new fasta file will not be written.

The following are experimental fix types:

"amb": fix ambiguous bases in fasta output (to most likely alternative);

"breaks": allow local reassembly to open new gaps (with "local");

"circles": try to close circlar elements when used with long corrected reads;

"novel": assemble novel sequence from unaligned non-jump reads.

--dumpreads

Dump reads for local re-assemblies.

--duplicates

Use reads marked as duplicates in the input BAMs (ignored by default).

--iupac

Output IUPAC ambiguous base codes in the output FASTA file when appropriate.

--nonpf

Use reads which failed sequencer quality filtering (ignored by default).

--targets targetlist

Only process the specified target(s). Targets are comma-separated, and each target

is a fasta element name optionally followed by a base range.

Example: "scaffold00001,scaffold00002:10000-20000" would result in processing all of

scaffold00001 and coordinates 10000-20000 of scaffold00002.

If "targetlist" is the name of a file, each line will be treated as a target

specification.

--verbose

More verbose output.

--debug

Debugging output (implies verbose).

--version

Print version string and exit.

HEURISTICS:

--defaultqual qual

Assumes bases are of this quality if quals are no present in input BAMs (default 10).

--flank nbases

Controls how much of the well-aligned reads will be used; this many bases at each

end of the good reads will be ignored (default 10).

--gapmargin

Closed gaps must be within this number of bases of true size to be closed (100000)

--K

Kmer size used by internal assembler (default 47).

--mindepth depth

Variants (snps and indels) will only be called if there is coverage of good pairs

at this depth or more; if this value is >= 1, it is an absolute depth, if it is a

fraction < 1, then minimum depth is computed by multiplying this value by the mean

coverage for the region, with a minumum value of 5 (default 0.1: min depth to call

is 10% of mean coverage or 5, whichever is greater).

--mingap

Minimum size for unclosed gaps (default 10)

--minmq

Minimum alignment mapping quality for a read to count in pileups (default 0)

--minqual

Minimum base quality to consider for pileups (default 0)

--nostrays

Skip making a pass through the input BAM files to identify stray pairs, that is,

those pairs in which both reads are aligned but not marked valid because they have

inconsistent orientation or separation. Identifying stray pairs can help fill gaps

and assemble larger insertions, especially of repeat content. However, doing so

sometimes consumes considerable memory.

ラン

リファレンス（近縁種でも可能）と、そのbamファイルが必要である。mate-pairやシングルエンドのbamにも対応している。

java -Xmx16G -jar pilon-1.23.jar --genome ref.fa --frags pair.bam --vcf --tracks --changes --verbose --threads 8 --outdir output_dir

INPUTS:

--genome genome.fasta　The input genome we are trying to improve, which must be the reference used for the bam alignments. At least one of --frags or --jumps must also be given.
--frags frags.bam　A bam file consisting of fragment paired-end alignments, aligned to the --genome argument using bwa or bowtie2. This argument may be specifed more than once.
--jumps　A bam file consisting of jump (mate pair) paired-end alignments, aligned to the --genome argument using bwa or bowtie2. This argument may be specifed more than once.
--unpaired　A bam file consisting of unpaired alignments, aligned to the --genome argument using bwa or bowtie2. This argument may be specifed more than once.
--bam 　A bam file of unknown type; Pilon will scan it and attempt to classify it as one of the above bam types.

OUTPUTS:

--output　Prefix for output files
--outdir　Use this directory for all output files.
--changes　If specified, a file listing changes in the <output>.fasta will be generated.
--vcf　 If specified, a vcf file will be generated
--vcfqe　If specified, the VCF will contain a QE (quality-weighted evidence) field rather than the default QP (quality-weighted percentage of evidence) field.
--tracks　This options will cause many track files (*.bed, *.wig) suitable for viewing in a genome browser to be written.

解析が終わると、指定したディレクトリにFASTAと、リファレンスと異なる部位を示したファイル（.change）が出力される。また、--taracksもつけていれば、異なる部位を記載したbedやGC、カバレッジ、copy number variationなどのファイルも出力される。

変更箇所が記載されたpilon.cganges

f:id:kazumaxneo:20180831133820j:plain

出力解説

https://github.com/broadinstitute/pilon/wiki/Output-File-Descriptions

追記

de novo assemblyで得たcontigを使えば、polishも行える。

まずショートリードをcontigにマッピングする。

#マッピング
bwa mem -t 20 contig.fasta illumina_read1.fq illumina_read2.fq |samtools view -@ 20 -bS - > aln.bam

#ソート
samtools sort -@ 20 aln.bam > aln_sorted.bam

#index
samtools index aln_sorted.bam

#ポリッシュ
java -Xmx16G -jar pilon-1.23.jar --genome racon_polished3.fasta --frags aln_sorted.bam --vcf --changes --verbose --threads 20 --outdir  polish_dir

tracks情報をigvで確認する。

cd directory/
igv -g ../ref.fa pair.bam,,pilonBadCoverage.wig,pilonChanges.wig,pilonClippedAlignments.wig,pilonCopyNumber.wig,pilonCoverage.wig,pilonDeltaCoverage.wig,pilonDipCoverage.wig,pilonGC.wig,pilonPctBad.wig,pilonPhysicalCoverage.wig,pilonPilon.bed,pilonUnconfirmed.wig,pilonWeightedMq.wig,pilonWeightedQual.wig &

リファレンスは出力ではなく、参照したレファンレンスのfastaの方を指定する。

.changesファイルの部位をIGVの窓にコピペすれば移動が速い（IGVフォーマットに対応している）。

SNV部位

f:id:kazumaxneo:20171001224403j:plain

large Insertion部位

f:id:kazumaxneo:20171001224207j:plain

large deletion部位

f:id:kazumaxneo:20171001224318j:plain

copy number variation部位

f:id:kazumaxneo:20171001224507j:plain

ただしこの部位は検出できていない。

他にも k-merの値、領域、差分、欠損のcoverage閾値など多様なオプションがある。詳細は公式サイトで確認してください。

https://github.com/broadinstitute/pilon/wiki/Requirements-&-Usage

主観ですがエンドユーザにとって良いツールだと思います。ただし、この手法で認識されない変異もあります。複雑な構造変化の検出はショートリードでは難しいでしょう。

追記

他のツールと比較されています。とても参考になります。

What is a good genome assembly? – Albertsen Lab

2019 4/12追記

v1.23にはlong readを使ったpolishオプションが実装されています。

ショートリードのbamとロングリードのbamを両方指定する。#vcfは遅くなるので消す。

nanopre

java -Xmx24G -jar pilon-1.23.jar --genome assembly.fasta --frags short_resd.bam --nanopore long_read.bam --tracks --changes --verbose --threads 20 --outdir pilon_output_dir

pacbio

java -Xmx24G -jar pilon-1.23.jar --genome assembly.fasta --frags short_resd.bam --pacbio long_read.bam --tracks --changes --verbose --threads 20 --outdir pilon_output_dir

テストした限り、ショトーリードもロングリードも両方あった方がpolishできる部位はかなり増えます。ロングリード情報があった方がNNNの解消にも有利なはずです。

2019 6/27追記

short-readのmerged.fqを使うと、リードが長い分さらにエラーが修正できるかもしれない。試した限り、マージしたfastqでしか修正できないindelエラーがいくつかあった（アラインメントを目視で見る限り、修正後の方が正しかったのでうまくワークしていると思われる）。

#creates merged.fq (BBtools)
bbmerge.sh in1=reads1.fq.gz in2=reads2.fq.gz out=merged.fq.gz

#mapping
minimap2 -t 18 -ax sr ref.fa merged.fq.gz | samtools sort -@ 18 -O BAM - > merged.bam && samtools index -@ 12 merged.bam

java -Xmx24G -jar pilon-1.23.jar --genome assembly.fasta --frags short.bam --pacbio merged.bam --tracks --changes --verbose --threads 20 --outdir pilon_output_dir

merged.fqは--unpairedで指定する。注意して使って下さい。

diploidsゲノムなら--diploidをつける。

--diploid Sample is from diploid organism; will eventually affect calling of heterozygous SNPs

複雑な領域については、polishingすることで結果が悪化する可能性もあります。以下の記事を読んでみて下さい。

追記 - メタゲノムアセンブリへの適用

メタゲノムアセンブリへのmappingを視覚化すると、ホモのミスマッチが大量に生じていることが多い。これをpolishすることで、下流の解析の精度が向上する。pilonはメタゲノムのpolsih向けに設計されておらず、以下のisuueでも否定されている。

https://github.com/broadinstitute/pilon/issues/31

しかし、実際には比較的上手く機能する。polish前後で評価し、幾つかのmetricで改善しているなら使うのも手かもしれません。

2021 2/28 追記 - --vcfqe

以下の論文：臨床の複雑な環境を想定すると、straightforwardな薬剤耐性変異ばかりが耐性菌に生じているわけではなく、代謝系の変異からの代謝ダイナミクスの変化で生じているのではないか、と言うことを調べた研究。その研究で、単離サンプルとポピュレーションサンプルのレアバリアントのコールにpilonが --vcfと--vcfqe フラグを立てて使用されている。

Clinically relevant mutations in core metabolic genes confer antibiotic resistance

https://science.sciencemag.org/content/371/6531/eaba0862

--vcfqeの解説

--vcfqe　If specified, the VCF will contain a QE (quality-weighted evidence) field rather than the default QP (quality-weighted percentage of evidence) field.

このフラグを立ててランすると、全ポジションのコール情報と QE field を含む巨大なvcfが出力される。これをパースして情報を得ているものと思われる。

2021 4/12

ホモに見えても、ヘテロな部位とみなされてNが研磨されないことがあります。どうしてもN以外の塩基をアサインしたけれれば、研磨対象の配列のNを適当な塩基（認識できるように固定の塩基にする）に置換し、それからpilonをランしてみて下さい。

pilon-1.24.jar --genome changeN2A.fasta --frags input.bam --changes --fix amb,snps --threads 12 --outdir outdir

推奨しませんが、この方法であればNは消えます。

引用

Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement

Bruce J. Walker , Thomas Abeel , Terrance Shea, Margaret Priest, Amr Abouelliel, Sharadha Sakthikumar, Christina A. Cuomo, Qiandong Zeng, Jennifer Wortman, Sarah K. Young, Ashlee M. Earl

PLoS One. 2014 Nov 19;9(11):e112963.