次世代シーケンサ技術を用いたゲノムアセンブリは、今や生物学研究に欠かせないものとなっているが、シーケンサやアセンブリのプロセスの多くは依然としてエラーが発生しやすい状態にある。残念ながら、これらのエラーは下流の解析に伝播し、結果や結論に大きな影響を与える。このようなエラーは、2倍体の遺伝子型データを扱う場合には認識されるが、最近のリファレンスアセンブリ(2倍体の配列として表現される)では、すべてのポジションについて簡潔な品質評価が行われていない。Refereeは、2倍体遺伝子型の品質情報を利用して、ハプロイドアセンブリの各ポジションに品質スコアを付与するプログラムである。Refereeは、Phredのようなスケールの簡潔な品質情報をFASTQ形式でアセンブリに提供し、低品質のサイトを簡単にフィルタリングすることを目的としている。また、Refereeは品質スコアをBED形式で出力し、ほとんどのゲノムブラウザでトラックとして簡単に視覚化することができる。Refereeは、https://gwct.github.io/referee/ から自由に利用できる。
HP
https://gwct.github.io/referee/
usage
https://gwct.github.io/referee/readme.html
インストール
git clone https://github.com/gwct/referee.git
cd referee/
> python referee.py -h
# =================================================
__
_ __ ___ / _| ___ _ __ ___ ___
| '__/ _ \ |_ / _ \ '__/ _ \/ _ \
| | | __/ _| __/ | | __/ __/
|_| \___|_| \___|_| \___|\___|
Reference genome quality score calculator.
usage: referee.py [-h] [-ref REF_FILE] [-gl GL_FILE] [-d OUTDIR]
[-prefix PREFIX] [--overwrite] [-p PROCESSES]
[-l LINES_PER_PROC] [--pileup] [--fastq] [--fasta] [--bed]
[--haploid] [--correct] [--mapped] [--mapq] [--raw]
[--quiet] [--version]
Referee: Reference genome quality scoring.
optional arguments:
-h, --help show this help message and exit
-ref REF_FILE The FASTA assembly to which you have mapped your reads.
-gl GL_FILE The file containing the genotype likelihood calculations
or a pileup file (be sure to set --pileup!).
-d OUTDIR An output directory for all files associated with this
run. Will be created if it doesn't exist. Default:
referee-[date]-[time]
-prefix PREFIX A prefix for all files associated with this run. Default:
referee-[date]-[time]
--overwrite Set this option to explicitly overwrite files within a
previous output directory.
-p PROCESSES The number of processes Referee should use. Default: 1.
-l LINES_PER_PROC The number of lines to be read per process. Decreasing
may reduce memory usage at the cost of slightly higher
run times. Default: 100000.
--pileup Set this option if your input file(s) are in pileup
format and Referee will calculate genotype likelihoods
for you.
--fastq Set this option to output in FASTQ format in addition to
the default tab delimited format.
--fasta Set this option to output the corrected sequence in FASTA
format in addition to the default tab delimited format.
Can only be set with --corrected.
--bed Set this option to output in BED format in addition to
the default tab delimited format. BED files can be viewed
as tracks in genome browsers.
--haploid Set this option if your input data are from a haploid
species. Referee will limit its likelihood calculations
to the four haploid genotypes. Can only be used with
--pileup.
--correct Set this option to allow Referee to suggest alternate
reference bases for sites that score 0.
--mapped Set this to calculate scores only for positions that have
reads mapped to them.
--mapq Set with --pileup to indicate whether to consider mapping
quality scores in the final score calculation. These
should be in the seventh column of the pileup file.
--raw Set this flag to output the raw score as the fourth
column in the tabbed output.
--quiet Set this flag to prevent Referee from reporting detailed
information about each step.
--version Simply print the version and exit. Can also be called as
'-version', '-v', or '--v'
実行方法
1、完成したアセンブリに(ゲノムを構築した)リードをマッピング
2、Refereeで遺伝子型尤度を計算するためのpileupファイルを作成するか、ゲノム上のすべての位置で10すべての遺伝子型(AA AC AG AT CC CG CT GG GT TT)の遺伝子型ログ尤度を事前に計算する(ANGSD推奨)。
3、refereeを実行。
python referee.py -gl [genotype likelihood file] -ref ref.fasta --pileup
*事前に計算されたgenotype log likelihoodを入力とする場合は、--pileupフラグを外す。
引用
Referee: Reference Assembly Quality Scores
Gregg W C Thomas, Matthew W Hahn
Genome Biology and Evolution, Volume 11, Issue 5, May 2019, Pages 1483–1486