配列ベースの生物学的コミュニティの特徴付けの過程において、配列の教師ありのtaxonomic classification は重要な目標である。多数の配列分類ソフトウェアプログラムは、配列類似性を測り、そして配列類似性と分類学的所属との間の関係をモデル化することによってこれを達成する。(一部略)
配列組成に基づく分類の間、関心のあるマーカーに隣接する無関係の配列領域は、SINTAX、RDPナイーブベイズ分類器、VSEARCHなどの多くのツールで使用されるk-mer分布を偏らせる(Wang et al、2007; Edgar 2016; Rognes et al、 2017)。上位N個の配列の距離分布を分析することによって分類学的境界を推定する分類器(Huson et al、2011; Bengtsson-Palme et al、2015; Gao et al、2017)では、配列重複の保持は距離分布の特徴付けに偏りがある。ただし、2つの配列が同じ分類群からのものであることを保証せずに分類群的に単純な配列重複の除去を行うと、 2つの分類群間の変異に関してデータベースから重要な情報が削除されることになる。
これらの問題を解決するために、MetaCuratorツールキットを開発した。メインツールであるIterRazorは、例えば全ゲノムアセンブリを含む任意の入手可能な供給源から標的マーカーリファレンス配列を同定および抽出する。与えられたマーカーのリファレンスを抽出した後、DerepByTaxonomyは分類学を意識したアプローチを使ってリファレンス配列を逆複製するために使われる。追加のPythonツールおよびUnixシェルスクリプトは、階層的キュレーションのための分類系統のフォーマット化およびNCBIヌクレオチドデータベース内に一般的に見られる分類系統アーチファクトの除去を容易にする(NCBI Resource Coordinators、2018)。
インストール
ubuntu16.04にて、conda createで2.7.14の仮想環境を作ってテストした。
依存
MetaCurator is a command-line only toolkit which runs on typical Unix/Linux environments. It requires the following software to be installed and globally accessible.
- Python 2.7.5 or greater
- Perl 5.16.0 or compatible
- VSEARCH 2.8.1 or compatible
- HMMER3 3.1b2 or compatible
- MAFFT 7.270 or compatible
#仮想環境を作って依存を導入
conda create -n metacurator -c bioconda python=2.7.14 vsearch hmmer mafft
本体 Github
wget https://github.com/RTRichar/MetaCurator/archive/MetaCurator_v1.0beta.1.tar.gz
tar xzf MetaCurator_v1.0beta.1.tar.gz
cd MetaCurator-MetaCurator_v1.0beta.1/
export PATH=$PATH:$PWD
> IterRazor.py -h
# IterRazor.py -h
usage: IterRazor is a tool which uses hidden Markov models to identify and extract precise DNA barcode sequences of interest from the full diversity of reference sequence data available through sources such as NCBI (e.g. whole genome sequences, whole chloroplast sequences or Sanger seuqnces representative of variable portions of the locus of interest)
[-h] -r REFERENCESFILE -i INPUTFILE -o OUTPUTFILE [-t THREADS]
[--SaveTemp SAVETEMP] [-e HMMEVALUE] [-is ITERATIONSERIES]
[-cs COVERAGESERIES]
optional arguments:
-h, --help show this help message and exit
-r REFERENCESFILE, --ReferencesFile REFERENCESFILE
Fasta file of sequences trimmed to exact region of
interest (header format should conform to --InputFile
header format). This should be 5 to 10 sequences
spanning a diversity of the phylogenetic tree of
interest, so as to build a relatively broadly
inclusive HMM from the first iteration (REQUIRED)
-i INPUTFILE, --InputFile INPUTFILE
Input file of reference sequences to be extracted
(e.g. whole chloroplast genome sequences from NCBI).
Sequence headers should include a unique sequence
identifier, NCBI accessions are recomended, preceded
only by '>' (REQUIRED)
-o OUTPUTFILE, --OutPutFile OUTPUTFILE
Output file for extracted sequences (REQUIRED)
-t THREADS, --threads THREADS
Number of processors to use
--SaveTemp SAVETEMP Option for saving alignments and HMM files produced
during extraction
-e HMMEVALUE, --HmmEvalue HMMEVALUE
Evalue threshold for hmm search
-is ITERATIONSERIES, --IterationSeries ITERATIONSERIES
The IterationSeries and CoverageSeries arguments
dictate the number of iterative searches to run and
the minimum HMM coverege required for each round of
extracting. This argument consists of a comma-
separated list specifying the number of searches to
run during each round of extraction. The list can vary
in length, changing the total number of extraction
rounds conducted. However, if a search iteration fails
to yield any new reference sequence amplicons,
IterRazor will break out of that round of searching
and move to the next. By default, the software
conducts four rounds of extraction with 20, 10, 5 and
5 search iterations per round (-is 20,10,5,5).
-cs COVERAGESERIES, --CoverageSeries COVERAGESERIES
This argument consists of a comma-separated list
specifying the minimum HMM coverage required per round
for an extracted amplicon reference sequence to be
added to the reference sequence fasta. The list can
vary in length but must be equal in length to the
IterationSeries list. By default, the software
conducts four rounds of extraction with coverage
limits of 1.0, 0.95, 0.9 and 0.65 for each respective
search round (-cs 1.0,0.95,0.9,0.65).
> MetaCurator.py -h
# MetaCurator.py -h
usage: MetaCurator.py [-h] -r REFERENCESFILE -i INPUTFILE -it INTAX -ot OUTTAX
-of OUTFASTA [-t THREADS] [--SaveTemp SAVETEMP]
[-e HMMEVALUE] [-tf TAXONOMIZRFORMAT] [-ct CLEANTAX]
[-is ITERATIONSERIES] [-cs COVERAGESERIES] [-mh MAXHITS]
[-di ITERATIONS]
optional arguments:
-h, --help show this help message and exit
required arguments:
-r REFERENCESFILE, --ReferencesFile REFERENCESFILE
Fasta file of sequences trimmed to exact region of
interest (header format should conform to inputfile
header format). This should be approximately 10
sequences spanning a diversity of the phylogenetic
tree of interest, so as to build an appropriately
inclusive HMM from the first iteration.
-i INPUTFILE, --InputFile INPUTFILE
Input file of reference sequences to be extracted
(sequence headers should include a unique sequence
identifier, NCBI accessions are recomended, preceded
only by '>')
-it INTAX, --InTax INTAX
Input taxonomic lineages (a tab-delimited file with
the first column containing unique sequence
identifiers compatible with the files provided under
'-r' and '-i'). Normally, this would be in
Metaxa2-compatible format, however, if lineages come
directly from Taxonomizr, you can skip reformatting
and declare '-tf True.' See below for more details.
-ot OUTTAX, --OutTax OUTTAX
Filename for output taxonomic lineages
-of OUTFASTA, --OutFasta OUTFASTA
Filename for output fasta formatted sequences
universal optional arguments:
-t THREADS, --threads THREADS
Number of processors to use
--SaveTemp SAVETEMP Option for saving alignments and HMM files produced
during extraction and dereplication (True or False)
-tf TAXONOMIZRFORMAT, --TaxonomizrFormat TAXONOMIZRFORMAT
Specify if tax lineages are in Taxonomizr format and
must first be converted to Metaxa2-compatible tab-
delimited format (True or False)
-ct CLEANTAX, --CleanTax CLEANTAX
Specify if tax lineages are to be cleaned of common
NCBI artifacts and revised at unresolved midpoints
(True or False)
IterRazor optional arguments:
-e HMMEVALUE, --HmmEvalue HMMEVALUE
Evalue threshold for HMM search
-is ITERATIONSERIES, --IterationSeries ITERATIONSERIES
The IterationSeries and CoverageSeries arguments
dictate the number of iterative searches to run and
the minimum HMM coverege required for each round of
extracting. This argument consists of a comma-
separated list specifying the number of searches to
run during each round of extraction. The list can vary
in length, changing the total number of extraction
rounds conducted. However, if a search iteration fails
to yield any new reference sequence amplicons,
IterRazor will break out of that round of searching
and move to the next. By default, the software
conducts four rounds of extraction with 20, 10, 5 and
5 search iterations per round (i.e. '-is 20,10,5,5').
-cs COVERAGESERIES, --CoverageSeries COVERAGESERIES
This argument consists of a comma-separated list
specifying the minimum HMM coverage required per round
for an extracted amplicon reference sequence to be
added to the reference sequence fasta. The list can
vary in length but must be equal in length to the
IterationSeries list. By default, the software
conducts four rounds of extraction with coverage
limits of 1.0, 0.95, 0.9 and 0.65 for each respective
search round (i.e. '-cs 1.0,0.95,0.9,0.65').
DerepByTaxonomy-specific optional arguments:
-mh MAXHITS, --MaxHits MAXHITS
Max hits to find before ending search for any given
query using VSEARCH (default: 1000)
-di ITERATIONS, --Iterations ITERATIONS
Number of iterative VSEARCH searches to run (default:
10). This is important in case the number of
replicates for a given sequence greatly exceeds '--
MaxHits'
テストラン
1、メタバーコーディングする領域を指定するため手動でトリミングして集めてきた8つのrbcL塩基配列"rbcL_Reps.fa"を入力とし、NCBIでrbcLとアノテートされている数千のrbcL塩基配列"rbcL_sample.fa"をキュレーションする。
cd TestMetaCurator/
MetaCurator.py -r rbcL_Reps.fa -i rbcL_sample.fa -it rbcL_sample.tax -tf True -ct True -of Test.fa -ot Test.tax
元のデータベースrbcL_sample.fa平均サイズは850.9-bpで2500配列あったが、キュレーション後は1691配列、平均476.6-bpに減少した。
> seqkit stats rbcL_sample.fa Test.fa
# seqkit stats rbcL_sample.fa Test.fa
file format type num_seqs sum_len min_len avg_len max_len
rbcL_sample.fa FASTA DNA 2,500 2,127,215 94 850.9 8,413
Test.fa FASTA DNA 1,691 805,958 319 476.6 494
Taxnomy情報を表すTest.tax も出力される。
上ではNCBIに登録されているrbcL配列群rbcL_sample.fa を-iで指定しているが、実際のデータ解析(メタバーコーディングのプライマー作成、増幅領域のマルチプルアラインメントなど)で使用するときは、ゲノム、メタゲノムのアセンブリなどを-iで指定する。
引用
MetaCurator: A hidden Markov model-based toolkit for extracting and curating sequences from taxonomically-informative genetic markers
Rodney T. Richardson, Douglas B. Sponsler, Harper McMinn-Sauder, Reed M. Johnson
bioRxiv preprint first posted online Jun. 18, 2019