メタバーコディングのデータベース配列キュレーションなどを行うツールキット MetaCurator

　配列ベースの生物学的コミュニティの特徴付けの過程において、配列の教師ありのtaxonomic classification は重要な目標である。多数の配列分類ソフトウェアプログラムは、配列類似性を測り、そして配列類似性と分類学的所属との間の関係をモデル化することによってこれを達成する。（一部略）

　配列組成に基づく分類の間、関心のあるマーカーに隣接する無関係の配列領域は、SINTAX、RDPナイーブベイズ分類器、VSEARCHなどの多くのツールで使用されるk-mer分布を偏らせる（Wang et al、2007; Edgar 2016; Rognes et al、 2017）。上位N個の配列の距離分布を分析することによって分類学的境界を推定する分類器（Huson et al、2011; Bengtsson-Palme et al、2015; Gao et al、2017）では、配列重複の保持は距離分布の特徴付けに偏りがある。ただし、2つの配列が同じ分類群からのものであることを保証せずに分類群的に単純な配列重複の除去を行うと、 2つの分類群間の変異に関してデータベースから重要な情報が削除されることになる。
　これらの問題を解決するために、MetaCuratorツールキットを開発した。メインツールであるIterRazorは、例えば全ゲノムアセンブリを含む任意の入手可能な供給源から標的マーカーリファレンス配列を同定および抽出する。与えられたマーカーのリファレンスを抽出した後、DerepByTaxonomyは分類学を意識したアプローチを使ってリファレンス配列を逆複製するために使われる。追加のPythonツールおよびUnix シェルスクリプトは、階層的キュレーションのための分類系統のフォーマット化およびNCBI ヌクレオチドデータベース内に一般的に見られる分類系統アーチファクトの除去を容易にする（NCBI Resource Coordinators、2018）。

インストール

ubuntu16.04にて、conda createで2.7.14の仮想環境を作ってテストした。

依存

MetaCurator is a command-line only toolkit which runs on typical Unix/Linux environments. It requires the following software to be installed and globally accessible.

Python 2.7.5 or greater
Perl 5.16.0 or compatible
VSEARCH 2.8.1 or compatible
HMMER3 3.1b2 or compatible
MAFFT 7.270 or compatible

#仮想環境を作って依存を導入 
conda create -n metacurator -c bioconda python=2.7.14 vsearch hmmer mafft

本体　Github

wget https://github.com/RTRichar/MetaCurator/archive/MetaCurator_v1.0beta.1.tar.gz
tar xzf MetaCurator_v1.0beta.1.tar.gz
cd MetaCurator-MetaCurator_v1.0beta.1/
export PATH=$PATH:$PWD

> IterRazor.py -h

# IterRazor.py -h

usage: IterRazor is a tool which uses hidden Markov models to identify and extract precise DNA barcode sequences of interest from the full diversity of reference sequence data available through sources such as NCBI (e.g. whole genome sequences, whole chloroplast sequences or Sanger seuqnces representative of variable portions of the locus of interest)

[-h] -r REFERENCESFILE -i INPUTFILE -o OUTPUTFILE [-t THREADS]

[--SaveTemp SAVETEMP] [-e HMMEVALUE] [-is ITERATIONSERIES]

[-cs COVERAGESERIES]

optional arguments:

-h, --help show this help message and exit

-r REFERENCESFILE, --ReferencesFile REFERENCESFILE

Fasta file of sequences trimmed to exact region of

interest (header format should conform to --InputFile

header format). This should be 5 to 10 sequences

spanning a diversity of the phylogenetic tree of

interest, so as to build a relatively broadly

inclusive HMM from the first iteration (REQUIRED)

-i INPUTFILE, --InputFile INPUTFILE

Input file of reference sequences to be extracted

(e.g. whole chloroplast genome sequences from NCBI).

Sequence headers should include a unique sequence

identifier, NCBI accessions are recomended, preceded

only by '>' (REQUIRED)

-o OUTPUTFILE, --OutPutFile OUTPUTFILE

Output file for extracted sequences (REQUIRED)

-t THREADS, --threads THREADS

Number of processors to use

--SaveTemp SAVETEMP Option for saving alignments and HMM files produced

during extraction

-e HMMEVALUE, --HmmEvalue HMMEVALUE

Evalue threshold for hmm search

-is ITERATIONSERIES, --IterationSeries ITERATIONSERIES

The IterationSeries and CoverageSeries arguments

dictate the number of iterative searches to run and

the minimum HMM coverege required for each round of

extracting. This argument consists of a comma-

separated list specifying the number of searches to

run during each round of extraction. The list can vary

in length, changing the total number of extraction

rounds conducted. However, if a search iteration fails

to yield any new reference sequence amplicons,

IterRazor will break out of that round of searching

and move to the next. By default, the software

conducts four rounds of extraction with 20, 10, 5 and

5 search iterations per round (-is 20,10,5,5).

-cs COVERAGESERIES, --CoverageSeries COVERAGESERIES

This argument consists of a comma-separated list

specifying the minimum HMM coverage required per round

for an extracted amplicon reference sequence to be

added to the reference sequence fasta. The list can

vary in length but must be equal in length to the

IterationSeries list. By default, the software

conducts four rounds of extraction with coverage

limits of 1.0, 0.95, 0.9 and 0.65 for each respective

search round (-cs 1.0,0.95,0.9,0.65).

> MetaCurator.py -h

# MetaCurator.py -h

usage: MetaCurator.py [-h] -r REFERENCESFILE -i INPUTFILE -it INTAX -ot OUTTAX

-of OUTFASTA [-t THREADS] [--SaveTemp SAVETEMP]

[-e HMMEVALUE] [-tf TAXONOMIZRFORMAT] [-ct CLEANTAX]

[-is ITERATIONSERIES] [-cs COVERAGESERIES] [-mh MAXHITS]

[-di ITERATIONS]

optional arguments:

-h, --help show this help message and exit

required arguments:

-r REFERENCESFILE, --ReferencesFile REFERENCESFILE

Fasta file of sequences trimmed to exact region of

interest (header format should conform to inputfile

header format). This should be approximately 10

sequences spanning a diversity of the phylogenetic

tree of interest, so as to build an appropriately

inclusive HMM from the first iteration.

-i INPUTFILE, --InputFile INPUTFILE

Input file of reference sequences to be extracted

(sequence headers should include a unique sequence

identifier, NCBI accessions are recomended, preceded

only by '>')

-it INTAX, --InTax INTAX

Input taxonomic lineages (a tab-delimited file with

the first column containing unique sequence

identifiers compatible with the files provided under

'-r' and '-i'). Normally, this would be in

Metaxa2-compatible format, however, if lineages come

directly from Taxonomizr, you can skip reformatting

and declare '-tf True.' See below for more details.

-ot OUTTAX, --OutTax OUTTAX

Filename for output taxonomic lineages

-of OUTFASTA, --OutFasta OUTFASTA

Filename for output fasta formatted sequences

universal optional arguments:

-t THREADS, --threads THREADS

Number of processors to use

--SaveTemp SAVETEMP Option for saving alignments and HMM files produced

during extraction and dereplication (True or False)

-tf TAXONOMIZRFORMAT, --TaxonomizrFormat TAXONOMIZRFORMAT

Specify if tax lineages are in Taxonomizr format and

must first be converted to Metaxa2-compatible tab-

delimited format (True or False)

-ct CLEANTAX, --CleanTax CLEANTAX

Specify if tax lineages are to be cleaned of common

NCBI artifacts and revised at unresolved midpoints

(True or False)

IterRazor optional arguments:

-e HMMEVALUE, --HmmEvalue HMMEVALUE

Evalue threshold for HMM search

-is ITERATIONSERIES, --IterationSeries ITERATIONSERIES

The IterationSeries and CoverageSeries arguments

dictate the number of iterative searches to run and

the minimum HMM coverege required for each round of

extracting. This argument consists of a comma-

separated list specifying the number of searches to

run during each round of extraction. The list can vary

in length, changing the total number of extraction

rounds conducted. However, if a search iteration fails

to yield any new reference sequence amplicons,

IterRazor will break out of that round of searching

and move to the next. By default, the software

conducts four rounds of extraction with 20, 10, 5 and

5 search iterations per round (i.e. '-is 20,10,5,5').

-cs COVERAGESERIES, --CoverageSeries COVERAGESERIES

This argument consists of a comma-separated list

specifying the minimum HMM coverage required per round

for an extracted amplicon reference sequence to be

added to the reference sequence fasta. The list can

vary in length but must be equal in length to the

IterationSeries list. By default, the software

conducts four rounds of extraction with coverage

limits of 1.0, 0.95, 0.9 and 0.65 for each respective

search round (i.e. '-cs 1.0,0.95,0.9,0.65').

DerepByTaxonomy-specific optional arguments:

-mh MAXHITS, --MaxHits MAXHITS

Max hits to find before ending search for any given

query using VSEARCH (default: 1000)

-di ITERATIONS, --Iterations ITERATIONS

Number of iterative VSEARCH searches to run (default:

10). This is important in case the number of

replicates for a given sequence greatly exceeds '--

MaxHits'

テストラン

１、メタバーコーディングする領域を指定するため手動でトリミングして集めてきた8つのrbcL塩基配列"rbcL_Reps.fa"を入力とし、NCBIでrbcLとアノテートされている数千のrbcL塩基配列"rbcL_sample.fa"をキュレーションする。

cd TestMetaCurator/
MetaCurator.py -r rbcL_Reps.fa -i rbcL_sample.fa -it rbcL_sample.tax -tf True -ct True -of Test.fa -ot Test.tax

元のデータベースrbcL_sample.fa平均サイズは850.9-bpで2500配列あったが、キュレーション後は1691配列、平均476.6-bpに減少した。

> seqkit stats rbcL_sample.fa Test.fa

# seqkit stats rbcL_sample.fa Test.fa

file format type num_seqs sum_len min_len avg_len max_len

rbcL_sample.fa FASTA DNA 2,500 2,127,215 94 850.9 8,413

Test.fa FASTA DNA 1,691 805,958 319 476.6 494

Taxnomy情報を表すTest.tax も出力される。

上ではNCBIに登録されているrbcL配列群rbcL_sample.fa を-iで指定しているが、実際のデータ解析（メタバーコーディングのプライマー作成、増幅領域のマルチプルアラインメントなど）で使用するときは、ゲノム、メタゲノムのアセンブリなどを-iで指定する。

引用

MetaCurator: A hidden Markov model-based toolkit for extracting and curating sequences from taxonomically-informative genetic markers

Rodney T. Richardson, Douglas B. Sponsler, Harper McMinn-Sauder, Reed M. Johnson

bioRxiv preprint first posted online Jun. 18, 2019

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

メタバーコディングのデータベース配列キュレーションなどを行うツールキット MetaCurator