2018-06-10

ターゲット遺伝子座のリファレンスガイドアセンブリを行う aTRAM2.0

2021 7/21 タイトル修正

　大規模なシーケンスからの迅速な標的遺伝子座特異的なアセンブリは、現在、医学から広範囲の系統学までの応用分野で、生物学科学全体で一般的に使用されている。ターゲットアセンブリ手法は、完全なゲノムのデノボアセンブリと比較してアセンブリの計算規模を大幅に削減し、全ゲノムアセンブリのエラーを避けることができる（論文より ref2-5）

　aTRAM1.0 (automated Target Restricted Assembly Method）は、次世代シークエンシング（NGS）データから任意の遺伝子座をアセンブルするために開発された(6,7)。このプロセスでは、NGSのターゲットにマッチするリードのみアセンブリされる。Perlで書かれ、個々のコンピュータと高性能コンピューティング環境の両方で動作するように設計されている。aTRAMは、最後に、BLASTを使用して、非常に異なるtaxonomy、すなわちhighly closedなゲノムを持たないraxonの遺伝子座のアセンブリを可能にする、一致するリードを見出す（一部略）。

　ここでは、新しい機能を追加し、より詳細なチューニングオプションを使用してアセンブリの制御を大幅に向上させ、以前の方法をより効率的に処理するため、TRAM 1.0を完全な再実装したTRAM 2.0を提示する。（1）BLASTデータベースへの読み込みを増加させてヒットを増加させること、（2）新しいデータベース化戦略を用いて配列クエリーおよび検索を高速化すること、（3）アセンブリされたコンティグをBLASTすルためフィルタリングステップを逆にする（4）さらなるユースケースを可能にするために新たなデノボアセンブラを追加する、（5）多くの分類群から多くの遺伝子座を迅速に組み立てるための自動化パイプラインを追加する、（6）コードの完全なるPythonへの書き換えによって長期的な持続性とパフォーマンスを向上させる。

　aTRAM 2.0は、2つのステップで動作するコマンドラインのPythonベースのツールである。第1に、ユーザはターゲットとユーザオプションのセットを指定し、次いでaTRAMフォーマットのライブラリが構築され、そこではペアエンドリードが分割される。分割はシーケンスの内容とは無関係で、データベースサイズのしきい値（デフォルトで250 MB、シャード (shards) サイズと数はユーザーが調整可能）に基づいている。このデータセット分割により、並列クエリ読み取りが可能になり、シリアルクエリでもパフォーマンスが大幅に向上する。各シャードについて、BLASTフォーマットのデータベースが両方のメイト対に対して構築される。第2のステップでは、関心対象の遺伝子座を標的とし、アセンブルする。これを行うために、クエリ配列は、シャードされたデータベースに対してBLASTされる。一致するリードは、デノボアセンブリモジュールでアセンブルされる。アセンブリは反復アプローチによって改善される。つまり、2回目の反復では、アセンブルされたコンティグが元のクエリを置き換え、ショートリードデータベースにぶつかる。最初の繰り返しと同様に、マッチングしたリードがデノボでアセンブリされる。このプロセスは、ユーザー指定の反復制限に達するか、新しいコンティグがアセンブルされなくなるまで続く（論文図1 ワークフロー）。各遺伝子が独立してアセンブルされるので、複数のレベルでパラレル化が可能になっている。多数の遺伝子を同時に並行して迅速にアセンブルすることができる。さらに、データベースが断片に分割され、それぞれが独立して検索されるように、各実行内で並列処理が可能になっている。最後に、アセンブリモジュール内の並列化制御も実装されている。これらの処理ステップは、ユーザーオプション設定によって高度にカスタマイズ可能になった。例えば、大規模なデータセット（例えば、ミトコンドリアゲノム）のような高カバレッジなターゲットについてaTRAMライブラリーの一部のみを検索することが可能である。デノボアセンブリモジュールの多くのオプションを活用することも可能になっている。

インストール

cent OS6のPython 3.6.3 :: Anacondaでテストした。

依存（アセンブラはどれか１つあればOK）

BLAST - version 2.7.1
Velvet
Trinity - version 2.5.1
Abyss - version 2.0.2
SPAdes - version 3.11.1

本体　Github

https://github.com/HKU-BAL/megagta

virtualenvを使い環境を構築する。

git clone https://github.com/juliema/aTRAM.git 
cd atram 
virtualenv venv -p python3
source venv/bin/activate
pip install -r requirements.txt

> python atram_preprocessor.py -h

$ python atram_preprocessor -h

usage: atram_preprocessor.py [-h] [--version]

[--end-1 FASTA_or_FASTQ [FASTA_or_FASTQ ...]]

[--end-2 FASTA_or_FASTQ [FASTA_or_FASTQ ...]]

[--mixed-ends FASTA_or_FASTQ [FASTA_or_FASTQ ...]]

[--single-ends FASTA_or_FASTQ [FASTA_or_FASTQ ...]]

[-b DB] [--cpus CPUS] [-t DIR] [-l LOG_FILE]

[-s SHARDS] [--path PATH] [--sqlite-temp-dir DIR]

This script prepares data for use by the atram.py

script. It takes fasta or fastq files of paired-end (or

single-end) sequence reads and creates a set of atram

databases.

You need to prepare the sequence read archive files so that the

header lines contain only a sequence ID with the optional

paired-end suffix at the end of the header line. The separator

for the optional trailing paired-end suffix may be a space,

a slash "/", a dot ".", or an underscore "_".

For example:

>DBRHHJN1:427:H9YYAADXX:1:1101:10001:77019/1

GATTAA...

>DBRHHJN1:427:H9YYAADXX:1:1101:10001:77019/2

ATAGCC...

>DBRHHJN1:427:H9YYAADXX:1:1101:10006:63769/2

CGAAAA...

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

--end-1 FASTA_or_FASTQ [FASTA_or_FASTQ ...], -1 FASTA_or_FASTQ [FASTA_or_FASTQ ...]

Sequence read archive files that have only end 1

sequences. The sequence names do not need an end

suffix, we will assume the suffix is always 1. The

files are in fasta or fastq format. You may enter more

than one file or you may use wildcards.

--end-2 FASTA_or_FASTQ [FASTA_or_FASTQ ...], -2 FASTA_or_FASTQ [FASTA_or_FASTQ ...]

Sequence read archive files that have only end 2

sequences. The sequence names do not need an end

suffix, we will assume the suffix is always 2. The

files are in fasta or fastq format. You may enter more

than one file or you may use wildcards.

--mixed-ends FASTA_or_FASTQ [FASTA_or_FASTQ ...], -m FASTA_or_FASTQ [FASTA_or_FASTQ ...]

Sequence read archive files that have a mix of both

end 1 and end 2 sequences (or single ends). The files

are in fasta or fastq format. You may enter more than

one file or you may use wildcards.

--single-ends FASTA_or_FASTQ [FASTA_or_FASTQ ...], -0 FASTA_or_FASTQ [FASTA_or_FASTQ ...]

Sequence read archive files that have only unpaired

sequences. Any sequence suffix will be ignored. The

files are in fasta or fastq format. You may enter more

than one file or you may use wildcards.

preprocessor arguments:

-b DB, --blast-db DB, --output DB, --db DB

This is the prefix of all of the blast database files.

So you can identify different blast database sets. You

may include a directory as part of the prefix. The

default is "./atram_2018-06-10".

--cpus CPUS, --processes CPUS, --max-processes CPUS

Number of CPU threads to use. On this machine the

default is ("10")

-t DIR, --temp-dir DIR

You may save intermediate files for debugging in this

directory. The directory must be empty.

-l LOG_FILE, --log-file LOG_FILE

Log file (full path). The default is to use the DB and

program name to come up with a name like

"<DB>_atram_preprocessor.log"

-s SHARDS, --shards SHARDS, --number SHARDS

Number of blast DB shards to create. The default is to

have each shard contain roughly 250MB of sequence

data.

--path PATH If blast or makeblastdb is not in your $PATH then use

this to prepend directories to your path.

--sqlite-temp-dir DIR

Use this directory to save temporary SQLITE3 files.

This is a possible fix for "database or disk is full"

errors.

——

> python atram.py -h

$ python atram.py -h

usage: atram.py [-h] [--version] -b DB [DB ...] [-q QUERY [QUERY ...]]

[-Q QUERY_SPLIT [QUERY_SPLIT ...]] -o OUTPUT_PREFIX

[-a {abyss,trinity,velvet,spades,none}] [-i N] [-p]

[--fraction FRACTION] [--cpus CPUS] [--log-file LOG_FILE]

[--path PATH] [-t DIR] [--sqlite-temp-dir DIR] [-T SECONDS]

[--no-filter] [--bit-score SCORE]

[--contig-length CONTIG_LENGTH] [--db-gencode CODE]

[--evalue EVALUE] [--max-target-seqs MAX] [--no-long-reads]

[--kmer KMER] [--mpi] [--bowtie2] [--max-memory MEMORY]

[--exp-coverage EXP_COVERAGE] [--ins-length INS_LENGTH]

[--min-contig-length MIN_CONTIG_LENGTH]

[--cov-cutoff COV_CUTOFF]

This is the aTRAM script. It takes a query sequence and a blast

database built with the atram_preprocessor.py script and builds an

assembly.

If you specify more than one query sequence and/or more than one blast

database then aTRAM will build one assembly for each query/blast

DB pair.

NOTE: You may use a text file to hold the command-line arguments

like: @/path/to/args.txt. This is particularly useful when specifying

multiple blast databases or multiple query sequences.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

required arguments:

-b DB [DB ...], --blast-db DB [DB ...], --sra DB [DB ...], --db DB [DB ...], --database DB [DB ...]

This needs to match the DB prefix you entered for

atram_preprocessor.py. You may repeat this argument to

run the --query sequence(s) against multiple blast

databases.

-q QUERY [QUERY ...], --query QUERY [QUERY ...], --target QUERY [QUERY ...], --probe QUERY [QUERY ...]

The path to the fasta file with sequences of interest.

You may repeat this argument. If you do then Each

--query sequence file will be run against every

--blast-db.

-Q QUERY_SPLIT [QUERY_SPLIT ...], --query-split QUERY_SPLIT [QUERY_SPLIT ...], --target-split QUERY_SPLIT [QUERY_SPLIT ...]

The path to the fasta file with multiple sequences of

interest. This will take every sequence in the fasta

file and treat it as if it were its own --query

argument. So every sequence in --query-split will be

run against every --blast-db.

-o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX

This is the prefix of all of the output files. So you

can identify different blast output file sets. You may

include a directory as part of the prefix. aTRAM will

add suffixes to differentiate ouput files.

optional aTRAM arguments:

-a {abyss,trinity,velvet,spades,none}, --assembler {abyss,trinity,velvet,spades,none}

Which assembler to use. Choosing "none" (the default)

will do a single blast run and stop before any

assembly.

-i N, --iterations N The number of pipeline iterations. The default is "5".

-p, --protein Are the query sequences protein? aTRAM will guess if

you skip this argument.

--fraction FRACTION Use only the specified fraction of the aTRAM database.

The default is "1.0"

--cpus CPUS, --processes CPUS, --max-processes CPUS

Number of CPU threads to use. This will also be used

for the assemblers when possible. On this machine the

default is ("10")

--log-file LOG_FILE Log file (full path)."

--path PATH If the assembler or blast you want to use is not in

your $PATH then use this to prepend directories to

your path.

-t DIR, --temp-dir DIR

You may save intermediate files for debugging in this

directory. The directory must be empty.

--sqlite-temp-dir DIR

Use this directory to save temporary SQLITE3 files.

This is a possible fix for "database or disk is full"

errors.

-T SECONDS, --timeout SECONDS

How many seconds to wait for an assembler before

stopping the run. To wait forever set this to 0. The

default is "300" (5 minutes).

optional values for blast-filtering contigs:

--no-filter Do not filter the assembled contigs. This will: set

both the --bit-score and --contig-length to 0

--bit-score SCORE Remove contigs that have a value less than this. The

default is "70.0". This is turned off by the --no-

filter argument.

--contig-length CONTIG_LENGTH, --length CONTIG_LENGTH

Remove blast hits that are shorter than this length.

The default is "100". This is turned off by the --no-

filter argument.

optional blast arguments:

--db-gencode CODE The genetic code to use during blast runs. The default

is "1".

--evalue EVALUE The default evalue is "1e-10".

--max-target-seqs MAX

Maximum hit sequences per shard. Default is calculated

based on the available memory and the number of

shards.

optional assembler arguments:

--no-long-reads Do not use long reads during assembly. (Abyss,

Trinity, Velvet)

--kmer KMER k-mer size. The default is "64" for Abyss and "31" for

Velvet. Note: the maximum kmer length for Velvet is

31. (Abyss, Velvet)

--mpi Use MPI for this assembler. The assembler must have

been compiled to use MPI. (Abyss)

--bowtie2 Use bowtie2 during assembly. (Trinity)

--max-memory MEMORY Maximum amount of memory to use in gigabytes. The

default is "14". (Trinity, Spades)

--exp-coverage EXP_COVERAGE, --expected-coverage EXP_COVERAGE

The expected coverage of the region. The default is

"30". (Velvet)

--ins-length INS_LENGTH

The size of the fragments used in the short-read

library. The default is "300". (Velvet)

--min-contig-length MIN_CONTIG_LENGTH

The minimum contig length used by the assembler

itself. The default is "100". (Velvet)

--cov-cutoff COV_CUTOFF

Read coverage cutoff value. Must be a positive float

value, or "auto", or "off". The default is "off".

(Spades)

——

ラン

１、プリプロセシング。

mkdir library
python atram_preprocessor.py --cpus 20 -b library/name --end-1 pair11.fastq --end-2 pair2.fastq

--cpus CPUS, --processes CPUS, --max-processes CPUS
-b This is the prefix of all of the blast database files so you can identify different blast database sets and so they can be stored together without resorting to subdirectories. You may include a directory as part of the prefix. The default is ./atram_2017-10-1

gzip圧縮fastqには対応していない。

２、ターゲットアセンブリ。ここではvelvetを使う。velvetのk-merはデフォルト31。ターゲット遺伝子座のFASTAファイルを指定する。

python atram.py -b library/name -q ref_locus.fasta -i 5 --cpus 40 --kmer 31 -o Locus_atram2.fasta --log-file Locus.log -a velvet

-q This specifies the query sequence or sequences. If one sequence use the "-q" option. For many query sequences, "-Q".
-i The number of pipeline iterations. The default is "5".
-a {abyss, trinity, velvet, spades} Choose which assembler to use from the list. If you do not use this argument then aTRAM will do a single blast run and stop before assembly.
-o Give a prefix for output files. You may include a directory path as part of the prefix.
-p Are the query sequences protein? aTRAM will guess if you skip this argument.
--kmer The default is 64 for Abyss and 31 for Velvet. Note: the maximum kmer length for Velvet is 31 (Abyss, Velvet).

ランが終わるとターゲットアセンブリのFASTAが出力される。

f:id:kazumaxneo:20180610193618j:plain

実際のランでは、sqliteのデータベースが増え続ける？ようなバグに備え、場所を変えることを推奨している。さらにメジャーアップデートをした時は、データベースをリセットすることが推奨されている。

引用

aTRAM 2.0: An Improved, Flexible Locus Assembler for NGS Data.

Allen JM, LaFrance R, Folk RA, Johnson KP, Guralnick RP

Evol Bioinform Online. 2018 May 8;14

aTRAM - automated target restricted assembly method: a fast method for assembling loci across divergent taxa from next-generation sequencing data.

Allen JM, Huang DI, Cronk QC, Johnson KP

BMC Bioinformatics. 2015 Mar 25;16:98.