macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

コンティグの拡張アセンブリを行う PRICE

 

 

低価格のDNAシーケンス技術により、ゲノム、トランスクリプトーム、生態系全体のメタゲノム解析における直接核酸シーケンスの役割は拡大しています。このような大規模なデータセットに対する人間や機械の理解は、配列断片を長く連続した配列ブロック(コンティグ)に合成することで簡略化されますが、アセンブリの分野における進歩のほとんどは、メタゲノムよりもむしろゲノムを単独で対象としています。ここでは、複雑なメタゲノム・データを入力として、特定の核酸種を重点的にアセンブルする戦略であるpaired-read iterative contig extension (PRICE) のためのソフトウェアを紹介する。PRICEは、複雑なメタゲノムから特定の遺伝子、転写産物、ウイルスゲノムを抽出するためのソフトウェアで、カポジ肉腫関連ヘルペスウイルスのBCBL-1株のアセンブリも含まれています。PRICEはオープンソースで、無料でダウンロードできます(derisilab.ucsf.edu/ software/price/ or sourceforge.net/projects/pricedenovo/ )。
 

 

HP (ソースコードが公開されている。マニュアルはリンク先の表をクリックするとダウンロードできる。)

https://derisilab.ucsf.edu/software/price/index.html

PRICE Documentation

https://vcru.wisc.edu/simonlab/bioinformatics/programs/price/PriceDocumentation140408/index.html

 

インストール

ダウンロードしてmakeする。得られるPriceTIに外部依存はないのでそのままパスの通ったディレクトリにコピーする。

ビルド依存

SourceForge

https://sourceforge.net/projects/pricedenovo/files/latest/download

cd PriceSorcexxxx/
make

> ./PriceTI

PRICE Assembler v1.0.1

 

 

These are the command options for the PRICE assembler.

For more details, see the accompanying README.txt file.

Contact: price@derisilab.ucsf.edu

 

Usage: ./PriceTI [args]

 

INPUT FILES:

 accepted formats are fasta (appended .fa, .fasta, .fna, .ffn, .frn),

fastq (.fq, .fastq, or _sequence.txt), or priceq (.pq or .priceq)

  NOTE ABOUT FASTQ ENCODING: multiple encodings are currently used for

fastq quality scores.  The traditional encoding is Phred+33,

                             and PRICE will interpret scores from any

.fq or .fastq file according to that encoding.  The Phred+64

                             encoding has been used extensively by

Illumina, and so it is applied to Illumina's commonly-used

_sequence.txt

                             file append.  Please make sure that your

encoding matches your file append.

INPUT READ FILES:

  NOTE: these flags can be used multiple times in the same command to

include multiple read datasets.

  (default % ID = 90%)

 PAIRED-END FILES (reads are 3p of one another on opposite strands,

i.e. pointing towards one another)

 -fp a b c [d e [f]]: (a,b)input file pair, (c)amplicon insert size

(including read)

                  (d,e,f) are optional; (d)the num. cycles to be

skipped before this file is used;

                  if (f) is provided, then the file will alternate

between being used for (e) cycles and not used for (f) cycles;

                  otherwise, the file will be used for (e) cycles then

will not be used again.

 -fpp a b c d [e f [g]]: (a,b)input file pair, (c)amplicon insert size

(including read), (d)required % identity for match (25-100 allowed)

                  (e,f,g) are optional; (e)the num. cycles to be

skipped before this file is used;

                  if (g) is provided, then the file will alternate

between being used for (f) cycles and not used for (g) cycles;

                  otherwise, the file will be used for (f) cycles then

will not be used again.

 -fs a b [c d [e]]: (a)input paired-end file (alternating entries are

paired-end reads), (b)amplicon insert size (including read)

                  (c,d,e) are optional; (c)the num. cycles to be

skipped before this file is used;

                  if (e) is provided, then the file will alternate

between being used for (d) cycles and not used for (e) cycles;

                  otherwise, the file will be used for (d) cycles then

will not be used again.

 -fsp a b c [d e [f]]: (a)input paired-end file (alternating entries

are paired-end reads), (b)amplicon insert size (including read),

                  (c)required % identity for match (25-100 allowed)

                  (d,e,f) are optional; (d)the num. cycles to be

skipped before this file is used;

                  if (f) is provided, then the file will alternate

between being used for (e) cycles and not used for (f) cycles;

                  otherwise, the file will be used for (e) cycles then

will not be used again.

 MATE-PAIR FILES (reads are 5p of one another on opposite strands,

i.e. pointing away from one another)

 -mp a b c [d e [f]]: like -fp above, but with reads in the opposite

orientation.

 -mpp a b c d [e f [g]]: like -fpp above, but with reads in the

opposite orientation.

 -ms a b [c d [e]]: like -fs above, but with reads in the opposite orientation.

 -msp a b c [d e [f]]: like -fsp above, but with reads in the opposite

orientation.

 FALSE PAIRED-END FILES (unpaired reads are split into paired ends,

with the scores of double-use nuceotides halved)

 -spf a b c [d e [f]]: (a)input file, (b)the length of the 'reads'

that will be taken from each side of the input reads,

                  (c)amplicon insert size (including read)

                  (d,e,f) are optional; (d)the num. cycles to be

skipped before this file is used;

                  if (f) is provided, then the file will alternate

between being used for (e) cycles and not used for (f) cycles;

                  otherwise, the file will be used for (e) cycles then

will not be used again.

 -spfp a b c d [e f [g]]: (a)input file, (b)the length of the 'reads'

that will be taken from each side of the input reads,

                  (c)amplicon insert size (including read),

(d)required % identity for match (25-100 allowed)

                  (e,f,g) are optional; (e)the num. cycles to be

skipped before this file is used;

                  if (g) is provided, then the file will alternate

between being used for (f) cycles and not used for (g) cycles;

                  otherwise, the file will be used for (f) cycles then

will not be used again.

INPUT INITIAL CONTIG FILES:

  NOTE: these flags can be used multiple times in the same command to

include multiple initial contig datasets.

 -icf a b c d: (a)initial contig file, (b)number of addition steps,

(c)number of cycles per step,

               (d)const by which to multiply quality scores

 -picf a b c d e: (a)num of initial contigs from this file, (b)initial

contig file, (c)num addition steps,

                  (d)num cycles per step (e)const by which to multiply

quality scores

 -icfNt/-picfNt: same as -icf/-picf, but if target mode is invoked,

contigs with matches to these input sequences will not necessarily

                 be retained

OUTPUT FILES:

 accepted formats are fasta (.fa or .fasta) or priceq (.pq or .priceq)

 -o a: (a)output file name (.fasta or .priceq)

 -nco a: (a)num. cycles that pass in between output files being

written (default=1)

OTHER PARAMS:

 -nc a: (a)num. of cycles

 -link a: (a)max. number of contigs that are allowed to replace a read

in a mini-assembly (default=2)

 -mol a: (a)minimum overlap length for mini-assembly (default=35)

 -tol a: (a)threshold seq num for scaling overlap for overhang

assemblies (default=20)

 NOTE: -mol and -tol do not affect the parameters for

de-Bruijn-graph-based assembly.

 -mpi a: (a)minimum % identity for mini-assembly (default=85)

 -tpi a: (a)threshold seq num for scaling % ID for mini-assemblies (default=20)

 -MPI, -TPI : same as above, but for meta-assembly (-MPI default=85,

-TPI default=1000)

 NOTE: there is no minimum overlap value for meta-assembly

 -dbmax a: (a) the maximum length sequence that will be fed into de

Bruijn assembly

           (default=100; recommended: max paired-end read length)

 -dbk a: (a) the k-mer size for de Bruijn assembly (default=20; keep

less than the read length)

 -dbms a: (a) the minimum number of sequences to which de Bruijn

assembly will be applied (default=3)

 -r a: (a) alignment score reward for a nucleotide match; should be a

positive integer (default=1)

 -q a: (a) alignment score penalty for a nucleotide mismatch; should

be a negative integer (default=-2)

 -G a: (a) alignment score penalty for opening a gap; should be a

negative integer (default=-5)

 -E a: (a) alignment score penalty for extending a gap; should be a

negative integer (default=-2)

FILTERING READS:

 -rqf a b [c d]: filters pairs of reads if either has an unaccptably

high number of low-quality nucleotides, as defined

                 by the provided quality scores (only applies to files

whose formats include quality score information).

                 (a) the percentage of nucleotides in a read that must

be high-quality; (b) the minimum allowed probability

                 of a nucleotide being correct (must be between 0 and

1, and will usually be a decimal value close to 1);

                 (c) and (d) optionally constrain this filter to use

after (c) cycles have passed, to run for (d) cycles.

                 This flag may be called multiple times to generate

variable behavior across a PRICE run.

 -rnf a [b c]: filters pairs of reads if either has an unaccptably

high number of uncalled nucleotides (Ns or other ambiguous

                 IUPAC codes).  Like -rqf, but will also filter

fasta-format data.  (a) the percentage of nucleotides in a read

                 that must be called; (b) and (c) optionally constrain

this filter to use after (b) cycles have passed, to run

                 for (c) cycles.  This flag may be called multiple

times to generate variable behavior across a PRICE run.

 -maxHp a: filters out a pair of reads if either read has a

homo-polymer track >(a) nucleotides in length.

 -maxDi a: filters out a pair of reads if either read has a repeating

di-nucleotide track >(a) nucleotides in length.

           NOTE: this will also catch mono-nucleotide repeats of the

specified length (a string of A's is also a string

           of AA's), so calling -maxHp in addition to -maxDi is

superfluous unless -maxHp is given a smaller max value.

 -badf a b: prevents reads with a match of at least (b)% identity to a

sequence in file (a) from being mapped to contigs.

 -repmask a b c d e f [g]: uses coverage levels of constructed and/or

input contigs to find repetitive elements and mask them

                           as if they were sequences input using -badf.

                           (a) = cycle number (1-indexed) at which

repeats will be detected.

                           (b) = 's' if repeats will be sought at the

start of the cycle or 'f' if they will be sought at the finish.

                           (c) = the min. number of variance units

above the median that will be counted as high-coverage.

                           (d) = the min. fold increase in coverage

above the median that will be counted as high-coverage.

                           (e) = the min. size in nt for a detected repeat.

                           (f) = reads with a match of at least this %

identity to a repeat will not be mapped to contigs.

                           (g) = an optional output file (.fasta or

.priceq) to which the detected repeats will be written.

 -reset a [b c d...]: re-introduces contigs that were previously not

generating assembly jobs of their own

                      (a) is the one-indexed cycle where the contigs

will be reset.  Same with b, c, d.

                      Any number of args may be added.

FILTERING INITIAL CONTIGS:

 -icbf a b [c]: prevents input sequences with a match of at least (b)%

identity to a sequence in file (a) from being used.

                This filter is optionally not applied to sequences of

length greater than (c) nucleotides.

 -icmHp a [b]: filters out an initial contig if it has a homo-polymer

track >(a) nucleotides in length.

               This filter is optionally not applied to sequences of

length greater than (b) nucleotides.

 -icmDi a [b]: filters out an initial contig if it has a repeating

di-nucleotide track >(a) nucleotides in length.

               NOTE: this will also catch mono-nucleotide repeats of

the specified length (a string of A's is also a string

               of AA's), so calling -icmHp in addition to -icmDi is

superfluous unless -icmHp is given a smaller max value.

               This filter is optionally not applied to sequences of

length greater than (b) nucleotides.

 -icqf a b [c]: filters out an initial contig if it has an unaccptably

high number of low-quality nucleotides, as defined

                by the provided quality scores (only applies to files

whose formats include quality score information).

                (a) the percentage of nucleotides in a read that must

be high-quality; (b) the minimum allowed probability

                of a nucleotide being correct (must be between 0 and

1, and will usually be a decimal value close to 1);

                This filter is optionally not applied to sequences of

length greater than (c) nucleotides.

 -icnf a [b]: filters pairs of reads if either has an unaccptably high

number of uncalled nucleotides (Ns or other ambiguous

              IUPAC codes).  Like -icqf, but will also filter

fasta-format data.  (a) the percentage of nucleotides in a read

              that must be called. This filter is optionally not

applied to sequences of length greater than (c) nucleotides.

FILTERING/PROCESSING ASSEMBLED CONTIGS:

 -lenf a b: filters out contigs shorter than (a) nt at the end of

every cycle, after skipping (b) cycles.

            NOTE: multiple -lenf commands can be entered; for any

cycle, the most recently-initiated filter is used.

            Example: -lenf 50 2 -lenf 300 4 -lenf 200 6 => no filter

for the first two cycles, then a 50nt filter for cycles

                     3 & 4, then a 300nt filter for cycles 5 & 6, then

a 200nt filter for cycles 7 onwards.

 -trim a b [c]: at the end of the (a)th cycle (indexed from 1), trim

off the edges of conigs until reaching the minimum coverage

                level (b), optionally deleting contigs shorter than

(c) after trimming; this flag may be used repeatedly.

 -trimB a b [c]: basal trim; after skipping (a) cycles, trim off the

edges of conigs until reaching the minimum coverage

                 level (b) at the end of EVERY cycle, optionally

deleting contigs shorter than (c) after trimming.

                 -trimB may be called many times, and multiple calls

will interact in the same way as multiple -lenf calls

                 (explained above). A call to -trim will override the

basal trim values for that specified cycle only.

 -trimI a [b]: initial trim; input initial contigs are trimmed before

being used by PRICE to seed assemblies. Contigs are

               trimmed from their outside edges until reaching the

minimum coverage level (a), optionally deleting contigs

               shorter than (b) after trimming. This flag is most

appropriate for .priceq input, can be appropriate for

               .fastq input, and is inappropriate for .fasta input.

It will be equally applied to ALL input contigs.

 -target a b [c d]: limit output contigs to those with matches to

input initial contigs at the end of each cycle.

                    (a) % identity to an input initial contig to count

as a match (ungapped); (b)num cycles to skip

                    before applying this filter.  [c and d are

optional, but must both be provided if either is]

                    After target filtering has begin,

target-filtered/-unfiltered cycles will alternate with (c)

                    filtered cycles followed by (d) unfiltered cycles.

 -targetF a b [c d]: the same as -target, but now matches to all reads

in the input set will be specified, not just

                    the ones that have been introduced up to that

point (this is FullFile mode).

COMPUTATIONAL EFFICIENCY:

 -a x: (x)num threads to use (default=1)

 -mtpf a: (a)max threads per file (default=1)

USER INTERFACE:

 -log a: determines the type of outputmakes the output verbose (lots

of time stamp tags)

         (a) = c: concise stdout (default)

         (a) = n: no stdout

         (a) = v: verbose stdout

 -logf a: (a)the name of an output file for verbose log info to be

written (doesn't change stdout format)

 -, -h, or --help: user interface info.

 

number of cycles MUST be specified with -nc flag

No assembly was run; help message printed instead.

 

 

実行方法

fastaかfastqファイル、もしくはpriceqファイルを指定する。

 

テストラン

パラインフルエンザ4の分離株のゲノムのアセンブルとなっている。

sample job

上のHPからファイルはダウンロードできる。

PriceTI -fpp s_2_1_sequence.txt s_2_2_sequence.txt 300 95 -icf sangerReads.fasta 1 1 5 -nc 30 -dbmax 72 -mol 30 -tol 20 -mpi 80 -target 90 2 1 1 -o practice.fasta

ペアエンドのイルミナFASTQファイルの入力ファイル(Solexaペアエンドリード)を2つ指定している。アンプリコンインサートサイズを300とする。リードをコンティグにアラインする際の配列同一性は95%以上とする。"-icf sangerReads.fasta 1 1 5"の部分はシード配列(sangerリード、454リード、個別のIlluminaリードなど)とパラメータになっている。最初の "1"は、Priceに、入力されたシードの一部を様々なサイクルで追加するのではなく、一度にすべて使用するように指示する。次の "1 "は、別のシードの一部を追加する前に待機するサイクル数を示す。最後の"5"という数字は、品質スコアの定数倍で指定する。"-nc 30"はPRICEに30サイクル実行するよう指示している。"-dbmax 72"はPRICEに、小さな領域のアセンブリにde Bruijn graphサブアセンブラを最大配列長72ヌクレオチドで使うように指示している。"-mol 30"は、配列がマッチする際のオーバーラップの最小値を30ntに設定している。この数値は、-tolフラグで指定された閾値により、カバレッジの増加に応じて自動的にスケールアップされる。多くのアセンブリジョブでは、40ntから50ntの長さで十分である。カバレッジが低いときにこの値を大きくしすぎると、適切なリードがないためにコンティグの成長が止まってしまう可能性がある。"-tol 20"はオーバーラップの最小長をスケーリングするための閾値フラグ。アセンブリジョブのリード数が20を超えると、最小のオーバーラップは上方にスケールされる。"-mpi 80"は、配列が一致するための最小一致率で、80%に設定されている。この値もカバレッジの深さに応じてスケーリングされる。スケーリングを開始する閾値は "-tpi "フラグで指定される。この値を低く設定すると、より緩やかなマッチングが可能になり、値を高く設定すると、マッチングの厳しさが増す。"-target 90 2 1 1"はPRICEを "target "モードで実行することを指定している。これは、最終的なコンティグは入力シード配列を拡張した配列に限定することを意味する。最初の数字 "90 "は、マッチとしてカウントするために必要な、最初のシードのコンティグに対する最小の同一性のパーセンテージである。2番目の数字 "2 "は、PRICEに2サイクル実行させ、アウトボード(ペアエンドの集合によって生成された)コンティグを保持し、出力を最初のシードコンティグとマップまたはコラップしたものに限定するように指示する。次の2つの数字、"1 1"は、その後、アウトボードコンティグの保存とシードコンティグへの制限を交互に行うよう指示している。最後は、"-o practice.fasta"で、PRICEに出力を "practice.fasta "というファイルに格納するよう指示する。

 

実行すると、それぞれのサイクルごとにシードから拡張したfastaファイルが出力される。ファイル名は指定されたprefixとサイクル数を表す"prefix + <INT> + .fasta"となっている。

 

マニュアルより

  • イルミナFASTQを使用する場合、ファイルの末尾は"_sequence.txt "でなければならない。イルミナ以外のFASTQを使用する場合、ファイルの末尾は".fq "または".fastq "でなければならない。
  • FASTAファイルを使用する場合、ファイルの末尾は ".fa" または ".fasta"の必要がある。

 

コメント

渦鞭毛藻類の断片化したプラスチドゲノムを伸ばすために利用されていて気づいたツールです。de novoアセンブリの手法ではうまく組み立てることができない断片化した配列のシード配列からの伸長に利用できそうです。

引用

PRICE: Software for the Targeted Assembly of Components of (Meta) Genomic Sequence Data

J Graham Ruby, Priya Bellare, Joseph L Derisi

G3 (Bethesda). 2013 May 20;3(5):865-80

 

関連

https://kazumaxneo.hatenablog.com/entry/2021/06/17/232020