柔軟な出力パラメータをもつ高速なORF予測ツール orfipy

2021 2/13 論文引用、help更新、実行例追記

　転写物中のORFを検索することは、新たに配列決定されたゲノム中のコーディング領域をアノテーションする前の重要なステップであり、既知の遺伝子内の代替リーディングフレームを検索するための重要なステップである。RNA-Seqデータの驚異的な増加に伴い、大規模な入力データセットを扱うためには、より高速なツールが必要とされている。これらのツールは、検索基準を微調整し、効率的な下流解析を可能にするために十分な汎用性を持っていなければならない。ここでは、fasta配列中のオープンリーディングフレームを柔軟に検索できるPythonベースの新しいツール、orfipyを紹介する。検索は迅速で、FastaとBEDの出力フォーマットを選択して完全にカスタマイズ可能である。

　orfipyはpythonで実装されており、python v3.6以上と互換性がある。インストールはソースから、または PyPi (https://pypi.org/project/orfipy) または bioconda (https://anaconda.org/bioconda/orfipy) を経由してインストールできる。

Excited to share that orfipy is now published in Bioinformatics. I used it to identify ORFs in millions of transcripts efficiently and quickly. Fully flexible and rapid search. https://t.co/5aXb4cEcZT @EveSyrkin #Bioinformatics #RNASeq #python
— Urminder Singh (@_urminder) 2021年2月13日

Comparison with getorf and OrfM（Githubより転載）

f:id:kazumaxneo:20210213131059p:plain

インストール

ubuntu18.04LTSでテストした（python3.8）。

Github

#bioconda (link) condaの代わりにmambaを使うと高速
mamba install -c bioconda orfipy -y

#pip (pypi)
pip install orfipy

#development version
git clone https://github.com/urmi-21/orfipy.git
cd orfipy
pip install .

> orfipy

u$ orfipy

usage:

orfipy [<options>] <infile>

By default orfipy reports ORFs as sequences between start and stop codons. See ORF searching options to change this behaviour.

If no output type, i.e. dna, rna, pep, bed or bed12, is specified, default output is bed format to stdout.

orfipy: extract Open Reading Frames (version 0.0.3)

positional arguments:

infile The input file, in plain Fasta/Fastq or gzipped format, containing Nucletide sequences

optional arguments:

-h, --help show this help message and exit

--procs PROCS Num processor cores to use Default:mp.cpu_count()

--single-mode Run in single mode i.e. no parallel processing (SLOWER). If supplied with procs, this is ignored. Default: False

--table TABLE The codon table number to use or path to .json file with codon table. Use --show-tables to see available tables compiled from: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?chapter=cgencodes Default: 1

--start START Comma-separated list of start-codons. This will override start codons described in translation table. E.g. "--start ATG,ATT" Default: Derived from the translation table selected

--stop STOP Comma-separated list of stop codons. This will override stop codons described in translation table. E.g. "--start TAG,TTT" Default: Derived from the translation table selected

--outdir OUTDIR Path to outdir default: orfipy_<infasta>_out

--bed12 BED12 bed12 out file Default: None

--bed BED bed out file Default: None

--dna DNA fasta (DNA) out file Default: None

--rna RNA fasta (RNA) out file Default: None

--pep PEP fasta (peptide) out file Default: None

--min MIN Minimum length of ORF, excluding stop codon (nucleotide) Default: 30

--max MAX Maximum length of ORF, excluding stop codon (nucleotide) Default: 1,000,000,000

--strand {f,r,b} Strands to find ORFs [(f)orward,(r)everse,(b)oth] Default: b

--partial-3 Output ORFs with a start codon but lacking an inframe stop codon. E.g. "ATG TTT AAA" Default: False

--partial-5 Output ORFs with an inframe stop codon lacking an inframe start codon. E.g. "TTT AAA TAG" Default: False

--between-stops Output ORFs defined as regions between stop codons (regions free of stop codon). This will set --partial-3 and --partial-5 true. Default: False

--include-stop Include stop codon in the results, if a stop codon exists. This output format is compatible with TransDecoder's which includes stop codon coordinates Default: False

--longest Output a separate BED file for longest ORFs per sequence. Requires bed option. Default: False

--by-frame Output separate BED files for ORFs by frame. Can be combined with "--longest" to output longest ORFs in each frame. Requires bed option. Default: False

--chunk-size CHUNK_SIZE

Max chunk size in MB. This is useful for limiting memory usage when processing large fasta files using multiple processes The files are processed in chunks if file size is greater than chunk-size. By default orfipy computes the chunk size based on available memory

and cpu cores. Providing a smaller chunk-size will lower the memory usage but, actual memory used by orfipy can be more than the chunk-size. Providing a very high chunk-size can lead to memory issues for larger sequences such as large chromosomes. It is best to

let orfipy decide on the chunk-size. Default: estimated by orfipy based on system memory and cpu

--show-tables Print translation tables and exit. Default: False

--version Print version information and exit

実行方法

genomeのFASTAファイルを指定する。

orfipy input.fasta --dna orfs.fa --min 10 --max 10000 --procs 4 --table 1 --outdir orfs_out

--min Minimum length of ORF, excluding stop codon (nucleotide) Default: 30
--max Maximum length of ORF, excluding stop codon
--table The codon table number to use or path to .json file with codon table
--dna (DNA) out file Default: None
--procs Num processes Default:mp.cpu_count()

出力

f:id:kazumaxneo:20210213124856p:plain

全てのフレームでORFを予測するため、出力はかなり大きくなる。

標準コドンテーブルを使うが開始コドンはATGのみとする。

orfipy input.fa.gz --dna orfs.fa --start ATG

--start Comma-separated list of start-codons. This will override start codons described in translation table. E.g. "--start ATG,ATT" Default: Derived from the translation table selected

利用可能なコドンテーブルを表示

orfipy --show-table

$ orfipy --show-table

Translation tables compiled from: https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?chapter=cgencodes

Table# Name Start Stop

1 Standard (transl_table=1) [TTG,CTG,ATG] [TAA,TAG,TGA]

2 Vertebrate Mitochondrial (transl_table=2) [ATT,ATC,ATA,ATG,GTG] [TAA,TAG,AGA,AGG]

3 Yeast Mitochondrial (transl_table=3) [ATA,ATG] [TAA,TAG]

4 Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial; Mycoplasma; Spiroplasma (transl_table=4) [TTA,TTG,CTG,ATT,ATC,ATA,ATG,GTG] [TAA,TAG]

5 Invertebrate Mitochondrial (transl_table=5) [TTG,ATT,ATC,ATA,ATG,GTG] [TAA,TAG]

6 Ciliate Nuclear; Dasycladacean Nuclear; Hexamita Nuclear (transl_table=6) [ATG] [TGA]

7 Echinoderm Mitochondrial; Flatworm Mitochondrial (transl_table=9) [ATG,GTG] [TAA,TAG]

8 Euplotid Nuclear (transl_table=10) [ATG] [TAA,TAG]

9 Bacterial, Archaeal and Plant Plastid (transl_table=11) [TTG,CTG,ATT,ATC,ATA,ATG,GTG] [TAA,TAG,TGA]

10 Alternative Yeast Nuclear (transl_table=12) [CTG,ATG] [TAA,TAG,TGA]

11 Ascidian Mitochondrial (transl_table=13) [TTG,ATA,ATG,GTG] [TAA,TAG]

12 Alternative Flatworm Mitochondrial (transl_table=14) [ATG] [TAG]

13 Chlorophycean Mitochondrial (transl_table=16) [ATG] [TAA,TGA]

14 Trematode Mitochondrial (transl_table=21) [ATG,GTG] [TAA,TAG]

15 Scenedesmus obliquus Mitochondrial Code (transl_table=22) [ATG] [TCA,TAA,TGA]

16 Thraustochytrium mitochondrial code (transl_table=23) [ATT,ATG,GTG] [TTA,TAA,TAG,TGA]

17 Pterobranchia Mitochondrial (transl_table=24) [TTG,CTG,ATG,GTG] [TAA,TAG]

18 Candidate Division SR1 and Gracilibacteria (transl_table=25) [TTG,ATG,GTG] [TAA,TAG]

19 Pachysolen tannophilus Nuclear Code (transl_table=26) [CTG,ATG] [TAA,TAG,TGA]

20 Karyorelict Nuclear (transl_table=27) [ATG] [TGA]

21 Condylostoma Nuclear (transl_table=28) [ATG] [TAA,TAG,TGA]

22 Mesodinium Nuclear (transl_table=29) [ATG] [TGA]

23 Peritrich Nuclear (transl_table=30) [ATG] [TGA]

コード領域をアミノ酸配列として出力

orfipy input.fasta --pep orfs_peptides.fa --min 50 --procs 4

--pep fasta (peptide) out file Default: None

BED6のアノテーションファイルとして出力

orfipy input.fasta --bed orfs.bed --min 50 --procs 4

--bed bed out file Default: None

BED12のアノテーションファイルとして出力

orfipy input.fasta --min 100 --bed12 orfs.bed --partial-5 --partial-3 --include-stop

--bed12 bed12 out file Default: None
--include-stop Include stop codon in the results, if a stop codon exists. This output format is compatible with TransDecoder's which includes stop codon coordinates Default: False

複数指定

orfipy input.fasta --dna orfs.fa --min 10 --max 10000 --procs 4 --table 1 --outdir orfs_out --pep orfs_peptides.fa --bed orfsBED6.bed --bed12 orfsBED12.bed

longestは別のBED6ファイルとして出力。

orfipy genes.fasta --dna orfs.fa --min 10 --max 10000 --procs 4 --table 1 --outdir orfs_out --pep orfs_peptides.fa --bed orfs.bed --longest

--longest Output a separate BED file for longest ORFs per sequence. Requires bed option. Default: False

引用

orfipy: a fast and flexible tool for extracting ORFs

Urminder Singh, Eve Syrkin Wurtele

bioRxiv, Posted October 21, 2020

2021 2/13

orfipy: a fast and flexible tool for extracting ORFs
Urminder Singh, Eve Syrkin Wurtele
Bioinformatics, Published: 12 February 2021

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

柔軟な出力パラメータをもつ高速なORF予測ツール orfipy