2019-07-03

アセンブリの前処理としてロングリードのキメラ領域（低オーバーラップ領域）を除く yacrd

2019 コマンドの誤り修正

2020 3/30 バージョンによるコマンドの違いを記載

2020 3/31 version0.6.0のコマンドを一番下に追記

2020 4/23 論文追記

　第三世代DNAシーケンシング法（PacBio、オックスフォードナノポア）は、リファレンスゲノムの構築（デノボアセンブリ）のための重要な技術となりつつある。この種のデータに対する新しいバイオインフォマティクス手法が急速に登場している。
　一部のロングリードアセンブラは、アセンブリ前にリードに対してエラー訂正を実行する。訂正は、第3世代リードの高いエラー率を減らし、アセンブリを扱いやすくするのに役立つが、時間とメモリを消費するステップでもある。最近のアセンブラ（例：Li（2016）; Ruan and Li（2019）など）は、未訂正の未加工のリードを直接アセンブルする方法を見つけた。したがって、ここでは未訂正アセンブリのみに焦点を当てる。この設定では、アセンブリの品質は、キメラリードと非常に誤った領域（Myers、2015）の影響を受ける。
　DASCRUBBERプログラム（Myers、2017）は、リードの「スクラビング」の概念を導入した。これは、他の方法で塩基を修正することを試みることなく、リード中の問題のある領域を迅速に除去する。その考えは、リードをスクラブすることは訂正よりも軽量の操作であり、それゆえ高性能で訂正のないゲノムアセンブラに適しているということである。
　DASCRUBBERは、リードのall-against-allマッピングを実行し、リードごとにパイルアップを作成する。次に、マッピング品質を分析して、推定上高いエラー率の領域を決定する。これは、パイルアップ内の他のリードからの同等で高品質の領域に置き換えられる。 MiniScrub（LaPierre et al、2018）は、オーバーラップ検出に使用されたアンカーの位置を記録するために、Minimap2（Li、2017）の修正版を使用する別のスクラビングツールである。 MiniScrubはリードごとにアンカー位置をイメージに変換する。次に、畳み込みニューラルネットワークが低品質のリード領域を検出して削除する。
　リードスクラビングのさらに上流にあるもう1つの問題は、リード間のオーバーラップの計算である。オーバーラップの保存はディスクを大量に消費し、著者らの知る限りでは、その潜在的に高いディスクスペースを最適化する試みはこれまでになかった。
　本稿では、ロングリードのアセンブラの初期段階を一緒に最適化する2つのツールを紹介する。 1つは、高速で効果的なリードのスクラビング用のyacrd（for Yet Another Chimeric Read Detector）である。もう1つは、リード間で検出される重複をフィルター処理するfpa（フィルターペアアライメント）である。

　DASCRUBBERやMiniScrubと同様に、yacrdはリードの低品質領域は他のリードでは十分にサポートされていないという仮定に基づいている。そのような領域を検出するために、yacrdはMinimap2を使用してall-against-allリードマッピングを実行してから、各リードのカバレッジを計算する。 DASCRUBBERおよびMiniScrubとは対照的に、yacrdはMinimap2によって与えられるおおよその位置マッピング情報のみを使用する。これは時間のかかるアライメントステップを回避する。これはベースレベルのアライメントを持たないことを犠牲にしているが、これはスクラビングを実行するのに十分であることが判明している。カバレッジが特定のしきい値（デフォルトでは4に設定されている）を下回った場所でリードが分割され、カバレッジの低い領域が完全に削除される。

f:id:kazumaxneo:20190703004847p:plain
Githubより転載

2022/03/02

Yacrd 1.0.0 Magby is out.

No major change just dependency update.

I think that all yacrd features are now implemented and stable.

I'm working on a tool that would have the same goal as yacrd but with another approach.https://t.co/LXZSAU3wAx #rustlang #Bioinformatics
— Pierre Marijon 🏳️‍🌈 (@pierre_marijon) March 1, 2022

Preprint of my new paper about two of my tools #yacrd and #fpa (I think I'm a little too much talk about it here) is now available !!

Discover how to optimize your long read assemblies for a minimum cost with two simple and efficient tools. https://t.co/wCC1qUOLl7
— Pierre Marijon 🏳️‍🌈 (@pierre_marijon) June 18, 2019

2020 3/3追記

https://twitter.com/pierre_marijon/status/1234783942986944512

インストール

ubuntu16.0.4のminicona3.4.0.5環境でテストした。

依存

Rust in stable channel
libgz
libbzip2
liblzma
minimap2 (リード同士のマッピングに必要)

conda install -c bioconda -y minimap2

本体 Github

#bioconda (link) 0.6とはコマンドが異なる。0.5.1を導入してテストした。
conda install -c bioconda -y yacrd==0.5.1

#cargo
cargo install yacrd

> yacrd -h

$ yacrd -h

yacrd 0.5.1 Omanyte

Pierre Marijon <pierre.marijon@inria.fr>

Yet Another Chimeric Read Detector

USAGE:

yacrd [SUBCOMMAND]

FLAGS:

-h, --help Prints help information

-V, --version Prints version information

SUBCOMMANDS:

chimeric In chimeric mode yacrd detect chimera if coverage gap are in middle of read

help Prints this message or the help of the given subcommand(s)

scrubbing In scrubbing mode yacrd remove all part of read not covered

> yacrd chimeric -h

$ yacrd chimeric -h

yacrd-chimeric

In chimeric mode yacrd detect chimera if coverage gap are in middle of read

USAGE:

yacrd chimeric [FLAGS] [OPTIONS]

FLAGS:

-j, --json Yacrd report are write in json format

-h, --help Prints help information

-V, --version Prints version information

OPTIONS:

-i, --input <input>...

Mapping input file in PAF or MHAP format (with .paf or .mhap extension), use - for read standard input (no

compression allowed, paf format by default) [default: -]

-o, --output <output>

Path where yacrd report are writen, use - for write in standard output same compression as input or use

--compression-out [default: -]

-f, --filter <filter>...

Create a new file {original_path}_fileterd.{original_extension} with only not chimeric records, format

support fasta|fastq|mhap|paf

-e, --extract <extract>...

Create a new file {original_path}_extracted.{original_extension} with only chimeric records, format support

fasta|fastq|mhap|paf

-s, --split <split>...

Create a new file {original_path}_splited.{original_extension} where chimeric records are split, format

support fasta|fastq

-F, --format <format> Force the format used [possible values: paf, mhap]

-c, --chimeric-threshold <chimeric-threshold>

Overlap depth threshold below which a gap should be created [default: 0]

-n, --not-covered-threshold <not-covered-threshold>

Coverage depth threshold above which a read are marked as not covered [default: 0.80]

--filtered-suffix <filtered-suffix>

Change the suffix of file generate by filter option [default: _filtered]

--extracted-suffix <extracted-suffix>

Change the suffix of file generate by extract option [default: _extracted]

--splited-suffix <splited-suffix>

Change the suffix of file generate by split option [default: _splited]

-C, --compression-out <compression-out>

Output compression format, the input compression format is chosen by default [possible values: gzip, bzip2,

lzma, no]

> yacrd scrubbing -h

$ yacrd scrubbing -h

yacrd-scrubbing

In scrubbing mode yacrd remove all part of read not covered

USAGE:

yacrd scrubbing [FLAGS] [OPTIONS] --mapping <mapping> --report <report> --scrubbed <scrubbed> --sequence <sequence>

FLAGS:

-j, --json Yacrd report are write in json format

-h, --help Prints help information

-V, --version Prints version information

OPTIONS:

-m, --mapping <mapping>

Path to mapping file in PAF or MHAP format (with .paf or .mhap extension, paf format by default)

-s, --sequence <sequence>

Path to sequence you want scrubbed, format support fasta|fastq

-r, --report <report> Path where yacrd report are writen

-S, --scrubbed <scrubbed> Path where scrubbed read are write [default: -]

-c, --chimeric-threshold <chimeric-threshold>

Overlap depth threshold below which a gap should be created [default: 0]

-n, --not-covered-threshold <not-covered-threshold>

Coverage depth threshold above which a read are marked as not covered [default: 0.80]

-M, --mapping-format <format> Force the format used [possible values: paf, mhap]

実行方法

1、chimeric - カバレッジギャップがリードの中央にある場合にキメラ検出

キメラ検出。必要に応じて一括セッティングオプション"-x"を使用する (e.g., "-x ava-ont"))。12スレッド使用。

minimap2 -t 12 reads.fq.gz reads.fq.gz | yacrd chimeric -o reads.yacrd

キメラフィルタリングされたリードを出力

minimap2 -t 12 reads.fq.gz reads.fq.gz > mapping.paf
yacrd chimeric -i mapping.paf -f reads.fasta > reads.yacrd # produce reads_fileterd.fasta

-f, --filter <filter> Create a new file {original_path}_fileterd.{original_extension} with only not chimeric records, format support fasta|fastq|mhap|paf

キメラリードのみ出力

minimap2 -t 12 reads.fq.gz reads.fq.gz > mapping.paf
yacrd chimeric -i mapping.paf -e reads.fasta > reads.yacrd # produce reads_extracted.fasta

-e, --extract <extract> Create a new file {original_path}_extracted {original_extension} with only chimeric records, format support fasta|fastq|mhap|paf

キメラ領域をsplitして除き、全リード出力

minimap2 -t 12 reads.fq.gz reads.fq.gz > mapping.paf
yacrd chimeric -i mapping.paf -s reads.fasta > reads.yacrd # produce reads_splited.fasta

-s, --split <split> Create a new file {original_path}_splited.{original_extension} where chimeric records are split, format support fasta|fastq

2、scrubbing - chimericとは違い、カバレッジギャップがリードのどの領域にあってもキメラ検出

ONTのキメラ検出。12スレッド使用。

minimap2 -x ava-ont -g 500 -t 12 reads.fq.gz reads.fq.gz > overlap.paf
yacrd scrubbing -c 4 -n 0.4 -m overlap.paf -s reads.fasta -S reads_scrubbed.fasta -r scrubbed_report.yacrd

minimap2

-g stop chain enlongation if there are no minimizers in INT-bp [5000]

yacrd

-c Overlap depth threshold below which a gap should be created [default: 0]
-n Coverage depth threshold above which a read are marked as not covered [default: 0.80]
-m Path to mapping file in PAF or MHAP format (with .paf or .mhap extension, paf format by default)
-s Path to sequence you want scrubbed, format support fasta|fastq
-S Path where scrubbed read are write [default: -]
-r Path where yacrd report are writen

PacbioのP6-C4のキメラ検出。12スレッド使用。

minimap2 -x ava-pb -g 800 -t 12 reads.fq.gz reads.fq.gz > overlap.paf
yacrd scrubbing -c 4 -n 0.4 -m overlap.paf -s reads.fasta -S reads_scrubbed.fasta -r scrubbed_report.yacrd

PacbioのSequelのキメラ検出。12スレッド使用。

minimap2 -x ava-pb -g 800 -t 12 reads.fq.gz reads.fq.gz > overlap.paf
yacrd scrubbing -c 4 -n 0.4 -m overlap.paf -s reads.fasta -S reads_scrubbed.fasta -r scrubbed_report.yacrd

Preprintでは、キメラリードを除くことでヒトゲノムと線虫のRedBean (旧 wtdbg2)とminiasmを使ったde novo assemblyのパフォーマンスが大きく伸びることが示されています。

fpaは別に紹介します。

2020 03/30

v0.6からコマンドが大きく変わったので、使い方を簡単に書いておきます。

#ONT-readsのall versus all overlap
minimap2 -x ava-ont -g 500 -t 16 ONT.fq.gz reads.fq.gz > overlap.paf

#キメラがないか分析
yacrd -i overlap.paf -o reads.yacrd

#キメラやオーバーラップなしリードの表記があれば、以下のコマンドでフィルタリング
#その前にリードをfasta変換しておく。
seqkit fq2fa ONT.fq.gz > ONT.fa

#yacrdのサブコマンド;filter実行
yacrd -i overlap.paf -o reads.yacrd filter -i ONT.fa -o reads.filter.fasta

引用

yacrd and fpa: upstream tools for long-read genome assembly

Pierre Marijon, Rayan Chikhi, Jean-Stéphane Varré

bioRxiv preprint first posted online Jun. 18, 2019

yacrd and fpa: upstream tools for long-read genome assembly
Pierre Marijon, Rayan Chikhi, Jean-Stéphane Varré
Bioinformatics, 2020

メタゲノムデータセットをタンパク質レベルでアセンブリし、ホモログサーチを行う GRASP2

2019 BMC Bioinformatics 高速なツール protein search assembly assembly graph

　メタゲノミクスは、特定の微生物群集のゲノム含有量を研究するための培養に依存しないアプローチである。典型的なメタゲノミクス研究では、環境サンプルから微生物のDNAが抽出され、次世代シークエンシング（NGS）技術を使用してシークエンシングされる。中程度または複雑度の高い微生物群集から得られたメタゲノミクスデータの分析は、構成微生物の分類学的およびゲノム的多様性が高いこと、およびしばしば不十分なおよび／または不均一なシーケンシングカバレッジのために困難である。そのようなデータセットのde novoアセンブリ（シーケンスリードから個々の微生物ゲノムを再構築することを目的とする）は、しばしば不完全で断片的であり、下流の機能的および分類学的分析を妨げる。たとえば、MetaHIT（Metagenomics of the Human Intestinal Tract）プロジェクトデータアセンブリでは、コンティグN50はわずか2.2kbであり、リードの53％以上は未アセンブルのままだった[ref.1]。

　アセンブリに依存しない分析方法は、利用可能なデータベースに対してそれらを検索することによって個々のリードに直接アノテーションを付ける。これらのデータベースは、completeシーケンシングされたゲノム、タンパク質およびタンパク質ドメイン、ならびにアノテーション付き分類法を有するマーカー遺伝子を含み得る。データベースに対する有意なヒットは、データベース内のリードと配列の間の相同性を示唆し、個々のリードの機能と分類法を推論し、続いてコミュニティ全体の機能を予測することを可能にする。ただし、このアプローチは既存のデータベースの完全性に大きく依存している。実際には、そのようなデータベースは、ほとんどの環境における微生物の多様性が十分に特徴付けられていないかまたはシーケンシングされていない、単純でよく研究されたコミュニティを除いて、めったに利用できない。この場合、ほとんどのデータベース検索では、中程度または遠隔相同性の検出が行われる。これは、closely relatredの検出と比較してより困難である。そのようなホモログ検索は、限られた量の機能的および分類学的シグナルしか含まないリード長の短さによってさらに混乱する。

　本著者らは、失われた機能的および分類学的シグナルを再構築するためにリードを重ね合わせそれらを長くすることでこの問題に取り組むことを試みた。リードのオーバーラップ検出は、デノボアセンブリで採用されているものと同様である。しかし、ほとんどのアセンブラのように過度のエラー修正や積極的なグラフフィルタリング、プルーニングは行わなかったため、豊富なゲノムのほとんどの多型を保持することができる。さらに、（FragGeneScan [ref.2]を使用して）ヌクレオチドリードから翻訳されたショートペプチド配列に対してリードオーバーラップ検出を行った。このアプローチは、アミノ酸空間における同義置換の崩壊により、タンパク質配列をより効果的に再構築することが示されている[ref.3]。そのような直感に基づいて、著者らは以前に概念的な重なりグラフにおいてクエリ／リファレンスタンパク質のホモログに対応するパスウェイを見つけることを目的としたGRAPS（ガイド付き参照ベースの短いペプチドのアセンブリ）と呼ばれる同時アラインメントおよびアセンブリアルゴリズムを開発した。。この意味で、GRASPはリファレンスのホモログタンパク質配列を再構築する遺伝子中心アセンブリツールとして使用することができる。それはまた、アセンブリされたコンティグをテンプレートとして使用することによって、個々の相同なリードを動員するためのホモログ検索ツールとしても使用され得る。ベンチマークデータは、GRASPが既存のホモログ検索ツールの再現率を精度を落とすことなく約40％から約60％に向上させたことを示した[ref.4]。 RAPSearch [ref.5]やDIAMOND [ref.6]のような最近開発されたホモログ検索ツールは、計算効率の向上に焦点を当てており、再現率のパフォーマンスは低い。したがって、効率的な同族体検索ツールは多数利用可能だが、GRASPは依然として高い再現率と全体的なパフォーマンスを誇るツールである。

　しかし、GRASPで行われるde novoアセンブリのため、BLASTのような伝統的なリードベースのホモログ検索ツールよりはるかに遅くなる。さらに、GRASPはまた、アセンブリ工程を補助するために使用される接尾辞配列データ構造を記憶するために大量のメモリを必要とする。したがって、GRASPは、人間の口腔環境から生成されたデータなど、比較的単純で小さなメタゲノムデータセットにのみ適用されていた。 GRASPの実用的な実用性を拡張するために、本著者らはGRASPxと呼ばれる新しいアライメントとアセンブリの同時アルゴリズムを開発した[ref.7]。 GRASPxは、GRASPの本来のパフォーマンスを犠牲にすることなく、GRASPの速度を最大30倍向上させる。 GRASPxもGRASPと同量のメモリを使用する。パフォーマンスが大幅に向上したにもかかわらず、GRASPxは他のホモログ検索ツールよりも多くの時間とスペースを必要としていた。

　この論文では、GRASPxの計算効率とスペース効率をさらに向上させるために、GRASP2と呼ばれる、完全に再設計された同時アライメントとアセンブリのアルゴリズムを紹介する。 GRASPx、BLAST、PSI-BLAST、およびFASTMを含むホモログ検索アルゴリズムのセットに対してGRASP2をベンチマークした。ベンチマークの結果は、GRASP2がGRASPxよりも8倍高速であり、インデックス作成フェーズとアセンブリ/検索フェーズの両方で8倍メモリ使用量が少ないことを示した。 GRASP2はGRASPxと同じ高い性能を持っている。そしてGRASP2は他のホモログ検索ツールよりもかなり優れている。これらの結果は、本発明者らの新しいアルゴリズムが、同時アラインメントおよびアセンブリアルゴリズムの実行時間および所要スペースを効果的に減らすことができることを示唆している。結果として得られるソフトウェアGRASP2は、その高い性能と著しく改善された計算効率のために大きな応用の可能性を持っている。

　最初にGRASPxのアルゴリズムフレームワークを要約して、ここでさらに改善された制限を特定する。 GRASPxは、２つの主な段階、すなわち効率的なオーバーラップ検出を容易にするためにリードセット全体から接尾辞配列を構築するためのindexing段階、およびリファレンスタンパク質のホモログに対応するパスウェイについて概念的重なりグラフを検索するアセンブリ／検索段階を含む。（以下略）

インストール

ubuntu16.04でテストした（docker使用、ホストOS macos10.12）。

依存

GCC version 2.8.0+
Boost library version 1.46.0+
cmake
make

#手っ取り早く全部入れる
apt update && apt install libboost-all-dev cmake make gcc

#Fraggenescanなどシーケンシングリードからproteinを予測する方法も必要
#bioconda (link)
conda install -c bioconda -y fraggenescan

GRASP2本体　SourceForge

ターボールをダウンロードして解凍、ビルドする。

tar -zxvf GRASP2_V0.02.tar
cd GRASP2_release/
cmake CMakeLists.txt
make -j 16
cd bin/

> ./grasp2-assemble --help

# ./grasp2-assemble --help

Usage: grasp-assemble [query] [peptide_db] [contig_out]

List of options:

--help print the help message

--query arg query sequences (in FASTA format)

--db arg short-peptide reads (in FASTA format)

--contig_out arg assembled contigs output (in FASTA format)

--alignment_out arg alignment output (between query and contigs, in

BLAST-style format)

--index arg (=index) working directory where the indexing files should be

found

--gap_open arg (=-10) gap open penalty for sequence alignment

--gap_extend arg (=-1) gap extension penalty for sequence alignment

--band_size arg (=40) band size for sequence alignment

--e_value arg (=10) e-value cutoff (Karlin-Altschul statistics)

--orphan arg align orphan reads (true/false; default false)

--num_threads arg (=1) maximum number of threads to be used

--verbose arg print intermediate information (true/false; default

true)

> ./grasp2-build --help

# ./grasp2-build --help

Usage: grasp-build [peptide_db (FASTA)]

List of options:

--help print the help message

--db_file arg short-peptide reads (in FASTA format)

--index arg (=index) working directory for indexing file dump

--extension_len arg (=10) minimum overlap length for path extension

--neighbor_score arg (=11) neighbor score for 3-mer seed matches

--num_threads arg (=1) maximum number of threads to be used

--verbose arg print intermediate information (default true)

GRASPxも入れる。

同様にbuildする。

cd GRASPx_src/
#Makefileの2行目をカレントパスに修正してからmake
make -j 16

実行方法

1、FragGeneScanを使いgenomeやショートリードからアミノ酸配列を得る。ここではアセンブリ配列contig.fastaを指定。

run_FragGeneScan.pl -genome=contigs.fasta -out=output -complete=0 -train=illumina_5 -thread=20

-genome= sequence file name including the full path
-complete= 1 if the sequence file has complete genomic sequences. 0 if the sequence file has short sequence reads
train= file name that contains model parameters. [complete] for complete genomic sequences or short sequence reads without sequencing error. [sanger_5] for Sanger sequencing reads with about 0.5% error rate. [sanger_10] for Sanger sequencing reads with about 1% error rate. [454_10] for 454 pyrosequencing reads with about 1% error rate. [454_30] for 454 pyrosequencing reads with about 3% error rate. [illumina_5] for Illumina sequencing reads with about 0.5% error rate. [illumina_10] for Illumina sequencing reads with about 1% error rate.

output.faやgffファイル等が出力される。

2、Indexing

bin/grasp2-build Short-peptides.fa

3、アセンブリ

bin/grasp2-assemble [Query_proteins] [Short-peptide_reads] [Contig_output]

4、リードをアセンブリしたcontigにマッピング

graps-map [Short-peptide_reads] [Contig_output] [Map_output]

テスト中。

引用

GRASP2: fast and memory-efficient gene-centric assembly and homolog search for metagenomic sequencing data

Cuncong Zhong, Youngik Yang, Shibu Yooseph
BMC Bioinformatics 2019 20 (Suppl 11) :276

2019-07-01

blast結果を視覚化する BlastViewer

BLAST GUIツールインフォマティクス解析をサポートするツール結果の視覚化 (visualization) DNA解析ソフト

BlastViewerは、NCBI BLASTソフトウェアの結果をグラフィカルに表示することを目的として開発されたツールである。詳細はGIhutbとwiki参照。

wiki

Home · pgdurand/BlastViewer Wiki · GitHub

インストール

windows10 proとubuntu16.0.4でテストした（いずれもjava version1.8.0）。

依存

Java Virtual Machine 1.7 (or above) from Oracle (link).

本体　GIthub

リリースよりビルド済みjarファイルをダウンロードする。

実行方法

１、blastを実行してlegacy xmlファイルを出力しておく。NCBIにアクセスしてblastを実行したなら、上のメニューからxml形式でダウンロードする。

f:id:kazumaxneo:20190701121004j:plain

左上の方にあるXMLを選択してダウンロード実行。

f:id:kazumaxneo:20190701123724j:plain

ローカルblastの場合、"-outfmt 5"付きでblastを実行する（legacy blastを使っているなら"-m 7"）。

#例えばblastpなら
blastp -db blastp_database -query query.fasta -out output.xml \
 -num_threads 8 -outfmt 5

-outfmt <String>

alignment view options:
0 = Pairwise,
1 = Query-anchored showing identities,
2 = Query-anchored no identities,
3 = Flat query-anchored showing identities,
4 = Flat query-anchored no identities,
5 = BLAST XML,
6 = Tabular,
7 = Tabular with comment lines,
8 = Seqalign (Text ASN.1),
9 = Seqalign (Binary ASN.1),
10 = Comma-separated values,
11 = BLAST archive (ASN.1),
12 = Seqalign (JSON),
13 = Multiple-file BLAST JSON,
14 = Multiple-file BLAST XML2,
15 = Single-file BLAST JSON,
16 = Single-file BLAST XML2,
18 = Organism Report

２、コンソールで以下のように打ってBlastViewerを起動。

java -jar blastviewer-5.2.0.jar

#またはメモリ指定して起動。512mメモリで立ち上げ、10gまでメモリ使用を許可。
java -Xms512m -Xmx10G -jar blastviewer-5.0.0.jar

windowsで起動したところ。

f:id:kazumaxneo:20190701115822j:plain

xmlファイルを読み込む。

f:id:kazumaxneo:20190701124051j:plain

読み込んだところ。BCBIでblastを実行した時と同じように、表形式で結果はまとめられる。

f:id:kazumaxneo:20190701124126j:plain

ヒット部位

f:id:kazumaxneo:20190701143446j:plain

それぞれのヒットをクリックすると、その配列に関する情報が表示される。

f:id:kazumaxneo:20190701143611j:plain

Multiple Sequence Alignment (MSA) 表示

f:id:kazumaxneo:20190701124401j:plain

引用

https://github.com/pgdurand/BlastViewer

AMOSアセンブラパッケージのMinimusとMinimus2

2007 2011 BMC Bioinformatics assembly dereplication

2021 6/11 minimus2のコマンドを修正

MInumusのpaper（Sommer et al., 2007）より

　大規模な全ゲノムシークエンシングプロジェクトの課題に対処するためのアルゴリズムの必要性に応えて、ゲノムアセンブラは非常に大きく複雑になっている。しかし、アセンブラの最も一般的な用途の多くは、より少ないソフトウェアコンポーネントのみ必要とし、よりメモリ使用量が少なく、インストールと実行がはるかに簡単な、より単純なタイプのアセンブラである。

　これらの問題に対処するためにMinimusアセンブラを開発し、さまざまなアセンブリ問題でテストした。Minimusがウイルスゲノム、個々の遺伝子、およびBACクローンのアセンブリを含む、いくつかの小さなアセンブリ作業でうまく機能することを示す。さらに、大規模なアセンブリパイプラインの構成要素としての適合性を評価するために、バクテリアゲノムアセンブリにおけるMinimusの性能を評価する。これらのタスクに現在使用されている他のソフトウェアとは異なり、Minimusは、より細分化されたアセンブリを生成することを犠牲にして、大幅に少ないアセンブリエラーを生成することを示す。

　スモールゲノムや他の小さなアセンブリ作業のために、Minimusが既存のツールより速くそしてはるかに柔軟であることを見つける。その小型サイズとモジュラー設計により、Minimusは複雑なアセンブリパイプラインのコンポーネントとして最適である。 Minimusはオープンソースソフトウェアプロジェクトとしてリリースされており、そのコードはSourceforgeのAMOSプロジェクトの一部として入手可能である。

AMOS wiki

http://amos.sourceforge.net/wiki/index.php/AMOS

Minimus wiki

http://amos.sourceforge.net/wiki/index.php/Minimus

Minimus2 wiki

http://amos.sourceforge.net/wiki/index.php/Minimus2

インストール

condaを使ってpython2.7の仮想環境に導入した。

SourceForge

A clone of the official AMOS git repo on sourceforge

Amosパッケージを入れれば使える。

#bioconda(link)
mamba create -n amos -y python=2.7
conda activate amos
mamba install -c bioconda -y amos

> minimus

# minimus

The log file is: runAmos.log

Cannot substitute variable strip .afg PREFIX

>minimus2 -h

# minimus2 -h

minimus2 - The AMOS Pipeline for merging 2 assemblies

Usage:

minimus2 prefix \

-D REFCOUNT=<n> \ # Number of sequences in the 1st assembly ; (Required)

-D OVERLAP=<n> \ # Assembly 1 vs 2 minimum overlap (Default 40bp)

-D CONSERR=<f> \ # Maximum consensus error (0..1) (Default 0.06)

-D MINID=<n> \ # Minimum overlap percent identity for alignments (Default 94)

-D MAXTRIM=<n> # Maximum sequence trimming length (Default 20bp)

実行方法

１、minimus

少数の配列セットをアセンブルする（次世代のようなたくさんの配列は処理できない）。

multi-fastaのmy_reads.seqを指定する。

toAmos -s input.seq -o input.afg
minimus input

作業ディレクトリとアセンブル結果のinput.fastaが出力される。

２、minimus2

redundancyを取り除きながら２組のコンティグのマージを行うならminimus2を使う。

cat pair1.seq pair2.seq > concatenate.seq
toAmos -s concatenate.seq -o concatenate.afg
minimus2 concatenate -D REFCOUNT=<number>

上で指定する<number>は最初の配列pair1.seqのサイズになる。grepで取得する。

grep -c "^>" pair1.seq

引用

Minimus: a fast, lightweight genome assembler.
Sommer DD, Delcher AL, Salzberg SL, Pop M

BMC Bioinformatics. 2007 Feb 26;8:64.

Next generation sequence assembly with AMOS
Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M

Curr Protoc Bioinformatics. 2011 Mar;Chapter 11:Unit 11.8

2019-06-29

(メタゲノム向け) blastアノテーション結果をインタラクティブなグラフで視覚化する Keanu

2019 BMC Bioinformatics 結果の視覚化 (visualization) BLAST テスト失敗

　メタゲノミクスは、環境サンプルから回収された遺伝物質の研究である。これらのサンプルは、特定の環境の多様性や生態学に関する情報を提供する。メタゲノミクス研究は通常、ショットガンシーケンスデータセットから得られた微生物シーケンスに焦点を当てている[ref.1]。ウイルス、細菌、真核生物の成分の特徴付けは、特定のサンプルに存在する多様性を総合的に理解するために重要である[ref.1]。シーケンシングプラットフォームとツールの急速な開発と改良により、研究者は環境サンプルの種の豊富さと多様性をよりよく観察することができる。しかしながら、これらの進歩はまた分析され解釈されるべき膨大な量のデータの発生をもたらした。大量のテキストデータを分析するのは面倒で困難なプロセスである。メタゲノミクス研究も例外ではない。得られたデータセットは、本質的な多次元性と、複数レベルの階層および接続性の存在によって特徴付けられる[ref.2]。視覚化は、研究者がデータについてさらに質問をすることができるもの、および追加の分析を実行できるものを決定するのに役立つ。対話機能により、さらに調査と分析のオプションが追加される。メタゲノミクスデータの視覚化は研究の活発な分野であり、毎年新しい方法が開発されている。視覚化方法は、それらの視覚化能力に従って分類することができる。いくつかの方法は、単一のメタゲノムを視覚的に探索することを目的としているが、他の方法は、複数のメタゲノムの視覚化を可能にする。このツールには、円グラフ、バブルチャート、ツリーと樹状図、ボックスプロット、自己組織化マップなど、さまざまな種類の視覚化も含まれる。さまざまな視覚化方法とツールの詳細なレビューについては[ref.2]を参照してください。

　ここでは、アラスカ内陸部の古代の土壌における生物多様性のパターンを調査するために、微生物、ウイルス、真核生物種を含むメタゲノムの種構成を調べることができる可視化ツール、Keanuを開発した。このツールは、ローカルで開くことができ、内容に基づいてサンプルをさらに分析する方法を決定するために探索できるインタラクティブなWebページを生成する。この種の探索的データ分析は、どのような種類の追加の質問にデータを尋ねることができるのか、およびより多くのデータを収集する必要があるのかどうかを知らせる[ref.3]。静止画像は小さなデータセットを探索するのに役立つが、そのような画像はデータによっては非常に大きくまたは非常に詳細になることがあるため、それらの探索は困難になる。この対話型の視覚化はNCBI分類データベースからの情報を使用して、ルートから種レベルまで降順で各分類群の存在量を示している。ユーザーは、データセット全体の混沌としたノイズの多い視覚化を作成せずに分類の特定のサブセクションを探索できる。

　Keanuに似たツールが開発された。 MetaSee [ref.4]は、かつてはWebサービスおよびスタンドアロンのGUIアプリケーションとして利用可能だった。Keanuが使用しているのと非常によく似た形式でメタゲノムデータを表示できる対話型の視覚化ツールだったが、MetaSeeを実行することは、サーバーのソースコードをダウンロードしてローカルで実行する必要があるため、簡単な作業ではない。 Blobtools [ref.5]とMetacoder [ref.6]はどちらも大きな静止画像を作成する。 Blobtoolsは、BLOBプロットを作成するPythonツールである。 Metacoder [ref.6]はヒートツリーを作成するRパッケージである。どちらもKeanuの開発を駆り立てたインタラクティブな機能を欠いている。 Keanuはローカルで実行されるため、ユーザーはWebサービスを使用してデータベースを管理するのではなく最新のデータベースを管理したり、サービスをホストしたりすることができる。また、Keanuの対話機能は出力の分析に役立つ。

f:id:kazumaxneo:20190317181449p:plain

Fig. 2
flow diagram of Keanu showing how data is piped into different tools or processes

論文より転載

インストール

依存

Keanu can be run on any system where Python is available, which includes Windows, macOS, and Linux.

Keanu is written in Python 3.5 and is available with an ex- ample dataset at https://github.com/IGBB/keanu (Fig. 1).

Github

git clone https://github.com/IGBB/keanu.git
cd keanu/

> python3 make_db.py -h

# python3 make_db.py -h

usage: make_db.py [-h] [-names NAMES] [-nodes NODES] [-merged MERGED]

[-deleted DELETED] [-out_db OUT_DB] [-out_md_db OUT_MD_DB]

A tool to format NCBI taxonomy data for Keanu

optional arguments:

-h, --help show this help message and exit

-names NAMES Location of names.dmp

-nodes NODES Location of nodes.dmp

-merged MERGED Location of merged.dmp

-deleted DELETED Location of delnodes.dmp

-out_db OUT_DB Name of output taxonomy database

-out_md_db OUT_MD_DB Name of output merged/deleted database

> python3 format_input.py -h

# python3 format_input.py -h

usage: format_input.py [-h] [-in INPUT] [-out OUTPUT]

A tool to format BLAST qseqid/staxid files into Keanu's input

optional arguments:

-h, --help show this help message and exit

-in INPUT, --input INPUT

BLAST query/taxon data

-out OUTPUT, --output OUTPUT

Output filename

> python3 keanu.py -h

# python3 keanu.py -h

usage: keanu.py [-h] [-db DATABASE] [-md_db MERGED_DELETED_DATABASE]

[-view {tree,bilevel}] [-in INPUT] [-out OUTPUT]

optional arguments:

-h, --help show this help message and exit

-db DATABASE, --database DATABASE

Formatted taxonomy database

-md_db MERGED_DELETED_DATABASE, --merged_deleted_database MERGED_DELETED_DATABASE

Merged/deleted taxonomy database

-view {tree,bilevel}, --view {tree,bilevel}

View choice

-in INPUT, --input INPUT

Input data set

-out OUTPUT, --output OUTPUT

Output HTML filename

実行方法

１、blast実行。" -outfmt '6 std staxids' "をつけて実行する。ここではblastnを使っているが、感度を上げるならアミノ酸で探す（クエリが塩基配列なら、プロテインデータベースを用意してx）。

blastn -db blast_database/nt -query sample2.fasta -outfmt '6 std staxids' -out query.txt -num_threads 10 -evalue 1e-10

２、データベース作成。初回のみ。ローカルのblastデータベースを指定する。

#taxdmpファイルのダウンロード
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz

python3 make_db.py -names names.dmp -nodes nodes.dmp \
 -out_db taxonomy.dat -deleted delnodes.dmp -merged merged.dmp \
 -out_md_db merged_deleted.dat

３、クエリデータの準備。

python3 format_input.py -in query.txt -out keanu.txt

４、グラフ表示

bilevel partition graph

python3 keanu.py -db taxonomy.dat -md_db merged_deleted.dat \
 -in keanu.txt -view bilevel -out bilevel.html

５、collapsible tree表示

python3 keanu.py -db taxonomy.dat -md_db merged_deleted.dat \
 -in keanu.txt -view tree -out tree.html

テスト時はhtml出力でエラーになxった。

引用

Keanu: a novel visualization tool to explore biodiversity in metagenomes

Adam Thrash, Mark ArickII, Robyn A. Barbato, Robert M. Jones, Thomas A. Douglas, Julie Esdale, Edward J. Perkins and Natàlia Garcia-ReyeroEmail author
BMC Bioinformatics 2019 20 (Suppl 2) :103

参考

https://evosite3d.blogspot.com/2013/06/browsing-ncbi-taxonomy-with-python.html

2019-06-28

MMseqs2 コマンド其の2 タンパク質配列のクラスタリング

2017 Nature Biotechnology 高速なツール sequence clustering

インストール

以前の記事を参照

> mmseqs

$ mmseqs

MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast,

parallelized protein sequence searches and clustering of huge protein sequence data sets.

Please cite: M. Steinegger and J. Soding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017).

MMseqs2 Version: 905109c40ac720c407282d4d6517a164c3470fba

Easy workflows (for non-experts)

easy-search Search with a query fasta against target fasta (or database) and return a BLAST-compatible result in a single step

easy-linsearch Linear time search with a query fasta against target fasta (or database) and return a BLAST-compatible result in a single step

easy-linclust Compute clustering of a fasta database in linear time. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences.

easy-cluster Compute clustering of a fasta database. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences.

easy-taxonomy Compute taxonomy and lowest common ancestor for each sequence. The workflow outputs a taxonomic classification for sequences and a hierarchical summery report.

Main tools (for non-experts)

createdb Convert protein sequence set in a FASTA file to MMseqs sequence DB format

search Search with query sequence or profile DB (iteratively) through target sequence DB

linsearch Search with query sequence DB through target sequence DB

map Fast ungapped mapping of query sequences to target sequences.

cluster Compute clustering of a sequence DB (quadratic time)

linclust Cluster sequences of >30% sequence identity *in linear time*

createindex Precompute index table of sequence DB for faster searches

createlinindex Precompute index for linsearch

enrich Enrich a query set by searching iteratively through a profile sequence set.

rbh Find reciprocal best hits between query and target

clusterupdate Update clustering of old sequence DB to clustering of new sequence DB

Utility tools for format conversions

createtsv Create tab-separated flat file from prefilter DB, alignment DB, cluster DB, or taxa DB

convertalis Convert alignment DB to BLAST-tab format or specified custom-column output format

convertprofiledb Convert ffindex DB of HMM files to profile DB

convert2fasta Convert sequence DB to FASTA format

result2flat Create a FASTA-like flat file from prefilter DB, alignment DB, or cluster DB

createseqfiledb Create DB of unaligned FASTA files (1 per cluster) from sequence DB and cluster DB

Taxonomy tools

taxonomy Compute taxonomy and lowest common ancestor for each sequence.

createtaxdb Annotates a sequence database with NCBI taxonomy information

addtaxonomy Add taxonomy information to result database.

lca Compute the lowest common ancestor from a set of taxa.

taxonomyreport Create Kraken-style taxonomy report.

filtertaxdb Filter taxonomy database.

An extended list of all tools can be obtained by calling 'mmseqs -h'.

Bash completion for tools and parameters can be installed by adding "source MMSEQS_HOME/util/bash-completion.sh" to your "$HOME/.bash_profile".

Include the location of the MMseqs2 binary is in your "$PATH" environment variable.

kazu@kazu:~/taniguchi_datadir/metagenome/sample2/test$

kamisakumanoMBP:~ kazuma$

> mmseqs cluster

$ mmseqs cluster

Usage: mmseqs cluster <i:sequenceDB> <o:clusterDB> <tmpDir> [options]

Compute clustering of a sequence DB (quadratic time)

Options:

Prefilter:

--max-seqs INT Maximum result sequences per query allowed to pass the prefilter (this parameter affects sensitivity) [20]

Align:

-c FLOAT list matches above this fraction of aligned (covered) residues (see --cov-mode) [0.800]

--cov-mode INT 0: coverage of query and target, 1: coverage of target, 2: coverage of query 3: target seq. length needs be at least x% of query length, 4: query seq. length needs be at least x% of target length [0]

-a add backtrace string (convert to alignments with mmseqs convertalis utility)

-e FLOAT list matches below this E-value (range 0.0-inf) [0.001]

--min-seq-id FLOAT list matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]

--min-aln-len INT minimum alignment length (range 0-INT_MAX) [0]

--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]

--alt-ali INT Show up to this many alternative alignments [0]

--force-reuse reuse tmp file in tmp/latest folder ignoring parameters and git version change

Clust:

--cluster-mode INT 0: Setcover, 1: connected component, 2: Greedy clustering by sequence length 3: Greedy clustering by sequence length (low mem) [0]

--single-step-clustering 0 switches from cascaded to simple clustering workflow [1, set to 0 to disable]

Common:

--threads INT number of cores used for the computation (uses all cores by default) [56]

--compressed INT write results in compressed format [0]

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

An extended list of options can be obtained by calling 'mmseqs cluster -h'.

- Steinegger M, Soding J: Clustering huge protein sequence sets in linear time. Nature Communications, doi:10.1038/s41467-018-04964-5 (2018)

- Hauser M, Steinegger M, Soding J: MMseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics, 32(9), 1323-1330 (2016).

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

3 Database paths are required

> mmseqs createdb

$ mmseqs createdb

Usage: mmseqs createdb <i:fastaFile1[.gz]> ... <i:fastaFileN[.gz]> <o:sequenceDB> [options]

Convert protein sequence set in a FASTA file to MMseqs sequence DB format

Options:

Misc:

--dont-split-seq-by-len 0 Dont split sequences by --max-seq-len [1, set to 0 to disable]

--dbtype INT Database type 0: auto, 1: amino acid 2: nucleotides [0]

--dont-shuffle 0 Do not shuffle input database [1, set to 0 to disable]

--id-offset INT numeric ids in index file are offset by this value [0]

Common:

--compressed INT write results in compressed format [0]

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

An extended list of options can be obtained by calling 'mmseqs createdb -h'.

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

2 Database paths are required

> mmseqs createtsv

$ mmseqs createtsv

Usage: mmseqs createtsv <i:queryDB> [<i:targetDB>] <i:resultDB> <o:tsvFile> [options]

Create tab-separated flat file from prefilter DB, alignment DB, cluster DB, or taxa DB

Options:

Misc:

--first-seq-as-repr Use the first sequence of the clustering result as representative sequence

--target-column INT Select a target column (default 1), 0 if no target id exists. [1]

--full-header Replace DB ID by its corresponding Full Header

--idx-seq-src INT 0: auto, 1: split/translated sequences, 2: input sequences [0]

--db-output Output a result db instead of a text file

Common:

--threads INT number of cores used for the computation (uses all cores by default) [56]

--compressed INT write results in compressed format [0]

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

An extended list of options can be obtained by calling 'mmseqs createtsv -h'.

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

> mmseqs result2flat -h

$ mmseqs result2flat -h

Usage: mmseqs result2flat <i:queryDB> <i:targetDB> <i:resultDB> <o:fastaDB> [options]

Create a FASTA-like flat file from prefilter DB, alignment DB, or cluster DB

By Martin Steinegger <martin.steinegger@mpibpc.mpg.de>

Options:

Misc:

--use-fasta-header use the id parsed from the fasta header as the index key instead of using incrementing numeric identifiers

Common:

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

> mmseqs result2repseq

$ mmseqs result2repseq

Usage: mmseqs result2repseq <i:sequenceDB> <i:resultDB> <o:sequenceDb> [options]

Get representative sequences for a result database

Options:

Common:

--threads INT number of cores used for the computation (uses all cores by default) [56]

--compressed INT write results in compressed format [0]

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

An extended list of options can be obtained by calling 'mmseqs result2repseq -h'.

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

実行方法

mmseq2 cluster

１、まずproteome.fastaをデータベースに変換する。

mmseqs createdb proteome.fasta DB

２、クラスタリング実行。作業ディレクトリが必要。ここでは/tmpとした。巨大なproteomeファイルの場合はかなりのディスクスペースを必要とするので、作業ディレクトリの空き容量に注意する。

mmseqs cluster DB DB_clu tmp

--min-seq-id <FLOAT> list matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]
--cov-mode <INT> 0: coverage of query and target, 1: coverage of target, 2: coverage of query 3: target seq. length needs be at least x% of query length, 4: query seq. length needs be at least x% of target length [0]

クラスタリング感度は"--min-seq-id"、"-c"、”--cov-mode”などを立てることで手動設定できる。

３、クラスタリング結果をTSV出力する。

mmseqs createtsv DB DB DB_clu clustered.tsv

> head clustered.tsv

$ head clustered.tsv

k161_4496204_2_367_+ k161_4496204_2_367_+

k161_2597819_391_528_- k161_2597819_391_528_-

k161_2397988_694_1326_- k161_2397988_694_1326_-

k161_2397988_694_1326_- k161_732702_1_726_+

k161_2397988_694_1326_- k161_4554834_402_1124_-

k161_1498756_826_1108_- k161_1498756_826_1108_-

k161_3197337_1_597_+ k161_3197337_1_597_+

k161_4696064_350_931_+ k161_4696064_350_931_+

k161_4696064_350_931_+ k161_4519536_22584_23195_-

k161_4696064_350_931_+ k161_3060204_1_699_+

４、クラスタリング後のfastaを出力する。

mmseqs result2flat DB DB DB_clu_rep DB_clu_rep.fasta --use-fasta-header

メタゲノムのアセンブリ配列をFragGeneScaにかけてprootein配列にしたもの（output.faa）を入力として、クラスタリングを実行した。

f:id:kazumaxneo:20190625000340p:plain

ラン後のファイルはDB_clu_rep.fastaだが、わずかに配列数が減少している。

mmseq2 linclust

Linclustは線形時間でのクラスタリングである。実行時間は早くなるが、クラスタリングよりも少しだけ感度が劣る。

１、データベース作成

mmseqs createdb proteome.fasta DB

２、クラスタリング実行。

mmseqs linclust DB DB_clu /tmp

４、クラスタリング後のfastaを出力する。

mmseqs result2repseq DB DB_clu DB_clu_rep
mmseqs result2flat DB DB DB_clu_rep DB_clu_rep.fasta --use-fasta-header

引用

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Steinegger M, Söding J

Nat Biotechnol. 2017 Nov;35(11):1026-1028

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.
Hauser M, Steinegger M, Söding J

Bioinformatics. 2016 May 1;32(9):1323-30.

Clustering huge protein sequence sets in linear time

Martin Steinegger & Johannes Söding

Nat Commun. 2018; 9: 2542.

参考

Check out my new beginner-friendly post on how to write a shell script that checks whether your computer has SSE or AVX instructions and then runs the correct version of @thesteinegger's MMseqs2 program automatically! https://t.co/N5OtebdBhY
— Ryan Moore (@TenderIsTheByte) May 21, 2019

2019-06-27

FragGeneScan

2010 Nucleic Acids Research metagenome HMM coding region

　次世代シーケンシング技術の進歩は、環境試料（すなわちメタゲノム）内の遺伝物質の全コレクションを直接シーケンシングしようと試みるメタゲノム研究を促進した。メタゲノムアセンブリは利用できないことが多いので（論文執筆時点）、ショートリードから直接遺伝子を同定することは、メタゲノムにアノテーションををつける重要だがチャレンジングな問題となっている。全ゲノム（例えばGlimmer）用に開発され、そしてメタゲノム配列（例えばMetaGene）用に最近開発された遺伝子予測法は、シーケンシングエラー率が増加するにつれて、またはリードが短くなるにつれて、性能の有意な低下を示す。本著者らは、ショートリードにおけるタンパク質コード領域の予測を改善するために、シークエンシングエラーモデルと隠れマルコフモデルにおけるコドン使用法を組み合わせた、新規な遺伝子予測方法FragGeneScanを開発した。 FragGeneScanの性能は、完全なゲノムに関してGlimmerおよびMetaGeneと同等であった。しかしショートリードでは、FragGeneScanは一貫してMetaGeneを上回った（精度は400塩基リード長で1％シークエンスエラーの場合62％、100塩基リード長でエラーがない場合は18％向上した）。メタゲノムに適用した場合、FragGeneScanはMetaGeneが予測したよりも実質的に多くの遺伝子（相同性検索によって同定された遺伝子の90％超）、および現在のタンパク質配列データベースにホモログを含まない多くの新規遺伝子を回収した。

インストール

ubbuntuのminiconda2.4.0.5環境でテストした。

SourcceForge

#bioconda (link)
conda install -c bioconda -y fraggenescan

実行方法

genomeからproteinを得る。-complete=1と-train=completeを指定。

run_FragGeneScan.pl -genome=contigs.fasta -out=output -complete=1 --train=complete -thread=20

-genome= sequence file name including the full path
-complete= 1 if the sequence file has complete genomic sequences. 0 if the sequence file has short sequence reads
train= file name that contains model parameters. [complete] for complete genomic sequences or short sequence reads without sequencing error. [sanger_5] for Sanger sequencing reads with about 0.5% error rate. [sanger_10] for Sanger sequencing reads with about 1% error rate. [454_10] for 454 pyrosequencing reads with about 1% error rate. [454_30] for 454 pyrosequencing reads with about 3% error rate. [illumina_5] for Illumina sequencing reads with about 0.5% error rate. [illumina_10] for Illumina sequencing reads with about 1% error rate.

output.faやgffファイル等が出力される。

illuminaのショートリードからproteinを得る。-complete=0とtrain=illumina_5（またはillumina_10）を指定。

run_FragGeneScan.pl -genome=ngs.fasta -out=output -complete=0 -train=illumina_5 -thread=20

-genome= sequence file name including the full path
-complete= 1 if the sequence file has complete genomic sequences. 0 if the sequence file has short sequence reads
train= file name that contains model parameters. [complete] for complete genomic sequences or short sequence reads without sequencing error. [sanger_5] for Sanger sequencing reads with about 0.5% error rate. [sanger_10] for Sanger sequencing reads with about 1% error rate. [454_10] for 454 pyrosequencing reads with about 1% error rate. [454_30] for 454 pyrosequencing reads with about 3% error rate. [illumina_5] for Illumina sequencing reads with about 0.5% error rate. [illumina_10] for Illumina sequencing reads with about 1% error rate.

454なら-train=454_10、-train=454_30を指定。

引用

FragGeneScan: predicting genes in short and error-prone reads
Mina Rho, Haixu Tang, Yuzhen Ye

Nucleic Acids Res. 2010 Nov; 38(20): e191.

FragGeneScan-Plus

https://ieeexplore.ieee.org/document/7300341

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

アセンブリの前処理としてロングリードのキメラ領域（低オーバーラップ領域）を除く yacrd

メタゲノムデータセットをタンパク質レベルでアセンブリし、ホモログサーチを行う GRASP2

blast結果を視覚化する BlastViewer

AMOSアセンブラパッケージのMinimusとMinimus2

(メタゲノム向け) blastアノテーション結果をインタラクティブなグラフで視覚化する Keanu

MMseqs2 コマンド其の2 タンパク質配列のクラスタリング

FragGeneScan