2020-11-18

ヒトとマウスのショートオープンリーディングフレーム(sORF)のデータベース MetamORF

2020 Preprint human genome database web tool mouse

　ハイスループット技術の開発により、ほとんどの真核生物のRNAにnon-canonicalなショートオープンリーディングフレーム（sORF）が存在することが明らかになった。これらは、種を超えて高度に保存されたユビキタスな遺伝的要素であり、多くの細胞プロセスに関与していると考えられている。MetamORF (http://metamorf.hb.univ-amu.fr/) は、ヒトおよびマウスのゲノム上で同定されたユニークなsORFのリポジトリを、実験と計算の両方のアプローチで提供することを目的としている。公開されているsORFデータを収集し、正規化し、冗長な情報を要約することで、合計1,162,675個のユニークなsORFを同定することができた。ORFを短い、上流、下流のいずれかに分類するのが一般的であるにもかかわらず、これらのカテゴリーの定義に関する明確なコンセンサスは現在のところ存在しない。そのため、データは正規化された命名法を用いて再処理されている。MetamORFは、遺伝子座、遺伝子、転写産物、ORFレベルでの新たな解析を可能にし、将来的にはsORFの機能に関する新たな疑問を解決する可能性を提供する。リポジトリは、ユーザーフレンドリーなWebインターフェースで利用可能で、簡単なブラウジング、可視化、複数の条件でのフィルタリング、エクスポートが可能である。データベースの内容は、UCSC Genome Browserのトラックハブを介して利用できるようになっている。

MetamORFウェブインターフェースは、データの閲覧、複数の基準（ORFの長さ、アノテーション、細胞タイプなど）を用いた可視化やフィルタリング、結果のエクスポートが可能である。現在のウェブインターフェースに加えて、データベースのコンテンツはUCSC Genome Browserのトラックハブを介して利用可能である。これらのゲノムトラックへのリンク、およびUCSCゲノムトラックからMetamORFのウェブサイトへのリンクが実装されており、1つのトラックから別のトラックへのナビゲートを支援している。

Tutorial

http://metamorf.hb.univ-amu.fr/doc#docTutorial

Download

http://metamorf.hb.univ-amu.fr/downloads

webサービス

macのsafariだと検索機能が動作しなかったため、chromeを使用した（対応の詳細はtutorialのWeb browser compatibilityを参照）。

http://metamorf.hb.univ-amu.fr/searchにアクセスする。

GENEを検索してみる。GENEをクリック。

f:id:kazumaxneo:20201118000659p:plain

MetamORFデータベースではORFを持たないsORFは登録されていないので注意する。

Speciesを選択。

f:id:kazumaxneo:20201118000807p:plain

H. sapiensとM. musculusが選べる。

絞り込みできるようになっている。

f:id:kazumaxneo:20201118000856p:plain 複数の遺伝子で同じエイリアスが使われている場合などで、提供されている参照先が曖昧な場合は、中間ページに移動し、選択を正確に行うように指示される。

検索結果

その遺伝子座のsORFの一覧が表示される。

f:id:kazumaxneo:20201118091023p:plain

結果はBED、CSV、FASTA(NT/AA)形式でダウンロードできる。

GROUP BY TRANSCRIPTS

転写パターンが異なりTRANSCRIPTS IDが複数ある場合、TRANSCRIPTS IDごとにまとめて表示できる。

f:id:kazumaxneo:20201118092407p:plain

sORFのリンクを１つクリックしてみた。ポジションや配列など、そのsORFに関する情報がまとめられる。

f:id:kazumaxneo:20201118091513p:plain f:id:kazumaxneo:20201118091516p:plain

中央のGENOME BROWSERSボタンはUCSC genome browserにリンクしている。

f:id:kazumaxneo:20201118091708p:plain

（試した時はエラーになった）

BROWSEでは、セルタイプで区別して閲覧することもできる。

f:id:kazumaxneo:20201118093146p:plain

種とセルタイプを選択。

f:id:kazumaxneo:20201118092659p:plain

humanのB.cellブラウズ結果

f:id:kazumaxneo:20201118092807p:plain

引用

MetamORF: A repository of unique short Open Reading Frames identified by both experimental and computational approaches for gene-level and meta-analysis

Sebastien A. Choteau, Audrey Wagner, Philippe Pierre, Lionel Spinelli, Christine Brun

bioRxiv, Posted November 14, 2020

2020-11-16

Viral quasispeciesのアセンブリを行う virus-vg

2019 Bioinformatics Viruses assembly quasispecies

　ウイルスは、遺伝的に関連のある突然変異株のコレクションであるquasispeciesとして宿主に寄生する。Viral quasispeciesのアセンブリは、リードデータから株固有のハプロタイプを再構築することであり、株間の相対的な存在量を予測することは、治療に関連した様々な理由から重要なステップとなる。リファレンスゲノムに依存しない（「de novo」）アプローチは、リファレンスガイド型アプローチに比べて利点があるが、これは発散している株を扱う場合、リファレンスに起因するバイアスが圧倒的に大きくなるからである。現存するDe novo法は非常に正確ではあるが、コンティグが短いものしか得られない。残る課題は、このようなコンティグから完全長のハプロタイプを再構築することと、それに伴うその豊富さを再構築することである。
　本研究では、あらかじめアセンブリされたコンティグからウイルスのハプロタイプを再構成するためのデノボアプローチとしてVirus-VGを提案する。この手法では、リファレンスゲノムを使用せずに、短い入力コンティグからバリエーショングラフを構築する。次に、元のハプロタイプを反映したバリエーショングラフのパスを得るために、最小化問題を解き、バリエーショングラフのノードに対して計算されたリードカバレッジとの互換性の観点から最適な最大長のパスを選択する。結果として得られた最大長のパスの選択をハプロタイプとその豊富さとともに出力する。困難なシミュレーションおよび実データセット上でのベンチマーク実験では、入力コンティグと比較してアセンブリの連続性が大幅に改善されていることが示された。Virus-vgは、https://bitbucket.org/jbaaijens/virus-vg から自由に利用できる。

インストール

condaでpython3.7の仮想環境を作ってテストした（macos10.14使用）。

依存

Bitbucket

git clone https://bitbucket.org/jbaaijens/virus-vg.git
cd virus-vg/
conda create --name virus-vg-deps --file conda_list_explicit.txt
conda activate virus-vg-deps
#gurobiのライセンスアクティベート（参考）

#vgのバイナリダウンロード
wget https://github.com/vgteam/vg/releases/download/v1.7.0/vg-v1.7.0
chmod u+x vg-v1.7.0

> python build_graph_msga.py -h

$ python build_graph_msga.py -h

usage: build_graph_msga.py [-h] -f FORWARD [-r REVERSE] -c CONTIGS -vg VG [-t THREADS] [-w MSGA_W] [--reuse_graph]

Build contig variation graph using vg msga and compute node abundances.

optional arguments:

-h, --help show this help message and exit

-f FORWARD, --forward FORWARD

Forward reads in fastq format

-r REVERSE, --reverse REVERSE

Reverse reads in fastq format

-c CONTIGS, --contigs CONTIGS

Input contigs in fastq format

-vg VG, --vg_path VG Path to vg executable

-t THREADS, --threads THREADS

Set number of threads used for vg.

-w MSGA_W, --msga_w MSGA_W

Set alignment band width parameter for vg msga.

--reuse_graph

> python optimize_strains.py -h

# python optimize_strains.py -h

usage: python opt_edge.py [-h] -m M -c MIN_COV [-d MIN_DEPTH] [-o FASTA]

[-p PATHS] [-r REDUCE_OBJ] [-t THREADS]

[--trim TRIM] [--max_strains MAX_STRAINS]

abundancefile graphfile

Build strains from contigs in variation graph and estimate their abundances.

positional arguments:

abundancefile Node abundance file

graphfile GFA file containing the contig variation graph

optional arguments:

-h, --help show this help message and exit

-m M, --min_abundance M

Minimal node abundance; nodes below this threshold are

considered erroneous and allowed to match any other

node.

-c MIN_COV, --min_cov MIN_COV

Minimum coverage required per strain

-d MIN_DEPTH, --min_depth MIN_DEPTH

Output a list of nodes with sequence depth less than

<min_depth>.

-o FASTA, --out_fasta FASTA

Output fasta file with final haplotypes

-p PATHS, --paths PATHS

Read candidate paths from file instead of enumerating

them.

-r REDUCE_OBJ, --reduce_obj REDUCE_OBJ

Use a reduced objective function by specifying how

many times a given combination of paths will be

evaluated (reduces runtime and memory usage).

-t THREADS, --threads THREADS

Set number of threads used for Gurobi.

--trim TRIM number of bases trimmed on either end of contig

--max_strains MAX_STRAINS

set an upper bound on the number of strains predicted

テストラン

example/のペアエンドfastqを指定する。実行時にはvgのバイナリのパスを指定する必要がある。

cd virus-vg/example/
python ../scripts/build_graph_msga.py -f forward.fastq -r reverse.fastq -c input.fasta -vg ../vg-v1.7.0 -t 8

optimize_strains.py -m <node_ab> -c <strain_ab> node_abundance.txt contig_graph.final.gfa

引用

Full-length de novo viral quasispecies assembly through variation graph construction
Jasmijn A Baaijens, Bastiaan Van der Roest, Johannes Köster, Leen Stougie, Alexander Schönhuth
Bioinformatics, Volume 35, Issue 24, 15 December 2019, Pages 5086–5094

2020-11-15

ターゲットの菌株の種類と豊富さを調べる mixtureS

2020 Bioinformatics bacteria sub-populations

　環境試料中の細菌株を研究することは不可欠である。既存の方法やツールは、既知の菌株や変異株に依存していることが多く、個々のサンプルに対応できない、信頼性が低い、使い勝手が悪いなどの問題がある。そのため、より正確に菌株を同定できる、より使いやすいツールを開発することが重要である。

　著者らは、菌株やその変異についての事前知識がなくても、クローンやメタゲノムサンプルのショットガンリードから細菌株をde novoで同定できるmixtureSと呼ばれる新しいツールを開発した。243のシミュレーションデータセットと195の実験データセットでテストしたところ、mixtureSは確実に菌株、その数、およびその豊富さを同定した。3つのツールと比較して、ほとんどすべてのシミュレーションデータセットと大多数の実験データセットにおいて、mixtureSはより優れた性能を示した。ソースコードとツールmixtureSは、http://www.cs.ucf.edu/〜xiaoman/mixtureS/にある。

インストール

ubuntu18.04のdocker環境でテストした（macos10.14使用、）。

依存

python=3.7.6
R
pysam=0.15.3

download

http://www.cs.ucf.edu/~xiaoman/mixtureS/

MixtureS.zipをダウンロードして解凍する。

cd MixtureS/
conda env create -f ./mixture.yml
conda activate mixture

> python preprocessing.py -h

# python preprocessing.py -h

usage: preprocessing.py [-h] [--sample_name SAMPLE_NAME] [--pair1 PAIR1]

[--pair2 PAIR2] [--process PROCESS]

[--genome_name GENOME_NAME] [--res_dir RES_DIR]

optional arguments:

-h, --help show this help message and exit

--sample_name SAMPLE_NAME

Give a unique sample name

--pair1 PAIR1 Give a unique sample name

--pair2 PAIR2 Give a unique sample name

--process PROCESS Give a unique sample name

--genome_name GENOME_NAME

Input genome name

--res_dir RES_DIR result directory

> python mixture_model.py -h

# python mixture_model.py -h

usage: mixture_model.py [-h] [--sample_name SAMPLE_NAME]

[--genome_len GENOME_LEN] [--genome_name GENOME_NAME]

[--genome_file_loc GENOME_FILE_LOC]

[--bam_file BAM_FILE] [--res_dir RES_DIR]

optional arguments:

-h, --help show this help message and exit

--sample_name SAMPLE_NAME

Give a unique sample name

--genome_len GENOME_LEN

Input genome length

--genome_name GENOME_NAME

Input genome name

--genome_file_loc GENOME_FILE_LOC

Input genome file location

--bam_file BAM_FILE Input sorted bam file

--res_dir RES_DIR result directory

テストラン

mixtureSのランにはbamファイルが必要。テストデータではpreprocessing.pyを使って作成されたbamが用意されている。

cd MixtureS/example_test_data/
python ../mixture_model.py --sample_name sample_name\
 --genome_len 1853160 --genome_name NC_009515.1\
 --genome_file_loc GCF_000016525.1_ASM1652v1_genomic.fna\
 --bam_file test_sorted.bam --res_dir result

出力

f:id:kazumaxneo:20201114204730p:plain

filter_polymorphic_sites - polymorphic sitesを示したファイル

position、A、C、G、Tの順

f:id:kazumaxneo:20201115130754p:plain

sample_name_haplotypes

株名が>で表現され、カバレッジ比が表される。この例では0.3。次の行からpolymorphic_sitesが示される。

f:id:kazumaxneo:20201115131057p:plain

第2の株以降も>で始まる。

引用

mixtureS: a novel tool for bacterial strain genome reconstruction from reads
Xin Li, Haiyan Hu, Xiaoman Li
Bioinformatics, Published: 17 August 2020

2020-11-14

様々な種のバリアント情報をまとめたデータベース Genome Variation Map（GVM）

2020 download eukaryotic genome annotation animal plant virus Nucleic Acids Research variant database dbSNP SARS-CoV-2

Genome Variation Map (GVM; http://bigd.big.ac.cn/gvm/) は、ゲノム変異の公開データリポジトリである。幅広い種のゲノム変異を収集・統合することを目的としており、世界中から様々な変異タイプの投稿を受け付けており、世界中の研究活動を支援するために、公開されているすべてのデータへの無料のオープンアクセスを提供している。特に旧バージョンと比較して、現行バージョンのGVMでは、合計22種、115プロジェクト、55 935サンプル、463 429 609バリアント、66 220アソシエーション、56投稿（2020年9月7日現在）が新たに追加されている。現在のリリースでは、GVMは13の動物、25の植物、3のウイルスを含む41の種からの合計約9億6000万のバリアントを格納している。さらに、64,819個の個々の遺伝子型と260,393個の手動でキュレーションされた高品質な遺伝子型間の関連付けが組み込まれている。GVMは設立以来、世界中のユーザーから提出された43,754サンプルのゲノム変異データをアーカイブ化し、100万件以上のデータダウンロードリクエストに対応してきた。GVMは、ナショナル・ゲノミクス・データセンター（NGDC）の中核的なリソースとして、多様な種の貴重なゲノム変異を提供しており、機能的ゲノミクス研究と分子育種の両方において重要な役割を果たしている。

webサービス

GVM（https://bigd.big.ac.cn/gvm/）にアクセスする。

ここではデータのサブミット手順などについては触れず、browse、search、そしてDownloadの機能についてだけ簡単に紹介する。

Browse

f:id:kazumaxneo:20201113234146p:plain

Animalからジャイアントパンダの#sampleの34をクリック。

f:id:kazumaxneo:20201113234259p:plain

34サンプルの地理的分布が表示された。

f:id:kazumaxneo:20201113234346p:plain

f:id:kazumaxneo:20201113234349p:plain

表のリンクはNCBI Biosampleや引用元の論文（Pubmed）繋がっている。

Search

Searchタブでは、シークエンシングのタイプ、遺伝子名、MAF、Clinvar、GWAS-catalogue、Pfamなど様々な方法でバリアントを検索できる。

ヒトなど42の生物に対応している。ウィルスにはSARS-CoV-2も含まれている。

f:id:kazumaxneo:20201113230955p:plain

様々な方法で検索できる。以下はヒトのGWAS-Catalogタブ。

f:id:kazumaxneo:20201113231007p:plain

humanのGWAS catalogueの１カテゴリーで絞り込んで検索した。dbSNPとMAFのカラムが確認できる。ここからさらにclinvarのeffectやMAFの頻度、領域など指定してバリアントを絞り込める。

f:id:kazumaxneo:20201113231410p:plain

遺伝子名はEnsembl humanにリンクしている。

Download

SNPやindelのファイル（VCFやFASTA形式）をダウンロードできる。

f:id:kazumaxneo:20201113232506p:plain

2020年11月現在、37種からのバリアントデータをダウンロード可能。

引用

Genome Variation Map: a worldwide collection of genome variations across multiple species
Cuiping Li, Dongmei Tian, Bixia Tang, Xiaonan Liu, Xufei Teng, Wenming Zhao, Zhang Zhang, Shuhui Song
Nucleic Acids Research, Published: 10 November 2020

2020-11-13

ウィルスゲノムのアノテーションを行う VIGOR

2012 Nucleic Acids Research 2010 BMC Bioinformatics annotation Viruses

　遺伝子予測プログラムVIGOR（Viral Genome ORF Reader）は、2010年にJ.Craig Venter Instituteで開発され、感染症ゲノムシークエンシングセンターのプロジェクトでコロナウイルス、インフルエンザ、ライノウイルス、ロタウイルスの遺伝子コールに成功している。VIGORでは、カスタムのタンパク質データベースに対して配列類似性検索を行い、タンパク質のコーディング領域、開始コドン、停止コドン、その他の遺伝子の特徴を特定する。リボ核酸編集などの特徴は、配列類似性とシグネチャー残基に基づいて正確に同定される。VIGORは、遺伝子予測ファイル、相補DNAファイル、アラインメントファイル、遺伝子特徴量表ファイルの4つの出力ファイルを生成する。遺伝子特徴量表は、GenBankへの申請に使用することができる。VIGORは単一の入力：FASTA形式のウイルスゲノム配列を受け取る。VIGORは、麻疹ウイルス、おたふくかぜウイルス、風疹ウイルス、呼吸器感染症ウイルス、アルファウイルス、ベネズエラ馬脳炎ウイルス、ノロウイルス、メタニューモウイルス、黄熱病ウイルス、日本脳炎ウイルス、パラインフルエンザウイルス、仙台ウイルスの12種類のウイルスの遺伝子を予測するために拡張されている。VIGORは、リボ核酸編集、ストップコドンリーク、リボソームシャントなどの複雑な遺伝子の特徴を正確に検出する。いくつかのウイルスのmat_peptide開裂を正確に特定することは、VIGORの組み込み機能である。これらのウイルスの遺伝子予測は、GenBankの27〜240ゲノムからテストして評価されている。

インストール

mvnを使って導入した（JDK11）。

依存

### Java 8 or above
Although VIGOR4 may work on other operating systems, it has only been tested in a linux environment.
VIGOR4 uses exonerate to generate its initial alignments.

Github

#依存のexonerateの導入
conda install -c bioconda exonerate

git clone https://github.com/JCVenterInstitute/VIGOR4.git
cd VIGOR4/
mvn -DskipTests clean package
unzip target/vigor-4.1.20201015-032846-7c6c78d.zip -d INSTALL_DIRECTORY
cd INSTALL_DIRECTORY/vigor-4.1.20201015-032846-7c6c78d/bin/

> ./vigor4 -h

$ ./vigor4 -h

usage: vigor4 -i inputfasta -o outputprefix [ -d refdb ]

named arguments:

-h, --help show this help message and exit

-i <input fasta>, --input-fasta <input fasta>

path to fasta file of genomic sequences to be annotated.

-o <output prefix>, --output-prefix <output prefix>

prefix for outputfile files, e.g. if the output prefix is /mydir/anno VIGOR will create output files /mydir/anno.tbl, /mydir/anno.stats, etc. An

output prefix without a directory element will create the output files in the current working directory.

-c MIN_COVERAGE, --min-coverage MIN_COVERAGE

minimum coverage of reference product (0-100) required to report a gene, by default coverage is ignored

-P <parameter=value~~...~~parameter=value>, --parameter <parameter=value~~...~~parameter=value>

~~ separated list of VIGOR parameters to override default values. Use --list-config-parameters to see settable parameters.

-v, --verbose verbose logging (default=terse)

--list-config-parameters [{all,current}]

list available configuration parameters and exit. By default only lists description, use the verbose option before this option to list more

information

--list-databases list the names and other information about the found vigor compatible databases. Requires reference database path to be set either by passing the

--reference-database-path command line parameter or setting reference_database_path in the configuration file

--version print version information

--config-file CONFIG_FILE

config file to use

--reference-database-path REFERENCE_DATABASE_PATH

reference database path

--virus-config VIRUSSPECIFICCONFIG

Path to virus specific configuration

--virus-config-path VIRUSSPECIFICCONFIGPATH

Path to directory containing virus specific config files.

--overwrite-output overwrite existing output files if they exist

--temporary-directory TEMPORARYDIRECTORY

Root directory to use for temporary directories

--list-output-formats list acceptable output formats and exit

reference database:

-d <ref db>, --reference-database <ref db>

specify the reference database to be used

locus tag usage:

-l, --no-locus-tags do NOT use locus_tags in TBL file output (incompatible with -L)

-L [<locus_tag_prefix>], --locus-tags [<locus_tag_prefix>]

USE locus_tags in TBL file output (incompatible with -l). If no prefix is provided, the prefix "vigor_" will be used.

Unimplemented/Ignored for backward compatibilty:

-0, --circular complete circular genome (allows gene to span origin). This feature is currently unimplemented

-f {0,1,2}, --frameshift-sensitivity {0,1,2}

frameshift sensitivity, 0=ignore frameshifts, 1=normal (default), 2=sensitive.

-m, --ignore-reference-requirements

ignore reference match requirements (coverage/identity/similarity), sometimes useful when running VIGOR to evaluate raw contigs and rough draft

sequences

-x <ref_id,...,ref_id>, --ignore-refID <ref_id,...,ref_id>

comma separated list of reference sequence IDs to ignore (useful when debugging a reference database). Not currently implemented

Outputs:

outputprefix.rpt - summary of program results

outputprefix.cds - fasta file of predicted CDSs

outputprefix.pep - fasta file of predicted proteins

outputprefix.tbl - predicted features in GenBank tbl format

outputprefix.aln - alignment of predicted protein to reference, and reference protein to genome

実行方法

fastaとconfigファイルを指定する。また-dでウィルスの種類（GithubのREADME.md参照）を指定する。

vigor4 -i input.fna -o outputprefix -d flua --config-file vigor.ini

-d specify the reference database to be used

引用

VIGOR extended to annotate genomes for additional 12 different viruses

Shiliang Wang, Jaideep P Sundaram, Timothy B Stockwell

Nucleic Acids Res. 2012 Jul;40(Web Server issue):W186-92

VIGOR, an annotation program for small viral genomes
Jaideep P Sundaram, David Spiro

BMC Bioinformatics. 2010; 11: 451

2020-11-12

リピートの多いゲノム配列にロングリードをマッピングするために最適化されたアライナー Winnowmap

2020 Preprint Nanopore long read Pacbio mapping repetitive sequences genome alignment 2022 Nature Methods

2022/04/02 論文引用

　ヒトゲノムの約5～10％は、セグメント重複やタンデムリピート配列などの繰り返し配列が存在するため、機能解析にアクセスできない状態になっている。高品質な個人ゲノムのリシークエンシングを可能にするためには、リピートを考慮したリードマッピング手法を用いて、エンドツーエンドでのゲノムバリアント発見をサポートすることが重要である。本研究では、既存のロングリードマッパーでは、対立遺伝子バイアスの影響を受けやすく、長い、ほぼ同一のリピート内の不正確なアラインメントやバリアントコールが生じることが多いという事実を明らかにした。リピート内に非リファレンスアリルが存在する場合、標準的なペアワイズ配列アライメントスコアリングシステムが真のバリアントにペナルティを与えるため、その領域からサンプリングされたリードは誤ったリピートコピーにマッピングされる可能性がある。

　上記の問題に対処するために、著者らは、minimal confidently alignable substrings (MCASs)を利用することで対立遺伝子のバイアスに対処する、新しいロングリードマッピング方法を提案する。MCASは、十分なマッピング信頼度（すなわち、ユーザーが指定した閾値以上のマッピング品質スコア）を持つ参照遺伝子座へのユニークなアラインメントを持つリードの最小長さの部分文字列として定式化される。このアプローチでは、各リードのマッピングを確信度の高いサブアラインメントの集合として扱い、構造変化に対してより寛容であり、リピート内のパラログ特異的バリアント（PSV）に対してより敏感である。本研究では、MCASを数学的に定義し、正確なアルゴリズムとその計算のための実用的なヒューリスティックな方法について述べる。提案手法（Winnowmap2と呼ばれる）は、最近完成したヒト染色体Xとヒト8番染色体のギャップレスアセンブリを参照して、シミュレーションと実際の長さのリードベンチマークを用いて評価した。その結果、Winnowmap2が対立遺伝子のバイアスの問題にうまく対処し、反復配列の下流のバリアントコールをより正確に行えることを示した。例として、シミュレーションされたPacBio HiFiリードと8番染色体の構造バリアントを使用した場合、Winnowmap2のアラインメントは、ほぼ同一反復内の構造バリアントコールについて、それぞれminimap2（39.62％、5.88％）とNGMLR（56.60％、36.11％）と比較して、最も低い偽陰性率と偽陽性率（1.89％、1.89％）を達成していた。Winnowmap2のコードは、https://github.com/marbl/Winnowmapで利用できる。

インストール

Github

git clone https://github.com/marbl/Winnowmap.git
cd Winnowmap
make -j8
cd bin/

> ./winnowmap -h

$ ./winnowmap -h

winnowmap is built on top of minimap2, while modifying its minimizer sampling and indexing procedures

Usage: winnowmap [options] <target.fa>|<target.idx> [query.fa] [...]

Options:

Indexing:

-H use homopolymer-compressed k-mer

-k INT k-mer size (no larger than 28) [15]

-w INT minimizer window size [50]

-W FILE input file containing list of high freq. k-mers

-I NUM split index for every ~NUM input bases [4G]

-d FILE dump index to FILE

Mapping:

-f FLOAT filter out top FLOAT (<1) fraction of repetitive minimizers [0.0]

-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]

-G NUM max intron length (effective with -xsplice; changing -r) [200k]

-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]

-r NUM bandwidth used in chaining and DP-based alignment [500]

-n INT minimal number of minimizers on a chain [3]

-m INT minimal chaining score (matching bases minus log gap penalty) [40]

-X skip self and dual mappings (for the all-vs-all mode)

-p FLOAT min secondary-to-primary score ratio [0.8]

--sv-off turn off SV-aware mode

Alignment:

-A INT matching score [2]

-B INT mismatch penalty [4]

-O INT[,INT] gap open penalty [4,24]

-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]

-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]

-s INT minimal peak DP alignment score [80]

-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

Input/Output:

-a output in the SAM format (PAF by default)

-o FILE output alignments to FILE [stdout]

-L write CIGAR with >65535 ops at the CG tag

-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar'

-c output CIGAR in PAF

--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]

--MD output the MD tag

--eqx write =/X CIGAR operators

-Y use soft clipping for supplementary alignments

-t INT manually set pthread count rather than automatically

-K NUM minibatch size for mapping [1000M]

--version show version number

Preset:

-x STR preset (always applied before other options)

- map-ont (ont-to-ref, uses default param)

- map-pb (hifi-to-ref, all defaults but does finer read fragmentation in SV-aware mapping)

- map-pb-clr (clr-to-ref, sets --sv-off)

- splice/splice:hq - long-read/Pacbio-CCS spliced alignment, sets --sv-off

- asm5/asm10/asm20 - asm-to-ref mapping

> ./meryl

$ ./meryl

usage: ./meryl ...

A meryl command line is formed as a series of commands and files, possibly

grouped using square brackets. Each command operates on the file(s) that

are listed after it.

COMMANDS:

print display kmers on the screen as 'kmer<tab>count'. accepts exactly one input.

count Count the occurrences of canonical kmers in the input. must have 'output' specified.

count-forward Count the occurrences of forward kmers in the input. must have 'output' specified.

count-reverse Count the occurrences of reverse kmers in the input. must have 'output' specified.

k=<K> create mers of size K bases (mandatory).

n=<N> expect N mers in the input (optional; for precise memory sizing).

memory=M use no more than (about) M GB memory.

threads=T use no more than T threads.

less-than N return kmers that occur fewer than N times in the input. accepts exactly one input.

greater-than N return kmers that occur more than N times in the input. accepts exactly one input.

equal-to N return kmers that occur exactly N times in the input. accepts exactly one input.

not-equal-to N return kmers that do not occur exactly N times in the input. accepts exactly one input.

increase X add X to the count of each kmer.

decrease X subtract X from the count of each kmer.

multiply X multiply the count of each kmer by X.

divide X divide the count of each kmer by X.

modulo X set the count of each kmer to the remainder of the count divided by X.

union return kmers that occur in any input, set the count to the number of inputs with this kmer.

union-min return kmers that occur in any input, set the count to the minimum count

union-max return kmers that occur in any input, set the count to the maximum count

union-sum return kmers that occur in any input, set the count to the sum of the counts

intersect return kmers that occur in all inputs, set the count to the count in the first input.

intersect-min return kmers that occur in all inputs, set the count to the minimum count.

intersect-max return kmers that occur in all inputs, set the count to the maximum count.

intersect-sum return kmers that occur in all inputs, set the count to the sum of the counts.

difference return kmers that occur in the first input, but none of the other inputs

symmetric-difference return kmers that occur in exactly one input

MODIFIERS:

output O write kmers generated by the present command to an output meryl database O

mandatory for count operations.

EXAMPLES:

Example: Report 22-mers present in at least one of input1.fasta and input2.fasta.

Kmers from each input are saved in meryl databases 'input1' and 'input2',

but the kmers in the union are only reported to the screen.

meryl print \

union \

[count k=22 input1.fasta output input1] \

[count k=22 input2.fasta output input2]

Example: Find the highest count of each kmer present in both files, save the kmers to

database 'maxCount'.

meryl intersect-max input1 input2 output maxCount

Example: Find unique kmers common to both files. Brackets are necessary

on the first 'equal-to' command to prevent the second 'equal-to' from

being used as an input to the first 'equal-to'.

meryl intersect [equal-to 1 input1] equal-to 1 input2

実行方法

１、pre-computing high frequency k-mers using meryl

meryl count k=15 output merylDB ref.fa 
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

２、mapping

#ONT
winnowmap -W repetitive_k15.txt -ax map-ont ref.fa ont.fq.gz > output.sam

#Pacbio
winnowmap -W repetitive_k15.txt -ax map-pb ref.fa hifi.fq.gz > output.sam

引用

A long read mapping method for highly repetitive reference sequences

Chirag Jain, Arang Rhie, Nancy Hansen, Sergey Koren, Adam M. Phillippy

bioRxiv, Posted November 02, 2020

2022/04/02

Long-read mapping to repetitive reference sequences using Winnowmap2

Chirag Jain, Arang Rhie, Nancy F. Hansen, Sergey Koren & Adam M. Phillippy

Nature Methods, Published: 01 April 2022

NCBIのデータベースへのリモート検索によって保存された遺伝子クラスターを探索し、クラスタリングして視覚化する cblaster

2020 gene cluster Preprint bacteria 結果の視覚化 (visualization) amino acids NCBI BLAST heatmap

2020 11/11 extractコマンド追記

2022/11/21 登録コマンド追記

　代謝、薬剤耐性、病原性などの生物学的パスウェイに関与する遺伝子は、多くの場合、遺伝子クラスターとしてクラスター化されている。相同な遺伝子クラスターを特定することは、その機能や進化の研究に役立つが、既存のツールは局所的な配列データベースの検索に限られている。また、そのためには、オンラインでのゲノムデータの急速な増加に対応するために、リモートで公開データベースを検索するためのツールが必要とされている。本研究では、Pythonベースのツールであるcblasterを用いて、ローカルデータベースとリモートデータベースのコロケーション遺伝子を迅速に検出するツールを紹介する。cblasterは比較ゲノミクスツールボックスの重要なアップデートである。

遺伝子クラスターを同一性によってグループ化すると、相同タンパク質のブロックが強調され、クラスター境界の指定や、コアとなる生合成遺伝子や機能的なサブクラスターの同定が容易になる。これは、機能的および進化的関係の有用なプロキシである。例えば、著者らは最近この技術を使用して、近縁の2つの天然物、インドロトリプトリンとインドロカルバゾールの生合成遺伝子クラスターをグループ化した。

　cblasterはタンパク質配列のコレクションが与えられると、リモート（NCBI BLAST API経由）またはローカル（DIAMOND経由）で配列データベースを検索する。検索結果は解析され、identity、coverage、e-valueのユーザー定義しきい値に基づいてフィルタリングされる。残りのヒットのゲノム座標は、NCBIのIdentical Protein Group (IPG)データベース（またはローカル検索の場合はローカルデータベース）から取得される。最後に、cblasterはcollocationのインスタンスをスキャンし、可視化する。

解析の手順はプレプリントの図１にまとめられています。

インストール

macOSではエラーが出た。ubuntu18でcondaでpython3.6の環境を作ってからpipで導入した（conda create -n cblaster python=3.6）。

依存

cblaster is tested on Python 3.6, and its only external Python dependency is the requests module (used for interaction with NCBI APIs). If you want to perform local searches, you should have diamond installed and available on your system $PATH.

Github

pip3 install cblaster

> cblaster

$ cblaster

usage: cblaster [-h] [--version] [-d] [-i INDENT]

{gui,makedb,search,gne,extract} ...

cblaster finds co-located sequence homologues.

Documentation available at https://cblaster.readthedocs.io

Type -h/--help after either subcommand for full description of available arguments.

positional arguments:

{gui,makedb,search,gne,extract}

gui Launch cblaster GUI

makedb Generate JSON/diamond databases from GenBank files

search Start a local/remote cblaster search

gne Perform gene neighbourhood estimation

extract Extract hit sequences from session files

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

-d, --debug Print debugging information

-i INDENT, --indent INDENT

Total spaces to use as indent in JSON file (def. None)

Cameron Gilchrist, 2020

> cblaster search -h

$ cblaster search -h

usage: cblaster search [-h] [-qf QUERY_FILE | -qi QUERY_IDS [QUERY_IDS ...]]

[-o OUTPUT] [-ohh] [-ode OUTPUT_DELIMITER]

[-odc OUTPUT_DECIMALS] [-b BINARY] [-bhh]

[-bde BINARY_DELIMITER] [-bkey {len,max,sum}]

[-bat {identity,coverage,bitscore,evalue}]

[-bdc BINARY_DECIMALS] [-p [PLOT]]

[--blast_file BLAST_FILE] [--ipg_file IPG_FILE]

[-m {local,remote}] [-db DATABASE] [-jdb JSON_DB]

[-eq ENTREZ_QUERY] [--rid RID]

[-s [SESSION_FILE [SESSION_FILE ...]]]

[-rcp [RECOMPUTE]] [-hs HITLIST_SIZE] [-g GAP]

[-u UNIQUE] [-mh MIN_HITS] [-r REQUIRE [REQUIRE ...]]

[-me MAX_EVALUE] [-mi MIN_IDENTITY] [-mc MIN_COVERAGE]

Remote/local cblaster searches.

optional arguments:

-h, --help show this help message and exit

Input:

-qf QUERY_FILE, --query_file QUERY_FILE

Path to FASTA file containing protein sequences to be

searched

-qi QUERY_IDS [QUERY_IDS ...], --query_ids QUERY_IDS [QUERY_IDS ...]

A collection of valid NCBI sequence identifiers to be

searched

Output:

-o OUTPUT, --output OUTPUT

Write results to file

-ohh, --output_hide_headers

Hide headers when printing result output.

-ode OUTPUT_DELIMITER, --output_delimiter OUTPUT_DELIMITER

Delimiter character to use when printing result

output.

-odc OUTPUT_DECIMALS, --output_decimals OUTPUT_DECIMALS

Total decimal places to use when printing score values

-b BINARY, --binary BINARY

Generate a binary table.

-bhh, --binary_hide_headers

Hide headers in the binary table.

-bde BINARY_DELIMITER, --binary_delimiter BINARY_DELIMITER

Delimiter used in binary table (def. none = human

readable).

-bkey {len,max,sum}, --binary_key {len,max,sum}

Key function used when generating binary table cell

values.

-bat {identity,coverage,bitscore,evalue}, --binary_attr {identity,coverage,bitscore,evalue}

Hit attribute used when generating binary table cell

values.

-bdc BINARY_DECIMALS, --binary_decimals BINARY_DECIMALS

Total decimal places to use when printing score values

-p [PLOT], --plot [PLOT]

Generate a cblaster plot. If this argument is

specified with no file name, the plot will be served

using Python's HTTP server. If a file name is

specified, a static HTML file will be generated at

that path.

--blast_file BLAST_FILE

Save BLAST/DIAMOND hit table to file

--ipg_file IPG_FILE Save IPG table to file

Searching:

-m {local,remote}, --mode {local,remote}

cblaster search mode

-db DATABASE, --database DATABASE

Database to be searched. This should be either a path

to a local DIAMOND database (if 'local' is passed to

--mode) or a valid NCBI database name (def. nr)

-jdb JSON_DB, --json_db JSON_DB

Path to local JSON database created using cblaster

makedb. If this argument is provided, genomic context

will be fetched from this database instead of through

NCBI IPG.

-eq ENTREZ_QUERY, --entrez_query ENTREZ_QUERY

An NCBI Entrez search term for pre-search filtering of

an NCBI database when using command line BLASTp (i.e.

only used if 'remote' is passed to --mode); e.g.

"Aspergillus"[organism]

--rid RID Request Identifier (RID) for a web BLAST search. This

is only used if 'remote' is passed to --mode. Useful

if you have previously run a web BLAST search and want

to directly retrieve those results instead of running

a new search.

-s [SESSION_FILE [SESSION_FILE ...]], --session_file [SESSION_FILE [SESSION_FILE ...]]

Load session from JSON. If the specified file does not

exist, the results of the new search will be saved to

this file.

-rcp [RECOMPUTE], --recompute [RECOMPUTE]

Recompute previous search session using new

thresholds. The filtered session will be written to

the file specified by this argument. If this argument

is specified with no value, the session will be

filtered but not saved (e.g. for plotting purposes).

-hs HITLIST_SIZE, --hitlist_size HITLIST_SIZE

Maximum total hits to save in a BLAST search (def.

5000). Setting this value too low may result in missed

hits/clusters.

Clustering:

-g GAP, --gap GAP Maximum allowed intergenic distance (bp) between

conserved hits to be considered in the same block

(def. 20000)

-u UNIQUE, --unique UNIQUE

Minimum number of unique query sequences that must be

conserved in a hit cluster (def. 3)

-mh MIN_HITS, --min_hits MIN_HITS

Minimum number of hits in a cluster (def. 3)

-r REQUIRE [REQUIRE ...], --require REQUIRE [REQUIRE ...]

Names of query sequences that must be represented in a

hit cluster

Filtering:

-me MAX_EVALUE, --max_evalue MAX_EVALUE

Maximum e-value for a BLAST hit to be saved (def.

0.01)

-mi MIN_IDENTITY, --min_identity MIN_IDENTITY

Minimum percent identity for a BLAST hit to be saved

(def. 30)

-mc MIN_COVERAGE, --min_coverage MIN_COVERAGE

Minimum percent query coverage for a BLAST hit to be

saved (def. 50)

Example usage

-------------

Run a remote cblaster search, save the session and generate a plot:

$ cblaster search -qf query.fa -s session.json -p

Recompute a search session with new parameters:

$ cblaster search -s session.json -rcp new.json -u 4 -g 40000

Merge multiple search sessions:

$ cblaster search -s one.json two.json three.json -rcp merged.json

Perform a local search:

$ cblaster makedb $(ls folder/*.gbk) mydb

$ cblaster search -qf query.fa -db mydb.dmnd -jdb mydb.json

Save plot as a static HTML file:

$ cblaster search -s session.json -p gne.html

Kitchen sink example:

$ cblaster search --query_file query.fa \

--session_file session.json \

--plot my_plot.html \

--output summary.csv --output_decimals 2 \

--binary abspres.csv --binary_delimiter "," \

--entrez_query "Aspergillus"[orgn] \

--max_evalue 0.05 --min_identity 50 --min_coverage 70 \

--gap 50000 --unique 2 --min_hits 3 --require Gene1 Gene2

Cameron Gilchrist, 2020

> cblaster makedb -h

$ cblaster makedb -h

usage: cblaster makedb [-h] genbanks [genbanks ...] filename

positional arguments:

genbanks Path/s to GenBank files to use when building JSON/diamond

databases

filename Name to use when building JSON/diamond databases (with

extensions .json and .dmnd, respectively)

optional arguments:

-h, --help show this help message and exit

> cblaster gne -h

$ cblaster gne -h

usage: cblaster gne [-h] [--max_gap MAX_GAP] [--samples SAMPLES]

[--scale {linear,log}] [-o OUTPUT] [-hh] [-d DELIMITER]

[-e DECIMALS] [-p PLOT]

session

Gene neighbourhood estimation.

Repeatedly recomputes homologue clusters with different --gap values.

positional arguments:

session cblaster session file

optional arguments:

-h, --help show this help message and exit

Parameters:

--max_gap MAX_GAP Maximum intergenic distance (def. 100000)

--samples SAMPLES Total samples taken from max_gap (def. 100)

--scale {linear,log} Draw sampling values from a linear or log scale (def.

linear)

Output:

-o OUTPUT, --output OUTPUT

Write results to file

-hh, --hide_headers Hide headers when printing result output.

-d DELIMITER, --delimiter DELIMITER

Delimiter character to use when printing result

output.

-e DECIMALS, --decimals DECIMALS

Total decimal places to use when printing score values

-p PLOT, --plot PLOT GNE plot HTML file. The plot is generated by default;

this option will just save a static version of it.

Example usage

-------------

Maximum gap value 200Kbp, with 200 evenly distributed gap values:

$ cblaster gne session.json --max_gap 200000 --samples 200 --scale linear

Draw gap values from a log scale (gaps increase as values increase):

$ cblaster gne session.json --scale log

Save delimited tabular output:

$ cblaster gne session.json --output gne.csv --delimiter ","

Save plot as a static HTML file:

$ cblaster gne session.json -p gne.html

Cameron Gilchrist, 2020

> cblaster extract -h

$ cblaster extract -h

usage: cblaster extract [-h] [-q QUERIES [QUERIES ...]]

[-or ORGANISMS [ORGANISMS ...]]

[-sc SCAFFOLDS [SCAFFOLDS ...]] [-o OUTPUT] [-d] [-no]

[-de DELIMITER]

session

Extract information from session files

positional arguments:

session cblaster session file

optional arguments:

-h, --help show this help message and exit

Filters:

-q QUERIES [QUERIES ...], --queries QUERIES [QUERIES ...]

IDs of query sequences

-or ORGANISMS [ORGANISMS ...], --organisms ORGANISMS [ORGANISMS ...]

Organism names

-sc SCAFFOLDS [SCAFFOLDS ...], --scaffolds SCAFFOLDS [SCAFFOLDS ...]

Scaffold names/ranges

Output:

-o OUTPUT, --output OUTPUT

Output file name

-d, --download Fetch sequences from NCBI and write in FASTA format

-no, --name_only Do not save sequence descriptions (i.e. no genomic

coordinates)

-de DELIMITER, --delimiter DELIMITER

Sequence description delimiter

Example usage

-------------

Extract names of sequences matching a specific query:

$ cblaster extract session.json -q "Query1"

Extract, download from NCBI and write to file in FASTA format:

$ cblaster extract session.json -q "Query1" -d -o output.fasta

Extract only from specific organisms (regular expressions):

$ cblaster extract session.json -or "Aspergillus.*" "Penicillium.*"

Generate delimited table (CSV) of all hits in clusters:

$ cblaster extract session.json -de ","

Cameron Gilchrist, 2020

テストラン

ランするにはFASTA形式の配列ファイル、またはNCBIのaccession ID, GI numbersが記載されたファイルを指定する。ここではテストデータとして用意されているFASTA形式の配列ファイルを使う。

git clone https://github.com/gamcil/cblaster.git
cd /cblaster/tests

#自分のメールアドレスを登録してデータを利用する
cblaster config --email name@domain.com
cblaster search --query_file test.faa -p out.html --output summary.csv -s session.json

-qf, --query_file Path to FASTA file containing protein sequences to be searched
-me, --max_evalue Maximum e-value for a BLAST hit to be saved (def.
0.01)
-mi, --min_identity Minimum percent identity for a BLAST hit to be saved
(def. 30)
-mc, --min_coverage Minimum percent query coverage for a BLAST hit to be saved (def. 50)
-u, --unique Minimum number of unique query sequences that must be conserved in a hit cluster (def. 3)
-mh, --min_hits Minimum number of hits in a cluster (def. 3)
-g, --gap Maximum allowed intergenic distance (bp) between conserved hits to be considered in the same block (def. 20000)
-ode, --output_delimiter Delimiter character to use when printing result output.
-odc, --output_decimals Total decimal places to use when printing score values
-s, --session_file Load session from JSON. If the specified file does not exist, the results of the new search will be saved to this file.

リモートでの配列検索が実行される。

[12:26:10] INFO - Starting cblaster in remote mode

[12:26:10] INFO - Launching new search

[12:26:12] INFO - Request Identifier (RID): UPPEKFF9016

[12:26:12] INFO - Request Time Of Execution (RTOE): 27s

[12:26:39] INFO - Polling NCBI for completion status

[12:26:39] INFO - Checking search status...

[12:27:39] INFO - Checking search status...

[12:27:40] INFO - Search has completed successfully!

[12:27:40] INFO - Retrieving results for search UPPEKFF9016

[12:27:43] INFO - Parsing results...

[12:27:43] INFO - Found 6835 hits meeting score thresholds

[12:27:43] INFO - Fetching genomic context of hits

[12:28:18] WARNING - Found no hits for IPG 346263175

[12:28:18] WARNING - Found no hits for IPG 346272747

[12:28:18] WARNING - Found no hits for IPG 336900197

[12:28:18] WARNING - Found no hits for IPG 341670659

[12:28:18] WARNING - Found no hits for IPG 334707585

[12:28:18] WARNING - Found no hits for IPG 341666946

[12:28:18] WARNING - Found no hits for IPG 341669695

[12:28:18] WARNING - Found no hits for IPG 341667385

（以下省略）

出力

0%(白)から100%(青)の同一性を示すヒートマップが表示される。

f:id:kazumaxneo:20201110214535p:plain

右上には

”our search of 3 queries returned 12241 hits from 12240 unique sequences.
cblaster detected 14 clusters across 10515 genomic scaffolds from 1237 organisms.”

と記載されており、14ゲノムからの遺伝子であることが分かる。

2回目以降のランならjsonファイルをロードする事で素早く視覚化できる（配列検索は終わっているので、配列検索時に使うオプションは機能しない)。

cblaster search -s session.json -p out.html

-s, --session_file Load session from JSON. If the specified file does not exist, the results of the new search will be saved to this file.

追記

配列を取り出す。例えばクエリのQBE85647.1とマッチする配列をダウンロードする。

cblaster extract session.json -q "QBE85647.1" -d -o output.fasta

> grep -n ">" output.fasta

f:id:kazumaxneo:20201111234347p:plain

14配列ある。上手く抽出できている。

生物間でよく保存された遺伝子クラスターの配列を使うと、ヒット数が多すぎて表示が重くなる。

f:id:kazumaxneo:20201111155426p:plain

その時はBLAST検索の minimum identiyを増やしたりe-value 最大値を下げる工夫が必要と思われる。クエリの遺伝子数が多いなら"-u"や"-mh"の値を上げる事も効果的かもしれない。

追記

少し厳しくした。

GUIバージョン

cblaster gui

f:id:kazumaxneo:20201110202533p:plain

引用

cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters

Cameron Laurence Mathison Gilchrist, Thomas J Booth, Yit-Heng Heng Chooi

bioRxiv, Posted November 09, 2020

追記

cblaster: a remote search tool for rapid identification and visualization of homologous gene clusters
Cameron L M Gilchrist, Thomas J Booth, Bram van Wersch, Liana van Grieken, Marnix H Medema, Yit-Heng Chooi
Bioinformatics Advances, Volume 1, Issue 1, 2021

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ヒトとマウスのショートオープンリーディングフレーム(sORF)のデータベース MetamORF

Viral quasispeciesのアセンブリを行う virus-vg

ターゲットの菌株の種類と豊富さを調べる mixtureS

様々な種のバリアント情報をまとめたデータベース Genome Variation Map（GVM）

ウィルスゲノムのアノテーションを行う VIGOR

リピートの多いゲノム配列にロングリードをマッピングするために最適化されたアライナー Winnowmap

NCBIのデータベースへのリモート検索によって保存された遺伝子クラスターを探索し、クラスタリングして視覚化する cblaster