2022-12-27

生のシークエンシングリードからスケーラブルな高精度の系統樹を生成する Read2Tree（再掲載）

系統樹の推定は生物学の基礎となるものである。しかし、最新の系統樹解析では、複雑なパイプラインを実行する必要があり、多大な計算コストと人件費がかかる上、シーケンスのカバレッジ、アセンブリ、アノテーションの質にも制約がある。このような課題を克服するために、著者らはRead2Treeを提案する。Read2Treeは、生のシーケンシングリードを直接処理して、対応する遺伝子のグループに分割する。様々なデータセットを含むベンチマークにおいて、本アセンブル不要のアプローチは、従来のアプローチよりも10-100倍速く、ほとんどの場合においてより正確であった（例外はシーケンスカバレッジが高く、リファレンス種が非常に遠い場合）。このツールの幅広い適用性を示すため、5億9000万年の進化に及ぶ435種の酵母の生命の系統樹を再構築した。コロナウイルス科のサンプルに適用したところ、Read2Treeは、非常に多様な動物サンプルとほぼ同一のSARS-CoV-2配列を1つのツリー上で正確に分類し、驚くべき幅と深さを示した。Read2Treeのスピード、精度、汎用性により、大規模な比較ゲノム解析が可能になる。

インストール

依存

The following python packages are needed: numpy, scipy, cython, lxml, tqdm, pysam, pyparsing, requests, filelock, natsort, pyyaml, biopython, ete3, dendropy.
Besides, you need softwares including mafft (multiple sequence aligner), iqtree (phylogenomic inference), ngmlr, ngm/nextgenmap (long and short read mappers), and samtools which could be installed using conda.

Github

#依存するツールはcondaで導入できる
mamba create -n Read2Tree python=3
mamba install -c conda-forge biopython numpy Cython ete3 lxml tqdm scipy pyparsing requests natsort pyyaml filelock -y
mamba install -c bioconda dendropy #もしくはpipで 
mamba install -c bioconda mafft iqtree ngmlr nextgenmap samtools -y
#本体
git clone https://github.com/DessimozLab/read2tree.git
cd read2tree
python setup.py install

#docker
docker pull dessimozlab/read2tree:latest

テストラン

docker

docker run --rm -i -v $PWD/tests:/input -v $PWD/tests/:/reads -v $PWD/output:/out -v $PWD/run:/run  dessimozlab/read2tree:latest  --tree --standalone_path /input/marker_genes --reads /reads/sample_1.fastq --output_path /out

引用

Read2Tree: scalable and accurate phylogenetic trees from raw reads

David Dylus, Adrian Altenhoff, Sina Majidian, Fritz J Sedlazeck, Christophe Dessimoz

bioRxiv. 2022 Dec 13;2022.04.18.488678

一度紹介しましたが、dockerイメージが公開されているので改めて紹介しました。

https://kazumaxneo.hatenablog.com/entry/2022/04/21/021338

2022-12-20

タンパク質のMSA中のSNPを比較することにより、SNP共起パターンを同定する SpydrPick

　共分散に基づく共選択的圧力下での多型の発見やエピスタシスは、近年、集団ゲノム科学において非常に注目されている。染色体上の対立遺伝子の集団レベルでの共分散を統計的にモデル化し、多型のペア間の依存性をモデルフリーで検定することで、細菌集団における選択のパターンを見出すことに成功したことが示されている。ここでは、計算効率が高く、多くの細菌のパンゲノムのスケールで解析が可能なモデルフリー手法SpydrPickを紹介する。SpydrPickは集団構造に対する効率的な補正を組み込んでおり、明示的な系統樹を必要とせずにデータ中の系統的なシグナルを調整することができる。また、ゲノムワイド関連研究で用いられるマンハッタンプロットに似た、新しいタイプの結果の可視化を導入し、同定された共進化のシグナルを迅速に探索することができるようになった。シミュレーションにより、本手法の有用性を示すとともに、この種の解析が最も成功しやすい時期についての知見を得ることができる。この方法を2つの主要なヒト病原体、Streptococcus pneumoniaeとNeisseria meningitidisの大規模な集団ゲノムデータセットに適用したところ、病原性と抗生物質耐性に関する共選択の既知ターゲットと新規ターゲットの両方が発見された。
（中略）

集団の配列データから共進化シグナルを検出するための比較手法は、ここ数十年の間に多くの注目を集めている。その顕著な例として、大規模なタンパク質アラインメントにおける非隣接部位間の共分散の統計的解析が、タンパク質の3次元構造における部位間の接触を予測するのに有効であることが証明されている。タンパク質構造中の接触部位は共通の構造的制約の下で共進化するため、タンパク質アラインメント中に検出可能な相関の痕跡を与える。同様に、共通の選択圧のもとで進化した部位は、適切な表現型データがない場合でも、配列アラインメントから検出可能な共選択パターンを生じさせる可能性がある。そのため、近年、細菌集団のゲノムワイド塩基配列の探索的共変化解析が注目されている。この解析の目的は、共通の選択圧のもとで共進化し、エピスタティック相互作用に関与している可能性のある部位を発見することである。

インストール

Github

#conda (link)
mamba create -n spydrpick -y
conda activate spydrpick
mamba install -c bioconda spydrpick -y

> SpydrPick -h

$ SpydrPick -h

SpydrPick: MI-ARACNE genome-wide co-evolution analysis.

Usage: SpydrPick [options] <alignmentfile> [-o <outputfile>]

Option '--help' will print a list of available options.

-h [ --help ] Print this help message.

--version Print version information.

-v [ --verbose ] Be verbose.

--mi-threshold arg (=-1) The MI threshold value. Experience

suggests that a value of 0.11 is often

reasonable. Zero indicates no threshold

and negative values will trigger

auto-define heuristics.

--mi-values arg (=0) Approximate number of MI values to

calculate from data (default=#samples*1

00).

--mi-pseudocount arg (=0.5) The MI pseudocount value.

--mi-threshold-iterations arg (=10) Number of iterations for estimating

saving threshold.

--mi-threshold-pairs arg (=0) Number of sampled pairs for estimating

saving threshold (0=determine

automatically).

--ld-threshold arg (=0) Threshold distance for linkage

disequilibrium (LD).

--no-aracne Skip ARACNE, only calculate MI.

-t [ --threads ] arg (=-1) Number of threads per MPI/shared memory

node (-1=use all hardware threads that

the OS/environment exposes).

--alignmentfile arg The input alignment filename(s). When

two filenames are specified, only

inter-alignment links will be probed

for.

--include-list arg Name of file containing list of loci to

include in analysis.

--exclude-list arg Name of file containing list of loci to

exclude from analysis.

--mapping-list arg Name of file containing loci mappings.

--sample-list arg The sample filter list input filename.

--sample-weights arg Name of file containing sample weights.

--input-indexing-base arg (=1) Base index for input.

--no-filter-alignment Do not filter alignment columns by

applying MAF and GAP thresholds.

--maf-threshold arg (=0.01) Minor state frequency threshold. Loci

with less than 2 states above threshold

are removed from alignment.

--gap-threshold arg (=0.14999999999999999)

Gap frequency threshold. Positions with

a gap frequency above the threshold are

excluded from the pair-analysis.

--no-sample-reweighting Turn sample reweighting off, i.e. do

not try to correct for population

structure.

--sample-reweighting-threshold arg (=0.90000000000000002)

Fraction of identical positions

required for two samples to be

considered identical.

--rescale-sample-weights Rescale sample weights so that

n(effective) == n.

--output-state-frequencies Write column-wise state frequencies to

file.

--output-sample-weights Output sample weights.

--output-sample-distance-matrix Output triangular sample-sample Hamming

distance matrix.

--output-indexing-base arg (=1) Base index for output.

--output-alignment Write alignment to file.

--output-filtered-alignment Write filtered alignment to file.

--genome-size arg Genome size, if different from input.

Default = 0: detect size from input.

--linear-genome Assume linear genome in distance

calculations.

--begin arg (=1) The first locus index to work on. Used

to define a range.

--end arg (=-1) The last locus index to work on (-1=end

of input). Used to define a range.

-o [ --aracne-outputfile ] arg (=aracne.out)

The ARACNE output filename. This is a

binary file for "plot.r".

--aracne-edge-threshold arg (=2.2204460492503131e-16)

Equality tolerance threshold. Edges

differing by less than this value are

considered equal in strength.

--aracne-block-size arg (=16384) Block size for graph processing.

--aracne-node-grouping-size arg (=16) Grouping size for node processing.

実行方法

FASTA形式の多重整列ファイルを指定する。A、C、G、Tは異なるカテゴリとしてマップさが、その他の文字は1つのギャップカテゴリにマップされる。大文字と小文字は区別されない。

SpydrPick -v core_gene_alignment.aln

-v Be verbose.
--ld-threshold <arg> (=0) 　 Threshold distance for linkage disequilibrium (LD).
--linear-genome 　Assume linear genome in distance calculations.

出力例

Githubより

SpydrPickのメイン出力ファイル *.spydrpick_couplings.?-based.*edges は、入力アラインメントの列（左から右）に従って番号付けされたMI値（Mutual information value; 論文で定義された2つの確率変数の相互依存性を表す情報理論的な尺度）と位置インデックスのペア（デフォルトでは1ベースのインデックス使用、--output-indexing-baseでコントロール）の降順リストで空白区切りのリストが含まれている。MI値が外れ値閾値以上で --ld-閾値より離れている位置ペアは、いくつかの追加データと共に *.outliers というファイルでも見つけることができる。
デフォルトの設定は全体的に合理的な結果を得られるように設計されている。
注目すべき例外は、連鎖不平衡（LD）の閾値距離オプション --ld-threshold= と、アライメントが環状染色体（デフォルト）か線状染色体（追加 --linear-genome）かである。LDやアセンブリレベルは、研究対象の生物の組換え特性などの要因に依存する。細菌ゲノムの場合、--ld-thresholdの典型的な値は500-20000bpの範囲にある。
一般的には、より長く、より保守的な距離の推定値を使用する方が安全である。これは、短すぎる距離のカットオフよりも外れ値および極端な外れ値のしきい値への影響が少ないため。

Rスクリプトgwes_plot.rを使うと、結果のedgesファイルから論文のFIg.3AやFIg.4のようなGWES (genome-wide epistasis and co-selection study) Manhattan plotをプロットできる。使用するにはRのコード；gwes_plot.rの10行目input_full_filepathで出力されたedgeファイルをフルパスで指定する。

git clone https://github.com/santeripuranen/SpydrPick.git
cd SpydrPick/
Rscript gwes_plot.r

引用

Genome-wide epistasis and co-selection study using mutual information
Johan Pensar, Santeri Puranen, Brian Arnold, Neil MacAlasdair, Juri Kuronen, Gerry Tonkin-Hill, Maiju Pesonen, Yingying Xu, Aleksi Sipola, Leonor Sánchez-Busó, John A Lees, Claire Chewapreecha, Stephen D Bentley, Simon R Harris, Julian Parkhill, Nicholas J Croucher, Jukka Corander
Nucleic Acids Research, Volume 47, Issue 18, 10 October 2019, Page e112

2022-12-14

ロングリード配列やコンティグ配列をビニングする LRBinner

Algorithms for Molecular Biology 2022 metagenome Binning (metagenomics)

　メタゲノム解析の進歩により、環境から直接微生物群集を研究することが可能になった。メタゲノム解析は、微生物群集の種を特定するための重要なステップである。次世代シーケンサーのリードは、ショートリードの情報が限られているため、通常コンティグにアセンブルされる。第三世代シーケンサーでは、ショートリードからアセンブルしたコンティグに近い長さのリードを得ることができる。しかし、既存のコンティグビニングツールは、カバレッジ情報がないことや高いエラー率の存在により、ロングリードに直接適用することができない。既存のロングリードのビニングツールは、組成のみを使用するか、組成とカバレッジ情報を別々に使用するかのどちらかである。このため、存在量の少ない生物種に対応するビンを無視したり、coveragesが均一でない生物種に対応するビンを誤って分割してしまうことがある。本発表では、完全なロングリードのデータを使い、組成と被覆率を組み合わせた、参照不要のビン分割手法であるLRBinnerを紹介する。また、LRBinnerは距離ヒストグラムに基づくクラスタリングアルゴリズムを用いて、様々な大きさのクラスタを抽出することができる。
　シミュレーションデータと実データを用いた実験の結果、LRBinnerは、サンプリングを行わない完全なデータセットを扱う場合、ほとんどのケースで最良のビニング精度を達成することが分かった。さらに、LRBinnerを用いたビニング処理により、ビニング処理に必要な計算量を削減し、十分なビン化品質を達成することができた。
　LRBinnerは、メタゲノム解析におけるロングリードのビニングを支援するために、ディープラーニング技術を効果的な特徴量の集約に利用できることを示す。さらに、ロングリードの正確なビニングは、特に複雑なデータセットにおけるメタゲノム解析の改善をサポートする。また、ビニングはアセンブリに必要なリソースを削減することにも役立つ。LRBinnerのソースコードは、https://github.com/anuradhawick/LRBinner で自由に利用することができる。

Extended LRBinner(https://t.co/I1n3A2cbBq) to handle long reads assemblies with fewer contigs and higher N50. Needs performance tuning. Many updates and blogs to come. Ideas/code snippets from @viji112 and FaceNet paper. Feedback is ❤️#microbiome #metagenomics #MachineLearning
— Anuradha (@Anuradhawick) September 19, 2021

インストール

依存

LRBinner is coded purely using C++ (v9) and Python 3.7. To run LRBinner, you will need to install the following python and C++ modules.

Python dependencies
Essential libraries

numpy 1.16.4
scipy 1.3.0
seaborn 0.9.0
h5py 2.9.0
tabulate 0.8.7
pytorch 1.4.0

Essential for contig binning

HDBSCAN

C++ requirements

GCC version 9.1.0
OpenMP 4.5 for multi processing

Github

mamba create -n lrbinner python=3.7 -y
conda activate lrbinner
mamba install -c conda-forge -c bioconda numpy scipy seaborn h5py tabulate pytorch hdbscan gcc openmp tqdm biopython fraggenescan hmmer
#pytorchははいらなかったのでpipで導入
pip install torch

git clone https://github.com/anuradhawick/LRBinner.git
cd LRBinner/
python setup.py build

> python LRBinner -h

$ python LRBinner -h

usage: LRBinner [-h] [--version] {reads,contigs} ...

LRBinner Help. A tool developed for binning of metagenomics long reads

(PacBio/ONT) and long read assemblies. Tool utilizes composition and coverage

profiles of reads based on k-mer frequencies to perform dimension reduction

via a deep variational auto-encoder. Dimension reduced reads are then

clustered. Minimum RAM requirement is 9GB (4GB GPU if cuda used).

optional arguments:

-h, --help show this help message and exit

--version, -v Show version.

LRBinner running Mode:

{reads,contigs}

reads for binning reads

contigs for binning contigs

> python LRBinner reads -h

usage: LRBinner reads [-h] --reads-path READS_PATH [--k-size {3,4,5}]

[--bin-size BIN_SIZE] [--bin-count BIN_COUNT]

[--ae-epochs AE_EPOCHS] [--ae-dims AE_DIMS]

[--ae-hidden AE_HIDDEN] [--threads THREADS] [--separate]

[--cuda] [--resume] --output <DEST>

[--min-bin-size MIN_BIN_SIZE]

[--bin-iterations BIN_ITERATIONS]

optional arguments:

-h, --help show this help message and exit

--reads-path READS_PATH, -r READS_PATH

Reads path for binning

--k-size {3,4,5}, -k {3,4,5}

k value for k-mer frequency vector. Choose between 3

and 5.

--bin-size BIN_SIZE, -bs BIN_SIZE

Bin size for the coverage histogram.

--bin-count BIN_COUNT, -bc BIN_COUNT

Number of bins for the coverage histogram.

--ae-epochs AE_EPOCHS

Epochs for the auto_encoder.

--ae-dims AE_DIMS Size of the latent dimension.

--ae-hidden AE_HIDDEN

Hidden layer sizes eg: 128,128

--threads THREADS, -t THREADS

Thread count for computations

--separate, -sep Flag to separate reads/contigs into bins detected.

Avaialbe in folder named 'binned'.

--cuda Whether to use CUDA if available.

--resume Continue from the last step or the binning step (which

ever comes first). Can save time needed count k-mers.

--output <DEST>, -o <DEST>

Output directory

--min-bin-size MIN_BIN_SIZE, -mbs MIN_BIN_SIZE

The minimum number of reads a bin should have.

--bin-iterations BIN_ITERATIONS, -bit BIN_ITERATIONS

Number of iterations for cluster search. Use 0 for

exhaustive search.

> python LRBinner contigs -h

usage: LRBinner contigs [-h] --reads-path READS_PATH [--k-size {3,4,5}]

[--bin-size BIN_SIZE] [--bin-count BIN_COUNT]

[--ae-epochs AE_EPOCHS] [--ae-dims AE_DIMS]

[--ae-hidden AE_HIDDEN] [--threads THREADS]

[--separate] [--cuda] [--resume] --output <DEST>

--contigs CONTIGS

optional arguments:

-h, --help show this help message and exit

--reads-path READS_PATH, -r READS_PATH

Reads path for binning

--k-size {3,4,5}, -k {3,4,5}

k value for k-mer frequency vector. Choose between 3

and 5.

--bin-size BIN_SIZE, -bs BIN_SIZE

Bin size for the coverage histogram.

--bin-count BIN_COUNT, -bc BIN_COUNT

Number of bins for the coverage histogram.

--ae-epochs AE_EPOCHS

Epochs for the auto_encoder.

--ae-dims AE_DIMS Size of the latent dimension.

--ae-hidden AE_HIDDEN

Hidden layer sizes eg: 128,128

--threads THREADS, -t THREADS

Thread count for computations

--separate, -sep Flag to separate reads/contigs into bins detected.

Avaialbe in folder named 'binned'.

--cuda Whether to use CUDA if available.

--resume Continue from the last step or the binning step (which

ever comes first). Can save time needed count k-mers.

--output <DEST>, -o <DEST>

Output directory

--contigs CONTIGS, -c CONTIGS

Contigs path

実行方法

ロングリードを直接binningするモード（reads ）とcontigをbinningするモード(contigs)

がある。

１、LRBinnerの実行

レポジトリからリンクされる形でテストデータが公開されている。平均8500-bp、最小5000、最長37000-bpのシークエンシングリードのfastaファイルとなっている。

python LRBinner reads -r reads.fasta -bc 10 -bs 32 -o outdir --resume --cuda -mbs 5000 --ae-dims 4 --ae-epochs 200 -bit 0 -t 32

生のfastqには対応しているがgzipされたfasta/fastq配列には対応してない。

リードを提供した場合、各配列がどのbinに属するかの番号が出力ディレクトリ内に書き出される。

２、bin配列にリードを分類する。出力ディレクトリにあるPKLファイルと１で使ったリードファイルを指定する。

python separate_reads.py --reads reads.fasta -b outdir/binning_result.pkl -o bin

出力例

bin/

bin-1/

リードは重複しているので、このあと単離されたゲノムのアセンブラか、まだ汚染があるならメタゲノムアセンブラで個別にDe novoアセンブルする(*1)。

試した限り、k-mer frequency（3,4,5）の値で結果は変化することが多いです。issuesではパラメータについても議論されています。興味がある方は確認してみてください。

引用

Binning long reads in metagenomics datasets using composition and coverage information
Anuradha Wickramarachchi & Yu Lin
Algorithms for Molecular Biology volume 17, Article number: 14 (2022)

ハイブリッドRNAシーケンスデータを使ってゲノムアノテーションを改善する annotate_my_genomes

2022 GigaScience docker eukaryotic genome annotation nextflow Iso-Seq

2022/12/27,28 追記

　ハイブリッドシーケンステクノロジーの進歩により、ハイブリッドシーケンス・トランスクリプトミクスを用いてしばしばアノテーションされるゲノムアセンブリがますます拡大し、ゲノムの特性解析が向上し、さまざまな生物における新規遺伝子やアイソフォームの同定に繋がっている。
ハイブリッドシーケンスデータから収集した転写産物を入力とし、いくつかのバイオインフォマティクス的アプローチを統合することにより、GTFフォーマットの過去のアノテーションとの遺伝子照合を含め、コーディングRNAとロングノンコーディング RNAを区別する使いやすいゲノム誘導型トランスクリプトームアノテーションパイプラインを開発した。また、ニワトリのSCO-spondin遺伝子（105以上のエクソンを含む）の全エクソンを正しくアセンブルし、アノテーションすることにより、本手法の有効性を実証した（相同性割り当てによるニワトリのリファレンスアノテーションにおける欠損遺伝子の同定を含む）。
本手法は、ニワトリ脳のトランスクリプトームアノテーションを改善するのに役立つ。Anaconda/NextflowとDockerで実装された使いやすいパッケージで、幅広い種、組織、研究分野に適用でき、現在のアノテーションの改善と調和に役立つものである。コードとデータセットは、https://github.com/cfarkas/annotate_my_genomes で公開されている。

wiki

https://github.com/cfarkas/annotate_my_genomes/wiki

このパイプラインは新規アノテーションを１からつけるためよりも、既にある程度の品質のアノテーションが公開されており、Iso-seqなどのlong RNA seqのデータを使ってそれをさらに改善することを目的としています。

インストール

dockerイメージは文字化けエラーを起こしたので、condaで環境を作ってテストした（#2; install option2の手順）。

Github

#1 
git clone https://github.com/cfarkas/annotate_my_genomes.git
cd annotate_my_genomes
current_dir=$(pwd)
nextflow run makefile.nf --workdir $current_dir --conda ./22.04_environment.yml
sudo cp ./bin/* /usr/local/bin/

#2
git clone https://github.com/cfarkas/annotate_my_genomes.git
cd annotate_my_genomes
conda config --add channels bioconda
conda config --add channels conda-forge
mamba env create -f 22.04_environment.yml
conda activate annotate_my_genomes
bash makefile.sh
#パスの通ったディレクトリにコピー。ここでは仮想環境のbinに
sudo cp ./bin/* /home/kazu/mambaforge/envs/annotate_my_genomes/bin/     

#3 docker
docker pull carlosfarkas/annotate_my_genomes:latest

> annotate-my-genomes

arguments -a, -r, -g, -c, -t and -o must be provided

annotate-my-genomes [-h] [-a <stringtie.gtf>] [-r

<reference_genome.gtf>] [-g <reference_genome.fasta>] [-c

<gawn_config>] [-t <threads>]

This pipeline will Overlap StringTie transcripts (GTF format) with

current UCSC annotation and will annotate novel transcripts.

Arguments:

-h show this help text

-a StringTie GTF

-r UCSC gene annotation (in GTF format)

-g Reference genome (in fasta format)

-c GAWN config file (path to gawn_config.sh in annotate_my_genomes folder)

-t Number of threads for processing (integer)

-o output folder (must exist)

> add-ncbi-annotation

add-ncbi-annotation [-h] [-a <stringtie.gtf>] [-n

<NCBI_reference.gtf>] [-r <reference_genome.gtf>] [-g

<reference_genome.fasta>] [-c <gawn_config>] [-t <threads>] [-o

<output>]

This pipeline will Overlap StringTie transcripts (GTF format) with

current NCBI annotation and will annotate novel transcripts.

Arguments:

-h show this help text

-a StringTie GTF

-n NCBI gene annotation (in GTF format)

-r UCSC gene annotation (in GTF format)

-g Reference genome (in fasta format)

-c GAWN config file (path to gawn_config.sh in annotate_my_genomes folder)

-t Number of threads for processing (integer)

arguments -a, -n, -r, -g, -c, -t and -o must be provided

-o output folder (must exist)

(an

テストラン

１、ゲノムの準備

cd annotate_my_genomes/nextflow_scripts/
mkdir outdir
nextflow run genome-download.nf \
--genome galGal6 \
--conda ../22.04_environment.yml --outdir outdir

最近のバージョンのnextflowを使っているなら-dsl1をつけてランする。nextflowのバージョンが古いとエラーが起きるので注意。

outdir/

２、annotate-my-genomesを実行する。
ファイルはフルパスで指定する。nextflowを使うとエラーが出たのでここでは使わず実行する。stringtie.gtf の作り方のコマンドレシピはこちらで説明されている。ここでは

#エラーになる
cd annotate_my_genomes/nextflow_scripts/
mkdir output_folder
nextflow run annotate-my-genomes.nf \
--stringtie $PWD/outdir/stringtie.gtf \
--ref_annotation $PWD/outdir/galGal6.gtf \ 
--genome $PWD/outdir/galGal6.fa \
--config $PWD/gawn_config.sh \
--threads 20 \
--conda /path/to/22.04_environment.yml --outdir $PWD/output_folder/

#エラーが出たのでnextflowのスクリプトを使わずに実行
mkdir output2
annotate-my-genomes -a outdir/transcripts.gtf -g outdir/galGal6.fa -c gawn_config.sh -t 20 -o output2/ -r output/galGal6.gtf

outdir2

現在のアノテーションにlong RNA seq read 由来アノテーション（stringtie2.gtf）がマージされたファイルがfinal_annotated.gtf。

３、さらにNCBIのアノテーションを追加することもできる。その場合、２のコマンドの代わりにadd-ncbi-annotationを使用する。-n以外のパラメータは２と同じ。

mkdir output3
add-ncbi-annotation -a output/transcripts.gtf -n output/galGal6_ncbiRefSeq.gtf -r output/galGal6.gtf -g output/galGal6.fa -c gawn_config.sh -t 30 -o output3/

outdir3/

BRAKER2 パイプラインの出力 braker.gtf または TSEBRA パイプラインの出力 tsebra.gtfも使ってNCBI annotationとマージしたアノテーションファイルを作ることができる（braker2の出力はbraker.gff3とbraker.gtfだが、gtfのほうを使う）。それにはAGAT toolkitを使ってbraker.gtfのフォーマットを修正し、add-ncbi-annotationコマンドを走らせる。詳細はレポジトリの"VI Annotation of BRAKER2 / TSEBRA gtf output"で説明されている。
stringtie.gtfをadd-ncbi-annotationでアノテーションした場合、add-ncbi-annotationパイプラインの出力としてtranscripts annotation table（csv）を作成することができる。isoform-identificationコマンドを使う。
新規タンパク質のアノテーションと、パイプラインで同定された新規遺伝子内のパラログの同定を行うことができる（リンク）。

引用

annotate_my_genomes: an easy-to-use pipeline to improve genome annotation and uncover neglected genes by hybrid RNA sequencing
Carlos Farkas, Antonia Recabal, Andy Mella, Daniel Candia-Herrera, Maryori González Olivero, Jody Jonathan Haigh, Estefanía Tarifeño-Saldivia, Teresa Caprile
GigaScience, Volume 11, 2022, Published: 06 December 2022

EMBL-EBIのMGnifyのアップデート

jupyter notebook metagenome 2022 Nucleic Acids Research ヒト口腔 (human oral) ヒト腸内 (human gut) Marine Metagenomics animal fish

　MGnifyプラットフォーム（https://www.ebi.ac.uk/metagenomics）は、微生物由来の核酸配列のアセンブリ、解析、保存を容易にする。このプラットフォームでは、メタバーコード、メタトランススクリプトーム、メタゲノムなど、様々な環境由来のデータセットを対象とした約50万件の解析について、分類や機能アノテーションを利用することができる。過去3年間で、MGnifyは収録データセット数の増加だけでなく、ロングリード配列の解析など、提供される解析の幅も広がった。MGnifyのタンパク質データベースは、メタゲノム解析から予測される非冗長配列が24億個を超えるようになった。このコレクションはリレーショナルデータベースに整理され、ソースアセンブリやサンプルメタデータへのナビゲーションを通じてタンパク質のゲノムコンテキストを理解することが可能になり、大きな進歩を遂げた。MGnifyで既に提供されている機能的アノテーションをさらに拡張するために、著者らは深層学習ベースのアノテーション手法を適用した。また、MGnifyのアプリケーションプログラミングインターフェース（API）とウェブサイトの基盤技術をアップグレードし、Jupyter Lab環境を導入してMGnifyデータのダウンストリーム解析ができるようにした。

Documentation

https://docs.mgnify.org/en/latest/

Github（EBI-Metagenomics / genomes-pipeline Public)レポジトリは他にもあり。

webサービス

https://www.ebi.ac.uk/metagenomics/ にアクセスする。

以前にも紹介していますので、ここでは更新されている部分について見ていきます。

Browse dataでは閲覧できるデータがかなり増えている。

Genomesタブでも利用できる大規模メタゲノムプロジェクトからのMAGが増えている。口腔、海、家畜や魚の腸内細菌など幅広い。

クリックすると利用可能なゲノムが表示される。Marine v1.0では1504個のゲノムについて閲覧できる。これらMAGsは上にリンクを張ったGIthubのパイプラインで解析して得られたものになる（興味があるなら下の画像のPilpeline v1.2.1のところをクリック）。パイプラインの詳細はこちらで説明されている。

Genomeをクリックすると詳細が表示される。基本的なゲノムの統計、COG、KEGG、などの分析結果になる。

KEGG class

EMBL-EBI MGnify | Jupyter Notebook Lab

https://shiny-portal.embl.de/shinyapps/app/06_mgnify-notebook-lab

引用

MGnify: the microbiome sequence data analysis resource in 2023
Lorna Richardson, Ben Allen, Germana Baldi, Martin Beracochea, Maxwell L Bileschi, Tony Burdett, Josephine Burgin, Juan Caballero-Pérez, Guy Cochrane, Lucy J Colwell, Tom Curtis, Alejandra Escobar-Zepeda, Tatiana A Gurbich, Varsha Kale, Anton Korobeynikov, Shriya Raj, Alexander B Rogers, Ekaterina Sakharova, Santiago Sanchez, Darren J Wilkinson, Robert D Finn
Nucleic Acids Research, Published: 07 December 2022

2022-12-08

微生物パンゲノム解析のスコア付けを行う Scoary

pan-GWAS Genome Biology 2016 PAV (presence/absence variations) pan-genome

2024/04/09 追記

　ゲノムワイド関連研究（GWAS）は、ヒトの医学やゲノミクスにおいて不可欠なものとなっているが、細菌を対象とした研究はほとんど行われていない。本発表では、パンゲノムの構成要素について、観察された表現形質との関連を、集団の階層性を考慮しながら、進化過程の仮定を最小限にしてスコア化する、超高速で使いやすく、広く応用できるソフトウェアツール、Scoaryを紹介する。著者らは、この手法を従来の一塩基多型（SNP）ベースのGWASと区別するために、 pan-GWASと呼んでいる。ScoaryはPythonで実装され、オープンソースGPLv3ライセンスの下、https://github.com/AdmiralenOla/Scoary で入手可能である。

インストール

依存

Python (Tested with versions 2.7, 3.4, 3.5, 3.6 and 3.6-dev)
SciPy (Tested with versions 0.16, 0.17, 0.18)

If you supply custom trees (Optional)

ete3
six

Github

#PyPI(link)
pip install scoary==1.6.16
#option
pip install ete3 six

#conda (link) 
mamba install -c bioconda scoary

> scoary -h

usage: scoary [-h] [-t TRAITS] [-g GENES] [-n NEWICKTREE] [-s START_COL]

[--delimiter DELIMITER] [-r RESTRICT_TO] [-o OUTDIR] [-u]

[-p P_VALUE_CUTOFF [P_VALUE_CUTOFF ...]]

[-c [{I,B,BH,PW,EPW,P} [{I,B,BH,PW,EPW,P} ...]]] [-m MAX_HITS]

[--include_input_columns GRABCOLS] [-w] [--no-time] [-e PERMUTE]

[--no_pairwise] [--collapse] [--threads THREADS] [--test]

[--citation] [--version]

Scoary version 1.6.16 - Screen pan-genome for trait-associated variants

optional arguments:

-h, --help show this help message and exit

Input options:

-t TRAITS, --traits TRAITS

Input trait table (comma-separated-values). Trait

presence is indicated by 1, trait absence by 0.

Assumes strain names in the first column and trait

names in the first row

-g GENES, --genes GENES

Input gene presence/absence table (comma-separated-

values) from ROARY. Strain names must be equal to

those in the trait table

-n NEWICKTREE, --newicktree NEWICKTREE

Supply a custom tree (Newick format) for phylogenetic

analyses instead instead of calculating it internally.

-s START_COL, --start_col START_COL

On which column in the gene presence/absence file do

individual strain info start. Default=15. (1-based

indexing)

--delimiter DELIMITER

The delimiter between cells in the gene

presence/absence and trait files, as well as the

output file.

-r RESTRICT_TO, --restrict_to RESTRICT_TO

Use if you only want to analyze a subset of your

strains. Scoary will read the provided comma-separated

table of strains and restrict analyzes to these.

Output options:

-o OUTDIR, --outdir OUTDIR

Directory to place output files. Default = .

-u, --upgma_tree This flag will cause Scoary to write the calculated

UPGMA tree to a newick file

-p P_VALUE_CUTOFF [P_VALUE_CUTOFF ...], --p_value_cutoff

P_VALUE_CUTOFF [P_VALUE_CUTOFF ...]

P-value cut-off / alpha level. For Fishers,

Bonferronis, and Benjamini-Hochbergs tests, SCOARY

will not report genes with higher p-values than this.

For empirical p-values, this is treated as an alpha

level instead. I.e. 0.02 will filter all genes except

the lower and upper percentile from this test. Run

with "-p 1.0" to report all genes. Accepts standard

form (e.g. 1E-8). Provide a single value (applied to

all) or exactly as many values as correction criteria

and in corresponding order. (See example under

correction). Default = 0.05

-c [{I,B,BH,PW,EPW,P} [{I,B,BH,PW,EPW,P} ...]], --correction

[{I,B,BH,PW,EPW,P} [{I,B,BH,PW,EPW,P} ...]]

Apply the indicated filtration measure. Allowed values

are I, B, BH, PW, EPW, P. I=Individual (naive)

p-value. B=Bonferroni adjusted p-value. BH=Benjamini-

Hochberg adjusted p. PW=Best (lowest) pairwise

comparison. EPW=Entire range of pairwise comparison

p-values. P=Empirical p-value from permutations. You

can enter as many correction criteria as you would

like. These will be associated with the

p_value_cutoffs you enter. For example "-c I EPW -p

0.1 0.05" will apply the following cutoffs: Naive

p-value must be lower than 0.1 AND the entire range of

pairwise comparison values are below 0.05 for this

gene. Note that the empirical p-values should be

interpreted at both tails. Therefore, running "-c P -p

0.05" will apply an alpha of 0.05 to the empirical

(permuted) p-values, i.e. it will filter everything

except the upper and lower 2.5 percent of the

distribution. Default = Individual p-value. (I)

-m MAX_HITS, --max_hits MAX_HITS

Maximum number of hits to report. SCOARY will only

report the top max_hits results per trait

--include_input_columns GRABCOLS

Grab columns from the input Roary file. and puts them

in the output. Handles comma and ranges, e.g.

--include_input_columns 4,6,8,16-23. The special

keyword ALL will include all relevant input columns in

the output

-w, --write_reduced Use with -r if you want Scoary to create a new gene

presence absence file from your reduced set of

isolates. Note: Columns 1-14 (No. sequences, Avg group

size nuc etc) in this file do not reflect the reduced

dataset. These are taken from the full dataset.

--no-time Output file in the form TRAIT.results.csv, instead of

TRAIT_TIMESTAMP.csv. When used with the -w argument

will output a reduced gene matrix in the form

gene_presence_absence_reduced.csv rather than

gene_presence_absence_reduced_TIMESTAMP.csv

Analysis options:

-e PERMUTE, --permute PERMUTE

Perform N number of permutations of the significant

results post-analysis. Each permutation will do a

label switching of the phenotype and a new p-value is

calculated according to this new dataset. After all N

permutations are completed, the results are ordered in

ascending order, and the percentile of the original

result in the permuted p-value distribution is

reported.

--no_pairwise Do not perform pairwise comparisons. Inthis mode,

Scoary will perform population structure-naive

calculations only. (Fishers test, ORs etc). Useful for

summary operations and exploring sets. (Genes unique

in groups, intersections etc) but not causal analyses.

--collapse Add this to collapse correlated genes (genes that have

identical distribution patterns in the sample) into

merged units.

Misc options:

--threads THREADS Number of threads to use. Default = 1

--test Run Scoary on the test set in exampledata, overriding

all other parameters.

--citation Show citation information, and exit.

--version Display Scoary version, and exit.

by Ola Brynildsrud (olbb@fhi.no)

実行方法

GUI上でも実行できるが（scoary_GUI）、ここではコマンドライン上での基本的な実行方法を説明する。

gene presence absence tableのCSVファイルと形質を1/0で示したCSVファイルを指定する。gene presence absence tableのフォーマットはroaryの出力もしくはLS-BSRの出力を想定している。他のパンゲノムツールの出力を使う場合、１列目にユニークな遺伝子名、15列目以降（excel等ではO列以降）に各ゲノムについて１列ずつ遺伝子の有無が書いてあれば、C-N列の情報は白紙でも受け付けることを確認した。

O列以降の１行目は各ゲノム名。

もう１つは表現型マトリックスのファイル。

gene presence absence tableのゲノム名と同じ名前で形質の有無を記載したCSVファイルを作成する。形質列は複数列あってもよい。

v1.6.9以降、欠損データを扱えるようになった。欠損値は "NA", ".", "-" のいずれかで指定する（Scoary はこれらの欠損値に対して実際には何らかの不確実性モデルを指定せず、単にさらなる分析から除外することに注意（マニュアルより））。

オプションでサンプルの系譜を定義する系統樹も使える。提供されない場合は、入力遺伝子型ファイルの分離ハミング距離を通して内部で計算される。

準備ができたらscoaryを実行する。macなどで作成した場合、改行形式に注意する（linux推奨）。

scoary -g gene_presence_absence.csv -t traits.csv

出力例。

traits ファイル内の trait 毎に 1 つの csv ファイルが出力される。有意性の順にソートされ、デフォルトではnaive p-value < 0.05 のすべての遺伝子が報告される（オプションで変更可）。

サンプルサイズ、オッズ比、native P値（Fisher’s exact test）、Bonferroni法での補正後の調整後P値(FWER)、Benjamini-Hochbergで補正後の調整後P値（FDR）、すべての検定推定量をランク付けした後の経験的p値（Empirical_p）などが表示される。

バージョン 1.6.12 から、scoary は VCF ファイルを Roary/Scoary 形式に変換する機能を備えている。これにより、バリアント(SNPs, indel, structural variants など)を入力に使用できる。

vcf2scoary myvariants.vcf

(すべてのVCFファイルを正しく扱えるとは限らないため、バグがある場合は報告すること)

論文の表２にサンプルサイズを変えて検出力がnative P値、FDR、経験的p値でどう変化するか示した結果があります。また、Githubには他の色々な使用例が書かれています。確認してください。

追記

macでエラーが出る時は入力CSVの改行コードがLFになっているか確認する。

引用

Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary
Ola Brynildsrud, Jon Bohlin, Lonneke Scheffer, Vegard Eldholm

Genome Biol. 2016 Nov 25;17(1):238

世界中の微生物種の生態を調べる Microbe Atlas Project（MAP）データベース

2017 Bioinformatics geographic distribution web tool bacteria metagenome

https://microbeatlas.org/index.html?action=aboutより

　メタゲノム配列が決定された大規模なサンプル群を集約的に解析することで、未知あるいは研究が不十分な微生物分類群が存在する典型的な存在量や環境に関する情報を蓄積できる。これにより、未知の微生物分類群がどのような環境に多く存在するのかについての第一線の情報が得られ、その能力の推論が可能になる。この情報により、微生物研究者は、試料中の微生物分類群のうち、自分の研究関心に関連しそうなものに優先順位をつけることができるようになる。
　微生物群集は、同じ環境から採取されたサンプルであっても、多様で大きく変動することが多いため、環境における有意差や変化を特徴付けるためには、参照ゲノムセットの利用が不可欠となる。MAPのウェブサイトでは、研究者が自身のマイクロバイオームサンプルを100万以上のMAPリファレンスマイクロバイオームと比較することができる。これらは、著者らが以前に解析したメタゲノム配列のサンプルで、NCBI SRAデータベースから取得したものである。このような違いは、研究対象サンプルに特有の興味深い生物学的効果によるものか、あるいはプロトコルの違いによるものかもしれませんが、結果を解釈する際には考慮されるべきものである。

MAPについて
世界の微生物多様性がいかに未知であるかは、完全に配列決定されたゲノムや培養収集株が利用可能な微生物種のごく一部（2%）によく表れている。大多数の微生物種は、メタゲノム試料中の16Sまたは18Sリボソーム RNA遺伝子（SSU rRNA）の配列決定からしか存在を知られていない。
The Microbe Atlas Projectは、配列決定された大量の微生物群集を活用することで、これらの捉えどころのない微生物の生態に新たな光を当てることを目的としている。個々の微生物群集や単一の研究の解析結果を提供するだけでなく、すべての微生物群集や研究の解析結果を組み合わせて、微生物の生態を包括的に把握することができるのが、本アプローチの特徴である。このリソースは、いくつかの手法、ツール、パイプライン（HPC-CLUST、MAPseq）の締結により実現された。
配列決定された微生物群集に関する微生物量の定量化は、MAPseq（引用1）を用いたクローズド・リファレンス・リボソーム RNA解析に基づいて行われる。共通のリファレンスを使用することで、異なるシーケンスプロトコルを使用するサンプルおよび研究間で、微生物分類群の存在量を直接比較することができる。したがって、解析されたシーケンスデータには、アンプリコンシーケンス、ショットガンシークエンス、メタトランススクリプトームシーケンスが含まれる。

Documentation（ワークフローが説明されています）

https://microbeatlas.org/index.html?action=help

webサービス

https://microbeatlas.org/にアクセスする。

種名や配列などで検索できる。Nostoc punctiformeで検索してみる。

Nostoc punctiformeがトップヒットした。興味があるtaxaをクリックする。

どのような環境サンプルに含まれているか集計された結果が表示される。

シークエンスされたサンプルでは土壌に多いことが分かった。

MAPをクリックするとサンプリングされた場所がどこかを定量的に確認できる。

ソースごとに色分けすることもできる。

Readsのkronaスタイルの視覚化。

Projectsタブでは収録されているSRAのIDから探索できる。

Publication explorerタブでは論文を検索できる。

MAPseqタブでは、rRNA配列をペーストすることで素早く階層的な分類情報を調べることができる。

出力例

微生物群集の性質は多様であるため、典型的な微生物群集の特徴を明らかにするためには、統計学の利用が必要である。また、腸内細菌叢のような環境では、複数の異なる特徴的なタイプの微生物群集（腸内細菌型）が存在することが示されている。つまり、メタデータや環境情報だけを使って微生物コミュニティを典型的なコミュニティの集合に分類すると、グループが粗くなりすぎてしまう可能性がある。そこで、階層的クラスタリング手法により、MAP参照マイクロバイオームをコミュニティの類似性に基づいてあらかじめグループ分けした。このようにして、研究者は自分のサンプルとMAPの最も類似したタイプの特徴的な環境との間に存在することが判明した差異を得ることができる。
2つのサンプルグループ間の微生物量の統計的有意差は、ユーザー提供サンプルとグループ、またはユーザーサンプルとMAP提供リファレンスサンプル間の両方について、MAPウェブサイトを使用して計算することができる。これにより、研究者は研究している環境に特有な微生物分類を特定することができる。

作成中

引用

MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis
João F Matias Rodrigues, Thomas S B Schmidt, Janko Tackmann, Christian von Mering
Bioinformatics, Volume 33, Issue 23, 01 December 2017, Pages 3808–3810

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

生のシークエンシングリードからスケーラブルな高精度の系統樹を生成する Read2Tree（再掲載）

タンパク質のMSA中のSNPを比較することにより、SNP共起パターンを同定する SpydrPick

ロングリード配列やコンティグ配列をビニングする LRBinner

ハイブリッドRNAシーケンスデータを使ってゲノムアノテーションを改善する annotate_my_genomes

EMBL-EBIのMGnifyのアップデート

微生物パンゲノム解析のスコア付けを行う Scoary

世界中の微生物種の生態を調べる Microbe Atlas Project（MAP）データベース