リファレンスフリー、アラインメントフリーの系統比較ツール SANS serif

　計算パンゲノミクスや系統樹解析では、複数のゲノムを並行して解析することが大きな課題となっている。系統樹の再構成の従来のアプローチは、マーカー遺伝子のような特定の配列のアラインメントに基づいている。しかし、複数の配列のアラインメントの問題は複雑であり、大規模なデータでは実用的ではない。全ゲノムアプローチは、マーカー遺伝子の同定も高価なアラインメントも必要としないが、通常、ペアワイズ距離を得るために（k-merまたは他のパターンの観点から）二次的な数の配列比較を実行する。このため、ランタイムは入力配列数に応じて二次関数的に増加し、多数のゲノムからなるプロジェクトには適していない。

　ここでは、全ゲノムアプローチに基づいており、アラインメントとリファレンスフリーの両方を備えたソフトウェアSANS serifを紹介する。このコマンドラインツールは、アセンブルされたゲノムと生のリードの両方を入力として受け入れ、SplitsTree（Huson and Bryant, 2006）のような既存のツールを使用して系統樹やネットワークとして可視化できる分割のセットを計算する。このツールは、ペアワイズ距離を計算する代わりに、二次配列比較の数を必要としないパンゲノミックアプローチに従っている。著者らの以前の実装SANS(Wittler, 2020)の評価はすでに有望な結果を示し、本方法が他の全ゲノムベースのアプローチよりも有意に高速で、よりメモリ効率が高いことを明らかにした。新しいバージョンのSANS serifは、サードパーティのライブラリに依存しないスタンドアロンの再実装であり、いくつかの新機能を導入し、我々の方法の性能をさらに向上させ、ランタイムとメモリ使用量をオリジナルの実装と比較して約20％に削減した。

インストール

GitLab

git clone https://gitlab.ub.uni-bielefeld.de/gi/sans.git
cd sans/
make

#追加解析を行うならdendropyが必要
conda install -c bioconda -c conda-forge dendropy -y

> ./SAN

SANS serif | version 2.0_10D

Usage: SANS [PARAMETERS]

Input arguments:

-i, --input Input file: list of sequence files, one per line

-g, --graph Graph file: load a Bifrost graph, file name prefix

(requires compiler flag -DuseBF, please edit makefile)

-s, --splits Splits file: load an existing list of splits file

(allows to filter -t/-f, other arguments are ignored)

(either --input and/or --graph, or --splits must be provided)

Output arguments:

-o, --output Output TSV file: list of splits, sorted by weight desc.

-N, --newick Output Newick file

(only applicable in combination with -f strict or n-tree)

(at least --output or --newick must be provided, or both)

Optional arguments:

-k, --kmer Length of k-mers (default: 31)

-t, --top Number of splits in the output list (default: all)

-m, --mean Mean weight function to handle asymmetric splits

options: arith: arithmetic mean

geom: geometric mean (default)

geom2: geometric mean with pseudo-counts

-f, --filter Output a greedy maximum weight subset

options: strict: compatible to a tree

weakly: weakly compatible network

n-tree: compatible to a union of n trees

(where n is an arbitrary number)

-x, --iupac Extended IUPAC alphabet, resolve ambiguous bases

Specify a number to limit the k-mers per position

between 1 (no ambiguity) and 4^k (allows NNN...N)

-n, --norev Do not consider reverse complement k-mers

-v, --verbose Print information messages during execution

-h, --help Display this help page and quit

Contact: sans-service@cebitec.uni-bielefeld.de

Evaluation: https://www.surveymonkey.de/r/denbi-service?sc=bigi&tool=sans

テストラン

Drosophilaのデータセットをダウンロードしてランできるようになっている。

cd sans/example_data/drosophila/
./download.sh
cd fa/

f:id:kazumaxneo:20210111133158p:plain

list.txt

f:id:kazumaxneo:20210111133438p:plain

ラン

SANS -i list.txt -o sans_greedytree.splits -t 130 -f strict -N sans_greedytree.new -v

#compare to reference
../../../scripts/newick2sans.py ../Reference.new > Reference.splits
../../../scripts/comp.py sans_greedytree.splits Reference.splits list.txt

Reference.splits

出力

read taxa

read split file 1

read split file 2

found:

21 21

compute precision and recall:

#precision recall (unweighted)

1.0 0.047619047619047616

1.0 0.09523809523809523

1.0 0.14285714285714285

1.0 0.19047619047619047

1.0 0.23809523809523808

1.0 0.2857142857142857

1.0 0.3333333333333333

1.0 0.38095238095238093

1.0 0.42857142857142855

1.0 0.47619047619047616

1.0 0.5238095238095238

1.0 0.5714285714285714

1.0 0.6190476190476191

1.0 0.6666666666666666

1.0 0.7142857142857143

1.0 0.7619047619047619

1.0 0.8095238095238095

1.0 0.8571428571428571

1.0 0.9047619047619048

1.0 0.9523809523809523

1.0 1.0

1.0 1.0 unweighted

1.0 1.0 weighted

newick2sans.pyでSplitsTreeなどでの視覚化に適したNEWICK format のツリーファイルが出力される。

引用

SANS serif: alignment-free, whole-genome based phylogenetic reconstruction

Andreas Rempel, Roland Wittler

bioRxiv, Posted January 03, 2021

Alignment- and reference-free phylogenomics with colored de Bruijn graphs Roland Wittler

Roland Wittler

Algorithms for Molecular Biology, (2020) 15:4

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

リファレンスフリー、アラインメントフリーの系統比較ツール SANS serif