生のシークエンシングリードからスケーラブルな高精度の系統樹を生成する Read2Tree

2023/06/27 論文引用

　シーケンスのリードデータから系統樹を推定することは、生物学の基礎となるものである。しかし、最新の系統樹解析では、複雑なパイプラインを実行する必要があり、多大な計算コストと人件費がかかる上、シーケンスのカバレッジ、アセンブリ、アノテーションの質にも制約がある。このような課題を克服するために、本著者らはRead2treeを開発した。Read2treeは、シーケンスのリードを直接処理し、対応する遺伝子のグループに分割する。様々なデータセットを含むベンチマークにおいて、本アセンブリフリーアプローチは、従来のアプローチよりも10倍から100倍速く、ほとんどの場合、より正確だった（例外は、シーケンスカバレッジが高く、参照種が非常に遠い場合）。このツールの幅広い適用性を説明するために、5億9000万年の進化に及ぶ435種の酵母のtree of lifeを再構築した。コロナウイルス科のサンプルに適用したところ、Read2Treeは、非常に多様な動物サンプルとほぼ同一のSARSCoV2配列を1つのツリー上に正確に分類し、驚くべき幅と深さを示した。Read2Treeのスピード、精度、汎用性により、大規模な比較ゲノム解析が可能になる。

Read2Tree: an assembly-free comp. genomics method for small & large-scale projects. Works on @Pacbio @nanopore @illumina& RNA/DNA & low coverage 0.2x+. Get your phylo tree from raw reads in minutes! e.g #COVID19 https://t.co/M16N8aO8oY @ddylus @cdessimoz @BCM_HGSC #Bioinformatics pic.twitter.com/b3PxAfJi2b
— Fritz Sedlazeck (@sedlazeck) 2022年4月20日

インストール

condaで環境を作って依存を導入した (ubuntu18)。

read2tree was built and tested with python 3.5.1.

Github

mamba create -n read2tree python=3.7 -y
conda activate read2tree
#依存ツール
mamba install -c conda-forge biopython numpy Cython ete3 lxml tqdm scipy pyparsing requests natsort pyyaml -y
mamba install -c bioconda dendropy -y
mamba install -c bioconda mafft iqtree ngmlr nextgenmap samtools -y

git clone https://github.com/DessimozLab/pyham.git
python -m pip install -e ./pyham
git clone https://github.com/DessimozLab/pyoma.git
python -m pip install -e ./pyoma
#or
pip install pyham
pip install pyoma

#本体
git clone https://github.com/DessimozLab/read2tree.git
cd read2tree
python setup.py install

> read2tree -h

usage: read2tree [-h] [--version] [--threads THREADS]

[--standalone_path STANDALONE_PATH]

[--reads READS [READS ...]] [--read_type READ_TYPE]

[--split_reads] [--split_len SPLIT_LEN]

[--split_overlap SPLIT_OVERLAP]

[--split_min_read_len SPLIT_MIN_READ_LEN] [--sample_reads]

[--coverage COVERAGE] [--min_cons_coverage MIN_CONS_COVERAGE]

[--genome_len GENOME_LEN] [--output_path OUTPUT_PATH]

[--dna_reference DNA_REFERENCE]

[--ignore_species IGNORE_SPECIES]

[--sc_threshold SC_THRESHOLD]

[--remove_species_mapping REMOVE_SPECIES_MAPPING]

[--remove_species_ogs REMOVE_SPECIES_OGS]

[--ngmlr_parameters NGMLR_PARAMETERS] [--keep_all_ogs]

[--check_mate_pairing] [--debug]

[--sequence_selection_mode SEQUENCE_SELECTION_MODE]

[-s SPECIES_NAME] [--tree] [--merge_all_mappings] [-r]

[--min_species MIN_SPECIES] [--single_mapping SINGLE_MAPPING]

[--ref_folder REF_FOLDER]

read2tree is a pipeline allowing to use read data in combination with an OMA

standalone output run to produce high quality trees.

optional arguments:

-h, --help show this help message and exit

--version Show programme's version number and exit.

--threads THREADS [Default is 1] Number of threads for the mapping using

ngm / ngmlr!

--standalone_path STANDALONE_PATH

[Default is current directory] Path to oma standalone

directory.

--reads READS [READS ...]

[Default is none] Reads to be mapped to reference. If

paired end add separated by space.

--read_type READ_TYPE

[Default is short reads] Type of reads to use for

mapping. Either ngm for short reads or ngmlr for long

will be used.

--split_reads [Default is off] Splits reads as defined by split_len

(200) and split_overlap (0) parameters.

--split_len SPLIT_LEN

[Default is 200] Parameter for selection of read split

length can only be used in combinationwith with long

read option.

--split_overlap SPLIT_OVERLAP

[Default is 0] Reads are split with an overlap defined

by this argument.

--split_min_read_len SPLIT_MIN_READ_LEN

[Default is 200] Reads longer than this value are cut

into smaller values as defined by --split_len.

--sample_reads [Default is off] Splits reads as defined by split_len

(200) and split_overlap (0) parameters.

--coverage COVERAGE [Default is 10] coverage in X.

--min_cons_coverage MIN_CONS_COVERAGE

[Default is 1] Minimum number of nucleotides at

column.

--genome_len GENOME_LEN

[Default is 2000000] Genome size in bp.

--output_path OUTPUT_PATH

[Default is current directory] Path to output

directory.

--dna_reference DNA_REFERENCE

[Default is None] Reference file that contains

nucleotide sequences (fasta, hdf5). If not given it

will usethe RESTapi and retrieve sequences from

http://omabrowser.org directly. NOTE: internet

connection required!

--ignore_species IGNORE_SPECIES

[Default is none] Ignores species part of the OMA

standalone pipeline. Input is comma separated list

without spaces, e.g. XXX,YYY,AAA.

--sc_threshold SC_THRESHOLD

[Default is 0.25; Range 0-1] Parameter for selection

of sequences from mapping by completeness compared to

its reference sequence (number of ACGT basepairs vs

length of sequence). By default, all sequences are

selected.

--remove_species_mapping REMOVE_SPECIES_MAPPING

[Default is none] Remove species present in data set

after mapping step completed and only do analysis on

subset. Input is comma separated list without spaces,

e.g. XXX,YYY,AAA.

--remove_species_ogs REMOVE_SPECIES_OGS

[Default is none] Remove species present in data set

after mapping step completed to build OGs. Input is

comma separated list without spaces, e.g. XXX,YYY,AAA.

--ngmlr_parameters NGMLR_PARAMETERS

[Default is none] In case this parameters need to be

changed all 3 values have to be changed [x,subread-

length,R]. The standard is: ont,256,0.25.

Possibilities for these parameter can be found in the

original documentation of ngmlr.

--keep_all_ogs [Default is on] Keep all orthologs after addition of

mapped seq, which means also the OGs that have no

mapped sequence. Otherwise only OGs are used that have

the mapped sequence for alignment and tree inference.

--check_mate_pairing Check whether in case of paired end reads we have

consistent mate pairing. Setting this option will

automatically select the overlapping reads and do not

consider single reads.

--debug [Default is off] Changes to debug mode: * bam files

are saved!* reads are saved by mapping to OG

--sequence_selection_mode SEQUENCE_SELECTION_MODE

[Default is sc] Possibilities are cov and cov_sc for

mapped sequence.

-s SPECIES_NAME, --species_name SPECIES_NAME

[Default is name of read 1st file] Name of species for

mapped sequence.

--tree [Default is false] Compute tree, otherwise just output

concatenated alignment!

--merge_all_mappings [Default is off] In case multiple species were mapped

to the same reference this allows to merge this

mappings and build a tree with all included species!

-r, --reference [Default is off] Just generate the reference dataset

for mapping.

--min_species MIN_SPECIES

Min number of species in selected orthologous groups.

If not selected it will be estimated such that around

1000 OGs are available.

--single_mapping SINGLE_MAPPING

[Default is none] Single species file allowing to map

in a job array.

--ref_folder REF_FOLDER

[Default is none] Folder containing reference files

with sequences sorted by species.

テストラン

ランにはペアエンドfastqとリファレンスとするオーソロググループの遺伝子セットのディレクトリが必要。マーカー遺伝子はOMAブラウザから取得できる。

テストデータでは、Mnemiopsis leidyi、Xenopus laevis、Homo sapiens、Gorilla gorilla、Rattus norvegicusの5種をリファレンスにしている。OMAブラウザーを用いて、これら5種のマーカー遺伝子20個がリファレンスのオルソログ群としてダウンロードされ、test/mareker_genes フォルダーに配置されている（Githubより）。

test/marker_genes/

1つ開いてみる。

OMAGroup_638532.fa

テストデータの目的は、Musculusのシーケンスリードを用いて樹形を推定することになっている。ペアエンドfastqとマーカー遺伝子のディレクトリを指定してread2treeをランする。

cd tests
read2tree --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq --output_path output/

出力

output/

tree_sample_1.nwkがNewick formatのリーファイル。そのほかの出力についてはGithubで説明されています。

引用

Read2Tree: scalable and accurate phylogenetic trees from raw reads
David Dylus, Adrian M Altenhoff, Sina Majidian, Fritz J Sedlazeck, Christophe Dessimoz

bioRxiv, Posted April 19, 2022.

2023/06/27

Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree
David Dylus, Adrian Altenhoff, Sina Majidian, Fritz J. Sedlazeck & Christophe Dessimoz

Nature Biotechnology (2023)