2023/06/27 論文引用
シーケンスのリードデータから系統樹を推定することは、生物学の基礎となるものである。しかし、最新の系統樹解析では、複雑なパイプラインを実行する必要があり、多大な計算コストと人件費がかかる上、シーケンスのカバレッジ、アセンブリ、アノテーションの質にも制約がある。このような課題を克服するために、本著者らはRead2treeを開発した。Read2treeは、シーケンスのリードを直接処理し、対応する遺伝子のグループに分割する。様々なデータセットを含むベンチマークにおいて、本アセンブリフリーアプローチは、従来のアプローチよりも10倍から100倍速く、ほとんどの場合、より正確だった(例外は、シーケンスカバレッジが高く、参照種が非常に遠い場合)。このツールの幅広い適用性を説明するために、5億9000万年の進化に及ぶ435種の酵母のtree of lifeを再構築した。コロナウイルス科のサンプルに適用したところ、Read2Treeは、非常に多様な動物サンプルとほぼ同一のSARSCoV2配列を1つのツリー上に正確に分類し、驚くべき幅と深さを示した。Read2Treeのスピード、精度、汎用性により、大規模な比較ゲノム解析が可能になる。
Read2Tree: an assembly-free comp. genomics method for small & large-scale projects. Works on @Pacbio @nanopore @illumina& RNA/DNA & low coverage 0.2x+. Get your phylo tree from raw reads in minutes! e.g #COVID19 https://t.co/M16N8aO8oY @ddylus @cdessimoz @BCM_HGSC #Bioinformatics pic.twitter.com/b3PxAfJi2b
— Fritz Sedlazeck (@sedlazeck) 2022年4月20日
インストール
condaで環境を作って依存を導入した (ubuntu18)。
- read2tree was built and tested with python 3.5.1.
mamba create -n read2tree python=3.7 -y
conda activate read2tree
#依存ツール
mamba install -c conda-forge biopython numpy Cython ete3 lxml tqdm scipy pyparsing requests natsort pyyaml -y
mamba install -c bioconda dendropy -y
mamba install -c bioconda mafft iqtree ngmlr nextgenmap samtools -y
git clone https://github.com/DessimozLab/pyham.git
python -m pip install -e ./pyham
git clone https://github.com/DessimozLab/pyoma.git
python -m pip install -e ./pyoma
#or
pip install pyham
pip install pyoma
#本体
git clone https://github.com/DessimozLab/read2tree.git
cd read2tree
python setup.py install
> read2tree -h
usage: read2tree [-h] [--version] [--threads THREADS]
[--standalone_path STANDALONE_PATH]
[--reads READS [READS ...]] [--read_type READ_TYPE]
[--split_reads] [--split_len SPLIT_LEN]
[--split_overlap SPLIT_OVERLAP]
[--split_min_read_len SPLIT_MIN_READ_LEN] [--sample_reads]
[--coverage COVERAGE] [--min_cons_coverage MIN_CONS_COVERAGE]
[--genome_len GENOME_LEN] [--output_path OUTPUT_PATH]
[--dna_reference DNA_REFERENCE]
[--ignore_species IGNORE_SPECIES]
[--sc_threshold SC_THRESHOLD]
[--remove_species_mapping REMOVE_SPECIES_MAPPING]
[--remove_species_ogs REMOVE_SPECIES_OGS]
[--ngmlr_parameters NGMLR_PARAMETERS] [--keep_all_ogs]
[--check_mate_pairing] [--debug]
[--sequence_selection_mode SEQUENCE_SELECTION_MODE]
[-s SPECIES_NAME] [--tree] [--merge_all_mappings] [-r]
[--min_species MIN_SPECIES] [--single_mapping SINGLE_MAPPING]
[--ref_folder REF_FOLDER]
read2tree is a pipeline allowing to use read data in combination with an OMA
standalone output run to produce high quality trees.
optional arguments:
-h, --help show this help message and exit
--version Show programme's version number and exit.
--threads THREADS [Default is 1] Number of threads for the mapping using
ngm / ngmlr!
--standalone_path STANDALONE_PATH
[Default is current directory] Path to oma standalone
directory.
--reads READS [READS ...]
[Default is none] Reads to be mapped to reference. If
paired end add separated by space.
--read_type READ_TYPE
[Default is short reads] Type of reads to use for
mapping. Either ngm for short reads or ngmlr for long
will be used.
--split_reads [Default is off] Splits reads as defined by split_len
(200) and split_overlap (0) parameters.
--split_len SPLIT_LEN
[Default is 200] Parameter for selection of read split
length can only be used in combinationwith with long
read option.
--split_overlap SPLIT_OVERLAP
[Default is 0] Reads are split with an overlap defined
by this argument.
--split_min_read_len SPLIT_MIN_READ_LEN
[Default is 200] Reads longer than this value are cut
into smaller values as defined by --split_len.
--sample_reads [Default is off] Splits reads as defined by split_len
(200) and split_overlap (0) parameters.
--coverage COVERAGE [Default is 10] coverage in X.
--min_cons_coverage MIN_CONS_COVERAGE
[Default is 1] Minimum number of nucleotides at
column.
--genome_len GENOME_LEN
[Default is 2000000] Genome size in bp.
--output_path OUTPUT_PATH
[Default is current directory] Path to output
directory.
--dna_reference DNA_REFERENCE
[Default is None] Reference file that contains
nucleotide sequences (fasta, hdf5). If not given it
will usethe RESTapi and retrieve sequences from
http://omabrowser.org directly. NOTE: internet
connection required!
--ignore_species IGNORE_SPECIES
[Default is none] Ignores species part of the OMA
standalone pipeline. Input is comma separated list
without spaces, e.g. XXX,YYY,AAA.
--sc_threshold SC_THRESHOLD
[Default is 0.25; Range 0-1] Parameter for selection
of sequences from mapping by completeness compared to
its reference sequence (number of ACGT basepairs vs
length of sequence). By default, all sequences are
selected.
--remove_species_mapping REMOVE_SPECIES_MAPPING
[Default is none] Remove species present in data set
after mapping step completed and only do analysis on
subset. Input is comma separated list without spaces,
e.g. XXX,YYY,AAA.
--remove_species_ogs REMOVE_SPECIES_OGS
[Default is none] Remove species present in data set
after mapping step completed to build OGs. Input is
comma separated list without spaces, e.g. XXX,YYY,AAA.
--ngmlr_parameters NGMLR_PARAMETERS
[Default is none] In case this parameters need to be
changed all 3 values have to be changed [x,subread-
length,R]. The standard is: ont,256,0.25.
Possibilities for these parameter can be found in the
original documentation of ngmlr.
--keep_all_ogs [Default is on] Keep all orthologs after addition of
mapped seq, which means also the OGs that have no
mapped sequence. Otherwise only OGs are used that have
the mapped sequence for alignment and tree inference.
--check_mate_pairing Check whether in case of paired end reads we have
consistent mate pairing. Setting this option will
automatically select the overlapping reads and do not
consider single reads.
--debug [Default is off] Changes to debug mode: * bam files
are saved!* reads are saved by mapping to OG
--sequence_selection_mode SEQUENCE_SELECTION_MODE
[Default is sc] Possibilities are cov and cov_sc for
mapped sequence.
-s SPECIES_NAME, --species_name SPECIES_NAME
[Default is name of read 1st file] Name of species for
mapped sequence.
--tree [Default is false] Compute tree, otherwise just output
concatenated alignment!
--merge_all_mappings [Default is off] In case multiple species were mapped
to the same reference this allows to merge this
mappings and build a tree with all included species!
-r, --reference [Default is off] Just generate the reference dataset
for mapping.
--min_species MIN_SPECIES
Min number of species in selected orthologous groups.
If not selected it will be estimated such that around
1000 OGs are available.
--single_mapping SINGLE_MAPPING
[Default is none] Single species file allowing to map
in a job array.
--ref_folder REF_FOLDER
[Default is none] Folder containing reference files
with sequences sorted by species.
read2tree (C) 2017-2022 David Dylus
テストラン
ランにはペアエンドfastqとリファレンスとするオーソロググループの遺伝子セットのディレクトリが必要。マーカー遺伝子はOMAブラウザから取得できる。
テストデータでは、Mnemiopsis leidyi、Xenopus laevis、Homo sapiens、Gorilla gorilla、Rattus norvegicusの5種をリファレンスにしている。OMAブラウザーを用いて、これら5種のマーカー遺伝子20個がリファレンスのオルソログ群としてダウンロードされ、test/mareker_genes フォルダーに配置されている(Githubより)。
test/marker_genes/
1つ開いてみる。
OMAGroup_638532.fa
テストデータの目的は、Musculusのシーケンスリードを用いて樹形を推定することになっている。ペアエンドfastqとマーカー遺伝子のディレクトリを指定してread2treeをランする。
cd tests
read2tree --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq --output_path output/
出力
output/
tree_sample_1.nwkがNewick formatのリーファイル。そのほかの出力についてはGithubで説明されています。
引用
Read2Tree: scalable and accurate phylogenetic trees from raw reads
David Dylus, Adrian M Altenhoff, Sina Majidian, Fritz J Sedlazeck, Christophe Dessimoz
bioRxiv, Posted April 19, 2022.
2023/06/27
Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree
David Dylus, Adrian Altenhoff, Sina Majidian, Fritz J. Sedlazeck & Christophe Dessimoz
Nature Biotechnology (2023)
関連