全ゲノム配列ファイルを迅速に比較する Mashtree

　過去10年間で、公開されている細菌ゲノムの数は劇的に増加した。ゲノムはシークエンスされ、一般に共有され、その後、系統的な関連性が分析される。疫学的に関心のある2つのゲノムが関連していることがわかれば、さらなる調査が促されるかもしれない。しかし、膨大な数のゲノムを系統関連性のために比較することは、計算コストがかかり、数が多ければ手間もかかる。そのため、データの複雑さを軽減して下流の解析を行うための戦略が数多く存在する。主要なkmer戦略の1つは、各ゲノムをスプリットkmerにすることである。スプリットkmer解析では変異部位の両側のkmerを記録し、変化するヌクレオチドを特定する。2つ以上のゲノムを比較する場合は、可変部位を比較する。スプリットkmerは、KSNPやSKAなどのソフトウェアパッケージに実装されている（Gardner, Slezak, & Hall, 2015; Harris, 2018)。もう一つの主要なkmer戦略は、ゲノムデータを管理可能なデータセットに変換するもので、通常はスケッチと呼ばれている（Baker & Langmead, 2019; Ondov et al., 2016; Zhao, 2018）。最も注目すべきは、Mashパッケージに実装されたmin-hashアルゴリズムがある。min-hashアルゴリズムでは、すべてのkmerが記録され、ハッシュとブルームフィルター（Bloom, 1970）を用いて整数に変換される。これらのハッシュ化されたkmerはソートされ、最初のいくつかのkmerのみが保持される。ソートされたリストの一番上に表示されたkmerは、まとめてスケッチと呼ばれる。2つのスケッチを比較するには、共通するハッシュ化されたkmerの数を数える。共通するものがいくつあるかを数えて比較することができる。min-hashは任意の2つのゲノム間の距離を作成するので、min-hashの値を利用して隣接結合アルゴリズム（Saitou & Nei, 1987）を用いて、ゲノムをツリー状に高速にクラスタリングすることができる。このアイデアをMashtreeというソフトウェアに実装した。これにより、他の方法では計算量が多すぎる大規模なツリーを迅速かつ効率的に生成することができる。

Documentation (Markdown)

https://github.com/lskatz/mashtree/tree/master/docs

インストール

condaを使ってpython2.7の仮想環境に導入した。

Github

#conda (link)
mamba install -c bioconda -y mashtree

> mashtree

mashtree: main::main: need more arguments

mashtree: use distances from Mash (min-hash algorithm) to make a NJ tree

Usage: mashtree [options] *.fastq *.fasta *.gbk *.msh > tree.dnd

NOTE: fastq files are read as raw reads;

fasta, gbk, and embl files are read as assemblies;

Input files can be gzipped.

--tempdir '' If specified, this directory will not be

removed at the end of the script and can

be used to cache results for future

analyses.

If not specified, a dir will be made for you

and then deleted at the end of this script.

--numcpus 1 This script uses Perl threads.

--outmatrix '' If specified, will write a distance matrix

in tab-delimited format

--file-of-files If specified, mashtree will try to read

filenames from each input file. The file of

files format is one filename per line. This

file of files cannot be compressed.

--outtree If specified, the tree will be written to

this file and not to stdout. Log messages

will still go to stderr.

--version Display the version and exit

--citation Display the preferred citation and exit

TREE OPTIONS

--truncLength 250 How many characters to keep in a filename

--sort-order ABC For neighbor-joining, the sort order can

make a difference. Options include:

ABC (alphabetical), random, input-order

MASH SKETCH OPTIONS

--genomesize 5000000

--mindepth 5 If mindepth is zero, then it will be

chosen in a smart but slower method,

to discard lower-abundance kmers.

--kmerlength 21

--sketch-size 10000

--seed 42 Seed for mash sketch

--save-sketches '' If a directory is supplied, then sketches

will be saved in it.

If no directory is supplied, then sketches

will be saved alongside source files.

実行方法

fasta | fastqを指定する。gz, bz2, zipの圧縮されたファイルも受け付ける。

mashtree  --numcpus 12 *.fasta > tree.dnd

--numcpus This script uses Perl threads.
--outmatrix If specified, will write a distance matrix in tab-delimited format

fastqファイルは生のリードファイルとして解釈される。Fasta, GenBank, EMBLファイルは、ゲノムアセンブリとして解釈される。

Newick (.dnd)ファイルが出力される。--outmatrix が指定された場合は、距離行列も出力される。

引用

Mashtree: a rapid comparison of whole genome sequence files
Katz, Lee; Griswold, Taylor; Morrison, Shatavia; Caravas, Jason; Zhang, Shaokang; Bakker, Henk; Deng, Xiangyu; Carleton, Heather

Journal of Open Source Software, vol. 4, issue 44, id. 1762. December 2019