Sourmashのgatherコマンド - macでインフォマティクス

Sourmash helpより

　Sourmashのサブコマンド `gather` は、メタゲノム解析で最適なリファレンスゲノム（のシグネチャファイル）をリファレンスゲノムデータベースから選択する。 k-merは非常に特異的なので、Sourmash gather は、過去にシークエンシングされたことのある環境や、多くのゲノムが単離されシークエンシングされたことのある環境からのメタゲノムに最も適している（引用2）。

チュートリアル

https://taylorreiter.github.io/2022-07-28-From-raw-metagenome-reads-to-phyloseq-taxonomy-table-using-sourmash-gather-and-sourmash-taxonomy/

インストール

本体　Github

mamba install -c bioconda sourmash -y

> sourmash -v

sourmash 4.3.0

$ sourmash

== This is sourmash version 4.3.0. ==

== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

Compute, compare, manipulate, and analyze MinHash sketches of DNA sequences.

Usage instructions:

Basic operations

sourmash compare --help compare sequence signatures made by compute

sourmash compute --help compute sequence signatures for inputs

sourmash gather --help search a metagenome signature against dbs

sourmash index --help index signatures for rapid search

sourmash info --help display sourmash version and other information

sourmash plot --help plot distance matrix made by 'compare'

sourmash prefetch --help search a signature against dbs, find all overlaps

sourmash search --help search a signature against other signatures

Taxonomic operations

sourmash lca --help

Manipulate signature files

sourmash sig --help

sourmash signature --help

Create signatures

sourmash sketch --help

Operations on storage

sourmash storage --help

Integrate taxonomy information based on "gather" results

sourmash tax --help

Options:

-h, --help show this help message and exit

-v, --version show program's version number and exit

-q, --quiet don't print citation information

> sourmash gather

usage:

The `gather` subcommand selects the best reference genomes to use for

a metagenome analysis, by finding the smallest set of non-overlapping

matches to the query in a database. This is specifically meant for

metagenome and genome bin analysis. (See "Classifying Signatures" [1]

in the command line documentation for more information on the

different approaches that can be used here.)

If the input signature was created with `-p abund`, output

will be abundance weighted (unless `--ignore-abundances` is

specified). `-o/--output` will create a CSV file containing the

matches.

`gather`, like `search`, will load all of provided signatures into

memory. You can use `sourmash index` to create a Sequence Bloom Tree

(SBT) that can be quickly searched on disk; this is the same format in

which we provide GenBank and other databases.

Command line usage:

```

sourmash gather query.sig [ list of signatures or SBTs ]

```

Example output:

```

overlap p_query p_match

--------- ------- --------

1.4 Mbp 11.0% 58.0% JANA01000001.1 Fusobacterium sp. OBRC...

1.0 Mbp 7.7% 25.9% CP001957.1 Haloferax volcanii DS2 pla...

0.9 Mbp 7.4% 11.8% BA000019.2 Nostoc sp. PCC 7120 DNA, c...

0.7 Mbp 5.9% 23.0% FOVK01000036.1 Proteiniclasticum rumi...

0.7 Mbp 5.3% 17.6% AE017285.1 Desulfovibrio vulgaris sub...

```

The command line option `--threshold-bp` sets the threshold below

which matches are no longer reported; by default, this is set to

50kb. see the Appendix in Classifying Signatures [1] for details.

Note:

Use `sourmash gather` to classify a metagenome against a collection of

genomes with no (or incomplete) taxonomic information. Use `sourmash

lca summarize` to classify a metagenome using a collection of genomes

with taxonomic information.

[1] https://sourmash.readthedocs.io/en/latest/classifying-signatures.html

---

gather: error: the following arguments are required: query, databases

チュートリアル

上にリンクを張ったチュートリアルでは、iHMP IBD cohortのstool sampleを使用している（サンプリングした一定期間後に症状が出た個人と出なかった個人のサンプル）。

１、それぞれのペアエンドfastqからMinHash sketchの作成。

for infile in *_1.fastq.gz
do
    bn=$(basename ${infile} _1.fastq.gz)
    sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund --merge ${bn} -o ${bn}.sig ${infile} ${bn}_2.fastq.gz
done

２、sourmashデータベースとして、 GTDB rs207 representatives databaseをダウンロードする。チュートリアルに書いてある通り、rs207 fullやGenBankなど使うことでより多くのサンプル配列にマッチさせることができる。

wget -O gtdb-rs207.genomic-reps.dna.k31.zip https://osf.io/3a6gn/download

gtdb-rs207.genomic-reps.dna.k31.zip（1.8GB）がダウンロードされる。

３、sourmash gatherを実行して、メタゲノムに含まれる既知のk-merをすべてカバーするデータベースの最小限のメタゲノムを求める。

for infile in *_1.fastq.gz
do
    bn=$(basename ${infile} _1.fastq.gz)
    sourmash sketch dna -p k=21,k=31,k=51,scaled=1000,abund --merge ${bn} -o ${bn}.sig ${infile} ${bn}_2.fastq.gz
done

４、sourmash taxonomy を使ってtaxonomyに変換する。

wget -O gtdb-rs207.taxonomy.csv.gz https://osf.io/v3zmg/download
gunzip gtdb-rs207.taxonomy.csv.gz
sourmash tax prepare -t gtdb-rs207.taxonomy.csv -o gtdb-rs207.taxonomy.sqldb -F sql

for infile in *_gather_gtdbrs207_reps.csv;
do
 sourmash tax annotate -g ${infile} -t gtdb-rs207.taxonomy.sqldb;
done

出力例