macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

メタゲノムのリードカバレッジ とrelative abundanceの計算ツール coverM

2021 8/5追記

2021 9/6 追記

 

Githubより

CoverMは、メタゲノミクスアプリケーションに特化した、設定可能で使いやすく、高速なDNAリードカバレッジおよび相対的な存在比の計算ツールである。ゲノム/MAGのカバレッジをまたは個々のコンティグのカバレッジを計算する。入力は、リファレンスごとにソートされたBAMファイル、または様々なフォーマットの生のリードとリファレンスゲノムのいずれかである。

 

HP

https://wwood.github.io/CoverM/coverm-genome.html#name

 

インストール

ubuntu18.04LTSでテストした(linuxマシン使用)。

依存

  • samtools v1.9
  • tee, which is installed by default on most Linux operating systems.
  • man, which is installed by default on most Linux operating systems.

some mapping software:

  • minimap2 v2.17-r941 (2.18 will not work)
  • bwa-mem2 v2.0

For dereplication:

  • Dashing v0.4.0
  • FastANI v1.3

Github

#bioconda (link)
mamba install -c bioconda -y coverm

#cargoにも対応
cargo install coverm

coverm

$ coverm

 

Mapping coverage analysis for metagenomics

 

Usage: coverm <subcommand> ...

 

Main subcommands:

contig Calculate coverage of contigs

genome Calculate coverage of genomes

 

Less used utility subcommands:

make Generate BAM files through alignment

filter Remove (or only keep) alignments with insufficient identity

cluster Dereplicate and cluster genomes

shell-completion

Generate shell completion scripts

 

Other options:

-V, --version Print version information

 

Ben J. Woodcroft <benjwoodcroft near gmail.com>

> coverm contig

$ coverm contig

error: The following required arguments were not provided:

--coupled <coupled>...

--interleaved <interleaved>...

-1 <read1>...

-2 <read2>...

--reference <reference>...

--single <single>...

 

USAGE:

coverm contig --contig-end-exclusion <contig-end-exclusion> --coupled <coupled>... --interleaved <interleaved>... --mapper <mapper> --methods <methods>... --min-covered-fraction <min-covered-fraction> --output-format <output-format> -1 <read1>... -2 <read2>... --reference <reference>... --single <single>... --threads <threads> --trim-max <trim-max> --trim-min <trim-min>

 

For more information try --help

(base) kazu@kazu:/media/kazu/optane/TY1$ coverm contig -h

 

coverm contig

Calculate coverage of individual contigs

 

Example: Calculate mean coverage from reads and assembly:

 

coverm contig --coupled read1.fastq.gz read2.fastq.gz --reference assembly.fna

 

Example: Calculate MetaBAT adjusted coverage from a sorted BAM file, saving

the unfiltered BAM files in the saved_bam_files folder:

 

coverm contig --method metabat --bam-files my.bam

--bam-file-cache-directory saved_bam_files

 

See coverm contig --full-help for further options and further detail.

> coverm genome -h

$ coverm genome -h

 

coverm genome

Calculate coverage of individual genomes

 

Example: Map paired reads to 2 genomes, and output relative abundances

to output.tsv:

 

coverm genome --coupled read1.fastq.gz read2.fastq.gz

--genome-fasta-files genome1.fna genome2.fna -o output.tsv

 

Example: Calculate coverage of genomes defined as .fna files in

genomes_directory/ from a sorted BAM file:

 

coverm genome --bam-files my.bam --genome-fasta-directory genomes_directory/

 

Example: Dereplicate genomes at 99% ANI before mapping unpaired reads:

 

coverm genome --genome-fasta-directory genomes/ --dereplicate

--single single_reads.fq.gz

 

See coverm genome --full-help for further options and further detail.

 

>  coverm bam -h

 

$ coverm bam -h

error: Found argument 'bam' which wasn't expected, or isn't valid in this context

 

USAGE:

coverm [FLAGS] [SUBCOMMAND]

 

For more information try --help

 

> coverm filter -h

 

$ coverm filter -h

 

coverm filter

Filter BAM file alignments

 

Example: Filter a BAM file by removing alignments shorter than 50bp:

 

coverm filter --bam-files input.bam --output-bam filtered.bam

--min-read-aligned-length 50

 

Example: Filter inverse: Keep alignments that have <95% alignment identity

and those which do map at all. Note that the output BAM file will likely

records that are still mapped, but align with < 95% identity. Use 16

threads for output compression:

 

coverm filter -b input.bam -o inverse_filtered.bam --inverse

--min-read-percent-identity 95 --threads 16

 

See coverm filter --full-help for further options and further detail.

> coverm cluster -h

$ coverm cluster -h

 

coverm cluster

Cluster (dereplicate) genomes

 

Example: Dereplicate at 99% (after pre-clustering at 95%) a directory of .fna

FASTA files and create a new directory of symlinked FASTA files of

epresentatives:

 

coverm cluster --genome-fasta-directory input_genomes/ 

--output-representative-fasta-directory output_directory/

 

Example: Dereplicate a set of genomes with paths specified in genomes.txt at

95% ANI, after a preclustering at 90% using the MinHash finch method, and

output the cluster definition to clusters.tsv:

 

coverm cluster --ani 95 --precluster-ani 90 --precluster-method finch

--genome-fasta-list genomes.txt 

--output-cluster-definition clusters.tsv

 

See coverm cluster --full-help for further options and further detail.

--full-helpをつけてコマンドを実行するとオプションの詳細が確認できる。 

 

 

実行方法

coverm genome ゲノム単位のカバレッジ(ゲノムモード)の計算

ゲノムモードの計算は、コンティグ単位のカバレッジ(コンティグモード)の計算と似ているが、報告の単位がコンティグ単位ではなくゲノム単位である点が異なる。カバレッジに基づいて塩基の位置を除外する計算方法では、すべてのコンティグからのすべての位置が一緒に考慮されます。例えば、2,000,000bpのコンティグがあり、そのコンティグの全ての位置のカバレッジが1Xである場合、2000bpの全ての位置がカバレッジでソートされた位置の上位5%に入るため、trimmed_meanは0となります。

bamファイルとゲノム配列のディレクトリを指定する。fastaの拡張子が.fnaなら”-x fna”とする。

#relative_abundance
coverm genome --bam-files input.bam --genome-fasta-directory bin_dir -x fna -m relative_abundance > out.tsv

#mean
coverm genome --bam-filesinput.bam --genome-fasta-directory bin_dir -x fna -m mean > out.tsv

#TPM
coverm genome --bam-filesinput.bam --genome-fasta-directory bin_dir -x fna -m tpm > out.tsv

#read count
coverm genome --bam-filesinput.bam --genome-fasta-directory bin_dir -x fna -m count --min-covered-fraction > out.tsv
  • -f, --genome-fasta-files  Path(s) to FASTA files of each genome

  • -d, --genome-fasta-directory  Directory containing FASTA files of each genome.

  • -x, --genome-fasta-extension  File extension of genomes in the directory specified with -d/--genome-fasta-directory. [default: fna]

  • --single-genome   All contigs are from the same genome. Requires --reference. [default: not set]
  • -r, --reference   FASTA file of contigs e.g. concatenated genomes or metagenome assembly, or minimap2 index (with --minimap2-reference-is-index), or BWA index stem (with -p bwa-mem). If multiple references FASTA files are provided and
    --sharded is specified, then reads will be mapped to references separately as sharded BAMs. NOTE: If genomic FASTA files are specified elsewhere (e.g. with --genome-fasta-files or --genome-fasta-directory), then --reference
    is not needed as a reference FASTA file can be derived by concatenating input genomes. However, while not necessary, --reference can be specified if an alternate reference sequence set is desired.
  • -m, --methods   Method(s) for calculating coverage [default: relative_abundance]. A more thorough description of the different methods is available at https://github.com/wwood/CoverM#calculation-methods

 

複数の定量方法がサポートされている。デフォルトではrelative_abundanceとなっている。"-m"オプションを使って変更できる。

f:id:kazumaxneo:20210801223650p:plain

(HPより)

 

 fastqを指定

coverm genome --coupled in_R1.fq.gz in_R2.fq.gz --genome-fasta-directory genome_dir -x fna -o out.tsv

 

 

coverm filter  identityが不十分なアラインメントを削除する

coverm filter -b input.bam -o filtered.bam --inverse --min-read-percent-identity 95 --threads 16

 

coverm contig  コンティグ単位のカバレッジの計算

coverm genome --bam-files input.bam --genome-fasta-files ref.fasta

 

 

coverm make  アライメントによるBAMファイルの生成

coverm cluster  ゲノムのdereplicationとクラスター化

引用

GitHub - wwood/CoverM: Read coverage calculator for metagenomics