2023-12-14

マイクロバイオーム関連に特化したsamtoolsの拡張 msamtools

msamtoolsは、マイクロバイオームデータ解析、特にショットガンメタゲノミクスやメタトランスクリプトミクスデータを解析する際によく使われる便利な機能を提供している。既にいくつかの論文で使用されている。

インストール

M1 macstudioでテストした（rosetta2エミュレーション使用）。

依存

samtools v1.9

on any flavor of linux and UNIX
should also work on macOS (not tested)

Github

#bioconda (link)
#既にsamtoolsを導入しているなら、衝突を避けるために別の環境に導入する
mamba create -n msamtools -y
conda activate msamtools
mamba install -c bioconda msamtools -y

> msamtools help

$ msamtools

Program: msamtools (Metagenomics-related extension to samtools)

Version: 1.1.3 (using samtools/htslib 1.9)

Usage: msamtools <command> [options]

Commands:

-- Filtering

filter filter alignments based on alignment statistics

-- Profiling

profile estimate relative abundance profile of reference sequences or genomes in bam file

-- Coverage

coverage estimate per-base or per-sequence read coverage of each reference sequence

-- Summary

summary summarize alignment statistics per read in a table format

> msamtools filter

$ msamtools filter

Usage:

------

msamtools filter [-buhSkv] <bamfile> [--help] [-l <int>] [-p <int>] [--ppt=<int>] [-z <int>] [--rescore] [--besthit] [--uniqhit]

General options:

----------------

These options specify the input/output formats of BAM/SAM files

(same meaning as in 'samtools view'):

-b output BAM (default: false)

-u uncompressed BAM output (force -b) (default: false)

-h print header for the SAM output (default: false)

-S input is SAM (default: false)

<bamfile> input SAM/BAM file

--help print this help and exit

Specific options:

-----------------

-l <int> min. length of alignment (default: 0)

-p <int> min. sequence identity of alignment, in percentage, integer between 0 and 100; requires NM field to be present (default: 0)

--ppt=<int> min/max sequence identity of alignment, in parts per thousand, integer between -1000 and 1000; requires NM field to be present (default: 0)

NOTE:

-----

When using --ppt, +ve values mean minimum ppt and -ve values mean maximum ppt.

E.g., '--ppt 950' will report alignments with ppt>950,

and '--ppt -950' will report alignments with ppt<=950.

-z <int> min. percent of the query that must be aligned, between 0 and 100 (default: 0)

-k, --keep_unmapped report unmapped reads, when filtering using upper-limit thresholds (default: false)

-v, --invert invert the effect of the filter (default: false)

CAUTION:

--------

When using --invert or -v, be precise in what needs to be inverted.

Adding -v gives you the complement of what you get without -v.

Sometimes, this might be counter-intuitive.

E.g., '-l 65 -p 95' will report alignments that are (>65bp AND >95%).

'-l 65 -p 95 -v' will not report (<65bp AND <95%), as one might think.

'-l 65 -p 95 -v' will report NOT(>65bp AND >95%) which is (<65bp OR <95%).

Notice the 'OR' in the final logical operation. This means that

an alignment that fails one condition will still be reported if it

satisfies the other condition.

If you only wanted alignments that are below 95%, then specify '-p 95 -v'

--rescore rescore alignments using MD or NM fields, in that order (default: false)

Special filters:

----------------

The following special filters cannot be combined with -v, but require:

(1) the alignments to be sorted by name,

(2) AS field (alignment score) to be present.

You can usually achieve sorting by:

samtools sort -n -T tmp.sort input.bam | msamtools -m filter --besthit -

If AS is missing, you can rescore alignments by:

samtools sort -n -T tmp.sort input.bam | msamtools -m filter --rescore --besthit -

--besthit keep all highest scoring hit(s) per read (default: false)

--uniqhit keep only one highest scoring hit per read, only if it is unique (default: false)

> msamtools coverage

$ msamtools coverage

Usage:

------

msamtools coverage [-Sxz] <bamfile> [--help] -o <file> [--summary] [-w <int>]

General options:

----------------

These options specify the input/output formats of BAM/SAM files

(same meaning as in 'samtools view'):

-S input is SAM (default: false)

<bamfile> input SAM/BAM file

--help print this help and exit

Specific options:

-----------------

-o <file> name of output file (required)

--summary do not report per-position coverage but report fraction of sequence covered (default: false)

-x, --skipuncovered do not report coverage for sequences without aligned reads (default: false)

-w, --wordsize=<int> number of words (coverage values) per line (default: 17)

-z, --gzip compress output file using gzip (default: true)

Description:

------------

Produces per-position sequence coverage information for all reference sequences

in the BAM file. Output is similar to old-style quality files from the Sanger

sequencing era, with a fasta-style header followed by lines of space-delimited

numbers.

For large datasets, option '-x' comes in handy when only a small fraction of

reference sequences are covered.

If using '-z', output file does NOT automatically get '.gz' extension. This is

up to the user to specify the correct full output file name.

> msamtools summary

$ msamtools summary

Usage:

------

msamtools summary [-Sc] <bamfile> [--help] [-e <num>] [--stats=<string>]

General options:

----------------

These options specify the input/output formats of BAM/SAM files

(same meaning as in 'samtools view'):

-S input is SAM (default: false)

<bamfile> input SAM/BAM file

--help print this help and exit

Specific options:

-----------------

-e, --edge=<num> ignore alignment if reads map to <num> bases at the edge of target sequence (default: 0)

-c, --count count number of unique inserts in BAM file (default: false)

--stats=<string> {mapped|unmapped|edit|score} only report readcount distribution for specified stats, not read-level stats (default: none)

Description

-----------

Prints summary of alignments in the given BAM/SAM file. By default, it prints

a summary line per alignment entry in the file. The summary is a tab-delimited

line with the following fields:

qname,aligned_qlen,target_name,glocal_align_len,matches,percent_identity

glocal_align_len includes the unaligned qlen mimicing a global alignment

in the query and local alignment in target, thus glocal.

With --stats option, summary is consolidated as distribution of read counts

for a given measure.

--stats=mapped - distribution for number of mapped query bases

--stats=unmapped - distribution for number of unmapped query bases

--stats=edit - distribution for edit distances

--stats=score - distribution for score=match-edit

使用例

現在４つのサブコマンドがある。それぞれについて、レポジトリでの利用例を紹介する。

1、msamtools filter - アライメントとフィルタリングをワンステップで行う。

filterプログラムは、同一性パーセント、リード長、リード長に対するアラインメントの割合、またはそれらの組み合わせに基づくアラインメントのフィルタリングを提供する。例えば、メタゲノミックリードをデータベースにマッピングして生物種レベルのアノテーションを行う場合、通常95%未満の配列同一性のアラインメントは除外する。

例えばbwa-mem2のSAM出力をストリームで読み込んで（-S）、アラインメント長が80bp以上 (-l 80)、同一性が95%以上 (-p 95)、リードの全長の80%以上がアラインメントされている (-z 80)、のフィルタリングをして、これら全て満たすアラインメントをBAM形式で書き出す (-b)。

bwa-mem2 mem DB SAMPLE.1.fq.gz SAMPLE.2.fq.gz \
 | msamtools filter -S -b -l 80 -p 95 -z 80 > filtered.bam

もう１つフィルタリング例が紹介されている。

bwa-mem2 mem HUMAN_DB SAMPLE.1.fq.gz SAMPLE.2.fq.gz \
 | msamtools filter -S -l 30 --invert --keep_unmapped -bu - \
 | samtools fastq -1 human_free_1.fq.gz -2 human_free_2.fq.gz -s /dev/null -o /dev/null -c 6 -N -
Ex

ここでは、メタゲノムからヒト関連リードを適切に除外するために、SAMフォーマットを読みこみ (-S)、アラインメント長が30bp以上 (-l 30)、その条件を反転して(--invert)30bp以下を取り出し、マップされていないリードも保持し (--keep_unmapped)、非圧縮BAM形式で出力を書き出し(-bu)、出力をsamtoolsにパイプする。samtoolsを使って圧縮された (-c 5)fastq形式で書き出す。フォワードリードとリバースリードを別々のファイルに書き出す(-1 -2)。ペアになっていないリードは無視する(-s /dev/null -0 /dev/null)。リードに/1と/2を付加する(-N)、となっている。

2、msamtools profile - bamファイル中の参照配列またはゲノムの相対存在量プロファイルを推定

profileプログラムは、配列アバンダンスのプロファイリング機能を提供する。デフォルトでは、BAMファイル中の各配列の相対アバンダンスが報告される。ただし、”--genome"オプションを使用すると、BAMファイル中のデータベース配列をゲノムやMAGなどの大きなフィーチャーに関連付けることができる。配列/ゲノムの存在量は、その配列/ゲノムにマッピングされたペアのインサート（mate-pairs/paired-ends）数をその長さで割ったものとして推定される。存在量は4つの単位：１）存在量（ab）、２）相対的存在量（rel）、３）100万リードあたりの配列1キロベースあたりの断片数（fpkm）、４）100万あたりの転写産物数（tpm）のいずれかで推定される。デフォルトでは、配列の合計が1になる相対アバンダンスが計算される。

ab - 配列にマップされたインサートの数で、配列長で正規化
rel - abを全配列の合計で正規化した相対存在量
fpkm - 100万リードあたりの配列1キロベースあたりの断片数
tpm - 100万リードあたりの転写産物

（おそらtpmとfpkmはゲノムのプロファイリングには適さない（レポジトリより））。

ペアエンドfastq.gzをGENE_DBの遺伝子カタログデータベース（990万個の遺伝子からなる(Li et al, Nat. biotech 2014)、bwa-mem2形式）にアライメントし、msamtools filterに渡してフィルタリングし、最後にmsamtools profileに渡して相対存在量のプロファイリングを行う。

bwa-mem2 mem GENE_DB SAMPLE.1.fq.gz SAMPLE.2.fq.gz \
 | msamtools filter -S -bu -l 80 -p 95 -z 80 --besthit - \
 | msamtools profile --multi=proportional --label=SAMPLE --unit=rel -o SAMPLE.profile.txt.gz -

bwa-mem2の出力をmsamtools filterにパイプしてSAMフォーマットを読み込む(-S)。msamtools filterでは、アラインメント長80bp以上 (-l 80)、同一性95%以上 (-p 95)、リードの80%以上がアラインメント (-z 80)、リードごとにベストスコアヒットのみ (--besthit)でフィルタリングし（つまり最良のアライメントのみ保持、同じベストアライメントスコアなら複数保持）、非圧縮BAM形式で書き出す (-bu)。その出力をmsamtools profileにパイプし、相対存在量プロファイルを計算する (--unit=rel)、マルチヒットの挿入はヒット間で共有 (--multi=proportional)、出力ファイルにサンプルラベルSAMPLEを使用 (--label=SAMPLE)、圧縮出力SAMPLE.profile.txt.gzに書き出す (-o)。

ほかにも以下の３つを使う例が紹介されている。

すべてのマルチヒット挿入を無視する (--multi=ignore)
配列長の正規化を避ける (--nolen)
相対量 (--unit=rel)ではなく、実際のアバンダンスプロファイルを計算する (--unit=ab)

v1.0.0から配列セットで定義されたゲノムのプロファイリングがサポートされた。実行するには以下の形式のタブ区切り定義ファイルが必要。

MAG_1 Contig_1
MAG_1 Contig_2
MAG_1 Contig_3
MAG_1 Contig_4
MAG_2 Contig_18
MAG_2 Contig_27
MAG_2 Contig_32
MAG_3 Contig_23
MAG_3 Contig_24
MAG_3 Contig_35
MAG_3 Contig_48

この情報がmyMAGs.genome.defというファイルに保存されている場合、以下のように実行すると、ゲノムレベルでのプロファイルを得ることができる。

msamtools filter -b -u -l 80 -p 95 -z 80 --besthit sample1.myMAGs.bam \
  | msamtools profile --multi=proportional --label=sample1 --unit=rel --genome myMAGs.genome.def -o sample1.myMAGs.profile.txt.gz -

3、msamtools coverage - 各参照配列の塩基ごとまたは配列ごとのリードカバレッジを推定

BAMファイル中の全配列の位置ごとのカバレッジを出力する。

msamtools coverage -x -z -o sample1.coverage.txt.gz sample1.IGC.filtered.bam

-x do not report coverage for sequences without aligned reads (default: false)
-w number of words (coverage values) per line (default: 17)
-z compress output file using gzip (default: true)

位置ごとのカバレッジ出力ファイルは、fastaヘッダーとスペース区切りの番号を持つ古いSanger qualityファイルの形式である。これは非常に大きなファイルになる可能性があるため、"-x"オプションを使用すると、1つのリードもマッピングされていない配列のカバレッジは表示されない。このような配列のカバレッジは各ポジションで基本的にゼロであるため、カバレッジを表示することはスペースの無駄になるため。

BAMファイル中の各配列の分数カバレッジを出力する。

msamtools coverage -z --summary -o sample1.coverage.summary.txt.gz sample1.IGC.filtered.bam

BAMファイルのどの配列のどれだけの配列の割合がアラインメントでカバーされているかを知るためには、各配列の分画カバレッジとシーケンスカバレッジを出力する "--summary"オプションを使用する。

４、msamtools summary - BAMファイルで与えられたアラインメントを要約する。また、アラインメント間の特定の特徴の分布を提供することもできる。

msamtools summary input.bam | head

-e ignore alignment if reads map to <num> bases at the edge of target sequence (default: 0)
-c count number of unique inserts in BAM file (default: false)
--stats=<string> {mapped|unmapped|edit|score} only report readcount distribution for specified stats, not read-level stats (default: none)

samtoolsプログラムの精神に則り、msamtoolsもデータストリームで動作するように設計されている。
filterサブコマンドで--besthitまたは--uniqhitを使用する場合、入力BAMファイルはQNAMEでソートされる。しかし、BAMファイルを自分で加工して修正した場合、filterからの出力に影響を与える可能性がある。そのような場は、"samtools sort -n"を使って、QNAMEでソートし直す。
profileサブコマンドでは、複数の場所にマップされたリードを追跡できるように、BAMファイルが名前順にソートされていることを想定している。BAMファイルがそのようにソートされていることを確認する。Profilerはこれをチェックしないため、coordinate-sortedされたBAMファイルを渡すと誤った結果を出すことがある。
profileサブコマンドでは、複数の配列/ゲノムにマッピングされたリードは、3つの異なる方法で配列/ゲノム間で共有する（レポジトリ参照）。
profileサブコマンドでは、v1.0.0からデフォルトの出力はgzip圧縮されたテキストファイルに変更された。引数 --gzip または -z を指定するとエラーになる。
profileプログラムに送る前に、アラインメントをフィルタリングすることを強く推奨する。プロファイルプログラムは各アラインメントを重要視する（例えば、アラインメントの質は見ない）。
profileサブコマンドの定量でオプションの--nolenフラグを立てると、abとrelの配列長正規化がオフになる。--unit=abと--nolenを組み合わせると、各配列にマップされた生の挿入数が得られ、それらを合計すると、BAMファイルの挿入総数（または--totalで渡されたもの）と一致する。
メタゲノムデータでは、データベースにマップされなかったリードの割合を特定し、残りの割合をBAMファイルの配列に割り当てる必要がある場合がある。方法をレポジトリで説明している。
データベースにほんの一握りのリードだけがマッピングされた場合、それが本当に存在量が少ないだけなのか、それとも疑わしいマッピングなのかはすぐにはわからない。そのため、妥当な数のリードがそのフィーチャーにマップされた場合にのみ、そのフィーチャーをダウンストリーム解析に含むように、v1.1.0以降では、-mincountが追加された。これでメタゲノムデータやメタトランスクリプトームデータで存在している or 発現している、とみなすための最小リード数を指定できる。