Metalign - macでインフォマティクス

　サンプル中の微生物の存在と相対的な存在量を予測するメタゲノムプロファイリングは、マイクロバイオーム解析の重要な第一歩である。アラインメントベースのアプローチは、多くの場合、正確ではあるが計算が困難であると考えられている。ここでは、効率的かつ正確なアラインメントベースのメタゲノムプロファイリングを行う新しい手法、Metalignを紹介する。著者らは、新しい封じ込めミニハッシュアプローチを使用して、アライメントの前に参照データベースを事前にフィルタリングし、一意にアライメントされたリードとマルチアライメントされたリードの両方を処理して、正確なアバンダンス推定値を生成する。実際のデータセットとシミュレーションされたデータセットの両方で性能評価を行ったところ、Metalignは、すべてのデータセットにおいて高い性能と競争力のある実行時間を維持した唯一の評価手法であることがわかった。

方法

macos10.14でcondaの仮想環境を作ってテストした。

Github

conda create -n Metalign python=3.8 -y
conda activate Metalign
conda install -c bioconda Metalign -y

> metalign.py -h

$ metalign.py -h

usage: metalign.py [-h] [--cutoff CUTOFF] [--db_dir DB_DIR] [--dbinfo_in DBINFO_IN] [--keep_temp_files] [--input_type {fastq,fasta,AUTO}] [--length_normalize]

[--low_mem] [--min_abundance MIN_ABUNDANCE] [--no_quantify_unmapped] [--output OUTPUT] [--pct_id PCT_ID] [--precise] [--rank_renormalize]

[--read_cutoff READ_CUTOFF] [--sampleID SAMPLEID] [--sensitive] [--strain_level] [--temp_dir TEMP_DIR] [--threads THREADS] [--verbose]

reads data

Runs full metalign pipeline on input reads file(s).

positional arguments:

reads Path to reads file.

data Path to data/ directory with the files from setup_data.sh

optional arguments:

-h, --help show this help message and exit

--cutoff CUTOFF CMash cutoff value. Default is 0.01.

--db_dir DB_DIR Directory with all organism files in the full database.

--dbinfo_in DBINFO_IN

Location of db_info file. Default: data/db_info.txt

--keep_temp_files Retain KMC files after this script finishes.

--input_type {fastq,fasta,AUTO}

Type of input file (fastq/fasta). Default: try to auto-determine

--length_normalize Normalize abundances by genome length.

--low_mem Run in low memory mode, with inexact multimapped processing.

--min_abundance MIN_ABUNDANCE

Minimum abundance for a taxa to be included in the results. Default: 10^(-4).

--no_quantify_unmapped

Do not factor in unmapped reads in abundance estimation.

--output OUTPUT Output abundances file. Default: abundances.tsv

--pct_id PCT_ID Minimum percent identity from reference to count a hit.

--precise Run in precise mode. Overwrites --read_cutoff and --min_abundance to 100 and 0.1.

--rank_renormalize Renormalize abundances to 100 pct. at each rank, e.g if an organism has a species but not genus label.

--read_cutoff READ_CUTOFF

Number of reads to count an organism as present.

--sampleID SAMPLEID Sample ID for output. Defaults to input file name(s).

--sensitive Run in sensitive mode. Sets --cutoff value to 0.0.

--strain_level Profile strains (off by default).

--temp_dir TEMP_DIR Directory to write temporary files to.

--threads THREADS Number of compute threads for Minimap2/KMC. Default: 4

--verbose Print verbose output.

テストラン

wget https://ucla.box.com/shared/static/ybz1xgke32kh56p4lqsg41t4g1bq8xy0.gz && mv ybz1xgke32kh56p4lqsg41t4g1bq8xy0.gz reads.fna.gz
metalign.py reads.fna.gz data/ --output metalign_results.tsv

テスト中

引用

Metalign: efficient alignment-based metagenomic profiling via containment min hash

Nathan LaPierre, Mohammed Alser, Eleazar Eskin, David Koslicki & Serghei Mangul

Genome Biology volume 21, Article number: 242 (2020)