2018-07-05

詳細なリードカウント情報を出力する bam-readcount

bam/sam

2021/111/8 インストール方法追記

シングルエンドのデータをターゲットとしている。ペアエンドは独立してカウントされる。

インストール

mac os 10.13でテストした。

依存

git
cmake 2.8.3+ (cmake.org)

mac osにはcmakeは入ってません。brew install cmakeで入れるのが手っ取り早いです。

本体　Github

git clone https://github.com/genome/bam-readcount.git
cd bam-readcount/
mkdir build 
cd build
cmake ..
make
cd bin/

#conda
conda create -n bamreadcount python=3.7 -y
conda activate bamreadcount
conda install -c bioconda bam-readcount

> bam-readcount

$ bam-readcount

Usage: bam-readcount [OPTIONS] <bam_file> [region]

Generate metrics for bam_file at single nucleotide positions.

Example: bam-readcount -f ref.fa some.bam

Available options:

-h [ --help ] produce this message

-v [ --version ] output the version number

-q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used

for counting.

-b [ --min-base-quality ] arg (=0) minimum base quality at a position to

use the read for counting.

-d [ --max-count ] arg (=10000000) max depth to avoid excessive memory

usage.

-l [ --site-list ] arg file containing a list of regions to

report readcounts within.

-f [ --reference-fasta ] arg reference sequence in the fasta format.

-D [ --print-individual-mapq ] arg report the mapping qualities as a comma

separated list.

-p [ --per-library ] report results by library.

-w [ --max-warnings ] arg maximum number of warnings of each type

to emit. -1 gives an unlimited number.

-i [ --insertion-centric ] generate indel centric readcounts.

Reads containing insertions will not be

included in per-base counts

ラン

MAPQ≥1以上のリードを対象にカバレッジをカウント。

bam-readcount -f ref.fa -b 1 input.bam  > output

bedで指定した領域を対象にカバレッジをカウント。エラーメッセージは表示しない。

bam-readcount -f ref.fa -w 0 -l inout.bed input.bam > output

出力。

head -n 1 output

chr1 100 A 10 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:10:60.00:37.10:60.00:6:4:0.60:0.00:0.00:6:0.41:287.90:0.49 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00

ch1のポジション100、リファレンスはA、デプスは10、残りの部分は

base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end

となっている。このカラムはGithubで説明されている通り

base → the base that all reads following in this field contain at the reported position i.e. C
count → the number of reads containing the base
avg_mapping_quality → the mean mapping quality of reads containing the base
avg_basequality → the mean base quality for these reads
avg_se_mapping_quality → mean single ended mapping quality
num_plus_strand → number of reads on the plus/forward strand
num_minus_strand → number of reads on the minus/reverse strand
avg_pos_as_fraction → average position on the read as a fraction (calculated with respect to the length after clipping). This value is normalized to the center of the read (bases occurring strictly at the center of the read have a value of 1, those occurring strictly at the ends should approach a value of 0)
avg_num_mismatches_as_fraction → average number of mismatches on these reads per base
avg_sum_mismatch_qualities → average sum of the base qualities of mismatches in the reads
num_q2_containing_reads → number of reads with q2 runs at the 3’ end
avg_distance_to_q2_start_in_q2_reads → average distance of position (as fraction of unclipped read length) to the start of the q2 run
avg_clipped_length → average clipped read length of reads
avg_distance_to_effective_3p_end → average distance to the 3’ prime end of the read (as fraction of unclipped read length)

となっている。

bam中に複数ライブラリがあってライブラリごとに出力したい場合、"-p"をつけて実行する。

引用

Github

GitHub - genome/bam-readcount: count DNA sequence reads in BAM files

wiki

https://genome.sph.umich.edu/wiki/Bam_read_count

Biostars

https://www.biostars.org/p/82993/

2018-07-05

ロングリードのアライナー Meta-aligner

Pacbio mapping

　次世代シークエンシング（NGS）技術によって生成されるロングリードの数は急速に増加している。リファレンスゲノムへのこれらロングリードの効率的かつ正確なマッピングは、明らかに、リシーケンス解析、RNA-Seq、およびChIP-Seqなどのアプリケーションにおける下流の分析を改善するだけでなく、全体的なNGSコストを低減する上で重要な役割を果たす。現在、NGS技術は、リードの全体的な品質と長さに基づいて2つのカテゴリに分けられる。 Illumina-HIseqやIon Torrent-Protonなどのシーケンサは、最初のカテゴリに分類されるが、PacBio-RS IIやNanopore-Minionは、長くてもノイズの多いシーケンスを提供するシーケンサの典型的な例である。

（１段落省略）

　この論文では、我々（著者ら）はより長い塩基配列（300bp以上）のアライメントに関心がある。この領域の1つの一般的な設計は、リードからシード（小さな配列）を抽出し、シードについてリファレンスと正確にまたは非常に類似した配列をを見つけることである。抽出されたシードをいくつかの場所にアンカリングした後、ローカルアライメントアルゴリズムを使用して、ベストマッチを決定する。一般的に、利用可能なロングリードアライナーの共通テーマの1つは、すべてのリードを均等に扱うことである。ただし、このペーパーで示されているように、リピート領域由来リードとノンリピート領域由来リードを区別すると、アライメントスキームのパフォーマンスが大幅に向上する。実際、我々（著者ら）のアプローチと従来のアライナーとの間の主な違いは、固有のゲノム構造と、最初からリードの基礎となるstatisticsを活用することに焦点を当てることである。

　高レベルの視点では、Meta-aligner は、アライメントとアサインの2つの異なるステージから構成されている。第1ステージは、従来のショートリードアライナーを用いてリードの小断片をリファレンスゲノムに迅速かつ正確に位置合わせするように設計されている。著者らの結果は、実際のゲノムの統計的性質のために、この段階で大量のリードが処理されることを明確に示している。残りのリードは比較的揃えにくいため、アサイン段階での追加の処理を適切に調整する必要がある。これらのリードは、残りのすべての小さな断片をアライメントさせることによって処理される。しかし、第2段階で処理されるリード数が比較的少ないので、第2段階の全体的な時間消費は第1段階よりも少ない。

Meta-alignerに関するツイート。

インストール

cent os6でテストした。

依存

本体はラボHPからダウンロードする。

http://brl.ce.sharif.edu/software/meta-aligner/

#コンパイル
g++ launcher.cpp -O2 -std=c++0x -o PE.out

g++ main.cpp Idea.cpp SamAnalyzer.cpp LocalAligner.cpp -pthread -O2 -std=c++11 -o meta-align.out

g++ Assignment.cpp Path.cpp LocalAligner.cpp -pthread -std=c++11 -O2 -o assignment.out

> ./meta-align.out -h

$ ./meta-align.out -h

*** Welcome to Meta-aligner ***

./meta-aligner [options]* -x <index name> -fa <ref name> -r <read name> -o [<hit>]

Main arguments:

============================

-x <index name>

The base name of the indexes. All indexes must exist with this name.

For Soap2 and mrsFast aligners, this name must be used without any suffix.

-fa <ref name>

The reference genome in FASTA format.

-r <read name>

The base name of the input read set (assumed to be in FastQ format).

-o <output name>

File to write outputs in SAM format (default is "output.sam").

============================

Options:

============================

Input option:

============================

-FA

This command is for Fasta input reads.

-pg

Percent of gap within the input read set which can be estimated by the PE algorithm.(default is 0.01).

============================

Alignment options:

============================

-al <int>

Flag for using different short-read aligners at the alignment stage. Flag values are: Bowtie 1 / mrsFast 2 / SOAP2 3 (default is 1 - Bowtie).

-l1 <int>

The subfragment length (l1) which can be estimated by the PE algorithm (default is 40).

-sl1 <int>

The length of sliding window which can be estimated by the PE algorithm (default is 0 with no sliding).

-cfd1 <int>

The consecutive distance between two anchored subfragments (g1) which is used for confirming two fragments of a read (default is 0.1*l1).

-d <int>

Edit distance between fragments and the reference genome which can be estimated by the PE algorithm. (default value is 2).

For Bowtie: this command works as -v (may be an integer from 0 through 3) and determines only number of mismatches.

For mrsFast: this command works as -e.

For Soap2: this command works as -v.

-tr <int>

Length of reads that are trimmed and only <int> bases of each read is used for anchoring and the remaining bases are used in the local alignment (default is not used).

============================

Assignment options:

============================

-l2 <int>

The subfragment length (l2) which is used at the 2nd step of the assignment stage (default is 150).

-sl2 <int>

The length of sliding window for the 2nd step of the assignment stage (default is 50).

-cfd2 <int>

The consecutive fragments distance which is used for confirming two fragments of a read and anchor it (g2) at the 2nd step of the assignment stage (default is 0.1*l2).

-seedmm2 <int>

Number of mismatches which is allowed in a seed alignment at the 2nd step of the assignment stage (default is 1).

-seedlen2 <int>

Length of the seed substrings to align at the 2nd step of the assignment stage (default is 20).

-ls1 <int>

List size of the assignment stage when Bowtie is used at the 1st step of the assignment stage (default is 10).

-ls2 <int>

List size of the assignment stage when Bowtie2 is used at the 2nd step of the assignment stage (default is 40).

-thrsc <double>

Threshold of path selection from their scores at the assignment stages (default is 0.3).

============================

Scoring options:

============================

-ms <double>

The match score.

-mp <double>

The mismatch penalty.

-gp <double>

The gap penalty.

============================

Reporting options:

============================

-dis

This option discards local alignment of the anchored reads. Reports are flags and positions of anchored reads.

-disHeader

This option suppresses the header of the output SAM file.

============================

Other options:

============================

-step <1 or 2 or 3>

This parameter specifies that Meta-aligner is run up to the selected step, in case of selecting "1": run only the alignment stage; "2": run the alignment stage and the 1st step of the assignment stage; "3": run all steps (default is "2").

-dir <address>

Meta-aligner creates a new directory at the input address, and all steps are executed at that address (default is "./results").

-p <int>

Number of threads (default value is 1).

-ed <double>

This parameter controls the normalized cutting length for the local alignment table. Only ed/2*(read_length) cells adjacent to the original diagonal of the local alignment table are used for local alignment procedure (between 0 and 2). This parameter can be estimated from indel rate by the PE algorithm (default is 5*pg).

-ram <double>

The available RAM. Meta-aligner runs without any restriction of RAM. If some reads cannot be processed by this value of RAM (even with one thread), Meta-aligner reports these reads in a file (named "NotEnoughRAM.txt") in Fastq format and their flags and the anchored positions are written in their header section by underline.

-h

Help.

——

ラン

1、bowtie indexの作成

 bowtie2-build -f ref.fa genomeindex

２、マッピング。

./meta-align.out -x genomeindex -fa chr1.fa -r reads.fq -o output.sam -l1 25 -d 1 -sl1 10 -p 4 -ram 4

Core dumpedになる。改善したら追記します。

引用

Meta-aligner: long-read alignment based on genome statistics

Damoon Nashta-ali, Ali Aliyari, Ahmad Ahmadian Moghadam†, Mohammad Amin Edrisi, Seyed Abolfazl Motahari and Babak Hossein Khalaj
BMC BioinformaticsBMC series – open, inclusive and trusted201718:126

2018-07-05

NGSデータからAMRのgenotypeを調べるARIBA

bacteria MLST 抗生物質耐性遺伝子 (ARGs) AMR assembly

　Antimicrobial resistance（AMR）（薬剤耐性。抗生物質耐性（AR or ABR）はAMRのサブクラス）は、ヒトの健康に対する主要な脅威の1つとなっており、世界中で年間700,000人の直接的な死因と推定されている[論文より ref.1]。この脅威に対処しなければ、この数字は2050年には1000万人に増加すると推定されている[ref.1]。薬剤耐性（AMR）に取り組むための戦略の重要な要素は、耐性のマーカーを同定するための迅速かつ正確な方法を有することである。 AMRとの闘いにおいて重要なツールの1つとなるように設定されたパーソナライズド・メディスンでのゲノムシーケンス解析使用によるゲノム配列データの利用可能性が増えたことによって、AMRのメカニズムおよび多様性に関する我々の理解が向上している。しかし、現在広く利用されているシーケンシング技術によって生成されたデータからAMR決定因子を直接同定することができるバイオインフォマティクスツールはほとんどない。現在利用可能な方法は、検出可能なAMRメカニズムのタイプに制限されているか、高スループット環境にスケーラブルできるものではない。

　既存のツールの限界は、ハイスループットではないWebサービスを介してのみ利用可能であること、全ての微生物種についてのAMRの現在の知識を網羅的に表さないかもしれない特定のリファレンス配列に限定されている;入力としてアセンブリされたゲノム配列を必要とする。一塩基多型（SNP）に基づくAMR determinants を同定し解釈することができない;高い計算資源要件を有する。ほとんどのツールは、シーケンシングリードを一連のリファレンス遺伝子に合わせるものと、de novo アセンブリされた配列とリファレンス遺伝子の一致を検索するものの2つのカテゴリいずれかに分類される。広く使用されているSRST2 [ref.2]は、試料中の遺伝子の存在を予測するために、一連のリファレンス配列にリードをアライメントさせることに基づく方法の一例である。 KmerResistance [ref.3]も同様の手法を採用しているが、遺伝子の存在を特定するためにk-merマッチングを使用している（紹介）。 SRST2およびKmerResistanceは、カスタムリファレンス遺伝子セットとともに使用することができるが、耐性を付与するSNPなどの変異を直接同定または解釈することはできないため、遺伝子の存在によって耐性を同定することになる。 Mykrobe predictor [ref.4]は、k-mersでリファレンスグラフと照合する非常に高速なツールであり、変異型を特定することはできるが、そのデータを調べるためのAMR determinants のデータベースを作成する必要があり、現在は黄色ブドウ球菌および結核菌に限られている。

　他の大部分のAMR検出ツールは、シーケンシングリードから生成するのに計算的に高価であるアセンブリされた配列を必要とし、完全なゲノムをデノボで組み立てる複雑さに起因するアセンブリエラーまたは失敗により、AMR determinants が見逃される可能性がある。これらの理由から、アライメントベースのアプローチは、以前にde novo assemblyの方法よりも優れていることが示されている[ref.2,3]。アセンブリされた配列を入力として使用するツールには、ResFinder [ref.5]（v3紹介）、ARG-ANNOT [ref.6]、SSTAR [ref.7]（紹介）、RAST [ref.8]などがある。これらの方法は、AMR遺伝子を同定するために、通常はblast [9]アルゴリズムを用いて、アセンブリされた配列をリファレンス遺伝子と一致させる。

　ここでは、ARIBA（Assembly by Antimicrobial Resistance Identification by Assemblies）と呼ばれる新しいツールを紹介する。このツールは、マッピング/アラインメントとターゲットローカルアセンブリアプローチを組み合わせて、ペアのシーケンシングリードからAMR遺伝子と変異を効率的かつ正確に同定する。ローカルアセンブリを使用することは、アライメントされたデータの解釈のあいまいさを伴わずに連続した遺伝子またはヌクレオチド配列を提供しながら、アセンブリプロセスの複雑さをかなり低減する。 ARIBAには、ARG-ANNOT [ref.6]、CARD [ref.10]、MEGARes [ref.11]、ResFinder [ref.5]など、多数の公開データベースのサポートが含まれている。これは、コードまたは非コード配列を区別し、サンプル中に存在する各配列の詳細を提供する。同定された遺伝子がサンプル中で完全、切断または断片化されているかどうかを検証し、フレームシフト、非同義置換またはナンセンス変異などSNPおよびindelの結果として起きる効果も報告する。結果の解釈を容易にするため、ARIBAには複数のサンプルの結果を要約する関数が含まれている。これらの要約は、PhandangoインタラクティブVisualizationツール[ref.12 link]と互換性がある。最小阻止濃度（MIC wiki）データがサンプルに利用可能である場合、ARIBAは、統計解析および遺伝子型に対するMICのプロットを可能にする。 AMRを越え、ARIBAはより一般的に関心のある入力シーケンスを見つけるために使用でき、PlasmidFinder [ref.13]（紹介）とVFDB [ref.14]データベースと、PubMLST [ref.15]からのデータを使用したmulti-locus sequence typing（MLST　日本臨床微生物学会2007総説）のための機能をサポートしている。

　著者らはペアエンドシーケンスデータを入力としてpublicまたはカスタムデータベースからAMR determinantを識別するARIBAを開発した。簡単に言えば、AMRデータベース内のリファレンス配列は、CD-HIT [ref.16]を用いて類似性によってクラスタリングされる。リードは、各クラスタのリードセットを生成するために、minimap [ref.17]を使ってリファレンスシーケンスにマッピングされる。これらのリードは、そのクラスター内のシーケンスの少なくとも1つにマップされる。各クラスタのリードとそのシーケンスペアは、さまざまなパラメータの組み合わせでfermi-lite（Heng Li, github）を使用して独立してアセンブルされる。コンティグはMUMmerパッケージ[ref.18]のnucmerを使い最も似たリファレンスシーケンスが同定される。それからMUMmerのnucmerおよびshow-snpsプログラムを使いコンティグとリファレンス配列を比較して、配列間の完全性および任意の変異を同定する。クラスタのリードは、Bowtie2 [ref,19]を使用してアセンブリにマップされ、バリアントはSAMtools [ref.20]で呼び出される。最後に、AMRに重要であると予め定義された変異の存在または非存在を含むサンプルの同定されたすべての変異の詳細な報告がなされる。

ワークフロー。論文より転載。

ARIBAに関するツイート

インストール

依存

Python3 version >= 3.3.2
Bowtie2 version >= 2.1.0
CD-HIT version >= 4.6
MUMmer version >= 3.23

Python packages

dendropy >= 4.2.0
matplotlib (no minimum version required, but only tested on 2.0.0)
pyfastaq >= 3.12.0
pysam >= 0.9.1
pymummer >= 0.10.1

本体　Github

Condaでインストールできる。またDockerイメージも提供されている。依存が多いので、Anaconda環境にてcondaで導入するか、他のツールとのバッティングを完全に避け仮想環境で使うのがラクと思われる。

#Anaconda環境ならcondaを使う（リンク）。
conda install -c bioconda ariba

#dockerイメージをpullする
docker pull sangerpathogens/ariba

#Anacondaでないならpipで本体を導入し、依存も別に導入する。
pip install aliba

> docker run --rm -it -v /Users/user/docker_share/:/data sangerpathogens/ariba ariba -h

usage: ariba <command> <options>

ARIBA: Antibiotic Resistance Identification By Assembly

optional arguments:

-h, --help show this help message and exit

Available commands:

aln2meta Converts multi-aln fasta and SNPs to metadata

expandflag Expands flag column of report file

flag Translate the meaning of a flag

getref Download reference data

micplot Make violin/dot plots using MIC data

prepareref Prepare reference data for input to "run"

pubmlstget Download species from PubMLST and make db

pubmlstspecies

Get list of available species from PubMLST

refquery Get cluster or sequence info from prepareref output

run Run the local assembly pipeline

summary Summarise multiple reports made by "run"

test Run small built-in test dataset

version Get versions and exit

——

上記コマンドはホストの/Users/user/docker_shareと仮想環境のdata/を共有して立ち上げている。--rmで終了するとコンテナ破棄している（--rmをつけないと停止したコンテナが溜まってしまう）。

aliasつけたわけではないが、以後の説明でdocker run --rm -it -v /Users/user/docker_share:/data sangerpathogens/ariba aribaは省略してaribaとだけ表記する。

ラン

1、getref: Download reference data

例えばCARDデータベースをダウンロードする。

#usage: ariba getref [options] <db> <outprefix>

ariba getref card out.card

<db> Database to download. Must be one of: argannot card megares plasmidfinder resfinder srst2_argannot vfdb_core vfdb_full virulencefinder

２、prepareref: Prepare reference data for input to "run"

ダウンロードしたリファレンス配列をデータベースに変換（準備）。

#usage: ariba prepareref [options] <outdir>

ariba prepareref -f out.card.fa -m out.card.tsv out.card.prepareref

-f REQUIRED. Name of fasta file. Can be used more than once if your sequences are spread over more than on file
-m Name of tsv file of metadata about the input sequences. Can be used more than once if your metadata is spread over more than one file. Incompatible with --all_coding.

out.card.preparerefディレクトリができる。

３、run: Run the local assembly pipeline

ローカルアセンブリ実行。

#SRAからNGSデータのダウンロード(*1)
fastq-dump --split-files SRR5132030 #ダウンロードとペアエンドfastq変換

#ローカルアセンブリ
#usage: ariba run [options] <prepareref_dir> <reads1.fq> <reads2.fq> <outdir>
ariba run out.card.prepareref pair1.fq pair2.fq output

--threads Experimental. Number of threads. Will run clusters in
parallel, but not minimap (yet) [1]
--verbose Be verbose
--min_scaff_depth Minimum number of read pairs needed as evidence for
scaffold link between two contigs [10]
--nucmer_min_id Minimum alignment identity for nucmer (delta-filter -i) [90]
--nucmer_min_len Minimum alignment length (delta-filter -i) [20]

*1 ここではSRA explorerで"Salmonella Outbreak"で検索し、トップヒットしたデータを使った（https://www.ncbi.nlm.nih.gov/sra/?term=ERR1197641)

ジョブが終わるとoutputディレクトリができる。

output/の中身。

f:id:kazumaxneo:20180704212043j:plain

４、summary: Summarise multiple reports made by "run"

summary report作成。例えば３つのラン結果を統合する。

ariba summary out.summary output1/report1.tsv output2/report2.tsv output3/report3.tsv

out.summary.csvファイルが出力される。１データしか使っていないので、サンプルは１行のみ表示されている。

f:id:kazumaxneo:20180704212236j:plain

pubmlstget: Download species from PubMLST and make db

Staphylococcus aureusのデータをPubMLST（リンク）からダウンロードし、データベースに変換（runに使用できる）。

ariba pubmlstget 'Staphylococcus aureus' out

pubmlstspecies: Get list of available species from PubMLST

PubMLSTの利用可能な全データのダウンロード。

ariba pubmlstspecies

summary reportからMIC plotを出力するmicplotというコマンドもあります。

https://github.com/sanger-pathogens/pathogen-informatics-training/blob/master/Notebooks/ARIBA/micplot.ipynb

他にも作成したデータベースを検索する refquery 、調べたい遺伝子のマルチプルアライメント実行後のマルチFASTAとSNPS情報ファイル（TSSV形式）から、metadataに変換するaln2metaコマンドがある。aln2meta実行後は上記と同じようにpreparerefコマンドでデータベース化して、AMRの検索に利用することができる。

MLST callingのワークフローはチュートリアルを確認してください。

https://github.com/sanger-pathogens/ariba/wiki/MLST-calling-with-ARIBA

引用

ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads
Martin Hunt, Alison E Mather, Leonor Sánchez-Busó, Andrew J Page, Julian Parkhill, Jacqueline A Keane, Simon R Harris

Microb Genom. 2017 Sep 4;3(10):e000131.

2018-07-04

ゲノム配列やcontig配列からAMR遺伝子を検出する staramr

抗生物質耐性遺伝子 (ARGs) clinical and diagnostic bacteria AMR

2019 7/5 コマンド修正

2019 7/8 説明の流れ修正

2019 7/14 動画追記

staramrはcontigやゲノムなどのDNA配列からAMR（Antimicrobial Resistance ）原因（または関連）遺伝子を検出してくれるツール。ResFinderデータベースやPointFinderデータベースを検索対象にしている。

インストール

mac os 10.13、Anaconda 3.4.0環境でテストした。

依存

Python 3
BLAST+
Git

本体　Github

#condaで依存も含めて導入
conda install -c bioconda -y staramr

データベースもダウンロードされる。

> staramr -h

usage: staramr [-h] [--verbose] [-V] {search,db} ...

Do AMR detection for genes and point mutations

positional arguments:

{search,db} Subcommand for AMR detection.

search Search for AMR genes

db Download ResFinder/PointFinder databases

optional arguments:

-h, --help show this help message and exit

--verbose Turn on verbose logging [False].

-V, --version show program's version number and exit

——

データベース情報

> staramr db info

$ staramr db info

resfinder_db_dir = /home/kazu/anaconda3/lib/python3.6/site-packages/staramr/databases/data/dist/resfinder

resfinder_db_url = https://bitbucket.org/genomicepidemiology/resfinder_db.git

resfinder_db_commit = dc33e2f9ec2c420f99f77c5c33ae3faa79c999f2

resfinder_db_date = Tue, 20 Mar 2018 16:49

pointfinder_db_dir = /home/kazu/anaconda3/lib/python3.6/site-packages/staramr/databases/data/dist/pointfinder

pointfinder_db_url = https://bitbucket.org/genomicepidemiology/pointfinder_db.git

pointfinder_db_commit = ba65c4d175decdc841a0bef9f9be1c1589c0070a

pointfinder_db_date = Fri, 06 Apr 2018 09:02

pointfinder_gene_drug_version = 050218

——

データベースの最新版へのアップデート。

staramr db update --update-default

ラン

１、contigの準備。

staramrはアセンブリして得たcontigのFASTAファイルを使う。NGSのデータは直接使用できないので、持ってなければ前もってアセンブリしておく。オーサーはde novoアセンブラとして、SPAdesをチューニングしたShovillなどを挙げている（紹介）。

shovill --outdir out --R1 test/R1.fq.gz --R2 test/R2.fq.gz --ram 8

２、staramr実行。

FASTAファイル及び出力ディレクトリを指定する。

staramr search -o staramr_output input.fa

解析が終わるとout/に複数のファイルができる。*.tsvファイルが分析結果のファイルで、タブ仕分けで抗生物質耐性遺伝子などの検出されたAMR遺伝子とそのアノテーションがまとめられている。それ以外にsettings.txtファイル、results.xlsxファイル（これまでの結果をシート単位で保存）、検出されたAMR遺伝子のFASTAファイルができる。詳細はGithub のマークダウン形式のREADME参照。

walk through

https://github.com/phac-nml/staramr/blob/development/doc/tutorial/staramr-tutorial.ipynb

データ１（リンク 137 contigs）とデータ２（リンク one chromosome & one plasmid）の配列をダウンロードし、staramrで分析してAMRを検出し、それを薬剤感受性検査の結果（表現型）と照合する実践的なチュートリアルになっている。

#配列のダウンロード
wget -O GCF_001478105.1.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/478/105/GCF_001478105.1_Salmonella_enterica_CVM_N31384-SQ_v1.0/GCF_001478105.1_Salmonella_enterica_CVM_N31384-SQ_v1.0_genomic.fna.gz
wget -O GCF_001931595.1.fasta.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/931/595/GCF_001931595.1_ASM193159v1/GCF_001931595.1_ASM193159v1_genomic.fna.gz

#解凍
gunzip GCF_001478105.1.fasta.gz
gunzip GCF_001931595.1.fasta.gz

#saramrでサーチ
staramr search --pointfinder-organism salmonella -o out *.fasta

#ResFinderへの全ヒット表示
cut -f 1,2,4,5,6,7 out/resfinder.tsv | column -s$'\t' -t

Isolate ID Gene %Identity %Overlap HSP Length/Total Length Contig

GCF_001478105.1 blaCMY-2 100.00 100.00 1146/1146 ref|NZ_JYVD01000056.1|

GCF_001931595.1 aac(3)-IVa 99.87 100.00 786/786 ref|NZ_CP016411.1|

GCF_001931595.1 aph(3')-Ia 99.39 99.75 814/816 ref|NZ_CP016411.1|

GCF_001931595.1 aph(4)-Ia 100.00 100.00 1026/1026 ref|NZ_CP016411.1|

GCF_001931595.1 blaCTX-M-65 100.00 100.00 876/876 ref|NZ_CP016411.1|

GCF_001931595.1 dfrA14 99.79 100.00 483/483 ref|NZ_CP016411.1|

GCF_001931595.1 floR 98.19 99.92 1214/1215 ref|NZ_CP016411.1|

GCF_001931595.1 sul1 100.00 100.00 927/927 ref|NZ_CP016411.1|

GCF_001931595.1 tet(A) 100.00 100.00 1200/1200 ref|NZ_CP016411.1|

#全point mutation表示
cut -f 1,2,5,6,7,8,10 out/pointfinder.tsv | column -s$'\t' -t

Isolate ID Gene Position Mutation %Identity %Overlap Contig

GCF_001931595.1 gyrA (D87Y) 87 GAC -> TAC (D -> Y) 99.43 100.00 ref|NZ_CP016410.1|

# Show only Predicted Phenotype
cut -f 1,3 out/summary.tsv | column -s$'\t' -t

Isolate ID Predicted Phenotype

GCF_001478105.1 ampicillin, amoxicillin/clavulanic acid, cefoxitin, ceftriaxone

GCF_001931595.1 gentamicin, kanamycin, hygromicin, ampicillin, ceftriaxone, trimethoprim, chloramphenicol, ciprofloxacin I/R, nalidixic acid, unknown[sul1_2_CP002151], tetracycline

Validation

結果の妥当性を確認するため、walk throughでは最後にstaramr予測結果をNCBIのpathogen detection（リンク）のAMR geneotypes、およびNCBIの薬剤感受性検査（用語 wiki）（リンク）のAST Phenotypes と比較している。walk throughを読んで確認してください。

https://github.com/phac-nml/staramr/blob/development/doc/tutorial/staramr-tutorial.ipynb

まだ開発途中とのことで、今後機能が追加されるかもしれません。

こちらも参考にしてください。

追記

参考動画

StaPH-B monthly webinar - May2019

引用

GitHub - phac-nml/staramr: Scans genome contigs against the ResFinder and PointFinder databases.

Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O, Aarestrup FM, Larsen MV. 2012. Identification of acquired antimicrobial resistance genes. J. Antimicrob. Chemother. 67:2640–2644. doi: 10.1093/jac/dks261

Zankari E, Allesøe R, Joensen KG, Cavaco LM, Lund O, Aarestrup F. PointFinder: a novel web tool for WGS-based detection of antimicrobial resistance associated with chromosomal point mutations in bacterial pathogens. J Antimicrob Chemother. 2017; 72(10): 2764–8. doi: 10.1093/jac/dkx217

2018-07-04

Serotypeを予測するSerotypeFinder

生物種の推定 (taxonomic profiling) bacteria web tool Serotype clinical and diagnostic

　大腸菌（Escherichia coli）は通常無害な共生菌であるが、特定の病原性メカニズムによってヒトおよび/または動物に病気を引き起こす能力を発達させた種もある。場合によっては、感染は致死的であり得る（論文より　ref.1）。Serotyping（ wiki）（以後、血清型検査）は、1940年代以来発達して以来、標準化された手順（2-4）に発展してきた大腸菌の分類法である。血清型検査の実施には、高いレベルの専門知識と抗血清へのアクセスが必要となる。それは時間がかかり、面倒な手続きである。

　O：K：H血清型は、リポ多糖（LPS）（O抗原）、莢膜抗原（K）、および鞭毛（H）抗原の3つの免疫原性構造の組み合わせに基づく。
　ほとんどの研究所はKタイピングを行うことができないので、O：H血清型は病原性大腸菌の特徴付けのゴールドスタンダードとなっている。 O：H血清型検査は、アウトブレイクの検出、疫学サーベイランス、大腸菌の分類学的な差異、種内の病原性血清型の検出、クローンおよび進化学の研究において極めて重要である。パルスフィールドゲル電気泳動（PFGE）、リボタイピング、および multilocus sequence typing（MLST）のような最近開発されたいくつかの分子タイピング方法とは対照的に、血清型分類は抗原応答に直接関連する情報を提供し、分離株の生態学にとって重要である。現在のところ、このタイピング方法は他の方法に置き換えることはできない。
　現在の血清型分類法は、01から0188まで命名された188個のO群からなり（O182〜O188の公開が保留中）、O群O31、O47、O67、O72、O94およびO122がスキームから取り除かれている（ref.3,5）。 H13、H22、およびH50を除いて、H1〜H56と命名されたスキームには、53種のH抗原が含まれている（ref.5,6）。

　リポポリサッカライド（LPS）は、外膜に包埋され、脂質A、コアオリゴ糖、およびO抗原と呼ばれるO-特異的多糖鎖の3つの成分からなる。 O抗原は、通常、広範囲の糖に由来する2〜7個の糖残基を含むオリゴ糖（O単位）の10〜25の反復単位からなり、細菌細胞の最も可変な領域である。大部分のOユニットは膜を横切ってトランスフェクションされ、wzx（O-抗原flippase）およびwzy（O-抗原polymerase）によってコードされるO-抗原プロセシングタンパク質によって重合される。このtranslocationは、Wzx / Wzy依存性経路と呼ばれる。グループ1および4に属する酸性莢膜K抗原はまた、translocationにWzx / Wzy依存性経路を使用し、O群O8、O9、O20またはO101の中性LPS結合ポリマーの1つとしばしば共発現する（ref.7）。これらの中性LPS結合O基が存在しない場合、グループ1および4K抗原にはO指定が与えられ、 K87はO32、K85はO141、K9はO104である。転座がwzmによってコードされるトランスポータータンパク質およびwztによってコードされるATP結合成分によって促進されるO群O8、O9、O52およびO99について、O抗原のtranslocationに関与する別のABC輸送体経路が記載されている（ref.9）。
（２段落省略）
　以前、全ゲノムシーケンシング（WGS）データから大腸菌病原性遺伝子を検出するためのツールVirulenceFinderを提示し、ベロ毒素産生大腸菌（VTEC）の現在のルーチンタイピングの優れた代替物として、迅速かつ匹敵するタイピング結果を提供した（ref.18）。ここでは、O抗原プロセッシング遺伝子wzx、wzy、wzm、およびwzmに基づいて、WGSデータからの大腸菌の血清型予測のために構築された、publicに利用可能なゲノム疫学（CGE）ウェブツールのSerotypeFinderを紹介する。SerotypeFinderで得られた結果と従来の血清型で得られた結果との比較から、従来の血清型よりもはるかに迅速かつコスト効率よく実施できるWGSベースのタイピングの優位性が示された。

マニュアル

https://cge.cbs.dtu.dk/services/SerotypeFinder/instructions.php

ラン

https://cge.cbs.dtu.dk/services/SerotypeFinder/

E.coliのみ利用できる。

アセンブリして得たcontigのFASTAファイルをアップロードする。 fastqは使用できない。Center for Genomic Epidemiologyの他のツールと同様、混雑時はジョブ開始まで時間がかかる。

出力

https://cge.cbs.dtu.dk/services/SerotypeFinder/output.php

Serotypeは真ん中に出力されている（b）。またその遺伝子と関連情報が表示される。各データはテキスト形式でダウンロードできる。

引用

Rapid and Easy In Silico Serotyping of Escherichia coli Isolates by Use of Whole-Genome Sequencing Data
Joensen KG, Tetzschner AM, Iguchi A, Aarestrup FM, Scheutz F
J Clin Microbiol. 2015 Aug;53(8):2410-26.

2018-07-04

既知Eukaryotic Virusesのアセンブリツール drVM

assembly virus metagenome 生物種の推定 (taxonomic profiling)

　ウイルスは地球上で最も豊富な生物学的実体であり、動物、植物、細菌、真菌類を含むあらゆる細胞型の生活の中で発見されている。 4500種以上のウイルス種が発見されてきている（論文執筆時点）。それらの配列情報は研究者によって収集されている[論文より　ref.1-3]。ウイルスは人類の歴史において最も劇的で致命的な疾病の流行の一部を引き起こし、ウイルス性疾患の流行は数年ごとに発生する傾向がある。過去20年間で、ヒト集団には鳥インフルエンザH5N1ウイルス、SARS コロナウイルス、H1N1パンデミック、MERSコロナウイルス、エボラウイルス、ジーファウイルスが出現した。このようなアウトブレイクの間に、病原体の同定と比較ゲノム解析は、疾病の監視と疫学にとって基本的な要素である。次世代シークエンシング（NGS）は、血液、糞便、喀痰、および他の綿棒試料を含む様々な試料からのウイルス同定のための魅力的なアプローチとして浮上している[ref,8,9]。この技術は、標的の事前知識なしに、単一のアッセイにおいて潜在的な病原体の同定を可能にする[ref,10]。しかし、クルードなメタゲノムディープシーケンシングリードの計算解析は非常に時間がかかる。

　SURPI [ref.10]およびTaxonomer [ref.11]（紹介）は、メタゲノムNGSデータを迅速に分析し包括的な診断アプリケーションを行うように提案されている病原体検出ツールである。しかしながら、両方のツールは完全なウイルスゲノムアセンブリが不可能である。 VIP（簡単な紹介）は、SURPIと同じ戦略を引き出し、識別のためウィルスリード識別前に宿主およびバクテリアのシーケンシングリードを差し引く。これは属以下の分類を可能にするアセンブリの代替戦略でありウィルスアセンブリを改善する可能性のある戦略であるが。 VIPは系統樹を生成し、候補ウイルスと既存のリファレンス配列との間の系統図の可視化を容易にするが、最終レポートではアセンブリされたウイルス配列を生成しない。さらに、SURPIおよびVIPの操作は、ほとんどのラボでの使用を妨げ、多くの場合熟練した人がアクセスできるだけになる。 VirusTAPは、メタゲノミックシーケンシングリードからのウイルスゲノムアセンブリ用のWebベースの統合NGS分析ツールである[ref.13]。このユーザーフレンドリーなツールは、raw NGSデータをアップロードし、いくつかの選択肢をクリックするだけで、ウイルスゲノムをより簡単に取得することができる。ただし、VirusTAPはIlluminaデータのみを受け入れ、データベース更新をサポートしていない。したがって、viral metagenomicsのための計算上効率的で、正確で（競合するウイルスゲノムアセンブリのために）、かつ使い易いツールが急速に必要とされている。

　ここでは、IlluminaまたはIon Torrentのシーケンシング技術によって生成されたNGSのリードをウイルスデータベースに対して迅速に分類し、ウイルスのシーケンシングリードを属グループに分け、最後に属レベルにわけたリードをde novoアセンブリするバイオインフォマティクスパイプラインdrVM (detect and reconstruct known viral genomes from metagenomes)を提示する。配布を容易にするために、drVM用のDockerコンテナ[ref.14]、Amazonマシンイメージ、および仮想マシン[ref.15]イメージが作成された。プラットフォームの性能は、18の独立した研究[8-10,13,16-29]からのSequence Read Archive（SRA）の349個のシーケンスデータの分析で評価された。これらのデータセットには、さまざまなサンプルタイプ、ウイルス、シーケンス深度が含まれている。 drVMは、種々の既知のウイルスゲノムの検出および再構成において非常に熟達しており、同時にSURPI、VIPおよびVirusTAPを含む他の分析用パイプラインよりも優れていることが実証された。

drVMのフローチャート。論文より転載。

drVMに関するツイート。

インストール

依存

python 2.7
blastn
g++
bz2file
screed
setuptools
khmer

依存が多いので、オーサーらが準備したdockerイメージを使うのが一番簡単。SourceForgeからダウンロードする。SourceForgeにはマニュアルPDF、.ovaファイルもアップされている。

https://sourceforge.net/projects/sb2nhri/files/drVM/

docker pull 990210oliver/drvm]

#ここではホストのdocker_share/と仮想環境のhomeとを共有ディレクトリにして立ち上げる
docker run -i -t -v /Users/user/docker_share/:/home 990210oliver/drvm

> drVM.py -h

# drVM.py -h

usage:

drVM.py -1 read1.fastq -2 read2.fastq [options]

options:

-type iontorrent [default: illumina]

-dn on/off [digital normalization. default: on]

-t <int> [number of threads, default: 2]

-md <int> [min depth, default: 1]

-ar <float> [alignment rate, default: 0.5 (0.1~0.9)]

-bi <int> [blast identity, default: 80 (50~100)]

-cl <int> [contig length, to keep assembly, default: 3000]

-keep [keep sam file]

Virtual box や VMwareにimportして使える.ovaファイルもダウンロードできます。

ラン

マニュアルに記載されているテストデータをアセンブリする。

データベースの準備。

cd /opt/
mkdir VMDB && cd VMDB
wget https://sourceforge.net/projects/sb2nhri/files/drVM/sequence_20160316.tar.gz 
tar -zxvf sequence_20160316.tar.gz
#短すぎる配列が入り込んでいるので消す。
seqkit seq -m 500 sequence.fasta > sequence2.fa
CreateDB.py -s sequence2.fa #"killed"のエラーが出たらサイズが大きすぎるからかも
export MyDB='/opt/VMDB'

fastqをダウンロードし、アセンブリを実行する。

#ダウンロードとペアエンドへの変換を同時に実行。
fastq-dump --split-spot --skip-technical --split-files DRR049387 

drVM.py -type illumina -1 DRR049387_1.fastq -2 DRR049387_2.fastq -t 16

出力ディレクトリ

f:id:kazumaxneo:20180704130340j:plain

dokcer環境でテストすると、カバレッジプロットファイルが生成されなかった。PDFマニュアルの流れのようにvirtual PC環境で実行した方がいいかもしれない。

amazon EC2での利用環境も整えられています。PDFマニュアルに書いてあるように、AWSにアクセスして、drVMのコンテナ使えば、スペックに応じて費用は発生しますが最小の手間で解析することも可能です。

引用

drVM: a new tool for efficient genome assembly of known eukaryotic viruses from metagenomes
Lin HH, Liao YC.

Gigascience. 2017 Feb 1;6(2):1-10.

2018-07-03

抗生物質耐性遺伝子や病原性遺伝子を素早く検索する ABRicate

抗生物質耐性遺伝子 (ARGs) bacteria Virulence Factor 生物種の推定 (taxonomic profiling) virus plasmid AMR docker

2019 3/1　コマンド及びランの流れ更新

2019 3/3 リンク修正

2019 3/14 condaインストール追記

2019 4/12 dockerリンク追加

2019 9/27 コメント追加

2020 4/25 追記

　ABRicateはTorsten SeemannさんがGithubに公開されている抗生物質耐性遺伝子や病原性遺伝子、腸内細菌科プラスミドの検索ツール。webツールは混雑していると実行するまで何時間も待たされることがあるが、本ツールはコマンドラインで実行し、素早く結果を得ることができる。複数のデータベースに対応している。データベースは、２コマンド打つだけで最新版に更新することもできる。

以下のデータベースを使用してAMR genesなどを検出できる。さらにオリジナルデータベースを追加することもできる。

インストール

ubuntu14.04のminiconda3.4.0.5環境でテストした。

本体　Github

#bioconda
conda install -c bioconda -y abricate

#homebrew
brew tap brewsci/science  #tapしてない人だけ
brew tap brewsci/bio #tapしてない人だけ
brew install abricate

データベースもダウンロードされる。

> abricate -h

Synopsis:

Find and collate amplicons in assembled contigs

Author:

Torsten Seemann <torsten.seemann@gmail.com>

Usage:

% abricate --list

% abricate [options] <contigs.{fasta,gbk,embl}[.gz]> > out.tab

% abricate --summary <out1.tab> <out2.tab> <out3.tab> ... > summary.tab

Options:

--help This help.

--debug Verbose debug output (default '0').

--quiet Quiet mode, no stderr output (default '0').

--version Print version and exit.

--setupdb Format all the BLAST databases (default '0').

--list List included databases (default '0').

--check Check dependencies are installed (default '0').

--summary Summarize multiple reports into a table (default '0').

--datadir [X] Location of database folders (default '/home/kazu/.pyenv/versions/miniconda3-4.0.5/bin/../db').

--db [X] Database to use (default 'resfinder').

--noheader Suppress column header row (default '0').

--csv Output CSV instead of TSV (default '0').

--minid [n.n] Minimum DNA %identity (default '75').

--mincov [n.n] Minimum DNA %coverage (default '0').

--nopath Strip filename paths from FILE column (default '0').

Documentation:

https://github.com/tseemann/abricate

依存ツールのチェック。

> abricate --check

$ abricate --check

Checking dependencies are installed:

Found 'blastn' => /home/kazu/.pyenv/versions/miniconda3-4.0.5/bin/blastn

Found 'makeblastdb' => /home/kazu/.pyenv/versions/miniconda3-4.0.5/bin/makeblastdb

Found 'blastdbcmd' => /home/kazu/.pyenv/versions/miniconda3-4.0.5/bin/blastdbcmd

Found 'seqret' => /home/kazu/.pyenv/versions/miniconda3-4.0.5/bin/seqret

Found 'gzip' => /bin/gzip

Found 'unzip' => /home/kazu/.pyenv/versions/miniconda3-4.0.5/bin/unzip

O.K

> abricate --list

$ abricate --list

DATABASE SEQUENCES DATE

argannot 1749 2018-Jul-16

card 2220 2018-Jul-16

ecoh 597 2018-Jul-16

ncbi 4324 2018-Jul-16

plasmidfinder 263 2018-Jul-16

resfinder 2280 2018-Jul-16

vfdb 2597 2018-Jul-16

追加

dockerイメージ

https://hub.docker.com/r/staphb/abricate/

ラン

fastaとデータベースを指定して実行する。はじめにEMBOSSのseqretコマンド（紹介）でfastaに変換するため、fastaが多少おかしくても修復し、正常にランできるようになっている。

abricate --db resfinder input.fa > output #書き出さずターミナルにSTDOUTしてもいいかも

--db Database to use (default 'resfinder').
--minid Minimum DNA %identity (default '75').
--csv Output CSV instead of TSV (default '0').
--datadir Location of database folders (default '/usr/local/Cellar/abricate/0.8/libexec/bin/../db').

ワイルドカードを使うことで複数のfastaの同時解析も可能。genbankファイルも使用できる。

出力の説明

Githubより転載。

他のデータベースに切り替える。 abricate --listコマンドでデータベースをチェック。

DATABASE SEQUENCES DATE

argannot 1749 2018-Jul-3

card 2220 2018-Jul-3

ecoh 597 2018-Jul-3

ncbi 4324 2018-Jul-3

plasmidfinder 263 2018-Jul-3

resfinder 2280 2018-Jul-3

vfdb 2597 2018-Jul-3

defaultはNCBIのデータベースを使うが、--dbで指定すればデータベースを切り替えられる。例えばcardを使う。

abricate --db card input.fa

複数解析結果の統合。解析結果２つを統合する。

abricate 1.fna > 1.tab
abricate 2.fna > 2.tab

abricate --summary 1.tab 2.tab

--summary Summarize multiple reports into a table (default '0').

データベースの最新版への更新。例えばresfinderを更新する流れは以下のようになる。

abricate-get_db --db resfinder --force
abricate-get_db --db resfinder

自分専用のデータベースも追加できます。詳細はGithubで確認してください。

追記

全部ランし、matrixファイル出力する。

#変数定義
CONTIG=assembly.fa

abricate --db card $CONTIG > card_out
abricate --db argannot $CONTIG > argannot_out
abricate --db ncbi $CONTIG > ncbi_out
abricate --db resfinder $CONTIG > resfinder_out
abricate --db vfdb $CONTIG > vfdb_out
abricate --db plasmidfinder $CONTIG > plasmidfinder_out
abricate --db ecoh $CONTIG > ecoh_out

abricate --summary card_out argannot_out ncbi_out resfinder_out vfdb_out plasmidfinder_out ecoh_out > summary 

#または特定のデータベースのみ調べ、サマリーを出力
abricate --db resfinder binned*fasta > resfinder_out
abricate --summary resfinder_out > summary

proteinレベルで調べていないので、データベースから遠い耐性遺伝子は検出できないことに注意してください（DNA homologyの閾値を変えるには-minidを使う）。

2020 4/25 追記

利用可能なデータベースが増えています。

引用

https://github.com/tseemann/abricate

こちらも合わせて使ってみて下さい。まとめのレポートも出ます。

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

詳細なリードカウント情報を出力する bam-readcount

ロングリードのアライナー Meta-aligner

NGSデータからAMRのgenotypeを調べるARIBA

ゲノム配列やcontig配列からAMR遺伝子を検出する staramr

Serotypeを予測するSerotypeFinder

既知Eukaryotic Virusesのアセンブリツール drVM

抗生物質耐性遺伝子や病原性遺伝子を素早く検索する ABRicate