2020-07-22

真核生物の予測されたタンパク質のデータベース EukProt

2020 7/23 説明追加

　EukProtは、公開されている予測タンパク質セットと、真核生物の多様性を表すために選択された非注釈ゲノムのデータベースで、すべての主要なスーパーグループからの742種とorphan taxaを含む。系統図、遺伝子ファミリー進化、その他の遺伝子ベースの研究のための単一の便利なリソースを提供することを目的としている。各種は、下流の解析を容易にするためにUniEuk分類学的枠組みの中に配置され、各データセットは、解析間の比較と複製を容易にするために、一意の永続的な識別子に関連付けられている。また、そのデータベースは現在バージョン2であり、すべてのバージョンは永久に保存され、FigShareを介して利用できるようになる。真核生物の多様性と多様化を理解するための研究を促進するための共同リソースを構築することを目標に、今後のバージョンに含まれる新しいデータセットや新しいアノテーション機能についての提案をコミュニティに提供していく。

　一般的には、種ごとに1つの株/アイソレートからのデータのみを収録している。しかし、ある種の特定の株について単一のトランスクリプトームデータセットのみが利用可能であり、同じ種の他の株について追加のトランスクリプトームデータセットが発表されている場合、単一のトランスクリプトームがある条件または実験でのみ発現された遺伝子を欠く可能性を防ぐために、デフォルトのパラメータ値で実行されたCD-HIT (Li & Godzik, 2006)を用いてそれらを結合した。複数の系統をマージして種のデータセットを作成した場合（そのようなケースは25件あった）、この情報はデータセットのメタデータに記載されている。

　種および株の同定は、データセットを記載したpublicationsを読み、命名変更の文献を参照し、18SリボソームDNA配列を参照配列データベースと比較することで行った。以前に他の名称で知られていた種については、種が元々属名として割り当てられていたが、種レベルでは同定されていない場合を除き、データセットのメタデータに記録した（例えば、現在はGoniomonas avonleaとして同定されているGoniomonas sp.は、旧名称としてリストアップされていない）。

　他のいくつかのリソースで使用されている分類体系とは異なり、我々が提供するすべての種の完全な分類体系（UniEukプロジェクトで開発されたフレームワークに従う）は、固定された数のランクに基づいているのではなく、系統的証拠を可能な限り密接に一致させるために、自由で無制限の数の分類レベルに基づいている。これは、エンドユーザーにより多くの情報と柔軟性を提供するが、下流の解析結果をまとめるのがより困難になる可能性がある。そこで、真核生物の多様性を、同等の系統的深さや生態学的関連性を持つ固定数の分類群に分散させることが有用である場合に、エンドユーザーを支援するために、２つの追加フィールド（"supergroup "と "taxogroup"）を提供する。42の「スーパーグループ」（そのうち38がEukProtに含まれている）は、「古典的な」アルベオラータ（Alveolata）、根粒菌（Rhizaria）、ストラメノピレス（Stramenopiles）と同等の系統深さを持つ、厳密に単系統で深く分岐する真核生物の系統から構成されている（以下略）。

data

FigShare 252に掲載されているEukProtのデータベースは5つのファイルを含む1つのアーカイブで配布されている。1つ目のファイルには、予測されたタンパク質を含むゲノム（239）またはシングルセルゲノム（10）、トランスクリプトーム（453）、シングルセルトランスクリプトーム（7）、ESTアセンブリ（17）のいずれかを持つ種の726のタンパク質データセットが含まれている。。2つ目のファイルには予測されたタンパク質のアノテーションを欠く16のゲノム、シングルセルゲノムを持つ15種とゲノム配列を持つ1種が含まれている。これらから、翻訳配列相同性検索ソフトウェアを用いて、興味のあるタンパク質を検索できる。3つ目のファイルには、公開されているmRNA配列のリードはあるが、公開されているアセンブリはない53種のトランスクリプトームコンティグが含まれている。これらのアセンブリから予測されたタンパク質がタンパク質ファイルに含まれている。最後に、データベースのメタデータは2つのファイルとして配布されている。1つは現在のバージョンのデータベース（742）に含まれるデータセットのためのファイルで、もう1つは含まれないデータセットのためのファイル（50）で、含まれなかった理由（例えば、配列が公開されているが公開されていない場合や、同じ種のより質の高いデータセットに置き換えられた場合など）を添付している。

EukProtのメタデータは、データソースからダウンロードした後、各データセットに適用された追加の手順を示している指定されたデータセットのメタデータレコードに記載されていない限り、すべてのソフトウェアのパラメータ値はデフォルトである。

'assemble mRNA'：Trinity v. 2.8.4を用いたde novoトランスクリプトームアセンブリ。Illuminaの入力リードをTrimmomatic v. 0.3.9の'--trimmomatic'オプションと'ILLUMINACLIP:[454 adapters FASTA file]:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25'を指定してアダプタートリミングと品質トリミングを行なった。
'translate mRNA'：Transdecoder v. 5.3.0を用いたmRNA配列のde novo翻訳、http://transdecoder.github.io/。与えられた種の予測タンパク質配列の数が入力mRNA配列の半分以下の場合、予測タンパク質の最小長さをデフォルトの100から50に減らした。
CD-HIT」：CD-HIT v.4.6を使用してタンパク質配列のクラスタリングを行い、冗長性のないデータセットを作成した。このツールは主に同一種の異なる系統のタンパク質予測を組み合わせるために使用したが、冗長性の証拠を示した非常に大きな予測タンパク質セット（50,000タンパク質以上）のサイズを縮小するためにも使用した。
EMBOSS パッケージ v. 6.6.6.0.0,のextractfeatを使い、遺伝子アノテーションを持つが、公開されているタンパク質配列がないゲノムからコーディング配列（CDS）を作成した。CDSを直接タンパク質に翻訳するためにEMBOSSのtranseqを使用した。また、その配列は、遺伝子アノテーションが公開されていないゲノムから収集したものである。

データセット

Figshareからデータをダウンロードした。

2020 7/21アクセス時のVersionは2で、ダウンロードサイズは5GBくらいだった。解凍すると以下のようなファイルが展開される。

f:id:kazumaxneo:20200722233719p:plain

１、EukProt_proteins.v02.2020_06_30.tgz

ゲノム（239）またはシングルセルゲノム（10）の予測されたタンパク質配列、トランスクリプトーム（453）、シングルセルトランスクリプトーム（7）、ESTアセンブリ（17）の計726のタンパク質データセット。

解凍したところ。合計７２６ファイルある。中身はアミノ酸配列。

f:id:kazumaxneo:20200722234537p:plain

２、EukProt_unannotated_genomes.v02.2020_06_30.tgz

予測されたタンパク質のアノテーションを欠く16のゲノム配列（シングルセル１５＋ゲノム１）。

解凍したところ。合計１６ファイルある。中身は塩基配列。

f:id:kazumaxneo:20200722235307p:plain

３、EukProt_assembled_transcriptomes.v02.2020_06_30.tgz

公開されているmRNA配列リードはあるが、アセンブリの公開がない53種のトランスクリプトームアセンブリ。これらのアセンブリから予測されるタンパク質は、上のリストのタンパク質ファイルに含まれている。

解凍したところ。合計５３ファイルある。このファイル自体は塩基配列。

f:id:kazumaxneo:20200722235718p:plain

他のEukProt_included_data_sets.v02.2020_06_30.txtおよびEukProt_not_included_data_sets.v02.2020_06_30.txtは、データベースに含まれる742データセットまたは含まれない50のデータセットの表になる。

EukProt_included_data_sets.v02.2020_06_30.txt

f:id:kazumaxneo:20200723000435p:plain

EukProt_not_included_data_sets.v02.2020_06_30.txt

f:id:kazumaxneo:20200723000341p:plain

表のEukProt_IDはデータセットに関連付けられた一意の識別子になる。これはバージョン更新では変更されない。新しいデータセットがその種で利用可能になった場合は、新しい一意の識別子が割り当てられる。Taxonomy_UniEukはUniEuk分類法における種の完全な系統でセミコロンで区切られている。データがダウンロードされたURLやデータを公開した論文も記載されている。また公開されているファイルを処理するために行われた操作のカラムもある。

引用

EukProt: a database of genome-scale predicted proteins across the diversity of eukaryotic life

Daniel J. Richter, Cédric Berney, Jürgen F. H. Strassert, Fabien Burki, Colomban de Vargas

bioRxiv, Posted July 01, 2020

2020-07-21

メタゲノム由来コンティグから真核生物のタンパク質配列を予測する MetaEuk

2020 Microbiome (Journal) k-mer metagenome 生物種の推定 (taxonomic profiling) GFF Evidence-driven gene prediction

2020 7/26 更新完了

　メタゲノミクスは、微生物とその生物学的、生物医学的、地球化学的プロセスへの関与の研究に革命をもたらしており、事前の培養を必要とせずに、膨大な数の生物を直接シーケンスして調査することが可能になっている。単細胞真核生物は、主な捕食者、分解者、光栄養剤、細菌宿主、共生体、植物や動物への寄生体として、ほとんどの微生物群集において重要な役割を果たしている。そのため、これらの役割を研究することは、生態学、バイオテクノロジー、人の健康、進化に大きな関心を寄せている。しかし、一般的にシークエンシングのカバレッジが低く、遺伝子やゲノムの構造が複雑で、真核生物に特有の実験や計算方法がないため、メタゲノミクスの傍観者として扱われてきた。
　MetaEukは、真核生物のメタゲノムコンティグにおけるタンパク質コード遺伝子のハイスループット、リファレンスベースの発見、アノテーションのためのツールキットである。可能なすべてのエクソンをカバーする6フレーム翻訳フラグメントの高速検索を行い、マッチしたものを複数のエクソンタンパク質に最適に結合する。7つの多様なアノテーションされたゲノムのベンチマークを使用して、リファレンスデータベースとの配列類似性が低い条件下でもMetaEukが高感度であることを示した。大規模なメタゲノムデータから新規の真核生物タンパク質を発見するMetaEukの能力を実証するために、Tara Oceansプロジェクトの912サンプルからコンティグを収集した。MetaEukは、10台の16コアサーバー上で8日間で1,200万以上のタンパク質コード遺伝子を予測した。発見されたタンパク質のほとんどは、既知のタンパク質と高度に乖離しており、サンプル数が非常に少ない真核生物のスーパーグループに由来するものである。

MetaEukはBUSCOのゲノムモードで、保存されたタンパク質コード遺伝子を見つけるためにも使用されています。

インストール

ubuntu18.04でテストした（CPUはxeon platinum）。

Github

ソースからコンパイルするか、静的コンパイルされたバージョンをダウンロードして使用できる（SSE4.1以上のSIMD対応CPU）。

# static build sse4.1
wget https://mmseqs.com/metaeuk/metaeuk-linux-sse41.tar.gz
tar xvfz metaeuk-linux-sse41.tar.gz
export PATH=$(pwd)/metaeuk/bin/:$PATH

# static build AVX2
wget https://mmseqs.com/metaeuk/metaeuk-linux-avx2.tar.gz
tar xvfz metaeuk-linux-avx2.tar.gz
export PATH=$(pwd)/metaeuk/bin/:$PATH

> metaeuk

$ metaeuk

MetaEuk is homology-based strategy to efficiently query many contigs assembled from metagenomic samples against a comprehensive protein/profile target database to describe their protein repertoire. It does not require preliminary binning of the contigs and makes no assumption concerning the splicing signal when searching for multi-exon proteins.

Please cite:

Levy Karin E, Mirdita M, Soding J: MetaEuk — sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. Microbiome (2020) 8:48.

metaeuk Version: e7e2d95f454105e5d4aa40bc221a8b18fdc1ce41

usage: metaeuk <command> [<args>]

Main workflows for database input/output

predictexons Call optimal exon sets based on protein similarity

easy-predict Predict protein-coding genes from contigs (fasta/database) based on similarities to targets (fasta/database) and return a fasta of the predictions in a single step

taxtocontig Assign taxonomic labels to predictions and aggregate them per contig

reduceredundancy Cluster metaeuk calls that share an exon and select representative prediction

unitesetstofasta Create a fasta output from optimal exon sets

groupstoacc Create a TSV output from representative prediction to member

An extended list of all modules can be obtained by calling 'metaeuk -h'.

> metaeuk easy-predict

$ metaeuk easy-predict

usage: metaeuk easy-predict <i:contigs> <i:targets> <o:predictionsFasta> <tmpDir> [options]

options:

-s FLOAT Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]

--max-seqs INT Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [300]

-a BOOL Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

--alignment-mode INT How to compute the alignment:

0: automatic

1: only score and end_pos

2: also start_pos and cov

3: also seq.id

4: only ungapped alignment [2]

-e FLOAT List matches below this E-value (range 0.0-inf) [100.000]

--min-seq-id FLOAT List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]

--min-aln-len INT Minimum alignment length (range 0-INT_MAX) [0]

--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]

--alt-ali INT Show up to this many alternative alignments [0]

-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]

--cov-mode INT 0: coverage of query and target

1: coverage of target

2: coverage of query

3: target seq. length has to be at least x% of query length

4: query seq. length has to be at least x% of target length

5: short seq. needs to be at least x% of the other seq. length [0]

--max-rejected INT Maximum rejected alignments before alignment calculation for a query is stopped [2147483647]

--max-accept INT Maximum accepted alignments before alignment calculation for a query is stopped [2147483647]

--e-profile FLOAT Include sequences matches with < e-value thr. into the profile (>=0.0) [0.001]

--num-iterations INT Number of iterative profile search iterations [1]

--rescore-mode INT Rescore diagonals with:

0: Hamming distance

1: local alignment (score only)

2: local alignment

3: global alignment

4: longest alignment fullfilling window quality criterion [0]

--allow-deletion BOOL Allow deletions in a MSA [0]

--min-length INT Minimum codon number in open reading frames [15]

--max-length INT Maximum codon number in open reading frames [32734]

--max-gaps INT Maximum number of codons with gaps or unknown residues before an open reading frame is rejected [2147483647]

--contig-start-mode INT Contig start can be 0: incomplete, 1: complete, 2: both [2]

--contig-end-mode INT Contig end can be 0: incomplete, 1: complete, 2: both [2]

--orf-start-mode INT Orf fragment can be 0: from start to stop, 1: from any to stop, 2: from last encountered start to stop (no start in the middle) [1]

--forward-frames STR Comma-seperated list of frames on the forward strand to be extracted [1,2,3]

--reverse-frames STR Comma-seperated list of frames on the reverse strand to be extracted [1,2,3]

--translate INT Translate ORF to amino acid [0]

--use-all-table-starts BOOL Use all alteratives for a start codon in the genetic table, if false - only ATG (AUG) [0]

--id-offset INT Numeric ids in index file are offset by this value [0]

--add-orf-stop BOOL Add stop codon '*' at complete start and end [0]

--search-type INT Search type 0: auto 1: amino acid, 2: translated, 3: nucleotide, 4: translated nucleotide alignment [0]

--start-sens FLOAT Start sensitivity [4.000]

--sens-steps INT Number of search steps performed from --start-sens to -s [1]

--metaeuk-eval FLOAT maximal combined evalue of an optimal set [0.0, inf] [0.001]

--metaeuk-tcov FLOAT minimal length ratio of combined set to target [0.0, 1.0] [0.500]

--max-intron INT Maximal allowed intron length [10000]

--min-intron INT Minimal allowed intron length [15]

--min-exon-aa INT Minimal allowed exon length in amino acids [11]

--max-overlap INT Maximal allowed overlap of consecutive exons in amino acids [10]

--set-gap-open INT Gap open penalty (negative) for missed target amino acids between exons [-1]

--set-gap-extend INT Gap extend penalty (negative) for missed target amino acids between exons [-1]

--overlap INT allow predictions to overlap another on the same strand. when not allowed (default), only the prediction with better E-value will be retained [0,1] [0]

--protein INT translate the joint exons coding sequence to amino acids [0,1] [0]

--target-key INT write the target key (internal DB identifier) instead of its accession. By default (0) target accession will be written [0,1] [0]

--reverse-fragments INT reverse AA fragments to compute under null [0,1] [0]

--threads INT Number of CPU-cores used (all by default) [56]

--compressed INT Write compressed output [0]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

examples:

Combines the following MetaEuk modules into a single step: predictexons, reduceredundancy and unitesetstofasta

references:

- Levy Karin E, Mirdita M, Soeding J: MetaEuk – sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics. biorxiv, 851964 (2019).

Show an extended list of options by calling 'metaeuk easy-predict -h'.

Not enough input paths provided. 4 paths are required.

実行方法

１、必要ならデータベースを作成する。

metaeuk createdb database_proteins.faa database

easy-predict

このワークフローはMetaEukのpredictexons、reduceredundancy、unitesetstofastaモジュールを1つのステップにまとめたものになる。入力はFastaファイルまたは以前に作成されたデータベース）、そしてターゲットのタンパク質配列のFastaファイルまたは以前に作成されたタンパク質またはタンパク質プロファイルのデータベースになる。

メタゲノムアセンブリのコンティグ配列をクエリにして、アミノ酸データベースを検索する。predictexonsがコンティグからタンパク質配列（最適なexon配列セット）を予測し、冗長性を除いてからfasta出力する。

metaeuk easy-predict query_nucleotides.fasta database predsResults tempFolder

出力

f:id:kazumaxneo:20200726170927p:plain

他に、予測されたMetaEukタンパク質にtaxonomic labelを割り当て、その予測値をコンティグに付与する機能もある。手順はGithubで確認して下さい。

引用

MetaEuk—sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics

Eli Levy Karin, Milot Mirdita & Johannes Söding
Microbiome volume 8, Article number: 48 (2020)

関連

2020-07-20

ノイズの多いロングリードを使ってSVをコールする SVIM

2019 Bioinformatics structural variations (SV) Nanopore long read Pacbio

2020 7/21 出力画像追加

　構造変異とは、50 bpよりも大きいゲノム変異と定義されている。構造変異は、一塩基多型や小さな挿入・欠失よりも、任意のゲノムのより多くの塩基に影響を与えることが示されている。さらに、これらの変異はヒトの表現型や多様性に大きな影響を与え、多くの疾患と関連している。それらのサイズとリピートとの関連性のため、ショットガンシーケンシングでは、特にショートリードに基づいている場合には検出が困難である。Pacific BiosciencesやOxford Nanopore Technologiesが提供するようなロングリード、１分子シーケンシング技術は、数千塩基対の長さのリードを生成する。より高いエラー率とシークエンシングコストにもかかわらず、ロングリードシークエンシングは構造変異の検出に多くの利点をもたらす。しかし、利用可能なソフトウェアツールでは、その可能性を十分に活用できていないのが現状である。
　本研究では、ロングリードデータから構造バリアントを高感度に検出し、正確に特徴付けするためのツールであるSVIMを紹介する。SVIMは、リードアラインメントからの構造バリアントシグネチャの収集、クラスタリング、および組み合わせのための3つのコンポーネントから構成されている。SVIMは、タンデム重複やインタースパン重複、新規エレメント挿入などの類似したタイプを含む5つの異なるバリアントクラスを識別する。SVIMは、ゲノム上の重複の発生源と発生先の両方を抽出できる点でユニークである。Pacific Biosciences社とNanopore社のシーケンシングマシンによるシミュレーションデータと実データでの評価では、既存のツールと比較して良好な結果が得られている。
　SVIM のソースコードと実行ファイルは Github: github.com/eldariont/svim で公開されている。SVIMはPython 3で実装されており、biocondaとPythonパッケージインデックスで利用できる。

SVIMは4つの主要なステップで構成される。

COLLECTは、ロングリードアラインメントのSVのシグネチャを検出する。
CLUSTER は同じ SV からのシグネチャをマージする。
COMBINEは、異なるゲノム領域からのクラスタを結合し、異なるSVタイプに分類する。
GENOTYPEはSVにまたがるアラインメントを用いて遺伝子型を決定する。

wiki

https://github.com/eldariont/svim/wiki

raw long readsではなくハプロイドや二倍体ゲノムアセンブリ、またはコンティグを解析する場合は、別の方法であるSVIM-asmを使う。

インストール

ubuntu18.04LTSでテストした。

Github

#bioconda (link)
conda create -n svim_env --channel bioconda svim
conda activate svim_env

#pip (Python 3.6 or newer)
pip install svim

#Install from github 
git clone https://github.com/eldariont/svim.git
cd svim
pip install .

> svim -h

$ svim -h

usage: svim [-h] [--version] {reads,alignment} ...

SVIM (pronounced SWIM) is a structural variant caller for long reads.

It discriminates five different variant classes: deletions, tandem and interspersed duplications,

inversions and insertions. SVIM is unique in its capability of extracting both the genomic origin and

destination of duplications.

SVIM consists of four major steps:

- COLLECT detects signatures for SVs in long read alignments

- CLUSTER merges signatures that come from the same SV

- COMBINE combines clusters from different genomic regions and classifies them into distinct SV types

- GENOTYPE uses alignments spanning SVs to determine their genotype

SVIM can process two types of input. Firstly, it can detect SVs from raw reads by aligning them to a given reference genome first ("SVIM.py reads [options] working_dir reads genome").

Alternatively, it can detect SVs from existing reads alignments in SAM/BAM format ("SVIM.py alignment [options] working_dir bam_file").

positional arguments:

{reads,alignment} modes

reads Detect SVs from raw reads. Align reads to given reference

genome first.

alignment Detect SVs from an existing alignment

optional arguments:

-h, --help show this help message and exit

--version, -v show program's version number and exit

> svim reads -h

$ svim reads -h

usage: svim reads [-h] [--verbose] [--cores CORES]

[--aligner {ngmlr,minimap2}] [--nanopore]

[--min_mapq MIN_MAPQ] [--min_sv_size MIN_SV_SIZE]

[--max_sv_size MAX_SV_SIZE]

[--segment_gap_tolerance SEGMENT_GAP_TOLERANCE]

[--segment_overlap_tolerance SEGMENT_OVERLAP_TOLERANCE]

[--all_bnds]

[--partition_max_distance PARTITION_MAX_DISTANCE]

[--distance_normalizer DISTANCE_NORMALIZER]

[--cluster_max_distance CLUSTER_MAX_DISTANCE]

[--del_ins_dup_max_distance DEL_INS_DUP_MAX_DISTANCE]

[--trans_destination_partition_max_distance TRANS_DESTINATION_PARTITION_MAX_DISTANCE]

[--trans_partition_max_distance TRANS_PARTITION_MAX_DISTANCE]

[--trans_sv_max_distance TRANS_SV_MAX_DISTANCE]

[--skip_genotyping] [--minimum_score MINIMUM_SCORE]

[--homozygous_threshold HOMOZYGOUS_THRESHOLD]

[--heterozygous_threshold HETEROZYGOUS_THRESHOLD]

[--minimum_depth MINIMUM_DEPTH] [--sample SAMPLE]

[--types TYPES] [--sequence_alleles] [--insertion_sequences]

[--tandem_duplications_as_insertions]

[--interspersed_duplications_as_insertions] [--read_names]

[--zmws]

working_dir reads genome

positional arguments:

working_dir Working and output directory. Existing files in the

directory are overwritten. If the directory does not

exist, it is created.

reads Read file (FASTA, FASTQ, gzipped FASTA, gzipped FASTQ

or file list). The read file has to have one of the

following supported file endings: FASTA: .fa, .fasta,

.FA, .fa.gz, .fa.gzip, .fasta.gz, .fasta.gzip FASTQ:

.fq, .fastq, .FQ, .fq.gz, .fq.gzip, .fastq.gz,

.fastq.gzip FILE LIST: .fa.fn, fq.fn

genome Reference genome file (FASTA)

optional arguments:

-h, --help show this help message and exit

--verbose Enable more verbose logging (default: False)

ALIGN:

--cores CORES CPU cores to use for the alignment (default: 1)

--aligner {ngmlr,minimap2}

Tool for read alignment: ngmlr or minimap2 (default:

ngmlr)

--nanopore Use Nanopore settings for read alignment (default:

False)

COLLECT:

--min_mapq MIN_MAPQ Minimum mapping quality of reads to consider (default:

20). Reads with a lower mapping quality are ignored.

--min_sv_size MIN_SV_SIZE

Minimum SV size to detect (default: 40). SVIM can

potentially detect events of any size but is limited

by the signal-to-noise ratio in the input alignments.

That means that more accurate reads and alignments

enable the detection of smaller events. For current

PacBio or Nanopore data, we would recommend a minimum

size of 40bp or larger.

--max_sv_size MAX_SV_SIZE

Maximum SV size to detect (default: 100000). This

parameter is used to distinguish long deletions (and

inversions) from translocations which cannot be

distinguished from the alignment alone. Split read

segments mapping far apart on the reference could

either indicate a very long deletion (inversion) or a

translocation breakpoint. SVIM calls a translocation

breakpoint if the mapping distance is larger than this

parameter and a deletion (or inversion) if it is

smaller or equal.

--segment_gap_tolerance SEGMENT_GAP_TOLERANCE

Maximum tolerated gap between adjacent alignment

segments (default: 10). This parameter applies to gaps

on the reference and the read. Example: Deletions are

detected from two subsequent segments of a split read

that are mapped far apart from each other on the

reference. The segment gap tolerance determines the

maximum tolerated length of the read gap between both

segments. If there is an unaligned read segment larger

than this value between the two segments, no deletion

is called.

--segment_overlap_tolerance SEGMENT_OVERLAP_TOLERANCE

Maximum tolerated overlap between adjacent alignment

segments (default: 5). This parameter applies to

overlaps on the reference and the read. Example:

Deletions are detected from two subsequent segments of

a split read that are mapped far apart from each other

on the reference. The segment overlap tolerance

determines the maximum tolerated length of an overlap

between both segments on the read. If the overlap

between the two segments on the read is larger than

this value, no deletion is called.

--all_bnds Output all rearrangements additionally in BND notation

(default: False). By default, SV signatures from the

read alignments are used to detect complete SVs, such

as deletions, insertions and inversions. When this

option is enabled, all SVs are also output in breakend

(BND) notation as defined in the VCF specs. For

instance, a deletion gets two records in the VCF

output: 1. the normal <DEL> record and 2. a <BND>

record representing the novel adjacency between the

deletion's start and end coordinate in the sample

genome.

CLUSTER:

--partition_max_distance PARTITION_MAX_DISTANCE

Maximum distance in bp between SVs in a partition

(default: 1000). Before clustering, the SV signatures

are divided into coarse partitions. This parameter

determines the maximum distance between two subsequent

signatures in the same partition. If the distance

between two subsequent signatures is larger than this

parameter, they are distributed into separate

partitions.

--distance_normalizer DISTANCE_NORMALIZER

Distance normalizer used for span-position distance

(default: 900). SVIM clusters the SV signatures using

an hierarchical clustering approach and a novel

distance metric called "span-position distance". Span-

position distance is the sum of two components, span

distance and position distance. The span distance is

the difference in lengths between signatures

normalized by the greater length and always lies in

the interval [0,1]. The position distance is the

difference in position between signatures normalized

by the distance normalizer (this parameter). For a

position difference of 1.8kb and a distance normalizer

of 900, the position distance will be 2. A smaller

distance normalizer leads to a higher position

distance and as a consequence increases the importance

of the position distance in the span-position distance

relative to the span distance.

--cluster_max_distance CLUSTER_MAX_DISTANCE

Maximum span-position distance between SVs in a

cluster (default: 0.3). This is the most important

parameter because it determines the strictness of

clustering. Choosing a large value leads to fewer but

larger clusters with larger distances between its

members. Choosing a small value leads to more but

smaller clusters with smaller distances between its

members. This parameter determines the height of the

cut-off in the hierarchical clustering dendrogram.

COMBINE:

--del_ins_dup_max_distance DEL_INS_DUP_MAX_DISTANCE

Maximum span-position distance between the origin of

an insertion and a deletion to be flagged as a

potential cut&paste insertion (default: 1.0)

--trans_destination_partition_max_distance TRANS_DESTINATION_PARTITION_MAX_DISTANCE

Maximum distance in bp between translocation

breakpoint destinations in a partition (default: 1000)

--trans_partition_max_distance TRANS_PARTITION_MAX_DISTANCE

Maximum distance in bp between translocation

breakpoints in a partition (default: 200)

--trans_sv_max_distance TRANS_SV_MAX_DISTANCE

Maximum distance in bp between a translocation

breakpoint and an SV signature to be combined

(default: 500)

GENOTYPE:

--skip_genotyping Disable genotyping (default: False)

--minimum_score MINIMUM_SCORE

Minimum score for genotyping (default: 3). Only SV

candidates with a higher or equal score are genotyped.

Depending on the score distribution among the SV

candidates, decreasing this value increases the

runtime. We recommend to choose a value close to the

score threshold used for filtering the SV candidates.

--homozygous_threshold HOMOZYGOUS_THRESHOLD

Minimum variant allele frequency to be called as

homozygous (default: 0.8). Allele frequency is

computed as the fraction of reads supporting the

variant over the total number of reads covering the

variant. Variants with an allele frequence greater

than or equal to this threshold are called as

homozygous alternative.

--heterozygous_threshold HETEROZYGOUS_THRESHOLD

Minimum variant allele frequency to be called as

heterozygous (default: 0.2). Allele frequency is

computed as the fraction of reads supporting the

variant over the total number of reads covering the

variant. Variants with an allele frequence greater

than or equal to this threshold but lower than the

homozygous threshold are called as heterozygous

alternative. Variants with an allele frequence lower

than this threshold are called as homozygous

reference.

--minimum_depth MINIMUM_DEPTH

Minimum total read depth for genotyping (default: 4).

Variants covered by a total number of reads lower than

this value are not assigned a genotype (./. in the

output VCF file).

OUTPUT:

--sample SAMPLE Sample ID to include in output vcf file (default:

Sample)

--types TYPES SV types to include in output VCF (default:

DEL,INS,INV,DUP:TANDEM,DUP:INT,BND). Give a comma-

separated list of SV types. The possible SV types are:

DEL (deletions), INS (novel insertions), INV

(inversions), DUP:TANDEM (tandem duplications),

DUP:INT (interspersed duplications), BND (breakends).

--sequence_alleles Use nucleotide sequences for alleles of deletions,

inversions and insertions in output VCF (default:

False). By default, all SVs are represented by

symbolic alleles, such as <DEL>, <INV> or <INS>. If

enabled, ALT alleles of insertions are obtained from

the sequence of a random read that supports the

variant.

--insertion_sequences

Output insertion sequences in INFO tag of VCF

(default: False). If enabled, the INFO/SEQS tag

contains a list of insertion sequences from the

supporting reads. However, the insertion sequences are

not combined into a consensus sequence.

--tandem_duplications_as_insertions

Represent tandem duplications as insertions in output

VCF (default: False). By default, tandem duplications

are represented by the SVTYPE=DUP:TANDEM and the

genomic source is given by the POS and END tags. When

enabling this option, duplications are instead

represented by the SVTYPE=INS and POS and END both

give the insertion point of the duplication.

--interspersed_duplications_as_insertions

Represent interspersed duplications as insertions in

output VCF (default: False). By default, interspersed

duplications are represented by the SVTYPE=DUP:INT and

the genomic source is given by the POS and END tags.

When enabling this option, duplications are instead

represented by the SVTYPE=INS and POS and END both

give the insertion point of the duplication.

--read_names Output names of supporting reads in INFO tag of VCF

(default: False). If enabled, the INFO/READS tag

contains the list of names of the supporting reads.

--zmws look for information on ZMWs in PacBio read names

(default: False). If enabled, the INFO/ZMWS tag

contains the number of ZMWs that produced supporting

reads.

> svim alignment -h

$ svim alignment -h

usage: svim alignment [-h] [--verbose] [--min_mapq MIN_MAPQ]

[--min_sv_size MIN_SV_SIZE] [--max_sv_size MAX_SV_SIZE]

[--segment_gap_tolerance SEGMENT_GAP_TOLERANCE]

[--segment_overlap_tolerance SEGMENT_OVERLAP_TOLERANCE]

[--partition_max_distance PARTITION_MAX_DISTANCE]

[--distance_normalizer DISTANCE_NORMALIZER]

[--cluster_max_distance CLUSTER_MAX_DISTANCE]

[--all_bnds]

[--del_ins_dup_max_distance DEL_INS_DUP_MAX_DISTANCE]

[--trans_destination_partition_max_distance TRANS_DESTINATION_PARTITION_MAX_DISTANCE]

[--trans_partition_max_distance TRANS_PARTITION_MAX_DISTANCE]

[--trans_sv_max_distance TRANS_SV_MAX_DISTANCE]

[--skip_genotyping] [--minimum_score MINIMUM_SCORE]

[--homozygous_threshold HOMOZYGOUS_THRESHOLD]

[--heterozygous_threshold HETEROZYGOUS_THRESHOLD]

[--minimum_depth MINIMUM_DEPTH] [--sample SAMPLE]

[--types TYPES] [--sequence_alleles]

[--insertion_sequences]

[--tandem_duplications_as_insertions]

[--interspersed_duplications_as_insertions]

[--read_names] [--zmws]

working_dir bam_file genome

positional arguments:

working_dir Working and output directory. Existing files in the

directory are overwritten. If the directory does not

exist, it is created.

bam_file Coordinate-sorted and indexed BAM file with aligned

long reads

genome Reference genome file that the long reads were aligned

to (FASTA)

optional arguments:

-h, --help show this help message and exit

--verbose Enable more verbose logging (default: False)

COLLECT:

--min_mapq MIN_MAPQ Minimum mapping quality of reads to consider (default:

20). Reads with a lower mapping quality are ignored.

--min_sv_size MIN_SV_SIZE

Minimum SV size to detect (default: 40). SVIM can

potentially detect events of any size but is limited

by the signal-to-noise ratio in the input alignments.

That means that more accurate reads and alignments

enable the detection of smaller events. For current

PacBio or Nanopore data, we would recommend a minimum

size of 40bp or larger.

--max_sv_size MAX_SV_SIZE

Maximum SV size to detect (default: 100000). This

parameter is used to distinguish long deletions (and

inversions) from translocations which cannot be

distinguished from the alignment alone. Split read

segments mapping far apart on the reference could

either indicate a very long deletion (inversion) or a

translocation breakpoint. SVIM calls a translocation

breakpoint if the mapping distance is larger than this

parameter and a deletion (or inversion) if it is

smaller or equal.

--segment_gap_tolerance SEGMENT_GAP_TOLERANCE

Maximum tolerated gap between adjacent alignment

segments (default: 10). This parameter applies to gaps

on the reference and the read. Example: Deletions are

detected from two subsequent segments of a split read

that are mapped far apart from each other on the

reference. The segment gap tolerance determines the

maximum tolerated length of the read gap between both

segments. If there is an unaligned read segment larger

than this value between the two segments, no deletion

is called.

--segment_overlap_tolerance SEGMENT_OVERLAP_TOLERANCE

Maximum tolerated overlap between adjacent alignment

segments (default: 5). This parameter applies to

overlaps on the reference and the read. Example:

Deletions are detected from two subsequent segments of

a split read that are mapped far apart from each other

on the reference. The segment overlap tolerance

determines the maximum tolerated length of an overlap

between both segments on the read. If the overlap

between the two segments on the read is larger than

this value, no deletion is called.

CLUSTER:

--partition_max_distance PARTITION_MAX_DISTANCE

Maximum distance in bp between SVs in a partition

(default: 1000). Before clustering, the SV signatures

are divided into coarse partitions. This parameter

determines the maximum distance between two subsequent

signatures in the same partition. If the distance

between two subsequent signatures is larger than this

parameter, they are distributed into separate

partitions.

--distance_normalizer DISTANCE_NORMALIZER

Distance normalizer used for span-position distance

(default: 900). SVIM clusters the SV signatures using

an hierarchical clustering approach and a novel

distance metric called "span-position distance". Span-

position distance is the sum of two components, span

distance and position distance. The span distance is

the difference in lengths between signatures

normalized by the greater length and always lies in

the interval [0,1]. The position distance is the

difference in position between signatures normalized

by the distance normalizer (this parameter). For a

position difference of 1.8kb and a distance normalizer

of 900, the position distance will be 2. A smaller

distance normalizer leads to a higher position

distance and as a consequence increases the importance

of the position distance in the span-position distance

relative to the span distance.

--cluster_max_distance CLUSTER_MAX_DISTANCE

Maximum span-position distance between SVs in a

cluster (default: 0.3). This is the most important

parameter because it determines the strictness of

clustering. Choosing a large value leads to fewer but

larger clusters with larger distances between its

members. Choosing a small value leads to more but

smaller clusters with smaller distances between its

members. This parameter determines the height of the

cut-off in the hierarchical clustering dendrogram.

--all_bnds Output all rearrangements additionally in BND notation

(default: False). By default, SV signatures from the

read alignments are used to detect complete SVs, such

as deletions, insertions and inversions. When this

option is enabled, all SVs are also output in breakend

(BND) notation as defined in the VCF specs. For

instance, a deletion gets two records in the VCF

output: 1. the normal <DEL> record and 2. a <BND>

record representing the novel adjacency between the

deletion's start and end coordinate in the sample

genome.

COMBINE:

--del_ins_dup_max_distance DEL_INS_DUP_MAX_DISTANCE

Maximum span-position distance between the origin of

an insertion and a deletion to be flagged as a

potential cut&paste insertion (default: 1.0)

--trans_destination_partition_max_distance TRANS_DESTINATION_PARTITION_MAX_DISTANCE

Maximum distance in bp between translocation

breakpoint destinations in a partition (default: 1000)

--trans_partition_max_distance TRANS_PARTITION_MAX_DISTANCE

Maximum distance in bp between translocation

breakpoints in a partition (default: 200)

--trans_sv_max_distance TRANS_SV_MAX_DISTANCE

Maximum distance in bp between a translocation

breakpoint and an SV signature to be combined

(default: 500)

GENOTYPE:

--skip_genotyping Disable genotyping (default: False)

--minimum_score MINIMUM_SCORE

Minimum score for genotyping (default: 3). Only SV

candidates with a higher or equal score are genotyped.

Depending on the score distribution among the SV

candidates, decreasing this value increases the

runtime. We recommend to choose a value close to the

score threshold used for filtering the SV candidates.

--homozygous_threshold HOMOZYGOUS_THRESHOLD

Minimum variant allele frequency to be called as

homozygous (default: 0.8). Allele frequency is

computed as the fraction of reads supporting the

variant over the total number of reads covering the

variant. Variants with an allele frequence greater

than or equal to this threshold are called as

homozygous alternative.

--heterozygous_threshold HETEROZYGOUS_THRESHOLD

Minimum variant allele frequency to be called as

heterozygous (default: 0.2). Allele frequency is

computed as the fraction of reads supporting the

variant over the total number of reads covering the

variant. Variants with an allele frequence greater

than or equal to this threshold but lower than the

homozygous threshold are called as heterozygous

alternative. Variants with an allele frequence lower

than this threshold are called as homozygous

reference.

--minimum_depth MINIMUM_DEPTH

Minimum total read depth for genotyping (default: 4).

Variants covered by a total number of reads lower than

this value are not assigned a genotype (./. in the

output VCF file).

OUTPUT:

--sample SAMPLE Sample ID to include in output vcf file (default:

Sample)

--types TYPES SV types to include in output VCF (default:

DEL,INS,INV,DUP:TANDEM,DUP:INT,BND). Give a comma-

separated list of SV types. The possible SV types are:

DEL (deletions), INS (novel insertions), INV

(inversions), DUP:TANDEM (tandem duplications),

DUP:INT (interspersed duplications), BND (breakends).

--sequence_alleles Use nucleotide sequences for alleles of deletions,

inversions and insertions in output VCF (default:

False). By default, all SVs are represented by

symbolic alleles, such as <DEL>, <INV> or <INS>. If

enabled, ALT alleles of insertions are obtained from

the sequence of a random read that supports the

variant.

--insertion_sequences

Output insertion sequences in INFO tag of VCF

(default: False). If enabled, the INFO/SEQS tag

contains a list of insertion sequences from the

supporting reads. However, the insertion sequences are

not combined into a consensus sequence.

--tandem_duplications_as_insertions

Represent tandem duplications as insertions in output

VCF (default: False). By default, tandem duplications

are represented by the SVTYPE=DUP:TANDEM and the

genomic source is given by the POS and END tags. When

enabling this option, duplications are instead

represented by the SVTYPE=INS and POS and END both

give the insertion point of the duplication.

--interspersed_duplications_as_insertions

Represent interspersed duplications as insertions in

output VCF (default: False). By default, interspersed

duplications are represented by the SVTYPE=DUP:INT and

the genomic source is given by the POS and END tags.

When enabling this option, duplications are instead

represented by the SVTYPE=INS and POS and END both

give the insertion point of the duplication.

--read_names Output names of supporting reads in INFO tag of VCF

(default: False). If enabled, the INFO/READS tag

contains the list of names of the supporting reads.

--zmws look for information on ZMWs in PacBio read names

(default: False). If enabled, the INFO/ZMWS tag

contains the number of ZMWs that produced supporting

reads.

実行方法

生のロングリードを与えられたリファレンスゲノムにアラインメントすることで、生のロングリードからSVを検出することができる。また、SAM/BAMフォーマットの既存のリードアライメントからSVを検出することもできる。

デフォルトのngmlrを使ってリードをアラインし、SVをコール。

#PacBio reads
svim reads out_dir input_reads.fa ref.fa

#ONT reads
svim reads --nanopore out_dir input_reads.fa ref.fa

生のfasta、fastq、gzip圧縮を扱える。

指定したout_dirディレクトリにログファイル、VCF 形式の SVコール結果、BED 形式の SVコール結果、BED形式の中間シグネチャクラスタが出力される。

minimap2を使ってリードをアラインし、SVをコール。

#PacBio reads
svim reads --aligner minimap2 out_dir input_reads.fa ref.fa

#Use only high-quality alignments
svim reads --min_mapq 30 out_dir input_reads.fa ref.fa

bamを指定する。

svim alignment my_sample input.bam

出力

f:id:kazumaxneo:20200721141259p:plain

variants.vcf

f:id:kazumaxneo:20200721141302p:plain

SVIM は全ての SVコールとスコアを出力する。そのため、結果をこれらのスコアに基づいてフィルタリングすることが強く推奨されている。これはすべてのバリアントコールを出力して、フィルタリングを後処理のステップとして残した方が、よりユーザーフレンドリーであるというオーサーらの考えによる。
SVIM は各 variant コールに対して 0 から 100 の間のスコアを出力する。スコアが高いほど、そのコールがより信頼できることを意味する。スコアはVCF出力のQUAL列やBEDファイルの5番目の列にある。スコアは主にサポートするリードの数に基づいており。さらにSVのスパンと位置に関するサポートリード間の一致も考慮されている。
結果のフィルタリングに関してだが、スコアがサポートリード数に依存するため、そのスコア分布は入力のシーケンシングカバレッジに応じて変化する。したがって、適切なスコアカットオフについて一概には言えない。カバレッジの高いデータセット（>40倍）では、10-15のしきい値が推奨される。カバー率の低いデータセットでは，閾値を低くすべきである。

引用

SVIM: structural variant identification using mapped long reads
David Heller, Martin Vingron
Bioinformatics, Volume 35, Issue 17, 1 September 2019

2020-07-19

インタラクティブなレポートを出力するONTのクオリティコントロールツール pycoQC

2018 JOSS trimming / preprocessing Nanopore long read jupyter notebook

2020 7/21 コマンドでダブルスペースになっていた部分を修正

　核酸のナノポアシーケンシングは、開発に30年近くを要し、現在では合成法によるシーケンシングの代替手段として確固たる地位を確立している(Deamer, Akeson, & Branton, 2016)。オックスフォード・ナノポア・テクノロジーズ（ONT）は、2014年にDNAシーケンシング用の最初の商業的なナノポアデバイスをリリースし、その後も継続的に技術を改良してきた（Jain, Olsen, Paten, & Akeson, 2016）。読み取り精度は90%程度に過ぎないが、ONT技術は非常に長い分子をシーケンスすることができ、リアルタイムでデータを生成することができる。また、RNAを直接シーケンスすることができ、修飾された塩基を検出することができる（Garalde et al）.
　ナノポアのアレイによって取得された電気信号は、HDF5形式で保存され、シーケンスされた分子ごとに1つのファイル（FAST5と呼ばれる）を持つ。次いで、信号は、ベースコールソフトウェアを使用して、核酸配列に変換される。いくつかの選択肢があるが、読み取り精度のための最良のパフォーマーは、ONTによって開発され、維持されているAlbacoreまたはGuppyである(Wick, Judd, & Holt, 2018)。どちらもFASTQファイル、ベースコール情報とテキストサマリーファイルを含むFAST5ファイルを生成することができる。ONTは最近、シーケンシングランの品質管理分析のためのベストプラクティスガイドラインを発表したが（Oxford Nanopore Technologies, 2019）、シーケンシングデータの品質を深く探求するためのターンキーソリューションを提供していなかった。
　ここでは、ベースコールされたナノポアリードまたはAlbacoreとGuppyによって生成されたサマリーファイルからインタラクティブな品質管理メトリクスとプロットを生成するための新しいツールであるpycoQCを紹介する。Nanoplot（De Coster, D'Hert, Schultz, Cruts, & Van Broeckhoven, 2018）、MinionQC（Lanfear, Schalamun, Kainer, Wang, & Schwessinger, 2018）、 toulligQC（Laffay, Ferrato-Berberian, Jour-dren, Lemoine, & Le Crom, 2018）など、他のオープンソースの代替ツールもあるが、pycoQCにはいくつかの斬新な機能がある。

Documentation

https://a-slide.github.io/pycoQC/

jupyter notebook上で使う

jupyter API usage - pycoQC

インストール

macOS10.14でpipを使って導入した。

依存

numpy>=1.13
scipy>=1.1
pandas>=0.23
plotly>=3.4
jinja2>=2.10
h5py>=2.8.0
tqdm>=4.23'

Github

#pipで導入可能。
pip install pycoQC

#condaの仮想環境に入れるなら
conda install pip
pip install pycoQC

#bioconda (link) 未テスト
conda create -n pycoQC python=3.6 -y
conda activate pycoQC
conda install -c bioconda -y pycoqc

#development version (unstable)
pip install --index-url https://test.pypi.org/simple/ pycoQC -U

> pycoQC

$ pycoQC

ERROR: `--summary_file` is a required argument

usage: pycoQC [-h] [--version]

[--summary_file [SUMMARY_FILE [SUMMARY_FILE ...]]]

[--barcode_file [BARCODE_FILE [BARCODE_FILE ...]]]

[--bam_file [BAM_FILE [BAM_FILE ...]]]

[--html_outfile HTML_OUTFILE] [--json_outfile JSON_OUTFILE]

[--min_pass_qual MIN_PASS_QUAL] [--min_pass_len MIN_PASS_LEN]

[--filter_calibration] [--filter_duplicated]

[--min_barcode_percent MIN_BARCODE_PERCENT]

[--report_title REPORT_TITLE] [--template_file TEMPLATE_FILE]

[--config_file CONFIG_FILE] [--sample SAMPLE] [--default_config]

[-v | -q]

pycoQC computes metrics and generates interactive QC plots from the sequencing summary

report generated by Oxford Nanopore technologies basecallers

* Minimal usage

pycoQC -f sequencing_summary.txt -o pycoQC_output.html

* Including Guppy barcoding file + html output + json output

pycoQC -f sequencing_summary.txt -b barcoding_sequencing.txt -o pycoQC_output.html -j pycoQC_output.json

* Including Bam file + html output

pycoQC -f sequencing_summary.txt -a alignment.bam -o pycoQC_output.html

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

-v, --verbose Increase verbosity

-q, --quiet Reduce verbosity

Input/output options:

--summary_file [SUMMARY_FILE [SUMMARY_FILE ...]], -f [SUMMARY_FILE [SUMMARY_FILE ...]]

Path to a sequencing_summary generated by Albacore

1.0.0 + (read_fast5_basecaller.py) / Guppy 2.1.3+

(guppy_basecaller). One can also pass multiple space

separated file paths or a UNIX style regex matching

multiple files (Required)

--barcode_file [BARCODE_FILE [BARCODE_FILE ...]], -b [BARCODE_FILE [BARCODE_FILE ...]]

Path to the barcode_file generated by Guppy 2.1.3+

(guppy_barcoder) or Deepbinner 0.2.0+. This is not a

required file. One can also pass multiple space

separated file paths or a UNIX style regex matching

multiple files (optional)

--bam_file [BAM_FILE [BAM_FILE ...]], -a [BAM_FILE [BAM_FILE ...]]

Path to a Bam file corresponding to reads in the

summary_file. Preferably aligned with Minimap2 One can

also pass multiple space separated file paths or a

UNIX style regex matching multiple files (optional)

--html_outfile HTML_OUTFILE, -o HTML_OUTFILE

Path to an output html file report (required if

json_outfile not given)

--json_outfile JSON_OUTFILE, -j JSON_OUTFILE

Path to an output json file report (required if

html_outfile not given)

Filtering options:

--min_pass_qual MIN_PASS_QUAL

Minimum quality to consider a read as 'pass' (default:

--min_pass_len MIN_PASS_LEN

Minimum read length to consider a read as 'pass'

(default: 0)

--filter_calibration If given, reads flagged as calibration strand by the

basecaller are removed (default: False)

--filter_duplicated If given, duplicated read_ids are removed but the

first occurence is kept (Guppy sometimes outputs the

same read multiple times) (default: False)

--min_barcode_percent MIN_BARCODE_PERCENT

Minimal percent of total reads to retain barcode

label. If below, the barcode value is set as

`unclassified` (default: 0.1)

HTML report options:

--report_title REPORT_TITLE

Title to use in the html report (default: PycoQC

report)

--template_file TEMPLATE_FILE

Jinja2 html template for the html report (default: )

--config_file CONFIG_FILE

Path to a JSON configuration file for the html report.

If not provided, looks for it in ~/.pycoQC and

~/.config/pycoQC/config. If it's still not found,

falls back to default parameters. The first level keys

are the names of the plots to be included. The second

level keys are the parameters to pass to each plotting

function (default: )")

Other options:

--sample SAMPLE If not None a n number of reads will be randomly

selected instead of the entire dataset for ploting

function (deterministic sampling) (default: 100000)

--default_config, -d Print default configuration file. Can be used to

generate a template JSON file (default: False)

実行方法

AlbacoreやGuppyが出力するsequencing_summary.txtを指定する。

pycoQC \
 -f Guppy-2.1.3_basecall-1D-RNA_sequencing_summary.txt.gz \
 -o Guppy-2.1.3_basecall-1D_RNA.html \
 -j Guppy-2.1.3_basecall-1D_RNA.json \
 --min_pass_qual 6 \
 --min_pass_len 100 \
 --filter_calibration \
 --min_barcode_percent 10 \
 --quiet

出力

f:id:kazumaxneo:20200719210345p:plain

Guppy-2.1.3_basecall-1D_RNA.html

f:id:kazumaxneo:20200719205235p:plain

f:id:kazumaxneo:20200719205239p:plain

f:id:kazumaxneo:20200719205248p:plain

f:id:kazumaxneo:20200719205251p:plain

f:id:kazumaxneo:20200719205338p:plain

引用

pycoQC, interactive quality control for Oxford Nanopore Sequencing

Adrien Leger, Tommaso Leonardi

JOSS, Published: 28 February 2019

関連

2020-07-18

TGSデータのためのQCツール LongQC

2020 G3: Genes, Genomes, Genetics docker Pacbio Nanopore long read trimming / preprocessing

2020 7/27 追記　

　ショートリードシーケンス技術は、過去 10 年間の生物学のパラダイムを変えてきた。最近では、TGSが登場し、１分子からの非常に長いが比較的エラーが発生しやすいリードを提供している。FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) や PRINSEQ (Schmieder and Edwards 2011) のようなショートリードのための標準的な品質管理（QC）方法があるが、これらの方法は、エラーが発生しやすいロングリードを特徴付けるために完全に最適化されていなかった。科学界ではTGSの採用が急速に進んでいるため、ロングリードデータに対するこのような手法の適用可能性と適合性を再考する価値がある。ロングリードデータのユニークな特性を説明するためには、異なる特性を記述するための異なるQC統計量が必要である。

　Phredクオリティスコアは、シーケンシングデータの品質を評価するために広く使われている測定値である。残念ながら、PacBio Sequel システムでは意味のある Phred スコアを提供していない。ベースコールの信頼性を示す指標としての Phred スコアがなければ、データセットから信頼性の高いコンセンサス配列を得るためには、カバレッジが重要な基準となる。これは、主にONTデータセット用に開発された最近リリースされたNanoPack(De Coster et al. 2018)にも影響を与える。PacBio uBamローディングの機能にもかかわらず、NanoPackはPhredスコアの欠如により、データを完全に評価することができない。ONTコミュニティ内で利用可能な他のQCツールは、ONTシステムによって生成されたメタデータを使用してプラットフォーム固有の統計を要約するため、ONTにのみ適している(Loman and Quinlan 2014; Watson et al. 2015; Lanfear et al. 2019)。このようなツールは、完全な分析の前に適用することができるが、メタデータのフォーマット変更の影響を受けやすい。実際、生データのフォーマット（Fast5）やベースコールプログラムの変更は、ONTのために頻繁に行われてきた。以前のONT QCツールは、最新のFast5フォーマットでは使用できなくなった。さらに、Scrappie や Bonito のような新しい ONT ベースコーラーは、Sequel のように Phred スコアの計算をスキップする。

　別の QC アプローチとしては、リードポリッシュツールを適用する方法がある。DASCRUBBERはPacBio社のリードをスクラブするために開発されたスイートである（https://github.com/thegenemyers/DASCRUBBER）が、それ自体はQC専用ではなくて、品質評価に使用することができる。DASCRUBBERのフレームワークの下では、リードのall versus all比較を実行して、誤りのあるセグメントを見つけ出してトリミングする。このプロセスは、完全性と精度のためにワークステーション上で何時間もかかることがある。さらに、サポートされているファイル形式の数が限られていること、異なるパラメータオプションでスイート内で複数のコマンドを実行していること、データの品質を判断するための明示的な手段がないことなどが、経験の浅いユーザーにとっての課題となる可能性がある。

　低品質のポアからのONTリードは、時に誤解を招くほど高い品質スコアを示すことがあるにもかかわらず、ランダムであることが報告されている（Mojarro et al., 2018）。これらのリードは人為的に生成されたものであり、サンプルとの生物学的関連性はない。実際、各TGSデータセットには、一定量の「ノンセンスリード」が含まれることがあり、これは、同じライブラリ内の他のどの分子の配列にもマッピングできないユニークなリードとここで定義する。このように、データセットに含まれるノンセンスリードの割合は、シーケンス品質のロバストな指標として利用できることを提案する。

　ここでは、LongQC（Long read Quality Control）と名付けられたTGSデータのための新しいQCパイプラインを報告する。LongQCは計算効率が高く、プラットフォームに依存しないQCツールで、完全な解析を行う前に問題を発見することができる。このツールは、誤ったロングリードデータを対象に統計を視覚化し、生物学的サンプルに起因する潜在的な問題や、シーケンス段階で発生した問題を強調表示する。LongQCは主要な TGS ファイル形式をサポートしている。LongQC は k-mer ベースの内部オーバーラップとスキップアライメントに依存しているため、リファレンスゲノムがなくても効率的に動作する。

インストール

condaの仮想環境を作ってテストした（ubuntu 18.04）。またdockerイメージをビルドしてテストした(macos 10.14)。

依存

LongQC was written in python3 and has dependencies for popular python libraries:

numpy
scipy
matplotlib (version 2.1.2 or higher)
scikit-learn
pandas (version 0.24.0 or higher)
jinja2
h5py

Also, it depends on some bioinformatics packages.

pysam
edlib (with its python wrapper, python-edlib)

Github

#仮想環境に入れる
conda create -n longqc -y
conda activate longqc
conda install h5py pandas scipy jinja2 matplotlib scikit-learn pysam edlib python-edlib -y

git clone https://github.com/yfukasawa/LongQC.git
cd LongQC/
#Modified version of minimap2 is required.
git clone https://github.com/yfukasawa/LongQC.git
cd minimap2_mod
make extra
export PATH=$PATH:$PWD

> python longQC.py -h

$ python longQC.py -h

usage: LongQC [-h] [-v] {runqc,sampleqc,help} ...

LongQC is a software to asses the quality of long read data from the third

generation sequencers.

positional arguments:

{runqc,sampleqc,help}

runqc see `runqc -h`

sampleqc see `sampleqc -h`

help see `help -h`

optional arguments:

-h, --help show this help message and exit

-v, --version show program's version number and exit

> python longQC.py runqc -h

$ python longQC.py runqc -h

usage: LongQC runqc [-h] [-s SUF] [-o OUT] platform raw_data_dir

positional arguments:

platform a platform to be evaluated. [rs2, sequel, minion,

gridion]

raw_data_dir a path for a dir containing the raw data

optional arguments:

-h, --help show this help message and exit

-s SUF, --suffix SUF suffix for each output file.

-o OUT, --output OUT path for output directory

> python longQC.py sampleqc -h

$ python longQC.py sampleqc -h

usage: LongQC sampleqc [-h] -o OUT -x preset [-t] [-n NSAMPLE] [-s SUF]

[-c TRIM] [--adapter_5 ADP5] [--adapter_3 ADP3] [-f]

[-p NCPU] [-d] [-m MEM] [-i INDS] [-b]

input

positional arguments:

input Input [fasta, fastq or pbbam]

optional arguments:

-h, --help show this help message and exit

-o OUT, --output OUT path for output directory

-x preset, --preset preset

a platform/kit to be evaluated. adapter and some ovlp

parameters are automatically applied. (pb-rs2, pb-

sequel, ont-ligation, ont-rapid, ont-1dsq)

-t, --transcript applies the preset for transcripts, RNA or cDNA

sequences

-n NSAMPLE, --n_sample NSAMPLE

the number of sequences for sampling. (>0 and <=10000)

[Default is 5000].

-s SUF, --sample_name SUF

sample name is added as a suffix for each output file.

-c TRIM, --trim_output TRIM

path for trimmed reads. If this is not given, trimmed

reads won't be saved.

--adapter_5 ADP5 adapter sequence for 5'.

--adapter_3 ADP3 adapter sequence for 3'.

-f, --fast this turns off sensitive setting. Faster but less

accurate.

-p NCPU, --ncpu NCPU the number of cpus for LongQC analysis [Default is 4]

-d, --db make minimap2 db in parallel to other tasks.

-m MEM, --mem MEM memory limit for chunking. Please specify in gigabytes

(>0 and <=2). [Default is 0.5]

-i INDS, --index INDS

Give index size for minimap2 (-I) in bp. Reduce when

running on a small memory machine.Default is 4G.

-b, --short this turns on the highly sensitive setting for very

short and erroneous reads (<500bp).

Dockerイメージ

wget https://raw.githubusercontent.com/yfukasawa/LongQC/master/Dockerfile
docker build -t longqc .
docker run -it --rm -v /path/to/shared_dir/:/data longqc

実行方法

pacbio

#RS-II
python longQC.py sampleqc -x pb-rs2 -o out_dir input_reads.(fq|fa)

#sequal
python longQC.py sampleqc -x pb-sequel -o out_dir input_reads.bam

ONT

#1D ligation
python longQC.py sampleqc -x ont-ligation -o out_dir input_reads.fq

#rapid
python longQC.py sampleqc -x ont-rapid -o out_dir input_reads.fq

ValueError: Number of processes must be at least 1 が出て修正できなかった。

追記

issue#3を参考にした。

https://github.com/yfukasawa/LongQC/issues/3

"ーp 4"をつけると動作する。

python longQC.py sampleqc -x ont-rapid -o out_dir input_reads.fq -p 4

-p NCPU, --ncpu NCPU the number of cpus for LongQC analysis [Default is 4]

f:id:kazumaxneo:20200727192311p:plain

f:id:kazumaxneo:20200727191811p:plain

f:id:kazumaxneo:20200727191856p:plain

f:id:kazumaxneo:20200727191830p:plain

引用
LongQC: A Quality Control Tool for Third Generation Sequencing Long Read Data

Yoshinori Fukasawa, Luca Ermini, Hai Wang , Karen Carty , Min-Sin Cheung

G3. 2020 Apr 9;10(4):1193-1196

関連

2020-07-17

Bwa-mem2

高速なツール mapping benchmark

2020 7/19 benchmark追記、一部修正

2020 10/15 condaインストール追記

　Bwa-mem2はbwaのbwa-memアルゴリズムのネクストバージョンである。bwaと同じアラインメントを生成し、データセット、実行中のマシンに依存して～1.3～3.1倍高速になる。オリジナルのbwaはHeng Liによって開発された。bwa-mem2の性能向上は、主にIntelのParallel Computing LabのVasimuddin MdとSanchit Misraによって行われた。Bwa-mem2はMITライセンスで配布されている。

　一般ユーザーは、リリースページにあるコンパイル済みのバイナリを使用することが推奨されている。これらのバイナリは Intel コンパイラでコンパイルされており、gcc でコンパイルされたバイナリよりも高速に動作する。プリコンパイルされたバイナリは、間接的に CPU のディスパッチ（解説）もサポートしている。bwa-mem2 バイナリは、実行中のマシンで利用可能な SIMD 命令セットに基づいて、最も効率的な実装を自動的に選択できる。

BWA-MEM2 production release: Produces alignments identical to BWA-MEM 0.7.17 and is "~1.3-3.0x faster depending on the use-case, dataset and the running machine". https://t.co/asCQah6la1
— Christopher Watson (@ChrisM_Watson) 2020年7月13日

9/11

BWAMEM2 index file size and its memory footprint are now ~16GB down from 42GB.
We are actively working on further reducing it down to ~10GB. @sanchit_misra @lh3lh3
#BWA #BWAMEM #BWAMEM2 #Genomics #Bioinformatics https://t.co/0s3E71TpqX
— Mohd Vasimuddin (@wasim_galaxy) 2020年9月10日

2021 3/9

New release of bwa-mem2 is out. This one has passed the validation test - to make sure it produces the same output at bwa v0.7.17 - on ~88 billion reads https://t.co/96a7C0NKHi #bwamem #bwamem2 @lh3lh3 @wasim_galaxy #Genomics pic.twitter.com/oN1CsSgMYa
— Sanchit Misra (@sanchit_misra) 2021年3月8日

インストール

ubuntu18.04でテストした（*1)。

Github

CentOS6マシン上でコンパイルされたバイナリが用意されている。これをダウンロードして使用する。

curl -L https://github.com/bwa-mem2/bwa-mem2/releases/download/v2.0pre2/bwa-mem2-2.0pre2_x64-linux.tar.bz2 \
 | tar jxf -
cd bwa-mem2-2.0pre2_x64-linux/

#bioconda (link)
conda install -c bioconda bwa-mem2 -y

> ./bwa-mem2

$ ./bwa-mem2

Usage: bwa-mem2 <command> <arguments>

Commands:

index create index

mem alignment

version print version number

> ./bwa-mem2 mem # 最適なSIMD対応バイナリが自動で選択される（赤字）

# ./bwa-mem2 mem

-----------------------------

Executing in AVX2 mode!!

-----------------------------

Usage: bwa2 mem [options] <idxbase> <in1.fq> [in2.fq]

Options:

Algorithm options:

-o STR Output SAM file name

-t INT number of threads [1]

-k INT minimum seed length [19]

-w INT band width for banded alignment [100]

-d INT off-diagonal X-dropoff [100]

-r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]

-y INT seed occurrence for the 3rd round seeding [20]

-c INT skip seeds with more than INT occurrences [500]

-D FLOAT drop chains shorter than FLOAT fraction of the longest overlapping chain [0.50]

-W INT discard a chain if seeded bases shorter than INT [0]

-m INT perform at most INT rounds of mate rescues for each read [50]

-S skip mate rescue

-o output file name missing

-P skip pairing; mate rescue performed unless -S also in use

Scoring options:

-A INT score for a sequence match, which scales options -TdBOELU unless overridden [1]

-B INT penalty for a mismatch [4]

-O INT[,INT] gap open penalties for deletions and insertions [6,6]

-E INT[,INT] gap extension penalty; a gap of size k cost '{-O} + {-E}*k' [1,1]

-L INT[,INT] penalty for 5'- and 3'-end clipping [5,5]

-U INT penalty for an unpaired read pair [17]

Input/output options:

-p smart pairing (ignoring in2.fq)

-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]

-H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]

-j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

-T INT minimum score to output [30]

-h INT[,INT] if there are <INT hits with score >80% of the max score, output all in XA [5,200]

-a output all alignments for SE or unpaired PE

-C append FASTA/FASTQ comment to SAM output

-V output the reference FASTA header in the XR tag

-Y use soft clipping for supplementary alignments

-M mark shorter split hits as secondary

-I FLOAT[,FLOAT[,INT[,INT]]]

specify the mean, standard deviation (10% of the mean if absent), max

(4 sigma from the mean if absent) and min of the insert size distribution.

FR orientation only. [inferred]

Note: Please read the man page for detailed description of the command line and options.

他にbwa-mem2.avx2、 bwa-mem2.avx512bw、 bwa-mem2.sse41がある。必要に応じてパスを通す。

実行方法

１、indexing

bwa-mem2 index re.fasta

f:id:kazumaxneo:20200719145036p:plain

２、mapping

bwa-mem2 mem re.fasta pair1.fq.gz pair2.fq.gz -t 20 > out.sam

高速化と引き換えにメモリ使用量が増加し、indexサイズも相当大きくなっています。導入する際は注意して下さい。

追記

シロイヌナズナゲノム（120-Mb）を使って簡単なベンチマークを行いました。３つの計算機を使い、スレッド数を変えて10GB fastqのbwa-mem2マッピングランタイム（real）とその時のピークメモリを比較した結果が次の表です。

f:id:kazumaxneo:20200719144841p:plain

SR3990Xは結果が揺らいだので５回行なった平均値を載せています。他２つのCPUは１回ずつの結果です。SR3990Xは64スレッド指定してもCPU率はまだ半分近く余っていたので、念のためCPUのスレッド数の倍である256を指定し、飽和値指定した時のタイムも測定しました。

引用

https://github.com/bwa-mem2/bwa-mem2

Vasimuddin Md, Sanchit Misra, Heng Li, Srinivas Aluru. Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. IEEE Parallel and Distributed Processing Symposium (IPDPS), 2019.

参考

https://oyat.nl/bwa2/

CPUがintelのxeon platinum だと、avx512bwも含めて配布されているバイナリ全てが動作した。xeon E5 2680 v4ではavx512bw以外は動作した。AMDのSR3990xだとSSE41 以外は動作しなかったため、ソースからビルドした。

関連

2021 12/11

With the recent (last year) advances in short-read aligners, it would be nice to see an independent benchmarking. I ran some of the tools in my small benchmarking, I include results and brief conclusions below. Could serve as inspiration for a more thorough benchmark. https://t.co/PHxBFLeA3a pic.twitter.com/iYBAI7vCiX
— Kristoffer Sahlin (@krsahlin) December 10, 2021

2020-07-16

（ヒトゲノム）ミトコンドリアハプロタイプを検出することでサンプルの汚染を検出する Haplocheck

2020 Preprint contamination mitochondria human genome ハプログループ

　ヒトのミトコンドリアDNA（mtDNA）は、長さ16.6kbの核外DNAである（Andrews et al）. mtDNAは母系を介してのみ継承され、世界的にヒトの母系の系統と女性の（前）歴史的な人口動態パターンの再構築を容易にしている。mtDNAの厳密な母方遺伝は、ハプロタイプの配列ハプロタイプをハプログループと呼ばれる単系統のクラスターに自然にグループ化する。さらに、次世代シークエンシング（NGS）または大規模並列シークエンシング（MPS）により、ミトコンドリアゲノム全体のヘテロプラスミーの検出が可能になった。ヘテロプラスミーとは、調査対象の生物学的サンプル（細胞や組織など）において、少なくとも2つの異なるmtDNAハプロタイプが存在することを指す。シーケンシングのカバレッジにもよるが、ヘテロプラスミックな位置は1%のバリアントレベルまで確実に検出可能である(Weissensteiner et al. 2016; Ye et al. 2014)。近年、ミトコンドリアデータにおける見かけのヘテロプラスミーに関する問題とデータの解釈は、いくつかの研究（Bandelt and Salas 2012; He et al. 2010; Ye et al. 2014; Just et al. 2014a）で取り上げられ、その結果、シーケンシング研究から得られたmtDNAデータの品質に関する包括的なレビューが行われている（Just et al. 2015）。研究ではヘテロプラスミーの存在を大幅に過大評価していることが示されており、これはしばしば外部汚染(Yao et al. 2007; Just et al. 2014b, 2015; Brandhagen et al. 2020)、人工組換え(Bandelt et al. 2004)、人工物や解析ソフトウェアの不整合(Weissensteiner et al. 2016)によって説明されることがある。核内 DNA（nDNA）と mtDNA のシーケンシング研究では、サンプルの汚染がいまだに大きな問題となっており、過去に Sanger シーケンシング研究で発生したようなミスを防ぐ必要がある (Salas et al. 2005)。 NGS の精度と感度が向上した計算モデルが利用可能になったことにより、全ゲノムシークエンシング（WGS）研究では、種内汚染は 1%レベルまで追跡可能になった (Jun et al. 2012)。 mtDNAシーケンシング研究におけるコンタミネーションを検出するためのいくつかのアプローチが存在する。最近の研究(Weissensteiner et al. 2016)では、ヘテロプラスミーとして観察される系統的に相容れないミトコンドリアハプロタイプの共存に基づくコンタミネーションアプローチが、既に他の研究者によって実証されているように実現可能であることを示した(Avital et al. 2012; Li et al. 2010, 2015)。他の方法としては、Galaxy-based approach (Dickins et al. 2014)などがあり、neighbor joining treesを構築することで汚染のチェックを容易にしている。 Mixemt (Vohr et al. 2017) は、ミトコンドリア系統を組み込み、各配列を読み込んだ際に最も確率の高いハプロタイプを推定している。Mixemt で実装された計算量の多いアルゴリズムは、1 つのサンプル内の複数のハプロタイプの汚染検出に有利であり、バリアント頻度に依存しない。古代DNAの研究では、schmutzi(Renaud et al. 2015)は、汚染を推定するために配列のdeamination patternsとフラグメント長分布を使用している。さらに、ダブルバーコードシーケンシングアプローチを含む、汚染を排除するための特定のラボプロトコルが設計された(Yin et al. 2019)。ミトコンドリア研究における汚染検出のために、VerifyBamID(Zhang et al. 2020; Jun et al. 2012)のような広く受け入れられているソフトウェアツールを適用して、DNAのクロスコンタミネーションを調査することがよくある(Wei et al. 2019; Ding et al. 2015; Yuan et al. 2020)。それにもかかわらず、汚染を簡単に検出し、ミトコンドリア研究における実際のヘテロプラスミックな位置と区別するためのツールがないことが明らかになってきた。 mtDNAも細胞の種類によっては細胞あたり数百倍から数千倍の割合で存在しているため、常染色体ゲノムに特化したWGSデータでもミトコンドリアゲノムのカバレッジが高い。そこで、mtDNAのみを見ることで、nDNAの汚染度を推定できるのではないかという仮説を立てた。

　本論文では、コンタミネーション検出のためにmtDNA系統図を使用するアプローチを体系的に評価し、NGS研究におけるコンタミネーション検出ツールとして使用できるhaplocheckソフトウェアを紹介する。ウェットラボとインシリコの異なるデータセットを用いて、haplocheckがヘテロプラズム位置を正確に検出できることを示し、mtDNA研究においても1%までのコンタミネーションを検出できることを示している。in-silicoのWGSデータを作成し、1000 Genomes Projectのデータを再解析することで（1000 Genomes Project Consortium et al. 2015）、haplocheckがnDNA汚染レベルを推定するための効率的なプロキシとして使用できることを示し、ミトコンドリアのコピー数（mtCN）の影響を調査した。最後に、Haplocheckがサンプル内で同定されたハプロタイプに起因する汚染源の発見に役立つことを示す。

Haplocheck: Contamination Detection in mtDNA and whole-genome sequencing studies

https://haplogrep.i-med.ac.at/2019/12/20/introducing-haplocheck/

Mitoverse （cloud web service）

https://mitoverse.i-med.ac.at/index.html#!

wiki

Contamination Method - Mitoverse

インストール

Github

mkdir haplocheck 
cd haplocheck
curl -s install.cloudgene.io | bash 
./cloudgene install https://github.com/genepi/haplocheck/releases/download/v1.1.3/haplocheck.zip

> ./cloudgene run haplocheck@1.1.3

$ ./cloudgene run haplocheck@1.1.3

Cloudgene 2.2.0

http://www.cloudgene.io

Built by travis on 2020-05-07T11:59:20Z

Haplocheck 1.1.3

ERROR: Missing required option: files

usage: input parameters:

--baseQ <number> Minimal Base Quality

(default: 20)

--conf <arg> Hadoop configuration folder

--files <local_folder> Input Files (VCF/BAM/CRAM)

--force Force Cloudgene to reinstall application in

HDFS even if it already installed.

--level <number> Level

(default: 0.01)

--mapQ <number> Minimal Map Quality

(default: 20)

--output <arg> Output folder

--show-log Stream logging messages to stdout

--show-output Stream output to stdout

--user <arg> Hadoop username [default: cloudgene]

テストラン

#Download 1000G Phase3 Data
wget https://github.com/genepi/haplocheck/raw/master/test-data/contamination/1000G/all/1000g-nobaq.vcf.gz 
 
#Run haplocheck
./cloudgene run haplocheck@1.1.3 --files 1000g-nobaq.vcf.gz --output results 
 
#open results in firefox
firefox results/report/report.html

Haplocheckは、各ミトコンドリア入力サンプルの汚染状況をレポートし、(a)グラフィカルなレポートと(b)テキストを出力する。

出力

汚染が検出されたサンプル

f:id:kazumaxneo:20200716234246p:plain

f:id:kazumaxneo:20200716234423p:plain

汚染が検出されなかったサンプル

f:id:kazumaxneo:20200716234246p:plain

summary

f:id:kazumaxneo:20200716234248p:plain

表は、特定のサンプルでフィルタリング、ソート、検索したりできる。さらに、各サンプルについて、系統樹は Phylotree 17からのグラフ情報を用いて生成される。ツリーはルートノード(rCRS)から始まり、最終的なハプログループ(Haplogrepによって割り当てられた)に到達するまでの各遷移のホモプラスミック(青)/ヘテロプラスミック(緑)の位置を示している。2つの枝は、メジャープロファイルとマイナープロファイルの最終的なミトコンドリアハプログループを表す。

引用

Haplocheck: Phylogeny-based Contamination Detection in Mitochondrial and Whole-Genome Sequencing Studies
Hansi Weissensteinera, Lukas Forera, Liane Fendta, Azin Kheirkhaha,
Antonio Salasb, Florian Kronenberga, Sebastian Schoenherra

bioRxiv, May 8, 2020

参考

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

真核生物の予測されたタンパク質のデータベース EukProt

メタゲノム由来コンティグから真核生物のタンパク質配列を予測する MetaEuk

ノイズの多いロングリードを使ってSVをコールする SVIM

インタラクティブなレポートを出力するONTのクオリティコントロールツール pycoQC

TGSデータのためのQCツール LongQC

Bwa-mem2

（ヒトゲノム）ミトコンドリアハプロタイプを検出することでサンプルの汚染を検出する Haplocheck