2020-11-03

複雑な構造多型の分析と視覚化を行う Svpluscnv

　ほとんどの腫瘍種において体細胞構造変化（SV）が広く普及しているにもかかわらず、その分子的意味合いについての理解が不十分な場合が多い。SVはその大きさと複雑さが非常に不均一であり、その病原性の役割の解釈を妨げている。ガンの体細胞構造を完全に特徴づけるためには、プラットフォームを横断して大規模なSVデータセットを統合するツールが必要である。
　svpluscnv R パッケージは、コピーナンバーバリアント（CNV）セグメンテーションプロファイルやシーケンスベースの構造バリアントコール（SVC）を含む直交データセットの統合と解釈のためのスイスアーミーナイフである。このパッケージには、染色体の不安定性や多型性の評価、再発性SVを持つ遺伝子の同定、chromothripsis（wiki）やchromoplexia（wiki）などの複雑な再配列の検出を行うための解析および可視化ツールが実装されている。また、ホットスポットとなっているゲノム領域を系統的に同定することができ、代替検出法やデータセット間での再現性を示している。

インストール

ubuntuのサーバーに導入した（ubuntu18.04とubuntu16.04の２つの計算機に入れたが、ubuntu16.04のサーバーでは導入できなかった）。

Github

devtools::install_github("ccbiolab/svpluscnv")

テストラン

Rのコンソールを立ち上げる。ライブラリを読み込む。

library(svpluscnv)

２つのデータタイプがサポートされている。

CNV のセグメンテーションデータ - (cnv データタイプ) sample, chrom, start, end, probes, segmean の順に 6 カラムが必要（テストデータ）
構造変化のコール。(SVC データ型) sample, chrom1, pos1, strand1, chrom2, pos2, strand2 & svclass の順に 8 カラム必要。svclassフィールドで受け付けるタイプは、重複(DUP)、欠失(DEL)、逆位(INV)、挿入(INS)、転座(TRA)、未定義バリアントのブレークエンド(BND)。

また、svpluscnvの機能を調べるために、2つのデータセットがパッケージに含まれている（Github参照）。ここでは、そのデータセットを使う。

#CNV
head(nbl_segdat)

## Sample Chromosome Start End Num_markers Seg_CN
## 1 PAISNS 1 11000 833000 337 -0.0270
## 2 PAISNS 1 835000 2715000 916 -1.0257
## 3 PAISNS 1 2717000 5969000 1552 -0.8316
## 4 PAISNS 1 5971000 12481000 3256 -0.9593
## 5 PAISNS 1 12483000 12777000 148 -0.2305
## 6 PAISNS 1 12779000 15551000 1287 -0.0083

#SV
head(nbl_svdat)

## TARGET.USI LeftChr LeftPosition LeftStrand RightChr RightPosition RightStrand Type
## 1 PAISNS chr1 12481576 - chr7 123358964 + TRA
## 2 PAISNS chr1 120543859 - chr2 65103235 - TRA
## 3 PAISNS chr2 231680218 + chr21 40045998 + TRA
## 4 PAISNS chr3 54814482 - chr17 42657036 + TRA
## 5 PAISNS chr4 97761321 - chr4 97765146 + INV
## 6 PAISNS chr4 190936134 + chr9 68411170 + TRA

データのイニシャライズ（reformat）

#CNV
cnv <- validate.cnv(nbl_segdat)

#SV
svc <- validate.svc(nbl_svdat)

ゲノム全体のCNVのゲイン/ロス頻度の可視化。閾値fc.pctはパーセンテージで表される（例：0.2→参照と比較して20％の倍数変化）。

cnv_freq <- cnv.freq(cnv, fc.pct = 0.2, ploidy = FALSE)

f:id:kazumaxneo:20201102231535p:plain

svpluscnvはサンプルごとの中央値を返す関数med.segmeanを実装しており、ハイパープロイディを持つサンプルの場合、各サンプルセグメントのlogRからサンプルごとの中央値を差し引くことができる（GithubのREADME参照）。

chr.arm.cnv 関数は，各サンプルの各染色体アームのセグメント加重平均対数比を取得する。

charm.mat <- chr.arm.cnv(cnv, genome.v = "hg19", verbose = FALSE)
# heatmap plot of chromosome arm level CNV
require(gplots,quietly = TRUE,warn.conflicts = FALSE)
heatmap.2(charm.mat[order(rownames(charm.mat))[1:42],],Rowv=NA,trace='none',cexCol=.5, lhei=c(0.25,1), dendrogram='col', key.title="Copy number",
 col=colorRampPalette(c("blue","white","red"))(256))

f:id:kazumaxneo:20201103120243p:plain

shattered.regions関数で生成されたオブジェクトを受け取るcirc.chromo.plotというcircclizeパッケージのラッパー関数を使ってcircos.plotを描画できる。

# It is important to make sure the input data.frame has no factors
library(taRifx)
segdat_lung_ccle <- remove.factors(segdat_lung_ccle)
svdat_lung_ccle <- remove.factors(svdat_lung_ccle)
cnv <- validate.cnv(segdat_lung_ccle)
# remove likely artifacts from segmentation data and fill gaps in the segmentation data (optional)
cnv_clean <- clean.cnv.artifact(cnv, verbose=FALSE,n.reps = 4,fill.gaps = TRUE) 
svc <- validate.svc(svdat_lung_ccle)

shatt_lung <- shattered.regions(cnv, svc, fc.pct = 0.1, verbose=FALSE)
shatt_lung

shatt_lung_cnv <- shattered.regions.cnv(cnv_clean, fc.pct = 0.1, verbose=FALSE)
shatt_lung_cnv

circ.wg.plot(cnv,svc,sample.id = "SCLC21H_LUNG")

f:id:kazumaxneo:20201103120306p:plain

circ.chromo.plot(shatt_lung_cnv,sample.id = "SCLC21H_LUNG")

f:id:kazumaxneo:20201103120312p:plain

circ.chromo.plot(shatt_lung,sample.id = "SCLC21H_LUNG")

f:id:kazumaxneo:20201103120320p:plain

環状プロットは内向きから外向きの順番で、SV、CNV、shattered.region（紫）、そしてイデオグラムを表す。

３つまとめて

par(mfrow=c(1,3))
circ.wg.plot(cnv,svc,sample.id = "SCLC21H_LUNG")
circ.chromo.plot(shatt_lung_cnv,sample.id = "SCLC21H_LUNG")
circ.chromo.plot(shatt_lung,sample.id = "SCLC21H_LUNG")

f:id:kazumaxneo:20201102232019p:plain

GithubのREADMEでは、染色体不安定性を調べるためのCNVプロファイルの計算、CNVやSVコールに由来するブレイクポイントの総負荷を測定してブレイクポイント総負荷とゲノム変化率の相関を調べる関数、局所的な領域にブレイクポイントが蓄積されているのか、ゲノム全体にランダムな分布を示すのかを調べるためのブレイクポイント負荷の測定法などが説明されています。さらに、遺伝子ごとのCNVやSV数を集計して図にする手順も紹介されています。レポジトリを確認して下さい。

引用

Svpluscnv: analysis and visualization of complex structural variation data
Gonzalo Lopez, Laura E Egolf, Federico M Giorgi, Sharon J Diskin, Adam A Margolin
Bioinformatics, Published: 14 October 2020

2020-11-02

異なるphylogenetic cladesで保存されているタンパク質を検索するwebサービス PhyloGene

2015 Nucleic Acids Research heatmap web tool protein search orthologue

　同じパスウェイ、タンパク質複合体、または同じ環境条件で機能するタンパク質は、系統発生クレード全体で類似した配列保存パターンを示すことがある。特定のタンパク質複合体またはパスウェイをもはや必要としない種では、これらのタンパク質は、グループとして、失われるか分岐する傾向がある。真核生物の大規模なセットにわたる配列保存のパターンの類似性の分析から、異なるタンパク質間の機能的関連を予測し、新しいパスウェイのメンバーを特定し、以前に特徴付けられていないタンパク質の機能を明らかにすることができる。正規化された系統発生プロファイリングを使用して、タンパク質機能を予測し、新しいパスウェイメンバーと疾患遺伝子を特定した。ヒト、マウス、Caenorhabditis elegans、およびショウジョウバエのゲノムに保存されている何万ものタンパク質の系統発生プロファイルは、新しいWebサーバーPhyloGeneで照会できる。 PhyloGeneは直観的で使いやすいプラットフォームを提供し、86の動物、真菌、植物、原生生物ゲノムの保存パターンを照会する。タンパク質クエリは、集中的に研究された種の全ゲノムタンパク質セットから名前を選択するか、タンパク質配列を入力することにより送信できる。グラフィック出力は、クエリの配列保存のプロファイルと、選択したゲノム内のタンパク質の最も類似した系統発生プロファイルを示している。ユーザーはこの出力を数値形式でダウンロードすることもできる。

チュートリアルも用意されているが、現在はリンク切れになっている。

http://genetics.mgh.harvard.edu/phylogene/update_1117/help_text.htm

webサービス

http://genetics.mgh.harvard.edu/phylogene/ にアクセスする。

初期はヒトになっている。

f:id:kazumaxneo:20200318082114p:plain

タンパク質名を選択する。ここではACE2を選択した。

f:id:kazumaxneo:20201101225345p:plain

またはタンパク質配列をアップロードすることもできる。

f:id:kazumaxneo:20201101224609p:plain

ヒートマップが表示された。ヒートマップのX軸はクエリの真核生物ゲノムに対応しており、主要な分類（動物、菌類、植物、原生生物）にグループ化され、保存パターンを手動で簡単に確認できるようになっている。ヒートマップのy軸は、検索したゲノムのタンパク質の系統的プロファイルが、検索したタンパク質のプロファイルと最も類似しているタンパク質に対応している。一番上の行はクエリタンパク質のプロファイルに対応している。

f:id:kazumaxneo:20201101225258p:plain

上はACE2（一番上）とそれに似た保存パターンを持つタンパク質の検索結果。

ヒートマップは、ヒートマップ行の左に表示されているタンパク質との配列類似性がないものが白で、配列保存率が高いものほど紺色で表示される。左端の遺伝子名は Ensembl (http://www.ensembl.org) へリンクしている。

f:id:kazumaxneo:20201101224124p:plain

ヒートマップの左端2つの列は色が異なっているが、これはピアソン相関係数とクエリとの類似度の統計的有意性を表している。

f:id:kazumaxneo:20201101224203p:plain

出力はエクセルファイル形式でダウンロードできる。

f:id:kazumaxneo:20201101224430p:plain

他にもPhylogeneという名前のツールがあるようです。注意して下さい。

引用
PhyloGene server for identification and visualization of co-evolving proteins using normalized phylogenetic profiles
Ilyas R. Sadreyev, Fei Ji, Emiliano Cohen, Gary Ruvkun, Yuval Tabach

Nucleic Acids Res. 2015 Jul 1;43(W1):W154-9

関連

2020-11-01

データベースのFASTAファイルをBLASTでのアラインメントに適した小さな断片に分割する AlignBucket

2015 Bioinformatics fasta/fastqの操作 database BLAST

　次世代シークエンシング時代では、増え続ける生物学的配列やそのバリエーションを正確にアノテーションするための信頼性の高い、高速かつ効率的なアプローチが求められている。類似性検索に基づくアノテーションのtransferは、標準的なアプローチである。全タンパク質比較の手順は、データベースに既に存在する情報に基づいて配列をアノテーションする様々な利用可能な方法の前段階である。また、その際には、配列比較の時間を短縮するために、データの前処理を行う方法が必要となる。

　大量の配列（データベース全体）を、限られた最小値と期待されるアラインメントカバレッジに応じて配列長値（残基数）を制約した集合に分割し、最適化するアルゴリズムを提案する。タンパク質配列をその長さに応じて最適にグループ化し、選択された長さの範囲内にある配列間のall-against-allシーケンスアラインメントを計算するという考え方である。この論文では、数学的に最適な解を記述し、本手法が実世界のケースで5倍のスピードアップにつながることを示す。本手法の実装は http://www.biocomp.unibo.it/∼giuseppe/partitioning.html からダウンロードできる。

タンパク質データセットにおける「all-against-all」のシーケンスアラインメントは、BLAST（Altschul etal. 1990）や動的プログラミングアプローチ（Needleman and Wunschun, 1970）などのプログラムを用いて日常的に行われている。これにより、配列は配列同一性値のサブセットにグループ化され、タンパク質配列間の構造的特徴や機能的特徴の伝達など、いくつかのアプリケーションで重要となる。少なくとも約30％の配列同一性を共有するタンパク質は、その三次元構造が重複している可能性があり（Chothia and Lesk, 1986）、少なくとも約60％の配列同一性を共有するタンパク質は、同様の機能を共有している可能性が高い（Rost, 2002; Tian and Skolnick, 2003）。また、アノテーションをtransferする前に、ペアワイズアラインメントされた配列の相対的な長さを考慮することも重要である(Devos and Valencia, 2001)。マルチドメイン蛋白質はシングルドメイン蛋白質よりも機能的に保存されていないため（Hegyi and Gerstein, 2001）、相同性検索後の機能transferに基づく方法では、機能の異なる蛋白質を同じクラスタ内でグループ化することがあり得る。したがって、アノテーション方法の開発では、配列同一性の値とアライメント長（カバレッジ）についてのオーバーラップの割合の両方が制約に含まれた（Mieleら、2011; Piovesanら、2011、2013）。進化論的研究は、そのワークフローの初期段階でのアラインメントを含む（Vilella etal. 2009; Waterhouse etal. 2013）が、同時に計算を高速化するための計算方法も重要になる（Roth etal. 2008; Sonnhammer etal. 2014; Wittwer etal. 2014）。

　計算コストを考えると、 all-against-all similarity searchは非常に負荷の高いプロセスである。最近、進化研究に関連する問題において、Wittwer etal. (2014)は、部分配列レベルの相同性を考慮することで、相同配列を検索する可能性を維持したまま、 all-against-all similarity searchの処理速度を4倍に高速化できることを示した。本研究では、期待されるアラインメントカバレッジを制限することにより、 similarity searchの計算時間を高速化するアルゴリズムを提案する。

HP（AlignBucketのソースコードに加えて、論文で使用されたデータセットもダウンロード可能）

http://www.biocomp.unibo.it/~giuseppe/partitioning.html

インストール

ビルド依存

The program was tested on a GNU/Linux Debian 7 system.

The required libraries are:

GNU Multiple Precision library (gmp)
boost library (at least version 1.49)

Github

git clone https://github.com/profgiuseppe/alignbucket.git
cd alignbucket/
make
chmod u+x alignbucket

> ./alignbucket

# ./alignbucket

Using default minimum length: 1

ERROR: please specify a distribution or a fasta file

Allowed options:

--help produce help message

--delta arg (=90) Coverage percentage needed, please see the paper

-s [ --start ] arg Minimum sequence length to consider

-d [ --distribution ] arg file with length distribution. The content should

be <length> <num. of sequences>

-f [ --fasta ] arg file containing the sequences in FASTA format

--outdir arg output directory

-v [ --verbose ] verbose mode. WARNING: may generate a huge amount

of data

> ./alignbucket.sh -h

# ./alignbucket.sh -h

Using default minimum length: 1

Reading fasta file -h

ERROR: Can't open file -h

実行方法

ランにはalignbucket.shを使う。FASTAファイルを指定する。論文では2015年5月のuniprot_sprotを使用している。

alignbucket.sh fasta_file --delta 90

デフォルトではFASTAファイルを最適化されたバケットに分割して、90%のカバレッジを実現する。xxx-yyy.fastaという形式で出力されるが、xxxとyyyはファイル内の配列の最小長と最大長を表している。

出力

f:id:kazumaxneo:20201101134134p:plain

アラインメントカバレッジが同程度のチャンクに分割される。

uniprot_sprot_2015の分割結果の1-4を開いてみた。

f:id:kazumaxneo:20201101134637p:plain

（なぜこうゆう配列が登録されているのかは分からない）

引用

AlignBucket: a tool to speed up ‘all-against-all’ protein sequence alignments optimizing length constraints
Giuseppe Profiti, Piero Fariselli, Rita Casadio

Bioinformatics, Volume 31, Issue 23, 1 December 2015, Pages 3841–3843,

関連

2020-10-31

10x genomicsのシングルセルRNA-seq解析パイプライン cellranger（version4について）

docker single cell 10X Genomics automated pipeline mapping scRNAseq

2020 10/31 説明を追加

2021 2/11 docker インストールにv５.01追加

　Cell Rangerは、ChromiumのシングルセルRNA-seq出力を処理して、リードのアラインメント、フィーチャ-バーコードマトリックスの生成、クラスタリングと遺伝子発現解析を行う解析パイプラインのセットである。Cell Rangerには、シングルセル遺伝子発現実験に関連する4つのパイプラインが含まれている。

　cellranger mkfastqは、Illuminaシーケンサーで生成された生のベースコール（BCL）ファイルをFASTQファイルにデマルチプレックスする。これはIlluminaのbcl2fastqのラッパーで、10xライブラリーに特化した便利な機能を追加し、サンプルシートのフォーマットを簡略化したものである。

　cellranger countは、cellranger mkfastqからFASTQファイルを取得し、アラインメント、フィルタリング、バーコードカウント、UMIカウントを実行する。それは、Chromium セルラーバーコードを使用して、フィーチャ-バーコードマトリックスを生成し、クラスタを決定し、遺伝子発現解析を実行するために使用される。カウントパイプラインは、同じGEMウェル上の複数のシークエンシング実行からの入力を取り込むことができる。cellranger countはまた、遺伝子発現リードと一緒にフィーチャーバーコードデータを処理する。

　cellranger aggr は、cellranger count の複数回の実行からの出力を集約し、それらの実行を同じシークエンシングデプスに正規化してから、特徴バーコードマトリクスを再計算し、結合されたデータを解析する。aggrパイプラインを使用して、複数のサンプルからのデータを実験全体のフィーチャ-バーコードマトリックスと解析に結合することができる。

　cellranger reanalyzeは、cellranger countまたはcellranger aggrによって生成された特徴-バーコードマトリックスを利用し、調整可能なパラメータ設定を用いて次元削減、クラスタリング、遺伝子発現アルゴリズムを再実行する。

　これらのパイプラインは、広く使用されているRNA-seqアライナーSTARとChromium固有のアルゴリズムを組み合わせたものである。出力は、標準的なBAM、MEX、CSV、HDF5、HTML形式で提供され、細胞情報が付加されている。

What is Cell Ranger?

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger

テスト環境

intel xeon platinumのシングルノード環境でテスト（osはubuntu18.04LTS）。

インストール

10x GenomicsのHPからダウンロードする。ダウンロード前に所属機関やメールアドレス等の記入が必要（ --transcriptomeで指定するリファレンスもリンク先からダウンロードできる）。

Cell Ranger Installation -Software -Single Cell Gene Expression -Official 10x Genomics Support

#解凍
tar -xzvf cellranger-4.0.0.tar.gz
cd cellranger-4.0.0/bin/

> cellranger

$ cellranger

cellranger 4.0.0

Process 10x Genomics Gene Expression, Feature Barcode, and Immune Profiling data

USAGE:

cellranger <SUBCOMMAND>

FLAGS:

-h, --help Prints help information

-V, --version Prints version information

SUBCOMMANDS:

count Count gene expression and feature barcoding reads from a single sample and GEM well

vdj Assembles single-cell VDJ receptor sequences from 10x Immune Profiling libraries

aggr Aggregate data from multiple Cell Ranger runs

reanalyze Re-run secondary analysis (dimensionality reduction, clustering, etc)

targeted-compare Analyze targeted enrichment performance by comparing a targeted sample to its cognate parent WTA sample (used as input for targeted gene expression)

targeted-depth Estimate targeted read depth values (mean reads per cell) for a specified input parent WTA sample and a target panel CSV file

mkvdjref Prepare a reference for use with CellRanger VDJ

mkfastq Run Illumina demultiplexer on sample sheets that contain 10x-specific sample index sets

testrun Execute the 'count' pipeline on a small test dataset

mat2csv Convert a gene count matrix to CSV format

mkgtf Prepare a GTF file for use as a 10x transcriptome reference

mkref Prepare a reference for use with 10x analysis software. Requires a GTF and FASTA

upload Upload a summary of an analysis pipeline job to 10x Genomics support

sitecheck Collect Linux system configuration information

help Prints this message or the help of the given subcommand(s)

> cellranger count -h

$ cellranger count -h

cellranger-count

Count gene expression and feature barcoding reads from a single sample and GEM well

USAGE:

cellranger count [FLAGS] [OPTIONS] --id <ID> --transcriptome <PATH>

FLAGS:

--no-target-umi-filter Turn off the target UMI filtering subpipeline

--nosecondary Disable secondary analysis, e.g. clustering. Optional

--no-libraries Proceed with processing using a --feature-ref but no Feature Barcode libraries specified with the 'libraries' flag

--dry Do not execute the pipeline. Generate a pipeline invocation (.mro) file and stop

--disable-ui Do not serve the UI

--noexit Keep web UI running after pipestance completes or fails

--nopreflight Skip preflight checks

-h, --help Prints help information

OPTIONS:

--id <ID> A unique run id and output folder name [a-zA-Z0-9_-]+

--description <TEXT> Sample description to embed in output files

--transcriptome <PATH> Path of folder containing 10x-compatible transcriptome reference

-f, --fastqs <PATH>... Path to input FASTQ data

-p, --project <TEXT> Name of the project folder within a mkfastq or bcl2fastq-generated folder to pick FASTQs from

-s, --sample <PREFIX>... Prefix of the filenames of FASTQs to select

--lanes <NUMS>... Only use FASTQs from selected lanes

--libraries <CSV> CSV file declaring input library data sources

--feature-ref <CSV> Feature reference CSV file, declaring Feature Barcode constructs and associated barcodes

--target-panel <CSV> The target panel CSV file declaring the target panel used, if any

--expect-cells <NUM> Expected number of recovered cells

--force-cells <NUM> Force pipeline to use this number of cells, bypassing cell detection

--r1-length <NUM> Hard trim the input Read 1 to this length before analysis

--r2-length <NUM> Hard trim the input Read 2 to this length before analysis

--chemistry <CHEM> Assay configuration. NOTE: by default the assay configuration is detected automatically, which is the recommened mode. You usually will not need to specify a chemistry. Options are: 'auto' for

autodetection, 'threeprime' for Single Cell 3', 'fiveprime' for Single Cell 5', 'SC3Pv1' or 'SC3Pv2' or 'SC3Pv3' for Single Cell 3' v1/v2/v3, 'SC5P-PE' or 'SC5P-R2' for Single Cell 5', paired-

end/R2-only, 'SC-FB' for Single Cell Antibody-only 3' v2 or 5' [default: auto]

--jobmode <MODE> Job manager to use. Valid options: local (default), sge, lsf, slurm or a .template file. Search for help on "Cluster Mode" at support.10xgenomics.com for more details on configuring the pipeline to

use a compute cluster [default: local]

--localcores <NUM> Set max cores the pipeline may request at one time. Only applies to local jobs

--localmem <NUM> Set max GB the pipeline may request at one time. Only applies to local jobs

--localvmem <NUM> Set max virtual address space in GB for the pipeline. Only applies to local jobs

--mempercore <NUM> Reserve enough threads for each job to ensure enough memory will be available, assuming each core on your cluster has at least this much memory available. Only applies in cluster jobmodes

--maxjobs <NUM> Set max jobs submitted to cluster at one time. Only applies in cluster jobmodes

--jobinterval <NUM> Set delay between submitting jobs to cluster, in ms. Only applies in cluster jobmodes

--overrides <PATH> The path to a JSON file that specifies stage-level overrides for cores and memory. Finer-grained than --localcores, --mempercore and --localmem. Consult the 10x support website for an example

override file

--uiport <PORT> Serve web UI at http://localhost:PORT

> cellranger bamtofastq -h

$ cellranger bamtofastq -h

bamtofastq v1.2.1

10x Genomics BAM to FASTQ converter.

Tool for converting 10x BAMs produced by Cell Ranger or Long Ranger back to

FASTQ files that can be used as inputs to re-run analysis. The FASTQ files

emitted by the tool should contain the same set of sequences that were

input to the original pipeline run, although the order will not be

preserved. The FASTQs will be emitted into a directory structure that is

compatible with the directories created by the 'mkfastq' tool.

10x BAMs produced by Long Ranger v2.1+ and Cell Ranger v1.2+ contain header

fields that permit automatic conversion to the correct FASTQ sequences.

Older 10x pipelines require one of the arguments listed below to indicate

which pipeline created the BAM.

NOTE: BAMs created by non-10x pipelines are unlikely to work correctly,

unless all the relevant tags have been recreated.

NOTE: BAM produced by the BASIC and ALIGNER pipeline from Long Ranger 2.1.2 and earlier

are not compatible with bamtofastq

NOTE: BAM files created by CR < 1.3 do not have @RG headers, so bamtofastq will use the GEM well

annotations attached to the CB (cell barcode) tag to split data from multiple input libraries.

Reads without a valid barcode do not carry the CB tag and will be dropped. These reads would

not be included in any valid cell.

Usage:

bamtofastq [options] <bam> <output-path>

bamtofastq (-h | --help)

Options:

--nthreads=<n> Threads to use for reading BAM file [default: 4]

--locus=<locus> Optional. Only include read pairs mapping to locus. Use chrom:start-end format.

--reads-per-fastq=N Number of reads per FASTQ chunk [default: 50000000]

--gemcode Convert a BAM produced from GemCode data (Longranger 1.0 - 1.3)

--lr20 Convert a BAM produced by Longranger 2.0

--cr11 Convert a BAM produced by Cell Ranger 1.0-1.1

--bx-list=L Only include BX values listed in text file L. Requires BX-sorted and index BAM file (see Long Ranger support for details).

-h --help Show this screen.

(grabseqs) [

cellranger-4.0.0/bin/にパスを通しておく。

cell barcodeのファイルは

cellranger-4.0.0/lib/python/cellranger/barcodes/737K-august-2016.txt

がそうらしい。テスト時は737280行あった。

非公式だがdockerイメージも上がっている。windowsやmacでランする時に便利と思われる。

https://hub.docker.com/r/cumulusprod/cellranger/tags

#6.1.2
docker pull cumulusprod/cellranger:6.1.2

#v5.0.1
docker pull cumulusprod/cellranger:5.0.1

#v4.0.0
docker pull cumulusprod/cellranger:4.0.0

#v3.1.0
docker pull cumulusprod/cellranger:3.1.0

#v2.2.0
docker pull cumulusprod/cellranger:2.2.0

実行方法

10xのチュートリアルではSRAのデータを例に説明しているので、ここでもそれに則る。SRAのscRNAseqデータを使う時の注意点だが、（fastqがindexとペアに分かれてアップロードされているなら問題ないが、）ここで使うチュートリアルデータのように１つのfastqとして登録されている場合、このfastqをダウンロードしてもindexやペアエンドが分離されておらず使えない。10xのbarcoded bam形式bamをダウンロードする（このページの下のAWSに置いてあるbarcoded bamをダウンロード）。それからcellranger bamtofastqを使い、それぞれのライブラリに正しく分離されたfastqを作る。この作業で、複数poolしていればライブラリごとのfastqディレクトリと、その中にfastq（通常ペアエンドなのでR1とR2のfastq.gzと、indexのfastq.gz）ができる。fastqは５千万リード毎にchunk出力される。cell ranger countのランには、この作業で得られたディレクトリを指定する。

１、cellranger bamtofastq（link）によりbarcoded bamからfastqを出力する。ここでは公式でexampleデータとして使っているC05.bam.1と C07.bam.1のうちのC05.bam.1を使う。こちらからダウンロードできる。

cellranger bamtofastq --nthreads=4 --reads-per-fastq=50000000  C05.bam.1 normal

"--nthreads=4"と"--reads-per-fastq=50000000"はなくても動作する。

ランが終わると、normal/にindepth_C05_MissingLibrary_1_HL5G3BBXXとindepth_C05_MissingLibrary_1_HNNWNBBXXができる。

出力ディレクトリ

ls normal/indepth_C05_MissingLibrary_1_HL5G3BBXX/

-rw-r--r-- 1 kazumaxgene 317M 10月 24 16:10 bamtofastq_S1_L003_I1_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 563M 10月 24 16:10 bamtofastq_S1_L003_R1_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.5G 10月 24 16:10 bamtofastq_S1_L003_R2_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 475M 10月 24 16:10 bamtofastq_S1_L004_I1_006.fastq.gz

-rw-r--r-- 1 kazumaxgene 804M 10月 24 16:10 bamtofastq_S1_L004_R1_006.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.6G 10月 24 16:10 bamtofastq_S1_L004_R2_006.fastq.gz

drwxr-xr-x 2 kazumaxgene 4.0K 10月 24 16:04 ./

-rw-r--r-- 1 kazumaxgene 677M 10月 24 16:04 bamtofastq_S1_L004_I1_005.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.2G 10月 24 16:04 bamtofastq_S1_L004_R1_005.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.6G 10月 24 16:04 bamtofastq_S1_L004_R2_005.fastq.gz

-rw-r--r-- 1 kazumaxgene 672M 10月 24 15:56 bamtofastq_S1_L004_I1_004.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.2G 10月 24 15:56 bamtofastq_S1_L004_R1_004.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.7G 10月 24 15:56 bamtofastq_S1_L004_R2_004.fastq.gz

-rw-r--r-- 1 kazumaxgene 697M 10月 24 15:56 bamtofastq_S1_L003_I1_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.3G 10月 24 15:56 bamtofastq_S1_L003_R1_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.8G 10月 24 15:56 bamtofastq_S1_L003_R2_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 680M 10月 24 15:47 bamtofastq_S1_L004_I1_003.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.3G 10月 24 15:47 bamtofastq_S1_L004_R1_003.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.7G 10月 24 15:47 bamtofastq_S1_L004_R2_003.fastq.gz

-rw-r--r-- 1 kazumaxgene 680M 10月 24 15:38 bamtofastq_S1_L004_I1_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.3G 10月 24 15:38 bamtofastq_S1_L004_R1_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.7G 10月 24 15:38 bamtofastq_S1_L004_R2_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 680M 10月 24 15:30 bamtofastq_S1_L004_I1_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.3G 10月 24 15:30 bamtofastq_S1_L004_R1_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.7G 10月 24 15:30 bamtofastq_S1_L004_R2_001.fastq.gz

このディレクトリを以後の解析に使用する。

２、cellranger countを実行する。

cellranger count --id=normal \
 --transcriptome=refdata-gex-mm10-2020-A/ \
 --fastqs=indepth_C05_MissingLibrary_1_HL5G3BBXX/,indepth_C05_MissingLibrary_1_HNNWNBBXX/ 
 --chemistry auto --sample=bamtofastq --localcores 16

"--chemistry auto --sample=bamtofastq --localcores 16"はなくても動作する。

出力ディレクトリのouts/が出力

f:id:kazumaxneo:20201026095243p:plain

outs/cloupe.cloupeはloupe rangerで開くことができる。

summary.html f:id:kazumaxneo:20201101123433p:plain

f:id:kazumaxneo:20201101123435p:plain

Tips

SRAからダウンロードしたデータなどで、ディレクトリにfastq.gzを置いているにも関わらずcellrangercountがfastqのファイル名が認識できない場合はこちらを参照（SRAからダウンロードした少し前のデータなど）。

https://kb.10xgenomics.com/hc/en-us/articles/115003802691

他のコマンドは10xのマニュアルを参照して下さい。

引用

What is Cell Ranger? -Software -Single Cell Gene Expression -Official 10x Genomics Support

status 255

https://kb.10xgenomics.com/hc/en-us/articles/360042488811-Cellranger-job-failed-in-stage-code-exit-status-255

2020 12/02 追記

DBKEROのシングルセルチュートリアル。基本的にコピペで実行できるようになっている。

https://kero.hgc.jp/open_tutorials/learning/

2022/09/28

Useful tip & heads-up❗️for #scrnaseq cellranger @10xGenomics users:starting with v.7.0, cellranger count also considers reads aligned to introns (20-40% of all reads).

this is great for increased sensitivity!

however, what to do if previous data has been counted w/o introns? 🧵 pic.twitter.com/m0JZ8u4Cyj
— Simona Cristea (@simocristea) September 26, 2022

2020-10-30

配列比較結果を視覚化する

初心者向け tips heatmap rRNA all versus all sequence comarison

2020 11/2 誤字修正

先日紹介したneedleallやvsearchによるall versus allの配列比較のテキスト出力をもとに、ヒートマップで視覚化する。ここではggplot2パッケージを使う。

EMBOSS needleallによるall versus allの配列比較

１、配列の準備

all versus allの配列比較に使う塩基配列は、ある程度長くて全長がよく似ていれば何でも良いが、ここでは大腸菌と赤痢菌の16S rRNA遺伝子を比較してみる。（このブログでは何でも良いが）一般論としてよく分からない株を比較しても議論できないので、まずLPSNでEscherichia coli とShigella dysenteriae のタイプストレインを調べ、その遺伝子を使う。

Escherichia coli (Migula 1895) 、1895年に"Bacillus coli"として再分類（wiki））は

ATCC 11775; CCUG 24; CCUG 29300; CIP 54.8; DSM 30083; JCM 1649; LMG 2092; NBRC 102203; NCCB 54008; NCTC 9001、がタイプストレインとして登録されている。

Shigella dysenteriae (Shiga 1897)は

ATCC 13313; CIP 57.28; DSM 4781; NCTC 4837、がタイプストレインとして登録されている。

NCBIのbrowse by organism機能（紹介）を使って、これらのタイプストレインの利用可能なゲノムデータを探した。Escherichia coliで探すといきなりATCC 11775の完全長ゲノムが見つかった（accession: NZ_CP033092のバージョン２であるNZ_CP033092.2）。これをダウンロードした（*2）。

f:id:kazumaxneo:20201030193857p:plain

（K-12は病原性がなくて弱い大腸菌）

次にShigella dysenteriaeだが、ATCC 13313の完全長ゲノムが登録されている（NZ_CP026774.1）。これをダウンロードする。

f:id:kazumaxneo:20201030195508p:plain

ここでは2つのゲノムファイルだけ取ってくれば良いので問題ないが、タイプストレインのゲノムを自動で大量に取ってくるのは意外にも難しい（*1）。

ダウンロードしたゲノムからbarrnap（紹介）を使ってSSU rRNA（以後16S rRNA）を取り出す（16S rRNA配列も複数登録されており、アノテーションファイル中にも当然存在すると思われるが、ここでは演習としてゲノムから取り出す）。トラブルの元なのでファイル名に空白は入れないこと。

#E.coli_ATCC11775
barrnap --outseq E.coli_ATCC11775_rRNA.fna --kingdom bac --threads 4 E.coli_ATCC11775.fna > E.coli_ATCC11775_rRNAs.gff

#S.dysenteriae_ATCC13313
barrnap --outseq S.dysenteriae_rRNAs.fa --kingdom bac --threads 4 S.dysenteriae_ATCC13313.fna > S.dysenteriae_rRNAs.gff

--kingdom Kingdom: euk mito bac arc (default 'bac')
--lencutoff [n.n] Proportional length threshold to label as partial (default '0.8')
--reject [n.n] Proportional length threshold to reject prediction (default '0.25')
--evalue [n.n] Similarity e-value cut-off (default '1e-06')

barrnapの出力から16S rRNAを取り出す。S.dysenteriae_ATCC13313ゲノムからは16S rRNAが７コピー見つかった。

f:id:kazumaxneo:20201030201147p:plain

（barrnapで撮り出されたrRNAは向きが統一されており、向きを考えずそのまま配列比較に使える）

E.coli_ATCC11775_rRNAsゲノムからも16S rRNAは7コピー見つかった。

f:id:kazumaxneo:20201030201325p:plain

16S rRNAの配列だけを取り出し、16S.faとして１つのファイルに保存した。長さを確認する。

$ s 16S.fa

file format type num_seqs sum_len min_len avg_len max_len

16S.fa FASTA DNA 14 21,532 1,538 1,538 1,538

全て1,538 bp

ヘッダを確認する。

$ grep -n ">" 16S.fa

1:>16S_rRNA::NZ_CP026774.1:2536626-2538164(+)

3:>16S_rRNA::NZ_CP026774.1:2636479-2638017(+)

5:>16S_rRNA::NZ_CP026774.1:2768885-2770423(+)

7:>16S_rRNA::NZ_CP026774.1:2810847-2812385(+)

9:>16S_rRNA::NZ_CP026774.1:3061594-3063132(+)

11:>16S_rRNA::NZ_CP026774.1:3646780-3648318(+)

13:>16S_rRNA::NZ_CP026774.1:1662010-1663548(-)

15:>16S_rRNA::NZ_CP033092.2:458561-460099(+)

17:>16S_rRNA::NZ_CP033092.2:1285125-1286663(+)

19:>16S_rRNA::NZ_CP033092.2:3780640-3782178(-)

21:>16S_rRNA::NZ_CP033092.2:4551515-4553053(-)

23:>16S_rRNA::NZ_CP033092.2:4591684-4593222(-)

25:>16S_rRNA::NZ_CP033092.2:4844587-4846125(-)

27:>16S_rRNA::NZ_CP033092.2:4726193-4727731(-)

grepで検索するとき、grep -n > 16S.fa としてはならない。

２、16S配列の総当たり比較

ここではvsearchを使う。

vsearch --allpairs_global 16S.fa --id 0.9 --uc vsearch_output

$ head vsearch_output

f:id:kazumaxneo:20201030203413p:plain

vsearch_outputの4,9,10列目を取り出す。不要な行（先頭がアスタリスク）があるので、パイプしてgrepで除く（grep -vでマッチは除外）。

cut -f 9,10,4 vsearch_output | grep -v ^* - > seq_ide

$ head -n 5 seq_ide

99.2 16S_rRNA::NZ_CP026774.1:1662010-1663548(-) 16S_rRNA::NZ_CP033092.2:4726193-4727731(-)

99.1 16S_rRNA::NZ_CP026774.1:1662010-1663548(-) 16S_rRNA::NZ_CP033092.2:458561-460099(+)

99.1 16S_rRNA::NZ_CP026774.1:1662010-1663548(-) 16S_rRNA::NZ_CP033092.2:1285125-1286663(+)

99.1 16S_rRNA::NZ_CP026774.1:1662010-1663548(-) 16S_rRNA::NZ_CP033092.2:3780640-3782178(-)

99.1 16S_rRNA::NZ_CP026774.1:1662010-1663548(-) 16S_rRNA::NZ_CP033092.2:4551515-4553053(-)

遺伝子名に冗長な情報が多く見えづらい。

見づらいのでヘッダの不要な情報を消す。アクセッションIDも株名に置換する。

sed -e "s/16S_rRNA:://g" \
 -e s/$'NZ_CP033092.2:'/S.dysenteriae_/g \
 -e s/$'NZ_CP026774.1:'/E.coli_/g \
 seq_ide > seq_rename

$ head -n 5 seq_rename

99.2 E.coli_1662010-1663548(-) S.dysenteriae_4726193-4727731(-)

99.1 E.coli_1662010-1663548(-) S.dysenteriae_458561-460099(+)

99.1 E.coli_1662010-1663548(-) S.dysenteriae_1285125-1286663(+)

99.1 E.coli_1662010-1663548(-) S.dysenteriae_3780640-3782178(-)

99.1 E.coli_1662010-1663548(-) S.dysenteriae_4551515-4553053(-)

seq_renameを開き、先頭行にidentity<tab>genome1<tab>genome2をつけて上書き保存した。

f:id:kazumaxneo:20201030212110p:plain

３、結果の視覚化

Rのggplot2を使う。

Rのコンソールに切り替え、ファイルを読み込む。

#R
#データの読み込み（カレントにないならフルパスで指定）
input <- read.table("seq_rename", header=T, sep="\t")

> head(input)

identity genome1 genome2

1 99.2 E.coli_1662010-1663548(-) S.dysenteriae_4726193-4727731(-)

2 99.1 E.coli_1662010-1663548(-) S.dysenteriae_458561-460099(+)

3 99.1 E.coli_1662010-1663548(-) S.dysenteriae_1285125-1286663(+)

4 99.1 E.coli_1662010-1663548(-) S.dysenteriae_3780640-3782178(-)

5 99.1 E.coli_1662010-1663548(-) S.dysenteriae_4551515-4553053(-)

6 99.1 E.coli_1662010-1663548(-) S.dysenteriae_4591684-4593222(-)

視覚化する。

#ggplot2
library("ggplot2")

#default settings
ggplot(data = input, aes(x=genome1, y=genome2, fill=identity)) + geom_tile()

#カスタム。99-100%、色は青。内部に数値。figのraioを1
ggplot(data = input, aes(x=genome1, y=genome2, fill=identity)) + geom_tile(color = "white")+
scale_fill_gradient2(low = "#ffffff", high = "#0000cd", mid = "#ffffff", 
 midpoint = 99.5, limit = c(99,100), space = "Lab",
 name="16S rRNA\nsequence identity (%)") +
 theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 90, vjust = 1,
 size = 9, hjust = 1))+
 coord_fixed() + theme(aspect.ratio = 1) + 
 geom_text(aes(label = identity), size = 3)

出力された。一番低い組み合わせでも99.1%の配列同一性があった。

f:id:kazumaxneo:20201030213336p:plain

一部欠損があるが、これはvsearchが計算していない配列の組み合わせがあるため。EMBOSS neddleallではこうゆうことはなくて、14x14比較なら196行出力（& 最後に4行のコメント行）される。また、色が二層性で綺麗に分かれてしまったのは、シゲラのゲノムとE.coliのゲノムの16S rRNAはいずれのコピーも互いにペアワイズ比較すると99.1程度の配列同一性しかなく、同じゲノム内のコピーはほぼ100%合致したため。比較するデータがヒートマップで視覚化するには不向きだったということになる（= 総当たりの表を示せば議論できる。あえてヒートマップで視覚化する必要はない）。

2020 10/31 追記

もう少し距離がある配列間の相同性を視覚化してみる。脊索動物のピルビン酸デヒドロゲナーゼ複合体サブユニット遺伝子５つを使う。

#1 配列のall vs all比較
needleall PDHA1proteins.faa PDHA1proteins.faa out -gapopen 10.0 -gapextend 0.5 -aformat pair

#2 scirptを書いてidentitywの数値を取り出す("# Identity"で始まるの行の()内がIdentity、"# 1:"が比較元の名前、"# 2:"が比較先の名前
perl extracct_best_hit.pl -i out -o out

2の出力

genome1	genome2	identity
NP_000275.1[Homo	NP_000275.1[Homo	100
XP_010641961.1[Fukomys	NP_000275.1[Homo	99.2
XP_006911688.1[Pteropus	NP_000275.1[Homo	97.4
XP_021570775.1[Carlito	NP_000275.1[Homo	95.8
PNI15832.1	NP_000275.1[Homo	93.8
NP_000275.1[Homo	XP_010641961.1[Fukomys	99.2
XP_010641961.1[Fukomys	XP_010641961.1[Fukomys	100
XP_006911688.1[Pteropus	XP_010641961.1[Fukomys	98.2
XP_021570775.1[Carlito	XP_010641961.1[Fukomys	95.5
PNI15832.1	XP_010641961.1[Fukomys	93
NP_000275.1[Homo	XP_006911688.1[Pteropus	97.4
XP_010641961.1[Fukomys	XP_006911688.1[Pteropus	98.2
XP_006911688.1[Pteropus	XP_006911688.1[Pteropus	100
XP_021570775.1[Carlito	XP_006911688.1[Pteropus	93.8
PNI15832.1	XP_006911688.1[Pteropus	91.3
NP_000275.1[Homo	XP_021570775.1[Carlito	95.8
XP_010641961.1[Fukomys	XP_021570775.1[Carlito	95.5
XP_006911688.1[Pteropus	XP_021570775.1[Carlito	93.8
XP_021570775.1[Carlito	XP_021570775.1[Carlito	100
PNI15832.1	XP_021570775.1[Carlito	90
NP_000275.1[Homo	PNI15832.1	93.8
XP_010641961.1[Fukomys	PNI15832.1	93
XP_006911688.1[Pteropus	PNI15832.1	91.3
XP_021570775.1[Carlito	PNI15832.1	90
PNI15832.1	PNI15832.1	100

Rのコンソールに移動する。上の表をコピペ読み込み。

#mac
input <- read.table(pipe("pbpaste"), header = TRUE)

#windows
input <- read.table("clipboard"), header = TRUE) 

#またはファイルinput.tsvに保存し、データフレームに読み込む。 
input <- read.table("input.tsv", header=T, sep="\t")

#視覚化する
#ggplot2
library("ggplot2")

#default settings
ggplot(data = input, aes(x=genome1, y=genome2, fill=identity)) + geom_tile()

#custom1 - 90-100%、青のグラデーション。内部にidentityの数値、figの比率は1
ggplot(data = input, aes(x=genome1, y=genome2, fill=identity)) + geom_tile(color = "white")+
scale_fill_gradient2(low = "#ffffff", high = "#0000cd", mid = "#ffffff", 
 midpoint = 93, limit = c(90,100), space = "Lab",
 name="PDHA\nsequence identity (%)") +
 theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 90, vjust = 1,
 size = 9, hjust = 1))+
 coord_fixed() + theme(aspect.ratio = 1) + 
 geom_text(aes(label = identity), size = 5)

#default settings

f:id:kazumaxneo:20201031160246p:plain

#custom1

f:id:kazumaxneo:20201031160720p:plain

クラスタリングしたり本格的に計算してから視覚化する場合、データフレームをas.matrixで行列に変換してから進める。

参考

大腸菌研究の歴史

https://www.sbj.or.jp/wp-content/uploads/file/sbj/9010/9010_yomoyama-1.pdf

関連

全配列間の多重整列を実行し、その結果から距離行列ファイルを出す。

最近出たGTDBの論文（DOI: https://doi.org/10.1038/s41587-020-0501-8）では、LPSNやGTDB、bacdiveとの識別子の互換性が乏しい点についての苦労話がある。現状、タイプストレインのゲノムだけ公共データベースから大量にfetchするのは骨が折れる。

タイプストレインは有用だが、完全長ゲノム情報が利用できない株がたくさん存在する。

乖離している理由だが、それには第一に、いわゆる細菌学は百年よりずっと昔から続く歴史ある研究分野であるのに対し、細菌ゲノムをサンガーシークエンシングやHTSのシークエンサーで読んでゲノムベースで取り組む研究手法は、登場から数十年程度の歴史しかない、相対的にずっと新しい研究分野であることを考慮する必要がある。

2020-10-28

All Versus Allの配列比較（sequence comparison）を行うEMBOSSの needleall

sequence comparison 2000 Trends in Genetics all versus all sequence comarison

2020 10/29 vsearchのコマンド追記

needleall は入力された一連の配列を読み込み、それらをすべて 1 つ以上の配列と比較し、最適なグローバル配列のアラインメントをファイルに書き込む。Needleman-Wunschアライメントアルゴリズムを使用して、全長に沿った2つの配列の最適アライメント（ギャップを含む）を見つける。このアルゴリズムは動的計画法を使用して、可能なすべてのアラインメントを探索し、最適なものを選択することで、最適なアラインメントを確実にする。可能性のある残基またはヌクレオチドが一致するすべての値を含むスコアリングマトリクスが読み込まれる。needleallは、アラインメントのスコアがスコアリングマトリクスから取得したマッチの合計から、アラインメントされた配列のギャップを開いたり拡張したりすることで生じるペナルティを差し引いた値に等しい最大のスコアを持つアラインメントを見つける。置換マトリクスとギャップの開きと拡張のペナルティはユーザーが指定する。

アルゴリズム

Needleman-Wunsch アルゴリズムは、2 つの配列のベストスコアとアラインメントを mn ステップ（n と m は配列の長さ）で計算できるアルゴリズムの一種である。これらの動的プログラミングアルゴリズムは、タンパク質の配列比較のためにNeedlemanとWunschによって最初に開発されたが、同様の手法は1960年代後半から1970年代初頭にかけて、音声処理やコンピュータサイエンスの分野で使用するために独自に考案された。重要な問題は、ギャップ、すなわちアライメントスコアを最適化するために挿入されるスペースの処理である。各ギャップが開かれるごとにスコアからペナルティが減算され（「ギャップオープン」ペナルティ）、ギャップスペースの合計数にコストを掛けたものがスコアから減算される（「ギャップエクステンション」ペナルティ）。一般的に、ギャップを拡張するためのコストは、ギャップを開くためのコストの5～10倍に設定されている。

n個の位置のギャップに対するペナルティは、次の式を用いて計算される。

gap opening penalty + (n - 1) * gap extension penalty
Needleman-Wunschのグローバルアラインメントでは、各配列の全長がアラインメントされる。配列が部分的に重なっている場合もあれば、一方の配列が他方の配列に完全に内部的にアラインメントされている場合もある。オーバーラップの両端が垂れ下がっていてもペナルティはない。バイオインフォマティクスでは配列が不完全であると考えるのが一般的であり、欠落している塩基のアラインメントに失敗してもペナルティはない。

EMBOSS eexplorer

https://www.bioinformatics.nl/cgi-bin/emboss/needleall

needleall wiki

EMBOSS: needleall

インストール

EMBOSSはcondaやbrewで導入できる。

#bioconda
conda install -c bioconda -y emboss 

#homebrew 
brew install emboss

> needleall -h

$ needleall -h

Many-to-many pairwise alignments of two sequence sets

Version: EMBOSS:6.6.0.0

Standard (Mandatory) qualifiers:

[-asequence] seqset Sequence set filename and optional format,

or reference (input USA)

[-bsequence] seqall Sequence(s) filename and optional format, or

reference (input USA)

-gapopen float [10.0 for any sequence] The gap open penalty

is the score taken away when a gap is

created. The best value depends on the

choice of comparison matrix. The default

value assumes you are using the EBLOSUM62

matrix for protein sequences, and the

EDNAFULL matrix for nucleotide sequences.

(Floating point number from 1.0 to 100.0)

-gapextend float [0.5 for any sequence] The gap extension,

penalty is added to the standard gap penalty

for each base or residue in the gap. This

is how long gaps are penalized. Usually you

will expect a few long gaps rather than many

short gaps, so the gap extension penalty

should be lower than the gap penalty. An

exception is where one or both sequences are

single reads with possible sequencing

errors in which case you would expect many

single base gaps. You can get this result by

setting the gap open penalty to zero (or

very low) and using the gap extension

penalty to control gap scoring. (Floating

point number from 0.0 to 10.0)

[-outfile] align [*.needleall] Output alignment file name

(default -aformat score)

Additional (Optional) qualifiers:

-datafile matrixf [EBLOSUM62 for protein, EDNAFULL for DNA]

This is the scoring matrix file used when

comparing sequences. By default it is the

file 'EBLOSUM62' (for proteins) or the file

'EDNAFULL' (for nucleic sequences). These

files are found in the 'data' directory of

the EMBOSS installation.

-endweight boolean [N] Apply end gap penalties.

-endopen float [10.0 for any sequence] The end gap open

penalty is the score taken away when an end

gap is created. The best value depends on

the choice of comparison matrix. The default

value assumes you are using the EBLOSUM62

matrix for protein sequences, and the

EDNAFULL matrix for nucleotide sequences.

(Floating point number from 1.0 to 100.0)

-endextend float [0.5 for any sequence] The end gap

extension, penalty is added to the end gap

penalty for each base or residue in the end

gap. (Floating point number from 0.0 to

10.0)

-minscore float [1.0 for any sequence] Minimum alignment

score to report an alignment. (Floating

point number from -10.0 to 100.0)

-errfile outfile [needleall.error] Error file to be written

Advanced (Unprompted) qualifiers:

-[no]brief boolean [Y] Brief identity and similarity

General qualifiers:

-help boolean Report command line options and exit. More

information on associated and general

qualifiers can be found with -help -verbose

実行方法

クエリのmulti-fasta、比較先のmulti-fastaの順番で指定する（塩基配列かアミノ酸配列）。コマンドだけ叩くと、対話モードで実行できる。

needleall first_sequences.fasta second_sequences.fasta output -auto

-gapopen float [10.0 for any sequence] The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62
matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. (Floating point number from 1.0 to 100.0)
-gapextend float [0.5 for any sequence] The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. (Floating point number from 0.0 to 10.0)
-auto Turn off prompts

１行形式の出力（"-aformat <>"で変更可能*1）

f:id:kazumaxneo:20201028200806p:plain

avaBLAST（DOI: 10.1109/CIBEC.2008.4786046）というプログラム（all versus all blast 比較を行う）も試したかったのですが、ダウンロードHPが消えていたのでこちらを紹介しました。二配列間の比較にはEMBOSSパッケージのneedleまたはstretcherが利用できます。needleについては、ばいばいバイオさんのHPで分かりやすく説明されています。

ゲノムのような大きな配列間のall vs all比較には使えません。また、遠く離れた配列間の比較にも適していません。注意して下さい。

2020 10/29追記

windowmoonさんに教えていただきました。

vsearchの--allpairs_globalサブコマンドを使う（Needleman-Wunsch の非常に高速な実装が利用でき、needleallの数十倍以上高速）。

vsearch --allpairs_global 16S_seq.fa --id 0.5 --uc output

--alnout <FILENAME> filename for human-readable alignment output
--acceptall output all pairwise alignments
--id <REAL> reject if identity lower
--uc <FILENAME> specify filename for UCLUST-like output

引用

EMBOSS: the European Molecular Biology Open Software Suite.
Rice P, Longden I, Bleasby A

Trends Genet. 2000 Jun;16(6):276-7

参考

The available multiple alignment format names are: unknown, multiple, simple, fasta, msf, trace, srs

The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, srspair, score

関連

2020-10-27

タンパク質をコードする遺伝子配列の組換えイベントや正の選択下にある部位を見つける PoSeiDon

workflow 2019 Bioinformatics docker nextflow evolution Selection pressure recombination Viruses

2020 10/27 テストデータ結果追記

　選択圧力は、遺伝子の進化に継続的に影響を与え、多くの方法で研究することができる（Vittiら、2013）。例えば正の選択、または多様化する選択は、オルソロガスな遺伝子のアラインメントにおける非同義置換（dN）と同義置換（dS）の割合を比較することによって検出することができる。いくつかの部位（コドン）にわたって、dN／dS比（またはω）は、１を大きく上回る値に達することがあり（Yang、2007）、そのような部位は、正に選択されている可能性が高い。例えば、特定のアミノ酸の変化は、それが病原体に対する宿主のフィットネスを増加させる場合に有利である（Fumagalliら、2011）。また、別の方法として病原体の遺伝子が影響を受けることは、COVID 19パンデミックのように、ウイルスのスパイクタンパク質中の陽性選択部位が懸念の原因を与えた（Korberら、2020）。Positive selectionを検出することで、遺伝子の進化を解明し、宿主と常に「腕を競い合う」病原体への対策を立てることができる。
　組換えは進化過程に大きな影響を与え、系統の再構成や正選択の正確な検出に悪影響を及ぼす可能性があるため（Shrinerら、2003）、研究者は、組換えを行うことで、病原体の進化を理解することができ、その対策を立てることができる。アライメント内の組換え部位を定義するためのブレイクポイントのスクリーニングは、比較進化研究の標準的なステップであるべきである。有意にポジティブに選択された部位の包括的な進化解析は、(1)インフレームアライメント、(2)インデル補正、(3)系統樹計算、(4)最適なヌクレオチド置換モデルの選択、(5)トポロジカル不整合の検出とブレークポイントの検出を含む、いくつかの複雑なステップから構成される。(6) 様々なモデルの下で正に選択された部位(ω > 1)を計算し、(7) アライメント全体に作用する選択圧への影響を計算する。このように、このような解析には何十もの異なるツールやパラメータ設定が必要となる。さらに、この分野の進化科学において、多くの確立された、広く使用されているツールの結果は、解釈および処理が容易ではない。特に、仮定的組換えイベントの正確な検出と処理は、挑戦的ではあるが本質的な作業である。ここでは、上記のすべてのタスクを自動的に処理することで、研究者が包括的な進化研究を行うことを可能にするパイプラインであるPoSeiDonを紹介する。

　PoSeiDonは、タンパク質をコードする配列の組換えイベントや正の選択下にある部位を見つけるのに役立つ使いやすいパイプラインである。相同な配列を入力することで、PoSeiDonはアライメントを構築し、最適な置換モデルを推定し、組換え解析を実行した後、対応するすべての系統を構築する。最後に、完全なアライメントと組換えフラグメントの可能性のあるモデルに応じて、選択された部位の中で有意にポジティブな部位が検出される。PoSeiDonの結果は、ユーザーフレンドリーなHTMLページに要約されており、すべての中間結果と組換えイベントとポジティブに選択されたサイトのグラフ表示を提供する。PoSeiDon は https://github.com/hoelzer/poseidon で自由に利用できる。パイプラインはDockerをサポートしたNextflowで実装されており、様々なツールの出力を処理する。

インストール

最初に試したinte xeonのlinuxマシンでは原因不明のエラーが起き、step1から進めなかった。TRのlinuxマシンでは何も問題は起きなかった（どちらも-r v1.0.0指定してラン）。

依存

TranslatorX (v1.1), Abascal et al. (2010)
Muscle (v3.8.31), Edgar (2004)
RAxML (v8.0.25), Stamatakis (2014)
Newick Utilities (v1.6), Junier and Zdobnov (2010)
MODELTEST, Posada and Crandall (1998)
HyPhy (v2.2), Pond et al. (2005)
GARD, Pond et al. (2006)
PaML/CodeML (v4.8), Yang (2007)
Ruby (v2.3.1)
Inkscape (v1.0)
pdfTeX (v3.14)

Github

git clone https://github.com/hoelzer/poseidon.git
cd poseidon
nextflow run poseidon.nf --help

#or pull from the repository（ここではv1.0.0）
nextflow pull hoelzer/poseidon -r v1.0.0

> nextflow run poseidon.nf --help

$ nextflow run poseidon.nf --help

N E X T F L O W ~ version 20.07.1

Launching `poseidon.nf` [wise_leakey] - revision: 557606b4ba

WARN: DSL 2 IS AN EXPERIMENTAL FEATURE UNDER DEVELOPMENT -- SYNTAX MAY CHANGE IN FUTURE RELEASE

____________________________________________________________________________________________

PoSeiDon -- Positive Selection Detection and Recombination Analysis

Usage example:

nextflow run poseidon.nf --fasta '*/*.fasta'

Input:

--fasta '*.fasta' -> one FASTA file per transcriptome assembly

..change above input to csv: --list

General options:

--cores max cores per process for local use [default 6]

--max_cores max cores used on the machine for local use [default 56]

--memory memory limitations for polisher tools in GB [default: 8 GB]

--output name of the result folder [default: results]

--reference resulting amino acid changes and sites will be reported according to this species (FASTA id) [default: NA]

--root outgroup species (FASTA id) for tree rooting; comma-separated [default: NA]

--bootstrap number of bootstrap calculations [default: 100]

Model parameters:

--model nucleotide model used for recombination analysis, will be estimated automatically if not defined [default: 010010]

--model_rc model rate classes [default: 4]

--model_sm model selection method [default: 1]

--model_rl model rejection level [default: 0.05]

Recombination parameters (GARD):

--gard_rv GARD rate variation [default: 2]

--gard_rc GARD rate classes [default: 3]

--kh use insignificant breakpoints (based on KH test) for fragment calcuations [default: false]

Nextflow options:

-with-report rep.html cpu / ram usage (may cause errors)

-with-dag chart.html generates a flowchart for the process tree

-with-timeline time.html timeline (may cause errors)

LSF computing:

For execution of the workflow on a HPC with LSF adjust the following parameters:

--workdir defines the path where nextflow writes tmp files [default: /tmp/nextflow-work-kazu]

--cachedir defines the path where images (singularity) are cached [default: singularity]

Profile:

Merge profiles comma-separated

-profile local,docker

local,conda

lsf,docker,singularity (adjust workdir and cachedir according to your HPC config)

slurm,conda (adjust workdir and cachedir according to your HPC config)

gcloud,docker (GCP google-lifescience with docker)

テストラン

相同な配列を入力すると、PoSeiDonはアライメントを構築し、最適な置換モデルを推定し、組換え解析を実行し、その後、対応するすべての系統の構築を行う。最後に、完全なアラインメントと可能性のある組換えフラグメントのモデルに従って、選択されたサイトが有意に陽性であるかどうかまとめられる。

FASTA配列を指定する。テストデータとして用意されているのは、SARS-CoVのスパイク糖タンパク質をコードする遺伝子8配列のmulti-fastaになる（入力は塩基配列）。

nextflow run poseidon.nf --fasta examples/cov/Spike_ali_nucl-VIRULIGN.fasta --cores 12

--fasta one FASTA file per transcriptome assembly
--reference resulting amino acid changes and sites will be reported according to this species (FASTA id) [default: NA]
--root outgroup species (FASTA id) for tree rooting; comma-separated [default: NA]
--bootstrap number of bootstrap calculations [default: 100]

ランにはfastaファイルを指定する。複数ある場合はワイルドカード指定する。 dockerの実行権がユーザーにないならsudo実行する。

相同タンパク質コード配列のインフレームアライメント、組換えイベントと進化的ブレイクポイントの検出、系統再構成、フルアライメントと可能性のあるすべてのフラグメントのポジティブに選択されたサイトの検出が行われる。最後に、すべての結果を組み合わせて、ユーザーフレンドリーなHTMLのウェブページに結果がまとめられる。

出力

f:id:kazumaxneo:20201028110633p:plain

Spike_ali_nucl-VIRULIGN/

f:id:kazumaxneo:20201028110638p:plain

結果は静的なhtmlとして保存される。

html/fragment_1/index.html

f:id:kazumaxneo:20201028111109p:plain

左のメニューから各ファイルにアクセスできる。

例えばツリー => PDFをクリック

f:id:kazumaxneo:20201028111442p:plain

Treeファイルが表示された。

f:id:kazumaxneo:20201028111453p:plain

recombination => GARD （paper）

f:id:kazumaxneo:20201028111616p:plain

組換えイベントの検出と視覚化

f:id:kazumaxneo:20201028111708p:plain

significant sites

f:id:kazumaxneo:20201028112106p:plain

ハイクオリティな表として出力される。

f:id:kazumaxneo:20201028110829p:plain

引用

PoSeiDon: a Nextflow pipeline for the detection of evolutionary recombination events and positive selection
Martin Hölzer, Manja Marz
Bioinformatics, Published: 31 July 2020

nextflowのインストール

wget -qO- https://get.nextflow.io | bash
cp nextflow /usr/local/bin/

このツールはikemen Mas Kotさんのツイートをホタペンさんがリツートしてくれて知ったと記憶しています。教えていただきありがとうございました。

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

複雑な構造多型の分析と視覚化を行う Svpluscnv

異なるphylogenetic cladesで保存されているタンパク質を検索するwebサービス PhyloGene

データベースのFASTAファイルをBLASTでのアラインメントに適した小さな断片に分割する AlignBucket

10x genomicsのシングルセルRNA-seq解析パイプライン cellranger（version4について）

配列比較結果を視覚化する

All Versus Allの配列比較（sequence comparison）を行うEMBOSSの needleall

タンパク質をコードする遺伝子配列の組換えイベントや正の選択下にある部位を見つける PoSeiDon