2020-08-06

エラーの多いロングリードをアセンブリグラフにアラインする GraphAligner

ゲノムグラフは、遺伝的変異や配列の不確実性を表現することができる。ゲノムグラフに配列をアラインさせることは、エラー修正、ゲノムアセンブリ、パンゲノムグラフ内のバリアントのジェノタイピングなど、多くのアプリケーションの鍵を握っている。しかし、これまでのところ、この作業は非常に時間がかかることが多い。本研究では、ロングリードをゲノムグラフにアラインメントするためのツールであるGraphAlignerを紹介する。GraphAlignerは、最新のツールと比較して、12倍の速度と5倍少ないメモリ使用量を実現しており、直鎖リファレンスゲノムにリードをアラインメントするのと同等の効率性を実現している。エラー修正のためにGraphAlignerを使用した場合、既存のツールと比較して、ほぼ3倍の精度と15倍以上の速度が得られる。

インストール

ubuntu18.04でテストした。

GIthub

#bioconda(link)
conda install -c bioconda graphaligner

> GraphAligner -h

$ GraphAligner -h

GraphAligner bioconda 1.0.12-

Mandatory parameters:

-g [ --graph ] arg input graph (.gfa / .vg)

-f [ --reads ] arg input reads (fasta or fastq, uncompressed or

gzipped)

-a [ --alignments-out ] arg output alignment file (.gaf/.gam/.json)

--corrected-out arg output corrected reads file (.fa/.fa.gz)

--corrected-clipped-out arg output corrected clipped reads file (.fa/.fa.gz)

General parameters:

-h [ --help ] help message

--version print version

-t [ --threads ] arg number of threads (int) (default 1)

--verbose print progress messages

--E-cutoff arg discard alignments with E-value > arg

--all-alignments return all alignments instead of the best

non-overlapping alignments

--extra-heuristic use heuristics to discard more seed hits

--try-all-seeds don't use heuristics to discard seed hits

--global-alignment force the read to be aligned end-to-end even if the

alignment score is poor

--optimal-alignment calculate the optimal alignment (VERY SLOW)

Seeding:

--seeds-clustersize arg discard seed clusters with fewer than

arg seeds (int)

--seeds-extend-density arg extend up to approximately the best

(arg * sequence length) seeds (double)

(-1 for all)

--seeds-minimizer-length arg k-mer length for minimizer seeding

(int)

--seeds-minimizer-windowsize arg window size for minimizer seeding (int)

--seeds-minimizer-density arg keep approximately (arg * sequence

length) least common minimizers

(double) (-1 for all)

--seeds-minimizer-ignore-frequent arg ignore arg most frequent fraction of

minimizers (double)

--seeds-mum-count arg arg longest maximal unique matches

fully contained in a node (int) (-1 for

all)

--seeds-mem-count arg arg longest maximal exact matches fully

contained in a node (int) (-1 for all)

--seeds-mxm-length arg minimum length for maximal unique /

exact matches (int)

--seeds-mxm-cache-prefix arg store the mum/mem seeding index to the

disk for reuse, or reuse it if it

exists (filename prefix)

-s [ --seeds-file ] arg external seeds (.gam)

--seeds-first-full-rows arg no seeding, instead calculate the first

arg rows fully. VERY SLOW except on

tiny graphs (int)

Extension:

-b [ --bandwidth ] arg alignment bandwidth (int)

-B [ --ramp-bandwidth ] arg ramp bandwidth (int)

-C [ --tangle-effort ] arg tangle effort limit, higher results in slower but

more accurate alignments (int) (-1 for unlimited)

--high-memory use slightly less CPU but a lot more memory

Preset parameters:

-x [ --preset ] arg Preset parameters

dbg - Parameters optimized for de Bruijn graphs

vg - Parameters optimized for variation graphs

テストラン

入力としてリードのfasta/fastq、アセンブリグラフとしてGFA/vgを指定する。ここではGAFフォーマット（GAF）で出力しているが、-a aln.gamを使用すると、vgと互換性のある出力が得られる。

git clone https://github.com/maickrau/GraphAligner.git
cd GraphAligner/
GraphAligner -g test/graph.gfa -f test/read.fa -a aln.gaf -x vg

-g input graph (.gfa / .vg)
-f input reads (fasta or fastq, uncompressed or gzipped)
-a output alignment file (.gaf/.gam/.json)
-t number of threads (int) (default 1)
-x Preset parameters
dbg - Parameters optimized for de Bruijn graphs
vg - Parameters optimized for variation graphs

"-x vg"は、variation graphsにリードを整列させるためのパラメータプリセットで、"-x dbg"はde brujin graphにアラインメントするプリセットになる。

引用

Bit-parallel sequence-to-graph alignment
Mikko Rautiainen, Veli Mäkinen, Tobias Marschall
Bioinformatics, Volume 35, Issue 19, 1 October 2019, Pages 3599–3607

関連

2020-08-05

フェージングの品質を評価、改善する phaseME

2020 GigaScience evaluation tool phasing

　同じDNA分子上でどの突然変異が発生しているかを検出することは、その結果を予測するために不可欠である。これは、ゲノム変異のphasingによって達成することができる。それにもかかわらず、最先端のハプロタイプphasingは、現在のところ、再構成されたハプロタイプの精度と品質を評価することが困難なブラックボックスとなっている。
ここでは、リンケージデータに基づいてサンプルのphasing結果を理解し、改善するための汎用性の高い手法であるPhaseMEを紹介する。ロングリードと高品質のコンセンサスリードの両方を含むPacific Biosciences、そしてOxford Nanopore Technologies、10x Genomics、Illuminaのシーケンシング技術についてPhaseMEの性能と重要性を比較する。10x GenomicsとOxford Nanoporeでは、高いN50とphaseブロックの完全性を維持しつつ、Phasingを大幅に改善できることがわかった。PhaseMEは、レポートとサマリープロットを生成して、フェージング性能と正確性についての洞察を提供する。シーケンシング技術のそれぞれに固有のphasingの問題が観察され、品質評価の必要性が強調された。PhaseMEは、5つの技術すべてにおいて、ハミングエラー率(ref.29)を平均で22.4%も大幅に減少させることができた。さらに、ロングスイッチエラー (inaccurately joined haplotypes)の低減においても大きな改善が得られている。特に高品質のコンセンサスリードについては、54.6%の改善が見られ、その見返りとして、phase block N50の長さが5%だけ減少した。
　PhaseMEは、リンケージ情報を利用してphasingの品質と精度を評価し、phasingの品質を向上させるためのユニバーサルな手法である。パッケージは https://github.com/smajidian/phaseme で自由に利用できる。

Great collaboration with @Siiinaaam: PhaseME: utilization of linkage information to assess and improve the accuracy of phasing using @PacBio / @nanopore / @illumina. Hamming error rate reduced by on average 22.4%. https://t.co/M7S8UnHPzU @GigaScience @BCM_HGSC #Bioinformatics pic.twitter.com/omUMFcymbQ
— Fritz Sedlazeck (@sedlazeck) 2020年7月24日

インストール

ubuntu18.04のpython3.7環境でテストした。

Github

git clone https://github.com/smajidian/phaseme.git
cd phaseme/

> python --version

# python --version

Python 3.7.3

(base) root@d2fd96e04b90:/data# python phaseme.py -h

Please provide enough argumnets

python phaseme.py qc my.vcf out

python phaseme.py improver my.vcf out

python phaseme.py improver my.vcf out path_shapeit path_1000g

If you choose individual, please make sure that the followings work well in bash: ls path_1000g/1000GP_Phase3_chr1.hap.gz; ls path_shapeit/shapeit; ./path_shapeit/shapeit;

テストラン

https://github.com/smajidian/phaseme/tree/master/example

Precomputed Mode

１、QC

WhatsHapなどでphase化されたvcfを出力し、そのvcfを指定する。

python phaseme.py quality example/my.vcf example/out_pre

出力

f:id:kazumaxneo:20200805235820p:plain

２、Improve

python phaseme.py improver example/my.vcf example/out_pre_imp

出力

f:id:kazumaxneo:20200806000006p:plain

QC.csv

f:id:kazumaxneo:20200806011505p:plain

引用

PhaseME: Automatic rapid assessment of phasing quality and phasing improvement
Sina Majidian, Fritz J Sedlazeck
GigaScience, Volume 9, Issue 7, July 2020

関連

2020-08-04

De Bruijnアセンブラ Minia

2020 8/5 誤字修正

de Bruijnグラフデータ構造は、次世代シークエンシング（NGS）で広く利用されている。デノボアセンブラなどの多くのプログラムは、このグラフのインメモリ表現に依存している。しかし、現在のヒトゲノムのde Bruijnグラフを表現する技術では、30GB以上のメモリを必要とする。

　本研究では、de Bruijnグラフの新しい符号化方式を提案する。このエンコーディングはBloomフィルタに基づいており、重要な偽陽性を除去するための構造を追加している。この構造を実装したアセンブリソフトウェアMiniaは、5.7GBのメモリを使用してヒトゲノムのショートリードの完全なde novoアセンブリを23時間で実行した。

インストール

ビルド依存

CMake 2.6+; see http://www.cmake.org/cmake/resources/software.html
C++11 compiler; (g++ version>=4.7 (Linux), clang version>=4.3 (Mac OSX))

Github

#bioconda(link)
conda install -c bioconda minia -y

#building from source
git clone --recursive https://github.com/GATB/minia.git
cd minia/
sh INSTALL

> minia

$ minia

Minia 3, git commit b0715310a

Specifiy -in

[minia options]

[assembly options]

-in (1 arg) : input reads (fasta/fastq/compressed) or hdf5 file [default '']

-keep-isolated (0 arg) : keep short (<= max(2k, 150 bp)) isolated output sequences

-traversal (1 arg) : traversal type ('contig', 'unitig') [default 'contig']

-fasta-line (1 arg) : number of nucleotides per line in fasta output (0 means one line) [default '0']

[graph simplifications options]

-no-bulge-removal (0 arg) : ask to not perform bulge removal

-no-tip-removal (0 arg) : ask to not perform tip removal

-no-ec-removal (0 arg) : ask to not perform erroneous connection removal

-tip-len-topo-kmult (1 arg) : remove all tips of length <= k * X bp [default '2.500000']

-tip-len-rctc-kmult (1 arg) : remove tips that pass coverage criteria, of length <= k * X bp [default '10.000000']

-tip-rctc-cutoff (1 arg) : tip relative coverage coefficient: mean coverage of neighbors > X * tip coverage [default '2.000000']

-bulge-len-kmult (1 arg) : bulges shorter than k*X bp are candidate to be removed [default '3.000000']

-bulge-len-kadd (1 arg) : bulges shorter than k+X bp are candidate to be removed [default '100']

-bulge-altpath-kadd (1 arg) : explore up to k+X nodes to find alternative path [default '50']

-bulge-altpath-covmult (1 arg) : bulges of coverage <= X*cov_altpath will be removed [default '1.100000']

-ec-len-kmult (1 arg) : EC shorter than k*X bp are candidates to be removed [default '9.000000']

-ec-rctc-cutoff (1 arg) : EC relative coverage coefficient (similar in spirit as tip) [default '4.000000']

-no-mphf (0 arg) : don't construct the MPHF

[kmer count options]

-kmer-size (1 arg) : size of a kmer [default '31']

-abundance-min (1 arg) : min abundance threshold for solid kmers [default '2']

-abundance-max (1 arg) : max abundance threshold for solid kmers [default '2147483647']

-abundance-min-threshold (1 arg) : min abundance hard threshold (only used when min abundance is "auto") [default '2']

-histo-max (1 arg) : max number of values in kmers histogram [default '10000']

-solidity-kind (1 arg) : way to compute counts of several files (sum, min, max, one, all, custom) [default 'sum']

-solidity-custom (1 arg) : when solidity-kind is custom, specifies list of files where kmer must be present [default '']

-max-memory (1 arg) : max memory (in MBytes) [default '5000']

-max-disk (1 arg) : max disk (in MBytes) [default '0']

-solid-kmers-out (1 arg) : output file for solid kmers (only when constructing a graph) [default '']

-out (1 arg) : output file [default '']

-out-dir (1 arg) : output directory [default '.']

-out-tmp (1 arg) : output directory for temporary files [default '.']

-out-compress (1 arg) : h5 compression level (0:none, 9:best) [default '0']

-storage-type (1 arg) : storage type of kmer counts ('hdf5' or 'file') [default 'hdf5']

-histo2D (1 arg) : compute the 2D histogram (with first file = genome, remaining files = reads) [default '0']

-histo (1 arg) : output the kmer abundance histogram [default '0']

[kmer count, advanced performance tweaks options]

-minimizer-type (1 arg) : minimizer type (0=lexi, 1=freq) [default '0']

-minimizer-size (1 arg) : size of a minimizer [default '10']

-repartition-type (1 arg) : minimizer repartition (0=unordered, 1=ordered) [default '0']

[bloom options]

-bloom (1 arg) : bloom type ('basic', 'cache', 'neighbor') [default 'neighbor']

-debloom (1 arg) : debloom type ('none', 'original' or 'cascading') [default 'cascading']

-debloom-impl (1 arg) : debloom impl ('basic', 'minimizer') [default 'minimizer']

[branching options]

-branching-nodes (1 arg) : branching type ('none' or 'stored') [default 'stored']

-topology-stats (1 arg) : topological information level (0 for none) [default '0']

[general options]

-config-only (0 arg) : dump config only

-nb-cores (1 arg) : number of cores [default '0']

-all-abundance-counts (0 arg) : output all k-mer abundance counts instead of mean

-edge-km (1 arg) : edge km representation [default '0']

-verbose (1 arg) : verbosity level [default '1']

-integer-precision (1 arg) : integers precision (0 for optimized value) [default '0']

[debug options]

-redo-bcalm (0 arg) : debug function, redo the bcalm algo

-skip-bcalm (0 arg) : same, but skip bcalm

-redo-bglue (0 arg) : same, but redo bglue (needs debug_keep_glue_files=true in source code)

-skip-bglue (0 arg) : same, but skip bglue

-redo-links (0 arg) : same, but redo links

-skip-links (0 arg) : same, but skip links

-nb-glue-partitions (1 arg) : number of glue partitions (automatically calculated by default) [default '0']

実行方法

入力のfastqを指定する。

minia minia -in input.fq -kmer-size 31 -out minia_assembly

-in input reads (fasta/fastq/compressed) or hdf5 file [default '']
-kmer-size size of a kmer [default '31']

引用
Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Rayan Chikhi, Guillaume Rizk

Algorithms Mol Biol. 2013 Sep 16;8(1):22

関連

参考

2020-08-03

Transcript-level Aware なロングリードのエラーコレクションを行う TALC

2020 error correction Nanopore direct RNA-seq Nanopore long read Bioinformatics

　ロングリードシーケンシング技術は、複雑なRNAトランスクリプト構造を決定するために非常に重要だが、エラーが発生しやすい。同じサンプルからシーケンスされたショートリードの精度と深さを利用してロングリードを補正する「ハイブリッド補正」アルゴリズムがゲノムデータ用に多数開発されている。これらのアルゴリズムは、より複雑なトランスクリプトームシーケンスデータの補正には適していない。
　著者らは、TALC (Transcript-level Aware Long Read Correction)と呼ばれる新しいリファレンスフリーのアルゴリズムを作成した。これは、RNA発現とアイソフォーム表現の変化を重み付きDe-Bruijnグラフでモデル化し、トランスクリプトーム研究のロングリードを補正するものである。TALCによるトランスクリプトレベルアウェアなロングリード補正は、下流のRNA-seqアプリケーションの全スペクトルの精度を向上させ、ロングリード技術を用いたトランスクリプトーム解析に必要であることを示している。TALCはC ++で実装されており、https://github.com/lbroseus/TALCで入手できる。

インストール

ubuntu18.04でテストした。jellyfish2はcondaでjellyfish2の仮想環境を作って導入、実行した（bioconda jellyfish）。

ビルド依存

gcc version > 5.

GitLab

git clone https://gitlab.igh.cnrs.fr/lbroseus/TALC.git
cd TALC
git clone https://github.com/seqan/seqan.git
make -j

> ./talc

# ./talc -h

******************************************************

* TALC : Transcriptome-Aware Long Read Correction *

*----------------------------------------------------*

* *

* Kmers are assumed directional *

******************************************************

[TALC]: Parsing arguments

TALC: Transcriptome-Aware Long Read Correction - Hybrid Long Read Correction using Short Read coverage

======================================================================================================

SYNOPSIS

DESCRIPTION

REQUIRED ARGUMENTS

Input_fastQ/A_file_containing_Long_Reads STRING

OPTIONS

-h, --help

Display the help message.

--version-check BOOL

Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,

0, OFF, FALSE, F, and NO. Default: 1.

--version

Display version information.

-o, --output STRING

Prefix to be used for output files Default: out.

-k, --kmerSize INTEGER

length k of k-mers. In range [18..30].

-qm, --query-mode STRING

Mode that should be used to query kmers (advised: memory). One of memory and jellyfish2. Default: memory.

-SR, --SRCounts STRING

Short reads kmer counts in format .dump file (memory mode) or .jf file (jf2 mode), obtained from Jellyfish2

-j, --junctions STRING

k-mers flanking junctions and their counts.

-jf2, --pathToJF2 STRING

Specifies where the Jellyfish2 program should be found. Default: .

-MIN_INNER_SCORE, --MIN_INNER_SCORE DOUBLE

Minimum %ID score required for the best-correction-path-between-any-two-solid-regions to be kept. In range

[0.3..0.9]. Default: 0.7.

-MIN_BORDER_SCORE, --MIN_BORDER_SCORE DOUBLE

Minimum %ID score required for the best-correction-path-on-left-or-right-borders to be kept. In range

[0.5..0.9]. Default: 0.7.

-MIN_COUNT, --MIN_COUNT INTEGER

Minimal count for a k-mer to be kept in the SR-de Bruijn Graph. In range [2..inf]. Default: 2.

-SR_ERROR_RATE, --SR_ERROR_RATE DOUBLE

Prior estimate of the error rate in short reads. In range [0.01..0.1]. Default: 0.025.

-WINDOW_SIZE, --WINDOW_SIZE INTEGER

[ADVANCED] Size of the window. The larger, the more fresh air. In range [6..inf]. Default: 9.

-MAX_NB_BRANCHES, --MAX_NB_BRANCHES INTEGER

[ADVANCED] Maximal number of competing branches. In range [5..inf]. Default: 7.

-ALPHA_FOR_PRED, --ALPHA_FOR_PRED DOUBLE

[ADVANCED] Coefficient used to build count confidence interval: [count +/- ALPHA*sqrt(count)]. In range

[0.67..inf]. Default: 2.57.

-t, --num_threads INTEGER

number of threads In range [1..inf]. Default: 1.

-DEBUG_MODE, --DEBUG_MODE STRING

Will activate output for debugging purposes

-rev, --reverse

If set, long reads will be reverse complemented before correction by short reads.

VERSION

Last update: September 2019

TALC: Transcriptome-Aware Long Read Correction version: 1.01

SeqAn version: 2.4.0

実行方法

１、jellyfish2によるk-merカウント

raw fastqを指定する。

jellyfish count --mer 25 -s 100M -o out.jf -t 20 pair_R*
jellyfish dump -c out.jf > out.dump

-s Initial hash size
-m Length of mer

2、talcによるエラーコレクション

ロングリード（raw fastq/fasta）と１の出力のdumpファイルを指定する。

talc long-reads.fq --SRCounts out.dump -k 31 -o talc-out -t 30 > log

-o Prefix to be used for output files Default: out
-k length k of k-mers. In range [18..30]

出力

f:id:kazumaxneo:20200802232938p:plain

talc-out.faがエラー修正されたリングリード。

TALCの重み付きde Bruijnグラフは方向性があるので、ショートリードとロングリードの配列は同じ方向でなければならない。ロングリードがショートリードと逆相補的な場合は”--reverse”を追加する。また既知スプライスジャンクション情報をエラー修正時に提供することもできる。手順はGitlabで確認してください。

引用

TALC: Transcript-level Aware Long Read Correction
Lucile Broseus, Aubin Thomas, Andrew J Oldfield, Dany Severac, Emeric Dubois, William Ritchie
Bioinformatics, Published: 16 July 2020

2020-08-02

tty-tableを使ってターミナルで表を整形表示する

tips

2020 8/8 タイトル修正

タイトルの通り。CLI環境で表を見やすく表示する。

インストール

Node.jsのパッケージマネージャnpmを使う（gemにも対応）
npm install tty-table -g

> tty-table -h

$ tty-table -h

オプション:

--version バージョンを表示 [真偽]

--config Specify the configuration for your table.

--csv-delimiter Set the field delimiter. One character only.

[デフォルト: ","]

--csv-escape Set the escape character. One character only.

--csv-rowDelimiter String used to delimit record rows. You can also use a

special constant: "auto","unix","max","windows","unicode".

[デフォルト: "

--format Set input data format

[選択してください: "json", "csv"] [デフォルト: "csv"]

--options‐* Specify an optional setting where * is the setting name.

See README.md for a complete list.

-h ヘルプを表示 [真偽]

実行方法

cat input.csv |tty-table

f:id:kazumaxneo:20200802151257p:plain

JOSN形式のテーブルなら

cat input.csv |tty-table --format json

excelのxlsxファイルは、xlsx2csvでCSVに変換すれば読み込める(*1)。

xlsx2csv input.xlsx |tty-table - |less

他のCSV整形表示コマンドと比較してみる。

csvtkのcsvlook (*2)

csvlook input.csv

f:id:kazumaxneo:20200802152642p:plain

こちらも見やすい。

columnコマンドだと

column -s, -t input.csv

f:id:kazumaxneo:20200802152507p:plain

標準で使えるcolumnコマンドよりtty-tableやcsvlookの方が見やすい。

引用

GitHub - piotrmurach/tty-table: A flexible and intuitive table generator

参考

pip install xlsx2csv

pip install csvtk

2020-08-01

De brujin アセンブラ BCALM 2

2016 Bioinformatics k-mer assembly graph assembly

　シーケンシング実験あたりのデータ量が増加するにつれて、フラグメントアセンブリはますます計算量が増加している。De Bruijn graphは、フラグメントアセンブリアルゴリズムで広く使用されているデータ構造で、リードのセットからの情報を表現するために使用されている。Compactionは、長い単純なパスを単一の頂点に圧縮するde Bruijn graphベースのアルゴリズムにおいて、データ削減の重要なステップである。最近ではアセンブリパイプラインのボトルネックとなっている圧縮処理の実行時間とメモリ使用量を改善することが重要な課題となっている。

　ここではDe Bruijn graphの圧縮のためのアルゴリズムとツールBCALM 2を紹介する。BCALM2はミニマムハッシュ法に基づいて入力を分散させる並列アルゴリズムであり、実行中のメモリ使用量のバランスを考慮している。ヒトのシーケンシングデータに対しては、BCALM2を適用することで、de Bruijn graphの圧縮にかかる計算負荷を約1時間、メモリ使用量を3GBに削減することができた。また、22 Gbpのloblolly pineと20Gbpのwhite spruceのシーケンシングデータにもBCALM 2を適用した。コンパクト化されたグラフは、1台のマシン上で2日以内に生のリードから40GBのメモリで構築された。そのため、BCALM 2は他の方法と比較して、少なくとも一桁以上の効率性がある。

インストール

macos10.14でビルドしてテストした。

ビルド依存

GCC >= 4.8 or a very recent C++11 capable compiler

Github

#bioconda (link)
conda install -c bioconda bcalm -y

#from source
git clone --recursive https://github.com/GATB/bcalm 
cd bcalm
mkdir build; cd build
cmake ..; make -j 8

#large k-merを扱うなら
git clone --recursive https://github.com/GATB/bcalm 
cd bcalm
mkdir build; cd build
rm -Rf CMake* && cmake -DKSIZE_LIST="32 64 96 128 160 192 224 256 320" .. && make -j 8

> ./bcalm

$ ./bcalm

BCALM 2, version v2.2.3, git commit 9b7b581

Specifiy -in

[bcalm_1 options]

-nb-cores (1 arg) : number of cores [default '0']

-verbose (1 arg) : verbosity level [default '1']

-version (0 arg) : version

-help (0 arg) : help

[graph options]

-no-mphf (0 arg) : don't construct the MPHF

[kmer count options]

-in (1 arg) : reads file [default '']

-kmer-size (1 arg) : size of a kmer [default '31']

-abundance-min (1 arg) : min abundance threshold for solid kmers [default '2']

-abundance-max (1 arg) : max abundance threshold for solid kmers [default '2147483647']

-solidity-custom (1 arg) : when solidity-kind is custom, specifies list of files where kmer must be present [default '']

-max-memory (1 arg) : max memory (in MBytes) [default '5000']

-max-disk (1 arg) : max disk (in MBytes) [default '0']

-out (1 arg) : output file [default '']

-out-dir (1 arg) : output directory [default '.']

-out-tmp (1 arg) : output directory for temporary files [default '.']

-out-compress (1 arg) : h5 compression level (0:none, 9:best) [default '0']

-storage-type (1 arg) : storage type of kmer counts ('hdf5' or 'file') [default 'hdf5']

-histo2D (1 arg) : compute the 2D histogram (with first file = genome, remaining files = reads) [default '0']

-histo (1 arg) : output the kmer abundance histogram [default '0']

[kmer count, advanced performance tweaks options]

-minimizer-type (1 arg) : minimizer type (0=lexi, 1=freq) [default '1']

-minimizer-size (1 arg) : size of a minimizer [default '10']

-repartition-type (1 arg) : minimizer repartition (0=unordered, 1=ordered) [default '1']

[bloom options]

-bloom (1 arg) : bloom type ('basic', 'cache', 'neighbor') [default 'neighbor']

-debloom (1 arg) : debloom type ('none', 'original' or 'cascading') [default 'cascading']

-debloom-impl (1 arg) : debloom impl ('basic', 'minimizer') [default 'minimizer']

[branching options]

-branching-nodes (1 arg) : branching type ('none' or 'stored') [default 'stored']

-topology-stats (1 arg) : topological information level (0 for none) [default '0']

[general options]

-config-only (0 arg) : dump config only

-nb-cores (1 arg) : number of cores [default '0']

-all-abundance-counts (0 arg) : output all k-mer abundance counts instead of mean

-edge-km (1 arg) : edge km representation [default '0']

-verbose (1 arg) : verbosity level [default '1']

-integer-precision (1 arg) : integers precision (0 for optimized value) [default '0']

[debug options]

-redo-bcalm (0 arg) : debug function, redo the bcalm algo

-skip-bcalm (0 arg) : same, but skip bcalm

-redo-bglue (0 arg) : same, but redo bglue (needs debug_keep_glue_files=true in source code)

-skip-bglue (0 arg) : same, but skip bglue

-redo-links (0 arg) : same, but redo links

-skip-links (0 arg) : same, but skip links

-nb-glue-partitions (1 arg) : number of glue partitions

実行方法

fastqのリストファイルを指定する。

ls -1 *.fastq > list_reads
bcalm -in list_reads -kmer-size 55 -abundance-min 2 -out-dir outdir

-kmer-size size of a kmer [default '31']
-abundance-min min abundance threshold for solid kmers [default '2']
-out-dir output directory [default '.']

unitigが出力される。

出力について

https://github.com/GATB/bcalm/blob/master/bidirected-graphs-in-bcalm2/bidirected-graphs-in-bcalm2.md

付属のスクリプトでunitigをGFAに変換する。unitig、出力GFA、k-mer値の順番に指定する。

python bcalm/scripts/convertToGFA.py list_reads.unitigs.fa output.gfa 55

引用

Compacting de Bruijn graphs from sequencing data quickly and in low memory
Rayan Chikhi, Antoine Limasset, Paul Medvedev
Bioinformatics, Volume 32, Issue 12, 15 June 2016, Pages i201–i208

2020-07-31

高速なアライナー Accel-Align

mapping 高速なツール 2020 Preprint 2021 BMC Bioinformatics

　シーケンシング技術の向上により、シーケンシングコストはゲノムあたり100ドルに向かって進み続けている。しかし、シーケンスデータをリファレンスゲノムにマッピングすることは、シーケンスによって導入されるindelやミスマッチを処理するための編集距離に依存しているため、計算量の多い作業であることに変わりない。最近のすべてのアライナーは、編集距離計算のオーバーヘッドを減らすために、シードフィルタリング拡張（SFE）手法を使用し、フィルタリングヒューリスティックに依存している。しかし、フィルタリングはエラーパターンを前提としており、固有の性能と精度のトレードオフがあり、データセットに適合させるためには慎重なハンドチューニングが必要である。

　ランダム化された低歪みエンベッディングにおけるアルゴリズムの進歩に触発され、シーケンスマッパーやアライナーを開発するための新しい設計手法であるSEE（seed-embed-extend）を導入した。SFEが最適でない候補を排除することに焦点を当てるのに対し、SEEは最適な候補を特定することに焦点を当てている。そのために、SEEは、ランダム化されたアルゴリズムを用いて埋め込みを行うことで、読み取り文字列と参照文字列を編集距離領域からハミング領域に変換し、埋め込み集合上のハミング距離を用いて最適な候補を特定する。実際にSEEがうまく機能することを示すために、著者らはSEEベースのショートリードシーケンスマッパーとアライナーであるAccel-Alignを紹介する。これは、同等の精度を提供しながら、特別な目的のハードウェアを使用せずに、汎用CPU上の最先端のアライナーよりも3-12倍高速である。

Our latest work with @nimchat12 on randomized low distortion embedding for sequence alignment is out. We introduce seed-embed-extend, a new design method, and show that an SEE aligner can provide 10x reduction in exec. time over bwa and bowtie2 without any special hardware. https://t.co/UATGzjCgxS
— Raja Appuswamy (@raja_appuswamy) 2020年7月21日

インストール

配布されているdocker imageを使ってテストした。

Girthub

現在ソースコードは公開されていない。実行形式ファイルのバイナリとビルド済みのdocker imageだけが提供されている。

git clone https://github.com/raja-appuswamy/accel-align-release.git
cd accel-align-release/

> ./accindex-x86-64

# ./accindex-x86-64

index [options] <ref.fa>

options:

-l INT length of seed. [32]

-m Use low mem

> ./accalign-x86-64

# ./accalign-x86-64

accalign [options] <ref.fa> [read1.fastq] [read2.fastq]

options:

-t INT number of threads to use[1]

-l INT length of seed [32].

-o name of output file to use

-x alignment-free

実行方法

1、indexing

accindex-x86-64 -m ref.fasta

-m Use low mem

出力。シロイヌナズナゲノムでは2.6GBになった。

f:id:kazumaxneo:20200729010702p:plain

2、mapping

accalign-x86-64 -o out.sam -t 30 ref.fasta paired_1.fq paired_2.fq

-t number of threads to use[1]
-l length of seed [32].
-o name of output file to use

引用

Accel-Align: A Fast Sequence Mapper and Aligner based on the Seed–Embed–Extend Method

Yiqing Yan, Nimisha Chaturvedi, Raja Appuswamy

bioRxiv, Posted July 21, 2020

2021 8/30

Accel-Align: a fast sequence mapper and aligner based on the seed–embed–extend method
Yiqing Yan, Nimisha Chaturvedi & Raja Appuswamy
BMC Bioinformatics volume 22, Article number: 257 (2021)

感想

　シロイヌナズナゲノムを使い約1400万x２リード（1.6GBx2）のマッピング(sam出力)にかかる時間を調べてみると39秒だった（*）。精度は見ていないが、ランタイムは短い。しかし2020 7/30現在、ソースコードが公開されていない。。。

* オーサー提供のdockerイメージを使用。xeon E5 platinum 2.1 GHz/ 28コア x 1、30スレッド指定。

関連

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

エラーの多いロングリードをアセンブリグラフにアラインする GraphAligner

フェージングの品質を評価、改善する phaseME

De Bruijnアセンブラ Minia

Transcript-level Aware なロングリードのエラーコレクションを行う TALC

tty-tableを使ってターミナルで表を整形表示する

De brujin アセンブラ BCALM 2

高速なアライナー Accel-Align