2018-01-14

454とSOLIDのfastqで高いロスレス圧縮を行う LFQC

fasta/fastqの操作 Compression / decompression

高い圧縮率を示すfastqの圧縮ツール。圧縮率が高いだけあって時間はかかるが、1/10ほどのサイズの圧縮ファイルを作ることができる（ロスレス）。

インストール

cent OSに導入した。

環境

Unix system with at least 4gb of RAM (preferably 8)
Ruby

本体　Github

GitHub - mariusmni/lfqc: LFQC: Fastq Compression Algorithm

git clone https://github.com/mariusmni/lfqc.git 
cd lfqc/lfqc/

zpaqとlpaq8のサブフォルダーのzpaqとlpaq8も必要になる。ランして動かなければそれぞれmakeし直す。

>./lpaq8

$ lpaq8

Licensed under GPL, http://www.gnu.org/copyleft/gpl.html

>./zpaq

$ zpaq

zpaq v7.02 journaling archiver, compiled Jan 14 2018

zpaq archiver for incremental backups with rollback capability.

http://mattmahoney.net/zpaq

Usage: zpaq {add|extract|list|test} archive[.zpaq] files... -options...

Files... may be directory trees. Default is the whole archive.

Archive may be "" to test compression or comparison.

* and ? in archive match numbers or digits in a multi-part archive.

Part 0 is the index. If present, no other parts are needed to add or list.

Commands (a,x,l,t) and options may be abbreviated if not ambiguous.

-all [N] Extract/list versions in N [4] digit directories.

-key [password] AES-256 encrypted archive [prompt without echo].

-noattributes Ignore/don't save file attributes or permissions.

-not files... Exclude. * and ? match any string or char.

-only files... Include only matches (default: *).

-summary Be brief.

-test Do not write to disk.

-threads N Use N threads (default: 40).

-to out... Rename files... to out... or all to out/all.

-until N Roll back archive to N'th update or -N from end.

-until 2018-01-14 04:36:49 Set date, roll back (UT, default time: 235959).

add options. archive can be "" to test compression with no output:

-force Add files even if the date is unchanged.

-nodelete Do not mark unmatched files as deleted.

-method L Compress level L (0..5 = faster..better, default 1).

LB Use 2^B MB blocks (0..11, default 04, 14, 26..56).

i Index (file metadata only).

-fragment N Set average dedupe fragment size = 2^N KiB (default: 6).

extract options:

-force Overwrite existing files (default: skip).

list (compare files) options:

-force Compare file contents instead of dates (slower).

-not =[+-#?^] Exclude by comparison result.

-summary [N] Show N largest files/dirs only (default: 20).

この２つが動作してパスが通っている必要がある。

ラン

圧縮

ruby lfqc.rb file.fastq

1GBのfastqで10分程度時間がかかる。終わるとfile.fastq.lfqcができる。

またはfastqの型を明示して圧縮。

ruby lfqc.rb -ls454 file.fastq 
ruby lfqc.rb -solid file.fastq 
ruby lfqc.rb -solexa file.fastq

解凍。

ruby lfqcd.rb file.fastq.lfqc output.fastq

引用

LFQC: a lossless compression algorithm for FASTQ files

Marius Nicolae, Sudipta Pathak, and Sanguthevar Rajasekaran*

Bioinformatics. 2015 Oct 15; 31(20): 3276–3281.

2018-01-12

シングルコアでも高速なRNA seqのアライナー RapMap

RNA seq 高速なツール mapping

RapMapはRNAのアライナー。非常に高速で、ほかのツールと比較すると、Bowtie2より数十倍高速で、高速なSTARと比べても２倍以上高速にアライメントできる（Figure2参照）。アライメントが具体的には7500万のリードをヒトトランスクリプトームに10分程度でアライメントできるとされる（シングルコア使用時）。アライメント精度の指標となるrecallやprecisionは他のRNAのアライナーKallisto、STAR、Bowiteと同程度となっている。

インストール

ビルドにCMakeが必要。なければbrewで導入しておく。

本体　Github

https://github.com/COMBINE-lab/RapMap

ダウンロードしてビルドする。

git clone https://github.com/COMBINE-lab/RapMap.git
cd RapMap/
mkdir build && cd build
cmake .. 
make
make install
cd ../bin 
./rapmap -h

> ./rapmap -h

$ ./rapmap -h

RapMap Transcriptome Aligner v0.5.0

=====================================

There are currently 4 RapMap subcommands

pseudoindex --- builds a k-mer-based index

pseudomap --- map reads using a k-mer-based index

quasiindex --- builds a suffix array-based (SA) index

quasimap --- map reads using the SA-based index

Run a corresponding command "rapmap <cmd> -h" for

more information on each of the possible RapMap

commands.

パスを通しておく。

ラン

１、 indexの作成。

 rapmap quasiindex -t ref.fa -i ref_index

hashサイズを最小にしてメモリ使用量を削減する。

rapmap quasiindex -t ref.fa -i ref_index -p -x 4

-p enables the minimum perfect hash
-x <INT> tells RapMap to use up to <int> threads when building the perfect hash

２、ペアードエンドのマッピング（samtolsにパイプしてbam出力している）。

rapmap quasimap -i ref_index -1 <(gunzip -c r1.fq.gz) -2 <(gunzip -c r2.fq.gz) -t 8 -o mapped_reads.sam | samtools view -Sb -@ 4 - > mapped_reads.bam

シングルエンドは"-1、-2"の代わりに"-r"で指定します。リード数が数百万だと本当に動作しているのか不安になるほど短い時間でアライメントを完了します。indexの作成にかかる時間もSTARよりずっと短くて済むため、トータルで考えると、STARよりずっと短時間でアライメントできます。"laptop"のマッパーというプログラム名に偽りなしです。

アライメントの感度と精度については、私のデータではbowtieと同じぐらいでした。

引用

RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes.

Srivastava A, Sarkar H, Gupta N, Patro R.

Bioinformatics. 2016 Jun 15;32(12):i192-i200.

2018-01-12

Synteny blockを検出して、染色体の類似領域を可視化する Cinteny

GUIツール結果の視覚化 (visualization) ゲノム比較 (comparative genomics) human genome synteny_block 2007 BMC Bioinformatics

Cintenyは類似の生物間のゲノム配列を比較し、Synteny block（wiki）で描画するツール。人やマウスなどのデータについてはビルド済みのwebサーバーが提供されており、すぐにゲノム比較を行うことができる。

webサーバー

http://cinteny.cchmc.org

使い方についてのデモ動画。

http://cinteny.cchmc.org/doc/demo.php

人とチンパンジーのゲノムを選択しstartをクリック。

f:id:kazumaxneo:20180112012420j:plain

whole genome analysis のsubmitをクリック。

f:id:kazumaxneo:20180112012426j:plain

上がヒトゲノム、下がチンバンジーゲノム。色が残っている領域が２生物間で類似している染色体領域となる。ただしこの画面ではどことどこがシンテシーを示すかは表示されていない。

f:id:kazumaxneo:20180112012429j:plain

確認したいシンテシー部分をクリックすると、染色体同士の比較像に切り替わる。

f:id:kazumaxneo:20180112012437j:plain

Cintenyでは新規プロジェクトも登録して比較できるようですが、手順が煩雑なので、スモールゲノムならばArtemisやmauveを使うことをおすすめします。

ACTサポートツール

mauve

引用

Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms

Amit U Sinha and Jaroslaw Meller

BMC Bioinformatics 2007 8:82

2018-01-11

ゲノムを比較し、染色体間の組み替えを可視化する SMASH

structural variations (SV) 結果の視覚化 (visualization) human genome ゲノム比較 (comparative genomics) synteny_block

SMASHは２つの相同なゲノム（染色体）を比較し、組み替えを見つけて結果をビジュアル出力できるツール。解析にはNGSのデータなどは必要としない。純粋にchromosomeの配列だけを使って相同性のある部位や組み替え部位が検出される。霊長類のような大きなゲノム向けのツールとなる。Nature Scientific Reportsに2015年に発表された。

Githubより

公式サイト

smash | UA.PT Bioinformatics

インストール

本体　Github

https://github.com/pratas/smash

ダウンロードしてビルドする。公式サイトからは実行形式のバイナリもダウンロードできる。

brew install cmake wget gcc48 
wget https://github.com/pratas/smash/archive/master.zip 
unzip master.zip 
cd smash-master/src/ 
cmake . 
make

> ./smash

Usage: smash <OPTIONS>... [FILE] [FILE]

-h give this help,

-V display version number,

-v verbose mode,

-f force (be sure!),

-c <context> context order (DEF: 20),

-t <threshold> threshold [0.0,2.0] (DEF: 1.5),

-m <mSize> minimum block size (DEF: 1000000),

-i do not show inversions,

-n do not show regulars,

-r <ratio> image size ratio (MaxSeq/150),

-a <alpha> alpha estimator (DEF: 1000),

-s <seed> seed for random 'N',

-w <wSize> window size,

-wt <wType> window type [0|1|2|3] (DEF: 0),

-d <dSize> sub-sample (DEF: 10000),

-nd do not delete temporary files,

-wi <width> sequence width (DEF: 25),

-p <posFile> output positions file,

-o <outFile> output svg plot file,

[refFile] reference file,

[tarFile] target file.

パスを通しておく。

ラン

ヒトゲノムchr20とオラウータンchr20のゲノムをダウンロードして解凍する（GitではNCBIのリンクを紹介していますが、アドレスが変わっているのでEnsemblからダウンロードに変更しています）。

wget ftp://ftp.ensembl.org/pub/release-91/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz
wget ftp://ftp.ensembl.org/pub/release-91/fasta/pongo_abelii/dna/Pongo_abelii.PPYG2.dna.chromosome.20.fa.gz 

#ヘッダを除き(grep -v ">")、ATGCN以外の余計な文字があれば消す(tr -dc "ATGCN")。
gzcat Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz | grep -v ">" | tr -d -c "ACGTN" > HS20 
gzcat Pongo_abelii.PPYG2.dna.chromosome.20.fa.gz | grep -v ">" | tr -d -c "ACGTN" > PA20

ラン。

SMASH -v -c 20 -t 1.5 HS20 PA20

-v verbose mode
-c <context> context order (DEF: 20)
-t <threshold> threshold [0.0,2.0] (DEF: 1.5),

数分で解析は終わりSVGなどが出力される。SVGはphotoshopなどで開ける。

f:id:kazumaxneo:20180111221026j:plain

position情報は.posファイルに出力される。

$ head HS20PA20.pos

TARGET 1 12890 9068115 0-regular

REFERENCE 1 5542700 14243450 0-regular

TARGET 2 9100340 13134910 0-regular

REFERENCE 2 14243450 16350965 0-regular

REFERENCE 2 16383190 18129785 0-regular

TARGET 3 13154245 18883850 0-regular

REFERENCE 3 18136230 19083645 0-regular

REFERENCE 3 19160985 23666040 0-regular

TARGET 4 18961190 20920470 0-regular

REFERENCE 4 23840055 23975400

出力や設定できるパラメータの詳細についてはGithubを確認してください。詳しく書かれています。

https://github.com/pratas/smash

引用

An alignment-free method to find and visualise rearrangements between pairs of DNA sequences

Diogo Pratas,a, Raquel M. Silva, Armando J. Pinho, and Paulo J.S.G. Ferreira.

Sci Rep. 2015; 5: 10203. Published online 2015 May 18.

2018-01-11

ターゲットゲノムにしかない配列を探し出すEAGLE

ゲノム比較 (comparative genomics) k-mer

EAGLEは指定したゲノムにしかない配列を検出するツール。論文では、EAGLEを使い、ヒトゲノムのどこにもない、エボラウィルスからしか見つからない14merの配列を検出している。特定の種のみ存在する配列があれば、ユニークなプライマーを設計したり、RNA干渉のターゲットにするなど、様々な用途に応用できる可能性がある。また変異株の変異をリファレンスフリーで検出可能な方法論にも応用可能で、すでにいくつかの方法論が発表されている。EAGLEと似たツールにGenomeTester4がある（リンク）。

インストール

本体　Github

https://github.com/pratas/eagle

ダウンロードしてビルドする。　

brew install cmake wget gcc48 #ない人だけ

wget https://github.com/pratas/eagle/archive/master.zip 
unzip master.zip 
cd eagle-master/src/ 
cmake . 
make

> ./EAGLE

$ ./EAGLE

Usage: EAGLE <OPTIONS>... -r [FILE] [FILE]:<...>

-v verbose mode,

-a about EAGLE,

-t use multi-threading,

-i use inversions,

-min <k-mer> k-mer minimum size,

-max <k-mer> k-mer maximum size,

-r [rFile] reference file (db),

[tFile1]:<tFile2>:<...> target file(s).

パスを通しておく。

ラン

k=9からk=13まで探索する。

EAGLE -v -t -min 9 -max 13 -i -r ref.fa target.fa

-t use multi-threading
-min <k-mer> k-mer minimum size
-max <k-mer> k-mer maximum size
-r [rFile] reference file (db)

引用

Three minimal sequences found in Ebola virus genomes and absent from human DNA

Raquel M. Silva Diogo Pratas Luísa Castro Armando J. Pinho Paulo J. S. G. Ferreira

Bioinformatics, Volume 31, Issue 15, 1 August 2015, Pages 2421–2425

2018-01-11

ゲノムデータの圧縮を行うGeCo

Compression / decompression

GeCoはゲノム（fasta）の圧縮ツール。高効率な圧縮を行うことができる（ロスレスかどうかは不明）。

公式サイト

http://bioinformatics.ua.pt/software/geco/

インストール

本体　Github

https://github.com/pratas/geco

ダウンロードしてビルドする。

brew install cmake wget gcc48 #ない人だけ
git clone https://github.com/pratas/geco.git 
cd geco/src/ 
cmake . 
make

> ./GeCo

$ ./GeCo

Usage: GeCo [OPTION]... -r [FILE] [FILE]:[...]

Compress and analyze a genomic sequence (by default, compress).

Non-mandatory arguments:

-h give this help,

-x show several running examples,

-s show GeCo compression levels,

-v verbose mode (more information),

-V display version number,

-f force overwrite of output,

-l <level> level of compression [1;9] (lazy -tm setup),

-g <gamma> mixture decayment forgetting factor. It is

a real value in the interval [0;1),

-c <cache> maximum collisions for hash cache. Memory

values are higly dependent of the parameter

specification,

-e it creates a file with the extension ".iae"

with the respective information content. If

the file is FASTA or FASTQ it will only use

the "ACGT" (genomic) data,

-rm <c>:<d>::<m/e> reference context model (ex:-rm 13:100:0:0/0),

-rm <c>:<d>::<m/e> reference context model (ex:-rm 18:1000:0:1/1000),

...

-tm <c>:<d>::<m/e> target context model (ex:-tm 4:1:0:0/0),

-tm <c>:<d>::<m/e> target context model (ex:-tm 18:20:1:2/10),

...

target and reference templates use <c> for

context-order size, <d> for alpha (1/<d>),

(0 or 1) to set the usage of inverted

repeats (1 to use) and <m> to the maximum

allowed mutation on the context without

being discarded (usefull in deep contexts),

under the estimator <e>,

-r <FILE> reference file ("-rm" are loaded here),

Mandatory arguments:

<FILE> file to compress (last argument). For more

files use splitting ":" characters.

Report bugs to <{pratas,ap,pjf}@ua.pt>.

> ./GeDe

$ ./GeDe

Usage: GeDe [OPTION]... -r [FILE] [FILE]:[...]

Decompress a genomic sequence compressed by GeCo.

Non-mandatory arguments:

-h give this help,

-v verbose mode (more information),

-r <FILE> reference file,

Mandatory arguments:

<FILE> file to uncompress (last argument). For

more files use splitting ":" characters.

Report bugs to <{pratas,ap,pjf}@ua.pt>.

パスを通しておく。

ラン

レベル5で圧縮する。

GeCo -l 5 File.seq

-l <level> level of compression [1;9] (lazy -tm setup),

デコード

GeDe File.seq.co

fastqを圧縮すると、ヘッダーやqualityが消去されて１つの配列になってしまうので使わないこと。

引用

Efficient Compression of Genomic Sequences

D. Pratas, A. J. Pinho, P. J. S. G. Ferreira.

March 2016 DOI10.1109/DCC.2016.60 ConferenceData Compression ConferenceAtSnowbird, Utah

2018-01-11

固有のindex（バーコード）を設計するTagGD

trimming / preprocessing barcode / index demultiplexing

index (バーコード配列) を設計する際は、判別可能かつ無駄のない適切な長さ、増幅バイアスが起きないようなGC含量、実験データとの干渉がないなどを考える必要がある。それに加えて、index配列に塩基置換、indelなどのシーケンスエラーが起きる可能性があるが、それでもdemulplexingできる十分な特異性を持ったindex配列が望ましい。しかしながら、indexを数千必要とするような実験系では手動での選抜は困難である。TagGDはこのindexを自動設計するツールで、数千〜数万のユニークなindexを数秒~数分で設計することができる。

インストール

Github

https://github.com/JoelSjostrand/taggd

git clone https://github.com/pelinakan/UBD.git
cd UBD/bin/mac/

#実行権の付与
chmod u+x

#binにパスを通す。または →
echo export PATH=\$PATH:`pwd`\ >> ~/.bash_profile && source ~/.bash_profile

#→ パスの通ったディレクトリにコピーする。
cp designBarcode /usr/local/bin/
cp findIndexes /usr/local/bin/

上記リンクからダウンロードして、bin/mac/のバイナリを使う。上記のように打って実行権をつけ、パスも通しておく。

> ./designBarcode

$ UBD/bin/mac/designBarcode

Version: 1.0

Usage: designBarcodes [options] <output>

Options:

-c [STR] Configuration file

-n [INT] Number of barcodes to output.

>./findIndexes

$ /bin/mac/findIndexes

Program: findIndexes

Contact: Paul Costea <paul.igor.costea@embl.de>

Usage: findIndexes [options] <ids.txt> <in.fastq> <out.fastq>

Options:

-m INT allowed mismatches [2]

-k INT kMer length [1/3*length]

-s INT start position of ID [0]

-l INT length of ID [0]

-e INT id positional error [0]

-p STRING name of pair file [NULL]

uesaka-no-Air-2:UBD kazumaxneo$

python版もある。

GitHub - JoelSjostrand/taggd: Genetic barcode demultiplexing

ラン

index配列を10000種類設計する（数が万近くまで増えると重くなる）。

designBarcode -c /config_example.txt -n 10000 out.txt

fastqからバーコードを探すfindIndexesというコマンドもある。Githubで使い方は確認してください。

引用

TagGD: Fast and Accurate Software for DNA Tag Generation and Demultiplexing

Paul Igor Costea, Joakim Lundeberg, Pelin Akan

Published: March 4, 2013https://doi.org/10.1371/journal.pone.0057521

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

454とSOLIDのfastqで高いロスレス圧縮を行う LFQC

シングルコアでも高速なRNA seqのアライナー RapMap

Synteny blockを検出して、染色体の類似領域を可視化する Cinteny

ゲノムを比較し、染色体間の組み替えを可視化する SMASH

ターゲットゲノムにしかない配列を探し出すEAGLE

ゲノムデータの圧縮を行うGeCo

固有のindex（バーコード）を設計するTagGD