2018-01-20

高頻度なk-merを効率的にカウントする Turtle

　k-merを用いたde Bruijnグラフ構造は今日普及しているゲノムアセンブルの中核であり、多くの方法論で使われている。k-merはCeleraのようなOLCのアセンブルツールでも重複のシードを用いるのに使われている。また、いくつかのエラー訂正ツールは、k-merの頻度を分析してエラー訂正を行う。すなわち複数回出現するk-merをゲノムの真の配列に由来するものと考え、対照的に１回しか出現しないk-merをシーケンスエラーとみなす。

　サイズがgのゲノムでは、最大g回までのユニークなk-merが期待される。ただしこの数は重複により減少する。k-merが小さくなっても、偶然ユニークでない可能性が高まるため、減ることになる（5merならATGCの全組み合わせは4^5しかない）。しかしながら、シーケンスデータのユニークなk-merを全て数え上げると、ゲノムサイズから期待されるユニークなk-mer数よりずっと多くなる。この理由は、シーケンスエラーが直感的に理解するより頻繁に起きているからである。例えばphread quality scoreが平均30のデータからk=31でk-merを数え上げることを考える。シーケンスエラーが偶然起きる確率は１つの部位で0.1%なので、正解率は99.9%、すなわち31mer連続してエラーが起きない可能性は96.6%になる。裏を返せば3.4%はシーケンスエラーをどこかに含んでいる。クオリティが20なら、シーケンスエラーがゼロの31merは全体の73%まで落ちる。ここに、ゲノムのレアなバリアント、ハプロタイプ、コンタミなども乗ってくるため、ゲノムサイズから予想される値よりずっと大きくなる。よって、ゲノムサイズを推測するには、低頻度なk-merを除きカウントする必要がある。

　Turtleは高頻度なk-merの数をBloom filterを使って省メモリで計算する方法論。並列化にも対応しており、巨大なデータのk-merを少ないリソースで計算するすることができる。具体的には、135.3Gbのヒトゲノムデータの31-merのカウントを、20スレッド使用時に2時間未満で終えるとされる。

インストール

cent OSに導入した。

公式サイトからダウンロードする。

http://bioinformatics.rutgers.edu/Software/Turtle/

> scTurtle64

$ scTurtle64

This program comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it under certain conditions. For details see the document COPYING.

Parameters received:

Please specify an input file.

scTurtle64 Usage:

scTurtle64 [arguments]

example: ./scTurtle64 -f 1Mreads.fq -o kmer_counts -k 31 -n 6000000 -t 9

-i input reads file in fasta format.

-f input reads file in fastq format. This is mutually exclusive with -i.

-o ouput files prefix. k-mers and their counts are stored in fasta format (headers indicating frequency) in multiple files named prefix0, prefix1... which the user can concatenate if desired.

-q ouput files prefix. k-mers and their counts are stored in tab delimited fromat (quake compatible) in multiple files named prefix0, prefix1... which the user can concatenate if desired.

-k k-mer length.

-t Number of threads.

-n Expected number of frequent k-mers. For uniform coverage libraries this is usually close to genome length. For single-cell libraries, 2-3 times the gemome length is recommended.

-s The approximate amount of space (in GB) to be used. It is used to indirectly compute -n and is mutually exclusive with -n. When both -n and -s are specified, the one that appears last is used.

-h Print this help menu.

-v Print software version.

パスを通しておく。

ラン

scTurtle32 -f reads.fq -o kmer_counts -k 31 -n 3900000 -t 8

-i input reads file in fasta format.-i input reads file in fasta format.
-f input reads file in fastq format. This is mutually exclusive with -i.
-o ouput files prefix. k-mers and their counts are stored in fasta format (headers indicating frequency) in multiple files named prefix0, prefix1... which the user can concatenate if desired.-o ouput files prefix. k-mers and their counts are stored in fasta format (headers indicating frequency) in multiple files named prefix0, prefix1... which the user can concatenate if desired.
-q ouput files prefix. k-mers and their counts are stored in tab delimited fromat (quake compatible) in multiple files named prefix0, prefix1... which the user can concatenate if desired.
-k k-mer length.
-t Number of threads.
-n Expected number of frequent k-mers. For uniform coverage libraries this is usually close to genome length. For single-cell libraries, 2-3 times the gemome length is recommended.

出力

STATs:

No of reads: 254371

No of k-mers: 62516374

no of frequent k-mers found :3854800

no of frequent k-mersはゲノムサイズに近くなる。

作成中

引用

Turtle: identifying frequent k-mers with cache-efficient algorithms

Rajat Shuvro Roy Debashish Bhattacharya Alexander Schliep.

Bioinformatics, Volume 30, Issue 14, 15 July 2014, Pages 1950–1957.

2018-01-20

バーコードやアダプターをトリミングする AdapterRemoval v2

trimming / preprocessing barcode / index contamination

化石のようなサンプル（リンク）や昔の人の骨、歯から断片化したDNAを抽出してシーケンスシーケンスすることが増えており、それに伴ってアダプターに5'と3'両側が汚染されたシーケンスデータが増えてきている。AdapterRemoval は柔軟なパラメータセットを持つ、はSSE1とSSE2命令に対応したやDemultiplexやアダプタートリミングのツール。柔軟な条件でトリミングが行える。初期バージョンからシングルエンドとペアードエンドのシーケンスデータに対応していたが、バージョン２では上記の拡張命令に加え、gzipとbzip2圧縮ファイルへの対応、処理の並列化、interleaveのfastqへの対応、オーバーラップするfastqのマージなどの機能が追加された。

インストール

github

https://github.com/MikkelSchubert/adapterremoval

Binaryをダウンロードする。

wget -O adapterremoval-2.1.7.tar.gz https://github.com/MikkelSchubert/adapterremoval/archive/v2.1.7.tar.gz 
tar xvzf adapterremoval-2.1.7.tar.gz 
cd adapterremoval-2.1.7
make #ビルド
sudo make install #インストール

> adapterremoval

$ AdapterRemoval

AdapterRemoval ver. 2.1.7

This program searches for and removes remnant adapter sequences from

your read data. The program can analyze both single end and paired end

data. For detailed explanation of the parameters, please refer to the

man page. For comments, suggestions and feedback please contact Stinus

Lindgreen (stinus@binf.ku.dk) and Mikkel Schubert (MikkelSch@gmail.com).

If you use the program, please cite the paper:

Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid

adapter trimming, identification, and read merging.

BMC Research Notes, 12;9(1):88.

http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2

Arguments: Description:

--help Display this message.

--version Print the version string.

--file1 FILE Input file containing mate 1 reads or single-ended reads [REQUIRED].

--file2 FILE Input file containing mate 2 reads [OPTIONAL].

FASTQ OPTIONS:

--qualitybase BASE Quality base used to encode Phred scores in input; either 33, 64, or solexa

[current: 33].

--qualitybase-output BASE Quality base used to encode Phred scores in output; either 33, 64, or solexa. By

default, reads will be written in the same format as the that specified using

--qualitybase.

--qualitymax BASE Specifies the maximum Phred score expected in input files, and used when writing

output. ASCII encoded values are limited to the characters '!' (ASCII = 33) to

'~' (ASCII = 126), meaning that possible scores are 0 - 93 with offset 33, and

0 - 62 for offset 64 and Solexa scores [default: 41].

--mate-separator CHAR Character separating the mate number (1 or 2) from the read name in FASTQ

records [default: '/'].

--interleaved This option enables both the --interleaved-input option and the

--interleaved-output option [current: off].

--interleaved-input The (single) input file provided contains both the mate 1 and mate 2 reads, one

pair after the other, with one mate 1 reads followed by one mate 2 read. This

option is implied by the --interleaved option [current: off].

--interleaved-output If set, trimmed paired-end reads are written to a single file containing mate 1

and mate 2 reads, one pair after the other. This option is implied by the

--interleaved option [current: off].

OUTPUT FILES:

--basename BASENAME Default prefix for all output files for which no filename was explicitly set

[current: your_output].

--settings FILE Output file containing information on the parameters used in the run as well as

overall statistics on the reads after trimming [default: BASENAME.settings]

--output1 FILE Output file containing trimmed mate1 reads [default: BASENAME.pair1.truncated

(PE), BASENAME.truncated (SE), or BASENAME.paired.truncated (interleaved PE)]

--output2 FILE Output file containing trimmed mate 2 reads [default: BASENAME.pair2.truncated

(only used in PE mode, but not if --interleaved-output is enabled)]

--singleton FILE Output file to which containing paired reads for which the mate has been

discarded [default: BASENAME.singleton.truncated]

--outputcollapsed FILE If --collapsed is set, contains overlapping mate-pairs which have been merged

into a single read (PE mode) or reads for which the adapter was identified by

a minimum overlap, indicating that the entire template molecule is present.

This does not include which have subsequently been trimmed due to low-quality

or ambiguous nucleotides [default: BASENAME.collapsed]

--outputcollapsedtruncated FILE Collapsed reads (see --outputcollapsed) which were trimmed due the presence of

low-quality or ambiguous nucleotides [default: BASENAME.collapsed.truncated]

--discarded FILE Contains reads discarded due to the --minlength, --maxlength or --maxns options

[default: BASENAME.discarded]

OUTPUT COMPRESSION:

--gzip Enable gzip compression [current: off]

--gzip-level LEVEL Compression level, 0 - 9 [current: 6]

--bzip2 Enable bzip2 compression [current: off]

--bzip2-level LEVEL Compression level, 0 - 9 [current: 9]

TRIMMING SETTINGS:

--adapter1 SEQUENCE Adapter sequence expected to be found in mate 1 reads [current:

AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG].

--adapter2 SEQUENCE Adapter sequence expected to be found in mate 2 reads [current:

AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT].

--adapter-list FILENAME Read table of white-space separated adapters pairs, used as if the first column

was supplied to --adapter1, and the second column was supplied to --adapter2;

only the first adapter in each pair is required SE trimming mode [current:

<not set>].

--mm MISMATCH_RATE Max error-rate when aligning reads and/or adapters. If > 1, the max error-rate

is set to 1 / MISMATCH_RATE; if < 0, the defaults are used, otherwise the

user-supplied value is used directly. [defaults: 1/3 for trimming; 1/10 when

identifing adapters].

--maxns MAX Reads containing more ambiguous bases (N) than this number after trimming are

discarded [current: 1000].

--shift N Consider alignments where up to N nucleotides are missing from the 5' termini

[current: 2].

--trimns If set, trim ambiguous bases (N) at 5'/3' termini [current: off]

--trimqualities If set, trim bases at 5'/3' termini with quality scores <= to --minquality value

[current: off]

--minquality PHRED Inclusive minimum; see --trimqualities for details [current: 2]

--minlength LENGTH Reads shorter than this length are discarded following trimming [current: 15].

--maxlength LENGTH Reads longer than this length are discarded following trimming [current:

4294967295].

--collapse When set, paired ended read alignments of --minalignmentlength or more bases are

combined into a single consensus sequence, representing the complete insert,

and written to either basename.collapsed or basename.collapsed.truncated (if

trimmed due to low-quality bases following collapse); for single-ended reads,

putative complete inserts are identified as having at least

--minalignmentlength bases overlap with the adapter sequence, and are written

to the the same files [current: off].

--minalignmentlength LENGTH If --collapse is set, paired reads must overlap at least this number of bases to

be collapsed, and single-ended reads must overlap at least this number of

bases with the adapter to be considered complete template molecules [current:

11].

--minadapteroverlap LENGTH In single-end mode, reads are only trimmed if the overlap between read and the

adapter is at least X bases long, not counting ambiguous nucleotides (N); this

is independant of the --minalignmentlength when using --collapse, allowing a

conservative selection of putative complete inserts while ensuring that all

possible adapter contamination is trimmed [current: 0].

DEMULTIPLEXING:

--barcode-list FILENAME List of barcodes or barcode pairs for single or double-indexed demultiplexing.

Note that both indexes should be specified for both single-end and paired-end

trimming, if double-indexed multiplexing was used, in order to ensure that the

demultiplexed reads can be trimmed correctly [current: <not set>].

--barcode-mm N Maximum number of mismatches allowed when counting mismatches in both the mate 1

and the mate 2 barcode for paired reads.

--barcode-mm-r1 N Maximum number of mismatches allowed for the mate 1 barcode; if not set, this

value is equal to the '--barcode-mm' value; cannot be higher than the

'--barcode-mm value'.

--barcode-mm-r2 N Maximum number of mismatches allowed for the mate 2 barcode; if not set, this

value is equal to the '--barcode-mm' value; cannot be higher than the

'--barcode-mm value'.

MISC:

--identify-adapters Attempt to identify the adapter pair of PE reads, by searching for overlapping

reads [current: off].

--seed SEED Sets the RNG seed used when choosing between bases with equal Phred scores when

collapsing. Note that runs are not deterministic if more than one thread is

used. If not specified, a seed is generated using the current time.

--threads THREADS Maximum number of threads [current: 1]

ラン

シングルエンドのfastqのNとクオリティの低い領域を除いてgzip出力する。

AdapterRemoval --file1 single.fq --basename output_single --trimns --trimqualities --gzip --threads 4

--file1 FILE Input file containing mate 1 reads or single-ended reads [REQUIRED].
--basename BASENAME Default prefix for all output files for which no filename was explicitly set [current: your_output].
--trimns If set, trim ambiguous bases (N) at 5'/3' termini [current: off]
--trimqualities If set, trim bases at 5'/3' termini with quality scores <= to --minquality value [current: off]
--minquality PHRED Inclusive minimum; see --trimqualities for details [current: 2]
--gzip Enable gzip compression [current: off]
--threads THREADS Maximum number of threads [current: 1]

ペアードエンドのfastqのNとクオリティの低い領域を除いて出力する。11mer以上の重複があるペアはマージして出力される。

AdapterRemoval --file1 pair_1.fq --file2 pair_2.fq --basename output_paired --trimns --trimqualities --collapse --threads 4

--file1 FILE Input file containing mate 1 reads or single-ended reads [REQUIRED].
--file2 FILE Input file containing mate 2 reads [OPTIONAL].
--collapse When set, paired ended read alignments of --minalignmentlength or more bases are combined into a single consensus sequence, representing the complete insert, and written to either basename.collapsed or basename.collapsed.truncated (if trimmed due to low-quality bases following collapse); for single-ended reads, putative complete inserts are identified as having at least --minalignmentlength bases overlap with the adapter sequence, and are written

引用

AdapterRemoval v2: rapid adapter trimming, identification, and read merging.

Schubert M, Lindgreen S, Orlando L.

BMC Res Notes. 2016 Feb 12;9:88.

AdapterRemoval: easy cleaning of next-generation sequencing reads.

Lindgreen S.

BMC Res Notes. 2012 Jul 2;5:337.

2018-01-20

並列化に対応し、高速にバーコードやアダプターをトリミングする FLEXBAR

trimming / preprocessing barcode / index demultiplexing miRNA seq

2021/10/21, 24 インストール追記

FLEXBARはMultiplexで読んだシーケンスのdemultiplexやアダプタートリミングに使われるツール。柔軟な条件でランできる。よく使われているらしく、現在Flexbar3まで発表されている。解析時間は短く、100Mのリードなら数秒〜10秒程度の時間でアダプターをトリミングできる。ペアリードの順番は破壊しないので安心して使える。

マニュアル

インストール

github

Binaryをダウンロードして（リンクのmac版）、環境変数にパスを通す。

export DYLD_LIBRARY_PATH=/path/FlexbarDir:$DYLD_LIBRARY_PATH #カレントがダウンロードしたFlexbarDir/なら後半は ='pwd':$DYLD_LIBRARY_PATH

#conda (20211021追記)
mamba install -c bioconda -y flexbar

> flexbar --help

$ flexbar -h

flexbar - flexible barcode and adapter removal

==============================================

SYNOPSIS

flexbar -r reads [-b barcodes] [-a adapters] [options]

DESCRIPTION

The program Flexbar preprocesses high-throughput sequencing data

efficiently. It demultiplexes barcoded runs and

removes adapter sequences. Several adapter removal presets for

Illumina libraries are included. Flexbar computes

exact overlap alignments using SIMD and multicore parallelism.

Moreover, trimming and filtering features are

provided, e.g. trimming of homopolymers at read ends. Flexbar

increases read mapping rates and improves genome as

well as transcriptome assemblies. Unique molecular identifiers can

be extracted in a flexible way. The software

supports data in fasta and fastq format from multiple sequencing

platforms. Refer to the manual on

github.com/seqan/flexbar/wiki or contact Johannes Roehr on

github.com/jtroehr for support with this application.

OPTIONS

-h, --help

Display the help message.

-hh, --full-help

Display the help message with advanced options.

-v, --versions

Print Flexbar and SeqAn version numbers.

-c, --cite

Show program references for citation.

Basic options:

-n, --threads INTEGER

Number of threads to employ. Default: 1.

-t, --target OUTPUT_PREFIX

Prefix for output file names or paths. Default: flexbarOut.

-r, --reads INPUT_FILE

Fasta/q file or stdin (-) with reads that may contain barcodes.

-p, --reads2 INPUT_FILE

Second input file of paired reads, gz and bz2 files supported.

Barcode detection:

-b, --barcodes INPUT_FILE

Fasta file with barcodes for demultiplexing, may contain N.

-br, --barcode-reads INPUT_FILE

Fasta/q file containing separate barcode reads for detection.

-bo, --barcode-min-overlap INTEGER

Minimum overlap of barcode and read. Default: barcode length.

-be, --barcode-error-rate DOUBLE

Error rate threshold for mismatches and gaps. Default: 0.0.

-bt, --barcode-trim-end STRING

Type of detection, see section trim-end modes. Default: LTAIL.

Adapter removal:

-a, --adapters INPUT_FILE

Fasta file with adapters for removal that may contain N.

-a2, --adapters2 INPUT_FILE

File with extra adapters for second read set in paired mode.

-aa, --adapter-preset STRING

One of TruSeq, SmallRNA, Methyl, Ribo, Nextera, and NexteraMP.

-ao, --adapter-min-overlap INTEGER

Minimum overlap for removal without pair overlap. Default: 3.

-ae, --adapter-error-rate DOUBLE

Error rate threshold for mismatches and gaps. Default: 0.1.

-at, --adapter-trim-end STRING

Type of removal, see section trim-end modes. Default: RIGHT.

-ap, --adapter-pair-overlap STRING

Overlap detection of paired reads. One of ON, SHORT, and ONLY.

Filtering and trimming:

-u, --max-uncalled INTEGER

Allowed uncalled bases N for each read. Default: 0.

-x, --pre-trim-left INTEGER

Trim given number of bases on 5' read end before detection.

-y, --pre-trim-right INTEGER

Trim specified number of bases on 3' end prior to detection.

-m, --min-read-length INTEGER

Minimum read length to remain after removal. Default: 18.

Quality-based trimming:

-q, --qtrim STRING

Quality-based trimming mode. One of TAIL, WIN, and BWA.

-qf, --qtrim-format STRING

Quality format. One of sanger, solexa, i1.3, i1.5, and i1.8.

-qt, --qtrim-threshold INTEGER

Minimum quality as threshold for trimming. Default: 20.

Trimming of homopolymers:

-hr, --htrim-right STRING

Trim certain homopolymers on right read end after removal.

-hi, --htrim-min-length INTEGER

Minimum length of homopolymers at read ends. Default: 3.

-he, --htrim-error-rate DOUBLE

Error rate threshold for mismatches. Default: 0.1.

Output selection:

-f, --fasta-output

Prefer non-quality format fasta for output.

-z, --zip-output STRING

Direct compression of output files. One of GZ and BZ2.

-1, --stdout-reads

Write reads to stdout, tagged and interleaved if needed.

Logging and tagging:

-l, --align-log STRING

Print chosen read alignments. One of ALL, MOD, and TAB.

-o, --stdout-log

Write statistics to stdout instead of target log file.

-g, --removal-tags

Tag reads that are subject to adapter or barcode removal.

TRIM-END MODES

ANY: longer side of read remains after removal of overlap

LEFT: right side remains after removal, align <= read end

RIGHT: left part remains after removal, align >= read start

LTAIL: consider first n bases of reads in alignment

RTAIL: use only last n bases, see tail-length options

EXAMPLES

flexbar -r reads.fq -t target -q TAIL -qf i1.8

flexbar -r reads.fq -b barcodes.fa -bt LTAIL

flexbar -r reads.fq -a adapters.fa -ao 3 -ae 0.1

flexbar -r r1.fq -p r2.fq -a a1.fa -a2 a2.fa -ap ON

flexbar -r r1.fq -p r2.fq -aa TruSeq -ap ON

VERSION

Last update: May 2019

flexbar version: 3.5.0

SeqAn version: 2.4.0

Available on github.com/seqan/flexbar

ラン

アダプターを除く。

flexbar -r input.fastq -a adaptor.fa -be LTAIL -t target -n 4

-t Prefix for output file names or paths. Default: flexbarOut.
-r Fasta/q file or stdin (-) with reads that may contain barcodes.
-p Second input file of paired reads, gz and bz2 files supported.
-a Fasta file with adapters for removal that may contain N.
-n Number of threads to employ. Default: 1.
-be Type of detection, see section trim-end modes. Default: LTAIL.

TRIM-END MODES

ANY: longer side of read remains after removal of overlap
LEFT: right side remains after removal, align <= read end
RIGHT: left part remains after removal, align >= read start
LTAIL: consider first n bases of reads in alignment
RTAIL: use only last n bases, see tail-length options

引用

Flexbar 3.0 - SIMD and multicore parallelization.

Roehr JT, Dieterich C, Reinert K.

Bioinformatics. 2017 Sep 15;33(18):2941-2942.

FLEXBAR-Flexible Barcode and Adapter Processing for Next-Generation Sequencing Platforms.

Dodt M, Roehr JT, Ahmed R, Dieterich C.

Biology (Basel). 2012 Dec 14;1(3):895-905.

2018-01-19

様々なバイオインフォマティクスツールの分析結果を１つに集約して分析できる MultiQC

evaluation tool bam/sam 結果の視覚化 (visualization) RNA seq trimming / preprocessing assembly

2019 1/16 誤字修正および対応ツール情報更新、12/29 ツイート追加

2020 1/17 condaインストール追記、4/19 説明追記、5/25 ツイート追記

2023/12/20ツイート追記

今まで様々なNGSの評価ツールが発表されてきたが、それらは特定のデータを評価するものであり、プロジェクト全体で品質評価（クオリティチェック）するためのツールがなかった。プロジェクト全体で一貫した品質評価ができないと、チェック漏れが出てしまう恐れがある。例えば、fastqcでfastqを分析すると、データごとに分析結果がビジュアル出力される。大量のシーケンスデータを処理していると、この分析だけでもたくさんの時間がかかり、エラーの多いデータの見過ごしも出てくる可能性が高まる。外れ値のデータの除去漏れなどがないようにするには、NGSのデータをまとめて解析・出力し、結果を１つの図または表に落とし込んで比較できる使いやすいツールが必要となる。

MulitiQCはこういった背景から開発された、たくさんのNGSツールの結果をまとめて比較するためのQCツール。簡単なコマンドでファイルを自動認識し、プロジェクト全体を満遍なく評価できる。結果はhtmlで出力される。

公式サイト

http://opensource.scilifelab.se/projects/multiqc/

Using MulitiQC Reports

presentation

現在以下のコマンドの出力結果がサポートされている。

f:id:kazumaxneo:20180119181121j:plain

2019年1月16日現在。さらに増えている。

RNA seq example report

http://multiqc.info/examples/rna-seq/multiqc_report.html

WGS example eport

http://multiqc.info/examples/wgs/multiqc_report.html

Bisulfite example eport

http://multiqc.info/examples/bs-seq/multiqc_report.html

Hi-C example eport

http://multiqc.info/examples/hi-c/multiqc_report.html

2023/12/20

MultiQC v1.19 was just released - just in time for the holidays! 🎁 🎄

This release has a bunch of bugfixes, with 9 different modules receiving tweaks and improvements 🛠️ The MultiQC core code also got a bit of sprucing up and spring cleaning 🧽🫧

Enjoy!https://t.co/wSpbUKxIq2
— MultiQC (@MultiQC) 2023年12月18日

It's finally happened! After talking about it for years, the @MultiQC GitHub repo has finally left @tallphil's personal account and moved to the @MultiQC organisation! 💥

Whilst we were at it, we also renamed `master` -> `main` 🎉https://t.co/5OPE5wBsse
— MultiQC (@MultiQC) 2023年12月18日

2019/11/21

✨🎉🎊🌈 MultiQC v1.8 is released! 🌈🎊🎉✨

Nearly a year in the making (👶👶🍼🙄) this release is packed with updates and bug fixes - thanks for your patience, especially those who submitted issues and the 19 code contributors involved in v1.8!https://t.co/qDmiyRcKhA
— Phil Ewels (@tallphil) 2019年11月21日

インストール

github

#bioconda (link)
conda install -c bioconda -c conda-forge multiqc -y

#pip
pip install multiqc

ラン

使うには、解析が終わったディレクトリで以下のコマンドを打つ。

multiqc .

サブディレクトリも含めてファイルが検索され、tagを認識して自動で結果がまとめられる。

デフォルトで認識されるタグはmultiqc --view-tagsで確認できる。

> multiqc --view-tags

$ multiqc --view-tags

MultiQC Available module tag groups:

- ancient:

- dedup

- damageprofiler

- mtnucratio

- clipandmerge

- cancer:

- conpair

- theta2

- chip:

- phantompeakqualtools

- deeptools

- homer

- macs2

- Denovo:

- quast

- busco

- supernova

- denovo:

- prokka

- DNA:

- conpair

- peddy

- methylQA

- qualimap

- preseq

- quast

- goleft_indexcov

- supernova

- deeptools

- verifybamid

- happy

- homer

- theta2

- snpeff

- gatk

- htseq

- bcftools

- featureCounts

- fgbio

- dedup

- damageprofiler

- biobambam2

- mtnucratio

- picard

- samblaster

- samtools

- sexdeterrmine

- bamtools

- jellyfish

- vcftools

- longranger

- stacks

- bbmap

- bismark

- biscuit

- kat

- leehom

- adapterRemoval

- clipandmerge

- cutadapt

- flexbar

- trimmomatic

- skewer

- sortmerna

- biobloomtools

- fastq_screen

- afterqc

- fastp

- fastqc

- minionqc

- mosdepth

- clusterflow

- bcl2fastq

- interop

- flash

- seqyclean

- hi-c:

- hicexplorer

- hicup

- hicpro

- Metagenomics:

- quast

- Methylation:

- methylQA

- methylation:

- bismark

- biscuit

- clusterflow

- bcl2fastq

- interop

- miRNA:

- mirtrace

- prokarytotic:

- prokka

- RNA:

- qualimap

- preseq

- qorts

- rna_seqc

- rsem

- rseqc

- disambiguate

- deeptools

- sargasso

- homer

- htseq

- featureCounts

- fgbio

- biobambam2

- picard

- samblaster

- samtools

- bamtools

- bbmap

- salmon

- kallisto

- star

- hisat2

- tophat

- bowtie2

- bowtie1

- leehom

- adapterRemoval

- cutadapt

- flexbar

- trimmomatic

- skewer

- sortmerna

- biobloomtools

- fastq_screen

- afterqc

- fastp

- fastqc

- minionqc

- mosdepth

- clusterflow

- bcl2fastq

- interop

- flash

- seqyclean

- slam:

- slamdunk

- smRNA:

- mirtrace

- umi:

- fgbio

- WGS:

- conpair

- longranger

検索パターンはsearch_patterns.yaml に記載されている。condaを使って~/anaconda3のpython3.7に導入しているなら、search_patterns.yaml は

anaconda3/lib/python3.7/site-packages/multiqc/utils/search_patterns.yaml

に存在する。

例1 fastqc

fastqcで16のfastqファイルのqualityをチェックする。

f:id:kazumaxneo:20180119211540j:plain

fastqc *gz

ランが終わったらそれぞれのfastqのレポートができる。

f:id:kazumaxneo:20180119213008j:plain

出力をそのままhtmlで見ても、サンプル間の比較がしにくい。

mulitiqcを走らせて、分析結果を統合して表示する。

mulitiqc .

自動でfastqcの出力ファイルを認識して、情報が統合される。htmlが出力されるので開く。

duplicationのレベル。

f:id:kazumaxneo:20180119213624j:plain

ポジションによるqualityの変化。

f:id:kazumaxneo:20180119213102j:plain

16サンプルが１つのグラフにまとめられた。これならサンプル間の比較が容易にできる。

例2 STAR

idnex

mkdir genome #出力用のディレクトリを作成
STAR --runMode genomeGenerate --genomeDir genome/ --genomeFastaFiles reference.fasta --sjdbGTFfile reference.gtf --sjdbOverhang 100 --runThreadN 12

mapping (3サンプル)

STAR --genomeDir genome/ --readFilesIn sample1.fastq --runThreadN 12 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix sample1

STAR --genomeDir genome/ --readFilesIn sample2.fastq --runThreadN 12 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix sample2

STAR --genomeDir genome/ --readFilesIn sample3.fastq --runThreadN 12 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix sample3

mulitiqc .

結果。

f:id:kazumaxneo:20180121191354j:plain

WGSのデータ解析レポート例

引用

MultiQC: summarize analysis results for multiple tools and samples in a single report

Philip Ewels,,* Måns Magnusson, Sverker Lundin, and Max Käller

Bioinformatics. 2016 Oct 1; 32(19): 3047–3048.

2021 04

Here's another #Rstats package that I've just published on CRAN! TidyMultiqc converts QC reports from MultiQC into tidy R data frames for further downstream analysis: https://t.co/uYR0LkvkfJ
— Michael Milton (@multimeric) 2021年4月10日

I added a new section to the #MultiQC docs about tools designed to work with MultiQC output for downstream analysis: https://t.co/SeDS6R7dhR Am I missing any? 🔭🕵🏻‍♂️📊
— Phil Ewels (@tallphil) 2021年4月18日

2018-01-19

Feature response courveによりアセンブルを評価するFRC_align

evaluation tool assembly

アセンブルのパフォーマンスを表す指標として N50やコンティグの数などがよく用いられているが、アセンブルの精度はこの値には反映されていない。FRC_alignは、Feature response courve: FRC（FRCを使ったアセンブル評価）を計算出力することで、異なるツール間のアセンブル結果をより正しく評価できるツール。カバレッジの深さ、k-merの反復数、indelの有無、mate-pairの分布、リードの向き（--> <--)などをもとに評価する。

FRC wiki

http://amos.sourceforge.net/wiki/index.php/FRCurve

ダウンロード

依存

boost #ない人はこちらを参考にインストールしてください（リンク）。

公式サイト

http://opensource.scilifelab.se/projects/frc/

git clone https://github.com/vezzi/FRC_align.git 
cd FRC_align 
mkdir build 
cd build 
cmake .. #boostパスが見つからないと怒られる
make

FRC_align/bin/にFRCができる。パスの通ったディレクトリに移動しておく。

> FRC --help

$ FRC --help

FRC version 1.3.0

Allowed options:

--help produce help message

--pe-sam arg paired end alignment file (in sam or bam format).

Orientation must be -> <-

--pe-max-insert arg maximum allowed insert size for PE (to filter out

outleyers)

--mp-sam arg mate pairs alignment file. (in sam or bam format).

Orientation must be <- ->

--mp-max-insert arg maximum allowed insert size for MP (to filter out

outleyers)

--genome-size arg estimated genome size (if not supplied genome size is

believed to be assembly length

--output arg Header output file names (default FRC.txt and

Features.txt)

--CEstats-PE-min arg minimum allowed CE_stats in PE library

--CEstats-PE-max arg maximum allowed CE_stats in PE library

--CEstats-MP-min arg minimum allowed CE_stats in MP library

--CEstats-MP-max arg maximum allowed CE_stats in MP library

ラン

アセンブルしたcontigにショートのpaired-endとMarte-pairのリードをアライメントして、そのbamを入力とする。ショートのpaired-endのbamは必須となる。入力のbamはsasmtoolsなどでcoordinate sortされている必要がある。

FRC --pe-sam paired.bam --pe-max-insert 3000 --genome-size 40000000 --output OUTPUT_HEADER

--pe-sam arg paired end alignment file (in sam or bam format). Orientation must be -> <-
--pe-max-insert arg maximum allowed insert size for PE (to filter out outleyers)
--mp-sam arg mate pairs alignment file. (in sam or bam format). Orientation must be <- ->
--mp-max-insert arg maximum allowed insert size for MP (to filter out outleyers)
--genome-size arg estimated genome size (if not supplied genome size is believed to be assembly length
--output arg Header output file names (default FRC.txt and Features.txt)

mate-pairのデータはできる限りあったほうが良いとされる。

引用

Feature-by-Feature – Evaluating De Novo Sequence Assembly

Francesco Vezzi, Giuseppe Narzisi, Bud Mishra

PLoS One. 2012; 7(2): e31002.

Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons

Francesco Vezzi, Giuseppe Narzisi and Bud Mishra

PLoS One. 2012; 7(12): e52210.

Comparing De Novo Genome Assembly: The Long and Short of It

Giuseppe Narzisi, and Bud Mishra

PLoS One. 2011; 6(4): e19175.

2018-01-19

DNAでもRNAでも使える、複数サンプルのマッピングを同時比較できるGUIツール Qualimap2

evaluation tool bam/sam GUIツール RNA seq 結果の視覚化 (visualization) 2016 Bioinformatics

2019 9/8 インストール追記

公式サイト

http://qualimap.bioinfo.cipf.es

ユーザーマニュアル

http://qualimap.bioinfo.cipf.es/doc_html/index.html

ワークフロー

http://qualimap.bioinfo.cipf.es/doc_html/workflow.html

CUI環境でのラン。

http://qualimap.bioinfo.cipf.es/doc_html/command_line.html

インストール

依存

Picard: a Java API (SAM-JDK) for creating programs that read and write SAM files.
FastQC: a quality control tool for high throughput sequence data.
SAMTools: essential utilities for manipulating alignments in the SAM format.
NOISeq （リンク） quality control and differential gene expression analysis for RNA-seq data.
Repitools （リンク）quality assessment, visualization, summarization and statistical analysis of epigenomics experiments.

公式サイトからダウンロードしてフォルダにパスを通す。

#解凍して中に入り、次のように打つ。
echo export PATH=\$PATH:`pwd`\ >> ~/.bash_profile && source ~/.bash_profile

#またはcondaを使う、ここではcondaの仮想環境に導入。
#biocona (link)
conda create -n qualimap-env -y qualimap
conda activate qualimap-env

> qualimap -h

$ qualimap -h

Java memory size is set to 1200M

Launching application...

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=1024m; support was removed in 8.0

QualiMap v.2.2.1

Built on 2016-10-03 18:14

usage: qualimap <tool> [options]

To launch GUI leave <tool> empty.

Available tools:

bamqc Evaluate NGS mapping to a reference genome

rnaseq Evaluate RNA-seq alignment data

counts Counts data analysis (further RNA-seq data evaluation)

multi-bamqc Compare QC reports from multiple NGS mappings

clustering Cluster epigenomic signals

comp-counts Compute feature counts

Special arguments:

--java-mem-size Use this argument to set Java memory heap size. Example:

qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G

ラン

qualimap --java-mem-size=4G #GUI版が開く

File -> New analysis -> BAM QCを選択。

f:id:kazumaxneo:20180119012031j:plain

BAMファイルを選択（SAMも可能だがposition sort されている必要がある）。

f:id:kazumaxneo:20180119012147j:plain

>>> Start analysisをクリックして解析開始。

マッピングの詳細。

f:id:kazumaxneo:20180119012350j:plain

coverage

f:id:kazumaxneo:20180119012446j:plain

coverage histgram

f:id:kazumaxneo:20180119012518j:plain

duplication率

f:id:kazumaxneo:20180119012640j:plain

リードのポジション毎の塩基の分布

f:id:kazumaxneo:20180119012657j:plain

GC率の分布

f:id:kazumaxneo:20180119012711j:plain

リードのポジション毎のクリッピング率

f:id:kazumaxneo:20180119012732j:plain

Homo polymer

f:id:kazumaxneo:20180119012933j:plain

Mapping quality

f:id:kazumaxneo:20180119012952j:plain

Mapping qualityのヒストグラム

f:id:kazumaxneo:20180119013043j:plain

インサートサイズ

f:id:kazumaxneo:20180119013128j:plain

インサートサイズの分布

f:id:kazumaxneo:20180119013101j:plain

結果はexport as PDFまたはHTMLから画像出力できる。

f:id:kazumaxneo:20180119013307j:plain

複数sampleを比較するなら、Multi-sample BAM QCから開始する。PCA plotなどが追加される。データも重ねて表示される（12sample例）。

f:id:kazumaxneo:20180119015611j:plain

リンク先から色々なデータを確認できます。

http://qualimap.bioinfo.cipf.es/doc_html/samples.html

引用

Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data.

Okonechnikov K, Conesa A, García-Alcalde F.

Bioinformatics. 2016 Jan 15;32(2):292-4.

Qualimap: evaluating next-generation sequencing alignment data.

García-Alcalde F1, Okonechnikov K, Carbonell J, Cruz LM, Götz S, Tarazona S, Dopazo J, Meyer TF, Conesa A.

Bioinformatics. 2012 Oct 15;28(20):2678-9.

2018-01-18

シンプルなSRA検索webサイト SRA Explorer

download tips web tool SRA

DDBJ、EMBL-EBI、NCBIのSRAの検索エンジンは情報が多く、簡単にシーケンスデータを取ってくるにはやや使いにくい。ExplorerはSRAの検索ツール。Phil Ewels さんが作成されたwebツールで、SRAのAPIを使い、高速にSRAのデータを検索する。シンプルなインターフェイスで使いやすい。bashのダウンロードスクリプトも作ってくれるので、ペーストすればですぐにシーケンスデータのダウンロードを始められる。

公式サイト

http://opensource.scilifelab.se/projects/sra-explorer/

Github