bamの分析に使うバイオインフォマティクスのツールキット goleft

2020 3/15 インストール追記、help更新

2020 4/19 追記

goleftはMIT licence下で提供されているバイオインフォマティクスのツールキット。GO言語で構築されている。

インストール

リリース（リンク）からosx向けバイナリをダウンロードできる。パスの通ったディレクトリに移動しておく。

chmod u+x goleft-osx.dms
mv goleft-osx.dms /usr/local/bin/goleft
goleft #動作確認

#bioconda (link)
conda install -c bioconda -y goleft

> goleft -h

$ goleft -h

goleft Version: 0.2.0

covstats : coverage stats across bams by sampling

depth : parallelize calls to samtools in user-defined windows

depthwed : matricize output from depth to n-sites * n-samples

indexcov : quick coverage estimate using only the bam index

indexsplit : create regions of even coverage across bams/crams

samplename : report samplename(s) from a bam's SM tag

> goleft covstats -h

$ goleft covstats -h

coverage insert_mean insert_sd insert_5th insert_95th template_mean template_sd pct_unmapped pct_bad_reads pct_duplicate pct_proper_pair read_length bam sample

Usage: goleft [--n N] [--regions REGIONS] [--fasta FASTA] BAMS [BAMS ...]

Positional arguments:

BAMS bams/crams for which to estimate coverage

Options:

--n N, -n N number of reads to sample for length [default: 1000000]

--regions REGIONS, -r REGIONS

optional bed file to specify target regions

--fasta FASTA, -f FASTA

fasta file. required for cram format

--help, -h display this help and exit

> goleft depth -h

$ goleft depth -h

Usage: goleft [--windowsize WINDOWSIZE] [--maxmeandepth MAXMEANDEPTH] [--ordered] [--q Q] [--chrom CHROM] [--mincov MINCOV] [--stats] [--reference REFERENCE] [--processes PROCESSES] [--bed BED] --prefix PREFIX BAM

Positional arguments:

BAM bam for which to calculate depth

Options:

--windowsize WINDOWSIZE, -w WINDOWSIZE

window size in which to calculate high-depth regions [default: 250]

--maxmeandepth MAXMEANDEPTH, -m MAXMEANDEPTH

windows with depth > than this are high-depth. The default reports the depth of all regions.

--ordered, -o force output to be in same order as input even with -p.

--q Q, -Q Q mapping quality cutoff [default: 1]

--chrom CHROM, -c CHROM

optional chromosome to limit analysis

--mincov MINCOV minimum depth considered callable [default: 4]

--stats, -s report sequence stats [GC CpG masked] for each window

--reference REFERENCE, -r REFERENCE

path to reference fasta

--processes PROCESSES, -p PROCESSES

number of processors to parallelize.

--bed BED, -b BED optional file of positions or regions to restrict depth calculations.

--prefix PREFIX prefix for output files depth.bed and callable.bed

--help, -h display this help and exit

> goleft indexcov -h

$ goleft indexcov -h

Usage: goleft --directory DIRECTORY [--includegl] [--excludepatt EXCLUDEPATT] [--sex SEX] [--chrom CHROM] [--fai FAI] BAM [BAM ...]

Positional arguments:

BAM bam(s) or crais for which to estimate coverage

Options:

--directory DIRECTORY, -d DIRECTORY

directory for output files

--includegl, -e plot GL chromosomes like: GL000201.1 which are not plotted by default

--excludepatt EXCLUDEPATT [default: ^chrEBV$|^NC|_random$|Un_|^HLA\-|_alt$|hap\d$]

--sex SEX, -X SEX comma delimited names of the sex chromosome(s) used to infer sex. Set to '' if no sex chromosomes are present. [default: X,Y]

--chrom CHROM, -c CHROM

optional chromosome to extract depth. default is entire genome.

--fai FAI, -f FAI fasta index file. Required when crais are used.

--help, -h display this help and exit

> goleft depthwed -h

$ goleft depthwed -h

Usage: goleft --size SIZE BEDS [BEDS ...]

Positional arguments:

BEDS depth.bed files from goleft depth

Options:

--size SIZE, -s SIZE sizes of windows to aggregate to must be >= window in input files.

--help, -h display this help and exit

> goleft indexsplit -h

$ goleft indexsplit -h

Usage: goleft --n N [--fai FAI] [--problematic PROBLEMATIC] INDEXES [INDEXES ...]

Positional arguments:

INDEXES bai's/crais to use for splitting genome.

Options:

--n N, -n N number of regions to split to.

--fai FAI fasta index file.

--problematic PROBLEMATIC, -p PROBLEMATIC

pipe-delimited list of regions to split small.

--help, -h display this help and exit

> goleft samplename -h

$ goleft samplename -h

samplename 0.2.0

Usage: goleft [--errormulti] BAM

Positional arguments:

BAM bam for to get sample name(s)

Options:

--errormulti, -e return an error if there is not exactly 1 sample in the bam.

--help, -h display this help and exit

--version display version and exit

ラン

１、covstats bamをサンプリングしてカバレッジとインサートサイズをレポート

bamを指定してランする。bedファイルで特定の領域や染色体だけ解析もできる。

goleft covstats pair.bam

カバレッジ、インサートサイズとそのSDなどが出力される。

f:id:kazumaxneo:20180213214311j:plain

cutとcolumnに渡して整形表示。

２、depth 一定のウィンドウサイズでbamのカバレッジを計算

goleft depth --reference ref.fa --prefix output input.bam

prefixで指定したファイルに指定ウィンドウサイズのカバレッジが出力される。

３、indexcov bam.baiからのカバレッジの超高速な推定。30サンプルを30秒で解析可能（ref.1）。

複数のbamをディレクトリに準備してランする。

goleft indexcov --directory output/ bam_data/*.bam

関係ないファイルがあるとエラーになったので、bamとbaiだけ集めたほうがよいかもしれない。htmlで結果は出力される。

f:id:kazumaxneo:20180213220113j:plain

上では２サンプルだけです。公式のほうがわかりやすい結果を載せています（リンク）。

他にも、bamをほぼ等しい量のデータでN個の領域に分割するindexsplitや（計算を並列化するために使う）、SM tagからsamplenameをレポートするsamplenameがある。

2020 4/19

multiqcでレポート作成。

goleft indexcov --directory goleft_outdir *.bam
multiqc .

goleftも素晴らしいhtmlレポートを作成してくれるが、multiqcの出力はサンプル間比較がしやすく、他の解析結果と共に統合レポートを作成するのに便利。

引用

ref.1

Indexcov: fast coverage quality control for whole-genome sequencing.

Pedersen BS, Collins RL, Talkowski ME, Quinlan AR

Gigascience. 2017 Nov 1;6(11):1-6.

https://github.com/brentp/goleft

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

bamの分析に使うバイオインフォマティクスのツールキット goleft