2017-06-20

Oxford Nanoporeリードのアセンブリ smartdenovo

2021 3/9 追記

　SMARTdenovoは、PacBioおよびOxford Nanopore（ONT）データ用の新しいアセンブラ。エラー訂正なしでリードをアライメントしてアセンブリを生成する。

　SMARTdenovoはいくつかの独立したコマンドラインツールで構成されている：rawリードのall vs allのアライメントとオーバーラップのためのwtzmo、欠落オーバーラップを救済するwtgbo、低品質領域とキメラを識別するためのwtclp、およびより良いunitigコンセンサスを生成するwtcnsまたはwtmsa。 smartdenovo.plスクリプトは、これらのプログラムを一度に呼び出すための便利なインターフェイスを提供している。

インストール

Github

git clone https://github.com/ruanjue/smartdenovo.git && (cd smartdenovo; make)

ラン

サンプルデータのダウンロードと整形

wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf - awk 'NR%4==1||NR%4==2' selfSampleData/pacbio_filtered.fastq | sed 's/^@/>/g' > reads.fa

アセンブルしてunitigを出力する（unitig）。

smartdenovo/smartdenovo.pl reads.fa > wtasm.mak #アセンブル前の準備
make -f wtasm.mak #アセンブル

引用

https://github.com/ruanjue/smartdenovo/blob/master/README.md

2021 3/9

SMARTdenovo: a de novo assembler using long noisy reads
Hailin LiuShigang WuAlun LiJue Ruan*

Gigabyte,1,2021 https://doi.org/10.46471/gigabyte.15

2017-06-20

ナノポアのアダプタートリミングツール Porechop

Nanopore long read trimming / preprocessing

2020 5/18 インストール手順追記

2024/04/13　追記

PorechopはOXford Nanoporeのリードのアダプタートリミングツール。データベースを保持しており、自動でアダプター配列を認識し除去してくれる。マルチプレックスのidnex配列を除く機能も持つ。

ダウンロードリンク

GitHub - rrwick/Porechop: Adapter trimmer for Oxford Nanopore reads

インストール

Github

#bioconda(link)
mamba create -n porechop -y
conda activate porechop
mamba install -c bioconda porechop -y

git clone https://github.com/rrwick/Porechop.git 
cd Porechop 
python3 setup.py install # usr/locan/binにパスも通る
porechop -h #ヘルプの表示

> porechop -h

$ porechop -h

usage: porechop -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY] [-t THREADS] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD] [--barcode_diff BARCODE_DIFF] [--require_two_barcodes]

[--untrimmed] [--discard_unassigned] [--adapter_threshold ADAPTER_THRESHOLD] [--check_reads CHECK_READS] [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE] [--min_trim_size MIN_TRIM_SIZE]

[--extra_end_trim EXTRA_END_TRIM] [--end_threshold END_THRESHOLD] [--no_split] [--discard_middle] [--middle_threshold MIDDLE_THRESHOLD] [--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE]

[--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE] [--min_split_read_size MIN_SPLIT_READ_SIZE] [-h] [--version]

Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and splitting reads with internal adapters

Main options:

-i INPUT, --input INPUT FASTA/FASTQ of input reads or a directory which will be recursively searched for FASTQ files (required)

-o OUTPUT, --output OUTPUT Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed reads will be printed to stdout)

--format {auto,fasta,fastq,fasta.gz,fastq.gz}

Output format for the reads - if auto, the format will be chosen based on the output filename or the input read format (default: auto)

-v VERBOSITY, --verbosity VERBOSITY Level of progress information: 0 = none, 1 = some, 2 = lots, 3 = full - output will go to stdout if reads are saved to a file and stderr if reads are printed to stdout (default: 1)

-t THREADS, --threads THREADS Number of threads to use for adapter alignment (default: 12)

Barcode binning settings:

Control the binning of reads based on barcodes (i.e. barcode demultiplexing)

-b BARCODE_DIR, --barcode_dir BARCODE_DIR

Reads will be binned based on their barcode and saved to separate files in this directory (incompatible with --output)

--barcode_threshold BARCODE_THRESHOLD

A read must have at least this percent identity to a barcode to be binned (default: 75.0)

--barcode_diff BARCODE_DIFF If the difference between a read's best barcode identity and its second-best barcode identity is less than this value, it will not be put in a barcode bin (to exclude cases which are too

close to call) (default: 5.0)

--require_two_barcodes Reads will only be put in barcode bins if they have a strong match for the barcode on both their start and end (default: a read can be binned with a match at its start or end)

--untrimmed Bin reads but do not trim them (default: trim the reads)

--discard_unassigned Discard unassigned reads (instead of creating a "none" bin) (default: False)

Adapter search settings:

Control how the program determines which adapter sets are present

--adapter_threshold ADAPTER_THRESHOLD

An adapter set has to have at least this percent identity to be labelled as present and trimmed off (0 to 100) (default: 90.0)

--check_reads CHECK_READS This many reads will be aligned to all possible adapters to determine which adapter sets are present (default: 10000)

--scoring_scheme SCORING_SCHEME Comma-delimited string of alignment scores: match, mismatch, gap open, gap extend (default: 3,-6,-5,-2)

End adapter settings:

Control the trimming of adapters from read ends

--end_size END_SIZE The number of base pairs at each end of the read which will be searched for adapter sequences (default: 150)

--min_trim_size MIN_TRIM_SIZE Adapter alignments smaller than this will be ignored (default: 4)

--extra_end_trim EXTRA_END_TRIM This many additional bases will be removed next to adapters found at the ends of reads (default: 2)

--end_threshold END_THRESHOLD Adapters at the ends of reads must have at least this percent identity to be removed (0 to 100) (default: 75.0)

Middle adapter settings:

Control the splitting of read from middle adapters

--no_split Skip splitting reads based on middle adapters (default: split reads when an adapter is found in the middle)

--discard_middle Reads with middle adapters will be discarded (default: reads with middle adapters are split) (required for reads to be used with Nanopolish, this option is on by default when outputting reads

into barcode bins)

--middle_threshold MIDDLE_THRESHOLD Adapters in the middle of reads must have at least this percent identity to be found (0 to 100) (default: 90.0)

--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE

This many additional bases will be removed next to middle adapters on their "good" side (default: 10)

--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE

This many additional bases will be removed next to middle adapters on their "bad" side (default: 100)

--min_split_read_size MIN_SPLIT_READ_SIZE

Post-split read pieces smaller than this many base pairs will not be outputted (default: 1000)

Help:

-h, --help Show this help message and exit

--version Show program's version number and exit

(porechop) kamisakakazumanoMac-mini:plastome_output20200512 kazu$

初めに10000リードランダムに（？）抽出して、Porechopのアダプターライブラリと称号を行いアダプターを検出する。その時の閾値は90%以上の相同性となっているが、--adapter_thresholdを指定すれば変更可能。

ラン

Porechopのデータベースと比較して、アダプター配列をstartとendから除く。

porechop -i input_reads.fastq.gz -o output_reads.fastq.gz

非圧縮のfastq/fastaも使用できる。重要そうなパラメータを載せておく。

--adapter_threshold: An adapter set has to have at least this percent identity to be labelled as present and trimmed off (0 to 100) (default: 90.0)
--end_size: The number of base pairs at each end of the read which will be searched for adapter sequences (default: 150)
--min_trim_size: Adapter alignments smaller than this will be ignored (default: 4)
--end_threshold: Adapters at the ends of reads must have at least this percent identity to be removed (0 to 100) (default: 75.0)
--extra_end_trim: his many additional bases will be removed next to adapters found at the ends of reads (default: 2)

特に--end_thresholdは大きく影響を与えそうである。著者は1DのONTリードでのみ検証しており、2Dのデータの使用については保証していない。精度の高い2Dのデータならばもう少し相同性に関わる値を厳しくした方が良い可能性がある。

結果

Trimming adapters from read ends

SQK-NSK007_Y_Top: AATGTACTTCGTTCAGTTACGTATTGCT

SQK-NSK007_Y_Bottom: GCAATACGTAACTGAACGAAGT

Rapid_adapter: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA

14198 / 22907 reads had adapters trimmed from their start (842478 bp removed)

778 / 22907 reads had adapters trimmed from their end (8997 bp removed)

Splitting reads containing middle adapters

0 / 22907 reads were split based on middle adapters

Saved result to /Users/user/nanopore2/merged_trimmed.fastq

lambdaのコントロールのシーケンスデータを読むと、このような結果となった。

--verbosity 2をつけると、どこにアダプター配列を含むかなどが細かく表示される。

porechop -i input_reads.fastq.gz -o output_reads.fastq.gz --verbosity 2

Porechopが検索する既知のNanoporeアダプターは、adapters.pyファイルで定義されている（Githubより）。以下の通り。

Ligation kit adapters
Rapid kit adapters
PCR kit adapters
Barcodes
Native barcoding
Rapid barcoding

自分のアダプター配列をPorechopに追加したい場合は、adapters.pyファイルを編集する。手順はそのファイルに記載されている（現在のバージョンだと55−74行目）。

追記

リード数が多いと数百GBのメインメモリを使う。メモリが足りない時はfastqを分割して実行する。

引用

GitHub - rrwick/Porechop: adapter trimmer for Oxford Nanopore reads

2017-06-19

sam/bam関係のツールまとめ

bam/sam format conversion summary

随時更新

2019 1/23 リンク修正

2020 4/17 samtoolsについてmultiqcと連携する例を追記、4/18 help更新、インストール方法追加

2024/02/24 誤字修正 &インストールコマンド修正（conda => mamba ）

samとbamのハンドリングに関するツールを紹介する。

追記

--2017--

8/20 samblaster samblasterでduplicationリードにタグをつける
8/29 BBTools 　 其の１、其の２
9/27 bamに塩基置換やindel変異を起こす　bamに塩基置換やindel変異を起こすbamsurgeon
12/21 sambamba 高速なbam/samの解析ツール Sambamba
12.28 TopHatのunmapped.bamを修復する TopHat-Recondition

--2018--

1/19 Qualimap2 複数サンプルのマッピングを同時比較できるGUIツール Qualimap2
2/13 分析レポート出力 bamの分析に使うバイオインフォマティクスのツールキット goleft
2/21 各chromosomeのカバレッジ　 BBtoolsを使い各クロモソームのカバレッジを計算する
2/28 RNA seqのbam品質管理　RNA seqのクオリティチェックツール QoRTs
2/28 RNA seqのbam操作と品質管理　RNA seqのクオリティコントロールを行う RSeQC
3/9 bam操作　NGSUtilsその１ bamutils
3/31 SAMTools互換のsam高速処理　SAMTools互換で高速なsam,bam,cram処理ツールelprep
4/20 ファイルの単純分割　samやfastqの単純分割
4/28 構造変化部位の可視化 構造変化が起きた部位のマッピング状況を出力
5/16 realignmnet　コード領域のリアライメントによってコールを改善ABRA
5/25 bamからprimerトリム　bamのプライマートリミング BAMClipper
5/27 bamのプロセシング　高速なbam処理ツール biobambam2
6/6 bamの領域ごとのカバレッジ　 WGS、WESのカバレッジを素早く計算する mosdepth
6/6 bamをVCFコール周辺に限定　BAMを感心対象のみにするVariantBam
6/9 マルチマッピング補正　マルチマッピングを補正する MMR
6/20 bamの分析　bamの分析ツール Alfred
6/22 sam/bamの分析　シンプルなfastq、sam、bamの分析ツール fastqp
6/27 同じサンプルか判定　同じサンプルに由来するかvariant callingから判定する BAM-matcher
6/28 bam/samのカバレッジ　bam/samのカバレッジなどを計算する pysamstats
6/29 bam/fastqの管理ツール　bam, fastqのユーティリティツール EA-Utils
7/5 bam/samのカバレッジ　詳細なリードカウント情報を出力する bam-readcount
7/6 bamのvisualization statistics bamの可視化分析ツール bam.iobio.io
7/7 Picard Toolsのbam分析コマンドを自動実行　bamを分析し結果を統合する picardmetrics
8/1 bamとfastqが同一サンプルかチェック　bamとfastqの整合性を確認する BamHash
8/4 sam/bamがmalformedではないか分析　PicardのValidateSamFile
9/16 BAMを複雑な条件でフィルタリング　BAMを様々な条件でフィルタリングできる BAMQL
11/3 シーケンシングデータのハプロタイプを可視化 リードを分類する HapFlow
12/21 bamをクリーニング　gencore

2019

1/14 リードの抽出とリアライメント　リアライメントを素早く実行する Bazam
4/15 bamのデプスをコンソールで表示　bamのカバレッジを素早く確認bamcov
5/8　コンセンサス配列call　deep bamからコンセンサス出力ConsensusFixer
6/16　bamを扱うcライブラリ bamM bamファイルを扱う bamM
8/6　fasta/fastq/bamのユーティリティツール　 fxtools

2020

coverageトラックの視覚化　　カバレッジトラックを視覚化する SparK
bamを操作する包括的なツールキット(下でも紹介) BamDeal

2021

バリアント領域のアラインメント画像を自動作成（たくさんの領域の自動処理にも対応）　BamSnap
ポジションに沿ってカバレッジプロットを生成 tinycov
ホストゲノムにunmapのリードを回収　 Bowtie 2を使って素早くホスト由来リード除去
クリッピング領域を除く　samファイルのクリッピングされたアラインメントを除く samclip

2022

高効率なカバレッジ計算ツール　BamToCov
BAMカバレッジ、polymorphic サイト率など計算する　CMSeq
ゲノムのBAMファイルを転写産物の BAMに変換してsalmonで扱う　mudskipper
BAMインデックス（bam.bai）ファイルを細かく切り刻む　chopBAI
効率的な圧縮器　Genozip TM

2023

包括的なフォーマットコンバーター Bioconvert

picard-tools

homebrewやbiocondaでインストールできる。

#homebrew
brew install picard-tools

#bioconda 
mamba install -c bioconda picard -y

BamIndexStats リファレンスにアラインされたリードの数。

picard BamIndexStats I=output_sorted.bam > bam_index_stats.tsv

I=でbamを指定。

>のリダイレクトで出力ファイル名を指定。

クロモソームごとのアライメントされたリードの数を出力できる。

Lambda_NEB length= 48502 Aligned= 4613 Unaligned= 0

CollectAlignmentSummaryMetrics アライメントされたリードの平均サイズ。

picard CollectAlignmentSummaryMetrics I=R1R2_sorted.bam OUTPUT=aln_sum_metric.tsv

CollectInsertSizeMetrics インサートサイズの分析。

picard CollectInsertSizeMetrics INPUT=R1R2_sorted.bam OUTPUT=insert_size_metric.tsv H=insert_size_metric.pdf

図が出力される。

f:id:kazumaxneo:20170723191440j:plain

QualityScoreDistribution クオリティスコアの分析。

picard QualityScoreDistribution  INPUT=R1R2_sorted.bam OUTPUT=quality_score_dist.tsv CHART_OUTPUT=quality_score_dist.pdf

図が出力される。

f:id:kazumaxneo:20170723191557j:plain

gatkのBQSR補正したbamを使った場合、補正前のスコアがOriginal quality scoreで表示され、補正前後のクオリティの変化を比較できる。

SamToFastq fastqの抽出

picard SamToFastq  INPUT=R1R2_sorted.bam F=R1.fq F2=R2.fq

ペアリードをFとF2で指定する。シングルならF=で指定。ペアリードの片方だけアンマップのペアや、セカンドアライメントのペア、ペアでないリードだけ返すようなオプションもある。https://gsl.hudsonalpha.org/information/software/bam2fastqにはunmapのリードを返せるとあるが、そのようなオプションはないように見える。

（下に載せた"bamtools split"でunmapped.bamを作り、それからリードを取り出すことでunmapリードを回収できる。）

samtools

インストール

#bioconda
mamba install -c bioconda -y samtools

#homebrew
brew install samtools

$ samtools

Program: samtools (Tools for alignments in the SAM format)

Version: 1.9 (using htslib 1.9)

Usage: samtools <command> [options]

Commands:

-- Indexing

dict create a sequence dictionary file

faidx index/extract FASTA

fqidx index/extract FASTQ

index index alignment

-- Editing

calmd recalculate MD/NM tags and '=' bases

fixmate fix mate information

reheader replace BAM header

targetcut cut fosmid regions (for fosmid pool only)

addreplacerg adds or replaces RG tags

markdup mark duplicates

-- File operations

collate shuffle and group alignments by name

cat concatenate BAMs

merge merge sorted alignments

mpileup multi-way pileup

sort sort alignment file

split splits a file by read group

quickcheck quickly check if SAM/BAM/CRAM file appears intact

fastq converts a BAM to a FASTQ

fasta converts a BAM to a FASTA

-- Statistics

bedcov read depth per BED region

depth compute the depth

flagstat simple stats

idxstats BAM index stats

phase phase heterozygotes

stats generate stats (former bamcheck)

-- Viewing

flags explain BAM flags

tview text alignment viewer

view SAM<->BAM<->CRAM conversion

depad convert padded BAM to unpadded BAM

flagstat - シンプルなアライメントレポートを出力

samtools flagstat -@ 4 input.bam

#multiqc(link)で統合レポート
samtools flagstat -@ 4 sample1.bam > sample1_flagstats
samtools flagstat -@ 4 sample2.bam > sample2_flagstats
samtools flagstat -@ 4 sample3.bam > sample3_flagstats
multiqc .

出力結果の例

3153604 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary 
0 + 0 supplementary 
0 + 0 duplicates
18472 + 0 mapped (0.59% : N/A) 
3153604 + 0 paired in sequencing 
1576802 + 0 read1 
1576802 + 0 read2
5708 + 0 properly paired (0.18% : N/A) 
16576 + 0 with itself and mate mapped
1896 + 0 singletons (0.06% : N/A) 
0 + 0 with mate mapped to a different chr 
0 + 0 with mate mapped to a different chr (mapQ>=5)

properly pairedは、同じクロモソームに適切な向きと位置でアライメントされたペアリードの数になる。詳細に興味がある人はsamtoolsのマニュアルを参考にしてください。

http://www.htslib.org/doc/samtools.html

レポート結果が正しいかどうかは簡単なカウントで判断できる。

samtools view -@ 4 input.bam |wc #１行目と同じ数値になる。

stats -アライメントレポートを出力

samtools stats -@ 4 input.bam

#multiqc(link)で統合レポート
samtools stats -@ 4 sample1.bam > sample1_stats
samtools stats -@ 4 sample2.bam > sample2_stats
samtools stats -@ 4 sample3.bam > sample3_stats
multiqc .

idxstats - BAM indexからのレポート

samtools idxstats input.bam

#multiqc(link)で統合レポート
samtools idxstats sample1.bam > sample1_idxstats
samtools idxstats sample2.bam > sample2_idxstats
samtools idxstats sample3.bam > sample3_idxstats
multiqc .

depth - アライメントされたリードのデプスレポート

samtools depth input.bam > output

#zero coverage positionも含める
samtools depth -a input.bam > output

-a output all positions (including zero depth)
-a -a (or -aa) output absolutely all positions, including unused ref. sequences
-l <int> read length threshold (ignore reads shorter than <int>) [0]
-d/-m <int> maximum coverage depth [8000]. If 0, depth is set to the maximum
integer value, effectively removing any depth limit.
-q <int> base quality threshold [0]
-Q <int> mapping quality threshold [0]

全ポジションのカバレッジを出力したい時に使える。depthコマンドを使えばpileupのカバレッジフィールドをパースする必要がない。

$ head -n 5 depth

chr 1 20

chr 2 20

chr 3 20

chr 4 22

chr 5 22

chr 6 22

chr 7 22

chr 8 22

chr 9 22

chr 10 23

merge - sam/bamのマージ

samtools merge -@ 8 output.bam input1.bam input2.bam input3.bam

２つ以上のbam、samを１つにまとめる。splitコマンドでクロモソームごとに分割し、別のクラスターでランして、後で統合するような用途にも使えそうである。

split - sam/bamの分割

samtools split

fasta - bam/samからfastaを抽出

samtools fasta input.bam > input.fa

fastq - bam/samからfastqを抽出

#シングルエンド
samtools fastq -@ 8 single.bam > single.fq
#または
samtools fastq -@ 8 single.bam -s input.fq

#ペアエンド
samtools fastq -@ 8 pair.bam -1 pair_1.fq -2 pair_2.fq -n

-1 write paired reads flagged READ1 to FILE
-2 write paired reads flagged READ2 to FILE
-0 write paired reads flagged both or neither READ1 and READ2 to
-s write singleton reads to FILE [assume single-end]
-n don't append /1 and /2 to the read name
-N always append /1 and /2 to the read name
FILE

ペアエンドの順番は考慮されない。ペアエンドの順番を維持したい場合は"bam bam2FastQ"コマンド（下の方で紹介）を使う。もしくは一度fastqを出力し、それからBBtoolsのrepairコマンドを使ってペアの同期を回復する。↓

ペアリードファイルの修復（BBTools）。

#pair_1.fq、pair_2.fqを修復
repair.sh -Xmx20g in1=pair_1.fq in2=pair_2.fq out1=fixed1.fq.gz out2=fixed2.fq.gz outs=singletons.fq repair

アンマップのリードだけSPAdesを使ってアセンブルする。

spades.py -1 fixed1.fq.gz -2 fixed2.fq.gz -s singletons.fq -o unmap-spades -k auto -t 8

=> blastnやblastxでどんな生物由来の配列かを推定する。

reheader - bamのヘッダーを変更

samtools reheader header.sam input.bam > output.bam

input.bamのヘッダーをheader.samのヘッダー情報で置換する。bam=>sam=>bamと変換するより圧倒的に早い。例えば以下のようにして使う。

#header情報だけsamに変換
samtools view -H input.bam > header.sam
# viなどでheader.bamを編集。例えばSO:unsortedをSO:sortedに修正。
# sample名をつけ忘れていたなら,@RGのSM:部分を編集する => SM:sample1

#元のbamのヘッダーを編集したheaderで置換。
samtools reheader header.sam input.bam > reheaer.bam 

#headerが交換されたことを確認。
samtools view -H reheaer.bam |head

rmdup - PCR duplicateを除く。

samtools rmdup input.bam output.bam

ペアリードに使える。ペアリードがFRの向きで同じクロモソームの同じ位置にアライメントされているリードが2つ以上あると適用される。duplicate判定されるとマッピングクオリティがベストのものだけ残し、あとのペアは排除される。ペアの片方しかアライメントされていないものには適用されない。duplicateがあるかどうかは、fastqcでfastqを分析すれば調べられる。"-s"でシングルエンドに対応。

addreplacerg - リードグループ情報を追加したり、消したりする。

samtools addreplacerg

phase - heterozygous SNPsのコール。

samtools phase

sort - リードを染色体のポジション順にソート（coordinate sort）。

samtools sort -@ 8 -m 2G pair.bam -o pair_sorted.bam

-@　Set number of sorting and compression threads [1]
-m　Set maximum memory per thread; suffix K/M/G recognized [768M]
-n　Sort by read name

デフォルトではメモリーを768Mまでしか使わない。メモリーがもっとあるなら、-mで指定することで、特に大きなゲノムのソートが速くなる。上では2G/thread指定して、スレッド数を12にしている。入力はbamとsamに対応している。samを指定すると、ソートしたbamを出力してくれる。

"sam"からポジションソート済みbamを作成

samtools sort -@ 8 -m 8G -O BAM pair.sam -o pair_sorted.bam

-O < FORMAT> Specify output format (SAM, BAM, CRAM)

view

bam <=> sam変換やフィルタリング

#samからbamに変換
samtools view -@ 10 -bS input.sam > output.bam

#bamからsamに変換
samtools view -@ 8 -h input.bam > output.sam

#bamの中身を見る。headerは見れない。
samtools view -@ 8 input.bam |less

#bamのheaderだけ見る。ソートされているか
samtools view -@ 8 -H input.bam |less

#50%ランダムサンプリング
samtools view -@ 8 -s 0.5 -b input.bam > random0.5.bam

#マッピングされたリードだけbamとして返す
samtools view -@ 8 -h -b -F 4 input.bam > mapped.bam

#マッピングされなかったリードだけbamとして返す
samtools view -@ 8 -h -b -f 4 input.bam > mapped.bam

#chr1とchr2のマッピングだけ取り出しbamとして返す
samtools view -@ 8 -b input.bam chr1 chr2 > chr1-2.bam

#chr1の100-300のマッピングだけ取り出しbamとして返す
samtools view -@ 8 -b input.bam chr1:100-300 > chr1_100-300.bam

#chr1の100-300のマッピングだけ取り出しペアエンドfastqとして返す
samtools view -b input.bam chr1:100-300 |samtools fastq -1 pair1.fq -2 pair2.fq -

#elprepには、マッピング時にbedで指定された領域のリードだけ残すか捨てるオプション"--filter-non-overlapping-reads bed-file"があります。exome解析などに使えるかと思います。

-b　output BAM
-C　output CRAM requires -T)
-h　include header in SAM output
-S　ignored (input format is auto-detected)
-T　Reference sequence FASTA FILE [null]
-s　FLOAT integer part sets seed of random number generator [0]; rest sets fraction of templates to subsample [no subsampling]
-F INT only include reads with none of the bits set in INT set in FLAG [0]
-f INT have all of the FLAGs present
-@　Number of additional threads to use [0]

他にも以下のような使い方がある。

samからmappingされたリードだけbamに変換。

samtools view -@ 8 -bS -F 4 input.sam > output.bam

samからmappingされたリードだけbamに変換して、さらにfastq出力（unmapを除き、mapされたリードだけfastqで回収）。

samtools view -@ 8 -bS -F 4 input.sam |samtools fastq -@ 10 - > output.fq

bamからchr1にmappingされたリードだけ抽出して、さらにfastq出力。

samtools view -@ 8 -b input.bam chr1 > chr1.bam
samtools index chr1.bam
samtools fastq -@ 8 chr1.bam > chr1_mapped.fq

mapping quality (MAPQ) が１以上のリードだけ出力（mapping qualityについての情報: リンク）。

samtools view -@ 8 -bSq 1 input.sam > filtered.bam

マッピングされなかったリードとマルチマッピングのリードを除く。

samtools view -bS -F 0x904 input.sam > filtered.bam

index

bamファイルのindexing

samtools index sorted.bam

-b Generate BAI-format index for BAM files [default]
-c Generate CSI-format index for BAM files

quickcheck - 追記 bamがおかしくなっていないか、カレントの全てのbamを調べる。

samtools quickcheck -v *.bam > bad_bams.fofn && echo 'all ok' || echo 'some files failed check, see bad_bams.fofn'

bamがおかしくなっていないか（download中に切れたことを気づかないまま進めたり、変換中におちたbamを使ったりなど）。 "||"をつかうことで、エラーがある時だけ||以降が実行されるようにしている。

bamがcoordinateソート（chr順&ポジション順）されているかどうか。ソートされていれば、 SO:coordinateが表示される。

samtools view -H iput.bam | grep @HD

sample名取得。

samtools view -H iput.bam | grep '^@RG' | sed "s/.*SM:\([^\t]*\).*/\1/g" | uniq

GNU parallelsを使った複数のbamの並列処理。同時処理数を"-j"オプションで6つまでに制限している。samtools sortコマンド自体も4スレッドの並列処理をしている。

ls *.bam | parallel -j 6 "samtools sort -@ 4 {} > {.}_sorted.bam; samtools index {.}_sorted.bam"

splitした小さなbamなどを並列処理する時に一番高速化が期待できます。ただし、samtoolsのコマンドはI/O intensitiveなので、同時処理数が増えすぎるとかえって速度が遅くなったり、最悪、コンピュータがフリーズする恐れがあります。注意して使ってください。parallelのあとに"--dry-run"をつけると（parallel --dry-run）、コマンド内容は実行せず、実行コマンドだけ表示されます。まず--dry-runをつけて実行してください。時間短縮を考えるなら、elprep splitなどでbamを分割し、timeをつけて実行時間を測定し、最速の組み合わせを検討することをお勧めします。I/OやCPUなど環境の違いによって最速の組み合わせは変化するはずです。

minimap2でmapping、coordinate sortしたbamを出力。スレッドは12、メモリは2G / thread。リードグループ（@RG）のサンプル名はsample1。

minimap2 -R "@RG\tID:X\tLB:Y\tSM:sample1\tPL:ILLUMINA" -t 12 -ax sr \
ref.fa pair1.fq.gz pair2.fq.gz \
|samtools sort -@ 12 -m 2G -O BAM - > sorted.bam \
&& samtools index sorted.bam

minimap2でmapping、elprepでフィルタリングしてbam出力。elprep以外は上と同じだが、

minimap2 -ax sr -R "@RG\tID:X\tLB:Y\tSM:sample1\tPL:ILLUMINA" -t 12 \
-ax sr ref.fa pair1.fq.gz pair2.fq.gz \
|elprep filter /dev/stdin /dev/stdout \
--mark-duplicates --remove-duplicates --filter-mapping-quality 1 \
--clean-sam --nr-of-threads 12 \ --sorting-order coordinate \
|samtools sort -@ 12 -m 2G -O BAM - > sorted.bam \
&& samtools index sorted.bam

elprepの mark duplicatesとsortはin memoryで動作するため、メモリ要求量はデータサイズに直接比例します（samファイルサイズの10倍近くのメモリが必要）。メモリが足りなければ、--mark-duplicatesと--remove-duplicatesを外すか、samのsplitを行ってから順次フィルタリングするよう変更して下さい。

optional fieldのMD tagをつける。

samtools calmd -@ 12 -b input.bam ref.fasta > input.md.bam 
samtools index input.md.bam

例えばchr2のギャップが検出された領域100-150のリードだけ抽出してfastaに変換、mafftでマルチプルアラインメントしてコンセンサス配列consensus.fastaを出力。mafft（リンク）とEMBOSSのcons（consensus call）を使う。

samtools view -@ 10 -b sort.bam chr2:100-150 \
 |samtools fasta -@ 10 - > gapped.fasta && \
 mafft --auto gapped.fasta > output.maf && \
 cons output.maf consensus.fasta
#注意 *1

#小改良。karretでerror correctionしてからマルチプルアラインメント、最後にnucdiff（紹介）でSNV、indelをコールする（リファレンスはref.faと記載）
samtools view -@ 10 -b sort.bam chr2:100-150 \
 |samtools fasta -@ 10 - > gapped.fasta && \
 |karect -correct -threads=8 -matchtype=hamming -celltype=haploid -inputfile=gapped.fasta -kmer=9 && 
 |mafft --auto gapped.fasta > output.maf && \
 |cons output.maf consensus.fasta && \
 |nucdiff --vcf yes ref.fa onsensus.fasta outdir unmatched

リードではなく、リファレンスゲノム配列の特定の染色体のポジションの配列だけ取り出す。

samtools

#chr2のリファレンス配列だけ取り出して保存する
samtools faidx genome.fa chr2 > chr2.fasta

#chr3の15000-20000のリファレンス配列だけ取り出して保存する
samtools faidx genome.fa chr3:15000-20000 > chr3_15000-20000.fasta

#数が多い時はリストファイルを指定する（テキストに１行に１つずつ指定）
samtools faidx genome.fa -r region_list > chr3_15000-20000.fasta

bamtools

bamファイルを使い色々なことができるツール。samtoolsの機能と重複するものが多いが、よりデータ解析よりに作られている。

インストール

#bioconda
mamba install -c bioconda bamtools -y

#homebrew
brew install bamtools

$ bamtools -h

usage: bamtools [--help] COMMAND [ARGS]

Available bamtools commands:

convert Converts between BAM and a number of other formats

count Prints number of alignments in BAM file(s)

coverage Prints coverage statistics from the input BAM file

filter Filters BAM file(s) by user-specified criteria

header Prints BAM header information

index Generates index for BAM file

merge Merge multiple BAM files into single file

random Select random alignments from existing BAM file(s), intended more as a testing tool.

resolve Resolves paired-end reads (marking the IsProperPair flag as needed)

revert Removes duplicate marks and restores original base qualities

sort Sorts the BAM file according to some criteria

split Splits a BAM file on user-specified property, creating a new BAM output file for each value found

stats Prints some basic statistics from input BAM file(s)

See 'bamtools help COMMAND' for more information on a specific command.

convert - bamtoolsの一番基本的な使い方はbamからのformat変換である。

bamtools convert -format <format> -in input.bam -out output

formatの部分に出力するフォーマットを記載する。pileup、bedのほかfastaやfastqに直接変換することも可能になっている（bedtoolsのbamtofastqと同じ）。ただしbamtoolsはシングルスレッドで動くので、samの出力はマルチスレッド化したsamtools viewが速いと思われる。

stats - bamの情報を出力できる。

bamtools stats -in input.bam

bamtools statsの出力

**********************************************

Stats for BAM file(s):

**********************************************

Total reads: 525231

Mapped reads: 524170 (99.798%)

Forward strand: 262861 (50.0467%)

Reverse strand: 262370 (49.9533%)

Failed QC: 0 (0%)

Duplicates: 0 (0%)

Paired-end reads: 525231 (100%)

'Proper-pairs': 517998 (98.6229%)

Both pairs mapped: 523577 (99.6851%)

Read 1: 262603

Read 2: 262628

Singletons: 593 (0.112903%)

samtoolsのflagstatの出力

525231 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

1683 + 0 supplementary

0 + 0 duplicates

524170 + 0 mapped (99.80% : N/A)

523548 + 0 paired in sequencing

261774 + 0 read1

261774 + 0 read2

517042 + 0 properly paired (98.76% : N/A)

521908 + 0 with itself and mate mapped

579 + 0 singletons (0.11% : N/A)

846 + 0 with mate mapped to a different chr

791 + 0 with mate mapped to a different chr (mapQ>=5)

大きな差はないが、個人的にはbamtools statの方が分かりやすい表記だと思う。

追記

#multiqcで統合
bamtools stats -in sample1.bam > sample1_bamtools
bamtools stats -in sample2.bam > sample2_bamtools
bamtools stats -in sample3.bam > sample3_bamtools
multiqc .

count - アライメントされたリード数のカウント。

bamtools count -in input.bam

sort - bamをname順に並べ替える。

bamtools sort -in input.bam -out input_sort.bam -byname

リード名順に並べ替えられる。samtools view -Hでヘッダーを確認すると

$ samtools view -H input_sort.bam |head 1

@HD VN:1.3 SO:queryname

になっており、名前順にソートされていることを確認できる。(-HはSAMheaderのみprintするsamtools viewのオプション）。samtools sort -nと同じと考えられる。

index - bamのindexを作る。

bamtools index -in input_sort.bam

samtools indexと同じである。~.baiが出力される。

coverage - アライメントされたリード数のカウント。

bamtools convert -in input.bam -out coverage.txt

version2.4.1ではなぜかsortしてないというエラーメッセージが出て検証できていない（bamtools sortとindexをかけていてもエラーになってしまう）。

filter - bamのフィルタリング。

bamtools filter -in input.bam -out output.bam -length 50

様々なパラメータでbamをフィルタリングできる。上記では50bpのリードのみ出力される。ただし~以上や~以下、また複数指定もできない（正直、使い勝手はイマイチである）。以下のようなフィルターがある。

-alignmentFlag keep reads with this *exact* alignment flag (for more detailed queries, see below)
-insertSize keep reads with insert size that matches pattern -length <int> keep reads with length that matches pattern
-mapQuality keep reads with map quality that matches pattern -name <string> keep reads with name that matches pattern
-queryBases keep reads with motif that matches pattern
-tag keep reads with this key=>value pair

header - ヘッダー情報を出力。

bamtools header -in input.bam

bamのheaderを出力する。samtools view -Hと同じである。

merge - 複数のbamを１つにマージする。３つ以上も可。

bamtools merge -in input1.bam -in input2.bam -out merge.bam

random - bamを指定数ランダムサンプリングして出力。

bamtools random -in input1.bam -out output.bam -n 10000

この例だと10000リードランダム出力する。bamtools formatやsamtools view -hでsamに戻し、リード数をカウントしてチェックする (wc -l input.sam)。

split - ユーザー定義の情報に従いbamを分割する。

bamtools split -in input.bam -mapped

上記ではリファレンスにマッピングされたリード（> input.MAPPED.bam）、マッピングされていないリード(> input.UNMAPPED.bam)に分けて出力される。利用できる定義は

-mapped split mapped/unmapped alignments
-paired split single-end/paired-end alignments
-reference split alignments by reference
-tag <tag name> splits alignments based on all values of TAG encountered (i.e. -tag RG creates a BAM file for each read group in original BAM file)

クロモソームごとに分割するなら、 -referenceのフラグをつける。

bamtools split -in input.bam -reference

input_chrromosome1.bam、input_chromosome2.bam...unmapped.bamのようにリファレンスのFASTAのヘッダー名付のbamが出力される。

他にもいくつかの機能を持つ。bamtools <command> -hで各コマンドの全機能は確認できる。

elprep（紹介）

インストール

#bioconda 
mamba install -c bioconda -y elprep

# elprep

elprep version 4.1.5 compiled with go1.12.7 - see http://github.com/exascience/elprep for more information.

2020/04/18 03:07:51 Incorrect number of parameters.

Print command details:

[--help]

[--help-extended]

Available commands: filter, sfm, vcf-to-elsites, bed-to-elsites, fasta-to-elfasta

filter/sfm parameters:

elprep [filter | sfm] sam-file sam-output-file

[--replace-reference-sequences sam-file]

[--filter-unmapped-reads]

[--filter-unmapped-reads-strict]

[--filter-mapping-quality mapping-quality]

[--filter-non-exact-mapping-reads]

[--filter-non-exact-mapping-reads-strict]

[--filter-non-overlapping-reads bed-file]

[--replace-read-group read-group-string]

[--mark-duplicates]

[--mark-optical-duplicates file]

[--optical-duplicates-pixel-distance nr]

[--remove-duplicates]

[--remove-optional-fields [all | list]]

[--keep-optional-fields [none | list]]

[--sorting-order [keep | unknown | unsorted | queryname | coordinate]]

[--clean-sam]

[--bqsr recal-file]

[--bqsr-reference elfasta]

[--quantize-levels nr]

[--sqq list]

[--max-cycle nr]

[--known-sites list]

[--nr-of-threads nr]

[--timed]

[--log-path path]

[--intermediate-files-output-prefix name] (sfm only)

[--intermediate-files-output-type [sam | bam]] (sfm only)

[--tmp-path path]

[--single-end] (sfm only)

[--contig-group-size nr] (sfm only)

vcf-to-elsites parameters:

elprep vcf-to-elsites vcf-file elsites-file

[--log-path path]

bed-to-elsites parameters:

elprep bed-to-elsites bed-file elsites-file

[--log-path path]

fasta-to-elfasta parameters:

elprep fasta-to-elfasta fasta-file elfasta-file

[--log-path path]

filter　以下の作業をワンライナーで行う。

アンマップリードを完全に除く（--filter-unmapped-reads-strict）。
duplicationのタグを付け（--mark-duplicates）duplicationリードを捨てる（--remove-duplicates）。
mapping qualityがゼロのリード（bwaならmultiple mappingはゼロ）を捨てる（--filter-mapping-quality）
samのcleaning（--clean-sam）。

elprep filter input.sam output.sam --mark-duplicates --remove-duplicates --filter-mapping-quality 0 --clean-sam

split インプット (sam/bam/cram)をクロモソームごとに分割する。unmapリードは別出力する。

elprep split input.sam output/ --output-prefix "split-sam" --output-type sam --nr-of-threads 8 --single-end

ソートやindexファイルがなくても動作する。シングルエンドのinputは--single-endをつける。デフォルトでは入力をペアエンドとして扱う。

merge splitで分割した複数のインプット (sam/bam/cram)をマージする。

elprep merge input/ output.sam --nr-of-threads 8

シングルエンドのinputは--single-endをつける。デフォルトでは入力をペアエンドとして扱う。bamとcram出力にも対応している。

VariantBam（紹介）

インストール

git clone --recursive https://github.com/jwalabroad/VariantBam.git 
cd VariantBam
./configure
make
cd src/

#bioconda (link)
mamba install -c biocondavariantbam -y

> variant

# variant

Usage: variant <input.bam> [OPTIONS]

Description: Filter a BAM/SAM/CRAM/STDIN according to hierarchical rules

General options

-h, --help Display this help and exit

-v, --verbose Verbose output

-x, --no-output Don't output reads (used for profiling with -q)

-r, --rules JSON ecript for the rules.

-k, --proc-regions-file Samtools-style region string (e.g. 1:1,000-2,000) or BED/VCF of regions to process. -k UN iterates over unmapped-unmapped reads

-Q, --mark-as-qc-fail Flag reads that don't pass VariantBam with the failed QC flag, rather than deleting the read.

Output options

-o, --output Output file to write to (BAM/SAM/CRAM) file instead of stdout

-C, --cram Output file should be in CRAM format

-b, --bam Output should be in binary BAM format

-T, --reference Path to reference. Required for reading/writing CRAM

-s, --strip-tags Remove the specified tags, separated by commas. eg. -s RG,MD

-S, --strip-all-tags Remove all alignment tags

-Z, --write-trimmed Output the base-quality trimmed sequence rather than the original sequence. Also removes quality scores

Filtering options

-q, --qc-file Output a qc file that contains information about BAM

-m, --max-coverage Maximum coverage of output file. BAM must be sorted. Negative values enforce a minimum coverage

-p, --min-phred Set the minimum base quality score considered to be high-quality

Region specifiers

-g, --region Regions (e.g. myvcf.vcf or WG for whole genome) or newline seperated subsequence file.

-G, --exclude-region Same as -g, but for region where satisfying a rule EXCLUDES this read.

-l, --linked-region Same as -g, but turns on mate-linking

-L, --linked-exclude-region Same as -l, but for mate-linked region where satisfying this rule EXCLUDES this read.

-P, --region-pad Apply a padding to each region supplied with the region flags (specify after region flag)

Command line rules shortcuts (to be used without supplying a -r script)

--min-clip Minimum number of quality clipped bases

--max-nbases Maximum number of N bases

--min-mapq Minimum mapping quality

--min-del Minimum number of deleted bases

--min-ins Minimum number of inserted bases

--min-length Minimum read length (after base-quality trimming)

--motif Motif file

-R, --read-group Limit to just a single read group

-f, --include-aln-flag Flags to include (like samtools -f)

-F, --exclude-aln-flag Flags to exclude (like samtools -F)

chr1の1-1,100,000にマッピングされたリードのみbam出力する。

variant input.bam -g 'chr1:1-1,100,000' --min-mapq 10 -b -o mini.bam -v

sam出力なら-bを外す。mini.bamが出力される。

VCFでコールされたポジション周囲100-bpのリードのみbam出力する。

variant input.bam -l input.vcf -P 100 -b -o mini.bam -v

sub-samplingしてmax-coverageを100に減らす（マッパーによっては、リピート領域にMAPQ=0のリードが多数マッピングされる）。

variant input.bam -m 100 -o mini.bam -v -b

clippingされたハイクオリティなリードのみbam出力する。

variant input.bam --min-phred 4 --min-clip 5 -b -o mini.bam -v

goleft（紹介）

インストール

#bioconda 
mamba install -c bioconda goleft -y

$ goleft

goleft Version: 0.2.0

covstats : coverage stats across bams by sampling

depth : parallelize calls to samtools in user-defined windows

depthwed : matricize output from depth to n-sites * n-samples

indexcov : quick coverage estimate using only the bam index

indexsplit : create regions of even coverage across bams/crams

samplename : report samplename(s) from a bam's SM tag

covstats bamをサンプリングしてカバレッジとインサートサイズをレポートする。

goleft covstats pair.bam

カバレッジ、インサートサイズとそのSDなどが出力される。

f:id:kazumaxneo:20180905091943j:plain

insert sizeは-> <-の内側サイズが計算される。250x2でinner inset sizeが100なら、outer inset sizeは600。

depth 一定のウィンドウサイズでbamのカバレッジを計算する。

goleft depth --reference ref.fa --prefix output input.bam

indexcov bam.baiからのカバレッジの超高速な推定。30サンプルを30秒で解析可能。複数のbamをディレクトリに準備してランする。

goleft indexcov --directory goleft_outdir/ *.bam

関係ないファイルがあるとエラーになるのでbamとbaiだけ集める。htmlで結果は出力される。ヒトゲノムでも早ければ１サンプル数秒で解析できる。

出力ディレクトリにできるindex.htmlが全サンプルのまとめ

f:id:kazumaxneo:20200418121928p:plain

multiqcでレポート作成。

goleft indexcov --directory goleft_outdir *.bam
multiqc .

mosdepth（紹介）

インストール

#bioconda (link) (linux)
mamba create -n mosdepth -y
conda activate mosdepth
mamba install -c bioconda mosdepth -y

chr1のデプスを計算する。

mosdepth --by capture.bed output exome.bam -Q 0 -t 10 -c chr1

-Q mapping quality threshold [default: 0]
-t number of BAM decompression threads [default: 0]
-c chromosome to restrict depth calculation.

multiqcで統合レポートを作成する。全スレッド使用。MQ≥1。

mosdepth --by gencode.v33.annotation.bed sample1.mosdepth sample1.bam -Q 1
mosdepth --by gencode.v33.annotation.bed sample2.mosdepth sample2.bam -Q 1
mosdepth --by gencode.v33.annotation.bed sample3.mosdepth sample3.bam -Q 1
multiqc .

biobambam2（紹介）

インストール

#bioconda (link)
mamba install -c bioconda biobambam -y

duplicateにタグをつけ、coordinateソートも行って出力する。

bamsormadup < input.bam level=1 threads=8 inputformat=bam SO=coordinate outputformat=bam indexfilename=output.bam.bai > output.bam

処理の大半が並列化されており、I/Oがリミットになりうるため、I/Oの高速なSSDなどのストレージ使用が推奨されている。threadsフラグがなければ利用できる全ての論理コアを使う。

ペアエンドを照合し、secondary hitは除く。samtools collateと同様の機能（samtools1.8 manual）。 level=0だと非圧縮になる。

bamcollate2 <input.bam collate=1 level=1 exclude=SECONDARY > out.bam

bamを再圧縮する。

bamrecompress <input.bam level=9 numthreads=8 > output.bam

ポジションソートしてbam出力する。

./bamsort SO=coordinate < input.bam inputthreads=8 level=1 outputformat=bam > out.bam

bamからペアエンドfastqを出力。

bamtofastq < out.bam F=1.fq F2=2.fq

Sambamba（紹介）

インストール

#bioconda 
mamba install -c bioconda Sambamba -y

# sambamba -h

sambamba 0.6.6

Usage: sambamba [command] [args...]

Available commands: 'view', 'index', 'merge', 'sort',

'flagstat', 'slice', 'markdup', 'depth', 'mpileup'

To get help on a particular command, just call it without args.

Leave bug reports and feature requests at

https://github.com/lomereiter/sambamba/issues

view　bamの基本情報

sambamba view --reference-info input.bam

-I, --reference-info　 output to stdout only reference names and lengths in JSON

view　samの基本情報

sambamba view -S --reference-info input.bam

-S　specify that input is in SAM format

bamのリファレンス"chr19"に適切にアライメントされるリードだけsam出力。（カウントなら-cをつけ、-o以下を消す）

sambamba view -F "proper_pair" input.bam chr19 -t 8 -f sam -o output.sam

-o　specify output filename

sort（リンク）　ソート。

指定がなければindexが自動でつくcoordinate-sorted。

sambamba sort input.bam -o sorted.bam -t 20 -p -l 6

-t　use specified number of threads
-p　show progressbar in STDERR
-l　level of compression for sorted BAM, from 0 to 9（指定がなければ"6"くらいのサイズになる）

名前でソート。(indexは付けられない)

sambamba sort input.bam -o sorted.bam -t 20 -p -l 5

flagstat（リンク）　bamの情報表示。

bamの情報表示。

sambamba flagstat input.bam -t 8 -p

sambamba flagstat input.bam

509280 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

538 + 0 supplementary

0 + 0 duplicates

456058 + 0 mapped (89.55%:N/A)

508742 + 0 paired in sequencing

254371 + 0 read1

254371 + 0 read2

451458 + 0 properly paired (88.74%:N/A)

454670 + 0 with itself and mate mapped

850 + 0 singletons (0.17%:N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

表示された情報については上のflagstatのリンク先参照。bamのBGZFフォーマットしか対応していないのでsamはviewで変換してから使う。

index（リンク）　bamのindexファイル作成

sambamba index input.bam -t 12 -p

merge（リンク）　bamのマージ。

圧縮率最大の設定で、12スレッド使い3つのbamをマージ。

sambamba merge merge.bam input1.bam input2.bam nput3.bam -t 12 -p -l 9

-l　Specify compression level of output file as a number from 0 to 9

mpileup（リンク）　samtoolsのmpileupのparallelバージョン。

20スレッド使い、pileupを実行。samtoolsの条件は--samtoolsをつけて書く。

sambamba mpileup input.bam -t 12 --samtools samtools -gu -S -D -d 1000 -L 1000 -m 3 -F 0.0002

mpileup出力 (manual) からリファレンスと異なる塩基の数を数える (引用)。pileup出力の５列目に全バリアント情報があるので、ここを取り出してパースする。

#５フィールド目を取り出して小文字を全て大文字に変換（tr）、１行１文字ずつにして（fold -w 1）ソートし(sort)、ユニークな文字数をカウント(uniq -c)
cut -f 5 sambamba-pileup_output | tr '[a-z]' '[A-Z]' | fold -w 1 | sort | uniq -c > counts

> cat counts

17026 !

947 "

178 #

2048447 $

78 %

78 &

122 '

117 (

101 )

4231 *

3801 +

157425826 ,

3737 -

157527525 .

203 /

80 0

6055 1

1157 2

330 3

249 4

123 5

90 6

250 7

88 8

99 9

118 :

109 ;

88 <

108 =

84 >

105 ?

88 @

95195 A

88 B

98141 C

82 D

90 E

70 F

97752 G

91 H

79 I

104 J

104 K

91 L

99 M

97 N

107 O

93 P

2117 Q

93 R

105 S

94768 T

94 U

296 V

1017 W

151 X

690 Y

2549 Z

231 [

990 \

2010030 ]

2041594 ^

5フィールド目は

- ^ この位置がリードによってカバーされる最初の位置である場合

- ドット(.)、カンマ(,) 塩基が参照塩基と一致

- ACGTN acgtn 塩基が参照塩基とミスマッチしている

- > < 参照スキップ (CIGAR "N ")

- * */# 参照塩基の削除(CIGAR "D")

- $ リードでカバーされる最後の位置の場合

読み取った塩基の後に挿入があった場合、"\+[0-9]+[ACGTNacgtn*#]+"にマッチするテキスト："+"文字の後に挿入の長さを示す整数が続き、その後に挿入された配列が表示される。パッドは "*"で表示され、--reverse-delが使用された場合は、逆鎖のパッドが "#"で表示される。
この読み取り塩基の後に削除がある場合、「-[0-9]+[ACGTNacgtn]+」に一致するテキスト：「-」文字の後に削除された参照塩基が同様に表される。

Samblaster（紹介）

インストール

#bioconda 
mamba install -c bioconda samblaster -y

samblaster: Version 0.1.25

Author: Greg Faust (gf4ea@virginia.edu)

Tool to mark duplicates and optionally output split reads and/or discordant pairs.

Input sam file must contain sequence header and be grouped by read ids (QNAME).

Input typicallly contains paired-end data, although singleton data is allowed with --ignoreUnmated option.

Output will be all alignments in the same order as input, with duplicates marked with FLAG 0x400.

Usage:

For use as a post process on an aligner (eg. bwa mem):

bwa mem <idxbase> samp.r1.fq samp.r2.fq | samblaster [-e] [-d samp.disc.sam] [-s samp.split.sam] | samtools view -Sb - > samp.out.bam

bwa mem -M <idxbase> samp.r1.fq samp.r2.fq | samblaster -M [-e] [-d samp.disc.sam] [-s samp.split.sam] | samtools view -Sb - > samp.out.bam

For use with a pre-existing bam file to pull split, discordant and/or unmapped reads without marking duplicates:

samtools view -h samp.bam | samblaster -a [-e] [-d samp.disc.sam] [-s samp.split.sam] [-u samp.umc.fasta] -o /dev/null

For use with a bam file of singleton long reads to pull split and/or unmapped reads with/without marking duplicates:

samtools view -h samp.bam | samblaster --ignoreUnmated [-e] --maxReadLength 100000 [-s samp.split.sam] [-u samp.umc.fasta] | samtools view -Sb - > samp.out.bam

samtools view -h samp.bam | samblaster --ignoreUnmated -a [-e] [-s samp.split.sam] [-u samp.umc.fasta] -o /dev/null

Input/Output Options:

-i --input FILE Input sam file [stdin].

-o --output FILE Output sam file for all input alignments [stdout].

-d --discordantFile FILE Output discordant read pairs to this file. [no discordant file output]

-s --splitterFile FILE Output split reads to this file abiding by paramaters below. [no splitter file output]

-u --unmappedFile FILE Output unmapped/clipped reads as FASTQ to this file abiding by parameters below. [no unmapped file output].

Requires soft clipping in input file. Will output FASTQ if QUAL information available, otherwise FASTA.

Other Options:

-a --acceptDupMarks Accept duplicate marks already in input file instead of looking for duplicates in the input.

-e --excludeDups Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.

-r --removeDups Remove duplicates reads from all output files. (Implies --excludeDups).

--addMateTags Add MC and MQ tags to all output paired-end SAM lines.

--ignoreUnmated Suppress abort on unmated alignments. Use only when sure input is read-id grouped,

and either paired-end alignments have been filtered or the input file contains singleton reads.

-M Run in compatibility mode; both 0x100 and 0x800 are considered chimeric. Similar to BWA MEM -M option.

--maxReadLength INT Maximum allowed length of the SEQ/QUAL string in the input file. [500]

Primarily useful for marking duplicates in files containing singleton long reads.

--maxSplitCount INT Maximum number of split alignments for a read to be included in splitter file. [2]

--maxUnmappedBases INT Maximum number of un-aligned bases between two alignments to be included in splitter file. [50]

--minIndelSize INT Minimum structural variant feature size for split alignments to be included in splitter file. [50]

--minNonOverlap INT Minimum non-overlaping base pairs between two alignments for a read to be included in splitter file. [20]

--minClipSize INT Minumum number of bases a mapped read must be clipped to be included in unmapped file. [20]

-q --quiet Output fewer statistics.

bwa memのアライメントの過程でduplicationにタグをつける。

bwa index -a is input.fa
bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster |samtools view -@ 12 -Sb - |samtools sort -@ 12 - > samp.sorted.bam

discordant-readとsplit-readは別ファイルに出力。

bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster -e -d samp.disc.sam -s samp.split.sam > output.sam

-e　Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.
-d　FILE Output discordant read pairs to this file. [no discordant file output]
-s　FILE Output split reads to this file abiding by paramaters below. [no splitter file output]

duplication-readは全出力から除く。

bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster -r -e -d samp.disc.sam -s samp.split.sam > output.sam

-r　Remove duplicates reads from all output files.

注意；bwa memで-M（mark shorter split hits as secondary）をつけている時は、samblasterにも-Mをつけてランを行う。

BamDeal（紹介）

BamDeal convert

bam2soap - bam/sam => SOAP bam

BamDeal convert bam2soap -InFile in.bam -OutPut out_SOAP.bam

soap2bam - SOAP bam => bam/sam

BamDeal convert soap2bam -InSoap in_SOAP.bam -OutBam out.bam -Dict Ref.fa

bam2fq - bam/sam => fastq

BamDeal convert bam2fq -i in.bam -o out.fq
=> out.fq.gzが出力される

bam2fa - bam/sam => fasta

BamDeal convert bam2fa -i in.bam -o out.fa
=> out.fa.gzが出力される

BamDeal modify

bamFilter -低クオリティなリードをフィルタリング

#Q15以下のリード、30bp以下のリード、duplicateリードをフィルタリング
BamDeal modify bamFilter -i in.bam -o out.bam -q 15 -l 30 -d

-q the quality to filter reads, default [15]
-l the length to filter reads, default [30]
-s the beginning of interval containing the 1-based leftmost mapping position of first matching base, default [0]
-e the end of interval containing the 1-based leftmost mapping position of first matching base, default [1e9]
-c specify the chromosome to output, default [all chromosomes]
-d remove the duplicate read

bamSplit - クロモソームごとにbamを分離

mkdir output_dir
BamDeal modify bamSplit -i in.bam -o output_dir -q 0

-r reset output files headers by remove the chromosomes not in the output files

bamAssign - ユーザー指定の組み合わせでbamを分離（詳細はBamDeal modify bamAssign -h参照）

mkdir out_dir
BamDeal modify bamAssign -l list -o output_dir

-a list indicating how to assign chromosomes to outputs
-r reset output files headers by remove the chromosomes not in the output files

bamCat - bamをマージ

bamCat -i A.bam B.bam -o merged.bam -s

-s output sort bam file when all inputs were sorted

bamRand - bamをランダムサンプリング

#10 percent
BamDeal modify bamRand -i in.bam -p 0.1 -o out.bam

-p probability with which each read would be outputed, default [0.1]
-s random seed, default [time]

bamSubChr - bamからリストで指定したchrを削除、または追加

#リストのchrを除去
bamSubChr -i in.bam -d delete.list -o out.bam -r

#リストのchrを追加
bamSubChr -i in.bam -k keep.list -o out.bam -r

#unmapのリードを除去
bamSubChr -i in.bam> -o AAA -u

-k list of chromosomes to be kept
-d list of chromosomes to be deleted
-u remove unmapped reads
-r reset output headers by remove the chr(s) not in the out files

bamShiftQ - Phred qualityのタイプを修正

#リードのPhred qualityをASCII33に修正して出力
bamShiftQ -i in.bam -o out.bam -p 1

#リードのPhred qualityをASCII64に修正して出力
bamShiftQ -i in.bam -o out.bam -p 2

-p phred quality in output BAM, [1]: ASCII+33 or [2]: ASCII+64, default [1]
-q the quality to filter reads, default [10]
-l the length to filter reads, default [30]

bamLimit - 大きなbamを分割

#最大1000000リードずつ分割
mkdir out_dir
BamDeal modify bamLimit -i in.bam -o out_dir/ -n 1000000

-n max read number for each bam[1000000000]

BamDeal statistics

Coverage - Coverage/Depth/GC分布を調べる

BamDeal statistics Coverage -i in.bam -r  ref.fasta -o outprefix

-r input reference FASTA to get Depth-GC wig
-w windows size for Depth-GC wig, default [10000]
-b list of the regions of which the coverage and mean of depth would be given
-d Filter the duplicated read

BasesCount - 全ポジションのATGC全てのデプスカウント

#3つのbamを調べる。q10以上
BamDeal statistics BasesCount -i A.bam B.bam C.bam -o outprefix -q 10

DeteCNV - CNV/Deletionを検出

BamDeal statistics DeteCNV -i A.bam B.bam -r ref.fasta -m 1000 -o outprefix

-f <float> depthRatio to judge breakpoint of merge adjacent[0.45]
-c for each chromosome, use its own mean of depth into calculation default would use the mean of depth of the whole genome
-m <int> set the minimum length of CNV, default [1800]
-p <float> p-value of CNV depth bias, default [0.02]

DeteSV - ペアエンドのインサートサイズ情報を使ってSVを検出

BamDeal statistics DeteSV -i in.bam -r ref.fasta -o outprefix -m 1800

LowDepth - low depthの領域を検出

BamDeal statistics LowDepth -i in.bam -o out.bed -q 10 -s 1000

-o output bed region file
-x set the minimum value of low depth,default[2]
-s the length to filter short region, default [1000]

BamDeal visualize

StatQC - クオリティレポート出力（Rのggplot2とreshapeパッケージが必要*1）

BamDeal visualize StatQC -i in.bam -o outdir

#複数bamをlist指定
 BamDeal visualize StatQC -i in.bam -o outdir

分析結果のテキストとPDFが出力される。

DepthCov - リードデプス出力

BamDeal visualize DepthCov -i in.bam -o out

DepthCov - リードデプス-GCプロット出力

BamDeal visualize DepthCov -i in.bam -r ref.fasta -o outprefix -k -q 10

DepthSlide - ゲノムのchr、ポジション順のマンハッタンプロット出力

BamDeal visualize DepthSlide -i in.bam -r ref.fasta -o outprefix

Alfred（紹介）

BAM統計、フィーチャーカウント、フィーチャーアノテーションのツール。

インストール

#bioconda  (errorはこちらを参照)
mamba install -c bioconda alfred -y

リファレンスのFASTAと.bamを指定する。

alfred qc -r ref.fa -o qc.tsv.gz input.bam

#結果の可視化
#なければ本体をcloneして取ってくる 
git clone https://github.com/tobiasrausch/alfred.git 
#可視化
Rscript alfred/R/stats.R qc.tsv.gz

出力例（direct link）。

bamM（紹介）

インストール

#bioconda 
mamba create -n bamM -y python=2.7
conda activate bamM
mamba install -c bioconda bamm -y

# bamm

...::: BamM :::...

Working with the BAM, not against it...

-------------------------------------------------------------------------

version: 1.7.3

-------------------------------------------------------------------------

bamm make -> Make BAM/TAM files (sorted + indexed)

bamm parse -> Get coverage profiles / linking reads / insert types

bamm extract -> Extract reads / headers from BAM files

bamm filter -> Filter BAM file reads

USE: bamm OPTION -h to see detailed options

bamm make mappingしてbamを作成

ペアエンド、シングルエンド、インターリーブのfastqをそれぞれ指定する。20スレッド指定する。unmapのリードは出力しない（default）。

bamm make -d ref.fasta -c pair_1.fastq.gz pair_2.fastq.gz \
 -s unpaired.fastq.gz -i interleave.fq.gz \
 -p mapped -t 20 --alignment_algorithm mem -f

bamが出力される。

bamm parse カバレッジやインサートサイズ計算

複数のbamのカバレッジ、インサートサイズを出力する。

bamm parse -c covs.tsv -l links.tsv -i inserts.tsv -t 8 -m pmean \
 -b f1.bam f2.bam f3.bam

extract

bamからリードを出力する。出力はgzip圧縮される。

bamm extract -g ref.fasta -b mapped.bam -t 8

ペアエンドのbamを指定した場合、ペアエンドfastq、ペアエンドfastqでsingletonになったfastqの３つのファイルが出力される。ペアエンドfastqの順番は同期されている。

KneadData（紹介）

メタゲノムからヒトゲノムなどホストゲノムの汚染を除く作業を自動で行う。有名なhuttenhowerラボのツール。

（初回のみ）初めにリファレンスゲノムのインデックスを作成する。ヒトやマウスのゲノムならKneadDataのコマンドからbowite2のindex作成済みゲノムをダウンロードできる。

mkdir genome
kneaddata_database --download human_genome bowtie2 genome/ #3.44GBある

（初回のみ）indexを作成

bowtie2-build Homo_sapiens.fasta -o Homo_sapiens_db test
#作成したindexをディレクトリに収納する
mkdir genome_index
mv test* genome_index/

作成したindexのフォルダを指定してランする。

#single
kneaddata --i seq.fastq --reference-db genome_index/ --output kneaddata_output -t 20

#pair
kneaddata --i seq1.fastq --i seq2.fastq -db genome_index/ --output kneaddata_output -t 20

-t　number of threads [ Default : 1 ]
-i　input FASTQ file (add a second argument instance to run with paired input files)
-o　directory to write output files
-d　location of reference database (additional arguments add databases)

Trimmomaticがないと言われたら --trimmomatic フラグをつけて trimmomaticのディレクトリを指定する。面倒ならwgetでダウンロードしておいておけばいい。

wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.33.zip
unzip Trimmomatic-0.33.zip

#trimmomaticを指定してラン
kneaddata --input seq1.fastq --input seq2.fastq -db genome_index/ --output kneaddata_output -t 20 --trimmomatic Trimmomatic-0.33/

fastqp（紹介）

fastqp input.bam

複数のmetricsで分析が行われる。

BAMStats

bamのアライメント情報をGUIで確認できるツール。samtools statコマンドを利用してる。Sourceforgeからビルド済みのJAVA実行ファイルをダウンロードする。

http://bamstats.sourceforge.net

java -Xmx6g -jar BAMStats-GUI-1.25.jarをコマンドラインで叩くか、このファイルをダブルクリックするとアプリが起動する。

GUI版では下のような情報を見ることができる。

このように、平均や中央値でなく、分布でリードのアライメント状況を見ることができる。 bam/samを追加して複数データを同時比較することも可能である。

BAMStatsはコマンドライン版もある。ランするには以下のように打つ。

java -Xmx4g -jar BAMStats-1.25.jar -i input.bam

-iでbam/samを指定する。

samstat

公式サイト SAMStat

fastq、sam、bamなどを分析できる。インストールは

tar -zxcf samstat.tgz 
cd samstat
./configure
make clean 
make

ビルドできたらsrc/のsamstatにパスを通す。

fastqの分析

samstat input.fq

~.htmlが出力される。

f:id:kazumaxneo:20170620183443j:plain

f:id:kazumaxneo:20170620183455j:plain

sam/bamの分析

samstat input.bam

~.htmlが出力される。

f:id:kazumaxneo:20170620183839j:plain

ショートリード向けに設計されているので、ロングリードを使うとエラーになる。

bcftools

非常にたくさんのコマンドがある。よく使うだろうコマンドを紹介する。

インストール

#homebrew
brew install bcftools

#bioconda
mamba install -c bioconda  bcftools -y

# bcftools

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)

Version: 1.9 (using htslib 1.9)

Usage: bcftools [--version|--version-only] [--help] <command> <argument>

Commands:

-- Indexing

index index VCF/BCF files

-- VCF/BCF manipulation

annotate annotate and edit VCF/BCF files

concat concatenate VCF/BCF files from the same set of samples

convert convert VCF/BCF files to different formats and back

isec intersections of VCF/BCF files

merge merge VCF/BCF files files from non-overlapping sample sets

norm left-align and normalize indels

plugin user-defined plugins

query transform VCF/BCF into user-defined formats

reheader modify VCF/BCF header, change sample names

sort sort VCF/BCF file

view VCF/BCF conversion, view, subset and filter VCF/BCF files

-- VCF/BCF analysis

call SNP/indel calling

consensus create consensus sequence by applying VCF variants

cnv HMM CNV calling

csq call variation consequences

filter filter VCF/BCF files using fixed thresholds

gtcheck check sample concordance, detect sample swaps and contamination

mpileup multi-way pileup producing genotype likelihoods

roh identify runs of autozygosity (HMM)

stats produce VCF/BCF stats

Most commands accept VCF, bgzipped VCF, and BCF with the file type detected

automatically even when streaming from a pipe. Indexed VCF and BCF will work

in all situations. Un-indexed VCF and BCF and streams will work in most but

not all situations.

Consensus fasta

minimap2を使いfastqをリファレンスにmappingしてbamを作成、そのbamからリファレンスと異なる部位を修正したコンセンサス配列を作成する。

ここではnanoporeのlong readを使っている。

#mapping to the ref
minimap2 -t 8 -ax map-ont ref.fa long.fq \
|samtools sort -@ 8 -O BAM - > mapping.bam \
&& samtools index -@ 8 mapping.bam

#pileup and consensus calling
samtools mpileup -C50 -uf ref.fa mapping.bam | bcftools call -c -Oz -o calls.vcf.gz

#indexing
bcftools index calls.vcf.gz 

#export consensus fasta
cat ref.fa | bcftools consensus calls.vcf.gz > consensus.fa

samtools mpileup

-C, --adjust-MQ INT adjust mapping quality; recommended:50, disable:0 [0]
-f faidx indexed reference sequence file

bcftools call

-c, --consensus-caller the original calling method (conflicts with -m)
-o, --output <file> write output to a file [standard output]
-O, --output-type <b|u|z|v> output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]

fass

インストール

#bioconda
mamba install -c bioconda bamutil -y

$ bam -h

Set of tools for operating on SAM/BAM files.

Version: 1.0.14; Built: Tue Jun 26 23:14:05 PDT 2018 by distiller

Tools to Rewrite SAM/BAM Files:

convert - Convert SAM/BAM to SAM/BAM

writeRegion - Write a file with reads in the specified region and/or have the specified read name

splitChromosome - Split BAM by Chromosome

splitBam - Split a BAM file into multiple BAM files based on ReadGroup

findCigars - Output just the reads that contain any of the specified CIGAR operations.

Tools to Modify & write SAM/BAM Files:

clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName

filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high

revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags

squeeze - reduces files size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers

trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!' or softclipping the ends (resulting file will not be sorted)

mergeBam - merge multiple BAMs and headers appending ReadGroupIDs if necessary

polishBam - adds/updates header lines & adds the RG tag to each record

dedup - Mark Duplicates

dedup_LowMem - Mark Duplicates using only a little memory

recab - Recalibrate

Informational Tools

validate - Validate a SAM/BAM File

diff - Diff 2 coordinate sorted SAM/BAM files.

stats - Stats a SAM/BAM File

gapInfo - Print information on the gap between read pairs in a SAM/BAM File.

Tools to Print Information In Readable Format

dumpHeader - Print SAM/BAM Header

dumpRefInfo - Print SAM/BAM Reference Name Information

dumpIndex - Print BAM Index File in English

readReference - Print the reference string for the specified region

explainFlags - Describe flags

Additional Tools

bam2FastQ - Convert the specified BAM file to fastQs.

Dummy/Example Tools

readIndexedBam - Read Indexed BAM By Reference and write it from reference id -1 to maxRefId

Usage:

bam <tool> [<tool arguments>]

The usage for each tool is described by specifying the tool with no arguments.

fastqの抽出

bam bam2FastQ --in input.bam --outBase outprefix

ペアエンドは同期されて出力される。同期できなかったリードはシングルで出力される。

NGSUtils - bamutils

#bioconda (link) linux （テストした時は導入できなかった）
mamba create -n ngsutils -y python=2.7
conda activate ngsutils
mamba install -c bioconda ngsutils -y

keepbest（リンク） best mappingを抽出。

bamutils keepbest input.bam output.bam

export（リンク） 指定した情報を全リードから出力

bamutils export -name input.bam > output #リード名を出力

bamutils export -ref input.bam > output #mappingされたクロモソーム名を出力

expressed（リンク） マッピングされた領域をBED形式で出力。（samtools indexでbam.baiファイルを準備しておく必要がある）

bamutils expressed input.bam > output.bed

filter（リンク） 指定した条件を満たすリードだけbamで出力。

#chr1にマッピングされており、リード長が100以上でペアエンドが-> <-の向きに両方マップングされているリードだけbamで出力。
bamutils filter input.bam output.bam -minlen 100 -properpair -includeref chr1

merge（リンク） bamをマージ（best matchを選択してくる）。

bamutils merge output.bam input1.bam input2.bam

pair（リンク）別々にマッピングされているペアエンドについて、指定条件内でペアを再度調べる。

bamutils pair output.bam input1.bam input2.bam

removevlipping（リンク）hard-clipされた領域を捨てる。

bamutils removeclipping input.bam output.bam

renamepair（リンク）一部アライナーはペアエンドのリード名に/1や/2をつけてbsmを作り、これが下流の解析でエラーになる。これを消す（ペアエンドは同じリード名になる）。

bamutils renamepair input.bam output.bam

tag（リンク）tagをbamに追加する。

bamutils tag -xs input.bam output.bam

check（リンク）破損のチェック。

bamutils check input.bam

cleancigar（リンク）CIGAR情報の修復。

bamutils cleancigar  input.bam

stats（リンク）bamを分析。

bamutils stats input.bam

$ bamutils stats input.bam

Calculating Read stats...

Done! (0:36)

out.bam

Mapped: 1083286

Unmapped: 0

Flag distribution

[0x001] Multiple fragments 1083286 100.00%

[0x002] All fragments aligned 888871 82.05%

[0x008] Next unmapped 58635 5.41%

[0x010] Reverse complimented 542299 50.06%

[0x020] Next reverse complimented 539537 49.81%

[0x040] First fragment 1083286 100.00%

[0x800] Supplementary 1 0.00%

Template length: 568.90 +/- 824.04

Reference counts count

1 1083286

bedファイルを出力するにはbedtoolsを使う。

bamをbedに変換する(bed)。

bedtools bamtobed -i reads.bam > reads.bed

bedtoolsについては以下にまとめています。

bam/samが増えてきていちいちコマンドを実行するのが面倒になったら、スクリプトを組む事を考える。ただし簡単な作業ならparallelを使い並列化した方が簡単で高速な場合も多い。

例えば特定のディレクトリにbamが複数あり、それぞれについてpicardを使いマッピングスコアを分析する。

find *bam | parallel 'Picard QualityScoreDistribution INPUT={} OUTPUT={}.quality_score_dist.tsv CHART_OUTPUT={}.quality_score_dist.pdf'

bamそれぞれについてpdfファイルとtsvファイルが出力される。

findでbamファイルを全て拾い、parallelコマンドに渡している。parallelコマンドが' 'で囲まれた部位を並列実行する。ただし負荷はファイル分増加するので気をつけること。

fasta、fastqなどを扱うなら、seqkitが高速で使い易いと思います。ランダムサインプリング、逆相補鎖に変換、ソート、重複の除去、指定した範囲だけ抽出など、なんでもできます。

以下にまとめています。

引用

BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data.

Narasimhan V1, Danecek P1, Scally A2, Xue Y1, Tyler-Smith C1, Durbin R1.

Bioinformatics. 2016 Jun 1;32(11):1749-51. doi: 10.1093/bioinformatics/btw044. Epub 2016 Jan 30.

https://www.ncbi.nlm.nih.gov/pubmed/26826718

The Sequence Alignment/Map format and SAMtools.

Li H1, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup.

Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8.

https://www.ncbi.nlm.nih.gov/pubmed/19505943

SAMStat: monitoring biases in next generation sequencing data.

Lassmann T1, Hayashizaki Y, Daub CO.

Bioinformatics. 2011 Jan 1;27(1):130-1. doi: 10.1093/bioinformatics/btq614. Epub 2010 Nov 18.

https://www.ncbi.nlm.nih.gov/pubmed/21088025

BamTools: a C++ API and toolkit for analyzing and managing BAM files.

Barnett DW1, Garrison EK, Quinlan AR, Strömberg MP, Marth GT.

Bioinformatics. 2011 Jun 15;27(12):1691-2. doi: 10.1093/bioinformatics/btr174. Epub 2011 Apr 14.

BamTools: a C++ API and toolkit for analyzing and managing BAM files. - PubMed - NCBI

VariantBam: filtering and profiling of next-generational sequencing data using region-specific rules.

Wala J, Zhang CZ, Meyerson M, Beroukhim R

Bioinformatics. 2016 Jul 1;32(13):2029-31.

biobambam: tools for read pair collation based algorithms on BAM files

German Tischler, Steven Leonard

Source Code Biol Med. 2014; 9: 13.

elPrep: High-Performance Preparation of Sequence Alignment/Map Files for Variant Calling.

Herzeel C, Costanza P, Decap D, Fostier J, Reumers J.

PLoS One. 2015 Jul 16;10(7):e0132868.

parallel

samtools

http://davetang.org/wiki/tiki-index.php?page=SAMTools

bamのsplit

https://www.biostars.org/p/46327/

Biostars

https://www.biostars.org/p/247903/

Tip and tricks for BAM files

https://github.com/IARCbioinfo/BAM-tricks

参考

このまま実行しても、大半のコンセンサスはnになってしまうと思います。flagで取り出すリードを調節したり、エラーコレクションを挟んだりする必要があるかなと思います。

2017-06-19

contigからscaffoldを作るツールの検証

assembly simulator

2020 10/4 コメント追加

アセンブルして作ったcontigをペアリード情報やロングリード情報を使って統合し、Scaffoldを作るツールがいくつか発表されている。代表的なものをインストールしてテストした。

ツールのインストール

テストには２種のバクテリア、Nostoc pcc 7120(Taxonomy Browser)とSynechocystis sp. PCC6803 （Taxonomy Browser）のクロモソーム配列とプラスミド配列を使う。Nostocのゲノムサイズは約6.4Mbで、Synechocystisのゲノムサイズは約3.6Mbである。プラスミド配列はSynechocystisのみ含めた。

配列の準備

artを使いillumina Miseqのリードをシミューレート。250bpx2、デプス100のデータを作る。以下ではNostocの流れを書いているが、Synechocystisもパラメータは同じで作った。

art_illumina -ss MSv3 -sam -p -i Nostoc_spPCC7120.fa  -l 250 -f 100 -m 600 -s 100 -o Nostoc_250_paired_

Nostoc_250_paired_1.fqとNostoc_250_paired_2.fqができる。

PBSIMを使いPacbioのリードをシミューレート。CLRでデプスx30のデータを作る。

pbsim --data-type CLR --depth 30 --model_qc PBSIM-PacBio-Simulator-master/data/model_qc_clr Nostoc_spPCC7120.fa

sd_001.fqができる。

アセンブル

３つの方法アセンブルを試みた。

１、illumina Miseqのペアリードをアセンブル (spadesを使用)。

spades.py --careful -k auto -t 20 -1 Nostoc_250_paired_1.fq -2 Nostoc_250_paired_2.fq -o miseq

f:id:kazumaxneo:20170618115415j:plain

Nostocシーケンスデータのアセンブル結果。58 contigある。縦軸はcontigの長さ(bp)。

f:id:kazumaxneo:20170619181019j:plain

Synechocystisシーケンスデータのアセンブル結果。81 contigある。縦軸はcontigの長さ(bp)。

２、spadesでアセンブルされたcontigを、pacbioのロングリードを使ってスキャッホールド化 (SSPACE-LONGREADを使用)。

perl SSPACE-LongRead.pl -c scaffolds.fasta -p sd_001.fq -b output

Nostocは1本の配列にアセンブルされた。縦軸はcontigの長さ(bp)。

f:id:kazumaxneo:20170619182626j:plain

Nostoc contigデータのスキャッホールド化結果。縦軸はcontigの長さ(bp)。

f:id:kazumaxneo:20170619183956j:plain Synechocystisシーケンスデータのスキャッホールド化結果。 33contigある。縦軸はcontigの長さ(bp)。

３、illumina MiseqのペアリードとPacbioのロングリードを使いハイブリッドアセンブル (spadesを使用)。

spades.py --careful -k auto -t 20 -1 Nostoc_250_paired_1.fq -2 Nostoc_250_paired_2.fq --pacbiosd_001.fq -o hybrid

hybrid_assemblyでもNostocは1本の配列にアセンブルされた。

f:id:kazumaxneo:20170619181636j:plain

Nostocシーケンスデータのアセンブル結果。縦軸はcontigの長さ(bp)。

f:id:kazumaxneo:20170619181446j:plain

Synechocystisシーケンスデータのアセンブル結果。10 contigある。縦軸はcontigの長さ(bp)。

NostocはSSPACEとspadesのhybrid-assemblyでscaffold数に差はなかった。Synechocystisは、SSPACEよりspadesの方がscaffold数は減り、長く伸びた。具体的にはspadesのhybrid-assemblyだとプラスミドを７つ全てがそれぞれ１本にアセンブルされた。クロモソームは３つのscaffoldまでアセンブルされた。

scaffoldの評価

最後にアセンブルされたscaffoldsを評価する。QUASTのwebサーバー版を使う。

http://quast.bioinf.spbau.ru

試した日はサーバーの調子が悪いみたいでジョブが終わらない。quastのローカル版で評価する。

./quast.py spades_scaffolds.fasta SSPACE_LONG_scaffolds.fasta hybrid_scaffolds.fasta   -R Nostoc_spPCC7120.fa

lessで結果を表示。

less quast_results/latest/report.txt

f:id:kazumaxneo:20170618124628j:plain

このような結果となった。

アセンブリにエラーがないかも確認する。

python quast.py scaffolds.fasta SSPACE_LONG_scaffolds.fasta hybrid_scaffolds.fasta -R Nostoc_spPCC7120.fa

quast_results/の最新のフォルダ(エイリアス latest)に移動し、report.htmlをsafariで開く。

f:id:kazumaxneo:20170618125307j:plain

リンクのView in Icarus contig browserをクリック。

f:id:kazumaxneo:20170618125313j:plain

ここではシミュレートしたリードを使っているため連続性の高い配列を作成できています。しかし実際のシークエンシングでは、短い配列、エラーが多い配列、場合によっては汚染DNAの配列なども読まれてきます。ショートリードもロングリードもあらかじめクオリティフィルタリングしておくことが重要になります。ロングリードについては、短すぎる配列を除くことも有効です（小さなプラスミドのリード情報を脱落させる可能性もあるので注意する）。並行して汚染を除く操作が必要な場合もあります。

2017-06-19

フォーマット変換 Fastq=> Fasta

format conversion fasta/fastqの操作

awkのコマンドで一発でできる。

awk '(NR - 1) % 4 < 2' test.fq | sed 's/@/>/' > test.fa

または、embossのseqretコマンドでも同じことができる。seqretコマンドは別に紹介しています。

2017-06-18

Prokaryotesのアノテーションツール Prokka

annotation bacteria 2014 Bioinformatics docker coding region HMM CRISPR AMR transposon bacterial annotation

2018 10/6 タイトル修正

2019 説明修正、dockerリンク追加、インストール方法追加、dockerリンク修正、コマンド修正、help追加、タイトル修正、インストールの説明の誤り修正、バージョンアップ追記、crisper、IS、AMR tag追加、誤字修正、dockerを使う例を追記

2020 4/19 multiqcとの連携例、4/28 コマンド例追記、10/ 5 help更新、10/20 追記、12/28 コマンド例追加

2021 5/9 condaのインストール手順アップデート、 9/25 インストール手順修正

2023/01/14 mt追記、07/04 virus追記, 10/19 追記

Prokkaは、バクテリア、アーキア、ウィルスのアノテーションツール。はじめにblast+でcore geneを特定し、それからHMMER3を使ってより高感度かつ精度の高いコード領域の特定が行われる。

Prokka passed 10,000 citations on Google Scholar this week. It's mind-blowing to think it has annotated millions of bacterial genomes over the last 8 years. Thanks to everyone for your support and feedback. 💚🦠🧬💻 pic.twitter.com/kKfKs1vjAe
— Torsten Seemann (@torstenseemann) December 26, 2022

バージョン

Releases · tseemann/prokka · GitHub

インストール

ubuntu18.0.4のpython3.9環境(mambaforge3.9)にてテストした。

本体　Github

conda、またはbrewで導入する。

#bioconda (link) (python3.8でもインストール可能) ここでは高速なmambaを使用
mamba install -c conda-forge -c bioconda -c defaults prokka -y

#python3のコードだが、condaで入れる際に依存するperlライブラリがなぜか導入されず。prokkaを動かせない場合がある。試した際は3.8で失敗し、3.6では全てのライブラリが正常に導入された。
mamba create -n prokka
conda activate prokka
mamba install -c conda-forge -c bioconda -c defaults prokka -y

#エラーが起こったので導入方法を変更（ubuntu18のdockerイメージでは正しく導入された）
mamba create -n prokka_env python=3.7 -y
conda activate prokka_env
mamba install -c bioconda -y perl-bioperl==1.7.2
mamba install -c bioconda prokka -y

#homebrew
brew install prokka

#perlライブラリが入らなければcpanmで導入 -Lでパス指定
sudo cpan Time::Piece XML::Simple Digest::MD5 Bio::Perl

#インストールができない場合、gitで最新版をダウンロードする方法が推奨されている
git clone https://github.com/tseemann/prokka.git 
cd prokka/ #ダウンロードしたprokka/に移動
bin/prokka --setupdb #Index the sequence databasesをする

dockerイメージ

https://hub.docker.com/r/staphb/prokka/

docker pull staphb/prokka:latest

> prokka -h

# prokka

Name:

Prokka 1.14.5 by Torsten Seemann <torsten.seemann@gmail.com>

Synopsis:

rapid bacterial genome annotation

Usage:

prokka [options] <contigs.fasta>

General:

--help This help

--version Print version and exit

--citation Print citation for referencing Prokka

--quiet No screen output (default OFF)

--debug Debug mode: keep all temporary files (default OFF)

Setup:

--dbdir [X] Prokka database root folders (default '/prokka-1.14.5/db')

--listdb List all configured databases

--setupdb Index all installed databases

--cleandb Remove all database indices

--depends List all software dependencies

Outputs:

--outdir [X] Output folder [auto] (default '')

--force Force overwriting existing output folder (default OFF)

--prefix [X] Filename output prefix [auto] (default '')

--addgenes Add 'gene' features for each 'CDS' feature (default OFF)

--addmrna Add 'mRNA' features for each 'CDS' feature (default OFF)

--locustag [X] Locus tag prefix [auto] (default '')

--increment [N] Locus tag counter increment (default '1')

--gffver [N] GFF version (default '3')

--compliant Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)

--centre [X] Sequencing centre ID. (default '')

--accver [N] Version to put in Genbank file (default '1')

Organism details:

--genus [X] Genus name (default 'Genus')

--species [X] Species name (default 'species')

--strain [X] Strain name (default 'strain')

--plasmid [X] Plasmid name or identifier (default '')

Annotations:

--kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')

--gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0')

--prodigaltf [X] Prodigal training file (default '')

--gram [X] Gram: -/neg +/pos (default '')

--usegenus Use genus-specific BLAST databases (needs --genus) (default OFF)

--proteins [X] FASTA or GBK file to use as 1st priority (default '')

--hmms [X] Trusted HMM to first annotate from (default '')

--metagenome Improve gene predictions for highly fragmented genomes (default OFF)

--rawproduct Do not clean up /product annotation (default OFF)

--cdsrnaolap Allow [tr]RNA to overlap CDS (default OFF)

Matching:

--evalue [n.n] Similarity e-value cut-off (default '1e-09')

--coverage [n.n] Minimum coverage on query protein (default '80')

Computation:

--cpus [N] Number of CPUs to use [0=all] (default '8')

--fast Fast mode - only use basic BLASTP databases (default OFF)

--noanno For CDS just set /product="unannotated protein" (default OFF)

--mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')

--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')

--norrna Don't run rRNA search (default OFF)

--notrna Don't run tRNA search (default OFF)

--rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

実行方法

アセブルして作ったscaffodls.fastaやFinished genomeをアノテーションする（*1）。

prokka genome.fa -o prokka_output

ランが終わるとoutputディレクトリの中にgtf、gff、faa、fnn、gbfなどができる。gbfがgenbank形式のファイルになる。

f:id:kazumaxneo:20190706172830j:plain

データベースに登録するためのオプションや、Genus特異的に調べるオプションもある。またローカルにすでにたくさんのgenbankファイルを持っている人なら、それをデータベースにして、精度の高いアノテーション解析を行うことができる。詳細は公式マニュアルを確認してください。

出力ファイル名のprefixを指定、デフォルトはprokka+日付だがEcoliに変更。

prokka genome.fa -o prokka_output --prefix Ecoli

locus tag名をJ001に変更。

prokka genome.fa -o prokka_output --prefix Ecoli --locustag J001

f:id:kazumaxneo:20201227233742p:plain

あとでアノテーションを追加することを想定して、番号を1ずつ増加から５きざみに変更。

prokka genome.fa -o prokka_output --prefix Ecoli --locustag J001 --increment 5

f:id:kazumaxneo:20201227234101p:plain

上記に加え、GFFのバージョンは3、アノテーションバージョンは1、センター名はcenter_of_genomicsを追加。また、デフォルトのCDSに加えてgene（--addgenes）と mRNA（--addmrna）をアノテーションフィーチャーとして追加。

prokka genome.fa -o prokka_output --prefix Ecoli --locustag J001 --increment 5 --gffver 3 --centre center_of_genomics --addgenes --addmrna

f:id:kazumaxneo:20201227234658p:plain

metagenome（binningしていない複数ゲノムが混ざった配列、binning後の品質の高いMAGなら--metagenomeは付けない）

prokka assembly.fasta -o prokka_output --metagenome --cpus 16

--metagenome Improve gene predictions for highly fragmented genomes (default OFF)

ミトコンドリアゲノム。codon tableを変更するか（参考）、kingdomオプションをつける。あとでアノテーションを追加することを想定して、番号を1ずつ増加から５きざみに変更。

prokka genome.fa -o prokka_output --prefix mitogenome --locustag mt --increment 5 --kingdom Mitochondria

コメント；種によってはイントロンが存在するミトコンドリア遺伝子もあり、そのような遺伝子ではexon-intron境界がずれていたり、一部のexonしか予測できなかったりする。rRNAにも遺伝子とは別のタイプのintronが存在することがあり、rRNAも注意が必要（rRNAなら近縁種とのblastn検索に基づく手動アノテーションがより正確）。あくまでprokkaはイントロンがない原核生物向けの遺伝子予測ツールであり、遺伝子予測のhintにしかならないことに注意したい。

prokkaの代わりに真核生物向けのab initio遺伝子予測手法が有効ではないかと考えるかもしれない。しかし、ab initio予測ツールの訓練にはintronを持つある程度たくさんの遺伝子が必要であり、サイズが小さいmtで十分な訓練ができるかどうかは疑わしい。また、自動アノテーションを行うwebサーバーもいくつか公開されているが、モデル生物から遠い場合、精度が低い可能性がある。これらを考慮すると、mtゲノムの決まった種から進化的に遠いmitogenomeのアノテーションでは、保存されたタンパク質のマッピングなどに基づくマニュアルのアノテーションが早いかもしれない。手動で進めるならこちらが役立つ。RNA-seqのイントロンを考慮したマッピング結果もヒントの１つに使える。ESTやTrinityのassembly（TransDecorderで選抜した配列）はgmap（紹介）などでアラインできる。保存されたタンパク質のマッピングにはexonerateなどが使える。prokkaの遺伝子モデルも参考になる。既存の分類群から遠いウィルスも同様と思われる。

ウィルスゲノム。

prokka genome.fa -o prokka_output --prefix prefix --locustag LOCUS --increment 5 --kingdom Viruses

1、ゲノムサイズが小さいと訓練が不十分になり、精度が上がらない可能性がある。アノテーション後、ORFが予測されていない領域をORF finderでチェックする。予測されたタンパク質産物はBLASTpサーチしてウィルスで予測されるタンパク質が出てくるのかなど確認する。後で十分な品質チェックを行うなら利用可能と考えられる。

２、環状アセンブリでは、アノテーション時に見かけの開始点にORFがないか注意する。oriがわかっているならその上流を+1にする。分からないなら、一度アノテーションを行なってORFが見つからない領域を探す。そのような領域を見かけ上の+1にするのが無難と思われる。

dockerを使う。

docker run --rm -itv $PWD:/data/ -w /data staphb/prokka:latest \
 prokka input_genome.fasta -o prokka_output

2020 4/19 追記、2020 12/28オプション追加

multiqcと連携する。

以下のスクリプトでディレクトリ中の全fastaのアノテーションを行う。gene IDは入力fasta名から取る（--locustag $folder）。出力ディレクトリ（--prefix $folder）と株名も同様（--strain $folder）。

#!/bin/sh 
#prooka annotation

for file in `\find *fasta -maxdepth 1 -type f`; do
 genome=${file}
 folder=${file%.fasta}
 
 prokka $genome --outdir $folder --addgenes --prefix $folder --strain $folder --cpus 12 --quiet --locustag $folder --prefix $folder
done

prokka.shとして保存。--strain xxx がmultiqcで出力される名前になる。

実行。

bash prokka.sh

multiqc .

2020 4/20 追記

PROKKAで出力されるgffからbedを出力。

#BEDOPSを入れていないなら導入
apt insall BEDOPS

convert2bed --input=gff < PROKKA.gff > PROKKA.bed

2021 11/3

genetic codeはデフォルトでは細菌、アーキアなどの11になっています（--kingdom Bacteriaがデフォルトで、これにより--gcodeのデフォルトである0から11に変更されている）。

参考ページ

The Genetic Codes

引用

Prokka: rapid prokaryotic genome annotation

Seemann T

Bioinformatics. 2014 Jul 15;30(14):2068-9

2023/03/29

prokkaはコンティグ末端のORF予測で、不完全なものを除外する可能性があります。そのため、コンティグの連続性が低い場合、コンティグ末端のORF予測を排除することで、そのような遺伝子がコア遺伝子探索でMissingになる可能性があります。自分の場合はprokkaのベースとなるprodigalに変えることで予測できるようになりました。再現性がない現象かもしれませんが注意してください。

（ただしprodigalの出すGFF3は微妙にフォーマットがおかしく、そのままパンゲノム解析などに使用するとエラーになる。prokkaの出す標準化されたGFF3は貴重）

tbl2asnでエラーになったら、tbl2asnだけ導入し直してください。NCBIの定めた使用期限が設定されているようです。

Could not run command: tbl2asn · Issue #139 · tseemann/prokka · GitHub

Oxford Nanoporeリードのアセンブリ MiniasmとNanopolish

assembly Nanopore long read 高速なツール 2016 Bioinformatics assembly graph polish Overlap-layout-consensus (OLC)

2019 4/4 ヘルプ追記

2019 6/21 文章修正

2019 7/17 コメント追加

2019 7/26　追記

2019 10/14追記

2019 11/5 コマンドに-t <NUM> 追加の修正

2020 3/30 関連ツール追記

MiniasmはPacbioのロングリードやナノポアのロングリードのアセンブルツールで2015年に論文が発表された (ref.1)。アルゴリズムはオーバーラップ法になる。アセンブル時間が非常に短いのが特徴で、ナノポアリードのアセンブルの比較ペーパーでは、競合アセンブラが数時間かけるデータを２分で終えると書かれている(ref.2)。

インストール

GitHub

GitHubのダウンロードリンクからソースをダウンロードすることもできるが、brewやcondaでワンライナーインストールもできる。

#bioconda
conda install -y -c bioconda minimap
conda install -y -c bioconda miniasm

#homebrew
brew install miniasm

>miniasm

$ miniasm

Usage: miniasm [options] <in.paf>

Options:

Pre-selection:

-m INT min match length [100]

-i FLOAT min identity [0.05]

-s INT min span [2000]

-c INT min coverage [3]

Overlap:

-o INT min overlap [same as -s]

-h INT max over hang length [1000]

-I FLOAT min end-to-end match ratio [0.8]

Layout:

-g INT max gap differences between reads for trans-reduction [1000]

-d INT max distance for bubble popping [50000]

-e INT small unitig threshold [4]

-f FILE read sequences

-n INT rounds of short overlap removal [3]

-r FLOAT[,FLOAT]

max and min overlap drop ratio [0.7,0.5]

-F FLOAT aggressive overlap drop ratio in the end [0.8]

Miscellaneous:

-p STR output information: bed, paf, sg or ug [ug]

-b both directions of an arc are present in input

-1 skip 1-pass read selection

-2 skip 2-pass read selection

-V print version number

See miniasm.1 for detailed description of the command-line options.

> minimap

$ minimap

Usage: minimap [options] <target.fa> [query.fa] [...]

Options:

Indexing:

-k INT k-mer size [15]

-w INT minizer window size [{-k}*2/3]

-I NUM split index for every ~NUM input bases [4G]

-d FILE dump index to FILE

-l the 1st argument is a index file (overriding -k, -w and -I)

Mapping:

-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.001]

-r INT bandwidth [500]

-m FLOAT merge two chains if FLOAT fraction of minimizers are shared [0.50]

-c INT retain a mapping if it consists of >=INT minimizers [4]

-L INT min matching length [40]

-g INT split a mapping if there is a gap longer than INT [10000]

-T INT SDUST threshold; 0 to disable SDUST [0]

-S skip self and dual mappings

-O drop isolated hits before chaining (EXPERIMENTAL)

-P filtering potential repeats after mapping (EXPERIMENTAL)

-x STR preset (recommended to be applied before other options)

ava10k: -Sw5 -L100 -m0 (PacBio/ONT all-vs-all read mapping)

Input/Output:

-t INT number of threads [3]

-V show version number

See minimap.1 for detailed description of the command-line options.

> minimap2

r$ minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Options:

Indexing:

-H use homopolymer-compressed k-mer (preferrable for PacBio)

-k INT k-mer size (no larger than 28) [15]

-w INT minizer window size [10]

-I NUM split index for every ~NUM input bases [4G]

-d FILE dump index to FILE

Mapping:

-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]

-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]

-G NUM max intron length (effective with -xsplice; changing -r) [200k]

-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]

-r NUM bandwidth used in chaining and DP-based alignment [500]

-n INT minimal number of minimizers on a chain [3]

-m INT minimal chaining score (matching bases minus log gap penalty) [40]

-X skip self and dual mappings (for the all-vs-all mode)

-p FLOAT min secondary-to-primary score ratio [0.8]

-N INT retain at most INT secondary alignments [5]

Alignment:

-A INT matching score [2]

-B INT mismatch penalty [4]

-O INT[,INT] gap open penalty [4,24]

-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]

-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]

-s INT minimal peak DP alignment score [80]

-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

Input/Output:

-a output in the SAM format (PAF by default)

-Q don't output base quality in SAM

-L write CIGAR with >65535 ops at the CG tag

-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar'

-c output CIGAR in PAF

--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]

--MD output the MD tag

--eqx write =/X CIGAR operators

-Y use soft clipping for supplementary alignments

-t INT number of threads [3]

-K NUM minibatch size for mapping [500M]

--version show version number

Preset:

-x STR preset (always applied before other options; see minimap2.1 for details)

- map-pb/map-ont: PacBio/Nanopore vs reference mapping

- ava-pb/ava-ont: PacBio/Nanopore read overlap

- asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence

- splice: long-read spliced alignment

- sr: genomic short-read mapping

See `man ./minimap2.1' for detailed description of command-line options.

テストデータ

オーサーたちの準備したtestデータのダウンロード

wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -

解凍したディレクトリは以下のようになっている。

user $ ls -lh selfSampleData

total 272M

-rw-r--r-- 1 uesaka user 267M Feb 21 2015 pacbio_filtered.fastq

-rwxr-xr-x 1 uesaka user 169 Feb 6 2015 pacbio.spec

-rw-r----- 1 uesaka user 3.1K Feb 28 2015 README

-rw-r--r-- 1 uesaka user 4.5M Mar 20 2013 reference.fasta

実行方法

１、overlap

minimap -Sw5 -L100 -m0 -t8 pacbio_filtered.fastq pacbio_filtered.fastq | gzip -1 > reads.paf.gz

-S　skip self and dual mappings

-L　INT min matching length [40]

-m　FLOAT merge two chains if FLOAT fraction of minimizers are shared [0.50]

-t　INT number of threads [3]

２、layout

miniasm -f pacbio_filtered.fastq reads.paf.gz > reads.gfa

-f　FILE read sequences

GFAファイルを開く。

>less -S reads.gfa

f:id:kazumaxneo:20170622221043j:plain

最後にawkとsedを使ってGFAからFASTAに変換する。

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

先頭がSの行がアセンブルされたunitigで、その２フィールド目と３フィールド目をfastaの>と改行をつけながらfoldに渡し、改行を加えて出力している。

GFA=> FASTAにはany2fastaのようなツールを使ってもよい。

追記

miinmap2を使う。ONTのリードの場合

minimap2 -x ava-ont -t 12 long-read.fq long-read.fq > ont.paf 
miniasm -f long-read.fq ont.paf > assembly.gfa 
awk '/^S/{print ">"$2"\n"$3}' assembly.gfa | fold > assembly.fasta

アセンブルのエラーをNanopolishで修復する。Nanopolishは使い方に癖があるツールなので、以下のエントリでーNanopolishを単独で説明している。

OLC（参考PDF）のtagをつけているが、miniasmにはconsensus callのステップがないことに注意する。エラー率はアセンブリに使ったリードと同じになる。

3020 03/30追記

all versus allリードマッピングにfpaのフィルタリングを取り込む。

引用

１、Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Heng Li

Bioinformatics (2016) 32 (14): 2103-2110. DOI:

https://academic.oup.com/bioinformatics/article/32/14/2103/1742895/Minimap-and-miniasm-fast-mapping-and-de-novo

２、Comparison of bacterial genome assembly software for MinION data and their applicability to medical microbiology

Kim Judge, Martin Hunt, Sandra Reuter, Alan Tracey, Michael A. Quail, Julian Parkhill, Sharon J. Peacock

01 September 2016, Microbial Genomics , 2016 2, doi: 10.1099/mgen.0.000085

http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000085

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Oxford Nanoporeリードのアセンブリ smartdenovo

ナノポアのアダプタートリミングツール Porechop

sam/bam関係のツールまとめ

samtools

quickcheck - 追記 bamがおかしくなっていないか、カレントの全てのbamを調べる。

bamがおかしくなっていないか（download中に切れたことを気づかないまま進めたり、変換中におちたbamを使ったりなど）。 "||"をつかうことで、エラーがある時だけ||以降が実行されるようにしている。

bamtools

bamファイルを使い色々なことができるツール。samtoolsの機能と重複するものが多いが、よりデータ解析よりに作られている。

elprep（紹介）

VariantBam（紹介）

goleft（紹介）

mosdepth（紹介）

biobambam2（紹介）

Sambamba（紹介）

Samblaster（紹介）

BamDeal（紹介）

Alfred（紹介）

bamM（紹介）

KneadData（紹介）

fastqp（紹介）

BAMStats

samstat

bcftools

fass

NGSUtils - bamutils

contigからscaffoldを作るツールの検証

フォーマット変換 Fastq=> Fasta

Prokaryotesのアノテーションツール Prokka

Oxford Nanoporeリードのアセンブリ MiniasmとNanopolish