2017-10-18

マルチプルシーケンスアラインメントを行うMAFFT

2019 6/13 説明及びインストール追記、6/21 コマンド微修正、7/3 説明修正、7/15 help追記、9/29 twitter追加、11/4 関連追加、m11/13 誤字修正

2020 4/15 タイトル修正、5/30 link追加

2024/04/27 インストール追記

MAFFT開発秘話。私が4回生で宮田研に加わった時、まず手動で多重アラインメントする訓練から始まりました。研究室内部で使われていたそのエディタに、後にMAFFTの主軸となるアルゴリズムが間もなく入りました。Xcedと呼ばれた、メンバーしか知らないエディタです。https://t.co/KVeCggUcCO
— 🧬Shigehiro Kuraku🦈工樂樹洋 (@sighex) September 28, 2019

MAFFTはマルチプルアライメントを行うツール。t-coffeeやclustal omegaより高速に動作するとされる。HPに数百のrRNA配列に対してマルチプルアライメントを実行する例が載っている。

クイックマニュアル

Manpage of MAFFT

Tips

https://mafft.cbrc.jp/alignment/software/tips0.html

インストール

macOSX版ダウンロード

MAFFT for Mac OS X - a multiple sequence alignment program

pkgファイルを叩き、指示に従ってインストールする。brewで導入することもできる。

追記
#bioconda (link)
mamba install -c bioconda -y mafft

#もしくはソースファイルをダウンロードしてビルドする（condaインストールと違って文書化されていないサポートスクリプトも使用できる)
#1 こちらからダウンロード
#2 解凍してビルド
gunzip -cd mafft-x.x-src.tgz | tar xfv -
cd mafft-x.x/core/
make clean
make
su
sudo make install

> mafft -h

$ mafft -h

/Users/user/.pyenv/versions/miniconda2-4.3.30/bin/mafft: Cannot open -h.

------------------------------------------------------------------------------

MAFFT v7.407 (2018/Jul/23)

https://mafft.cbrc.jp/alignment/software/

MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)

------------------------------------------------------------------------------

High speed:

% mafft in > out

% mafft --retree 1 in > out (fast)

High accuracy (for <~200 sequences x <~2,000 aa/nt):

% mafft --maxiterate 1000 --localpair in > out (% linsi in > out is also ok)

% mafft --maxiterate 1000 --genafpair in > out (% einsi in > out)

% mafft --maxiterate 1000 --globalpair in > out (% ginsi in > out)

If unsure which option to use:

% mafft --auto in > out

--op # : Gap opening penalty, default: 1.53

--ep # : Offset (works like gap extension penalty), default: 0.0

--maxiterate # : Maximum number of iterative refinement, default: 0

--clustalout : Output: clustal format, default: fasta

--reorder : Outorder: aligned, default: input order

--quiet : Do not report progress

--thread # : Number of threads (if unsure, --thread -1)

実行方法

公式サイトから59のrRNAの配列のFASTAをダウンロードできる（リンク）。これを使ってみる。example 1のmulti fastaをダウンロード。

ダウンロードした配列を指定してautoモードでラン。

mafft --auto ex1.txt > output

マルチプルアライメント結果が出力される。

ユーザーと対話式でランもできる。

mafft

user$ mafft

---------------------------------------------------------------------

MAFFT v7.305b (2016/Aug/16)

MBE 30:772-780 (2013), NAR 30:3059-3066 (2002)

http://mafft.cbrc.jp/alignment/software/

---------------------------------------------------------------------

Input file? (fasta format)

ex1.txt

Output file?

@ out

Output format?

1. Clustal format / Sorted

2. Clustal format / Input order

3. Fasta format / Sorted

4. Fasta format / Input order

5. Phylip format / Sorted

6. Phylip format / Input order

@ 1

OK. arguments = --reorder

指示に従って実行する。

mafft webサービス

産業技術総合研究所のhttps://mafft.cbrc.jp/alignment/server/ にアクセスする。

こちらではmafftを使ったマルチプルシーケンスアラインメント（MSA）の他に、MSAから各種方法による分子系統樹作成までほぼ自動で行える。

FASTAファイルをウィンドウ内にペーストするか、ファイルを選択からアップロードする。

ラン後、上のメニューボタンからMSAViewerに切り変えれば結果を視覚化できます。

viewerはjava webスタートとしてlocalでも実行できる。

マルチプルシーケンスアラインメントをすでに実行済なら、.alnファイルを提供することで系統推定だけ行うこともできる。

Molecular phylogeny by the NJ / UPG methods

ツリーのみ描画

Rough clustering of Unaligned Sequences

追記

1、通常、アミノ酸配列は大文字で、塩基配列は小文字で出力される。

https://academic.oup.com/mbe/article/30/4/772/1073398

しかし何らかの理由で小文字と大文字を使い分けている場合、小文字は小文字のまま、大文字は大文字のまま、すなわち入力ファイルの状態で出力したい事がある。そのような時は--preservecaseオプションをつけてランする。この場合、入力ファイルが小文字なら小文字のまま、大文字は大文字のまま出力されるが（混ざっていても動作する）、一部の大文字を想定したMSAビューアではエラーになるので注意。

２、デフォルトでは入力配列と同じ順番で出力されるが、--reorderオプションを使うと配列の類似性によりソートして出力される。

引用

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

Katoh, Misawa, Kuma and Miyata

Nucleic Acids Res. 30:3059-3066, 2002

MAFFT version 5: improvement in accuracy of multiple sequence alignment.

Katoh, Kuma, Toh and Miyata

Nucleic Acids Res. 33:511-518, 2005

PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences.

Katoh and Toh

Bioinformatics 23:372-374, 2007

MAFFTを使ってマルチプルアラインメントを行う統合TV(togotv)｜生命科学系DB・ツール使い倒し系チャンネル

トランスポゾン検出ツール6　Tangram

structural variations (SV) transposon 作成中 2014 BMC Genomics

2021 8/20 追記

Tangramはトランスポゾンの検出に特化した構造変化検出ツール。SV検出で用いられるread-pairとsplit-readのアルゴリズムを使い高感度にトランスポゾンを検出する。1000ゲノムでもmobile element検出ツールとして用いられた。トランスポゾン検出ツールは様々報告されているが、Tangramはsplit read情報とread pair情報を両方使い、１塩基の精度でトランスポゾン挿入位置を検出することが可能である。

インストール

github

依存

g++ 4.2.0 and above
zlib
pthread lib

ダウンロードしてビルドする。

git clone git://github.com/jiantao/Tangram.git 
cd Tangram/src/
make -j

>./tangram_scan

$ ./tangram_scan

Usage: tangram_scan [options] -in <input_file_list> -dir <output_dir>

Mandatory arguments: -in FILE the list of input bam files

-dir STRING the path to the output dir (must be empty or non-existing)

Options: -cf FLOAT threashold for normal read pair in the fragment length distribution[0.01 total for both side]

-tr FLOAT trim rate for the fragment length distribution[0.02 total for both side]

-mq INT minimum mapping quality for a normal read pair

-mf INT minimum number of nomral fragments in a library[10000]

-help print this help message

> /tangram_bam

$ ./tangram_bam

Usage: tangram_bam [options] -i <in_bam> -r <ref_fa> -o <out_bam>

Mandatory arguments:

-i --input FILE The input of bam file [stdin].

-r --ref FILE The input of special reference file.

-o --output FILE The output of bam file [stdout].

Options:

-h --help Print this help message.

-t --target-ref-name STRING Chromosome region.

-m --required-match INT The number of required matches.

between reads and special references [50].

Notes:

1. tangram_bam will add ZA tags that are required for the following detection.

> ./tangram_merge

$ ./tangram_merge

Usage: tangram_merge -dir <input_dir>

Mandatory arguments: -dir STRING the path to the dir contains all the fragment length distribution files

-help print this help message

> ./tangram_filter.pl

$ ./tangram_filter.pl

Usage: ./tangram_filter [options] --vcf <input_vcf> --msk <mask_input_list>

Mandatory Arguments: --vcf FILE input vcf file for filtering

--msk FILE input list of mask files with window size information

Options: --type STRING SV event type for filtering, choosing from "DEL", "DUP", "INV" and "MEI" (case insensitive) [MEI]

--rpf INT Minimum number of supporting fragments (reads) for read-pair events. For MEI events, this threshold

must be satisfied for reads from both 5' and 3' [2]

--srf INT Minimum number of supporting fragments (reads) for split-read events [2]

--out FILE Output of filtered and sorted VCF file [stdout]

--help Show this help message

Note:

1. This script require the installation of "bedtools" package and Unix

sort in the default directory.

2. Each entry of the list of mask files is a tab delimited file

with following format:

"TYPE WINDOW_SIZE FILE_NAME"

"TYPE" (string) is the type of this mask file. For a referenced MEI

mask file, it must match the first two characters of the family name

in the VCF file (For example AL: ALU, L1: L1, SV: SVA and HE: HERV).

This mask file will only be applied to the corresponding type of MEI

events. For example, AL mask file will only be applied to ALU insertions.

The rest of the mask files, such as segmental duplication

mask and simple repeat mask, their "TYPE" string can be anything and

it will be applied to all the entries in the VCF file. No space is allowed

in the type name.

"WINDOW_SIZE" (integer) is the window size around each entry of

the mask file.

"FILE_NAME" (string) is the path to the corresponding mask file

All the mask files must be in the BED format. For detailed information

about this format, please check http://genome.ucsc.edu/FAQ/FAQformat.html

bin/にパスを通しておく。

実行方法

以下のフローでトランスポゾンを検出する。

f:id:kazumaxneo:20171017213212j:plain

Gitのマニュアルより

MOSAIKでアライメントしたza tag付きのbamが必要。なければ、bamにZA -tagを付加する作業を最初に行う。fastaのindexがなければ、ここで自動作成される。

step1　ZA tagの付加

tangram_bam -i input.bam -r ref.fa -o ZA-tagged.bam

<@/&:MQ1:MQ2:SP_REF:NUM_MAP:CIGAR:MD>のようなTagがbamに付加される。詳細はGitのマニュアルを参考にしてください。

step2　bamのスキャン

tangram_scan -in ZA-tagged.bam -dir output

step3　index

tangram_index -ref -sp output/input.bam -out output/

step4　トランスポゾンの検出

tangram_detect

step5　フィルタリング

tangram_filter

引用

Tangram: a comprehensive toolbox for mobile element insertion detection

Jiantao Wu, Wan-Ping Lee, Alistair Ward, Jerilyn A Walker, Miriam K Konkel, Mark A Batzer and Gabor T Marth

BMC Genomics 2014 15:795

アダンプタートリミングツール TagDust2

trimming / preprocessing barcode / index low complexity 2015 BMC Bioinformatics

TgaDust2は、アダプター、バーコード、単純リピートなどの不要な情報を見つけて除去するツール。2009年にTagDustが発表され、その後2015年にTagDust2が発表された。

公式サイト

TagDust

インストール

brewで導入できる。

brew install TagDust

brewではTagDust2が導入されている。

user$ tagdust

Usage: tagdust [options] <file> -o <output prefix>

Options:

-Q FLT confidence threshold [20].

-l STR log file directory name.

-start INT start of search area [0].

-end INT end of search area [length of sequence].

-format STR format of input sequence file.

-minlen INT minimal accepted read length [16].

-ref STR reference fasta file to be compared against[].

-fe INT number of errors allowed when comparing to reference[2].

-dust INT remove low complexity sequences. [100].

-e FLT expected sequencer error rate [0.05].

-o STR output file name.

-a STR output file for artifacts [NA].

-t INT number of threads [8].

-show_finger_seq NA print fingerprint as sequence (default is as base 4 number).

-h/help NA print help.

-v/version NA print version number.

-1 STR type of the first HMM building block.

-2 STR type of the second HMM building block.

-... STR type of the . . . HMM building block.

実行方法

tagdust adapter.fasta input.fastq -e 0.05 -fe 2 -o output.fastq -a artifactual.fastq -t 12

-dust　remove low complexity sequences. [100].
-e　expected sequencer error rate [0.05].
-o　output file name.
-a　output file for artifacts [NA].
-t　number of threads [8].
-fe　number of errors allowed when comparing to reference[2].
-minlen　minimal accepted read length [16].

Scytheのオーサーらは、Scytheでアダプター除去し、Scytheで除けなかった5'側のアダプターなどをTagDustで除去する２段構えのアダプター除去プロセスを提案しています（その場合のワークフローは、Scythe-> TagDust2 -> クオリティトリミング-> QC、の流れにすべきとしています）。Scytheは以前紹介しています。

引用

TagDust--a program to eliminate artifacts from next generation sequencing data

Lassmann T, Hayashizaki Y, Daub CO

Bioinformatics. 2009 Nov 1;25(21):2839-40

TagDust2: a generic method to extract reads from sequencing data

Lassmann T

BMC Bioinformatics. 2015 Jan 28;16:24

http://catway.jp/bioinformatics/qc/removeseq.html

2017-10-17

EMBOSSのseqretを使ってfastaファイルを修復する

format conversion fasta/fastqの操作 tips 2000 GenBank

2019 6/19 インストール追記

2019 7/15 タイトル修正

2019 8/7リンク追加

2019 10/3コメント追加

fastaファイルfをいじっていると、何らかの拍子に構造がおかしくなってsamtoolsのindexでsegmentation errorを起こすことがある。途中に空行ができていたり、特殊文字が入っていたり、何らかの理由があるわけだが、embossのseqretを使うと簡単に修復することができる。seqretは入力ファイルを分析し、パースして標準的なNCBIのFASTA形式で出力することに使われるコマンドである。

配列中の数値、スペースなども消してくれるので、genbankをコピーして余計な文字を消すときにも使えます。

このような配列も

f:id:kazumaxneo:20201019161021p:plain

↓ 修復してくれる。

f:id:kazumaxneo:20201019161051p:plain

公式サイト

http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/seqret.html

インストール

embossはcondaやbrewで導入できる。

#bioconda (link)
mamba install -c bioconda -y emboss

#homebrew
brew cask install xquartz #xquartzも無ければ入れておく
brew install emboss

> seqret -h

$ seqret -h

Read and write (return) sequences

Version: EMBOSS:6.6.0.0

Standard (Mandatory) qualifiers:

[-sequence] seqall (Gapped) sequence(s) filename and optional

format, or reference (input USA)

[-outseq] seqoutall [<sequence>.<format>] Sequence set(s)

filename and optional format (output USA)

Additional (Optional) qualifiers: (none)

Advanced (Unprompted) qualifiers:

-feature boolean Use feature information

-firstonly boolean [N] Read one sequence and stop

General qualifiers:

-help boolean Report command line options and exit. More

information on associated and general

qualifiers can be found with -help -verbose

実行方法

seqret

入力のFASTAと出力のFASTA名を順番に入力する。

user$ seqret

Read and write (return) sequences

Input (gapped) sequence(s): input.fasta

output sequence(s) [chr.fasta]:out.fa

またはinputとoutputのfasta名を指定する。

seqret input.fasta output.fasta

これだけでFASTAを修復できる。

UCSCからも同様のツールが提供されています。

https://users.soe.ucsc.edu/~kent/dnaDust/dnadust.html

追記

gffからgbkに変換 (manual)

seqret -sequence genome.fasta -feature -fformat gff -fopenfile input.gff -osformat genbank -osname_outseq output.gbk -ofdirectory_outseq gbk_file -auto

空のgff（先頭のコメント行のみ）を使えば、遺伝子アノテーションが無いgbkファイルを作ることもできます。

Proteinの修復ならProtein Duster が利用できます。

引用

EMBOSS: The European Molecular Biology Open Software Suite

Rice P1, Longden I, Bleasby A.

Trends Genet. 2000 Jun;16(6):276-7.

http://seqanswers.com/forums/showthread.php?t=2352

Existing tool for converting gff3 to genbank (gbk)

https://bioinformatics.stackexchange.com/questions/11115/existing-tool-for-converting-gff3-to-genbank-gbk

2017-10-15

リファレンス配列に点変異やSVを導入するEMBOSSのmsbar

small indel structural variations (SV) simulator artificial indels Trends in Genetics 2000

2019 7/16 タイトル修正

EMBOSSパッケージのmsbarを使うと、リファレンスに変異を導入することができる。変異のシミュレーション実験などに使える機能である。

公式サイト

http://emboss.sourceforge.net

EMBOSS: msbar

インストール

embossはcondaやbrewで導入できる。

mamba install -c bioconda -y emboss

brew install emboss

実行方法

msbarをタイプし、指示に従えば変異を導入できる。

user$ msbar

Mutate a sequence

Input sequence(s): Homo_sapiens.GRCh38.dna.chromosome.19.fasta

Number of times to perform the mutation operations [1]: 10

Point mutation operations

0 : None

1 : Any of the following

2 : Insertions

3 : Deletions

4 : Changes

5 : Duplications

6 : Moves

Types of point mutations to perform [0]: 0

Block mutation operations

0 : None

1 : Any of the following

2 : Insertions

3 : Deletions

4 : Changes

5 : Duplications

6 : Moves

Types of block mutations to perform [0]: 2

Codon mutation operations

0 : None

1 : Any of the following

2 : Insertions

3 : Deletions

4 : Changes

5 : Duplications

6 : Moves

Types of codon mutations to perform [0]: 0

output sequence(s) [19.fasta]: out.fa

humanのchr19に構造変化を起こすInsertion変異を10回導入して、out.faで出力した。

ワンライナーでも動作する。

1-1000bpのBlock mutationを100回導入する。他の変異は発生させない。

msbar -sequence Homo_sapiens_chr19.fasta -outseq output.fa -count 100 -point 0 -block 3 -codon 0 -minimum 1 -maximum 1000

Point mutationの１塩基挿入を10回導入する。他の変異は発生させない。

msbar -sequence Homo_sapiens_chr19.fasta -outseq output.fa -count 10 -point 2 -block 0 -codon 0

変異後の配列からfastqを発生させてオリジナルの配列にマッピングすれば、変異株のリシーケンスのシミュレーション実験ができる。どうやら正確な変異のbreakpintをレポートする機能はないようなので、盲検テストとしてツール間の感度分析などに使うと良いかもしれない。

引用

EMBOSS: The European Molecular Biology Open Software Suite

Rice P, Longden I, Bleasby A.

Trends Genet. 2000 Jun;16(6):276-7.

2017-10-14

SVを検出する wham

small indel structural variations (SV) PLOS Computational Biology 2015

whamはsplit-read情報、soft-clipping情報、コンセンサス配列情報などを統合してSVを検出するSV検出の方法論。サイズの大きなSVも検出することが可能である。ダウンロードできるパッケージにはwhamとwhamgの２つのツールが入っている。2015年に発表された論文ではオリジナルのwhamが使用されているが、whamは高感度でfalse positiveも多いため、オーサーらは、false positiveを減らすように改良されたwhamgを使うことを推奨している。

公式サイト

Wham

インストール

Github

git clone --recursive https://github.com/zeeev/wham.git; cd wham; make

macでは動作中にエラ-を起こしたのでcent OSに導入した。

condaでも導入できる（linux only リンク）。

#Anaconda環境にて
conda install -c bioconda wham

実行方法

whamgを使うフローのみ紹介する。

inputのbamはbwa memを用いてアライメント後、ソートされduplicate tagもつけたものを使うことが推奨されている。

whamg -a Homo_sapiens_assembly19.fasta -f CHM1_1.bam | perl utils/filtWhamG.pl > chm1.vcf 2> chm1.err

制作途中

引用

Wham: Identifying Structural Variants of Biological Consequence

Zev N. Kronenberg, Edward J. Osborne, Kelsey R. Cone, Brett J. Kennedy, Eric T. Domyan, Michael D. Shapiro, Nels C. Elde, Mark Yandell

PLoS Comput Biol. 2015 Dec 1;11(12):e1004572.

2017-10-07

ナノポアのロングリードのトリミングやフィルタリングを行うNanofilt

Nanopore long read trimming / preprocessing fasta/fastqの操作 2018 Bioinformatics web tool

2019 2/14 コマンド追加

2019 5/19 ヘルプ追加、パラメータ変更

2019 12/30並列処理例追加

2020 10/10 リンク追加

nanofitはナノポアのロングリードのクオリティトリミングができるツールである。

インストール

Github

https://github.com/wdecoster/nanofilt

https://github.com/wdecoster/NanoPlot

mamba install -y -c conda-forge -c bioconda Nanofilt
mamba install -y -c conda-forge -c bioconda NanoPlot

> NanoPlot -h

$ NanoPlot -h

usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw]

[-o OUTDIR] [-p PREFIX] [--maxlength N] [--minlength N]

[--drop_outliers] [--downsample N] [--loglength]

[--percentqual] [--alength] [--minqual N]

[--readtype {1D,2D,1D2}] [--barcoded] [-c COLOR]

[-f {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff}]

[--plots [{kde,hex,dot,pauvre} [{kde,hex,dot,pauvre} ...]]]

[--listcolors] [--no-N50] [--N50] [--title TITLE]

CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.

General options:

-h, --help show the help and exit

-v, --version Print version and exit.

-t, --threads THREADS

Set the allowed number of threads to be used by the script

--verbose Write log messages also to terminal.

--store Store the extracted data in a pickle file for future plotting.

--raw Store the extracted data in tab separated file.

-o, --outdir OUTDIR Specify directory in which output has to be created.

-p, --prefix PREFIX Specify an optional prefix to be used for the output files.

Options for filtering or transforming input prior to plotting:

--maxlength N Drop reads longer than length specified.

--minlength N Drop reads shorter than length specified.

--drop_outliers Drop outlier reads with extreme long length.

--downsample N Reduce dataset to N reads by random sampling.

--loglength Logarithmic scaling of lengths in plots.

--percentqual Use qualities as theoretical percent identities.

--alength Use aligned read lengths rather than sequenced length (bam mode)

--minqual N Drop reads with an average quality lower than specified.

--readtype {1D,2D,1D2}

Which read type to extract information about from summary. Options are 1D, 2D,

1D2

--barcoded Use if you want to split the summary file by barcode

Options for customizing the plots created:

-c, --color COLOR Specify a color for the plots, must be a valid matplotlib color

-f, --format {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff}

Specify the output format of the plots.

--plots [{kde,hex,dot,pauvre} [{kde,hex,dot,pauvre} ...]]

Specify which bivariate plots have to be made.

--listcolors List the colors which are available for plotting and exit.

--no-N50 Hide the N50 mark in the read length histogram

--N50 Show the N50 mark in the read length histogram

--title TITLE Add a title to all plots, requires quoting if using spaces

Input data sources, one of these is required.:

--fastq file [file ...]

Data is in one or more default fastq file(s).

--fasta file [file ...]

Data is in one or more fasta file(s).

--fastq_rich file [file ...]

Data is in one or more fastq file(s) generated by albacore or MinKNOW with

additional information concerning channel and time.

--fastq_minimal file [file ...]

Data is in one or more fastq file(s) generated by albacore or MinKNOW with

additional information concerning channel and time. Minimal data is extracted

swiftly without elaborate checks.

--summary file [file ...]

Data is in one or more summary file(s) generated by albacore.

--bam file [file ...]

Data is in one or more sorted bam file(s).

--cram file [file ...]

Data is in one or more sorted cram file(s).

--pickle pickle Data is a pickle file stored earlier.

EXAMPLES:

Nanoplot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed

NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots hex dot

NanoPlot --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000

> NanoFilt -h

# NanoFilt -h

usage: NanoFilt [-h] [-v] [--logfile LOGFILE] [-l LENGTH]

[--maxlength MAXLENGTH] [-q QUALITY] [--minGC MINGC]

[--maxGC MAXGC] [--headcrop HEADCROP] [--tailcrop TAILCROP]

[-s SUMMARY] [--readtype {1D,2D,1D2}]

Perform quality and/or length and/or GC filtering of (long read) fastq data. Reads on stdin.

General options:

-h, --help show the help and exit

-v, --version Print version and exit.

--logfile LOGFILE Specify the path and filename for the log file.

Options for filtering reads on.:

-l LENGTH, --length LENGTH

Filter on a minimum read length

--maxlength MAXLENGTH

Filter on a maximum read length

-q QUALITY, --quality QUALITY

Filter on a minimum average read quality score

--minGC MINGC Sequences must have GC content >= to this. Float between 0.0 and 1.0. Ignored if

using summary file.

--maxGC MAXGC Sequences must have GC content <= to this. Float between 0.0 and 1.0. Ignored if

using summary file.

Options for trimming reads.:

--headcrop HEADCROP Trim n nucleotides from start of read

--tailcrop TAILCROP Trim n nucleotides from end of read

Input options.:

-s SUMMARY, --summary SUMMARY

Use summary file for quality scores

--readtype {1D,2D,1D2}

Which read type to extract information about from summary. Options are 1D, 2D or

1D2

EXAMPLES:

gunzip -c reads.fastq.gz | NanoFilt -q 10 -l 500 --headcrop 50 | minimap2 genome.fa - | samtools sort -O BAM -@24 -o alignment.bam -

gunzip -c reads.fastq.gz | NanoFilt -q 12 --headcrop 75 | gzip > trimmed-reads.fastq.gz

gunzip -c reads.fastq.gz | NanoFilt -q 10 | gzip > highQuality-reads.fastq.gz

root@fc7ac9b00489:/#

ラン

5'末端75-bpのトリミング、平均クオリティ10以下のリードを捨てるクオリティフィルタリング、500bp以下のリードを捨てるサイズフィルタリングを実行する。

gunzip -c input.fq.gz |NanoFilt -q 10 -l 500 --headcrop 75 | gzip > trimmed.fq.gz

-q　QUALITY Filter on a minimum average read quality score
-s　SUMMARYFILE optional, the sequencing_summary file from albacore for extracting quality scores
-l　LENGTH Filter on a minimum read length
--headcrop　HEADCROP Trim n nucleotides from start of read
--tailcrop　TAILCROP Trim n nucleotides from end of read

ナノポアのリードの先頭数十bpは特にクオリティが悪く、解析に悪影響を与えるので強制トリミングしている。

1Dのデータを分析してみる。

まずはfast5からbasecallingして作ったraw fastqを分析する（webでも利用可能）。

NanoPlot --fastq E.coli.fastq --loglength -t 12

f:id:kazumaxneo:20171007151034j:plain

quality6以下、1000bp付近に非常にたくさんのリードが出ており、クオリティの山が２つある状態である。また、山の形状も異なるのも興味深い。左下に伸びた短いリードはつまりジャンクということだろうか？

下の山をクオリティ6で切る。また5'末端50-bpをトリミングし、100bp以下になったリードは捨てる。

$gzip compressed fastq
gunzip -c input.fq.gz |NanoFilt -q 6 --headcrop 50 -l 100 > trimmed.fq

#fastq
cat input.fq |NanoFilt -q 6 --headcrop 50 -l 100 > trimmed.fq

nanoplotで分析。

NanoPlot --fastq trimmed.fastq --loglength -t 12 -o qc_result_dir

f:id:kazumaxneo:20171007152051j:plain

平均クオリティ6以下が完全になくなっている。

nanoplotは別に紹介しています。

追記

2019 12/30 mergeする前のfastqがあるならGNU parallelで簡単に並列処理できる（*1）。"q10>"、"500bp>"、"先頭50bpトリミング"を８並列で実行する。

ls *.fastq | parallel -j 8 'cat {} | NanoFilt -q 10 --headcrop 50 -l 500 > filtered_{}'

引用

NanoPack: visualizing and processing long read sequencing data.

De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C

Bioinformatics. 2018 Mar 14.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

マルチプルシーケンスアラインメントを行うMAFFT

トランスポゾン検出ツール6　Tangram

アダンプタートリミングツール TagDust2

EMBOSSのseqretを使ってfastaファイルを修復する

リファレンス配列に点変異やSVを導入するEMBOSSのmsbar

SVを検出する wham

ナノポアのロングリードのトリミングやフィルタリングを行うNanofilt