2017-08-24

fastqの操作ツール illumina-utils

illumina-utilsはpythonで記述されたilluminaのシーケンスデータのユーティリティツール。オーバーラップしたペアリードのmergeやクオリティフィルタリングを行うことができる。

インストール

sudo pip install illumina-utils

実行方法

raw fastqのdemulitiplexing。

ランにはindexの情報が載ったテキストファイルが必要。フォーマットについては公式の例を参照。

iu-demultiplex -s barcode_to_sample.txt --r1 r1.fastq --r2 r2.fastq --index index.fastq -o output/

illuminaの提供しているdemulitiplexツールはbcl2fastq（リンク）。

ほとんどのコマンドのランにはコンフィグファイルが必要。以下のようなファイル（input.txt）を元にコンフィグファイルを作成することができる。

sample	r1	r2
e_coli	ecoli-R1.fastq	ecoli-R2.fastq

コンフィグファイルの作成。

iu-gen-configs input.txt

この例だとe_coli.iniができる。中身を確認する。

user$ cat e_coli.ini

[general]

project_name = e_coli

researcher_email = u@example.edu

input_directory = /Users/user/Documents/test/

output_directory = /Users/use/Documents/test/

[files]

pair_1 = ecoli-R1.fastq

pair_2 = ecoli-R2.fastq

オーバーラップがあるペアリードのマージ。先ほどのconfigファイルを指定する。

iu-merge-pairs e_coli.ini

ランにはそれなりの時間がかかる（macbook pro corei7モデルだと300-bpの200万x2リードの処理1時間以上）。マージが終わると、マージが成功した配列と失敗した配列が別々のファイルで出力される。また、レポートも出力される。

マージされたfastqは、ヘッダーにどのような状態でマージされたかを示す情報がつく。クオリティ行は削られfastaフォーマットになっている。

user$ head -4 e_coli_MERGED

>e_coli_19|M02077:18:000000000-A88N7:1:1101:14587:1648 1:N:0:1|o:229|m/o:0.013100|MR:n=0;r1=3;r2=0|Q30:n/a|CO:0|mismatches:3

aattaaaaaaacttggctggcaatatgttcctggctgtttggaagatcaacctgttcctactgatccactgaTtactgaacgggaaagttttaagcagattcttattaagccccgacttcagcaagcCcttaagcgCattaatctgactgatgatggagagccatggctagatgactaccaaattgagtcggccatttcccagctagagcgggctgtcaccaccgaaaagctgatcgaagccaatcaactcatcacagaactgctctggaatggtgtgactgtattcgttcccaatggtaaagatgaaattgttcagttcattgatttcgagaatattgagcagaatgatttccttgccatcaatcaatatg

>e_coli_28|M02077:18:000000000-A88N7:1:1101:20129:1679 1:N:0:1|o:223|m/o:0.026906|MR:n=0;r1=5;r2=1|Q30:n/a|CO:0|mismatches:6

agaagctatcgctgaattttcccccctggaacaggaggacgttatgcagttaacaaccagttggatgcttcagggcattgaacagggcatCgaacgtggacaaaaatctctgctactcaaacaaattagacatcgTtttggagagttgaatgcagttaatTtgtcgaggattgacattctcaCagtgcctcaattggaacagttggGagaagtgctgttggactgctctgatttcgcagaattagaacaatggctagcggcccaatCtgaaacacctgagcgaaaaatttagtgaatttccagaggtggatgttgccatgaatgttcccaaagaatcaggtttcaaccatcggcttacaattccaaactgagttgata

--min-overlap-size　Minimum expected overlap. Default is 15.
--max-num-mismatches　Maximum number of mismatches at the overlapped region to retain the pair. The default behavior relies on `-P` parameter an does not pay attention to the number of mismatches at the overlapped region.
-P　Any merged sequence with P below the declared value is discarded and stored in a seperate file.
--min-qual-score　Minimum Q-score for a base to overwrite a mismatch at the overlapped region. If there is a mismatch at the overlapped region, the base with higher quality is being used in the final sequence. Alternatively, if the Q-score of the base with higher quality is lower than the Q-score declared with this parameter, that base is being marked as an ambiguous base, which may result in the elimination of the merged sequence depending on the --ignore-Ns paranmeter. The default value is 15.
--retain-only-overlap　When set, merger will only return the parts of reads that do overlap, and parts of reads that do not overlap will be trimmed.

e_coli_STATSにはレポートが出力される。

f:id:kazumaxneo:20170825200415j:plain

完全にオーバーラップしているリードだけマージする。

iu-merge-pairs e_coli.ini --marker-gene-stringent --retain-only-overlap --max-num-mismatches 0

クオリティフィルタリングは公式ページを参照してください。

引用

A Filtering Method to Generate High Quality Short Reads Using Illumina Paired-End Technology

A. Murat Eren , Joseph H. Vineis , Hilary G. Morrison, Mitchell L. Sogin

PLoS One. 2013 Jun 17;8(6):e66643

2017-08-24

複数のトランスクリプトーム解析からコア遺伝子を探索するGET_HOMOLOGUES-EST

RNA seq ゲノム比較 (comparative genomics) pan-genome ANI sequence clustering Applied and Environmental Microbiology 2013 2017 Frontiers in Plant Science docker plant

2018 9/27 引用の誤り修正

2020 4/13 インストール手順とヘルプ追記, タイトル修正

2020 4/14 インストール手順修正

2020 5/27 タイトル修正

　種のパンゲノムとは、その種のすべての個体に見られるすべての遺伝子とノンコーディング配列の集合体と定義される。しかし、大規模なゲノムを持つ植物のパンゲノムを構築することは、配列決定のコストと必要とされる計算解析の規模の両方において困難である。より手頃な方法として、トランスクリプトームデータを利用してゲノムのレパートリーに注目する方法がある。ここでは、ソフトウェアGET_HOMOLOGUES-ESTを、19のシロイヌナズナエコタイプのゲノムおよびRNA-seqデータを用いてベンチマークし、16のHordeum vulgare遺伝子型からの転写物の解析に適用した。その目的は、それらのパンゲノムをサンプリングし、すべてのアクセッションで検出された場合はコア配列、一部のアクセッションでは検出されなかった場合はアクセサリー配列に分類することであった。その結果得られた配列クラスターは、パンゲノムの成長をシミュレートし、種内変異をまとめた平均ヌクレオチド同一性マトリックスを作成するために使用された。その結果、転写産物はパンゲノムサイズを少なくとも10％程度過小評価していることがわかったが、発現配列のクラスターは系統を再現し、A. thaliana遺伝子モデルで観察される2つの特性を再現できると結論づけた：アクセサリ遺伝子座はコア遺伝子よりも発現が低く、非同義置換率が高い。最後に、アクセサリ配列は、両種のトランスポゾンコンポーネントに加えて、栽培種大麦の病害抵抗性遺伝子、および文献によく見られる有無の変化に関連する他のファミリーの様々なタンパク質ドメインを優先的にコードしていることが観察された。これらの結果は、パンゲノム解析が生殖形質の多様性を探るのに有用であることを示している。

Manual

GET_HOMOLOGUES-EST

インストール

リリースからmacosのビルドをダウンロードし、macos10.14でテストした。

本体　Github

リリースから、各プラットフォーム向けにGET_HOMOLOGUES-ESTとGET_HOMOLOGUESのバイナリがダウンロードできる。その後、データベースをダウンロードしてインストールする。

cd get_homologues-macosx-20200226/
./install.pl

> ./get_homologues-est.pl

$ ./get_homologues-est.pl

usage: ./get_homologues-est.pl [options]

-h this message

-v print version, credits and checks installation

-d directory with input FASTA files (.fna , optionally .faa), (use of pre-clustered sequences

1 per sample, or subdirectories (subdir.clusters/subdir_) ignores -c)

with pre-clustered sequences (.faa/.fna ). Files matching

tag 'flcdna' are handled as full-length transcripts.

Allows for files to be added later.

Creates output folder named 'directory_est_homologues'

Optional parameters:

-o only run BLASTN/Pfam searches and exit (useful to pre-compute searches)

-i cluster redundant isoforms, including those that can be (min overlap, default: -i 40,

concatenated with no overhangs, and perform use -i 0 to disable)

calculations with longest

-c report transcriptome composition analysis (follows order in -I file if enforced,

with -t N skips clusters occup<N [OMCL],

ignores -r,-e)

-R set random seed for genome composition analysis (optional, requires -c, example -R 1234)

-s save memory by using BerkeleyDB; default parsing stores

sequence hits in RAM

-m runmode [local|cluster] (default: -m local)

-n nb of threads for BLASTN/HMMER/MCL in 'local' runmode (default=2)

-I file with .fna files in -d to be included (takes all by default, requires -d)

Algorithms instead of default bidirectional best-hits (BDBH):

-M use orthoMCL algorithm (OMCL, PubMed=12952885)

Options that control sequence similarity searches:

-C min %coverage of shortest sequence in BLAST alignments (range [1-100],default: -C 75)

-E max E-value (default: -E 1e-05 , max=0.01)

-D require equal Pfam domain composition (best with -m cluster or -n threads)

when defining similarity-based orthology

-S min %sequence identity in BLAST query/subj pairs (range [1-100],default: -S 95 [BDBH|OMCL])

-b compile core-transcriptome with minimum BLAST searches (ignores -c [BDBH])

Options that control clustering:

-t report sequence clusters including at least t taxa (default: t=numberOfTaxa,

t=0 reports all clusters [OMCL])

-L add redundant isoforms to clusters (optional, requires -i)

-r reference transcriptome .fna file (by default takes file with

least sequences; with BDBH sets

first taxa to start adding genes)

-e exclude clusters with inparalogues (by default inparalogues are

included)

-F orthoMCL inflation value (range [1-5], default: -F 1.5 [OMCL])

-A calculate average identity of clustered sequences, (optional, creates tab-separated matrix,

uses blastn results [OMCL])

-P calculate percentage of conserved sequences (POCS), (optional, creates tab-separated matrix,

uses blastn results, best with CDS [OMCL])

-z add soft-core to genome composition analysis (optional, requires -c [OMCL])

This program uses BLASTN/HMMER to define clusters of 'orthologous' transcripts

and pan/core-trancriptome sets. Different algorithm choices are available

and search parameters are customizable. It is designed to process (in a HPC computer

cluster) files contained in a directory (-d), so that new .fna/.faa files can be added

while conserving previous BLASTN/HMMER results. In general the program will try to re-use

previous results when run with the same input directory.

dockerhub

docker pull csicunam/get_homologues

#help
docker run --rm -itv $PWD:/data csicunam/get_homologues get_homologues-est.pl -h

#一部のRライブラリが導入されていないので、ヒートマップなど出力する時にエラにーなる。
以下を入れてコミットし直した。

docker run -it csicunam/get_homologues
> install.packages("gplots")
> install.packages("dendextend")
> install.packages("factoextra")
> quit(y)
#ID確認
docker ps -a
#commit
docker commit xxxxxxxx csicunam/get_homologues

実行方法

テストデータのラン。ディレクトリ；sample_transcripts_fastaを指定する。

./get_homologues-est.pl -d sample_transcripts_fasta

-d　directory with input FASTA files (.fna , optionally .faa)

$ ./get_homologues-est.pl -d sample_transcripts_fasta

# ./get_homologues-est.pl -d sample_transcripts_fasta -o 0 -i 40 -e 0 -r 0 -t all -c 0 -z 0 -I 0 -m local -n 2 -M 0 -C 75 -S 95 -E 1e-05 -F 1.5 -b 0 -s 0 -D 0 -R 0 -L 0 -A 0 -P 0

# version 26022020

# results_directory=/Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues

# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=1000 MINSEQLENGTH=20 MAXSEQLENGTH=25000

# checking input files...

# Esterel.trinity.fna.bz2 5892 median length = 506

# Franka.trinity.fna.bz2 6036 median length = 523

# Hs_Turkey-19-24.trinity.fna.bz2 6204 median length = 476

# flcdnas_Hnijo.fna.gz 28620 [full length sequences] median length = 1504

# 4 genomes, 46752 sequences

# taxa considered = 4 sequences = 46752 residues = 63954041

# mask=Esterel_alltaxa_algBDBH_e0_ (_algBDBH)

# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/Esterel.trinity.fna.bz2.nucl.fasta

# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/Franka.trinity.fna.bz2.nucl.fasta

# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/Hs_Turkey-19-24.trinity.fna.bz2.nucl.fasta

# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/flcdnas_Hnijo.fna.gz.nucl.fasta

# running BLAST searches ...

# done

# concatenating and sorting blast results...

# sorting _Esterel.trinity.fna.bz2.nucl results (2.5MB)

# sorting _Franka.trinity.fna.bz2.nucl results (2.1MB)

# sorting _Hs_Turkey-19-24.trinity.fna.bz2.nucl results (2.1MB)

# sorting _flcdnas_Hnijo.fna.gz.nucl results (11MB)

# done

# parsing blast result! (/Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/tmp/all.blast , 18MB)

# parsing file finished

# making temporary indexes required for clustering isoforms

# construct_taxa_indexes: number of taxa found = 4

# number of file addresses/BLAST queries = 4.7e+04

# clustering redundant isoforms in Esterel.trinity.fna.bz2.nucl

# Esterel.trinity.fna.bz2.nucl : 41 sequences

# clustering redundant isoforms in Franka.trinity.fna.bz2.nucl

# Franka.trinity.fna.bz2.nucl : 65 sequences

# clustering redundant isoforms in Hs_Turkey-19-24.trinity.fna.bz2.nucl

# Hs_Turkey-19-24.trinity.fna.bz2.nucl : 60 sequences

# clustering redundant isoforms in flcdnas_Hnijo.fna.gz.nucl

# flcdnas_Hnijo.fna.gz.nucl : 2298 sequences

# redundancy-filtering blast file

# created nr blast file

# parsing blast result! (/Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/tmp/all.blast.nr , 16MB)

# parsing file finished

# creating indexes, this might take some time (lines=2.09e+05) ...

# construct_taxa_indexes: number of taxa found = 4

# number of file addresses/BLAST queries = 4.4e+04

# clustering orthologous sequences

# clustering inparalogues in Esterel.trinity.fna.bz2.nucl (reference)

# 2611 sequences

# clustering inparalogues in Franka.trinity.fna.bz2.nucl

# 2057 sequences

# finding BDBHs between Esterel.trinity.fna.bz2.nucl and Franka.trinity.fna.bz2.nucl (1)

# 357 sequences

# clustering inparalogues in Hs_Turkey-19-24.trinity.fna.bz2.nucl

# 2331 sequences

# finding BDBHs between Esterel.trinity.fna.bz2.nucl and Hs_Turkey-19-24.trinity.fna.bz2.nucl (1)

# 307 sequences

# clustering inparalogues in flcdnas_Hnijo.fna.gz.nucl

# 5843 sequences

# finding BDBHs between Esterel.trinity.fna.bz2.nucl and flcdnas_Hnijo.fna.gz.nucl (1)

# 2006 sequences

# looking for valid sequence clusters (n_of_taxa=4)...

# number_of_clusters = 17

# cluster_list = sample_transcripts_fasta_est_homologues/Esterel_alltaxa_algBDBH_e0_.cluster_list

# cluster_directory = sample_transcripts_fasta_est_homologues/Esterel_alltaxa_algBDBH_e0_

# runtime: 137 wallclock secs (11.31 usr 0.58 sys + 94.95 cusr 12.17 csys = 119.01 CPU)

# RAM use: 139.3 MB

出力

f:id:kazumaxneo:20200413155859p:plain

GET_HOMOLOGUESの説明は別の記事に移しました。

引用
Analysis of Plant Pan-Genomes and Transcriptomes with GET_HOMOLOGUES-EST, a Clustering Solution for Sequences of the Same Species
Bruno Contreras-Moreira, Carlos P. Cantalapiedra, María J. García-Pereira, Sean P. Gordon, John P. Vogel, Ernesto Igartua, Ana M. Casas, Pablo Vinuesa

Front Plant Sci. 2017; 8: 184. Published online 2017 Feb 14

GET_HOMOLOGUES, a Versatile Software Package for Scalable and Robust Microbial Pangenome Analysis

Contreras-Moreira B, Vinuesa P

Appl Environ Microbiol. 2013 Dec;79(24):7696-701

http://journal.frontiersin.org/article/10.3389/fpls.2017.00184/full

2017-08-24

BLASTとコンパチブルで高速なホモロジー検索ツール Diamond

BLAST 高速なツール protein search 2015 Nature Methods all versus all sequence comarison

2019 1/20 help追加、コマンド追記, 6/9 -コマンド例から-max-target-seqs削除, 7/19 追記

2021 2/13 ツイート追記

2022/04/07 インストール追記、07/22 例追記、help更新

Diamondはindexのつけ方を工夫することでBLASTXの解析速度を加速できるツール。blastと同等の機能を持つが、論文ではblastより最大20000倍高速化できると主張されている。特にクエリー配列が非常に多い場合に高速とされる。2015年にnature methodsに論文が発表された。

2021 2/13

DIAMOND v2.0.7 now supports full-matrix Smith Waterman extensions (vectorized using the SWIPE algorithm) and the new extended taxonomy mapping file from NCBI. https://t.co/YtVTQlDicf
— Benjamin Buchfink (@bbuchfink) 2021年2月13日

3/11

DIAMOND v2.0.8 now supports directly using BLAST databases, database reformatting no longer needed. https://t.co/YtVTQlDicf
— Benjamin Buchfink (@bbuchfink) 2021年3月10日

マニュアル

manual

ppt

https://www.donarmstrong.com/ld/dmnd2015/diamond_presentation_2015.pdf

インストール

Github

condaやbrewを使って導入できる。

#bioconda(link)
mamba install -c bioconda diamond

#brewでも導入可能(試していません)
brew install diamond

> diamond help #v2.0.13

$ diamond --help

diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science

Documentation, support and updates available at http://www.diamondsearch.org

Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

Syntax: diamond COMMAND [OPTIONS]

Commands:

makedb Build DIAMOND database from a FASTA file

blastp Align amino acid query sequences against a protein reference database

blastx Align DNA query sequences against a protein reference database

view View DIAMOND alignment archive (DAA) formatted file

help Produce help message

version Display version information

getseq Retrieve sequences from a DIAMOND database file

dbinfo Print information about a DIAMOND database file

test Run regression tests

makeidx Make database index

General options:

--threads (-p) number of CPU threads

--db (-d) database file

--out (-o) output file

--outfmt (-f) output format

0 = BLAST pairwise

5 = BLAST XML

6 = BLAST tabular

100 = DIAMOND alignment archive (DAA)

101 = SAM

Value 6 may be followed by a space-separated list of these keywords:

qseqid means Query Seq - id

qlen means Query sequence length

sseqid means Subject Seq - id

sallseqid means All subject Seq - id(s), separated by a ';'

slen means Subject sequence length

qstart means Start of alignment in query

qend means End of alignment in query

sstart means Start of alignment in subject

send means End of alignment in subject

qseq means Aligned part of query sequence

qseq_translated means Aligned part of query sequence (translated)

full_qseq means Query sequence

full_qseq_mate means Query sequence of the mate

sseq means Aligned part of subject sequence

full_sseq means Subject sequence

evalue means Expect value

bitscore means Bit score

score means Raw score

length means Alignment length

pident means Percentage of identical matches

nident means Number of identical matches

mismatch means Number of mismatches

positive means Number of positive - scoring matches

gapopen means Number of gap openings

gaps means Total number of gaps

ppos means Percentage of positive - scoring matches

qframe means Query frame

btop means Blast traceback operations(BTOP)

cigar means CIGAR string

staxids means unique Subject Taxonomy ID(s), separated by a ';'

(in numerical order)

sscinames means unique Subject Scientific Name(s), separated by a ';'

sskingdoms means unique Subject Super Kingdom(s), separated by a ';'

skingdoms means unique Subject Kingdom(s), separated by a ';'

sphylums means unique Subject Phylum(s), separated by a ';'

stitle means Subject Title

salltitles means All Subject Title(s), separated by a '<>'

qcovhsp means Query Coverage Per HSP

scovhsp means Subject Coverage Per HSP

qtitle means Query title

qqual means Query quality values for the aligned part of the query

full_qqual means Query quality values

qstrand means Query strand

Default: qseqid sseqid pident length mismatch gapopen qstart qend

sstart send evalue bitscore

--verbose (-v) verbose console output

--log enable debug log

--quiet disable console output

--header Write header lines to blast tabular format.

Makedb options:

--in input reference file in FASTA format

--taxonmap protein accession to taxid mapping file

--taxonnodes taxonomy nodes.dmp from NCBI

--taxonnames taxonomy names.dmp from NCBI

Aligner options:

--query (-q) input query file

--strand query strands to search (both/minus/plus)

--un file for unaligned queries

--al file or aligned queries

--unfmt format of unaligned query file (fasta/fastq)

--alfmt format of aligned query file (fasta/fastq)

--unal report unaligned queries (0=no, 1=yes)

--max-target-seqs (-k) maximum number of target sequences to report

alignments for (default=25)

--top report alignments within this percentage

range of top alignment score (overrides --max-target-seqs)

--max-hsps maximum number of HSPs per target sequence to

report for each query (default=1)

--range-culling restrict hit culling to overlapping query ranges

--compress compression for output files (0=none, 1=gzip, zstd)

--evalue (-e) maximum e-value to report alignments (default=0.001)

--min-score minimum bit score to report alignments

(overrides e-value setting)

--id minimum identity% to report an alignment

--query-cover minimum query cover% to report an alignment

--subject-cover minimum subject cover% to report an alignment

--fast enable fast mode

--mid-sensitive enable mid-sensitive mode

--sensitive enable sensitive mode)

--more-sensitive enable more sensitive mode

--very-sensitive enable very sensitive mode

--ultra-sensitive enable ultra sensitive mode

--iterate iterated search with increasing sensitivity

--global-ranking (-g) number of targets for global ranking

--block-size (-b) sequence block size in billions of letters

(default=2.0)

--index-chunks (-c) number of chunks for index processing (default=4)

--tmpdir (-t) directory for temporary files

--parallel-tmpdir directory for temporary files used by multiprocessing

--gapopen gap open penalty

--gapextend gap extension penalty

--frameshift (-F) frame shift penalty (default=disabled)

--long-reads short for --range-culling --top 10 -F 15

--matrix score matrix for protein alignment (default=BLOSUM62)

--custom-matrix file containing custom scoring matrix

--comp-based-stats composition based statistics mode (0-4)

--masking masking algorithm (none, seg, tantan=default)

--query-gencode genetic code to use to translate query (see

user manual)

--salltitles include full subject titles in DAA file

--sallseqid include all subject ids in DAA file

--no-self-hits suppress reporting of identical self hits

--taxonlist restrict search to list of taxon ids (comma-separated)

--taxon-exclude exclude list of taxon ids (comma-separated)

--seqidlist filter the database by list of accessions

--skip-missing-seqids ignore accessions missing in the database

Advanced options:

--algo Seed search algorithm

(0=double-indexed/1=query-indexed/ctg=contiguous-seed)

--bin number of query bins for seed search

--min-orf (-l) ignore translated sequences without an open

reading frame of at least this length

--seed-cut cutoff for seed complexity

--freq-masking mask seeds based on frequency

--freq-sd number of standard deviations for ignoring

frequent seeds

--motif-masking softmask abundant motifs (0/1)

--id2 minimum number of identities for stage 1 hit

--xdrop (-x) xdrop for ungapped alignment

--gapped-filter-evalue E-value threshold for gapped filter (auto)

--band band for dynamic programming computation

--shapes (-s) number of seed shapes (default=all available)

--shape-mask seed shapes

--multiprocessing enable distributed-memory parallel processing

--mp-init initialize multiprocessing run

--mp-recover enable continuation of interrupted multiprocessing run

--mp-query-chunk process only a single query chunk as specified

--ext-chunk-size chunk size for adaptive ranking (default=auto)

--no-ranking disable ranking heuristic

--ext Extension mode (banded-fast/banded-slow/full)

--culling-overlap minimum range overlap with higher scoring hit

to delete a hit (default=50%)

--taxon-k maximum number of targets to report per species

--range-cover percentage of query range to be covered for

range culling (default=50%)

--dbsize effective database size (in letters)

--no-auto-append disable auto appending of DAA and DMND file extensions

--xml-blord-format Use gnl|BL_ORD_ID| style format in XML output

--stop-match-score Set the match score of stop codons against each other.

--tantan-minMaskProb minimum repeat probability for masking (default=0.9)

--file-buffer-size file buffer size in bytes (default=67108864)

--memory-limit (-M) Memory limit for extension stage in GB

--no-unlink Do not unlink temporary files.

--target-indexed Enable target-indexed mode

--ignore-warnings Ignore warnings

View options:

--daa (-a) DIAMOND alignment archive (DAA) file

--forwardonly only show alignments of forward strand

Getseq options:

--seq Space-separated list of sequence numbers to display.

Online documentation at http://www.diamondsearch.org

ラン

はじめにデータベースとなるアミノ酸配列のindexファイルを作成する。

diamond makedb --in input.faa -d nr

--in Path to the input protein reference database file in FASTA format (may be gzip compressed). If this parameter is omitted, the input will be read from stdin
--taxonmap Path to mapping file that maps NCBI protein accession numbers to taxon ids (gzip com- pressed). This parameter is optional and needs to be supplied in order to provide taxon- omy features. The file can be downloaded from NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/ taxonomy/accession2taxid/prot.accession2taxid.gz.
--taxonnodes Path to the nodes.dmp file from the NCBI taxonomy. This parameter is optional and needs to be supplied in order to provide taxonomy features. The file is contained within this archive downloadable at NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip.

blastxでホモロジー検索を行う。inputは塩基配列。

diamond blastx -d nr -q query.fna -o matches.m8

--threads Number of CPU threads. By default, the program will auto-detect and use all available virtual cores on the machine.

出力はタブ区切り形式である。

user$ head matches.m8

gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813330|ref|YP_007449782.1| 100.0 323 0 0 1 969 1 323 1.8e-179 622.1

gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813441|ref|YP_007449893.1| 33.0 233 142 4 82 747 34 263 8.8e-25 108.2

Diamondの検出閾値はblastのdefaultの検出閾値よりずっと低いため、stringencyはblastより高くなっている。また、defaultのパラメータはショートリード向けの設定のため、クエリ配列が長い場合、--sensitiveや--more-sensitiveをつけることが推奨されている。

マニュアルに書いてあるが、タブ出力するには--outfmt 6をつける。さらに以下のような指定を行うことで、出力項目を好きに設定できる。

qseqid Query Seq - id
qlen Query sequence length
sseqid Subject Seq - id
sallseqid All subject Seq - id(s), separated by a ’;’ slen Subject sequence length
qstart Start of alignment in query
qend End of alignment in query
sstart Start of alignment in subject
send End of alignment in subject
qseq Aligned part of query sequence
sseq Aligned part of subject sequence
full sseq Full subject sequence
evalue Expect value
bitscore Bit score
score Raw score
length Alignment length
pident Percentage of identical matches
nident Number of identical matches
mismatch Number of mismatches
positive Number of positive - scoring matches
gapopen Number of gap openings
gaps Total number of gaps
ppos Percentage of positive - scoring matches
qframe Query frame
btop Blast traceback operations(BTOP)
staxids Unique Subject Taxonomy ID(s), separated by a ’;’ (in numerical order). This field requires setting the --taxonmap parameter for makedb.
salltitles All Subject Title(s), separated by a ’<>’
qcovhsp Query Coverage Per HSP
qtitle Query title

デフォルトでは

"qseqid sseqid pident length mismatch

gapopen qstart qend sstart send evalue bitscore"が出力されるようになっている。

例えばdiamondでblastxサーチを行う。tabularでqseqid、sseqid、evalueのみ出力する。max target seqはデフォルト25なので増やす。

diamond blastx --query input.fa \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --outfmt 6 qseqid sseqid evalue \
 --sensitive \
 --max-target-seqs 1000 \
 > blast.out

追記

E-value ≤1E-100、amino acid identity ≥ 80%、minimum length (amino acid) ≥ 300 a.a。

diamond blastx --query input.fa \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --outfmt 6 \
 --min-orf 300 --id 80 --evalue 1e-100 \
 > blast.out

感度を上げるには新しく導入された--very-sensitiveか--ultra-sensitiveを使う（引用２参照）。max target seqはデフォルト25しかないので増やす。

diamond blastx --query input.fa \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --evalue 1e-1 --very-sensitive \
 --max-target-seqs 100000 \
 --outfmt 6 \
 > blast.out

all versus all

proteins.faaの総当たり比較。部分ヒットを避けるパラメータ設定を使用する（引用）。

diamond makedb --in proteins.faa --db protein_db
diamond blastp --query proteins.faa --db protein_db --out blastp.tsv \
--outfmt 6 --evalue 1e-5 --max-target-seqs 10000 --query-cover 50 \
--subject-cover 50

出力例

f:id:kazumaxneo:20220407231300p:plain

引用

１

Fast and sensitive protein alignment using DIAMONDFast and sensitive protein alignment using DIAMOND
Benjamin Buchfink, Chao Xie & Daniel H

Nature Methods 12, 59–60 (2015) doi:10.1038/nmeth.3176

PDF

https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf

２

Sensitive protein alignments at tree-of-life scale using DIAMOND

Benjamin Buchfink, Klaus Reuter & Hajk-Georg Drost
Nature Methods volume 18, pages 366–368 (2021)

追記

Question about max-target-seqs option #29

Question about max-target-seqs option · Issue #29 · bbuchfink/diamond · GitHub

KMCのホームページで、diamondとKMCの連携について提案があります。詳細はKMCのHPからスクリプトをダウンロードして確認してください。

http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download

追記

論文図１で色々なシーケンサー由来のリードを使ってタンパク質と相同性検索した時の処理時間と感度が比較されています。データによっては2万倍以上高速化しています。

https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf

追記

ローカルデータベースのダウンロードについては、こちらが参考になります。

GitHub - josuebarrera/GenEra: genEra is an easy-to-use, low-dependency command-line tool that estimates the age of the last common ancestor of protein-coding gene families.

追記

AC-DIAMOND

追記

ID convert

大雑把に調べる。

NCBIのblast DBを使う。

2021 4/8

DIAMOND v2 is here! Check out this paper from @bbuchfink & @HajkDrost in our Department. Instead of "we blasted gazillions of genomes, which took several days" it will now be "we diamonded gazillions of genomes during the coffee break." https://t.co/THmCxR2aeo
— Weigel Lab 🌱 aka WeigelWorld 🌱 (@PlantEvolution) 2021年4月7日

We introduce two new sensitivity modes: -very-sensitive and -ultra-sensitive allowing users to match the alignment sensitivity levels of BLAST while maintaining superior computational speed up to 360x. #Bioinformatics #Genomics #Phylogenomics pic.twitter.com/4IvXfiDT7y
— Hajk-Georg Drost (@HajkDrost) 2021年4月7日

Together with an optimized HPC and cloud-computing infrastructure, DIAMOND can now scale with the demands of ongoing bulk-sequencing efforts and exponentially growing genome assembly databases to facilitate massive comparative genomics efforts. #ERGA pic.twitter.com/zPkuHooama
— Hajk-Georg Drost (@HajkDrost) 2021年4月7日

2017-08-24

SSU rRNAを素早く検出する Barrnap

RNA seq rRNA

2019 3/10　タイトル修正

2019 5/30　インストール方法追記

2020 6/15 コマンド修正, help追記

2020 6/29 例追記

BarrnapはrRNAをゲノムから探すツール。

検索対象

bacteria (5S,23S,16S)
archaea (5S,5.8S,23S,16S)
mitochondria (12S,16S)
eukaryotes (5S,5.8S,28S,18S)

インストール

本体　Github

#bioconda（link）
mamba install -c bioconda barrnap -y

#homebrew
brew install barrnap

> barrnap -h

$ barrnap -h

Synopsis:

barrnap 0.9 - rapid ribosomal RNA prediction

Author:

Torsten Seemann

Usage:

barrnap [options] chr.fa

barrnap [options] < chr.fa

barrnap [options] - < chr.fa

Options:

--help This help

--version Print version and exit

--citation Print citation for referencing barrnap

--kingdom [X] Kingdom: euk mito bac arc (default 'bac')

--quiet No screen output (default OFF)

--threads [N] Number of threads/cores/CPUs to use (default '1')

--lencutoff [n.n] Proportional length threshold to label as partial (default '0.8')

--reject [n.n] Proportional length threshold to reject prediction (default '0.25')

--evalue [n.n] Similarity e-value cut-off (default '1e-06')

--incseq Include FASTA _input_ sequences in GFF3 output (default OFF)

--outseq [X] Save rRNA hit seqs to this FASTA file (default '')

ラン

barrnap --threads 8 genome.fa > 16SrRNAs.gff

--threads Number of threads/cores/CPUs to use (default '8')

default出力はGFF3形式になる。

kingdomを指定してラン。eukaryotes指定。

barrnap --kingdom euk --threads 12 input.fa > enk_rRNA.gff

--kingdom Kingdom: euk mito bac arc (default 'bac')

ミトコンドリア指定。

barrnap --kingdom mito --threads 12 input.fa > mito_rRNA.gff

イントロンがある場合、断片的な予測が起きる可能性を考慮して下さい。

rRNA配列も別出力する。

barrnap --outseq rRNAs.fa --threads 12 input.fa > rRNA.gff

引用

https://github.com/tseemann/barrnap

高速なRNA seqのマッピングツール STAR

RNA seq mapping 高速なツール 2013 Bioinformatics chimera transcript 2016 Current Protocols in Bioinformatics

2019 2/15 動画とbiocondaによる install追加

2020 7/6 コメントとhelp追加

2021 10/9 gzip fastqのオプション追記、12/5 chimera出力について追記

2024/02/20 情報を整頓

STARは高速なRNAのアライメントツール。intron-exonのsplit-alingmentに対応している。動作はbowtie2より１０倍以上高速とされ、マッピング感度の高さとエラー率の低さは既存のツールと同等とされている。

github

https://github.com/alexdobin/STAR

マニュアル

https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

STAR: RNA-Seq Read Aligner

インストール

wget https://github.com/alexdobin/STAR/archive/2.5.3a.tar.gz 
tar -xzf 2.5.3a.tar.gz
cd STAR-2.5.3a/bin/MacOSX_x86_64/

#Anacondaを使っているならcondaで導入可能
conda install -c bioconda -y star

> star --help

$ star --help

Usage: STAR [options]... --genomeDir /path/to/genome/index/ --readFilesIn R1.fq R2.fq

Spliced Transcripts Alignment to a Reference (c) Alexander Dobin, 2009-2019

For more details see:

<https://github.com/alexdobin/STAR>

<https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>

### versions

versionGenome 2.7.1a

string: earliest genome index version compatible with this STAR release. Please do not change this value!

### Parameter Files

parametersFiles -

string: name of a user-defined parameters file, "-": none. Can only be defined on the command line.

### System

sysShell -

string: path to the shell binary, preferably bash, e.g. /bin/bash.

- ... the default shell is executed, typically /bin/sh. This was reported to fail on some Ubuntu systems - then you need to specify path to bash.

### Run Parameters

runMode alignReads

string: type of the run.

alignReads ... map reads

genomeGenerate ... generate genome files

inputAlignmentsFromBAM ... input alignments from BAM. Presently only works with --outWigType and --bamRemoveDuplicates.

liftOver ... lift-over of GTF files (--sjdbGTFfile) between genome assemblies using chain file(s) from --genomeChainFiles.

runThreadN 1

int: number of threads to run STAR

runDirPerm User_RWX

string: permissions for the directories created at the run-time.

User_RWX ... user-read/write/execute

All_RWX ... all-read/write/execute (same as chmod 777)

runRNGseed 777

int: random number generator seed.

### Genome Parameters

genomeDir ./GenomeDir/

string: path to the directory where genome files are stored (for --runMode alignReads) or will be generated (for --runMode generateGenome)

genomeLoad NoSharedMemory

string: mode of shared memory usage for the genome files. Only used with --runMode alignReads.

LoadAndKeep ... load genome into shared and keep it in memory after run

LoadAndRemove ... load genome into shared but remove it after run

LoadAndExit ... load genome into shared memory and exit, keeping the genome in memory for future runs

Remove ... do not map anything, just remove loaded genome from memory

NoSharedMemory ... do not use shared memory, each job will have its own private copy of the genome

genomeFastaFiles -

string(s): path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they *cannot* be zipped.

Required for the genome generation (--runMode genomeGenerate). Can also be used in the mapping (--runMode alignReads) to add extra (new) sequences to the genome (e.g. spike-ins).

genomeChainFiles -

string: chain files for genomic liftover. Only used with --runMode liftOver .

genomeFileSizes 0

uint(s)>0: genome files exact sizes in bytes. Typically, this should not be defined by the user.

genomeConsensusFile -

string: VCF file with consensus SNPs (i.e. alternative allele is the major (AF>0.5) allele)

### Genome Indexing Parameters - only used with --runMode genomeGenerate

genomeChrBinNbits 18

int: =log2(chrBin), where chrBin is the size of the bins for genome storage: each chromosome will occupy an integer number of bins. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]).

genomeSAindexNbases 14

int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter --genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).

genomeSAsparseD 1

int>0: suffux array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAM at the cost of mapping speed reduction

genomeSuffixLengthMax -1

int: maximum length of the suffixes, has to be longer than read length. -1 = infinite.

### Splice Junctions Database

sjdbFileChrStartEnd -

string(s): path to the files with genomic coordinates (chr <tab> start <tab> end <tab> strand) for the splice junction introns. Multiple files can be supplied wand will be concatenated.

sjdbGTFfile -

string: path to the GTF file with annotations

sjdbGTFchrPrefix -

string: prefix for chromosome names in a GTF file (e.g. 'chr' for using ENSMEBL annotations with UCSC genomes)

sjdbGTFfeatureExon exon

string: feature type in GTF file to be used as exons for building transcripts

sjdbGTFtagExonParentTranscript transcript_id

string: GTF attribute name for parent transcript ID (default "transcript_id" works for GTF files)

sjdbGTFtagExonParentGene gene_id

string: GTF attribute name for parent gene ID (default "gene_id" works for GTF files)

sjdbGTFtagExonParentGeneName gene_name

string(s): GTF attrbute name for parent gene name

sjdbGTFtagExonParentGeneType gene_type gene_biotype

string(s): GTF attrbute name for parent gene type

sjdbOverhang 100

int>0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)

sjdbScore 2

int: extra alignment score for alignmets that cross database junctions

sjdbInsertSave Basic

string: which files to save when sjdb junctions are inserted on the fly at the mapping step

Basic ... only small junction / transcript files

All ... all files including big Genome, SA and SAindex - this will create a complete genome directory

### Variation parameters

varVCFfile -

string: path to the VCF file that contains variation data.

### Input Files

inputBAMfile -

string: path to BAM input file, to be used with --runMode inputAlignmentsFromBAM

### Read Parameters

readFilesType Fastx

string: format of input read files

Fastx ... FASTA or FASTQ

SAM SE ... SAM or BAM single-end reads; for BAM use --readFilesCommand samtools view

SAM PE ... SAM or BAM paired-end reads; for BAM use --readFilesCommand samtools view

readFilesIn Read1 Read2

string(s): paths to files that contain input read1 (and, if needed, read2)

readFilesPrefix -

string: preifx for the read files names, i.e. it will be added in front of the strings in --readFilesIn

-: no prefix

readFilesCommand -

string(s): command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout

For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc.

readMapNumber -1

int: number of reads to map from the beginning of the file

-1: map all reads

readMatesLengthsIn NotEqual

string: Equal/NotEqual - lengths of names,sequences,qualities for both mates are the same / not the same. NotEqual is safe in all situations.

readNameSeparator /

string(s): character(s) separating the part of the read names that will be trimmed in output (read name after space is always trimmed)

readQualityScoreBase 33

int>=0: number to be subtracted from the ASCII code to get Phred quality score

clip3pNbases 0

int(s): number(s) of bases to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.

clip5pNbases 0

int(s): number(s) of bases to clip from 5p of each mate. If one value is given, it will be assumed the same for both mates.

clip3pAdapterSeq -

string(s): adapter sequences to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.

clip3pAdapterMMp 0.1

double(s): max proportion of mismatches for 3p adpater clipping for each mate. If one value is given, it will be assumed the same for both mates.

clip3pAfterAdapterNbases 0

int(s): number of bases to clip from 3p of each mate after the adapter clipping. If one value is given, it will be assumed the same for both mates.

### Limits

limitGenomeGenerateRAM 31000000000

int>0: maximum available RAM (bytes) for genome generation

limitIObufferSize 150000000

int>0: max available buffers size (bytes) for input/output, per thread

limitOutSAMoneReadBytes 100000

int>0: max size of the SAM record (bytes) for one read. Recommended value: >(2*(LengthMate1+LengthMate2+100)*outFilterMultimapNmax

limitOutSJoneRead 1000

int>0: max number of junctions for one read (including all multi-mappers)

limitOutSJcollapsed 1000000

int>0: max number of collapsed junctions

limitBAMsortRAM 0

int>=0: maximum available RAM (bytes) for sorting BAM. If =0, it will be set to the genome index size. 0 value can only be used with --genomeLoad NoSharedMemory option.

limitSjdbInsertNsj 1000000

int>=0: maximum number of junction to be inserted to the genome on the fly at the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run

limitNreadsSoft -1

int: soft limit on the number of reads

### Output: general

outFileNamePrefix ./

string: output files name prefix (including full or relative path). Can only be defined on the command line.

outTmpDir -

string: path to a directory that will be used as temporary by STAR. All contents of this directory will be removed!

- the temp directory will default to outFileNamePrefix_STARtmp

outTmpKeep None

string: whether to keep the tempporary files after STAR runs is finished

None ... remove all temporary files

All .. keep all files

outStd Log

string: which output will be directed to stdout (standard out)

Log ... log messages

SAM ... alignments in SAM format (which normally are output to Aligned.out.sam file), normal standard output will go into Log.std.out

BAM_Unsorted ... alignments in BAM format, unsorted. Requires --outSAMtype BAM Unsorted

BAM_SortedByCoordinate ... alignments in BAM format, unsorted. Requires --outSAMtype BAM SortedByCoordinate

BAM_Quant ... alignments to transcriptome in BAM format, unsorted. Requires --quantMode TranscriptomeSAM

outReadsUnmapped None

string: output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s).

None ... no output

Fastx ... output in separate fasta/fastq files, Unmapped.out.mate1/2

outQSconversionAdd 0

int: add this number to the quality score (e.g. to convert from Illumina to Sanger, use -31)

outMultimapperOrder Old_2.4

string: order of multimapping alignments in the output files

Old_2.4 ... quasi-random order used before 2.5.0

Random ... random order of alignments for each multi-mapper. Read mates (pairs) are always adjacent, all alignment for each read stay together. This option will become default in the future releases.

### Output: SAM and BAM

outSAMtype SAM

strings: type of SAM/BAM output

1st word:

BAM ... output BAM without sorting

SAM ... output SAM without sorting

None ... no SAM/BAM output

2nd, 3rd:

Unsorted ... standard unsorted

SortedByCoordinate ... sorted by coordinate. This option will allocate extra memory for sorting which can be specified by --limitBAMsortRAM.

outSAMmode Full

string: mode of SAM output

None ... no SAM output

Full ... full SAM output

NoQS ... full SAM but without quality scores

outSAMstrandField None

string: Cufflinks-like strand field flag

None ... not used

intronMotif ... strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns are filtered out.

outSAMattributes Standard

string: a string of desired SAM attributes, in the order desired for the output SAM

NH HI AS nM NM MD jM jI XS MC ch ... any combination in any order

None ... no attributes

Standard ... NH HI AS nM

All ... NH HI AS nM NM MD jM jI MC ch

vA ... variant allele

vG ... genomic coordiante of the variant overlapped by the read

vW ... 0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag

STARsolo:

CR CY UR UY ... sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing

CB UB ... error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires --outSAMtype BAM SortedByCoordinate.

sM ... assessment of CB and UMI

sS ... sequence of the entire barcode (CB,UMI,adapter...)

sQ ... quality of the entire barcode

Unsupported/undocumented:

rB ... alignment block read/genomic coordinates

vR ... read coordinate of the variant

outSAMattrIHstart 1

int>=0: start value for the IH attribute. 0 may be required by some downstream software, such as Cufflinks or StringTie.

outSAMunmapped None

string(s): output of unmapped reads in the SAM format

1st word:

None ... no output

Within ... output unmapped reads within the main SAM file (i.e. Aligned.out.sam)

2nd word:

KeepPairs ... record unmapped mate for each alignment, and, in case of unsorted output, keep it adjacent to its mapped mate. Only affects multi-mapping reads.

outSAMorder Paired

string: type of sorting for the SAM output

Paired: one mate after the other for all paired alignments

PairedKeepInputOrder: one mate after the other for all paired alignments, the order is kept the same as in the input FASTQ files

outSAMprimaryFlag OneBestScore

string: which alignments are considered primary - all others will be marked with 0x100 bit in the FLAG

OneBestScore ... only one alignment with the best score is primary

AllBestScore ... all alignments with the best score are primary

outSAMreadID Standard

string: read ID record type

Standard ... first word (until space) from the FASTx read ID line, removing /1,/2 from the end

Number ... read number (index) in the FASTx file

outSAMmapqUnique 255

int: 0 to 255: the MAPQ value for unique mappers

outSAMflagOR 0

int: 0 to 65535: sam FLAG will be bitwise OR'd with this value, i.e. FLAG=FLAG | outSAMflagOR. This is applied after all flags have been set by STAR, and after outSAMflagAND. Can be used to set specific bits that are not set otherwise.

outSAMflagAND 65535

int: 0 to 65535: sam FLAG will be bitwise AND'd with this value, i.e. FLAG=FLAG & outSAMflagOR. This is applied after all flags have been set by STAR, but before outSAMflagOR. Can be used to unset specific bits that are not set otherwise.

outSAMattrRGline -

string(s): SAM/BAM read group line. The first word contains the read group identifier and must start with "ID:", e.g. --outSAMattrRGline id:xxx CN:yy "DS:z z z".

xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted.

Comma separated RG lines correspons to different (comma separated) input files in --readFilesIn. Commas have to be surrounded by spaces, e.g.

--outSAMattrRGline id:xxx , id:zzz "DS:z z" , id:yyy DS:yyyy

outSAMheaderHD -

strings: @HD (header) line of the SAM header

outSAMheaderPG -

strings: extra @PG (software) line of the SAM header (in addition to STAR)

outSAMheaderCommentFile -

string: path to the file with @CO (comment) lines of the SAM header

outSAMfilter None

string(s): filter the output into main SAM/BAM files

KeepOnlyAddedReferences ... only keep the reads for which all alignments are to the extra reference sequences added with --genomeFastaFiles at the mapping stage.

KeepAllAddedReferences ... keep all alignments to the extra reference sequences added with --genomeFastaFiles at the mapping stage.

outSAMmultNmax -1

int: max number of multiple alignments for a read that will be output to the SAM/BAM files.

-1 ... all alignments (up to --outFilterMultimapNmax) will be output

outSAMtlen 1

int: calculation method for the TLEN field in the SAM/BAM files

1 ... leftmost base of the (+)strand mate to rightmost base of the (-)mate. (+)sign for the (+)strand mate

2 ... leftmost base of any mate to rightmost base of any mate. (+)sign for the mate with the leftmost base. This is different from 1 for overlapping mates with protruding ends

outBAMcompression 1

int: -1 to 10 BAM compression level, -1=default compression (6?), 0=no compression, 10=maximum compression

outBAMsortingThreadN 0

int: >=0: number of threads for BAM sorting. 0 will default to min(6,--runThreadN).

outBAMsortingBinsN 50

int: >0: number of genome bins fo coordinate-sorting

### BAM processing

bamRemoveDuplicatesType -

string: mark duplicates in the BAM file, for now only works with (i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only

- ... no duplicate removal/marking

UniqueIdentical ... mark all multimappers, and duplicate unique mappers. The coordinates, FLAG, CIGAR must be identical

UniqueIdenticalNotMulti ... mark duplicate unique mappers but not multimappers.

bamRemoveDuplicatesMate2basesN 0

int>0: number of bases from the 5' of mate 2 to use in collapsing (e.g. for RAMPAGE)

### Output Wiggle

outWigType None

string(s): type of signal output, e.g. "bedGraph" OR "bedGraph read1_5p". Requires sorted BAM: --outSAMtype BAM SortedByCoordinate .

1st word:

None ... no signal output

bedGraph ... bedGraph format

wiggle ... wiggle format

2nd word:

read1_5p ... signal from only 5' of the 1st read, useful for CAGE/RAMPAGE etc

read2 ... signal from only 2nd read

outWigStrand Stranded

string: strandedness of wiggle/bedGraph output

Stranded ... separate strands, str1 and str2

Unstranded ... collapsed strands

outWigReferencesPrefix -

string: prefix matching reference names to include in the output wiggle file, e.g. "chr", default "-" - include all references

outWigNorm RPM

string: type of normalization for the signal

RPM ... reads per million of mapped reads

None ... no normalization, "raw" counts

### Output Filtering

outFilterType Normal

string: type of filtering

Normal ... standard filtering using only current alignment

BySJout ... keep only those reads that contain junctions that passed filtering into SJ.out.tab

outFilterMultimapScoreRange 1

int: the score range below the maximum score for multimapping alignments

outFilterMultimapNmax 10

int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value.

Otherwise no alignments will be output, and the read will be counted as "mapped to too many loci" in the Log.final.out .

outFilterMismatchNmax 10

int: alignment will be output only if it has no more mismatches than this value.

outFilterMismatchNoverLmax 0.3

real: alignment will be output only if its ratio of mismatches to *mapped* length is less than or equal to this value.

outFilterMismatchNoverReadLmax 1.0

real: alignment will be output only if its ratio of mismatches to *read* length is less than or equal to this value.

outFilterScoreMin 0

int: alignment will be output only if its score is higher than or equal to this value.

outFilterScoreMinOverLread 0.66

real: same as outFilterScoreMin, but normalized to read length (sum of mates' lengths for paired-end reads)

outFilterMatchNmin 0

int: alignment will be output only if the number of matched bases is higher than or equal to this value.

outFilterMatchNminOverLread 0.66

real: sam as outFilterMatchNmin, but normalized to the read length (sum of mates' lengths for paired-end reads).

outFilterIntronMotifs None

string: filter alignment using their motifs

None ... no filtering

RemoveNoncanonical ... filter out alignments that contain non-canonical junctions

RemoveNoncanonicalUnannotated ... filter out alignments that contain non-canonical unannotated junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept.

outFilterIntronStrands RemoveInconsistentStrands

string: filter alignments

RemoveInconsistentStrands ... remove alignments that have junctions with inconsistent strands

None ... no filtering

### Output Filtering: Splice Junctions

outSJfilterReads All

string: which reads to consider for collapsed splice junctions output

All: all reads, unique- and multi-mappers

Unique: uniquely mapping reads only

outSJfilterOverhangMin 30 12 12 12

4 integers: minimum overhang length for splice junctions on both sides for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif