macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

fastqの操作ツール illumina-utils

 

illumina-utilsはpythonで記述されたilluminaのシーケンスデータのユーティリティツール。オーバーラップしたペアリードのmergeやクオリティフィルタリングを行うことができる。

 

インストール

Github

sudo pip install illumina-utils

 

実行方法

raw fastqのdemulitiplexing。

ランにはindexの情報が載ったテキストファイルが必要。フォーマットについては公式の例を参照。

iu-demultiplex -s barcode_to_sample.txt --r1 r1.fastq --r2 r2.fastq --index index.fastq -o output/

illuminaの提供しているdemulitiplexツールはbcl2fastq(リンク)。 

 

 

ほとんどのコマンドのランにはコンフィグファイルが必要。以下のようなファイル(input.txt)を元にコンフィグファイルを作成することができる。

sample	r1	r2
e_coli	ecoli-R1.fastq	ecoli-R2.fastq

コンフィグファイルの作成。

iu-gen-configs input.txt

この例だとe_coli.iniができる。中身を確認する。

user$ cat e_coli.ini

[general]

project_name = e_coli

researcher_email = u@example.edu

input_directory = /Users/user/Documents/test/

output_directory = /Users/use/Documents/test/

 

[files]

pair_1 = ecoli-R1.fastq

pair_2 = ecoli-R2.fastq

 

 

 

オーバーラップがあるペアリードのマージ。先ほどのconfigファイルを指定する。

iu-merge-pairs e_coli.ini

ランにはそれなりの時間がかかる(macbook pro corei7モデルだと300-bpの200万x2リードの処理1時間以上)。マージが終わると、マージが成功した配列と失敗した配列が別々のファイルで出力される。また、レポートも出力される。

マージされたfastqは、ヘッダーにどのような状態でマージされたかを示す情報がつく。クオリティ行は削られfastaフォーマットになっている。

user$ head -4 e_coli_MERGED 

>e_coli_19|M02077:18:000000000-A88N7:1:1101:14587:1648 1:N:0:1|o:229|m/o:0.013100|MR:n=0;r1=3;r2=0|Q30:n/a|CO:0|mismatches:3

aattaaaaaaacttggctggcaatatgttcctggctgtttggaagatcaacctgttcctactgatccactgaTtactgaacgggaaagttttaagcagattcttattaagccccgacttcagcaagcCcttaagcgCattaatctgactgatgatggagagccatggctagatgactaccaaattgagtcggccatttcccagctagagcgggctgtcaccaccgaaaagctgatcgaagccaatcaactcatcacagaactgctctggaatggtgtgactgtattcgttcccaatggtaaagatgaaattgttcagttcattgatttcgagaatattgagcagaatgatttccttgccatcaatcaatatg

>e_coli_28|M02077:18:000000000-A88N7:1:1101:20129:1679 1:N:0:1|o:223|m/o:0.026906|MR:n=0;r1=5;r2=1|Q30:n/a|CO:0|mismatches:6

agaagctatcgctgaattttcccccctggaacaggaggacgttatgcagttaacaaccagttggatgcttcagggcattgaacagggcatCgaacgtggacaaaaatctctgctactcaaacaaattagacatcgTtttggagagttgaatgcagttaatTtgtcgaggattgacattctcaCagtgcctcaattggaacagttggGagaagtgctgttggactgctctgatttcgcagaattagaacaatggctagcggcccaatCtgaaacacctgagcgaaaaatttagtgaatttccagaggtggatgttgccatgaatgttcccaaagaatcaggtttcaaccatcggcttacaattccaaactgagttgata

  • --min-overlap-size Minimum expected overlap. Default is 15.
  • --max-num-mismatches Maximum number of mismatches at the overlapped region to retain the pair. The default behavior relies on `-P` parameter an does not pay attention to the number of mismatches at the overlapped region.
  • -P Any merged sequence with P below the declared value is discarded and stored in a seperate file.
  • --min-qual-score Minimum Q-score for a base to overwrite a mismatch at the overlapped region. If there is a mismatch at the overlapped region, the base with higher quality is being used in the final sequence. Alternatively, if the Q-score of the base with higher quality is lower than the Q-score declared with this parameter, that base is being marked as an ambiguous base, which may result in the elimination of the merged sequence depending on the --ignore-Ns paranmeter. The default value is 15.
  • --retain-only-overlap When set, merger will only return the parts of reads that do overlap, and parts of reads that do not overlap will be trimmed.

 e_coli_STATSにはレポートが出力される。

f:id:kazumaxneo:20170825200415j:plain

 

 

完全にオーバーラップしているリードだけマージする。

iu-merge-pairs e_coli.ini --marker-gene-stringent --retain-only-overlap --max-num-mismatches 0

 

 クオリティフィルタリングは公式ページを参照してください。 

 

 

 

引用

A Filtering Method to Generate High Quality Short Reads Using Illumina Paired-End Technology

A. Murat Eren , Joseph H. Vineis , Hilary G. Morrison, Mitchell L. Sogin

PLoS One. 2013 Jun 17;8(6):e66643

 

 

複数のトランスクリプトーム解析からコア遺伝子を探索するGET_HOMOLOGUES-EST

2018 9/27 引用の誤り修正

2020 4/13 インストール手順とヘルプ追記, タイトル修正

2020 4/14  インストール手順修正

2020 5/27 タイトル修正

 

 種のパンゲノムとは、その種のすべての個体に見られるすべての遺伝子とノンコーディング配列の集合体と定義される。しかし、大規模なゲノムを持つ植物のパンゲノムを構築することは、配列決定のコストと必要とされる計算解析の規模の両方において困難である。より手頃な方法として、トランスクリプトームデータを利用してゲノムのレパートリーに注目する方法がある。ここでは、ソフトウェアGET_HOMOLOGUES-ESTを、19のシロイヌナズナエコタイプのゲノムおよびRNA-seqデータを用いてベンチマークし、16のHordeum vulgare遺伝子型からの転写物の解析に適用した。その目的は、それらのパンゲノムをサンプリングし、すべてのアクセッションで検出された場合はコア配列、一部のアクセッションでは検出されなかった場合はアクセサリー配列に分類することであった。その結果得られた配列クラスターは、パンゲノムの成長をシミュレートし、種内変異をまとめた平均ヌクレオチド同一性マトリックスを作成するために使用された。その結果、転写産物はパンゲノムサイズを少なくとも10%程度過小評価していることがわかったが、発現配列のクラスターは系統を再現し、A. thaliana遺伝子モデルで観察される2つの特性を再現できると結論づけた:アクセサリ遺伝子座はコア遺伝子よりも発現が低く、非同義置換率が高い。最後に、アクセサリ配列は、両種のトランスポゾンコンポーネントに加えて、栽培種大麦の病害抵抗性遺伝子、および文献によく見られる有無の変化に関連する他のファミリーの様々なタンパク質ドメインを優先的にコードしていることが観察された。これらの結果は、パンゲノム解析が生殖形質の多様性を探るのに有用であることを示している。

 

Manual

GET_HOMOLOGUES-EST

 

インストール

リリースからmacosのビルドをダウンロードし、macos10.14でテストした。

本体 Github

リリースから、各プラットフォーム向けにGET_HOMOLOGUES-ESTとGET_HOMOLOGUESのバイナリがダウンロードできる。その後、データベースをダウンロードしてインストールする。

cd get_homologues-macosx-20200226/
./install.pl

./get_homologues-est.pl

$ ./get_homologues-est.pl 

 

usage: ./get_homologues-est.pl [options]

 

-h this message

-v print version, credits and checks installation

-d directory with input FASTA files (.fna , optionally .faa),  (use of pre-clustered sequences

   1 per sample, or subdirectories (subdir.clusters/subdir_)    ignores -c)

   with pre-clustered sequences (.faa/.fna ). Files matching

   tag 'flcdna' are handled as full-length transcripts.

   Allows for files to be added later.

   Creates output folder named 'directory_est_homologues'

 

Optional parameters:

-o only run BLASTN/Pfam searches and exit                      (useful to pre-compute searches)

-i cluster redundant isoforms, including those that can be     (min overlap, default: -i 40,

   concatenated with no overhangs, and perform                  use -i 0 to disable)

   calculations with longest

-c report transcriptome composition analysis                   (follows order in -I file if enforced,

                                                                with -t N skips clusters occup<N [OMCL],

                                                                ignores -r,-e)

-R set random seed for genome composition analysis             (optional, requires -c, example -R 1234)

-s save memory by using BerkeleyDB; default parsing stores

   sequence hits in RAM

-m runmode [local|cluster]                                     (default: -m local)

-n nb of threads for BLASTN/HMMER/MCL in 'local' runmode       (default=2)

-I file with .fna files in -d to be included                   (takes all by default, requires -d)

 

Algorithms instead of default bidirectional best-hits (BDBH):

-M use orthoMCL algorithm (OMCL, PubMed=12952885)

 

Options that control sequence similarity searches:

-C min %coverage of shortest sequence in BLAST alignments      (range [1-100],default: -C 75)

-E max E-value                                                 (default: -E 1e-05 , max=0.01)

-D require equal Pfam domain composition                       (best with -m cluster or -n threads)

   when defining similarity-based orthology

-S min %sequence identity in BLAST query/subj pairs            (range [1-100],default: -S 95 [BDBH|OMCL])

-b compile core-transcriptome with minimum BLAST searches      (ignores -c [BDBH])

 

Options that control clustering:

-t report sequence clusters including at least t taxa          (default: t=numberOfTaxa,

                                                                t=0 reports all clusters [OMCL])

-L add redundant isoforms to clusters                          (optional, requires -i)

-r reference transcriptome .fna file                           (by default takes file with

                                                                least sequences; with BDBH sets

                                                                first taxa to start adding genes)

-e exclude clusters with inparalogues                          (by default inparalogues are

                                                                included)

-F orthoMCL inflation value                                    (range [1-5], default: -F 1.5 [OMCL])

-A calculate average identity of clustered sequences,          (optional, creates tab-separated matrix,

 uses blastn results                                            [OMCL])

-P calculate percentage of conserved sequences (POCS),         (optional, creates tab-separated matrix,

 uses blastn results, best with CDS                             [OMCL])

-z add soft-core to genome composition analysis                (optional, requires -c [OMCL])

 

This program uses BLASTN/HMMER to define clusters of 'orthologous' transcripts

and pan/core-trancriptome sets. Different algorithm choices are available

and search parameters are customizable. It is designed to process (in a HPC computer

cluster) files contained in a directory (-d), so that new .fna/.faa files can be added

while conserving previous BLASTN/HMMER results. In general the program will try to re-use

previous results when run with the same input directory.

 

dockerhub

docker pull csicunam/get_homologues

#help
docker run --rm -itv $PWD:/data csicunam/get_homologues get_homologues-est.pl -h

#一部のRライブラリが導入されていないので、ヒートマップなど出力する時にエラにーなる。
以下を入れてコミットし直した。

docker run -it csicunam/get_homologues
> install.packages("gplots")
> install.packages("dendextend")
> install.packages("factoextra")
> quit(y)
#ID確認
docker ps -a
#commit
docker commit xxxxxxxx csicunam/get_homologues

 

 

実行方法

テストデータのラン。ディレクトリ;sample_transcripts_fastaを指定する。

./get_homologues-est.pl -d sample_transcripts_fasta
  • -d directory with input FASTA files (.fna , optionally .faa)

$ ./get_homologues-est.pl -d sample_transcripts_fasta

# ./get_homologues-est.pl -d sample_transcripts_fasta -o 0 -i 40 -e 0 -r 0 -t all -c 0 -z 0 -I 0 -m local -n 2 -M 0 -C 75 -S 95 -E 1e-05 -F 1.5 -b 0 -s 0 -D 0 -R 0 -L 0 -A 0 -P 0

 

# version 26022020

# results_directory=/Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues

# parameters: MAXEVALUEBLASTSEARCH=0.01 MAXPFAMSEQS=250 BATCHSIZE=1000 MINSEQLENGTH=20 MAXSEQLENGTH=25000

 

# checking input files...

# Esterel.trinity.fna.bz2 5892  median length = 506

# Franka.trinity.fna.bz2 6036  median length = 523

# Hs_Turkey-19-24.trinity.fna.bz2 6204  median length = 476

# flcdnas_Hnijo.fna.gz 28620 [full length sequences] median length = 1504

 

# 4 genomes, 46752 sequences

 

# taxa considered = 4 sequences = 46752 residues = 63954041

 

# mask=Esterel_alltaxa_algBDBH_e0_ (_algBDBH)

 

# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/Esterel.trinity.fna.bz2.nucl.fasta

 

# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/Franka.trinity.fna.bz2.nucl.fasta

 

# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/Hs_Turkey-19-24.trinity.fna.bz2.nucl.fasta

 

# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/flcdnas_Hnijo.fna.gz.nucl.fasta

 

# running BLAST searches ...

# done

 

# concatenating and sorting blast results...

# sorting _Esterel.trinity.fna.bz2.nucl results (2.5MB)

# sorting _Franka.trinity.fna.bz2.nucl results (2.1MB)

# sorting _Hs_Turkey-19-24.trinity.fna.bz2.nucl results (2.1MB)

# sorting _flcdnas_Hnijo.fna.gz.nucl results (11MB)

# done

 

 

# parsing blast result! (/Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/tmp/all.blast , 18MB)

# parsing file finished

 

# making temporary indexes required for clustering isoforms

# construct_taxa_indexes: number of taxa found = 4

# number of file addresses/BLAST queries = 4.7e+04

 

# clustering redundant isoforms in Esterel.trinity.fna.bz2.nucl

# Esterel.trinity.fna.bz2.nucl : 41 sequences

 

# clustering redundant isoforms in Franka.trinity.fna.bz2.nucl

# Franka.trinity.fna.bz2.nucl : 65 sequences

 

# clustering redundant isoforms in Hs_Turkey-19-24.trinity.fna.bz2.nucl

# Hs_Turkey-19-24.trinity.fna.bz2.nucl : 60 sequences

 

# clustering redundant isoforms in flcdnas_Hnijo.fna.gz.nucl

# flcdnas_Hnijo.fna.gz.nucl : 2298 sequences

 

# redundancy-filtering blast file

# created nr blast file

 

# parsing blast result! (/Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/tmp/all.blast.nr , 16MB)

# parsing file finished

 

# creating indexes, this might take some time (lines=2.09e+05) ...

 

# construct_taxa_indexes: number of taxa found = 4

# number of file addresses/BLAST queries = 4.4e+04

 

# clustering orthologous sequences

 

# clustering inparalogues in Esterel.trinity.fna.bz2.nucl (reference)

# 2611 sequences

 

# clustering inparalogues in Franka.trinity.fna.bz2.nucl

# 2057 sequences

 

# finding BDBHs between Esterel.trinity.fna.bz2.nucl and Franka.trinity.fna.bz2.nucl (1)

# 357 sequences

 

# clustering inparalogues in Hs_Turkey-19-24.trinity.fna.bz2.nucl

# 2331 sequences

 

# finding BDBHs between Esterel.trinity.fna.bz2.nucl and Hs_Turkey-19-24.trinity.fna.bz2.nucl (1)

# 307 sequences

 

# clustering inparalogues in flcdnas_Hnijo.fna.gz.nucl

# 5843 sequences

 

# finding BDBHs between Esterel.trinity.fna.bz2.nucl and flcdnas_Hnijo.fna.gz.nucl (1)

# 2006 sequences

 

# looking for valid sequence clusters (n_of_taxa=4)...

 

# number_of_clusters = 17

# cluster_list = sample_transcripts_fasta_est_homologues/Esterel_alltaxa_algBDBH_e0_.cluster_list

# cluster_directory = sample_transcripts_fasta_est_homologues/Esterel_alltaxa_algBDBH_e0_

 

# runtime: 137 wallclock secs (11.31 usr  0.58 sys + 94.95 cusr 12.17 csys = 119.01 CPU)

# RAM use: 139.3 MB

 

出力

f:id:kazumaxneo:20200413155859p:plain

 

 GET_HOMOLOGUESの説明は別の記事に移しました。

引用
Analysis of Plant Pan-Genomes and Transcriptomes with GET_HOMOLOGUES-EST, a Clustering Solution for Sequences of the Same Species
Bruno Contreras-Moreira, Carlos P. Cantalapiedra, María J. García-Pereira, Sean P. Gordon, John P. Vogel, Ernesto Igartua, Ana M. Casas, Pablo Vinuesa

Front Plant Sci. 2017; 8: 184. Published online 2017 Feb 14

 

GET_HOMOLOGUES, a Versatile Software Package for Scalable and Robust Microbial Pangenome Analysis

Contreras-Moreira B, Vinuesa P

Appl Environ Microbiol. 2013 Dec;79(24):7696-701

 

http://journal.frontiersin.org/article/10.3389/fpls.2017.00184/full

BLASTとコンパチブルで高速なホモロジー検索ツール Diamond

2019 1/20 help追加 、コマンド追記,  6/9 -コマンド例から-max-target-seqs削除, 7/19 追記

2021 2/13  ツイート追記

2022/04/07 インストール追記、07/22 例追記、help更新

  

Diamondはindexのつけ方を工夫することでBLASTXの解析速度を加速できるツール。blastと同等の機能を持つが、論文ではblastより最大20000倍高速化できると主張されている。特にクエリー配列が非常に多い場合に高速とされる。2015年にnature methodsに論文が発表された。

 

2021 2/13

3/11

 

 

  マニュアル

manual

ppt

https://www.donarmstrong.com/ld/dmnd2015/diamond_presentation_2015.pdf

インストール

Github

condaやbrewを使って導入できる。

#bioconda(link)
mamba install -c bioconda diamond

#brewでも導入可能(試していません)
brew install diamond

 > diamond help #v2.0.13

$ diamond --help

diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science

Documentation, support and updates available at http://www.diamondsearch.org

Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

 

Syntax: diamond COMMAND [OPTIONS]

 

Commands:

makedb    Build DIAMOND database from a FASTA file

blastp    Align amino acid query sequences against a protein reference database

blastx    Align DNA query sequences against a protein reference database

view    View DIAMOND alignment archive (DAA) formatted file

help    Produce help message

version    Display version information

getseq    Retrieve sequences from a DIAMOND database file

dbinfo    Print information about a DIAMOND database file

test    Run regression tests

makeidx    Make database index

 

General options:

--threads (-p)           number of CPU threads

--db (-d)                database file

--out (-o)               output file

--outfmt (-f)            output format

    0   = BLAST pairwise

    5   = BLAST XML

    6   = BLAST tabular

    100 = DIAMOND alignment archive (DAA)

    101 = SAM

 

    Value 6 may be followed by a space-separated list of these keywords:

 

    qseqid means Query Seq - id

    qlen means Query sequence length

    sseqid means Subject Seq - id

    sallseqid means All subject Seq - id(s), separated by a ';'

    slen means Subject sequence length

    qstart means Start of alignment in query

    qend means End of alignment in query

    sstart means Start of alignment in subject

    send means End of alignment in subject

    qseq means Aligned part of query sequence

    qseq_translated means Aligned part of query sequence (translated)

    full_qseq means Query sequence

    full_qseq_mate means Query sequence of the mate

    sseq means Aligned part of subject sequence

    full_sseq means Subject sequence

    evalue means Expect value

    bitscore means Bit score

    score means Raw score

    length means Alignment length

    pident means Percentage of identical matches

    nident means Number of identical matches

    mismatch means Number of mismatches

    positive means Number of positive - scoring matches

    gapopen means Number of gap openings

    gaps means Total number of gaps

    ppos means Percentage of positive - scoring matches

    qframe means Query frame

    btop means Blast traceback operations(BTOP)

    cigar means CIGAR string

    staxids means unique Subject Taxonomy ID(s), separated by a ';'

(in numerical order)

    sscinames means unique Subject Scientific Name(s), separated by a ';'

    sskingdoms means unique Subject Super Kingdom(s), separated by a ';'

    skingdoms means unique Subject Kingdom(s), separated by a ';'

    sphylums means unique Subject Phylum(s), separated by a ';'

    stitle means Subject Title

    salltitles means All Subject Title(s), separated by a '<>'

    qcovhsp means Query Coverage Per HSP

    scovhsp means Subject Coverage Per HSP

    qtitle means Query title

    qqual means Query quality values for the aligned part of the query

    full_qqual means Query quality values

    qstrand means Query strand

 

    Default: qseqid sseqid pident length mismatch gapopen qstart qend

sstart send evalue bitscore

--verbose (-v)           verbose console output

--log                    enable debug log

--quiet                  disable console output

--header                 Write header lines to blast tabular format.

 

Makedb options:

--in                     input reference file in FASTA format

--taxonmap               protein accession to taxid mapping file

--taxonnodes             taxonomy nodes.dmp from NCBI

--taxonnames             taxonomy names.dmp from NCBI

 

Aligner options:

--query (-q)             input query file

--strand                 query strands to search (both/minus/plus)

--un                     file for unaligned queries

--al                     file or aligned queries

--unfmt                  format of unaligned query file (fasta/fastq)

--alfmt                  format of aligned query file (fasta/fastq)

--unal                   report unaligned queries (0=no, 1=yes)

--max-target-seqs (-k)   maximum number of target sequences to report

alignments for (default=25)

--top                    report alignments within this percentage

range of top alignment score (overrides --max-target-seqs)

--max-hsps               maximum number of HSPs per target sequence to

report for each query (default=1)

--range-culling          restrict hit culling to overlapping query ranges

--compress               compression for output files (0=none, 1=gzip, zstd)

--evalue (-e)            maximum e-value to report alignments (default=0.001)

--min-score              minimum bit score to report alignments

(overrides e-value setting)

--id                     minimum identity% to report an alignment

--query-cover            minimum query cover% to report an alignment

--subject-cover          minimum subject cover% to report an alignment

--fast                   enable fast mode

--mid-sensitive          enable mid-sensitive mode

--sensitive              enable sensitive mode)

--more-sensitive         enable more sensitive mode

--very-sensitive         enable very sensitive mode

--ultra-sensitive        enable ultra sensitive mode

--iterate                iterated search with increasing sensitivity

--global-ranking (-g)    number of targets for global ranking

--block-size (-b)        sequence block size in billions of letters

(default=2.0)

--index-chunks (-c)      number of chunks for index processing (default=4)

--tmpdir (-t)            directory for temporary files

--parallel-tmpdir        directory for temporary files used by multiprocessing

--gapopen                gap open penalty

--gapextend              gap extension penalty

--frameshift (-F)        frame shift penalty (default=disabled)

--long-reads             short for --range-culling --top 10 -F 15

--matrix                 score matrix for protein alignment (default=BLOSUM62)

--custom-matrix          file containing custom scoring matrix

--comp-based-stats       composition based statistics mode (0-4)

--masking                masking algorithm (none, seg, tantan=default)

--query-gencode          genetic code to use to translate query (see

user manual)

--salltitles             include full subject titles in DAA file

--sallseqid              include all subject ids in DAA file

--no-self-hits           suppress reporting of identical self hits

--taxonlist              restrict search to list of taxon ids (comma-separated)

--taxon-exclude          exclude list of taxon ids (comma-separated)

--seqidlist              filter the database by list of accessions

--skip-missing-seqids    ignore accessions missing in the database

 

Advanced options:

--algo                   Seed search algorithm

(0=double-indexed/1=query-indexed/ctg=contiguous-seed)

--bin                    number of query bins for seed search

--min-orf (-l)           ignore translated sequences without an open

reading frame of at least this length

--seed-cut               cutoff for seed complexity

--freq-masking           mask seeds based on frequency

--freq-sd                number of standard deviations for ignoring

frequent seeds

--motif-masking          softmask abundant motifs (0/1)

--id2                    minimum number of identities for stage 1 hit

--xdrop (-x)             xdrop for ungapped alignment

--gapped-filter-evalue   E-value threshold for gapped filter (auto)

--band                   band for dynamic programming computation

--shapes (-s)            number of seed shapes (default=all available)

--shape-mask             seed shapes

--multiprocessing        enable distributed-memory parallel processing

--mp-init                initialize multiprocessing run

--mp-recover             enable continuation of interrupted multiprocessing run

--mp-query-chunk         process only a single query chunk as specified

--ext-chunk-size         chunk size for adaptive ranking (default=auto)

--no-ranking             disable ranking heuristic

--ext                    Extension mode (banded-fast/banded-slow/full)

--culling-overlap        minimum range overlap with higher scoring hit

to delete a hit (default=50%)

--taxon-k                maximum number of targets to report per species

--range-cover            percentage of query range to be covered for

range culling (default=50%)

--dbsize                 effective database size (in letters)

--no-auto-append         disable auto appending of DAA and DMND file extensions

--xml-blord-format       Use gnl|BL_ORD_ID| style format in XML output

--stop-match-score       Set the match score of stop codons against each other.

--tantan-minMaskProb     minimum repeat probability for masking (default=0.9)

--file-buffer-size       file buffer size in bytes (default=67108864)

--memory-limit (-M)      Memory limit for extension stage in GB

--no-unlink              Do not unlink temporary files.

--target-indexed         Enable target-indexed mode

--ignore-warnings        Ignore warnings

 

View options:

--daa (-a)               DIAMOND alignment archive (DAA) file

--forwardonly            only show alignments of forward strand

 

Getseq options:

--seq                    Space-separated list of sequence numbers to display.

 

Online documentation at http://www.diamondsearch.org

 

 

ラン

はじめにデータベースとなるアミノ酸配列のindexファイルを作成する。

diamond makedb --in input.faa -d nr

 

 

blastxでホモロジー検索を行う。inputは塩基配列

diamond blastx -d nr -q query.fna -o matches.m8 
  • --threads   Number of CPU threads. By default, the program will auto-detect and use all available virtual cores on the machine.

 出力はタブ区切り形式である。

 user$ head matches.m8 

gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813330|ref|YP_007449782.1| 100.0 323 0 0 1 969 1 323 1.8e-179 622.1

gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813441|ref|YP_007449893.1| 33.0 233 142 4 82 747 34 263 8.8e-25 108.2

 Diamondの検出閾値はblastのdefaultの検出閾値よりずっと低いため、stringencyはblastより高くなっている。また、defaultのパラメータはショートリード向けの設定のため、クエリ配列が長い場合、--sensitiveや--more-sensitiveをつけることが推奨されている。

 

マニュアルに書いてあるが、タブ出力するには--outfmt 6をつける。さらに以下のような指定を行うことで、出力項目を好きに設定できる。

  • qseqid Query Seq - id
  • qlen Query sequence length
  • sseqid Subject Seq - id
  • sallseqid All subject Seq - id(s), separated by a ’;’ slen Subject sequence length
  • qstart Start of alignment in query
  • qend End of alignment in query
  • sstart Start of alignment in subject
  • send End of alignment in subject
  • qseq Aligned part of query sequence
  • sseq Aligned part of subject sequence
  • full sseq Full subject sequence
  • evalue Expect value
  • bitscore Bit score
  • score Raw score
  • length Alignment length
  • pident Percentage of identical matches
  • nident Number of identical matches
  • mismatch Number of mismatches
  • positive Number of positive - scoring matches
  • gapopen Number of gap openings
  • gaps Total number of gaps
  • ppos Percentage of positive - scoring matches
  • qframe Query frame
  • btop Blast traceback operations(BTOP)
  • staxids Unique Subject Taxonomy ID(s), separated by a ’;’ (in numerical order). This field requires setting the --taxonmap parameter for makedb.
  • salltitles All Subject Title(s), separated by a ’<>’
  • qcovhsp Query Coverage Per HSP
  • qtitle Query title

デフォルトでは

"qseqid sseqid pident length mismatch

gapopen qstart qend sstart send evalue bitscore"が出力されるようになっている。

例えばdiamondでblastxサーチを行う。tabularでqseqid、sseqid、evalueのみ出力する。max target seqはデフォルト25なので増やす。

diamond blastx --query input.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--outfmt 6 qseqid sseqid evalue \
--sensitive \
--max-target-seqs 1000 \
> blast.out

 

追記

E-value ≤1E-100、amino acid identity ≥ 80%、minimum length (amino acid) ≥ 300 a.a。

diamond blastx --query input.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--outfmt 6 \
--min-orf 300 --id 80 --evalue 1e-100 \
> blast.out

感度を上げるには新しく導入された--very-sensitive--ultra-sensitiveを使う(引用2参照)。max target seqはデフォルト25しかないので増やす。

diamond blastx --query input.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--evalue 1e-1 --very-sensitive \
--max-target-seqs 100000 \
--outfmt 6 \
> blast.out

all versus all

proteins.faaの総当たり比較。部分ヒットを避けるパラメータ設定を使用する(引用)。

diamond makedb --in proteins.faa --db protein_db
diamond blastp --query proteins.faa --db protein_db --out blastp.tsv \
--outfmt 6 --evalue 1e-5 --max-target-seqs 10000 --query-cover 50 \
--subject-cover 50

出力例

f:id:kazumaxneo:20220407231300p:plain

 

引用

Fast and sensitive protein alignment using DIAMONDFast and sensitive protein alignment using DIAMOND
Benjamin Buchfink, Chao Xie & Daniel H

Nature Methods 12, 59–60 (2015) doi:10.1038/nmeth.3176

PDF

https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf

 

Sensitive protein alignments at tree-of-life scale using DIAMOND

Benjamin Buchfink, Klaus Reuter & Hajk-Georg Drost 
Nature Methods volume 18, pages 366–368 (2021)

 

 

 

追記

Question about max-target-seqs option #29

Question about max-target-seqs option · Issue #29 · bbuchfink/diamond · GitHub

 

KMCのホームページで、diamondとKMCの連携について提案があります。詳細はKMCのHPからスクリプトをダウンロードして確認してください。

http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download

 

追記

 論文図1で色々なシーケンサー由来のリードを使ってタンパク質と相同性検索した時の処理時間と感度が比較されています。データによっては2万倍以上高速化しています。

https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf

 

追記

ローカルデータベースのダウンロードについては、こちらが参考になります。

GitHub - josuebarrera/GenEra: genEra is an easy-to-use, low-dependency command-line tool that estimates the age of the last common ancestor of protein-coding gene families.

 

追記

AC-DIAMOND

 

追記

ID convert


大雑把に調べる。



NCBIのblast DBを使う。


 

2021 4/8

 

 

 

SSU rRNAを素早く検出する Barrnap

2019 3/10 タイトル修正

2019 5/30 インストール方法追記

2020 6/15 コマンド修正, help追記

2020 6/29  例追記

 

BarrnapはrRNAをゲノムから探すツール。

 検索対象

  • bacteria (5S,23S,16S)
  • archaea (5S,5.8S,23S,16S)
  • mitochondria (12S,16S)
  • eukaryotes (5S,5.8S,28S,18S)

 

インストール

本体 Github

#bioconda(link
mamba install -c bioconda barrnap -y

#homebrew
brew install barrnap

barrnap -h

$ barrnap -h

Synopsis:

  barrnap 0.9 - rapid ribosomal RNA prediction

Author:

  Torsten Seemann

Usage:

  barrnap [options] chr.fa

  barrnap [options] < chr.fa

  barrnap [options] - < chr.fa

Options:

  --help            This help

  --version         Print version and exit

  --citation        Print citation for referencing barrnap

  --kingdom [X]     Kingdom: euk mito bac arc (default 'bac')

  --quiet           No screen output (default OFF)

  --threads [N]     Number of threads/cores/CPUs to use (default '1')

  --lencutoff [n.n] Proportional length threshold to label as partial (default '0.8')

  --reject [n.n]    Proportional length threshold to reject prediction (default '0.25')

  --evalue [n.n]    Similarity e-value cut-off (default '1e-06')

  --incseq          Include FASTA _input_ sequences in GFF3 output (default OFF)

  --outseq [X]      Save rRNA hit seqs to this FASTA file (default '')

 

 

ラン

barrnap --threads 8 genome.fa > 16SrRNAs.gff
  • --threads   Number of threads/cores/CPUs to use (default '8')

 default出力はGFF3形式になる。

 

kingdomを指定してラン。eukaryotes指定。

barrnap --kingdom euk --threads 12 input.fa > enk_rRNA.gff
  •  --kingdom    Kingdom: euk mito bac arc (default 'bac')

 

ミトコンドリア指定。

barrnap --kingdom mito --threads 12 input.fa > mito_rRNA.gff

イントロンがある場合、断片的な予測が起きる可能性を考慮して下さい。

 

rRNA配列も別出力する。

barrnap --outseq rRNAs.fa --threads 12 input.fa > rRNA.gff

引用

https://github.com/tseemann/barrnap

 

関連


高速なRNA seqのマッピングツール STAR

2019 2/15 動画とbiocondaによる install追加

2020 7/6 コメントとhelp追加

2021 10/9 gzip fastqのオプション追記、12/5 chimera出力について追記

2024/02/20 情報を整頓

 

STARは高速なRNAのアライメントツール。intron-exonのsplit-alingmentに対応している。動作はbowtie2より10倍以上高速とされ、マッピング感度の高さとエラー率の低さは既存のツールと同等とされている。

 

github

https://github.com/alexdobin/STAR

マニュアル

https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

STAR: RNA-Seq Read Aligner

 

インストール

wget https://github.com/alexdobin/STAR/archive/2.5.3a.tar.gz 
tar -xzf 2.5.3a.tar.gz
cd STAR-2.5.3a/bin/MacOSX_x86_64/

#Anacondaを使っているならcondaで導入可能
conda install -c bioconda -y star

star --help

$ star --help

Usage: STAR  [options]... --genomeDir /path/to/genome/index/   --readFilesIn R1.fq R2.fq

Spliced Transcripts Alignment to a Reference (c) Alexander Dobin, 2009-2019

 

For more details see:

<https://github.com/alexdobin/STAR>

<https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf>

 

### versions

versionGenome           2.7.1a

    string: earliest genome index version compatible with this STAR release. Please do not change this value!

 

### Parameter Files

parametersFiles          -

    string: name of a user-defined parameters file, "-": none. Can only be defined on the command line.

 

### System

sysShell            -

    string: path to the shell binary, preferably bash, e.g. /bin/bash.

                    - ... the default shell is executed, typically /bin/sh. This was reported to fail on some Ubuntu systems - then you need to specify path to bash.

 

### Run Parameters

runMode                         alignReads

    string: type of the run.

 

                                alignReads             ... map reads

                                genomeGenerate         ... generate genome files

                                inputAlignmentsFromBAM ... input alignments from BAM. Presently only works with --outWigType and --bamRemoveDuplicates.

                                liftOver               ... lift-over of GTF files (--sjdbGTFfile) between genome assemblies using chain file(s) from --genomeChainFiles.

 

runThreadN                      1

    int: number of threads to run STAR

 

runDirPerm                      User_RWX

    string: permissions for the directories created at the run-time.

                                User_RWX ... user-read/write/execute

                                All_RWX  ... all-read/write/execute (same as chmod 777)

 

runRNGseed                      777

    int: random number generator seed.

 

 

### Genome Parameters

genomeDir                   ./GenomeDir/

    string: path to the directory where genome files are stored (for --runMode alignReads) or will be generated (for --runMode generateGenome)

 

genomeLoad                NoSharedMemory

    string: mode of shared memory usage for the genome files. Only used with --runMode alignReads.

                          LoadAndKeep     ... load genome into shared and keep it in memory after run

                          LoadAndRemove   ... load genome into shared but remove it after run

                          LoadAndExit     ... load genome into shared memory and exit, keeping the genome in memory for future runs

                          Remove          ... do not map anything, just remove loaded genome from memory

                          NoSharedMemory  ... do not use shared memory, each job will have its own private copy of the genome

 

genomeFastaFiles            -

    string(s): path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they *cannot* be zipped.

                            Required for the genome generation (--runMode genomeGenerate). Can also be used in the mapping (--runMode alignReads) to add extra (new) sequences to the genome (e.g. spike-ins).

 

genomeChainFiles            -

    string: chain files for genomic liftover. Only used with --runMode liftOver .

 

genomeFileSizes             0

    uint(s)>0: genome files exact sizes in bytes. Typically, this should not be defined by the user.

 

genomeConsensusFile         -

    string: VCF file with consensus SNPs (i.e. alternative allele is the major (AF>0.5) allele)

 

### Genome Indexing Parameters - only used with --runMode genomeGenerate

genomeChrBinNbits           18

    int: =log2(chrBin), where chrBin is the size of the bins for genome storage: each chromosome will occupy an integer number of bins. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]).

 

genomeSAindexNbases         14

    int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter --genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).

 

genomeSAsparseD             1

    int>0: suffux array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAM at the cost of mapping speed reduction

 

genomeSuffixLengthMax       -1

    int: maximum length of the suffixes, has to be longer than read length. -1 = infinite.

 

 

### Splice Junctions Database

sjdbFileChrStartEnd                     -

    string(s): path to the files with genomic coordinates (chr <tab> start <tab> end <tab> strand) for the splice junction introns. Multiple files can be supplied wand will be concatenated.

 

sjdbGTFfile                             -

    string: path to the GTF file with annotations

 

sjdbGTFchrPrefix                        -

    string: prefix for chromosome names in a GTF file (e.g. 'chr' for using ENSMEBL annotations with UCSC genomes)

 

sjdbGTFfeatureExon                      exon

    string: feature type in GTF file to be used as exons for building transcripts

 

sjdbGTFtagExonParentTranscript          transcript_id

    string: GTF attribute name for parent transcript ID (default "transcript_id" works for GTF files)

 

sjdbGTFtagExonParentGene                gene_id

    string: GTF attribute name for parent gene ID (default "gene_id" works for GTF files)

 

sjdbGTFtagExonParentGeneName            gene_name

    string(s): GTF attrbute name for parent gene name

 

sjdbGTFtagExonParentGeneType            gene_type gene_biotype

    string(s): GTF attrbute name for parent gene type

 

sjdbOverhang                            100

    int>0: length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)

 

sjdbScore                               2

    int: extra alignment score for alignmets that cross database junctions

 

sjdbInsertSave                          Basic

    string: which files to save when sjdb junctions are inserted on the fly at the mapping step

Basic ... only small junction / transcript files

All   ... all files including big Genome, SA and SAindex - this will create a complete genome directory

 

### Variation parameters

varVCFfile                              -

    string: path to the VCF file that contains variation data.

 

### Input Files

inputBAMfile                -

    string: path to BAM input file, to be used with --runMode inputAlignmentsFromBAM

 

### Read Parameters

readFilesType               Fastx

    string: format of input read files

                            Fastx       ... FASTA or FASTQ

                            SAM SE      ... SAM or BAM single-end reads; for BAM use --readFilesCommand samtools view

                            SAM PE      ... SAM or BAM paired-end reads; for BAM use --readFilesCommand samtools view

 

readFilesIn                 Read1 Read2

    string(s): paths to files that contain input read1 (and, if needed,  read2)

 

readFilesPrefix             -

    string: preifx for the read files names, i.e. it will be added in front of the strings in --readFilesIn

                            -: no prefix

 

readFilesCommand             -

    string(s): command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout

               For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc.

 

readMapNumber               -1

    int: number of reads to map from the beginning of the file

                            -1: map all reads

 

readMatesLengthsIn          NotEqual

    string: Equal/NotEqual - lengths of names,sequences,qualities for both mates are the same  / not the same. NotEqual is safe in all situations.

 

readNameSeparator           /

    string(s): character(s) separating the part of the read names that will be trimmed in output (read name after space is always trimmed)

 

readQualityScoreBase        33

    int>=0: number to be subtracted from the ASCII code to get Phred quality score

 

clip3pNbases                 0

    int(s): number(s) of bases to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.

 

clip5pNbases                 0

    int(s): number(s) of bases to clip from 5p of each mate. If one value is given, it will be assumed the same for both mates.

 

clip3pAdapterSeq            -

    string(s): adapter sequences to clip from 3p of each mate.  If one value is given, it will be assumed the same for both mates.

 

clip3pAdapterMMp            0.1

    double(s): max proportion of mismatches for 3p adpater clipping for each mate.  If one value is given, it will be assumed the same for both mates.

 

clip3pAfterAdapterNbases    0

    int(s): number of bases to clip from 3p of each mate after the adapter clipping. If one value is given, it will be assumed the same for both mates.

 

 

### Limits

limitGenomeGenerateRAM               31000000000

    int>0: maximum available RAM (bytes) for genome generation

 

limitIObufferSize                    150000000

    int>0: max available buffers size (bytes) for input/output, per thread

 

limitOutSAMoneReadBytes              100000

    int>0: max size of the SAM record (bytes) for one read. Recommended value: >(2*(LengthMate1+LengthMate2+100)*outFilterMultimapNmax

 

limitOutSJoneRead                    1000

    int>0: max number of junctions for one read (including all multi-mappers)

 

limitOutSJcollapsed                  1000000

    int>0: max number of collapsed junctions

 

limitBAMsortRAM                         0

    int>=0: maximum available RAM (bytes) for sorting BAM. If =0, it will be set to the genome index size. 0 value can only be used with --genomeLoad NoSharedMemory option.

 

limitSjdbInsertNsj                     1000000

    int>=0: maximum number of junction to be inserted to the genome on the fly at the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run

 

limitNreadsSoft                        -1

    int: soft limit on the number of reads

 

### Output: general

outFileNamePrefix               ./

    string: output files name prefix (including full or relative path). Can only be defined on the command line.

 

outTmpDir                       -

    string: path to a directory that will be used as temporary by STAR. All contents of this directory will be removed!

            - the temp directory will default to outFileNamePrefix_STARtmp

 

outTmpKeep                      None

    string: whether to keep the tempporary files after STAR runs is finished

                                None ... remove all temporary files

                                All .. keep all files

 

outStd                          Log

    string: which output will be directed to stdout (standard out)

                                Log                    ... log messages

                                SAM                    ... alignments in SAM format (which normally are output to Aligned.out.sam file), normal standard output will go into Log.std.out

                                BAM_Unsorted           ... alignments in BAM format, unsorted. Requires --outSAMtype BAM Unsorted

                                BAM_SortedByCoordinate ... alignments in BAM format, unsorted. Requires --outSAMtype BAM SortedByCoordinate

                                BAM_Quant              ... alignments to transcriptome in BAM format, unsorted. Requires --quantMode TranscriptomeSAM

 

outReadsUnmapped                None

   string: output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s).

                                None    ... no output

                                Fastx   ... output in separate fasta/fastq files, Unmapped.out.mate1/2

 

outQSconversionAdd              0

   int: add this number to the quality score (e.g. to convert from Illumina to Sanger, use -31)

 

outMultimapperOrder             Old_2.4

    string: order of multimapping alignments in the output files

                                Old_2.4             ... quasi-random order used before 2.5.0

                                Random              ... random order of alignments for each multi-mapper. Read mates (pairs) are always adjacent, all alignment for each read stay together. This option will become default in the future releases.

 

### Output: SAM and BAM

outSAMtype                      SAM

    strings: type of SAM/BAM output

                                1st word:

                                BAM  ... output BAM without sorting

                                SAM  ... output SAM without sorting

                                None ... no SAM/BAM output

                                2nd, 3rd:

                                Unsorted           ... standard unsorted

                                SortedByCoordinate ... sorted by coordinate. This option will allocate extra memory for sorting which can be specified by --limitBAMsortRAM.

 

outSAMmode                      Full

    string: mode of SAM output

                                None ... no SAM output

                                Full ... full SAM output

                                NoQS ... full SAM but without quality scores

 

outSAMstrandField                               None

    string: Cufflinks-like strand field flag

                                None        ... not used

                                intronMotif ... strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns are filtered out.

 

outSAMattributes                Standard

    string: a string of desired SAM attributes, in the order desired for the output SAM

                                NH HI AS nM NM MD jM jI XS MC ch ... any combination in any order

                                None        ... no attributes

                                Standard    ... NH HI AS nM

                                All         ... NH HI AS nM NM MD jM jI MC ch

                                vA          ... variant allele

                                vG          ... genomic coordiante of the variant overlapped by the read

                                vW          ... 0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag

                                STARsolo:

                                CR CY UR UY ... sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing

                                CB UB       ... error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires --outSAMtype BAM SortedByCoordinate.

                                sM          ... assessment of CB and UMI

                                sS          ... sequence of the entire barcode (CB,UMI,adapter...)

                                sQ          ... quality of the entire barcode

                                Unsupported/undocumented:

                                rB          ... alignment block read/genomic coordinates

                                vR          ... read coordinate of the variant

 

outSAMattrIHstart               1

    int>=0:                     start value for the IH attribute. 0 may be required by some downstream software, such as Cufflinks or StringTie.

 

outSAMunmapped                  None

    string(s): output of unmapped reads in the SAM format

                                1st word:

                                None   ... no output

                                Within ... output unmapped reads within the main SAM file (i.e. Aligned.out.sam)

                                2nd word:

                                KeepPairs ... record unmapped mate for each alignment, and, in case of unsorted output, keep it adjacent to its mapped mate. Only affects multi-mapping reads.

 

outSAMorder                     Paired

    string: type of sorting for the SAM output

                                Paired: one mate after the other for all paired alignments

                                PairedKeepInputOrder: one mate after the other for all paired alignments, the order is kept the same as in the input FASTQ files

 

outSAMprimaryFlag OneBestScore

    string: which alignments are considered primary - all others will be marked with 0x100 bit in the FLAG

                                OneBestScore ... only one alignment with the best score is primary

                                AllBestScore ... all alignments with the best score are primary

 

outSAMreadID Standard

    string: read ID record type

                                Standard ... first word (until space) from the FASTx read ID line, removing /1,/2 from the end

                                Number   ... read number (index) in the FASTx file

 

outSAMmapqUnique        255

    int: 0 to 255: the MAPQ value for unique mappers

 

outSAMflagOR           0

    int: 0 to 65535: sam FLAG will be bitwise OR'd with this value, i.e. FLAG=FLAG | outSAMflagOR. This is applied after all flags have been set by STAR, and after outSAMflagAND. Can be used to set specific bits that are not set otherwise.

 

outSAMflagAND           65535

    int: 0 to 65535: sam FLAG will be bitwise AND'd with this value, i.e. FLAG=FLAG & outSAMflagOR. This is applied after all flags have been set by STAR, but before outSAMflagOR. Can be used to unset specific bits that are not set otherwise.

 

outSAMattrRGline        -

    string(s): SAM/BAM read group line. The first word contains the read group identifier and must start with "ID:", e.g. --outSAMattrRGline id:xxx CN:yy "DS:z z z".

            xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted.

            Comma separated RG lines correspons to different (comma separated) input files in --readFilesIn. Commas have to be surrounded by spaces, e.g.

            --outSAMattrRGline id:xxx , id:zzz "DS:z z" , id:yyy DS:yyyy

 

outSAMheaderHD          -

    strings: @HD (header) line of the SAM header

 

outSAMheaderPG          -

    strings: extra @PG (software) line of the SAM header (in addition to STAR)

 

outSAMheaderCommentFile -

    string: path to the file with @CO (comment) lines of the SAM header

 

outSAMfilter            None

    string(s): filter the output into main SAM/BAM files

                        KeepOnlyAddedReferences ... only keep the reads for which all alignments are to the extra reference sequences added with --genomeFastaFiles at the mapping stage.

                        KeepAllAddedReferences ...  keep all alignments to the extra reference sequences added with --genomeFastaFiles at the mapping stage.

 

 

outSAMmultNmax          -1

    int: max number of multiple alignments for a read that will be output to the SAM/BAM files.

                        -1 ... all alignments (up to --outFilterMultimapNmax) will be output

 

outSAMtlen              1

    int: calculation method for the TLEN field in the SAM/BAM files

                        1 ... leftmost base of the (+)strand mate to rightmost base of the (-)mate. (+)sign for the (+)strand mate

                        2 ... leftmost base of any mate to rightmost base of any mate. (+)sign for the mate with the leftmost base. This is different from 1 for overlapping mates with protruding ends

 

outBAMcompression       1

    int: -1 to 10  BAM compression level, -1=default compression (6?), 0=no compression, 10=maximum compression

 

outBAMsortingThreadN    0

    int: >=0: number of threads for BAM sorting. 0 will default to min(6,--runThreadN).

 

outBAMsortingBinsN      50

    int: >0:  number of genome bins fo coordinate-sorting

 

### BAM processing

bamRemoveDuplicatesType  -

    string: mark duplicates in the BAM file, for now only works with (i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only

                        -                       ... no duplicate removal/marking

                        UniqueIdentical         ... mark all multimappers, and duplicate unique mappers. The coordinates, FLAG, CIGAR must be identical

                        UniqueIdenticalNotMulti  ... mark duplicate unique mappers but not multimappers.

 

bamRemoveDuplicatesMate2basesN   0

    int>0: number of bases from the 5' of mate 2 to use in collapsing (e.g. for RAMPAGE)

 

### Output Wiggle

outWigType          None

    string(s): type of signal output, e.g. "bedGraph" OR "bedGraph read1_5p". Requires sorted BAM: --outSAMtype BAM SortedByCoordinate .

                    1st word:

                    None       ... no signal output

                    bedGraph   ... bedGraph format

                    wiggle     ... wiggle format

                    2nd word:

                    read1_5p   ... signal from only 5' of the 1st read, useful for CAGE/RAMPAGE etc

                    read2      ... signal from only 2nd read

 

outWigStrand        Stranded

    string: strandedness of wiggle/bedGraph output

                    Stranded   ...  separate strands, str1 and str2

                    Unstranded ...  collapsed strands

 

outWigReferencesPrefix    -

    string: prefix matching reference names to include in the output wiggle file, e.g. "chr", default "-" - include all references

 

outWigNorm              RPM

    string: type of normalization for the signal

                        RPM    ... reads per million of mapped reads

                        None   ... no normalization, "raw" counts

 

### Output Filtering

outFilterType                   Normal

    string: type of filtering

                                Normal  ... standard filtering using only current alignment

                                BySJout ... keep only those reads that contain junctions that passed filtering into SJ.out.tab

 

outFilterMultimapScoreRange     1

    int: the score range below the maximum score for multimapping alignments

 

outFilterMultimapNmax           10

    int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value.

         Otherwise no alignments will be output, and the read will be counted as "mapped to too many loci" in the Log.final.out .

 

outFilterMismatchNmax           10

    int: alignment will be output only if it has no more mismatches than this value.

 

outFilterMismatchNoverLmax      0.3

    real: alignment will be output only if its ratio of mismatches to *mapped* length is less than or equal to this value.

 

outFilterMismatchNoverReadLmax  1.0

    real: alignment will be output only if its ratio of mismatches to *read* length is less than or equal to this value.

 

 

outFilterScoreMin               0

    int: alignment will be output only if its score is higher than or equal to this value.

 

outFilterScoreMinOverLread      0.66

    real: same as outFilterScoreMin, but  normalized to read length (sum of mates' lengths for paired-end reads)

 

outFilterMatchNmin              0

    int: alignment will be output only if the number of matched bases is higher than or equal to this value.

 

outFilterMatchNminOverLread     0.66

    real: sam as outFilterMatchNmin, but normalized to the read length (sum of mates' lengths for paired-end reads).

 

outFilterIntronMotifs           None

    string: filter alignment using their motifs

None                           ... no filtering

RemoveNoncanonical             ... filter out alignments that contain non-canonical junctions

RemoveNoncanonicalUnannotated  ... filter out alignments that contain non-canonical unannotated junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept.

 

outFilterIntronStrands          RemoveInconsistentStrands

    string: filter alignments

                RemoveInconsistentStrands      ... remove alignments that have junctions with inconsistent strands

                None                           ... no filtering

 

### Output Filtering: Splice Junctions

outSJfilterReads                All

    string: which reads to consider for collapsed splice junctions output

                All: all reads, unique- and multi-mappers

                Unique: uniquely mapping reads only

 

outSJfilterOverhangMin          30  12  12  12

    4 integers:    minimum overhang length for splice junctions on both sides for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif

                                does not apply to annotated junctions

 

outSJfilterCountUniqueMin       3   1   1   1

    4 integers: minimum uniquely mapping read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif

                                Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied

                                does not apply to annotated junctions

 

outSJfilterCountTotalMin     3   1   1   1

    4 integers: minimum total (multi-mapping+unique) read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif

                                Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied

                                does not apply to annotated junctions

 

outSJfilterDistToOtherSJmin     10  0   5   10

    4 integers>=0: minimum allowed distance to other junctions' donor/acceptor

                                does not apply to annotated junctions

 

outSJfilterIntronMaxVsReadN        50000 100000 200000

    N integers>=0: maximum gap allowed for junctions supported by 1,2,3,,,N reads

                                i.e. by default junctions supported by 1 read can have gaps <=50000b, by 2 reads: <=100000b, by 3 reads: <=200000. by >=4 reads any gap <=alignIntronMax

                                does not apply to annotated junctions

 

### Scoring

scoreGap                     0

    int: splice junction penalty (independent on intron motif)

 

scoreGapNoncan               -8

    int: non-canonical junction penalty (in addition to scoreGap)

 

scoreGapGCAG                 -4

    GC/AG and CT/GC junction penalty (in addition to scoreGap)

 

scoreGapATAC                 -8

    AT/AC  and GT/AT junction penalty  (in addition to scoreGap)

 

scoreGenomicLengthLog2scale   -0.25

    extra score logarithmically scaled with genomic length of the alignment: scoreGenomicLengthLog2scale*log2(genomicLength)

 

scoreDelOpen                 -2

    deletion open penalty

 

scoreDelBase                 -2

    deletion extension penalty per base (in addition to scoreDelOpen)

 

scoreInsOpen                 -2

    insertion open penalty

 

scoreInsBase                 -2

    insertion extension penalty per base (in addition to scoreInsOpen)

 

scoreStitchSJshift           1

    maximum score reduction while searching for SJ boundaries inthe stitching step

 

 

### Alignments and Seeding

 

seedSearchStartLmax             50

    int>0: defines the search start point through the read - the read is split into pieces no longer than this value

 

seedSearchStartLmaxOverLread    1.0

    real: seedSearchStartLmax normalized to read length (sum of mates' lengths for paired-end reads)

 

seedSearchLmax       0

    int>=0: defines the maximum length of the seeds, if =0 max seed lengthis infinite

 

seedMultimapNmax      10000

    int>0: only pieces that map fewer than this value are utilized in the stitching procedure

 

seedPerReadNmax       1000

    int>0: max number of seeds per read

 

seedPerWindowNmax     50

    int>0: max number of seeds per window

 

seedNoneLociPerWindow    10

    int>0: max number of one seed loci per window

 

seedSplitMin                12

    int>0: min length of the seed sequences split by Ns or mate gap

 

alignIntronMin              21

    minimum intron size: genomic gap is considered intron if its length>=alignIntronMin, otherwise it is considered Deletion

 

alignIntronMax              0

    maximum intron size, if 0, max intron size will be determined by (2^winBinNbits)*winAnchorDistNbins

 

alignMatesGapMax            0

    maximum gap between two mates, if 0, max intron gap will be determined by (2^winBinNbits)*winAnchorDistNbins

 

alignSJoverhangMin          5

    int>0: minimum overhang (i.e. block size) for spliced alignments

 

alignSJstitchMismatchNmax   0 -1 0 0

    4*int>=0: maximum number of mismatches for stitching of the splice junctions (-1: no limit).

                            (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif.

 

alignSJDBoverhangMin        3

    int>0: minimum overhang (i.e. block size) for annotated (sjdb) spliced alignments

 

alignSplicedMateMapLmin     0

    int>0: minimum mapped length for a read mate that is spliced

 

alignSplicedMateMapLminOverLmate 0.66

    real>0: alignSplicedMateMapLmin normalized to mate length

 

alignWindowsPerReadNmax     10000

    int>0: max number of windows per read

 

alignTranscriptsPerWindowNmax       100

    int>0: max number of transcripts per window

 

alignTranscriptsPerReadNmax               10000

    int>0: max number of different alignments per read to consider

 

alignEndsType           Local

    string: type of read ends alignment

                        Local             ... standard local alignment with soft-clipping allowed

                        EndToEnd          ... force end-to-end read alignment, do not soft-clip

                        Extend5pOfRead1   ... fully extend only the 5p of the read1, all other ends: local alignment

                        Extend5pOfReads12 ... fully extend only the 5p of the both read1 and read2, all other ends: local alignment

 

alignEndsProtrude       0    ConcordantPair

    int, string:        allow protrusion of alignment ends, i.e. start (end) of the +strand mate downstream of the start (end) of the -strand mate

                        1st word: int: maximum number of protrusion bases allowed

                        2nd word: string:

                                            ConcordantPair ... report alignments with non-zero protrusion as concordant pairs

                                            DiscordantPair ... report alignments with non-zero protrusion as discordant pairs

 

alignSoftClipAtReferenceEnds    Yes

    string: allow the soft-clipping of the alignments past the end of the chromosomes

                                Yes ... allow

                                No  ... prohibit, useful for compatibility with Cufflinks

 

alignInsertionFlush     None

    string: how to flush ambiguous insertion positions

                        None    ... insertions are not flushed

                        Right   ... insertions are flushed to the right

 

### Paired-End reads

peOverlapNbasesMin          0

    int>=0:             minimum number of overlap bases to trigger mates merging and realignment

 

peOverlapMMp                0.01

    real, >=0 & <1:     maximum proportion of mismatched bases in the overlap area

 

### Windows, Anchors, Binning

 

winAnchorMultimapNmax           50

    int>0: max number of loci anchors are allowed to map to

 

winBinNbits                     16

    int>0: =log2(winBin), where winBin is the size of the bin for the windows/clustering, each window will occupy an integer number of bins.

 

winAnchorDistNbins              9

    int>0: max number of bins between two anchors that allows aggregation of anchors into one window

 

winFlankNbins                   4

    int>0: log2(winFlank), where win Flank is the size of the left and right flanking regions for each window

 

winReadCoverageRelativeMin      0.5

    real>=0: minimum relative coverage of the read sequence by the seeds in a window, for STARlong algorithm only.

 

winReadCoverageBasesMin      0

    int>0: minimum number of bases covered by the seeds in a window , for STARlong algorithm only.

 

### Chimeric Alignments

chimOutType                 Junctions

    string(s): type of chimeric output

                            Junctions       ... Chimeric.out.junction

                            SeparateSAMold  ... output old SAM into separate Chimeric.out.sam file

                            WithinBAM       ... output into main aligned BAM files (Aligned.*.bam)

                            WithinBAM HardClip  ... (default) hard-clipping in the CIGAR for supplemental chimeric alignments (defaultif no 2nd word is present)

                            WithinBAM SoftClip  ... soft-clipping in the CIGAR for supplemental chimeric alignments

 

chimSegmentMin              0

    int>=0: minimum length of chimeric segment length, if ==0, no chimeric output

 

chimScoreMin                0

    int>=0: minimum total (summed) score of the chimeric segments

 

chimScoreDropMax            20

    int>=0: max drop (difference) of chimeric score (the sum of scores of all chimeric segments) from the read length

 

chimScoreSeparation         10

    int>=0: minimum difference (separation) between the best chimeric score and the next one

 

chimScoreJunctionNonGTAG    -1

    int: penalty for a non-GT/AG chimeric junction

 

chimJunctionOverhangMin     20

    int>=0: minimum overhang for a chimeric junction

 

chimSegmentReadGapMax       0

    int>=0: maximum gap in the read sequence between chimeric segments

 

chimFilter                  banGenomicN

    string(s): different filters for chimeric alignments

                            None ... no filtering

                            banGenomicN ... Ns are not allowed in the genome sequence around the chimeric junction

 

chimMainSegmentMultNmax        10

    int>=1: maximum number of multi-alignments for the main chimeric segment. =1 will prohibit multimapping main segments.

 

chimMultimapNmax                    0

    int>=0: maximum number of chimeric multi-alignments

                                0 ... use the old scheme for chimeric detection which only considered unique alignments

 

chimMultimapScoreRange          1

    int>=0: the score range for multi-mapping chimeras below the best chimeric score. Only works with --chimMultimapNmax > 1

 

chimNonchimScoreDropMin         20

    int>=0: to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be greater than this value

 

chimOutJunctionFormat           0

    int: formatting type for the Chimeric.out.junction file

                                0 ... no comment lines/headers

                                1 ... comment lines at the end of the file: command line and Nreads: total, unique, multi

 

### Quantification of Annotations

quantMode                   -

    string(s): types of quantification requested

                            -                ... none

                            TranscriptomeSAM ... output SAM/BAM alignments to transcriptome into a separate file

                            GeneCounts       ... count reads per gene

 

quantTranscriptomeBAMcompression    1       1

    int: -2 to 10  transcriptome BAM compression level

                            -2  ... no BAM output

                            -1  ... default compression (6?)

                             0  ... no compression

                             10 ... maximum compression

 

quantTranscriptomeBan       IndelSoftclipSingleend

    string: prohibit various alignment type

                            IndelSoftclipSingleend  ... prohibit indels, soft clipping and single-end alignments - compatible with RSEM

                            Singleend               ... prohibit single-end alignments

 

### 2-pass Mapping

twopassMode                 None

    string: 2-pass mapping mode.

                            None        ... 1-pass mapping

                            Basic       ... basic 2-pass mapping, with all 1st pass junctions inserted into the genome indices on the fly

 

twopass1readsN              -1

    int: number of reads to process for the 1st step. Use very large number (or default -1) to map all reads in the first step.

 

 

### WASP parameters

waspOutputMode              None

    string: WASP allele-specific output type. This is re-implemenation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad & Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015), https://www.nature.com/articles/nmeth.3582 .

                            SAMtag      ... add WASP tags to the alignments that pass WASP filtering

 

### STARsolo (single cell RNA-seq) parameters

soloType                    None

    string(s): type of single-cell RNA-seq

                            CB_UMI_Simple   ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium

                            CB_UMI_Complex  ... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.

 

soloCBwhitelist             -

    string(s): file(s) with whitelist(s) of cell barcodes. Only one file allowed with 

 

soloCBstart                 1

    int>0: cell barcode start base

 

soloCBlen                   16

    int>0: cell barcode length

 

soloUMIstart                17

    int>0: UMI start base

 

soloUMIlen                  10

    int>0: UMI length

 

soloBarcodeReadLength       1

    int: length of the barcode read

                            1   ... equal to sum of soloCBlen+soloUMIlen

                            0   ... not defined, do not check

 

soloCBposition              -

    strings(s)              position of Cell Barcode(s) on the barcode read.

                            Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.

                            Format for each barcode: startAnchor_startDistance_endAnchor_endDistance

                            start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end

                            start(end)Distance is the distance from the CB start(end) to the Anchor base

                            String for different barcodes are separated by space.

                            Example: inDrop (Zilionis et al, Nat. Protocols, 2017):

                            --soloCBposition  0_0_2_-1  3_1_3_8

 

soloUMIposition             -

    string                  position of the UMI on the barcode read, same as soloCBposition

                            Example: inDrop (Zilionis et al, Nat. Protocols, 2017):

                            --soloCBposition  3_9_3_14

 

soloAdapterSequence         -

    string:                 adapter sequence to anchor barcodes.

 

soloAdapterMismatchesNmax   1

    int>0:                  maximum number of mismatches allowed in adapter sequence.

 

soloCBmatchWLtype           1MM_multi

    string:                 matching the Cell Barcodes to the WhiteList

                            Exact                   ... only exact matches allowed

                            1MM                     ... only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match.

                            1MM_multi               ... multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches. 

                                                        Allowed CBs have to have at least one read with exact match. Similar to CellRanger 2.2.0

                            1MM_multi_pseudocounts  ... same as 1MM_Multi, but pseudocounts of 1 are added to all whitelist barcodes.

                                                        Similar to CellRanger 3.x.x

 

soloStrand                  Forward

    string: strandedness of the solo libraries:

                            Unstranded  ... no strand information

                            Forward     ... read strand same as the original RNA molecule

                            Reverse     ... read strand opposite to the original RNA molecule

 

soloFeatures                Gene

    string(s):              genomic features for which the UMI counts per Cell Barcode are collected

                            Gene            ... genes: reads match the gene transcript

                            SJ              ... splice junctions: reported in SJ.out.tab

                            GeneFull        ... full genes: count all reads overlapping genes' exons and introns

                            Transcript3p   ... quantification of transcript for 3' protocols

 

soloUMIdedup                1MM_All

    string(s):              type of UMI deduplication (collapsing) algorithm

                            1MM_All             ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once)

                            1MM_Directional     ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).

                            Exact               ... only exactly matching UMIs are collapsed

 

soloUMIfiltering            -

    string(s)               type of UMI filtering

                            -               ... basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0)

                            MultiGeneUMI    ... remove lower-count UMIs that map to more than one gene (introduced in CellRanger 3.x.x)

 

soloOutFileNames            Solo.out/          features.tsv barcodes.tsv        matrix.mtx

    string(s)               file names for STARsolo output:

                            file_name_prefix   gene_names   barcode_sequences   cell_feature_count_matrix

 

soloCellFilter              CellRanger2.2 3000 0.99 10

    string(s):              cell filtering type and parameters

                            CellRanger2.2   ... simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count

                            TopCells        ... only report top cells by UMI count, followed by the excat number of cells

                            None            ... do not output filtered cells

 

 

 

 

 

ラン

1,indexing 

mkdir genome #出力用のディレクトリを作成
STAR --runMode genomeGenerate --genomeDir genome/ --genomeFastaFiles reference.fasta --sjdbGTFfile reference.gtf --sjdbOverhang 100 --runThreadN 12
  • --runMode genomeGenerate  generate genome files  
  • --genomeDir path/to/genomeDir
  • --genomeFastaFiles path/to/genome/fasta1,fasta2...
  • --sjdbGTFfile path/to/annotation.gtf
  • --sjdbOverhang (default100) length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1)
  • --runThreadN (default1)number of threads to run STAR

 

2、Mapping

STAR --genomeDir genome/ --readFilesIn R1.fastq R2.fastq --runThreadN 12 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix sample1

#gzipped fastq
STAR --genomeDir genome/ --readFilesIn R1.fq.gz R2.fq.gz --runThreadN 12 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix sample1 --readFilesCommand zcat
  • --genomeDir path/to/genomeDir
  • --readFilesIn paths to files that contain input read1 (and, if needed, read2)
  • --runThreadN (default1)number of threads to run STAR
  • --outFileNamePrefix output files name prefix (including full or relative path).
  • --outSAMtype BAM output BAM without sorting
  • --readFilesCommand       string(s): command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout. For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc.

終わるとsample1Aligned.sortedByCoord.out.bamができる。

 

 

その他

  • contigの数が多くてメモリエラーが出てしまう場合はlimitGenomeGenerateRAMの数値を上げてやり直してください。それでもダメな場合はgenomeSAsparseDのフラグを1から2以上に切り替えて見てください。マッピング時間は長くなってしまいますが、メモリ要求量を削減できます。例えば"--genomeSAsparseD 3"など
  • STARではunmapのリード情報は出力されません。unmapをfastqとして出力するにはフラグを立てる必要があります。注意して下さい。
  • BAMoutput.cpp:27:BAMoutput: exiting because of *OUTPUT FILE* error:というエラーはユーザが利用できるリソースを増やすことで回避できます。ulimitの数を1024から倍に増やすには"ulimit -n 2048"
  • 最適化に関するペーパーが出ています。

    Optimizing RNA-Seq Mapping with STAR.

    https://www.ncbi.nlm.nih.gov/pubmed/27115637

  • 標準のSTAR出力には、ゲノムに対するリニアなアライメントのみが含まれていて、キメラ配列のアライメントは含まれていません。キメラ転写産物のリードも出力するには--chimSegmentMin <INT>を1以上に指定する必要があります(引用)。数値は、キメラ配列の2つのキメラセグメントのそれぞれに許される最小の長さを意味します。0だとキメラ配列のアラインメントは出力されません。
#gzipped fastq
STAR --genomeDir genome/ --readFilesIn R1.fq.gz R2.fq.gz --runThreadN 12 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix sample1 --readFilesCommand zcat --chimSegmentMin 20
  • --chimSegmentMin <INT>   int>=0: minimum length of chimeric segment length, if ==0, no chimeric output

=> キメラ配列が含まれたアラインメントが追加で出力されます。Chimeric.out.junctionには、STAR専用フォーマットのキメラ・ジャンクション情報が含まれています。マニュアルに詳細は記載されています。

 

  • multiqcでレポートを作成
multiqc . #STARの出力ディレクトリで打つ。

f:id:kazumaxneo:20180206120540j:plain

 

  • 時間がかかりすぎることが気になるなら、RApMapを検討してみてください。メモリなどのリソースが少ないマシンでも動くように設計されています。

 

引用

STAR: ultrafast universal RNA-seq aligner

Alexander Dobin,1,* Carrie A. Davis,1 Felix Schlesinger,1 Jorg Drenkow,1 Chris Zaleski,1 Sonali Jha,1 Philippe Batut,1 Mark Chaisson,2 and Thomas R. Gingeras1

Bioinformatics. 2013 Jan; 29(1): 15–21.

 

Mapping RNA-seq Reads with STAR

Alexander Dobin and Thomas R. Gingeras

Curr Protoc Bioinformatics. Author manuscript; available in PMC 2016 Sep 3.

 

参考

バイオインフォ道場

http://bioinfo-dojo.net/2017/04/18/star_rna-seq-aligner/

 

RNA seqのリードカウント HTSeq-count

2020 8/15 condaによるインストールとhelp追記

2021 8/6 リンク消去

 

HTSeqはNGSデータの各種ハンドリングができるツール。ここではその1つhtseq-countコマンドを紹介する。htseq-countはリードのアライメントデータからカウントデータを出力するために使う。htseq-countを使うと、bamから数分でカウントデータを得ることができる。samtoolsのAPIを使い、bamの読み書きを行なっている(リンク)。2015年にBioinformaticsに論文が発表された。 

 

公式サイト

http://htseq.readthedocs.io/en/release_0.9.1/

 

インストール

依存

piplかcondaで導入できる。

pip install htseq

#bioconda(依存も含めて導入される)
conda install -c bioconda -y htseq

> htseq-count -h

$ htseq-count -h

usage: htseq-count [options] alignment_file gff_file

 

This script takes one or more alignment files in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it. See http://htseq.readthedocs.io/en/master/count.html for details.

 

positional arguments:

  samfilenames          Path to the SAM/BAM files containing the mapped reads. If '-' is selected, read from standard input

  featuresfilename      Path to the GTF file containing the features

 

optional arguments:

  -h, --help            show this help message and exit

  -f {sam,bam,auto}, --format {sam,bam,auto}

                        Type of <alignment_file> data. DEPRECATED: file format is detected automatically. This option is ignored.

  -r {pos,name}, --order {pos,name}

                        'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single-end data.

  --max-reads-in-buffer MAX_BUFFER_SIZE

                        When <alignment_file> is paired end sorted by position, allow only so many reads to stay in memory until the mates are found (raising this number will use more memory). Has no effect for single end or paired end sorted by name

  -s {yes,no,reverse}, --stranded {yes,no,reverse}

                        Whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand interpretation

  -a MINAQUAL, --minaqual MINAQUAL

                        Skip all reads with MAPQ alignment quality lower than the given minimum value (default: 10). MAPQ is the 5th column of a SAM/BAM file and its usage depends on the software used to map the reads.

  -t FEATURETYPE, --type FEATURETYPE

                        Feature type (3rd column in GTF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon)

  -i IDATTR, --idattr IDATTR

                        GTF attribute to be used as feature ID (default, suitable for Ensembl GTF files: gene_id). All feature of the right type (see -t option) within the same GTF attribute will be added together. The typical way of using this option is to count all exonic reads from each gene and

                        add the exons but other uses are possible as well.

  --additional-attr ADDITIONAL_ATTR

                        Additional feature attributes (default: none, suitable for Ensembl GTF files: gene_name). Use multiple times for more than one additional attribute. These attributes are only used as annotations in the output, while the determination of how the counts are added together is

                        done based on option -i.

  -m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}

                        Mode to handle reads overlapping more than one feature (choices: union, intersection-strict, intersection-nonempty; default: union)

  --nonunique {none,all,fraction,random}

                        Whether and how to score reads that are not uniquely aligned or ambiguously assigned to features (choices: none, all, fraction, random; default: none)

  --secondary-alignments {score,ignore}

                        Whether to score secondary alignments (0x100 flag)

  --supplementary-alignments {score,ignore}

                        Whether to score supplementary alignments (0x800 flag)

  -o SAMOUTS, --samout SAMOUTS

                        Write out all SAM alignment records into SAM/BAM files (one per input file needed), annotating each line with its feature assignment (as an optional field with tag 'XF'). See the -p option to use BAM instead of SAM.

  -p {SAM,BAM,sam,bam}, --samout-format {SAM,BAM,sam,bam}

                        Format to use with the --samout option.

  -d OUTPUT_DELIMITER, --delimiter OUTPUT_DELIMITER

                        Column delimiter in output (default: TAB).

  -c OUTPUT_FILENAME, --counts_output OUTPUT_FILENAME

                        Filename to output the counts to instead of stdout.

  --append-output       Append counts output. This option is useful if you have already creates a TSV/CSV/similar file with a header for your samples (with additional columns for the feature name and any additionl attributes) and want to fill in the rest of the file.

  -n NPROCESSES, --nprocesses NPROCESSES

                        Number of parallel CPU processes to use (default: 1).

  --feature-query FEATURE_QUERY

                        Restrict to features descibed in this expression. Currently supports a single kind of expression: attribute == "one attr" to restrict the GFF to a single gene or transcript, e.g. --feature-query 'gene_name == "ACTB"' - notice the single quotes around the argument of this

                        option and the double quotes around the gene name. Broader queries might become available in the future.

  -q, --quiet           Suppress progress report

  --version             Show software version and exit

 

Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology Laboratory (EMBL) and Fabio Zanini (fabio.zanini@unsw.edu.au), UNSW Sydney. (c) 2010-2020. Released under the terms of the GNU General Public License v3. Part of the 'HTSeq' framework, version 0.12.4.

htseq-count-barcodes -h

$ htseq-count-barcodes -h

usage: htseq-count-barcodes [options] alignment_file gff_file

 

This script takes one alignment file in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it, accounting for barcodes. See http://htseq.readthedocs.io/en/master/count.html for details.

 

positional arguments:

  samfilename           Path to the SAM/BAM file containing the barcoded, mapped reads. If '-' is selected, read from standard input

  featuresfilename      Path to the GTF file containing the features

 

optional arguments:

  -h, --help            show this help message and exit

  -f {sam,bam,auto}, --format {sam,bam,auto}

                        Type of <alignment_file> data. DEPRECATED: file format is detected automatically. This option is ignored.

  -r {pos,name}, --order {pos,name}

                        'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single-end data.

  --max-reads-in-buffer MAX_BUFFER_SIZE

                        When <alignment_file> is paired end sorted by position, allow only so many reads to stay in memory until the mates are found (raising this number will use more memory). Has no effect for single end or paired end sorted by name

  -s {yes,no,reverse}, --stranded {yes,no,reverse}

                        Whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand interpretation

  -a MINAQUAL, --minaqual MINAQUAL

                        Skip all reads with MAPQ alignment quality lower than the given minimum value (default: 10). MAPQ is the 5th column of a SAM/BAM file and its usage depends on the software used to map the reads.

  -t FEATURETYPE, --type FEATURETYPE

                        Feature type (3rd column in GTF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon)

  -i IDATTR, --idattr IDATTR

                        GTF attribute to be used as feature ID (default, suitable for Ensembl GTF files: gene_id)

  --additional-attr ADDITIONAL_ATTR

                        Additional feature attributes (default: none, suitable for Ensembl GTF files: gene_name). Use multiple times for each different attribute

  -m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}

                        Mode to handle reads overlapping more than one feature (choices: union, intersection-strict, intersection-nonempty; default: union)

  --nonunique {none,all}

                        Whether to score reads that are not uniquely aligned or ambiguously assigned to features

  --secondary-alignments {score,ignore}

                        Whether to score secondary alignments (0x100 flag)

  --supplementary-alignments {score,ignore}

                        Whether to score supplementary alignments (0x800 flag)

  -o SAMOUT, --samout SAMOUT

                        Write out all SAM alignment records into aSAM/BAM file, annotating each line with its feature assignment (as an optional field with tag 'XF'). See the -p option to use BAM instead of SAM.

  -p {SAM,BAM,sam,bam}, --samout-format {SAM,BAM,sam,bam}

                        Format to use with the --samout option.

  -d OUTPUT_DELIMITER, --delimiter OUTPUT_DELIMITER

                        Column delimiter in output (default: TAB).

  -c OUTPUT_FILENAME, --counts_output OUTPUT_FILENAME

                        TSV/CSV filename to output the counts to instead of stdout.

  --cell-barcode CB_TAG

                        BAM tag used for the cell barcode (default compatible with 10X Genomics Chromium is CB).

  --UMI UB_TAG          BAM tag used for the unique molecular identifier, also known as molecular barcode (default compatible with 10X Genomics Chromium is UB).

  -q, --quiet           Suppress progress report

  --version             Show software version and exit

 

Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology Laboratory (EMBL) and Fabio Zanini (fabio.zanini@unsw.edu.au), UNSW Sydney. (c) 2010-2020. Released under the terms of the GNU General Public License v3. Part of the 'HTSeq' framework, version 0.12.4.

htseq-qa -h

$ htseq-qa -h

usage: htseq-qa [-h] [-t {sam,bam,solexa-export,fastq,solexa-fastq}] [-o OUTFILE] [-r READLEN] [-g GAMMA] [-n] [-m MAXQUAL] [--primary-only] [--max-records MAX_RECORDS] readfilename

 

This script take a file with high-throughput sequencing reads (supported formats: SAM, Solexa _export.txt, FASTQ, Solexa _sequence.txt) and performs a simply quality assessment by producing plots showing the distribution of called bases and base-call quality scores by position within the reads. The

plots are output as a PDF file.

 

positional arguments:

  readfilename          The file to count reads in (SAM/BAM or Fastq)

 

optional arguments:

  -h, --help            show this help message and exit

  -t {sam,bam,solexa-export,fastq,solexa-fastq}, --type {sam,bam,solexa-export,fastq,solexa-fastq}

                        type of read_file (one of: sam [default], bam, solexa-export, fastq, solexa-fastq)

  -o OUTFILE, --outfile OUTFILE

                        output filename (default is <read_file>.pdf)

  -r READLEN, --readlength READLEN

                        the maximum read length (when not specified, the script guesses from the file

  -g GAMMA, --gamma GAMMA

                        the gamma factor for the contrast adjustment of the quality score plot

  -n, --nosplit         do not split reads in unaligned and aligned ones

  -m MAXQUAL, --maxqual MAXQUAL

                        the maximum quality score that appears in the data (default: 41)

  --primary-only        For SAM/BAM input files, ignore alignments that are not primary. This only affects 'multimapper' reads that align to several regions in the genome. By choosing this option, each read will only count as one; without this option, each of its alignments counts as one.

  --max-records MAX_RECORDS

                        Limit the analysis to the first N reads/alignments.

 

 

ラン

シングルエンドのカウント。

htseq-count -f bam align.bam reference.gtf > count.txt
  •  -f type of <alignment_file> data, either 'sam' or 'bam' (default: sam)

 

ペアードエンドのカウント。入力するbamはあらかじめsamtoolsでソートされている必要がある。

htseq-count -f bam -r pos align.bam reference.gtf > count.txt
  • -r  'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single- end data.
  • -f type of <alignment_file> data, either 'sam' or 'bam' (default: sam)
  • -s whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand
  • -a skip all reads with alignment quality lower than the given minimum value (default: 10)

 

 

引用

HTSeq—a Python framework to work with high-throughput sequencing data

Simon Anders,* Paul Theodor Pyl, and Wolfgang Huber

Bioinformatics. 2015 Jan 15; 31(2): 166–169.

 

 

クオリティトリミングツール sickle

2020 10/31 インストール追記

2020 11/24 help追記

2021 6/15 コマンド追記

 

sickleはfastqのクオリティトリミングツール。リード長の0.1倍のウィンドウサイズでリードを分析し、指定値以下のクオリティになった領域をトリムする。Trimmomaticと同様、ペアリードの順番が破壊されないよう、ペアの数を同じに揃えて出力できる(orphanなリードは別出力)。

 

 

インストール

condaやbrewで導入できる。

Github

#conda
conda install -c bioconda -y sickle-trim

#homebrew
brew install sickle

sickle se

$ sickle se

 

Usage: sickle se [options] -f <fastq sequence file> -t <quality type> -o <trimmed fastq file>

 

Options:

-f, --fastq-file, Input fastq file (required)

-t, --qual-type, Type of quality values (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)) (required)

-o, --output-file, Output trimmed fastq file (required)

-q, --qual-threshold, Threshold for trimming based on average quality in a window. Default 20.

-l, --length-threshold, Threshold to keep a read based on length after trimming. Default 20.

-x, --no-fiveprime, Don't do five prime trimming.

-n, --trunc-n, Truncate sequences at position of first N.

-g, --gzip-output, Output gzipped files.

--quiet, Don't print out any trimming information

--help, display this help and exit

--version, output version information and exit

 

sickle pe -h

$ sickle pe -h

sickle: invalid option -- h

 

If you have separate files for forward and reverse reads:

Usage: sickle pe [options] -f <paired-end forward fastq file> -r <paired-end reverse fastq file> -t <quality type> -o <trimmed PE forward file> -p <trimmed PE reverse file> -s <trimmed singles file>

 

If you have one file with interleaved forward and reverse reads:

Usage: sickle pe [options] -c <interleaved input file> -t <quality type> -m <interleaved trimmed paired-end output> -s <trimmed singles file>

 

If you have one file with interleaved reads as input and you want ONLY one interleaved file as output:

Usage: sickle pe [options] -c <interleaved input file> -t <quality type> -M <interleaved trimmed output>

 

Options:

Paired-end separated reads

--------------------------

-f, --pe-file1, Input paired-end forward fastq file (Input files must have same number of records)

-r, --pe-file2, Input paired-end reverse fastq file

-o, --output-pe1, Output trimmed forward fastq file

-p, --output-pe2, Output trimmed reverse fastq file. Must use -s option.

 

Paired-end interleaved reads

----------------------------

-c, --pe-combo, Combined (interleaved) input paired-end fastq

-m, --output-combo, Output combined (interleaved) paired-end fastq file. Must use -s option.

-M, --output-combo-all, Output combined (interleaved) paired-end fastq file with any discarded read written to output file as a single N. Cannot be used with the -s option.

 

Global options

--------------

-t, --qual-type, Type of quality values (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)) (required)

-s, --output-single, Output trimmed singles fastq file

-q, --qual-threshold, Threshold for trimming based on average quality in a window. Default 20.

-l, --length-threshold, Threshold to keep a read based on length after trimming. Default 20.

-x, --no-fiveprime, Don't do five prime trimming.

-n, --truncate-n, Truncate sequences at position of first N.

-g, --gzip-output, Output gzipped files.

--quiet, do not output trimming info

--help, display this help and exit

--version, output version information and exit

 

 

入出力について

  • 対応クオリティフォーマット

=> Illumina、Solexa、Sanger。

  • 3行目の+のラインは入力に関わらずCASAVA >= 1.8で標準の+だけで出力される。
  • gzip圧縮ファイルの入力にも対応。
  • 出力はdefaulでは非圧縮fastq。

 

 

ラン

シングルエンド。Q30以下の領域をトリムし、40-bp以下になったリードは除く。

sickle se -f single.fq -t sanger -o trimmed_output.fq -q 30 -l 40
  • se single-end sequence trimming
  • -f Input fastq file (required)
  • -t Type of quality values (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)) (required)
  • -o Output trimmed fastq file (required)
  • -q Threshold for trimming based on average quality in a window. Default 20.
  • -l  Threshold to keep a read based on length after trimming. Default 20.
  • -x Don't do five prime trimming.

-xをつけると3'側のみがトリミング対象になる。

 

 

ペアエンド。Q30以下の領域をトリムし、20-bp以下になったリードは除く。-gをつけるとgzip出力する。

sickle pe -f R1.fq.gz -r R2.fq.gz -t sanger -o trimmed_R1.fq.gz -p trimmed_R2.fq.gz -s trimmed_singles.fq.gz -q 30 -l 20 -g
  • pe paired-end sequence trimming
  • -f Input paired-end forward fastq file (Input files must have same number of records)
  • -r Input paired-end reverse fastq file
  • -o Output trimmed forward fastq file
  • -p Output trimmed reverse fastq file. Must use
  • -s --output-single, Output trimmed singles fastq file
  • -q Threshold for trimming based on average quality in a window. Default 20.
  • -l Threshold to keep a read based on length after trimming. Default 20.
  • -n Truncate sequences at position of first N.
  • -g    Output gzipped files.

 

ペアエンドのインターレースファイル。

sickle pe -c interlace.fastq -t sanger -m interlace_trimmed.fastq -s trimmed_singles.fastq
  • -c Combined (interleaved) input paired-end fastq
  • -m Output combined (interleaved) paired-end fastq file. Must use -s option.
  • -M Output combined (interleaved) paired-end fastq file with any discarded read written to output file as a single N. Cannot be used with the -s option.

 

 

 

テスト

最近シーケンスしたデータを使う。

p='R1.fq' 
q='R2.fq'
mkdir raw_data_qc_reports
mkdir Quality30_trimmed_reports
sickle pe -f $p -r $q -t sanger -o ${p%.fastq}_Q30_trimmed.fastq -p ${q%.fastq}_Q30_trimmed.fastq -s trimmed_singles.fastq -q 30 -l 20

 

fastqcで分析

fastqc --nogroup -o ./raw_data_qc_reports $p $q 

 処理前

f:id:kazumaxneo:20170907172550j:plain

処理後

a=${p%.fastq}_Q30_trimmed.fastq 
b=${q%.fastq}_Q30_trimmed.fastq
fastqc --nogroup -o ./Quality30_trimmed_reports $a $b

f:id:kazumaxneo:20170907172606j:plain

 

引用

Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.33)

Joshi NA, Fass JN. (2011).

[Software]. Available at https://github.com/najoshi/sickle.