macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

トランスクリプトームから主要なtrasncriptsを選抜する EvidentialGene

 

EvidentialGeneのtr2aacds.plは、de novo アセンブルツールの結果から生物学的に有用な最良のmRNAセットにクラスタリングするパイプライン。論文は準備中で不明な点もあるが、ポスターによると以下の流れで冗長なtranscirptsを減らすらしい。fastanrdbとcd-hitを使ったあと、blastを使いprimaryなtranscirptsを選抜している。

 

Algorithm of tr2aacds:

 0. collect input transcripts.tr, produce CDS and AA sequences, work mostly on CDS.

 1. perfect redundant removal with fastanrdb

 2. perfect fragment removal with cd-hit-est

 3. blastn, basic local align high-identity subsequences for alternate tr.

 4. classify main/alternate cds, okay & drop subsets by CDS-align, protein metrics.

 5. output sequence sets from classifier: okay-main, okay-alts, drops. See  http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

  

評判は上々らしく(ref1)、すでにいくつかのde novo transcriptome解析の論文で、複数のde novo アセンブルツールの結果をマージして冗長性を減らすために使われている。

 

公式サイト

http://arthropods.eugenes.org/genes2/about/EvidentialGene_trassembly_pipe.html

wiki

https://sourceforge.net/p/evidentialgene/wiki/Home/ 

 

インストール

依存

  • fastanrdb of exonerate package, quickly reduces perfect duplicate sequences
  • cd-hit, cd-hit-est clusters protein or nucleotide sequences.
  • blastn and makeblastdb of NCBI BLAST, Basic Local Alignment Search Tool, finds regions of local similarity between sequences.

 

本体

http://arthropods.eugenes.org/EvidentialGene/evigene/

The best way to get〜のftpリンクからダウンロードする。

解凍してpub/evigene/scripts/prot/tr2aacds2.plを使う。

$ perl evigene/scripts/prot/tr2aacds2.pl 

EvidentialGene tr2aacds.pl VERSION 2017.12.21

  convert large, redundant mRNA assembly set to best protein coding sequences, 

  filtering by quality of duplicates, fragments, and alternate transcripts.

  See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

Usage: tr2aacds.pl -mrnaseq transcripts.fasta[.gz] 

  opts: -MINCDS=90 -NCPU=1 -MAXMEM=1000.Mb -[no]smallclass -logfile -tidyup -dryrun -debug

 

ラン

de novo アセンブルツールの結果をマージしたfastaを入力とする。複数あるなら"cat *fa > merged.fa"などでコンカテネートしておく。

 

解析前に準備が必要となる。protocol.ioにJared Mamrotさんが投稿されたDe novo transcriptome assembly workflowのワークフローを真似て、FASTAのヘッダーをシンプルな名前に修正する。

perl -ane 'if(/\>/){$a++;print ">Locus_$a\n"}else{print;}' input.fasta > rrenamed.fasta

#さらにformatを修正
perl evigene/scripts/rnaseq/trformat.pl -output output.fa -input renamed.fasta

(trformat.pl :regularize IDs in fasta of Velvet,Soap,Trinity, ensure unique IDs, add prefixes for parameter sets.)

 

perl evigene/scripts/prot/tr2aacds2.pl -mrnaseq output.fasta -MINCDS=90 -NCPU=12 -MAXMEM=1000.Mb -logfile

 

 

引用

poster

 Gene-omes built from mRNA seq not genome DNA

ref1:  cd-hitより効率的。

https://sourceforge.net/p/evidentialgene/discussion/general/thread/a4f0e29f/

 

複数のトランスクリプトームをマージするtransfuse

 

transfuseは、フィルターを満たすtranscriptsをクラスタリングし、融合したtranscriptsを作るツール。複数のRNAアセンブルツールで合成されたtranscriptsをマージし、よりハイグレードなtranscriptsを作るために用いられる。現在、論文準備中とされる。

 

インストール

Github

https://github.com/cboursnell/transfuse

transfuseはRubyで構築されており、rubyのパッケージ管理コマンドgemでインストールできる。

sudo gem install transfuse

> transfuse

$ transfuse

/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin17/rbconfig.rb:214: warning: Insecure world writable dir /Users/kazumaxneo/Documents/art_bin_MountRainier/ in PATH, mode 040777

 

  Transfuse v0.5.0

  by Chris Boursnell <cmb211@cam.ac.uk> and

     Richard Smith-Unna <rds45@cam.ac.uk>

 

  DESCRIPTION:

  Merge multiple assemblies.

 

  USAGE:

  transfuse <options>

 

  OPTIONS:

  -a, --assemblies=<s>    assembly files in FASTA format, comma-separated

  -l, --left=<s>          left reads file in FASTQ format

  -r, --right=<s>         right reads file in FASTQ format

  -o, --output=<s>        write merged assembly to file

  -t, --threads=<i>       number of threads (default: 1)

  -i, --id=<f>            sequence identity to cluster at (default: 1.0)

  -n, --install           install dependencies

  -v, --verbose           be verbose

  -e, --version           Print version and exit

  -h, --help              Show this message

 

ラン

 

transfuse --assemblies soap-k31.fa,soap-k41.fa,soap-k51.fa --left reads_1.fq --right reads_2.fq --output soap-merged.fa --threads 12

 

 

引用

https://github.com/cboursnell/transfuse

 

https://groups.google.com/forum/#!topic/trinityrnaseq-users/Rt9Wnrs3k0A

 

超高速にRNA seqのリードカウント(定量)を行う salmon

salmonは豊富なbiasモデルを取り込み、高速、高精度、堅牢なRNAseqの発現定量を行う方法論。 kallistoやeXpressと比べて、同じFDRで2倍以上精度が高い(DEG判定された遺伝子が2倍以上少ない=false positiveが少ない)というデータを出している。

f:id:kazumaxneo:20180121202042j:plain

Supplementary Figure 1より転載。

 

salomonは単独で「アラインメント」と「定量」の両方を行う、すなわち、indexがついたtranscriptsのリファレンスとFASTQを入力として受け取り、中間のアラインメントファイルを生成せずに直接定量を実行する。結果かなりの時間とスペースを節約することができる。1例をあげると、 6億のリード (75bp, paired-end)をわずか23分で定量できるとされる( 30 スレッド使用時)。2017年、Nature Methodsに掲載された。

 

公式サイト

Overview – Salmon: Fast, accurate and bias-aware transcript quantification from RNA-seq data

quick start guide

Getting Started – Salmon: Fast, accurate and bias-aware transcript quantification from RNA-seq data

Document

http://salmon.readthedocs.io/en/latest/

 

インストール

brewで導入できる。

brew install salmon

Githubのリリースにはビルド済みのバイナリもあります(リンク)。さらにdockerイメージも用意されています。

 > salmon

$ salmon

Salmon v0.8.2

 

Usage:  salmon -h|--help or 

        salmon -v|--version or 

        salmon -c|--cite or 

        salmon [--no-version-check] <COMMAND> [-h | options]

 

Commands:

     cite  Show salmon citation information

     index Create a salmon index

     quant Quantify a sample

     swim  Perform super-secret operation

 

 

ラン

transcriptsのfastaにindexをつける。

salmon index -p 2 -t transcripts.fa.gz -i ref_index 
  • -p  [ --threads ] arg (=2) Number of threads to use (only used for computing bias features)
  • -t  [ --transcripts ] arg Transcript fasta file.
  • -k  [ --kmerLen ] arg (=31) The size of k-mers that should be used for the quasi index.
  • -i  [ --index ] arg Salmon index.

 

定量

salmon quant -i ref_index -l A -1 pair1.fastq.gz -2 pair2.fastq.gz -p 8 -o output/sample1

output/sample1/の中に複数のファイルができる。quant.sfが定量結果のファイルとなる。

 

引用

Salmon provides fast and bias-aware quantification of transcript expression

Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry & Carl Kingsford

Nature Methods 14, 417–419 (2017)

 

RNA seq Blog

http://www.rna-seqblog.com/salmon-fast-and-bias-aware-quantification-of-transcript-expression/

特異的なプライマーを設計できない領域をマスクしてプライマー設計を支援するPrimer3_masker

 

Primer3_maskerは、ゲノムに対してk-mer頻度のデータベースを構築し、プライマーが高頻度に結合する配列をマスクすることで、特異的なプライマー設計が行えるよう支援するツール。

 論文がアクセプトされたのが2018年の1月17日となっており、 まだ書類があまり充実していません。更新がありましたら追記します。

 

インストール

Github

https://github.com/bioinfo-ut/primer3_masker

git clone https://github.com/bioinfo-ut/primer3_masker 
cd primer3_masker/src/
make primer3_masker

#k-mer_listのダウンロード ヒトゲノム向け
cd ../kmer_lists/
wget http://primer3.ut.ee/lists/homo_sapiens_11.list #11-merのリスト
wget http://primer3.ut.ee/lists/homo_sapiens_16.list #16-merのリスト

> ./primer3_masker 

$ primer3_masker 

Usage: ./primer3_masker [OPTIONS] <INPUTFILE>

Options:

    -h, --help                   - print this usage screen and exit

 

    -p, --probability_cutoff     - masking cutoff [0, 1] (default: >=0.1)

    -lh, --kmer_lists_path       - path to the kmer list files (default: ../kmer_lists/)

    -lp, --list_prefix           - prefix of the k-mer lists to use with default model (default: homo_sapiens)

 

    -a, --absolute_value_cutoff  - masking cutoff based on k-mer count; requires a single list name, defined with -l

    -l, --list                   - define a single k-mer list for masking with absolute cutoff option -a

 

    -m5, --mask_5p               - nucleotides to mask in 5' direction (default: 1)

    -m3, --mask_3p               - nucleotides to mask in 3' direction (default: 0)

    -c, --masking_char           - character used for masking (default: N)

    -s, --soft_mask              - use soft masking (default: false)

    -d, --masking_direction      - a strand to mask (fwd, rev, both) (default: both)

パスの通ったディレクトリに移動しておく。

 

 

ラン

k-mer listのダウンロード。

http://primer3.ut.ee/lists.htm

リストにない生き物のゲノムのマスキングを行いたい場合はオーサーに連絡して下さいとのことです( リンク上)。

 

template.fasta(ヒトゲノムの配列)のプライマーを設計する。次のような配列。

$ cat ../test_data/template.fasta |fold -w 80

>template

TTGTCAAGGTTAGATGCTGTTTCTACAGGTCACCAACTGCGGAAACAATGACATGGTCTGAAAATATGGACACGCTTTTA

GCCAACCAAGGTAAGATTTAACTAATAATAGGCTTAAAATACAATAATTAAATATAAATTATTAAATTCTGAAAGTTGGT

AACATATCATAAAGTATGAGTTTAATCAATGAAGTATAAAATTATTAATAATCATAAATTCATAAAAATCCAAAATCTAA

ATAGAATCAGGTTGGGGCTAAAATAAGTTTATAGGTTAACTCTGTACATTAAAACAAAAGGGAAATTCAATCTAGCAAGT

GAAATTTTCCATTGCCTTAGACTCACTTTAACATTTTTTATTATTTTTTATTTTAATACAGAGTCTCACTCTCTCTCTCT

ATCAGGCTGGAGTGCAGTGGCATGATCTCAGCTCACTGCAAACTCCACCTTCTGGGTTCAAGCAATTTTCCTGACTCAGC

CTCCTGAGTAGCTGAGATTACAGACATGCACCACCATACCCGGCTAATTTTTGTATTTTTAGTAGAGACAGGGTTTCACC

ATGTTGGCCAGGCTGGTATCAAACTCCTGACCTCAGGTGATCCACCCACCTCAGCATCCCAAAGTGCTGGGATTCAATTC

AGGTGTGAGCCACTGTGCCAGCCCTAGGCTCGCTGTGTGTGTGTGTGTGTGTATACACACATACACATACATATATATAT

GTATTTTTTTTTTTTTTGAGACGGAGTCTTGCTTTACCACCCAGACTGGAGTGTAGAGTGTAGTGGTGTGATCTCTGCTC

ACTGCAACCTCTGCCTCCCGGGTTCAAGGGATTCTCCTGCCTCAGCCTCCCGAGGAGCTGGGACTACGGGAGCATGCCAC

GACACCAAGCTAATATGTGTATTTTTAGTAGAGACAGGTGTTCGCCACATTAGCCAGGCTGGTCTCGAACTTCTGACCCC

AGATGATCTGCCTGCCTTGACCTCCCAAAGTGCTAGGAT

 

primer3_masker -lh test_data/ -lp test test_data/template.fasta
  • -lh path to the kmer list files (default: ../kmer_lists/)
  • -lp define prefix of the k-mer lists to use (default: homo_sapiens) 

結果 (foldに渡し、80文字折り返し出力)

>template

TTGTCAAGGTTAGATGCTGTTTCTACAGGTCACCAACTGCGGAAACAATGACATGGTCTGAAAATATGGACACGCTTTTA

GCCAACCAAGGTAAGATTTAACTAATAATAGGCTTAAAATACAATAATTAAATATAAATTATTAAATTCTGAAAGTTGGT

AACATATCATAAAGTATGAGTTTAATCAATGAAGTATAAAATTATTAATAATCATAAATTCATAAAAATCCAAAATCTAA

ATAGAATCAGGTTGGGGCTAAAATAAGTTTATAGGTTAACTCTGTACATTAAAACAAAAGGGAAATTCAATCTAGCAAGT

GAAATTTTCCATTGCCTTAGACTCACTTTAACATTTTTTATTATTTTTTATTTTAATACAGAGTCTCACTCTCTCTCTCT

ATNNNNNNNNAGTGCAGNNNNNTGNNCTCAGCTCACTGCNNACTCCACCTTCTGGGTTCAAGCAATTTTCCTGANNNNNC

CTCCTGAGTNNNNNAGATTACAGACATGCACCACCATACCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNCANN

NNGTTNGCNNNNCNNGNATCANNNNCCTGACCTCAGGTGATCCACCCACCTCAGCANNNCAAAGTGCTGGGNNNCAATTC

AGGTGTGAGCCACTGTGCCAGCCCTAGGCTCGCNNNNNNNGTGTGTGNNNNNNTACACACATACACATACATATATATAT

GTANNNNNNNNNTTNNNGNNNNGGAGTCTTGCTTTACCACCCAGACTGGAGTGTAGAGTGTAGTGGTGTGATCTCTNNNN

NCTGCAACCTCTGCCTCCCGGGTTCAAGGNNNNNNNNNNCCTCANNNNNNNNAGGAGCTGGGACTACGGGAGCATGCCAC

GACACCAAGCTAATATGNNNNNNTTTAGTAGANNNAGGTGTTCGCCACATTAGCCAGGCTGGTCTCGAACTTCTGACCCC

AGATGATCTGCCTGCCTTGACCTCCCAAAGTGCTAGGAT

 

 

k-mer頻度が10以上はマスクする。

primer3_masker -a 10 -l test_data/test_16.list test_data/template.fasta |fold -w 80
  • -a masking cutoff based on k-mer count; requires a single list name, defined with -l
  • -l define a single k-mer list; for using with absolute cutoff option -a 

1つ目より厳しい条件となる。

>template

TTGTCAAGGTTAGATGCTGTTTCTACAGGTCACCAACTGCGGAAACAATGACATGGTCTGAAAATATGGACACGCTTTTA

GCCAACCAAGGTAAGATTTAACTAATAATAGGCTTAAAATACAATAATTAAATATAAATTATTAAATTCTGAAAGTTGGT

AACATATCATAAAGTATGAGTTTAATCAATGAAGTATAAAATTATTAATAATCATAAATTCATAAAAATCCAAAATCTAA

ATAGAATCAGGTTGGGGCTAAAATAAGTTTATAGGTTAACTCTGTACATTAAAACAAAAGGGAAATTCAATCTAGCAAGT

GAAATTTTCCATTGCCTTAGACTCACTTTAACATTTTTTATTATTTTTTATTTTAATACAGAGTCTCACTCTCTCTCTCT

ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACTCCACCTTNTGGGTNCAAGCAATNTNCCTNANNNNNN

NNCNTGNNTNNNNNNNNTTACNNACATGCACCACCATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNCAAAGTGCTGGGNNNCAATTN

NNNNGTGAGCCACTNNNNNAGCCCTAGGCTCGNNNNNNNNGTGTGTGNNNNNNNNCACACATACACATACATATATATAN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTACNACCCAGACTGGAGTNTAGAGTGTAGTGGTGTGNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNGAGCTGGGACTACGGGAGCATGCCAC

GACACCAAGCTAATANNNNNNNNTTTAGTANNNNNNNNTGTTCGCCACANTANNNNGGCTGGTCNCGNNNNTCTGACCCC

AGANGATCTGCCTGNNNNNANNNNCCAAANNNNNANNNN

 

-aが2だと当然より多くの領域がマスクされる。

>template

TTGTCAAGGTTAGATGCTGTTTCTACAGGTCACCAACTGCGGAAACAATGACATGGTCTGAAAATATGGACACGCTTTTA

GCCAACCAAGGTAAGATTTAACTAATAATAGGCTTAAAATACAATAATTAAATATAAATTATTAAATTCTGAAAGTTGGT

AACATATCATAAAGTATGAGTTTAATCAATGAAGTATAAAATTATTAATAATCATAAATTCATAAAAATCCAAAATCTAA

ATAGAATCAGGTTGGGGCTAAAATAAGTTTATAGGTTAACTCTGTACATTAAAACAAAAGGGAAATTCAATCTAGCAAGT

GAAATTTTCCATTGCCTTAGACTCACTTTANCANNTNNNATTATTNTTNNTNNNAATACAGAGNNTCACTCTCTCTCTNN

ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGNNNNNNNNNNNNNTNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNANNNNCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAANNGCTNNNNNNNNATTN

NNNNNNGAGCCACTNNNNNNNCCCTAGGCTCNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNTANNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTACNNNNCAGACTGGAGTNNNNNNTGTNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTANGGGNGCATGCCAC

GACACCAAGCTAATANNNNNNNNNTTAGTANNNNNNNNNGTTCGCCACNNNANNNNNNNNNNNNNNGNNNNNNNNNNNCN

NNNNGATCTGNNNNNNNNNNNNNNCNNNNNNNNNNNNNN

 

 

引用

Primer3_masker: integrating masking of template sequence with primer design software

Triinu Kõressaar Maarja Lepamets Lauris Kaplinski Kairi Raime Reidar Andreson Maido Remm

Bioinformatics, bty036 Published: 19 January 2018

rRNAのコンタミを除く SortMeRNA

 

SortMeRNAはメタトランスクリプトームやメタゲノムのシーケンスデータからrRNAを高感度に検出し、フィルタリングするツール。出力はfasta、fastq、アライメントのsam、またblastライクな出力も可能である。Illumina, 454, Ion Torrent and PacBioのシーケンスデータに対応している。QIIMEと一緒に使用することで、OTUを検出し系統解析にも利用することができる。

 

 マニュアル

 http://bioinfo.lifl.fr/RNA/sortmerna/code/SortMeRNA-user-manual-v2.1.pdf

FAQ

http://bioinfo.lifl.fr/sortmerna/faqs.php

 

ダウンロード

公式からBinaryをダウンロードできる。

公式サイト

http://bioinfo.lifl.fr/sortmerna/sortmerna.php

 

> indexdb_rna -h

$ sortmerna -h

 

  Program:     SortMeRNA version 2.1, 01/02/2016

  Copyright:   2012-16 Bonsai Bioinformatics Research Group:

               LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe

               2014-16 Knight Lab, Department of Pediatrics, UCSD, La Jolla,

  Disclaimer:  SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the

               implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

               See the GNU Lesser General Public License for more details.

  Contact:     Evguenia Kopylova, jenya.kopylov@gmail.com 

               Laurent Noé, laurent.noe@lifl.fr

               Hélène Touzet, helene.touzet@lifl.fr

 

 

  usage:   ./sortmerna --ref db.fasta,db.idx --reads file.fa --aligned base_name_output [OPTIONS]:

 

  -------------------------------------------------------------------------------------------------------------

  | parameter          value           description                                                    default |

  -------------------------------------------------------------------------------------------------------------

     --ref             STRING,STRING   FASTA reference file, index file                               mandatory

                                         (ex. --ref /path/to/file1.fasta,/path/to/index1)

                                         If passing multiple reference files, separate 

                                         them using the delimiter ':',

                                         (ex. --ref /path/to/file1.fasta,/path/to/index1:/path/to/file2.fasta,path/to/index2)

     --reads           STRING          FASTA/FASTQ reads file                                         mandatory

     --aligned         STRING          aligned reads filepath + base file name                        mandatory

                                         (appropriate extension will be added)

 

   [COMMON OPTIONS]: 

     --other           STRING          rejected reads filepath + base file name

                                         (appropriate extension will be added)

     --fastx           BOOL            output FASTA/FASTQ file                                        off

                                         (for aligned and/or rejected reads)

     --sam             BOOL            output SAM alignment                                           off

                                         (for aligned reads only)

     --SQ              BOOL            add SQ tags to the SAM file                                    off

     --blast           STRING          output alignments in various Blast-like formats                

                                        '0' - pairwise

                                        '1' - tabular (Blast -m 8 format)

                                        '1 cigar' - tabular + column for CIGAR 

                                        '1 cigar qcov' - tabular + columns for CIGAR

                                                         and query coverage

                                        '1 cigar qcov qstrand' - tabular + columns for CIGAR,

                                                                query coverage and strand

     --log             BOOL            output overall statistics                                      off

     --num_alignments  INT             report first INT alignments per read reaching E-value          -1

                                        (--num_alignments 0 signifies all alignments will be output)

       or (default)

     --best            INT             report INT best alignments per read reaching E-value           1

                                         by searching --min_lis INT candidate alignments

                                        (--best 0 signifies all candidate alignments will be searched)

     --min_lis         INT             search all alignments having the first INT longest LIS         2

                                         LIS stands for Longest Increasing Subsequence, it is 

                                         computed using seeds' positions to expand hits into

                                         longer matches prior to Smith-Waterman alignment. 

     --print_all_reads BOOL            output null alignment strings for non-aligned reads            off

                                         to SAM and/or BLAST tabular files

     --paired_in       BOOL            both paired-end reads go in --aligned fasta/q file             off

                                         (interleaved reads only, see Section 4.2.4 of User Manual)

     --paired_out      BOOL            both paired-end reads go in --other fasta/q file               off

                                         (interleaved reads only, see Section 4.2.4 of User Manual)

     --match           INT             SW score (positive integer) for a match                        2

     --mismatch        INT             SW penalty (negative integer) for a mismatch                   -3

     --gap_open        INT             SW penalty (positive integer) for introducing a gap            5

     --gap_ext         INT             SW penalty (positive integer) for extending a gap              2

     -N                INT             SW penalty for ambiguous letters (N's)                         scored as --mismatch

     -F                BOOL            search only the forward strand                                 off

     -R                BOOL            search only the reverse-complementary strand                   off

     -a                INT             number of threads to use                                       1

     -e                DOUBLE          E-value threshold                                              1

     -m                INT             INT Mbytes for loading the reads into memory                   1024

                                        (maximum -m INT is 49152)

     -v                BOOL            verbose                                                        off

 

 

   [OTU PICKING OPTIONS]: 

     --id              DOUBLE          %id similarity threshold (the alignment must                   0.97

                                         still pass the E-value threshold)

     --coverage        DOUBLE          %query coverage threshold (the alignment must                  0.97

                                         still pass the E-value threshold)

     --de_novo_otu     BOOL            FASTA/FASTQ file for reads matching database < %id             off

                                         (set using --id) and < %cov (set using --coverage) 

                                         (alignment must still pass the E-value threshold)

     --otu_map         BOOL            output OTU map (input to QIIME's make_otu_table.py)            off

 

 

   [ADVANCED OPTIONS] (see SortMeRNA user manual for more details): 

    --passes           INT,INT,INT     three intervals at which to place the seed on the read         L,L/2,3

                                         (L is the seed length set in ./indexdb_rna)

    --edges            INT             number (or percent if INT followed by % sign) of               4

                                         nucleotides to add to each edge of the read

                                         prior to SW local alignment 

    --num_seeds        INT             number of seeds matched before searching                       2

                                         for candidate LIS 

    --full_search      BOOL            search for all 0-error and 1-error seed                        off

                                         matches in the index rather than stopping

                                         after finding a 0-error match (<1% gain in

                                         sensitivity with up four-fold decrease in speed)

    --pid              BOOL            add pid to output file names                                   off

 

 

   [HELP]:

     -h                BOOL            help

     --version         BOOL            SortMeRNA version number

 

 

> indexdb_rna -h

$ indexdb_rna -h

 

  Program:     SortMeRNA version 2.1, 01/02/2016

  Copyright:   2012-16 Bonsai Bioinformatics Research Group:

               LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe

               2014-16 Knight Lab, Department of Pediatrics, UCSD, La Jolla,

  Disclaimer:  SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the

               implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

               See the GNU Lesser General Public License for more details.

  Contact:     Evguenia Kopylova, jenya.kopylov@gmail.com 

               Laurent Noé, laurent.noe@lifl.fr

               Hélène Touzet, helene.touzet@lifl.fr

 

 

  usage:   ./indexdb_rna --ref db.fasta,db.idx [OPTIONS]:

 

  --------------------------------------------------------------------------------------------------------

  | parameter        value           description                                                 default |

  --------------------------------------------------------------------------------------------------------

     --ref           STRING,STRING   FASTA reference file, index file                            mandatory

                                      (ex. --ref /path/to/file1.fasta,/path/to/index1)

                                       If passing multiple reference sequence files, separate

                                       them by ':',

                                      (ex. --ref /path/to/file1.fasta,/path/to/index1:/path/to/file2.fasta,path/to/index2)

   [OPTIONS]:

     --tmpdir        STRING          directory where to write temporary files

     -m              INT             the amount of memory (in Mbytes) for building the index     3072 

     -L              INT             seed length                                                 18

     --max_pos       INT             maximum number of positions to store for each unique L-mer  10000

                                      (setting --max_pos 0 will store all positions)

     -v              BOOL            verbose

     -h              BOOL            help

パスを通しておく。

 

ラン

解析にはデータベースのrRNA (FASTA) にindexをつける必要がある。ここではRNA_databases/にあるbacteriaの16Sにindexをつけている。

mkdir index
indexdb_rna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db -v
  • -v   verbose
  • --ref   STRING,STRING   FASTA reference file, index file

fastaとinexの間は","で区切る。複数のファイルを指定することもできる(マニュアル参照)。

データベースにはアーキアと真核生物のrRNAの配列もある。

f:id:kazumaxneo:20180121124845j:plain

 

データベースが準備できたら、fastqからrRNAを検出し、別ファイルで出力する。

indexdb_rna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s --reads file.fq --aligned mapped --fastx --other nohit
  • --fastx    output FASTA/FASTQ file  
  • --aligned         STRING          aligned reads filepath + base file name 

mapped.fastqとnohit.fastqが出力される。

 

--fastx、--otherのほかに、--samや--blastがある。詳細はhelpから確認してください。

 

引用

SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data.

Kopylova E, Noé L, Touzet H.

Bioinformatics. 2012 Dec 15;28(24):3211-7.

 

古いサンプルのデータ (fastqやbam) から効率的にアダプターを除く leeHom

 

 古いDNAが断片化したサンプルからのシーケンスが増えている。しばしば数万年前のサンプルからも抽出される古代のサンプルのDNAは断片化が起きており、うまくDNAを抽出してもサイズが100-bpを超えることは滅多にない。短いDNAをペアードエンドでシーケンスすると、インサート全体がシーケンスされ、さらにアダプターまでシーケンスされることになる(図参照)。

f:id:kazumaxneo:20180121020540j:plain

ResearchGateより転載。

 

 シーケンス全体がオーバーラップしているので、正しくマージできれば、単純に高いクオリテイのペア側の配列に更新するだけでシーケンス精度を高めることも可能になるが、元々のDNAの質が悪いので(ミスマッチとギャップが発生しうる)、マージは難しいと推測される。

 leeHomは5'側と3'側のアダプターを除去し、ベイジアン最大事後確率アプローチを用いて元のDNA配列を再構成する方法論となる。シミュレーションと、古代のサンプルとして有名なネアンデルタール人のシーケンスデータを使いテストされており、他の方法論より精度が高いという結果が出ている。シングルエンド、ペアードエンドのシーケンスデータに対応している。

 

 

公式サイト

https://bioinf.eva.mpg.de/leehom/

 

インストール

Github

https://github.com/grenaud/leeHom

git clone --recursive https://github.com/grenaud/leeHom.git 
cd leeHom/
make
src/leeHom #動作確認

$ src/leeHom

Usage:

 

src/leeHom [options] BAMfile

 

This program takes an unaligned BAM where mates are consecutive

or fastq files and trims and merges reads

 

You can specify a unaligned bam file or one or two fastq :

-fq1 First fastq

-fq2 Second  fastq file (for paired-end)

-fqo Output fastq prefix

 

-o , --outfile Output (BAM format).

-u            Produce uncompressed bam (good for pipe)

--aligned Allow reads to be aligned (default false)

-v , --verbose Turn all messages on (default false)

--log [log file] Print a tally of merged reads to this log file (default only to stderr)

--phred64 Use PHRED 64 as the offset for QC scores (default : PHRED33)

 

Paired End merging/Single Read trimming  options

You can specify either:

--ancientdna ancient DNA (default false)

            this allows for partial overlap

 

or if you know your size length distribution:

--loc Location for lognormal dist. (default none)

--scale Scale for lognormal dist. (default none)

 

--keepOrig Write original reads if they are trimmed or merged  (default false)

Such reads will be marked as PCR duplicates

 

-f , --adapterFirstRead Adapter that is observed after the forward read (def. Multiplex: AGATCGGAAGAGCACACGTCTGAACTCCAG)

-s , --adapterSecondRead Adapter that is observed after the reverse read (def. Multiplex: AGATCGGAAGAGCGTCGTGTAGGGAAAGAG)

-c , --FirstReadChimeraFilter If the forward read looks like this sequence, the cluster is filtered out.

Provide several sequences separated by comma (def. Multiplex: ACACTCTTTCCCTACACGTCTGAACTCCAG)

-k , --key Key sequence with which each sequence starts. Comma separate for forward and reverse reads. (default '')

-i , --allowMissing Allow one base in one key to be missing or wrong. (default false)

--trimCutoff Lowest number of adapter bases to be observed for single Read trimming (default 1)

パスを通しておく。

マルチコア版

> src/leeHomMulti    #-tでthread数を決める

$ src/leeHomMulti 

Usage:

 

leeHomMulti [options] BAMfile

 

This program takes an unaligned BAM where mates are consecutive

or fastq files and trims and merges reads

 

You can specify a unaligned bam file or one or two fastq :

-fq1 First fastq

-fq2 Second  fastq file (for paired-end)

-fqo Output fastq prefix

 

-o , --outfile Output (BAM format).

-u            Produce uncompressed bam (good for pipe)

--aligned Allow reads to be aligned (default false)

-v , --verbose Turn all messages on (default false)

--log [log file] Print a tally of merged reads to this log file (default only to stderr)

--phred64 Use PHRED 64 as the offset for QC scores (default : PHRED33)

-t [threads] Use multiple cores (default : 1)

 

Paired End merging/Single Read trimming  options

You can specify either:

--ancientdna ancient DNA (default false)

            this allows for partial overlap

 

or if you know your size length distribution:

--loc Location for lognormal dist. (default none)

--scale Scale for lognormal dist. (default none)

 

--keepOrig Write original reads if they are trimmed or merged  (default false)

Such reads will be marked as PCR duplicates

 

-f , --adapterFirstRead Adapter that is observed after the forward read (def. Multiplex: AGATCGGAAGAGCACACGTCTGAACTCCAG)

-s , --adapterSecondRead Adapter that is observed after the reverse read (def. Multiplex: AGATCGGAAGAGCGTCGTGTAGGGAAAGAG)

-c , --FirstReadChimeraFilter If the forward read looks like this sequence, the cluster is filtered out.

Provide several sequences separated by comma (def. Multiplex: ACACTCTTTCCCTACACGTCTGAACTCCAG)

-k , --key Key sequence with which each sequence starts. Comma separate for forward and reverse reads. (default '')

-i , --allowMissing Allow one base in one key to be missing or wrong. (default false)

--trimCutoff Lowest number of adapter bases to be observed for single Read trimming (default 1)

 

ラン

テストラン。

fasrqファイルからアダプターを除く。

leeHom -f AGATCGGAAGAGCACACGTCTGAACTCCAG -s GGAAGAGCGTCGTGTAGGGAAAGAGTGTAG --ancientdna -fq1 testData/rawAncientDNA.f1.gz -fq2 testData/rawAncientDNA.f2.gz -fqo testData/outfq
  • -f    Adapter that is observed after the forward read (def. Multiplex: AGATCGGAAGAGCACACGTCTGAACTCCAG)
  • -s    Adapter that is observed after the reverse read (def. Multiplex: AGATCGGAAGAGCGTCGTGTAGGGAAAGAG)
  • --ancientdna   ancient DNA (default false) this allows for partial overlap
  • -fq1   First fastq-fq1 First fastq
  • -fq2   Second  fastq file (for paired-end)
  • -fqo   Output fastq prefix

ランが終わると下の解析logが表示される。

Total 50000; Merged (trimming) 40540; Merged (overlap) 7376; Kept PE/SR 1955; Trimmed SR 0; Adapter dimers/chimeras 129; Failed Key 0

47,916のペアードエンドがマージされ(=40540+7376)、1955は重複が見つからずマージされなかったことを意味している。またインサートなしにアダプターが2つタンデムに結合したキメラは129見つかった。

 出力はマージされたfastq.gzと、マージされなかったr1.fastq.gz、r2.fastq.gz、そして条件が満たされなかったfailのfastq.gzが出力される。

 

bamファイルからアダプターを除く。

leeHom -f AGATCGGAAGAGCACACGTCTGAACTCCAG -s GGAAGAGCGTCGTGTAGGGAAAGAGTGTAG --ancientdna -o testData/reconsAncientDNA.bam testData/rawAncientDNA.bam 
  • -o , --outfile Output (BAM format).

 

引用

leeHom: adaptor trimming and merging for Illumina sequencing reads

Gabriel Renaud, Udo Stenzel, and Janet Kelso.

Nucleic Acids Res. 2014 Oct 13; 42(18): e141.

 

RNA seqのクオリティチェックツール QoRTs

RNA-Seqは特定のバイアス、アーティファクトを受けやすく、 堅牢で包括的なクオリティチェックが重要になる。とくにサンプル調製、ライブラリー作成、またはシークエンシングのエラーは、 予期せぬアーティファクト、バイアスを引き起こす。適切に処理できるように、そのような問題を検出することが重要になるが、品質を自動的にテストする包括的な方法が存在しない。

QoRTsは幅広いクオリティマトリクスを生成し、インフォマティシャンが適切にクオリティチェックを行えるようサポートする。

  

公式サイト

QoRTs: Quality of RNA-Seq Toolset

Example Walkthrough

QoRTs/example-walkthrough.pdf at master · hartleys/QoRTs · GitHub

 

インストール

Githubのリリースからコンパイル済みのjarファイルをダウンロードできる。マニュアルもある。

https://github.com/hartleys/QoRTs/releases

java -jar QoRTs.jar QC -man #マニュアル

#描画するためのRのパッケージも入れておく。Rに入って
install.packages("http://hartleys.github.io/QoRTs/QoRTs_LATEST.tar.gz", repos=NULL, type="source");

r$ java -jar QoRTs.jar QC --man

Starting QoRTs v1.3.0 (Compiled Fri Oct 20 11:56:37 EDT 2017)

Starting time: (Mon Jan 22 17:49:51 JST 2018)

NAME

QC

   Version: 1.3.0 (Updated Fri Oct 20 11:56:37 EDT 2017)

 

USAGE

    java [Java Options] -jar QoRTs.jar QC [options] infile 

        gtffile.gtf qcDataDir

 

DESCRIPTION:

    This utility runs a large battery of QC / data processing tools 

    on a single given sam or bam file. This is the primary function 

    of the QoRTs utility. All analyses are run via a single pass 

    through the sam/bam file.

 

REQUIRED ARGUMENTS:

    infile

        The input .bam or .sam file of aligned sequencing reads. Or 

        '-' to read from stdin.

        (String)

    gtffile.gtf

        The gtf annotation file. This tool was designed to use the 

        standard gtf annotations provided by Ensembl, but other 

        annotations can be used as well.

        If the filename ends with ".gz" or ".zip", the file will be 

        parsed using the appropriate decompression method.

        (String)

    qcDataDir

        The output directory.

        (String)

 

OPTIONS:

    --singleEnded

        Flag to indicate that reads are single end.

        (flag)

 

    --stranded

        Flag to indicate that data is stranded.

        (flag)

 

    --stranded_fr_secondstrand

        Flag to indicate that reads are from a fr_secondstrand type 

        of stranded library (equivalent to the "stranded = yes" 

        option in HTSeq or the "fr_secondStrand" library-type option 

        in TopHat/CuffLinks). If your data is stranded, you must 

        know the library type in order to analyze it properly. This 

        utility uses the same definitions as cufflinks to define 

        strandedness type. By default, the fr_firststrand library 

        type is assumed for all stranded data (equivalent to the 

        "stranded = reverse" option in HTSeq).

        (flag)

 

    --maxReadLength len

        Sets the maximum read length. For unclipped datasets this 

        option is not necessary since the read length can be 

        determined from the data. By default, QoRTs will attempt to 

        determine the max read length by examining the first 1000 

        reads. If your data is hard-clipped prior to alignment, then 

        it is strongly recommended that this option be included, or 

        else an error may occur. Note that hard-clipping data prior 

        to alignment is generally not recommended, because this 

        makes it difficult (or impossible) to determine the 

        sequencer read-cycle of each nucleotide base. This may 

        obfuscate cycle-specific artifacts, trends, or errors, the 

        detection of which is one of the primary purposes of QoRTs! 

        In addition, hard clipping (whether before or after 

        alignment) removes quality score data, and thus quality 

        score metrics may be misleadingly optimistic. A MUCH 

        preferable method of removing undesired sequence is to 

        replace such sequence with N's, which preserves the quality 

        score and the sequencer cycle information while still 

        removing undesired sequence.

        (Int)

 

    --minMAPQ num

        Filter out reads with less than the given MAPQ. Most RNA-Seq 

        aligners use the MAPQ field to differentiate uniquely-mapped 

        and multi-mapped reads. However, different aligners use a 

        different MAPQ conventions. By default, all reads with a 

        MAPQ of less than 255 will be excluded, as this is the MAPQ 

        associated with uniquely-aligned reads generated by the 

        RNA-STAR aligner. For use with TopHat2 you should set this 

        to 50. The MAPQ behavior for GSNAP is not well documented, 

        but it appears that a filtering threshold of 30 should be 

        adequate. Set this to 0 to turn off mapq filtering.

        (Int)

 

    --generatePlots

        Generate all single-replicate QC plots. Equivalent to the 

        combination of: --generateMultiPlot --generateSeparatePlots 

        and --generatePdfReport. This option will cause QoRTs to 

        make an Rscript system call, loading the R package QoRTs. 

        (Note: this requires that R be installed and in the PATH, 

        and that QoRTs be installed on that R installation)

        (flag)

 

    --testRun

        Flag to indicate that only the first 100k reads should be 

        read in. Used for testing.

        (flag)

 

    --keepMultiMapped

        Flag to indicate that the tool should NOT filter out 

        multi-mapped reads. Note that even with this flag raised 

        this utility will still only use the 'primary' alignment 

        location for each read. By default any reads that are marked 

        as multi-mapped will be ignored entirely. Most aligners use 

        the MAPQ value to mark multi-mapped reads. Any read with 

        MAPQ < 255 is assumed to be non-uniquely mapped (this is the 

        standard used by RNA-STAR and TopHat/TopHat2). This option 

        is equivalent to "--minMAPQ 0".

        (flag)

 

    --noGzipOutput

        Flag to indicate that output files should NOT be compressed 

        into the gzip format. By default almost all output files are 

        compressed to save space.

        (flag)

 

    --readGroup readGroupName

        If this option is set, all analyses will be restricted to 

        ONLY reads that are tagged with the given readGroupName 

        (using an RG tag). This can be used if multiple read-groups 

        have already been combined into a single bam file, but you 

        want to summarize each read-group separately.

        (String)

 

    --dropChrom dropChromosomes

        A comma-delimited list of chromosomes to ignore and exclude 

        from all analyses. Important: no whitespace!

        (CommaDelimitedListOfStrings)

 

    --skipFunctions func1,func2,...

        A list of functions to skip (comma-delimited, no 

        whitespace). See the sub-functions list, below. The 

        default-on functions are: NVC, GCDistribution, GeneCalcs, 

        readLengthDistro, QualityScoreDistribution, 

        writeJunctionSeqCounts, writeKnownSplices, 

        writeNovelSplices, writeClippedNVC, CigarOpDistribution, 

        overlapMatch, cigarLocusCounts, InsertSize, chromCounts, 

        writeSpliceExon, writeGenewiseGeneBody, JunctionCalcs, 

        writeGeneCounts, writeBiotypeCounts, writeDESeq, 

        writeDEXSeq, writeGeneBody, StrandCheck

        (CommaDelimitedListOfStrings)

 

    --addFunctions func1,func2,...

        A list of functions to add (comma-delimited, no whitespace). 

        This can be used to add functions that are off by default. 

        Followed by a comma delimited list, with no internal 

        whitespace. See the sub-functions list, below. The 

        default-off functions are: mismatchEngine, 

        annotatedSpliceExonCounts, calcOnTarget, FPKM, cigarMatch, 

        testDataDump, writeGeneBodyIv, fastqUtils, referenceMatch, 

        writeDocs, makeJunctionBed, makeWiggles, 

        makeAllBrowserTracks, calcDetailedGeneCounts

        (CommaDelimitedListOfStrings)

 

    --runFunctions func1,func2,...

        The complete list of functions to run (comma-delimited, no 

        whitespace). Setting this option runs ONLY for the functions 

        explicitly requested here (along with any functions upon 

        which the assigned functions are dependent). See the 

        sub-functions list, below. Allowed options are: NVC, 

        mismatchEngine, annotatedSpliceExonCounts, GCDistribution, 

        calcOnTarget, GeneCalcs, FPKM, readLengthDistro, cigarMatch, 

        QualityScoreDistribution, testDataDump, 

        writeJunctionSeqCounts, writeKnownSplices, 

        writeNovelSplices, writeClippedNVC, CigarOpDistribution, 

        overlapMatch, cigarLocusCounts, InsertSize, chromCounts, 

        writeGeneBodyIv, fastqUtils, writeSpliceExon, 

        referenceMatch, writeGenewiseGeneBody, JunctionCalcs, 

        writeGeneCounts, writeDocs, makeJunctionBed, 

        writeBiotypeCounts, writeDESeq, writeDEXSeq, makeWiggles, 

        writeGeneBody, StrandCheck, makeAllBrowserTracks, 

        calcDetailedGeneCounts

        (CommaDelimitedListOfStrings)

 

    --seqReadCt val

        (Optional) The total number of reads (or read-pairs, for 

        paired-end data) generated by the sequencer for this sample, 

        prior to alignment. This will be passed on into the 

        QC.summary.txt file and used to calculate mapping rate.

        (Int)

 

    --rawfastq myfastq.1.fq.gz,myfastq.2.fq.gz

        (Optional) The raw fastq, prior to alignment. In normal 

        operation, this is used ONLY to calculate the number of 

        pre-alignment reads (or read-pairs) simply by counting the 

        number of lines and dividing by 4. Alternatively, the number 

        of pre-alignment read-pairs can be included explicitly via 

        the --seqReadCt option, or added in the plotting / 

        cross-comparison step by including the input.read.pair.count 

        column in the replicate decoder.In general, the --seqReadCt 

        option is recommended when available.

        Certain optional QC functions are also available that 

        utilize the raw Fastq file in other ways. If the filename 

        ends with ".gz" or ".zip", the file will be parsed using the 

        appropriate decompression method.

        (CommaDelimitedListOfStrings)

 

    --chromSizes chrom.sizes.txt

        A chrom.sizes file. The first (tab-delimited) column must 

        contain all chromosomes found in the dataset. The second 

        column must contain chromosome sizes (in base-pairs). If a 

        standard genome is being used, it is strongly recommended 

        that this be generated by the UCSC utility 

        'fetchChromSizes'.

        This file is ONLY needed to produce wiggle files. If this is 

        provided, then by default QoRTs will produce 100-bp-window 

        wiggle files (and junction '.bed' files) for the supplied 

        data.In order to produce wiggle files, this parameter is 

        REQUIRED.

        (String)

 

    --title myTitle

        The title of the replicate. Used for the track name in the 

        track definition line of any browser tracks ('.wig' or 

        '.bed' files) generated by this utility. Also may be used in 

        the figure text, if figures are being generated.Note that no 

        browser tracks will be created by default, unless the 

        '--chromSizes' option is set. Bed files can also be 

        generated using the option '--addFunction makeJunctionBed'

        (String)

 

    --flatgff flattenedGffFile.gff.gz

        A "flattened" gff file that matches the standard gtf file. 

        Optional. The "flattened" gff file assigns unique 

        identifiers for all exons, splice junctions, and 

        aggregate-genes. This is used for the junction counts and 

        exon counts (for DEXSeq). The flattened gtf file can be 

        generated using the "makeFlatGff" command. Flattened GFF 

        files containing novel splice junctions can be generated 

        using the "mergeNovelSplices" function. Note that (for most 

        purposes) the command should be run with the same 

        strandedness code as found in the dataset. Running a 

        flattened gff that was generated using a different 

        strandedness mode may be useful for certain purposes, but is 

        generally not supported and is for advanced users only.See 

        the documentation for makeFlatGff for more information.

        If the filename ends with ".gz" or ".zip", the file will be 

        parsed using the appropriate decompression method.

        (String)

 

    --generateMultiPlot

        Generate a multi-frame figure, containing a visual summary 

        of all QC stats. (Note: this requires that R be installed 

        and in the PATH, and that QoRTs be installed on that R 

        installation)

        (flag)

 

    --generateSeparatePlots

        Generate seperate plots for each QC stat, rather than only 

        one big multiplot. (Note: this requires that R be installed 

        and in the PATH, and that QoRTs be installed on that R 

        installation)

        (flag)

 

    --generatePdfReport

        Generate a pdf report. (Note: this requires that R be 

        installed and in the PATH, and that QoRTs be installed on 

        that R installation)

        (flag)

 

    --adjustPhredScore val

        QoRTs expects input files to conform to the SAM format 

        specification, which requires all Phred scores to be in 

        Phred+33 encoding. However some older tools produce SAM 

        files with nonstandard encodings. To read such data, you can 

        set this parameter to subtract from the apparent (phred+33) 

        phred score. Thus, to read Phred+64 data (produced by 

        Illumina v1.3-1.7), set this parameter to 31. Note: QoRTs 

        does not support negative Phred scores. NOTE: THIS OPTION IS 

        EXPERIMENTAL!

        (Int)

 

    --maxPhredScore val

        According to the standard FASTQ and SAM format 

        specification, Phred quality scores are supposed to range 

        from 0 to 41. However, certain sequencing machines such as 

        the HiSeq4000 supposedly produce occasional quality scores 

        as high as 45. If your dataset contains quality scores in 

        excess of 41, then you must use this option to set the 

        maximum legal quality score. Otherwise, QoRTs will throw an 

        error.

        (Int)

 

    --summaryFileSuffix .summary.txt

        The suffix of the 'summary' file. This is useful to set if 

        you want to run multiple QC runs in parallel to reduce 

        runtime, without overwriting one another's summary files.In 

        particular, the NVC metrics often take a long time to run, 

        so splitting those off using the --runFunctions parameter 

        might speed things up considerably. Note that 'QC' will be 

        appended in the actual filename. THIS OPTION IS BETA!

        (String)

 

    --extractReadsByMetric metric=value

        THIS OPTIONAL PARAMETER IS STILL UNDER BETA TESTING. This 

        parameter allows the user to extract anomalous reads that 

        showed up in previous QoRTs runs. Currently reads can be 

        extracted based on the following metrics: StrandTestStatus, 

        InsertSize and GCcount.

        (String)

 

    --keepOnlyOnTarget

        Experimental flag. Ignores reads that DO NOT fall within the 

        target region (specified by the required bedfile using the 

        --targetRegionBed parameter).

        (flag)

 

    --dropOnTarget

        Experimental flag. Ignores reads that DO fall within the 

        target region (specified by the required bedfile using the 

        --targetRegionBed parameter).

        (flag)

 

    --randomSubsample 1.00

        If this option is set, QoRTs will ignore a random fraction 

        of the input read pairs. This can drastically reduce 

        runtime, though it may reduce the accuracy of the output QC 

        metrics.

        (Double)

 

    --restrictToGeneList geneList.txt

        If this option is set, almost all analyses will be 

        restricted to reads that are found on genes named in the 

        supplied gene list file. The file should contain a gene ID 

        on each line and nothing else. The only functions that will 

        be run on the full set of all reads will be the functions 

        that calculate the gene mapping itself. NOTE: if you want to 

        include ambiguous reads, include a line with the text: 

        '_ambiguous'. If you want to include reads that do not map 

        to any known feature, include a line with the text: 

        '_no_feature'. WARNING: this is not intended for default 

        use. It is intended to be used when re-running QoRTs, with 

        the intention of examining artifacts that can be caused in 

        various plots by a small number of genes with extremely high 

        coverage. For example, GC content plots sometimes contain 

        visible spikes caused by small mitochondrial genes with 

        extremely high expression.ADDITIONAL WARNING: This feature 

        is in BETA, and is not yet fully tested.

        (String)

 

    --dropGeneList geneList.txt

        If this option is set, almost all analyses will be 

        restricted to reads that are NOT found on genes named in the 

        supplied gene list file. The file should contain a gene ID 

        on each line and nothing else. The only functions that will 

        be run on the full set of all reads will be the functions 

        that calculate the gene mapping itself. NOTE: if you want to 

        EXCLUDE ambiguous reads, include a line with the text: 

        '_ambiguous'. If you want to EXCLUDE reads that do not map 

        to any known feature, include a line with the text: 

        '_no_feature'. WARNING: this is not intended for default 

        use. It is intended to be used when re-running QoRTs, with 

        the intention of examining artifacts that can be caused by 

        certain individual 'problem genes'. For example, GC content 

        plots sometimes contain visible spikes caused by small 

        transcripts / RNA's with extremely high expression 

        levels.ADDITIONAL WARNING: This feature is in BETA, and is 

        not yet fully tested.

        (String)

 

    --DNA

        BETA: This flag makes various changes to allow QoRTs to run 

        on whole-exome or whole-genome DNA-Seq data.

        (flag)

 

    --RNA

        Indicates that the data is RNA-Seq (this is the default: 

        flag does nothing).

        (flag)

 

    --genomeFA chr.fa.gz[,chr2.fa,...]

        Reference genome sequence. This can either be a single FASTA 

        file with all the chromosomes included, or a comma-delimited 

        list of fasta files with 1 chromosome each. Note: IF 

        multiple fasta files are specificed, each must contain ONLY 

        ONE chromosome. If a single multi-chromosome fasta file is 

        specificed, performance will be improved if the chromosomes 

        are in the same order as they are found in the BAM file, 

        however, this is not required. The genomic sequence is used 

        by certain experimental sub-utilities (currently only the 

        referenceMatch utility). Comma delimited, no spaces. Fasta 

        files can be in plaintext, gzipped or zipped.

        (CommaDelimitedListOfStrings)

 

    --genomeBufferSize val

        The size of the genome fasta buffer. Tuning this parameter 

        may improve performance.

        (Int)

 

    --outfilePrefix sampID

        Prefix to be prepended to all output files. If this is set, 

        all output files will use the format: 

        "outfiledir/<prefix>QC.qcfilename.txt.gz"

        (String)

 

    --nameSorted

        DEPRECATED: Relevant for paired-end reads only. 

        This flag is used to run QoRTs in "name-sorted" mode. This 

        flag is optional, as under the default mode QoRTs will 

        accept BAM files sorted by either name OR position. However, 

        The only actual requirement in this mode is that read pairs 

        be adjacent.

        Errors may occur if the SAM flags are inconsistent: for 

        example, if orphaned reads appear with the "mate mapped" SAM 

        flag set.

        (flag)

 

    --coordSorted

        DEPRECATED: this mode is now subsumed by the default mode 

        and as such this parameter is now nonfunctional.

        Note that, in the default mode, for paired-end data QoRTs 

        will accept EITHER coordinate-sorted OR name-sorted bam 

        files. In "--nameSorted" mode, QoRTs ONLY accepts 

        name-sorted bam files.

        If a large fraction of the read-pairs are mapped to 

        extremely distant loci (or to different chromosomes), then 

        memory issues may arise. However, this should not be a 

        problem with most datasets. Technically by default QoRTs can 

        run on arbitrarily-ordered bam files, but this is STRONGLY 

        not recommended, as memory usage will by greatly increased.

        (flag)

 

    --fileContainsNoMultiMappedReads

        DEPRECATED. Flag to indicate that the input sam/bam file 

        contains only primary alignments (ie, no multi-mapped 

        reads). This flag is ALWAYS OPTIONAL, but when applicable 

        this utility will run (slightly) faster when using this 

        argument. (DEPRECIATED! The performance improvement was 

        marginal)

        (flag)

 

    --parallelFileRead

        DEPRECATED: DO NOT USE. Flag to indicate that bam file 

        reading should be run in paralell for increased speed. Note 

        that in this mode you CANNOT read from stdin. Also note that 

        for this to do anything useful, the numThreads option must 

        be set to some number greater than 1. Also note that 

        additional threads above 9 will have no appreciable affect 

        on speed.

        (flag)

 

    --numThreads num

        DEPRECIATED, nonfunctional.

        (Int)

 

    --checkForAlignmentBlocks

        Certain aligners will mark reads 'aligned' even though they 

        have no aligned bases. This option will automatically check 

        for some reads and ignore them, rather than throwing an 

        error.

        (flag)

 

    --targetRegionBed targetRegion.bed

        For whole exome sequencing, this specifies the exome target 

        regions.

        (String)

 

    --stopAfterNReads n

        Stop after reading in n reads or read-pairs.

        (Int)

 

    --randomSeed n

        Use specified random seed.

        (Long)

 

    --parseIlluminaStyleReadIDs

        Specifies that the read-names are in the illumina style. 

        CURRENTLY NONFUNCTIONAL!

        (flag)

 

    --verbose

        Flag to indicate that debugging information and extra 

        progress information should be sent to stderr.

        (flag)

 

    --quiet

        Flag to indicate that only errors and warnings should be 

        sent to stderr.

        (flag)

 

DEFAULT SUB-FUNCTIONS

    NVC

        Nucleotide-vs-Cycle counts.

    GCDistribution

        Calculate GC content distribution.

    GeneCalcs

        Find gene assignment and gene body calculations.

    readLengthDistro

        Tabulates the distribution of read lengths. 

    QualityScoreDistribution

        Calculate quality scores by cycle.

    writeJunctionSeqCounts

        Write counts file designed for use with JunctionSeq 

        (contains known splice junctions, gene counts, and exon 

        counts). [Depends: writeSpliceExon]

    writeKnownSplices

        Write known splice junction counts. [Depends: JunctionCalcs]

    writeNovelSplices

        Write novel splice junction counts. [Depends: JunctionCalcs]

    writeClippedNVC

        Write NVC file containing clipped sequences. [Depends: NVC]

    CigarOpDistribution

        Cigar operation rates by cycle and cigar operation length 

        rates (deletions, insertions, splicing, clipping, etc).

    overlapMatch

        BETA: This function calculates the matching of overlapping 

        sections of paired reads. [Depends: mismatchEngine]

    cigarLocusCounts

        BETA: This function is still undergoing basic testing. It is 

        not intended for production use at this time.

    InsertSize

        Insert size distribution (paired-end data only).

    chromCounts

        Calculate chromosome counts

    writeSpliceExon

        Synonym for function "writeJunctionSeqCounts" (for 

        backwards-compatibility) [Depends: JunctionCalcs]

    writeGenewiseGeneBody

        Write file containing gene-body distributions for each 

        (non-overlapping) gene. [Depends: writeGeneBody]

    JunctionCalcs

        Find splice junctions (both novel and annotated).

    writeGeneCounts

        Write extended gene-level read/read-pair counts file 

        (includes counts for CDS/UTR, ambiguous regions, etc). 

        [Depends: GeneCalcs]

    writeBiotypeCounts

        Write a table listing read counts for each gene BioType 

        (uses the optional "gene_biotype" GTF attribute). [Depends: 

        GeneCalcs]

    writeDESeq

        Write gene-level read/read-pair counts file, suitable for 

        use with DESeq, EdgeR, etc. [Depends: GeneCalcs]

    writeDEXSeq

        Write exon-level read/read-pair counts file, designed for 

        use with DEXSeq. [Depends: JunctionCalcs]

    writeGeneBody

        Write gene-body distribution file. [Depends: GeneCalcs]

    StrandCheck

        Check the strandedness of the data. Note that if the 

        stranded option is set incorrectly, this tool will 

        automatically print a warning to that effect.

NON-DEFAULT SUB-FUNCTIONS

    mismatchEngine

        Internal module that runs overlap/reference mismatch 

        calculations. Automatically included on any runs that 

        include these functions.

    annotatedSpliceExonCounts

        Write counts for exons, known-splice-junctions, and genes, 

        with annotation columns indicating chromosome, etc (default: 

        OFF) [Depends: JunctionCalcs]

    calcOnTarget

        BETA: requires --targetRegionBed parameter. This function 

        calculates the rates at which reads intersect with the 

        On-Target area. Intended for whole exome sequencing data. 

        Make sure to use the --targetRegionBed parameter or else 

        this function will deactivate! (Default: ON iff 

        targetRegionBed param is found)

    FPKM

        Write FPKM values. Note: FPKMs are generally NOT the 

        recommended normalization method. We recommend using a more 

        advanced normalization as provided by DESeq, edgeR, 

        CuffLinks, or similar (default: OFF)

    cigarMatch

        Work-In-Progress: this function is a placeholder for future 

        functionality, and is not intended for use at this time. 

        (default: OFF)

    testDataDump

        EXPERIMENTAL: This function dumps a bunch of information for 

        internal testing purposes. NOT FOR GENERAL USE! (default: 

        OFF)

    writeGeneBodyIv

        Writes an optional additional file detailing the intervals 

        used in the gene-body coverage calculations 

        ("QC.geneBodyCoverage.DEBUG.intervals.txt.gz") (default: 

        OFF) [Depends: writeGeneBody]

    fastqUtils

        BETA: requires --rawfastq parameter. Adds additional tests 

        that use the supplied raw fastq file. Requires that one (or 

        two) fastq files be supplied. (Default: ON iff rawfastq 

        param is found)

    referenceMatch

        BETA: requires --genomeFA parameter. This function 

        calculates the matching against the reference. Requires the 

        specification of genome fasta file(s). REQUIRES 

        COORDINATE-SORTED BAM FILES! REQUIRES THAT FA AND BAM HAVE 

        THE SAME CHROMOSOME ORDERING! (Default: ON iff genomeFA 

        param is found) [Depends: mismatchEngine]

    writeDocs

        Writes a QC.documentation.txt file that documents all output 

        files.

    makeJunctionBed

        Write splice-junction count "bed" files. (default: OFF)

    makeWiggles

        Write "wiggle" coverage files with 100-bp window size. Note: 

        this REQUIRES that the --chromSizes parameter be included! 

        (default: OFF)

    makeAllBrowserTracks

        Write both the "wiggle" and the splice-junction bed files 

        (default: OFF) [Depends: makeJunctionBed, makeWiggles]

    calcDetailedGeneCounts

        Calculate more detailed read counts for each gene, counting 

        the number of reads that cover introns, cross-strand, etc 

        (default: OFF)

AUTHORS:

    Stephen W. Hartley, Ph.D. stephen.hartley (at nih dot gov)

LEGAL:

    This software is "United States Government Work" under the terms 

    of the United States Copyright Act. It was written as part of 

    the authors' official duties for the United States Government 

    and thus cannot be copyrighted. This software is freely 

    available to the public for use without a copyright notice. 

    Restrictions cannot be placed on its present or future use.

    Although all reasonable efforts have been taken to ensure the 

    accuracy and reliability of the software and data, the National 

    Human Genome Research Institute (NHGRI) and the U.S. Government 

    does not and cannot warrant the performance or results that may 

    be obtained by using this software or data. NHGRI and the U.S. 

    Government disclaims all warranties as to performance, 

    merchantability  or fitness for any particular purpose.

    In any work or product derived from this material, proper 

    attribution of the authors as the source of the software or data 

    should be made, using "NHGRI Genome Technology Branch" as the 

    citation.

    NOTE: This package includes (internally) the sam-1.113.jar 

    library from picard tools. That package uses the MIT license, 

    which can be accessed using the command:

     java -jar thisjarfile.jar help samjdkinfo

Done. (Mon Jan 22 17:49:51 JST 2018)

例えばbin/に入れて省略名"QoRTs"でランできるようにしておく。

mv QoRTs.jar /usr/local/bin/
echo alias QoRTs=\'java -jar /usr/local/bin/QoRTs.jar\' >> ~/.bash_profile && source ~/.bash_profile

 

ラン

アライメント結果のbamとgtfを指定してランする。結果は指定したディレクトリに出力される。

QoRTs --generatePdfReport input.bam input.gtf qcDataDir
  • --singleEnded Flag to indicate that reads are single end. (flag)
  • --stranded Flag to indicate that data is stranded. (flag)
  • --stranded_fr_secondstrand Flag to indicate that reads are from a fr_secondstrand type of stranded library (equivalent to the "stranded = yes" option in HTSeq or the "fr_secondStrand" library-type option in TopHat/CuffLinks). If your data is stranded, you must know the library type in order to analyze it properly. This utility uses the same definitions as cufflinks to define strandedness type. By default, the fr_firststrand library type is assumed for all stranded data (equivalent to the "stranded = reverse" option in HTSeq). (flag)
  • --generateMultiPlot Generate a multi-frame figure, containing a visual summary of all QC stats. (Note: this requires that R be installed and in the PATH, and that QoRTs be installed on that R installation) (flag)
  • --generatePdfReport Generate a pdf report. (Note: this requires that R be installed and in the PATH, and that QoRTs be installed on that R installation)

 シングルエンドのbamなら"--singleEnded"のフラグをつけてランする。結果は指定したディレクトリqcDataDir/に出力される。

 

--generatePdfReportをつけると、次のようなPDFレポートが出力される。

f:id:kazumaxneo:20180122180004j:plain

f:id:kazumaxneo:20180122180008j:plain

f:id:kazumaxneo:20180122180010j:plain

f:id:kazumaxneo:20180122180013j:plain

f:id:kazumaxneo:20180122180016j:plain

f:id:kazumaxneo:20180122180018j:plain

 

 

 

QCコマンド以外に、QC出力のreplicates内のマージ、UCSCで使えるカバレッジのwigファイルへの変換、レポートを出力するユーティリティコマンドなどいくつかあります。詳細は次のコマンドで確認してください。

QoRTs utilname --man

 こちらにも分かりやすく説明されています。

QoRTs: Quality of RNA-Seq Toolset

 

引用

QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments

Stephen W. HartleyEmail author and James C. Mullikin

BMC Bioinformatics201516:224