macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

illumina、454、SOLIDのショートリードのシミュレータ ART

2019 7/16 タイトル修正 インストール追記、help追加

2023/12/14 追記

 

NGSのリードをシミュレートする手法はいくつかあるが、ここでははMiseqのリードなどもシミュレートできるARTを紹介する。

https://www.niehs.nih.gov/research/resources/software/biostatistics/art/

 ARTはIllumina's Solexa, Roche's 454 and Applied Biosystems' SOLiDのリードをジェネレートできる。illuminaはMiseq v3の250bpをサポートしている。gccのライブラリをインストールしておけばbinary版をダウンロードして叩くだけで動く。

 

インストール

ダウンロードリンク。

から一番新しいmacOSX版(ソースと書いてない方がbinary版)

またはbrewでも導入できる。

#bioconda (link)
mamba install -c bioconda -y art

#homebrew
brew install art

> art_illumina 

$ art_illumina 

 

    ====================ART====================

             ART_Illumina (2008-2016)          

          Q Version 2.5.8 (June 6, 2016)       

     Contact: Weichun Huang <whduke@gmail.com> 

    -------------------------------------------

 

===== USAGE =====

 

art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -o <outfile_prefix>

art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -o <outfile_prefix>

art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>

art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>

 

===== PARAMETERS =====

 

  -1   --qprof1   the first-read quality profile

  -2   --qprof2   the second-read quality profile

  -amp --amplicon amplicon sequencing simulation

  -c   --rcount   number of reads/read pairs to be generated per sequence/amplicon (not be used together with -f/--fcov)

  -d   --id       the prefix identification tag for read ID

  -ef  --errfree  indicate to generate the zero sequencing errors SAM file as well the regular one

                  NOTE: the reads in the zero-error SAM file have the same alignment positions

                  as those in the regular SAM file, but have no sequencing errors

  -f   --fcov     the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon

  -h   --help     print out usage information

  -i   --in       the filename of input DNA/RNA reference

  -ir  --insRate  the first-read insertion rate (default: 0.00009)

  -ir2 --insRate2 the second-read insertion rate (default: 0.00015)

  -dr  --delRate  the first-read deletion rate (default:  0.00011)

  -dr2 --delRate2 the second-read deletion rate (default: 0.00023)

  -k   --maxIndel the maximum total number of insertion and deletion per read (default: up to read length)

  -l   --len      the length of reads to be simulated

  -m   --mflen    the mean size of DNA/RNA fragments for paired-end simulations

  -mp  --matepair indicate a mate-pair read simulation

  -M  --cigarM    indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch

  -nf  --maskN    the cutoff frequency of 'N' in a window size of the read length for masking genomic regions

                  NOTE: default: '-nf 1' to mask all regions with 'N'. Use '-nf 0' to turn off masking

  -na  --noALN    do not output ALN alignment file

  -o   --out      the prefix of output filename

  -p   --paired   indicate a paired-end read simulation or to generate reads from both ends of amplicons

                  NOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000

  -q   --quiet    turn off end of run summary

  -qL  --minQ     the minimum base quality score

  -qU  --maxQ     the maxiumum base quality score

  -qs  --qShift   the amount to shift every first-read quality score by 

  -qs2 --qShift2  the amount to shift every second-read quality score by

                  NOTE: For -qs/-qs2 option, a positive number will shift up quality scores (the max is 93) 

                  that reduce substitution sequencing errors and a negative number will shift down 

                  quality scores that increase sequencing errors. If shifting scores by x, the error

                  rate will be 1/(10^(x/10)) of the default profile.

  -rs  --rndSeed  the seed for random number generator (default: system time in second)

                  NOTE: using a fixed seed to generate two identical datasets from different runs

  -s   --sdev     the standard deviation of DNA/RNA fragment size for paired-end simulations.

  -sam --samout   indicate to generate SAM alignment file

  -sp  --sepProf  indicate to use separate quality profiles for different bases (ATGC)

  -ss  --seqSys   The name of Illumina sequencing system of the built-in profile used for simulation

       NOTE: sequencing system ID names are:

            GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp)

           HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)

           HSXn - HiSeqX PCR free (150bp),     HSXt - HiSeqX TruSeq (150bp),   MinS - MiniSeq TruSeq (50bp)

           MSv1 - MiSeq v1 (250bp),            MSv3 - MiSeq v3 (250bp),        NS50 - NextSeq500 v2 (75bp)

===== NOTES =====

 

* ART by default selects a built-in quality score profile according to the read length specified for the run.

 

* For single-end simulation, ART requires input sequence file, output file prefix, read length, and read count/fold coverage.

 

* For paired-end simulation (except for amplicon sequencing), ART also requires the parameter values of

  the mean and standard deviation of DNA/RNA fragment lengths

 

===== EXAMPLES =====

 

 1) single-end read simulation

  art_illumina -ss HS25 -sam -i reference.fa -l 150 -f 10 -o single_dat

 

 2) paired-end read simulation

       art_illumina -ss HS25 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_dat

 

 3) mate-pair read simulation

       art_illumina -ss HS10 -sam -i reference.fa -mp -l 100 -f 20 -m 2500 -s 50 -o matepair_dat

 

 4) amplicon sequencing simulation with 5' end single-end reads 

  art_illumina -ss GA2 -amp -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_5end_dat

 

 5) amplicon sequencing simulation with paired-end reads

       art_illumina -ss GA2 -amp -p -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_pair_dat

 

 6) amplicon sequencing simulation with matepair reads

       art_illumina -ss MSv1 -amp -mp -sam -na -i amp_reference.fa -l 150 -f 10 -o amplicon_mate_dat

 

 7) generate an extra SAM file with zero-sequencing errors for a paired-end read simulation

       art_illumina -ss HSXn -ef -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_twosam_dat

 

 8) reduce the substitution error rate to one 10th of the default profile

       art_illumina -i reference.fa -qs 10 -qs2 10 -l 50 -f 10 -p -m 500 -s 10 -sam -o reduce_error

 

 9) turn off the masking of genomic regions with unknown nucleotides 'N'

       art_illumina -ss HS20 -nf 0  -sam -i reference.fa -p -l 100 -f 20 -m 200 -s 10 -o paired_nomask

 

 10) masking genomic regions with >=5 'N's within the read length 50

       art_illumina -ss HSXt -nf 5 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_maskN5

 

> art_SOLiD 

$ art_SOLiD 

=================================================================

                    ART_SOLiD (Version 1.3.3)                    

         Simulation of Applied Biosystems' SOLiD Sequencing       

   Copyright (c) 2008-2015, Weichun Huang. All Rights Reserved.  

=================================================================

 

===== USAGES =====

 

SINGLE-END (F3 READ) SIMULATION

art_SOLiD [ options ] <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ> <FOLD_COVERAGE>

 

MATE-PAIR READS (F3-R3 PAIR) SIMULATION

art_SOLiD [ options ] <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ> <FOLD_COVERAGE> <MEAN_FRAG_LEN> <STD_DEV>

 

PAIRED-END READS (F3-F5 PAIR) SIMULATION

art_SOLiD [ options ] <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ_F3> <LEN_READ_F5> <FOLD_COVERAGE> <MEAN_FRAG_LEN> <STD_DEV>

 

AMPLICON SEQUENCING SIMULATION

art_SOLiD [ options ] -A s <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ> <READS_PER_AMPLICON>

art_SOLiD [ options ] -A m <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ> <READ_PAIRS_PER_AMPLICON>

art_SOLiD [ options ] -A p <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <LEN_READ_F3> <LEN_READ_F5> <READ_PAIRS_PER_AMPLICON>

 

===== PARAMETERS =====

 

INPUT_SEQ_FILE            -  filename of DNA/RNA reference sequences in FASTA format

OUTPUT_FILE_PREFIX        -  prefix or directory for all output read data files

FOLD_COVERAGE             -  fold of read coverage over the reference sequences 

LEN_READ                  -  length of F3/R3 reads

LEN_READ_F3               -  length of F3 reads for paired-end read simulation

LEN_READ_F5               -  length of F5 reads for paired-end read simulation

READS_PER_AMPLICON        -  number of reads per amplicon

READ_PAIRS_PER_AMPLICON   -  number of read pairs per amplicon

MEAN_FRAG_LEN             -  mean DNA/RNA fragment size for matepair/paired-end read simulation

STD_DEV                   -  standard deviation of the DNA/RNA fragment sizes for matepair/paired-end read simulation

 

===== OPTIONAL PARAMETERS =====

 

-A specify the read type for amplicon sequencing simulation (s:single-end, m: matepair, p: paired-end)

-M indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch

-s indicate to generate a SAM alignment file

-r specify the random seed for the simulation

-f specify the scale factor adjusting error rate (e.g., -f 0 for zero-error rate simulation)

-p specify user's own read profile for simulation

 

===== EXAMPLES =====

 

 1) singl-end 25bp reads simulation at 10X coverage

art_SOLiD -s seq_reference.fa ./outdir/single_dat 25 10

 2) singl-end 75bp reads simulation at 20X coverage with user's error profile

art_SOLiD -s -p ../SOLiD_profiles/profile_pseudo ./seq_reference.fa ./dat_userProfile 75 20

 3) matepair 35bp (F3-R3) reads simulation at 20X coverage with DNA/RNA MEAN fragment size 2000bp and STD 50

art_SOLiD -s seq_reference.fa ./outdir/matepair_dat 35 20 2000 50

 4) matepair reads simulation with a fixed random seed

art_SOLiD -r 777 -s seq_reference.fa ./outdir/matepair_fs 50 10 1500 50

 5) paired-end reads (75bp F3, 35bp F5) simulation with the MEAN fragment size 250 and STD 10 at 20X coverage

art_SOLiD -s seq_reference.fa ./outdir/paired_dat 75 35 50 250 10

 6) amplicon sequencing with 25bp single-end reads at 100 reads per amplicon

art_SOLiD -A s -s amp_reference.fa ./outdir/amp_single 25 100

 7) amplicon sequencing with 50bp matepair reads at 80 read pairs per amplicon

art_SOLiD -A m -s amp_reference.fa ./outdir/amp_matepair 50 80

 8) amplicon sequencing with paired-end reads (35bp F3, 25bp F5 reads) at 50 pairs per amplicon

art_SOLiD -A p -s amp_reference.fa ./outdir/amp_pair 35 25 50

art_454

$ art_454 

===================================================================

                      ART_454 (Version 2.6.0)                      

                   Simulation of 454 Pyrosequencing                

     Copyright (c) 2008-2015, Weichun Huang. All Rights Reserved.  

===================================================================

 

===== USAGE ===== 

 

SINGLE-END SIMULATION

art_454 [-s] [-a ] [-t] [-r rand_seed] [ -p read_profile ] [ -c num_flow_cycles ] <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <FOLD_COVERAGE>

 

PAIRED-END SIMULATION

art_454 [-s] [-a ] [-t] [-r rand_seed] [ -p read_profile ] [ -c num_flow_cycles ] <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <FOLD_COVERAGE> <MEAN_FRAG_LEN> <STD_DEV>

 

AMPLICON SEQUENCING SIMULATION

art_454 [-s] [-a ] [-t] [-r rand_seed] [ -p read_profile ] [ -c num_flow_cycles ] <-A|-B> <INPUT_SEQ_FILE> <OUTPUT_FILE_PREFIX> <#_READS/#_READ_PAIRS_PER_AMPLICON>

 

===== PARAMETERS =====

 

INPUT_SEQ_FILE           -  the filename of DNA/RNA reference sequences in FASTA format

OUTPUT_FILE_PREFIX       -  the prefix or directory of output read data file (*.fq) and read alignment file (*.aln)

FOLD_COVERAGE            -  the fold of read coverage over the reference sequences

MEAN_FRAG_LEN            -  the average DNA fragment size for paired-end read simulation

STD_DEV                  -  the standard deviation of the DNA fragment size for paired-end read simulation

#READS_PER_AMPLICON      -  number of reads per amplicon (for 5'end amplicon sequencing)

#READ_PAIRS_PER_AMPLICON -  number of read pairs per amplicon (for two-end amplicon sequencing)

 

===== OPTIONAL PARAMETERS =====

 

-A indicate to perform single-end amplicon sequencing simulation

-B indicate to perform paired-end amplicon sequencing simulation

-M indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch

-a indicate to output the ALN alignment file

-s indicate to output the SAM alignment file

-d print out warning messages for debugging

-t indicate to simulate reads from the built-in GS FLX Titanium profile [default: GS FLX profile]

-r specify a fixed random seed for the simulation (to generate two identical datasets from two different runs)

-c specify the number of flow cycles by the sequencer [ default: 100 for GS-FLX, and 200 for GS-FLX Titanium ] 

-p specify user's own read profile for simulation

   NOTE: the name of a read profile is the directory containing read profile data files.

         please read the REAME file about the format of 454 read profile data files and.

         and the default filenames of these data files.

 

===== EXAMPLES =====

 

 1) singl-end simulation with 20X coverage

art_454 -s seq_reference.fa ./outdir/single_dat 20

 2) paired-end simulation with the mean fragment size 1500 and STD 20 using GS FLX Titanium platform

art_454 -s -t seq_reference.fa ./outdir/paired_dat 10 1500 20

 3) paired-end simulation with a fixed random seed

art_454 -s -r 777 seq_reference.fa ./outdir/paired_fxSeed 10 2500 50

 4) single-end amplicon sequencing with 10 reads per amplicon

art_454 -A -s amplicon_ref.fa ./outdir/amp_single 10

 5) paired-end amplicon sequencing with 10 read pairs per amplicon

art_454 -B -s amplicon_ref.fa ./outdir/amp_paired 10

 

 

 

ラン

Miseq v3のエラープロファイルでカバレッジ100、インサートサイズ600、インサートサイズのSD10、250bpのペアードエンドリードを発生させる。alnファイルは出力しない。

art_illumina -ss MSv3 -i ref.fa -p -l 250 -f 100 -m 600 -s 10 --noALN -o 250bp_paired
  • -l リード生成元のfasta
  • -l リード長(最大250)
  • -m インサートサイズ(リード込みのサイズ)
  • -f リードデプス(100なら-lで指定したゲノムのx100カバレッジ
  • -p ペアードエンド。

 ランが終わるとシミュレート結果のペアエンドエンドfastqができる。-samをつけるとゴールデンスタンダードとして利用できるsamファイルも出力される。

 

Hiseq2500のエラープロファイルでカバレッジ20、インサートサイズ2500、SD50、100-bpのメイトペアリードを発生させる。

art_illumina -ss HS25 -i ref.fa -mp -l 100 -f 20 -m 2500 -s 50 -o matepair_dat

  illuminaのデータをシミュレートして作った配列を見ると、5'末のクオリティが非常に低いことを確認できる。

 

SOLiD

solidのデータも出力できる。出力はcolorspaceのfastq。エラー発生率は過去の大量のデータから学習された確率に基づく。

 25bpで平均カバレッジが10のシングルリードを発生させる。

art_SOLiD -s ref.fa single 25 10
  • -A specify the read type for amplicon sequencing simulation (s:single-end, m: matepair, p: paired-end)
  • -M indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch
  • -s indicate to generate a SAM alignment file
  • -r specify the random seed for the simulation
  • -f specify the scale factor adjusting error rate (e.g., -f 0 for zero-error rate simulation)
  • -p specify user's own read profile for simulation

samが必要なければ-sを消す。

長さが75 (F3) x 35 (F5)で、カバレッジ50、インサートサイズ250、SD10のペアードエンドリードを発生させる。

art_SOLiD -s ref.fa pair 75 35 50 250 10

長さ、カバレッジなどを記載する順番は決まっているので注意。

 

長さが35-35 (F3-R3)で、カバレッジ20、インサートサイズ2000、SD50のメイトペアリードを発生させる。

art_SOLiD -s ref.fa mate 35 20 2000 50

 

454も同じような感じで使用できます。

 

2023/12/14追記

たくさんのゲノムをシミュレートする時は並列化すると早い。ここではGNU parallelを使う。拡張子を.fnaとすると

#10並列、HS25プロファイルでシングルエンドの10xカバレッジ、あとでgzip圧縮&.alnは消す
ls *.fna | parallel -j 10 'base_name=$(basename {} .fna);
art_illumina -ss HS25 -i {} -l 150 -f 10 -o $base_name && rm ${base_name}.aln && gzip ${base_name}.fq'

basenameについて

 

 引用

ART: a next-generation sequencing read simulator.

Huang W, Li L, Myers JR, Marth GT.

Bioinformatics. 2012 Feb 15;28(4):593-4.