illuminaと454の前処理ツール seqyclean


 最新のハイスループットシーケンス機器は大量のデータを生するが、これにはシーケンスエラー、シーケンスアダプタ、汚染されたリードなどのノイズが含まれていることがよくある。このノイズはゲノミクス研究を複雑にする。シーケンスノイズを減らすために多くの前処理ソフトウェアツールが開発されているが、それらの多くは複数の技術からのデータを扱うことができず、1種類以上のノイズに対処するものはほとんどない。 ここではSeqyCleanという包括的な前処理ソフトウェアパイプラインを紹介する。 SeqyCleanは、ハイスループットシーケンスデータ内の複数のノイズ源を効果的に除去し、著者らのテストによれば、他の利用可能な前処理ツールよりも優れている。 SeqyCleanによる前処理データが最初にde novoゲノム構築とゲノムマッピングの両方を改善することを示す。私たち(著者ら)はアイダホ大学のバイオインフォマティクス・進化論研究所(IBEST)のゲノミクスコアでSeqyCleanを広く使用し、テストデータとproduction dataの両方で検証してきた。 SeqyCleanは、MITライセンス下でからオープンソースソフトウェアとして入手できる。


SeqyClean offers: (Githubより)

  1. Adapter/key/primers filtering
  2. Vector and contaminants filtering.
  3. Quality trimming.
  4. Poly A/T trimming.
  5. Overlapping paired reads.




macos10.14の miniconda3-4.0.5環境でcondaを使って導入した。


Clone or download the repository. Then cd to seqyclean home folder, and type make.

  • zlib
  • make

本体 GIthub

conda install -c bioconda -y seqyclean

seqyclean -h

$ seqyclean -h

Version: 1.10.09 (2018-10-16)


usage: ./seqyclean libflag input_file_name_1 [libflag input_file_name_2] -o output_prefix [options]


Common arguments for all library types:

   -h, --help - Show this help and exit.

   -v <filename> - Turns on vector trimming, default=off. <filename> - is a path to a FASTA-file containing vector genomes.

   -c <filename> - Turns on contaminants screening, default=off, <filename> - is a path to a FASTA-file containing contaminant genomes.

   -k <value> - Common size of k-mer, default=15

   -d - Distance between consecutive k-mers, default=1

   -kc <value> - Size of k-mer used in sampling contaminat genome, default=15

   -qual <max_average_error> <max_error_at_ends> - Turns on quality trimming, default=off. Error boundaries: max_average_error (default=0.01), max_error_at_ends (default=0.01)

   -bracket <window_size> <max_avg_error> - Bracket window_size (default=0.794) and maximum_average_error (default=0.794) for quality trimming

   -window window_size max_avg_error [window_size max_avg_error ...] - Parameters for window trimming. There are two windows with size of 50 and 10bp and max_avg_err of 0.794 by default.

   -ow - Overwrite existing results, default=off

   -minlen <value> - Minimum length of read to accept, default=50 bp.

   -polyat [cdna] [cerr] [crng] - Turns on poly A/T trimming, default=off. Parameters: cdna (default=10) - maximum size of a poly tail, cerr (default=3) - maximum number of G/C nucleotides within a tail, cnrg (default=50) - range to look for a tail within a read.

   -verbose - Verbose output, default=off.

   -detrep - Generate detailed report for each read, default=off.

   -dup [-startdw 10][-sizedw 35] [-maxdup 3] - Turns on screening duplicated sequences, default=off. Here: -startdw (defalt=10) and -sizedw (default=25) are starting position and size of the window within a read, -maxdup (default=3) - maximum number of duplicated sequences allowed.

   -no_adapter_trim - Turns off trimming of adapters, default=off.

Roche 454 only arguments:

   -t <value> - Number of threads (not yet applicable to Illumina mode), default=4.

   -fastq - Output in FASTQ format, default=off.

   -fasta_out - Output in FASTA format, default=off.

   -m <filename> - Using custom barcodes, default=off. <filename> - a path to a FASTA-file with custom barcodes.

Illumina paired- and single-end arguments:

   -1 <filename1> -2 <filename2> - Paired-end mode (see examples below)

   -U <filename> - Single-end mode

   -shuffle - Store non-paired Illumina reads in shuffled file, default=off.

   -i64 - Turns on 64-quality base, default = off.

   -adp <filename> - Turns on using custom adapters, default=off. <filename> - FASTA file with adapters

   -at <value> - This option sets the similarity threshold for adapter trimming by overlap (only in paired-end mode). By default its value is set to 0.75.

   -overlap <value> - This option turns on merging overlapping paired-end reads and <value> is the minimum overlap length. By default the minimum overlap length is 16 base pairs.

   -new2old - Switch to fix read IDs, default=off ( As is detailed in: ).

   -gz - compressed output (GZip format, .gz).

   -alen - Maximum adapter length, default=30 bp.(only for paired-end mode)


Roche 454:

./seqyclean -454 test_data/in_001.sff -o test/Test454 -v test_data/vectors.fasta

Paired-end Illumina library:

./seqyclean -1 test_data/R1.fastq.gz -2 test_data/R2.fastq.gz -o test/Test_Illumina

Single-end Illumina library:

./seqyclean -U test_data/R1.fastq.gz -o test/Test_Illumina

Please ask Ilya by email: in case of any questions.





#illumina paired-end(single-endは"-U"を使う)
seqyclean -1 pair_1.fq.gz -2 pair_2.fq.gz -o illumina_output

seqyclean 454 input.sff -o 454_output -t 8
  • -t <value>    Number of threads (not yet applicable to Illumina mode), default=4.





#illumina paired-end
seqyclean -1 pair_1.fq.gz -2 pair_2.fq.gz -o illumina_output \
-qual -v vector.fasta -c contamination.fasta
  • -qual     Turns on quality trimming, default=off
  • -v <filename>     Turns on vector trimming, default=off. <filename> - is a path to a FASTA-file containing vector genomes.
  • -c <filename>     Turns on contaminants screening, default=off, <filename> - is a path to a FASTA-file containing contaminant genomes.




SeqyClean: A Pipeline for High-throughput Sequence Data Preprocessing

Ilya Y. Zhbannikov, Samuel S. Hunter, James A. Foster, Matthew L. Settles

Conference Paper

ACM-BCB '17 Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
Pages 407-416



アダプターP5 / P7 +インデックスI5 / I7 +リンカー、のアダプター配列が除去される。