macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ショートリードまたはロングリードのシーケンスデータからウイルスや微生物を迅速に同定する fastv

 

 本論文では、ショートリードまたはロングリードのシーケンスデータからウイルスや微生物を迅速に同定するためのツールセットと関連リソースを紹介する。fastvは、シーケンシングデータ中に存在する微生物の配列を検出し、対象となる微生物を同定し、微生物ゲノムのカバレッジを可視化するための超高速ツールである。このツールは、k-mer mapping and extension法に基づいている。K-merセットは、このツールセットに含まれるもう一つのツールであるUniqueKMERによって生成される。UniqueKMERは、大規模なウイルスや微生物のゲノムセット内の各ゲノムに対するユニークなk-merの完全なセットを生成することができる。便宜上,微生物およびヒトを悩ませる一般的なウイルスのユニークなk-merが生成され,ツールとともに提供されている。fastvは、FASTQデータを入力し、その結果をHTMLおよびJSON形式で直接出力する軽量なツールである。fastvは、k-mer解析の前に、アダプターのトリミング、クオリティトリミング、塩基補正などの前処理を自動的に行い、k-mer解析の精度を確保します。特に、fastvは、SARS-CoV-2の迅速な同定とタイピングをサポートしている。実験の結果、fastvはシーケンスデータからSARS-CoV-2を100%の感度と100%の特異度で検出することができ、SARS-CoV-2をSARS、Middle East respiratory syndrome、その他のコロナウイルスと区別することができた。このツールセットは、https://github.com/OpenGene/fastvから利用できる。

 

手順

fastvは、FASTQファイルを入力として受け取り、以下の手順で微生物の配列を検出し、レポートを生成する。

1)fastpと同様に、データのQCとクオリティフィルタリングを行う(アダプターのカット、低品質リードの除去、誤塩基の修正)。
2)クリーンデータをスキャンして、ユニークなk-merを含む配列や、リファレンス微生物ゲノムにマッピング可能な配列を収集する。
3)統計処理を行い、結果をHTML形式で可視化し、JSON形式で出力する。
オンターゲットのシーケンスリードを出力し、下流のツールで解析できるようにする。

本ツールは、ショートリード(Illumina, BGIなど)とロングリード(ONT, PacBioなど)の両方に対応している。 

 

インストール

Github

#bioconda (link)
mamba create -n fastv -y
conda activate fastv
mamba install -c bioconda -y fastv

> fastv -h

$ fastv -h

option needs value: --html

usage: fastv [options] ... 

options:

-i, --in1 read1 input file name (string [=])

-I, --in2 read2 input file name (string [=])

-o, --out1 file name to store read1 with on-target sequences (string [=])

-O, --out2 file name to store read2 with on-target sequences (string [=])

-c, --kmer_collection the unique k-mer collection file in fasta format, see an example: http://opengene.org/kmer_collection.fasta (string [=])

-k, --kmer the unique k-mer file of the detection target in fasta format. data/SARS-CoV-2.kmer.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=])

-g, --genomes the genomes file of the detection target in fasta format. data/SARS-CoV-2.genomes.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=])

-p, --positive_threshold the data is considered as POSITIVE, when its mean coverage of unique kmer >= positive_threshold (0.001 ~ 100). 0.1 by default. (float [=0.1])

-d, --depth_threshold For coverage calculation. A region is considered covered when its mean depth >= depth_threshold (0.001 ~ 1000). 1.0 by default. (float [=1])

-E, --ed_threshold If the edit distance of a sequence and a genome region is <=ed_threshold, then consider it a match (0 ~ 50). 8 by default. (int [=8])

--long_read_threshold A read will be considered as long read if its length >= long_read_threshold (100 ~ 10000). 200 by default. (int [=200])

--read_segment_len A long read will be splitted to read segments, with each <= read_segment_len (50 ~ 5000, should be < long_read_threshold). 100 by default. (int [=100])

--bin_size For coverage calculation. The genome is splitted to many bins, with each bin has a length of bin_size (1 ~ 100000), default 0 means adaptive. (int [=0])

-j, --json the json format report file name (string [=fastv.json])

-h, --html the html format report file name (string [=fastv.html])

-R, --report_title should be quoted with ' or ", default is "fastv report" (string [=fastv report])

-w, --thread worker thread number, default is 4 (int [=4])

-6, --phred64 indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)

-z, --compression compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])

--stdin input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.

--stdout stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end output. Disabled by default.

--interleaved_in indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.

--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])

--dont_overwrite don't overwrite existing files. Overwritting is allowed by default.

-V, --verbose output verbose log information (i.e. when every 1M reads are processed).

-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled

-a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])

--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=auto])

--adapter_fasta specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file (string [=])

--detect_adapter_for_pe by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data.

-f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0])

-t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0])

-b, --max_len1 if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation (int [=0])

-F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])

-T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])

-B, --max_len2 if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0])

--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10])

-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

-x, --trim_poly_x enable polyX trimming in 3' ends.

--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10])

-5, --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.

-3, --cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.

-r, --cut_right move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop.

-W, --cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])

-M, --cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])

--cut_front_window_size the window size option of cut_front, default to cut_window_size if not specified (int [=4])

--cut_front_mean_quality the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])

--cut_tail_window_size the window size option of cut_tail, default to cut_window_size if not specified (int [=4])

--cut_tail_mean_quality the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])

--cut_right_window_size the window size option of cut_right, default to cut_window_size if not specified (int [=4])

--cut_right_mean_quality the mean quality requirement option for cut_right, default to cut_mean_quality if not specified (int [=20])

-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled

-q, --qualified_quality_phred the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])

-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])

-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])

-e, --average_qual if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement (int [=0])

-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled

-l, --length_required reads shorter than length_required will be discarded, default is 15. (int [=15])

--length_limit reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])

-y, --low_complexity_filter enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).

-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])

--filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])

--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])

--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])

-C, --correction enable base correction in overlapped regions (only for PE data), default is disabled

--overlap_len_require the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default. (int [=30])

--overlap_diff_limit the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default. (int [=5])

--overlap_diff_percent_limit the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%. (int [=20])

-U, --umi enable unique molecular identifier (UMI) preprocessing

--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])

--umi_len if the UMI is in read1/read2, its length should be provided (int [=0])

--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])

--umi_skip if the UMI is in read1/read2, fastv can skip several bases following UMI, default is 0 (int [=0])

-?, --help print this message

 

 

データベース

レポジトリのリンクから、2種類の前もってビルドされたk-merデータベースをダウンロードできる。

1viruses、2、viruses and common human microorganisms

 

SARS-CoV-2のk-mer データベースだけ必要なら、下のテストランのようにレポジトリをcloneする。”-k”でfastv/data/にあるk-merファイル(ユニークなk-merのfasta形式ファイル)を指定する。全データもダウンロードできるが、UniqueKMERを使って指定した頻度以上のk-merを取り出す必要がある。

 

テストラン1

SARS-CoV-2 identification

git clone https://github.com/OpenGene/fastv.git
cd fastv/data/
wget http://opengene.org/fastv/testdata.fq.gz
fastv -i testdata.fq.gz -k SARS-CoV-2.kmer.fa -g SARS-CoV-2.genomes.fa -j report.json -h report.html -w 4
  • -i      read1 input file name (string [=])
  • -k     the unique k-mer file of the detection target in fasta format. data/SARS-CoV-2.kmer.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=])
  • -g     the genomes file of the detection target in fasta format. data/SARS-CoV-2.genomes.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=])
  • -j      the json format report file name (string [=fastv.json])
  • -h     the html format report file name (string [=fastv.html])
  • -w    worker thread number, default is 4 (int [=4])

 

出力

f:id:kazumaxneo:20210710224428p:plain

report.html

f:id:kazumaxneo:20210710233531p:plain

f:id:kazumaxneo:20210710233555p:plain

非表示の項目はクリックすると展開される。

 

ヒットがない場合、Detection result for target k-mer file: negativeとなる。

f:id:kazumaxneo:20210710230114p:plain

 

テストラン2

メタゲノムからの細菌検出例(レポジトリ参照)

 

 

実行方法
 ヒットしたリードを下流解析のために保存する。

#illumina paired-end
fastv -i pair_R1.fq.gz -I pair_R2.fq.gz -o out_R1.fq.gz -O out_R2.fq.gz -k SARS-CoV-2.kmer.fa -j report.json -h report.html

 

感想

fastpの高速な前処理機能と、臨床で重要なウィルスや菌の分析機能を統合したツールと言えますね。たいへん有用なツールだと思います。

引用

A computational toolset for rapid identification of SARS-CoV-2, other viruses and microorganisms from sequencing data
Shifu Chen, Changshou He, Yingqiang Li, Zhicheng Li, Charles E Melançon, III
Briefings in Bioinformatics, Volume 22, Issue 2, March 2021, Pages 924–935

 

関連