fastqの操作ツール illumina-utils - macでインフォマティクス

illumina-utilsはpythonで記述されたilluminaのシーケンスデータのユーティリティツール。オーバーラップしたペアリードのmergeやクオリティフィルタリングを行うことができる。

インストール

sudo pip install illumina-utils

実行方法

raw fastqのdemulitiplexing。

ランにはindexの情報が載ったテキストファイルが必要。フォーマットについては公式の例を参照。

iu-demultiplex -s barcode_to_sample.txt --r1 r1.fastq --r2 r2.fastq --index index.fastq -o output/

illuminaの提供しているdemulitiplexツールはbcl2fastq（リンク）。

ほとんどのコマンドのランにはコンフィグファイルが必要。以下のようなファイル（input.txt）を元にコンフィグファイルを作成することができる。

sample	r1	r2
e_coli	ecoli-R1.fastq	ecoli-R2.fastq

コンフィグファイルの作成。

iu-gen-configs input.txt

この例だとe_coli.iniができる。中身を確認する。

user$ cat e_coli.ini

[general]

project_name = e_coli

researcher_email = u@example.edu

input_directory = /Users/user/Documents/test/

output_directory = /Users/use/Documents/test/

[files]

pair_1 = ecoli-R1.fastq

pair_2 = ecoli-R2.fastq

オーバーラップがあるペアリードのマージ。先ほどのconfigファイルを指定する。

iu-merge-pairs e_coli.ini

ランにはそれなりの時間がかかる（macbook pro corei7モデルだと300-bpの200万x2リードの処理1時間以上）。マージが終わると、マージが成功した配列と失敗した配列が別々のファイルで出力される。また、レポートも出力される。

マージされたfastqは、ヘッダーにどのような状態でマージされたかを示す情報がつく。クオリティ行は削られfastaフォーマットになっている。

user$ head -4 e_coli_MERGED

>e_coli_19|M02077:18:000000000-A88N7:1:1101:14587:1648 1:N:0:1|o:229|m/o:0.013100|MR:n=0;r1=3;r2=0|Q30:n/a|CO:0|mismatches:3

aattaaaaaaacttggctggcaatatgttcctggctgtttggaagatcaacctgttcctactgatccactgaTtactgaacgggaaagttttaagcagattcttattaagccccgacttcagcaagcCcttaagcgCattaatctgactgatgatggagagccatggctagatgactaccaaattgagtcggccatttcccagctagagcgggctgtcaccaccgaaaagctgatcgaagccaatcaactcatcacagaactgctctggaatggtgtgactgtattcgttcccaatggtaaagatgaaattgttcagttcattgatttcgagaatattgagcagaatgatttccttgccatcaatcaatatg

>e_coli_28|M02077:18:000000000-A88N7:1:1101:20129:1679 1:N:0:1|o:223|m/o:0.026906|MR:n=0;r1=5;r2=1|Q30:n/a|CO:0|mismatches:6

agaagctatcgctgaattttcccccctggaacaggaggacgttatgcagttaacaaccagttggatgcttcagggcattgaacagggcatCgaacgtggacaaaaatctctgctactcaaacaaattagacatcgTtttggagagttgaatgcagttaatTtgtcgaggattgacattctcaCagtgcctcaattggaacagttggGagaagtgctgttggactgctctgatttcgcagaattagaacaatggctagcggcccaatCtgaaacacctgagcgaaaaatttagtgaatttccagaggtggatgttgccatgaatgttcccaaagaatcaggtttcaaccatcggcttacaattccaaactgagttgata

--min-overlap-size　Minimum expected overlap. Default is 15.
--max-num-mismatches　Maximum number of mismatches at the overlapped region to retain the pair. The default behavior relies on `-P` parameter an does not pay attention to the number of mismatches at the overlapped region.
-P　Any merged sequence with P below the declared value is discarded and stored in a seperate file.
--min-qual-score　Minimum Q-score for a base to overwrite a mismatch at the overlapped region. If there is a mismatch at the overlapped region, the base with higher quality is being used in the final sequence. Alternatively, if the Q-score of the base with higher quality is lower than the Q-score declared with this parameter, that base is being marked as an ambiguous base, which may result in the elimination of the merged sequence depending on the --ignore-Ns paranmeter. The default value is 15.
--retain-only-overlap　When set, merger will only return the parts of reads that do overlap, and parts of reads that do not overlap will be trimmed.

e_coli_STATSにはレポートが出力される。

f:id:kazumaxneo:20170825200415j:plain

完全にオーバーラップしているリードだけマージする。

iu-merge-pairs e_coli.ini --marker-gene-stringent --retain-only-overlap --max-num-mismatches 0

クオリティフィルタリングは公式ページを参照してください。

引用

A Filtering Method to Generate High Quality Short Reads Using Illumina Paired-End Technology

A. Murat Eren , Joseph H. Vineis , Hilary G. Morrison, Mitchell L. Sogin

PLoS One. 2013 Jun 17;8(6):e66643