k-merを使ったリードフィルタリングを行う Cookiecutter

2022/02/08 インストール追記

　次世代シークエンシング技術は、より安価になり、ルーティンの分析に役立っている。アセンブリの前に未処理のリードから特定のシーケンスを抽出または削除することを必要とする多くのタスクがある。抽出された領域特異的なリード（例えば、mtDNAまたはrRNAからの）のほんの一部使えば、得られたアセンブリおよびその分析の速度および品質を有意に改善することができる（Hahn et al、2013）。一方、リードトリミングは常に望ましい戦略ではなく、技術的なシーケンスの断片を含むすべてのリードを削除する方が効果的で簡単である。このような除去手順は、例えば、擬似ゲノム接尾辞配列作成（Kowalski et al、2015）に必要とされるのと同じリード長を維持する。
いくつかのツールは、rawリード処理のために開発された：アダプタ配列のライブラリに基づいたトリミング（Bolger et al、2014; Martin、2011）、カバレッジとアラインメントの同一性（Schmieder and Edwards、2011a）（Morgan et al。、2009）、シーケンスエントロピー尺度（Schmieder and Edwards、2011b）に従ってリードを取り除き、それらのコピー数に従ってリードを取り除く（Brownら、2012）。しかし、任意のシーケンスとの類似性に基づいてリードをフィルタリングするためのツールが不足している。この問題を解決するために、Cookiecutterツールを開発した。
　Cookiecutterは、任意の解析パイプラインに簡単に統合できるスタンドアロンのコマンドラインツールパッケージである。 Cookiecutterは、1つまたは複数のFASTQファイルを入力として受け取り、フィルタリングのためにk-mersのリストを含むファイルを必要とする。そのリストは、ユーザーが提供するか、提供されたFASTAファイルのCookiecutterによって生成される。 Cookiecutterは、シングルエンドリードとペアエンドリード両方を処理できる。ペアエンドリードの場合、Cookiecutterは両方のリードがフィルタリングを通過するとペアを維持する。ペアの1つがフィルタリングされた場合、他方のリードはシングルエンドとして別出力される。

インストール

mac os10.13でテストした。

依存

make;
gcc 4.7 or higher;
python 2.7.

Github

リリースから実行ファイルのBinaryがダウンロードできる。

https://github.com/ad3002/Cookiecutter/releases

wget  https://github.com/ad3002/Cookiecutter/releases/download/v1.0.0/cookiecutter_osx.tar.gz
tar -xvzf cookiecutter_osx.tar.gz
cd cookiecutter_osx.65cwMW/bin/
export PATH=$PWD:$PATH

#2022/02/08 python2.7の環境を作っていれる
mamba create -n cookiecutter python=2.7 -y
conda activate cookiecutter
wget https://github.com/ad3002/Cookiecutter/releases/download/v1.0.0/cookiecutter_linux_x64.tar.gz
tar -xvzf cookiecutter_linux_x64.tar.gz
cd cookiecutter_linux_x64/bin/
export PATH=$PWD:$PATH

> ./cookiecutter -h

$ ./cookiecutter -h

usage: cookiecutter [-h] [-v] [-e]

{extract,remove,rm_reads,separate,make_library} ...

Cookiecutter: a kmer-based read filtration and extraction tool.

positional arguments:

{extract,remove,rm_reads,separate,make_library}

extract extract reads matching the specified k-mers

remove remove reads matching the specified k-mers

rm_reads classify reads applying the specified filters

separate separate reads matching or unmatching the specified

k-mers

make_library create a file of k-mers from sequences of the

specified FASTA file

optional arguments:

-h, --help show this help message and exit

-v, --version show program's version number and exit

-e, --echo print commands to be launched instead of launching

them

> cookiecutter make_library -h

$ cookiecutter make_library -h

usage: cookiecutter make_library [-h] -i INPUT [INPUT ...] -o OUTPUT -l LENGTH

Create a library of k-mers from the specified FASTA file.

optional arguments:

-h, --help show this help message and exit

required_arguments:

-i INPUT [INPUT ...], --input INPUT [INPUT ...]

a list of FASTA files

-o OUTPUT, --output OUTPUT

an output file of k-mers

-l LENGTH, --length LENGTH

the length of generated k-mers

> cookiecutter remove -h

$ cookiecutter remove -h

usage: cookiecutter remove [-h]

(-i INPUT [INPUT ...] | -1 FASTQ1 [FASTQ1 ...])

[-2 FASTQ2 [FASTQ2 ...]] [-t THREADS] -f FRAGMENTS

-o OUTPUT

Removes reads according to a given list of k-mers and outputs only reads

without any matches to the provided k-mer list.

optional arguments:

-h, --help show this help message and exit

-2 FASTQ2 [FASTQ2 ...], --fastq2 FASTQ2 [FASTQ2 ...]

a FASTQ file of the second paired-end reads

-t THREADS, --threads THREADS

the number of threads for parallel processing of

multiple input files (default: 1)

required arguments:

-i INPUT [INPUT ...], --input INPUT [INPUT ...]

a FASTQ file of single-end reads

-1 FASTQ1 [FASTQ1 ...], --fastq1 FASTQ1 [FASTQ1 ...]

a FASTQ file of the first paired-end reads

-f FRAGMENTS, --fragments FRAGMENTS

a file of k-mers

-o OUTPUT, --output OUTPUT

a directory for output files

uesaka-no-Air-2:bin kazumaxneo$

> cookiecutter extract -h

$ cookiecutter extract -h

usage: cookiecutter extract [-h]

(-i INPUT [INPUT ...] | -1 FASTQ1 [FASTQ1 ...])

[-2 FASTQ2 [FASTQ2 ...]] [-t THREADS] -f FRAGMENTS

-o OUTPUT

Extracts reads according to a given list of k-mers and outputs only the reads

that matched the list.

optional arguments:

-h, --help show this help message and exit

-2 FASTQ2 [FASTQ2 ...], --fastq2 FASTQ2 [FASTQ2 ...]

a FASTQ file of the second paired-end reads

-t THREADS, --threads THREADS

the number of threads for parallel processing of

multiple input files (default: 1)

required arguments:

-i INPUT [INPUT ...], --input INPUT [INPUT ...]

a FASTQ file of single-end reads

-1 FASTQ1 [FASTQ1 ...], --fastq1 FASTQ1 [FASTQ1 ...]

a FASTQ file of the first paired-end reads

-f FRAGMENTS, --fragments FRAGMENTS

a file of k-mers

-o OUTPUT, --output OUTPUT

a directory for output files

> cookiecutter separate -h

$ cookiecutter separate -h

usage: cookiecutter separate [-h]

(-i INPUT [INPUT ...] | -1 FASTQ1 [FASTQ1 ...])

[-2 FASTQ2 [FASTQ2 ...]] [-t THREADS] -f

FRAGMENTS -o OUTPUT

Outputs both matched and not matched reads in separate files.

optional arguments:

-h, --help show this help message and exit

-2 FASTQ2 [FASTQ2 ...], --fastq2 FASTQ2 [FASTQ2 ...]

a FASTQ file of the second paired-end reads

-t THREADS, --threads THREADS

the number of threads for parallel processing of

multiple input files (default: 1)

required arguments:

-i INPUT [INPUT ...], --input INPUT [INPUT ...]

a FASTQ file of single-end reads

-1 FASTQ1 [FASTQ1 ...], --fastq1 FASTQ1 [FASTQ1 ...]

a FASTQ file of the first paired-end reads

-f FRAGMENTS, --fragments FRAGMENTS

a file of k-mers

-o OUTPUT, --output OUTPUT

a directory for output files

> cookiecutter rm_reads -h

$ cookiecutter rm_reads -h

usage: cookiecutter rm_reads [-h]

(-i INPUT [INPUT ...] | -1 FASTQ1 [FASTQ1 ...])

[-2 FASTQ2 [FASTQ2 ...]] [-t THREADS] [-p POLYGC]

[-l LENGTH] [-d] [-c DUST_CUTOFF] [-k DUST_K]

[-N] -f FRAGMENTS -o OUTPUT

The rm_reads tool is an extended version of remove enhanced with the DUST

filter, removing reads containing (G)n- and (C)n-tracks and unknown

nucleotides and filtering reads by their length; also its output includes both

filtered and unfiltered reads.

optional arguments:

-h, --help show this help message and exit

-2 FASTQ2 [FASTQ2 ...], --fastq2 FASTQ2 [FASTQ2 ...]

a FASTQ file of the second paired-end reads

-t THREADS, --threads THREADS

the number of threads for parallel processing of

multiple input files (default: 1)

-p POLYGC, --polygc POLYGC

the polyG/polyC sequence length cutoff (default: 13)

-l LENGTH, --length LENGTH

the read length cutoff (default: 50)

-d, --dust use the DUST filter (default: False)

-c DUST_CUTOFF, --dust_cutoff DUST_CUTOFF

the score cutoff for the DUST filter (default: 2)

-k DUST_K, --dust_k DUST_K

the window size for the DUST filter (default: 4)

-N, --filterN filter reads by the presence of Ns (default: False)

required arguments:

-i INPUT [INPUT ...], --input INPUT [INPUT ...]

a FASTQ file of single-end reads

-1 FASTQ1 [FASTQ1 ...], --fastq1 FASTQ1 [FASTQ1 ...]

a FASTQ file of the first paired-end reads

-f FRAGMENTS, --fragments FRAGMENTS

a file of k-mers

-o OUTPUT, --output OUTPUT

a directory for output files

ラン

make_library

はじめにk-merライブラリを作る必要がある。ゲノム配列を指定する。k-merサイズは27とする。

cookiecutter make_library -i ref.fa -o adapters.txt -l 27

> head adapters.txt

$ head adapters.txt

TTGATCCCTCTTCATATCTAGGAGTTT 1

CTCCCTGGCCCTGGGGGGTCAGTTGCT 1

GGCCAGGGCTTCGCTATACAGACCAGA 1

AAACCCAAACAGCTTCGGCAGGATTGT 1

ATAATCAGTTTTAATAAAGCCGAACTC 1

ATAAGCCGGCACCCGGAAAGGAAACTC 1

ATCGCCTGGGCCGACGGGGCACGGTGG 1

TAACCGGGCGGAAACTAAATATTTACC 1

ATGGCCTGCACTTGATGCAAAAACCAA 1

AGAGAACACCAAGAAAATACCCATAGT 1

大きなゲノムやfastqの分析には時間がかかるので、コンパチブルのJellyfish2を使う。

jellyfish count -m 27 -s 10G -t 8 --text -o kmer_library.dat input.fastq

remove searches given k-mers in reads and outputs the reads without any matches to the k-mers;

上記ゲノムのk-merを含むリードを除く。adapters.txtとfastqファイルを指定する。

cookiecutter remove -i input.fq -f adapters.txt -o filtered

filteredディレクトリにマッチしなかったリードのfastqが出力される。ペアエンドリードは"-i"の代わりに-1"と"-2"を使う。

extract searches given k-mers in reads and outputs the reads that matched the k-mers;

cookiecutter extract -i input.fq -f adapters.txt -o filtered

extractだとマッチしたリードが出力される。

separate searches given k-mers in reads and outputs both matched and unmatched reads to two separate files.

cookiecutter separate -1 pair1.fq -2 pair2.fq -f adapters.txt -o output_dir

separateだとマッチしたリードとマッチしなかったリードが別ファイルに分けられて両方出力される。

より複雑なフィルタリングを行うにはrm_readsコマンドを使う。長さ、low complexility、poly GC、NNN、などのフィルタリングができる。

rm_reads is an extension of remove that additionally provides options to filter reads by the presence of (C)n/(G)n tracks or unknown nucleotides, read length or low sequence complexity and outputs both filtered and unfiltered reads;

cookiecutter rm_reads -1 pair1.fq -2 pair2.fq -f adapters.txt -o output_dir --polygc 13 --length 50 --dust --filterN

テストした時は-Nフィルタだけ機能しなかった。

メモリ使用量が多いのと計算負荷が高いのが気になりますが、上手く使えば、コンタミを除いたり、メタゲノムからターゲットゲノムを濃縮したり、様々な用途に活用できると思います。複数ファイルも指定できます。GithubのREADMEを参照してください。

BBtoolsのBBDukもk-merフィルタリングが可能です。

引用

Cookiecutter: a tool for kmer-based read filtering and extraction

Ekaterina Starostina, Gaik Tamazian, Pavel Dobrynin, Stephen O'Brien, Aleksey Komissarov

bioRxiv preprint first posted online Aug. 16, 2015; doi: http://dx.doi.org/10.1101/024679.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

k-merを使ったリードフィルタリングを行う Cookiecutter