メタゲノム解析のために擬似ロングリードを作成する PLR-GEN

　ハイスループットなシークエンスデータを用いたメタゲノム解析は、培養せずに環境試料中の微生物ゲノムを構築できる強力な手法である。しかし、メタゲノム解析は、複数の微生物のゲノムが混在してメタゲノムを構成しているため、特にショートリードしか利用できない場合、複雑で挑戦的な作業となる。ロングリードのシークエンス技術が開発され、メタゲノム解析に利用され始めているが、ロングリードの生成にはショートリードよりも高いシークエンスコストを要するため、多くのメタゲノム解析研究はショートリードに基づいて行われてきた。

　本研究では、PLR-GENと呼ばれる新しい手法を発表した。これは、与えられたリファレンスゲノム配列に基づいて、同種または異種の個々のゲノムに存在する小さな配列変異を考慮して、メタゲノムのショートリードから擬似ロングリードを作成するものである。ヒトマイクロバイオームプロジェクトの模擬群集データセットに適用したところ、PLR-GENは101 bp長のショートリードをN50 33 Kbp、エラー率0.4%の疑似ロングリードに劇的に伸長させた。PLR-GENで生成されたこれらの擬似ロングリードを用いることで、配列数、アセンブリの連続性、生物種や遺伝子の予測など、メタゲノム解析が明らかに改善された。

インストール

condaで環境を作って導入した。リファレンスゲノムの調製にTAMAを使用する場合、TAMAも導入しておく必要がある。

依存

Third party programs

Bowtie2
BEDtools
SAMtools

Perl libraries

Parallel::ForkManager
Getopt::Long
File::Basename
Scalar::Util
FindBin
Math::Round

Github

git clone https://github.com/jkimlab/PLR-GEN.git
cd PLR-GEN
mamba env create -f plrgen_env.yml
conda activate plrgen_env
./build.pl install

#docker
docker pull jkimlab/plrgen

> perl PLR-GEN.pl

Usage: PLR-GEN.pl [options] -1 <pe1> -2 <pe2> (or -s <se>) -r <ref_list> -o <out_dir>

== MANDATORY

-s <se> File with unpaired reads [incompatible with -1 and -2]

-1 <pe1> File with #1 mates (paired 1) [incompatible with -s]

-2 <pe2> File with #2 mates (paired 2) [incompatible with -s]

-r|-ref <ref_list> The list of reference genome sequence files

-tama Reference preparation using TAMA [incompatible with -r|-ref]

-sampling <proportion> proportion to random sampling for references (default: off, range: 0-1)

-o <out_dir> Output directory (default: ./PR.out)

==Running and filtering options

-p|-core <integer> The number of threads (default: 1)

-q|-mapq <integer> Minimum mapping quality (default: 20)

-l|-min_length <integer> Cutoff of minimum length of pseudo-long reads (default: 100bp)

-c|-min_count <integer> Cutoff of minimum mapping depth for each node (default: 1)

-d|-min_depth <integer> Cutoff of mapping depth of bubbles (default: 1, range: 0-100)

0: all bubbles are used.

1: bubbles with less than 1% mapping depth from mapping depth distribution of bubbles are converted to normal nodes.

100: all bubbles are converted to normal nodes

==Other options

-t|-temp If you use -t option, all intermediate files are left.

Please careful to use this option because it has to be needed very large space.

-h|-help Print help page.

実行方法

シングルエンドかペアエンドのfastqと、リファレンスゲノムのリスト（ls <path>/<to>/ref*fna.gz > list）を指定する。

./PLR-GEN.pl -1 read_1.fq -2 read_2.fq -r reference_list.txt -o outdir

-s File with unpaired reads [incompatible with -1 and -2]
-1 File with #1 mates (paired 1) [incompatible with -s]
-2 File with #2 mates (paired 2) [incompatible with -s]
-r The list of reference genome sequence files

リファレンス１つずつ処理していくので、リファレンスゲノムリストが多いとかなりの時間がかかる（テストデータでは1日くらい）。

出力例

> seqkit stats *reads.fa.gz

(テストデータの出力)

引用
Generation and application of pseudo-long reads for metagenome assembly
Mikang Sim, Jongin Lee, Suyeon Wy, Nayoung Park, Daehwan Lee, Daehong Kwon, Jaebum Kim

Gigascience. 2022 May 17;11:giac044