アセンブリの前処理としてロングリードのキメラ領域（低オーバーラップ領域）を除く yacrd

2019 コマンドの誤り修正

2020 3/30 バージョンによるコマンドの違いを記載

2020 3/31 version0.6.0のコマンドを一番下に追記

2020 4/23 論文追記

　第三世代DNAシーケンシング法（PacBio、オックスフォードナノポア）は、リファレンスゲノムの構築（デノボアセンブリ）のための重要な技術となりつつある。この種のデータに対する新しいバイオインフォマティクス手法が急速に登場している。
　一部のロングリードアセンブラは、アセンブリ前にリードに対してエラー訂正を実行する。訂正は、第3世代リードの高いエラー率を減らし、アセンブリを扱いやすくするのに役立つが、時間とメモリを消費するステップでもある。最近のアセンブラ（例：Li（2016）; Ruan and Li（2019）など）は、未訂正の未加工のリードを直接アセンブルする方法を見つけた。したがって、ここでは未訂正アセンブリのみに焦点を当てる。この設定では、アセンブリの品質は、キメラリードと非常に誤った領域（Myers、2015）の影響を受ける。
　DASCRUBBERプログラム（Myers、2017）は、リードの「スクラビング」の概念を導入した。これは、他の方法で塩基を修正することを試みることなく、リード中の問題のある領域を迅速に除去する。その考えは、リードをスクラブすることは訂正よりも軽量の操作であり、それゆえ高性能で訂正のないゲノムアセンブラに適しているということである。
　DASCRUBBERは、リードのall-against-allマッピングを実行し、リードごとにパイルアップを作成する。次に、マッピング品質を分析して、推定上高いエラー率の領域を決定する。これは、パイルアップ内の他のリードからの同等で高品質の領域に置き換えられる。 MiniScrub（LaPierre et al、2018）は、オーバーラップ検出に使用されたアンカーの位置を記録するために、Minimap2（Li、2017）の修正版を使用する別のスクラビングツールである。 MiniScrubはリードごとにアンカー位置をイメージに変換する。次に、畳み込みニューラルネットワークが低品質のリード領域を検出して削除する。
　リードスクラビングのさらに上流にあるもう1つの問題は、リード間のオーバーラップの計算である。オーバーラップの保存はディスクを大量に消費し、著者らの知る限りでは、その潜在的に高いディスクスペースを最適化する試みはこれまでになかった。
　本稿では、ロングリードのアセンブラの初期段階を一緒に最適化する2つのツールを紹介する。 1つは、高速で効果的なリードのスクラビング用のyacrd（for Yet Another Chimeric Read Detector）である。もう1つは、リード間で検出される重複をフィルター処理するfpa（フィルターペアアライメント）である。

　DASCRUBBERやMiniScrubと同様に、yacrdはリードの低品質領域は他のリードでは十分にサポートされていないという仮定に基づいている。そのような領域を検出するために、yacrdはMinimap2を使用してall-against-allリードマッピングを実行してから、各リードのカバレッジを計算する。 DASCRUBBERおよびMiniScrubとは対照的に、yacrdはMinimap2によって与えられるおおよその位置マッピング情報のみを使用する。これは時間のかかるアライメントステップを回避する。これはベースレベルのアライメントを持たないことを犠牲にしているが、これはスクラビングを実行するのに十分であることが判明している。カバレッジが特定のしきい値（デフォルトでは4に設定されている）を下回った場所でリードが分割され、カバレッジの低い領域が完全に削除される。

f:id:kazumaxneo:20190703004847p:plain
Githubより転載

2022/03/02

Yacrd 1.0.0 Magby is out.

No major change just dependency update.

I think that all yacrd features are now implemented and stable.

I'm working on a tool that would have the same goal as yacrd but with another approach.https://t.co/LXZSAU3wAx #rustlang #Bioinformatics
— Pierre Marijon 🏳️‍🌈 (@pierre_marijon) March 1, 2022

Preprint of my new paper about two of my tools #yacrd and #fpa (I think I'm a little too much talk about it here) is now available !!

Discover how to optimize your long read assemblies for a minimum cost with two simple and efficient tools. https://t.co/wCC1qUOLl7
— Pierre Marijon 🏳️‍🌈 (@pierre_marijon) June 18, 2019

2020 3/3追記

https://twitter.com/pierre_marijon/status/1234783942986944512

インストール

ubuntu16.0.4のminicona3.4.0.5環境でテストした。

依存

Rust in stable channel
libgz
libbzip2
liblzma
minimap2 (リード同士のマッピングに必要)

conda install -c bioconda -y minimap2

本体 Github

#bioconda (link) 0.6とはコマンドが異なる。0.5.1を導入してテストした。
conda install -c bioconda -y yacrd==0.5.1

#cargo
cargo install yacrd

> yacrd -h

$ yacrd -h

yacrd 0.5.1 Omanyte

Pierre Marijon <pierre.marijon@inria.fr>

Yet Another Chimeric Read Detector

USAGE:

yacrd [SUBCOMMAND]

FLAGS:

-h, --help Prints help information

-V, --version Prints version information

SUBCOMMANDS:

chimeric In chimeric mode yacrd detect chimera if coverage gap are in middle of read

help Prints this message or the help of the given subcommand(s)

scrubbing In scrubbing mode yacrd remove all part of read not covered

> yacrd chimeric -h

$ yacrd chimeric -h

yacrd-chimeric

In chimeric mode yacrd detect chimera if coverage gap are in middle of read

USAGE:

yacrd chimeric [FLAGS] [OPTIONS]

FLAGS:

-j, --json Yacrd report are write in json format

-h, --help Prints help information

-V, --version Prints version information

OPTIONS:

-i, --input <input>...

Mapping input file in PAF or MHAP format (with .paf or .mhap extension), use - for read standard input (no

compression allowed, paf format by default) [default: -]

-o, --output <output>

Path where yacrd report are writen, use - for write in standard output same compression as input or use

--compression-out [default: -]

-f, --filter <filter>...

Create a new file {original_path}_fileterd.{original_extension} with only not chimeric records, format

support fasta|fastq|mhap|paf

-e, --extract <extract>...

Create a new file {original_path}_extracted.{original_extension} with only chimeric records, format support

fasta|fastq|mhap|paf

-s, --split <split>...

Create a new file {original_path}_splited.{original_extension} where chimeric records are split, format

support fasta|fastq

-F, --format <format> Force the format used [possible values: paf, mhap]

-c, --chimeric-threshold <chimeric-threshold>

Overlap depth threshold below which a gap should be created [default: 0]

-n, --not-covered-threshold <not-covered-threshold>

Coverage depth threshold above which a read are marked as not covered [default: 0.80]

--filtered-suffix <filtered-suffix>

Change the suffix of file generate by filter option [default: _filtered]

--extracted-suffix <extracted-suffix>

Change the suffix of file generate by extract option [default: _extracted]

--splited-suffix <splited-suffix>

Change the suffix of file generate by split option [default: _splited]

-C, --compression-out <compression-out>

Output compression format, the input compression format is chosen by default [possible values: gzip, bzip2,

lzma, no]

> yacrd scrubbing -h

$ yacrd scrubbing -h

yacrd-scrubbing

In scrubbing mode yacrd remove all part of read not covered

USAGE:

yacrd scrubbing [FLAGS] [OPTIONS] --mapping <mapping> --report <report> --scrubbed <scrubbed> --sequence <sequence>

FLAGS:

-j, --json Yacrd report are write in json format

-h, --help Prints help information

-V, --version Prints version information

OPTIONS:

-m, --mapping <mapping>

Path to mapping file in PAF or MHAP format (with .paf or .mhap extension, paf format by default)

-s, --sequence <sequence>

Path to sequence you want scrubbed, format support fasta|fastq

-r, --report <report> Path where yacrd report are writen

-S, --scrubbed <scrubbed> Path where scrubbed read are write [default: -]

-c, --chimeric-threshold <chimeric-threshold>

Overlap depth threshold below which a gap should be created [default: 0]

-n, --not-covered-threshold <not-covered-threshold>

Coverage depth threshold above which a read are marked as not covered [default: 0.80]

-M, --mapping-format <format> Force the format used [possible values: paf, mhap]

実行方法

1、chimeric - カバレッジギャップがリードの中央にある場合にキメラ検出

キメラ検出。必要に応じて一括セッティングオプション"-x"を使用する (e.g., "-x ava-ont"))。12スレッド使用。

minimap2 -t 12 reads.fq.gz reads.fq.gz | yacrd chimeric -o reads.yacrd

キメラフィルタリングされたリードを出力

minimap2 -t 12 reads.fq.gz reads.fq.gz > mapping.paf
yacrd chimeric -i mapping.paf -f reads.fasta > reads.yacrd # produce reads_fileterd.fasta

-f, --filter <filter> Create a new file {original_path}_fileterd.{original_extension} with only not chimeric records, format support fasta|fastq|mhap|paf

キメラリードのみ出力

minimap2 -t 12 reads.fq.gz reads.fq.gz > mapping.paf
yacrd chimeric -i mapping.paf -e reads.fasta > reads.yacrd # produce reads_extracted.fasta

-e, --extract <extract> Create a new file {original_path}_extracted {original_extension} with only chimeric records, format support fasta|fastq|mhap|paf

キメラ領域をsplitして除き、全リード出力

minimap2 -t 12 reads.fq.gz reads.fq.gz > mapping.paf
yacrd chimeric -i mapping.paf -s reads.fasta > reads.yacrd # produce reads_splited.fasta

-s, --split <split> Create a new file {original_path}_splited.{original_extension} where chimeric records are split, format support fasta|fastq

2、scrubbing - chimericとは違い、カバレッジギャップがリードのどの領域にあってもキメラ検出

ONTのキメラ検出。12スレッド使用。

minimap2 -x ava-ont -g 500 -t 12 reads.fq.gz reads.fq.gz > overlap.paf
yacrd scrubbing -c 4 -n 0.4 -m overlap.paf -s reads.fasta -S reads_scrubbed.fasta -r scrubbed_report.yacrd

minimap2

-g stop chain enlongation if there are no minimizers in INT-bp [5000]

yacrd

-c Overlap depth threshold below which a gap should be created [default: 0]
-n Coverage depth threshold above which a read are marked as not covered [default: 0.80]
-m Path to mapping file in PAF or MHAP format (with .paf or .mhap extension, paf format by default)
-s Path to sequence you want scrubbed, format support fasta|fastq
-S Path where scrubbed read are write [default: -]
-r Path where yacrd report are writen

PacbioのP6-C4のキメラ検出。12スレッド使用。

minimap2 -x ava-pb -g 800 -t 12 reads.fq.gz reads.fq.gz > overlap.paf
yacrd scrubbing -c 4 -n 0.4 -m overlap.paf -s reads.fasta -S reads_scrubbed.fasta -r scrubbed_report.yacrd

PacbioのSequelのキメラ検出。12スレッド使用。

minimap2 -x ava-pb -g 800 -t 12 reads.fq.gz reads.fq.gz > overlap.paf
yacrd scrubbing -c 4 -n 0.4 -m overlap.paf -s reads.fasta -S reads_scrubbed.fasta -r scrubbed_report.yacrd

Preprintでは、キメラリードを除くことでヒトゲノムと線虫のRedBean (旧 wtdbg2)とminiasmを使ったde novo assemblyのパフォーマンスが大きく伸びることが示されています。

fpaは別に紹介します。

2020 03/30

v0.6からコマンドが大きく変わったので、使い方を簡単に書いておきます。

#ONT-readsのall versus all overlap
minimap2 -x ava-ont -g 500 -t 16 ONT.fq.gz reads.fq.gz > overlap.paf

#キメラがないか分析
yacrd -i overlap.paf -o reads.yacrd

#キメラやオーバーラップなしリードの表記があれば、以下のコマンドでフィルタリング
#その前にリードをfasta変換しておく。
seqkit fq2fa ONT.fq.gz > ONT.fa

#yacrdのサブコマンド;filter実行
yacrd -i overlap.paf -o reads.yacrd filter -i ONT.fa -o reads.filter.fasta

引用

yacrd and fpa: upstream tools for long-read genome assembly

Pierre Marijon, Rayan Chikhi, Jean-Stéphane Varré

bioRxiv preprint first posted online Jun. 18, 2019

yacrd and fpa: upstream tools for long-read genome assembly
Pierre Marijon, Rayan Chikhi, Jean-Stéphane Varré
Bioinformatics, 2020