rRNAを除く SortMeRNA - macでインフォマティクス

2020 2/5 condaインストール追記

2020 6/16 コマンドが大きく変更したため更新（v２.１）

2020 12/9 unmapを出力するようにコマンドを修正, 再びhelp更新（v４.２）

次世代シーケンシング（NGS）技術を生物群集から直接抽出したRNAに適用すると、コーディング型とノンコーディング型の両方のRNAを特徴づける断片が混在している。これらを区別し、メッセンジャー RNAとリボソーム RNA（rRNA）のファミリーをさらに分類することは、相互作用環境の遺伝子発現パターンや構成種の系統分類を調べる上で重要なステップである。

　本研究では、メタトランスクリプトームデータからrRNA断片を迅速にフィルタリングするために設計された新しいソフトウェアSortMeRNAを紹介する。このソフトウェアは、大規模なリードセットを扱うことができ、rRNAデータベースに一致する全ての断片を高感度かつ低ランニングタイムでソートすることが可能である。

出力はfasta、fastq、アライメントのsam、またblastライクな出力も可能である。Illumina, 454, Ion Torrent and PacBioのシーケンスデータに対応している。QIIMEと一緒に使用することで、OTUを検出し系統解析にも利用することができる。

マニュアル

http://bioinfo.lifl.fr/RNA/sortmerna/code/SortMeRNA-user-manual-v2.1.pdf

FAQ

http://bioinfo.lifl.fr/sortmerna/faqs.php

インストール

Github

conda install -c bioconda -y sortmerna

> sortmerna -h

# sortmerna -h

[process:1369] === Options processing starts ... ===

Found value: sortmerna

Found flag: -h

[process:1453] Processing option: h with value:

Program: SortMeRNA version 4.2.0

Turnhoutseweg 30, 2340 Beerse, Belgium

2014-2016 Knight Lab:

Department of Pediatrics, UCSD, La Jolla

2012-2014 Bonsai Bioinformatics Research Group:

LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe

Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the

implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the GNU Lesser General Public License for more details.

Contributors: Jenya Kopylova jenya.kopylov@gmail.com

Laurent Noﾃｩ laurent.noe@lifl.fr

Pierre Pericard pierre.pericard@lifl.fr

Daniel McDonald wasade@gmail.com

Mikaﾃｫl Salson mikael.salson@lifl.fr

Hﾃｩlﾃｨne Touzet helene.touzet@lifl.fr

Rob Knight robknight@ucsd.edu

Usage: sortmerna --ref FILE [--ref FILE] --reads FWD_READS [--reads REV_READS] [OPTIONS]:

-------------------------------------------------------------------------------------------------------------

| option type-format description default |

-------------------------------------------------------------------------------------------------------------

[REQUIRED]

--ref PATH Required Reference file (FASTA) absolute or relative path.

Use mutliple times, once per a reference file

--reads PATH Required Raw reads file (FASTA/FASTQ/FASTA.GZ/FASTQ.GZ).

Use twice for files with paired reads

[COMMON]

--workdir PATH Optional Directory for storing Reference index, USRDIR/sortmerna/run/

Key-value database, and the output.

Default structure:

WORKDIR/

idx/

kvdb/

out/

--kvdb PATH Optional Directory for storing Key-value database WORKDIR/kvdb

KVDB is used for storing alignement results.

--idx PATH Optional Directory for storing Reference index. WORKDIR/idx

--fastx BOOL Optional Output aligned reads into FASTA/FASTQ file

--sam BOOL Optional Output SAM alignment for aligned reads.

--SQ BOOL Optional Add SQ tags to the SAM file

--blast STRING Optional output alignments in various Blast-like formats

'0' - pairwise

'1' - tabular(Blast - m 8 format)

'1 cigar' - tabular + column for CIGAR

'1 cigar qcov' - tabular + columns for CIGAR

and query coverage

'1 cigar qcov qstrand' - tabular + columns for CIGAR,

query coverage and strand

--aligned STRING/BOOL Optional Aligned reads file prefix [dir/][pfx] WORKDIR/out/aligned

Directory and file prefix for aligned output i.e.

each output file will go into the specified directory with the given prefix.

The appropriate extension (fasta|fastq|blast|sam|etc) will be automatically added.

Both 'dir' and 'pfx' are optional.

The 'dir' can be a relative or an absolute path.

If 'dir' is not specified, the output will be created in the WORKDIR/out/

If 'pfx' is not specified, the prefix 'aligned' will be used

Examples:

-aligned $MYDIR/dir_1/dir_2/1 -> $MYDIR/dir_1/dir_2/1.fasta

-aligned dir_1/apfx -> $PWD/dir_1/apfx.fasta

-aligned dir_1/ -> $PWD/aligned.fasta

-aligned apfx -> $PWD/apfx.fasta

-aligned (no argument) -> WORKDIR/out/aligned.fasta

--other STRING/BOOL Optional Non-aligned reads file prefix [dir/][pfx] WORKDIR/out/other

Must be used with 'fastx'.

Directory and file prefix for non-aligned output i.e.

each output file will go into the specified directory with the given prefix.

The appropriate extension (fasta|fastq|blast|sam|etc) will be automatically added.

Both 'dir' and 'pfx' are optional.

The 'dir' can be a relative or an absolute path.

If 'dir' is not specified, the output will be created in the WORKDIR/out/

If 'pfx' is not specified, the prefix 'other' will be used

Examples:

-other $MYDIR/dir_1/dir_2/1 -> $MYDIR/dir_1/dir_2/1.fasta

-other dir_1/apfx -> $PWD/dir_1/apfx.fasta

-other dir_1/ -> $PWD/dir_1/other.fasta

-other apfx -> $PWD/apfx.fasta

-other (no argument) -> aligned_out/other.fasta

i.e. the same output directory as used

for aligned output

--num_alignments INT Optional Positive integer (INT >=0).

Report first INT alignments per read reaching E-value

If INT = 0, all alignments will be output

--best INT Optional Report INT best alignments per read reaching E-value 1

by searching --min_lis INT candidate alignments

If INT == 0: search All candidate alignments

If INT > 0: search INT best alignments.

The larger is the INT, the longer is the search time.

Explanation:

A read can potentially be aligned (reaching E-value threshold)

to multiple reference sequences.

The 'best' alignment is an alignment that is better

than the previously found alignments.

The very first found alignment is automatically the best alignment

until a better one is found.

--min_lis INT Optional Search all alignments having the first INT longest LIS

LIS stands for Longest Increasing Subsequence,

it is computed using seeds' positions to expand hits into

longer matches prior to Smith - Waterman alignment.

Requires option 'best'.

Mutually exclusive with option 'num_alignments'

--print_all_reads BOOL Optional Output null alignment strings for non-aligned reads False

to SAM and/or BLAST tabular files

--paired BOOL Optional Indicates a Single reads file with paired reads False

If a single reads file with paired reads is used,

and neither 'paired_in' nor 'paired_out' are specified,

use this option together with 'out2' to output

FWD and REV reads into separate files

--paired_in BOOL Optional If one of the paired-end reads is Aligned, False

put both reads into Aligned FASTA/Q file

Must be used with 'fastx'.

Mutually exclusive with 'paired_out'.

--paired_out BOOL Optional If one of the paired-end reads is Non-aligned, False

put both reads into Non-Aligned FASTA/Q file

Must be used with 'fastx'.

Mutually exclusive with 'paired_in'.

--out2 BOOL Optional Output paired reads into separate files. False

Must be used with 'fastx'.

Ignored without either of 'paired_in' |

'paired_out' | 'paired' | two 'reads'

--match INT Optional SW score (positive integer) for a match. 2

--mismatch INT Optional SW penalty (negative integer) for a mismatch. -3

--gap_open INT Optional SW penalty (positive integer) for introducing a gap. 5

--gap_ext INT Optional SW penalty (positive integer) for extending a gap. 2

-e DOUBLE Optional E-value threshold. 1

Defines the 'statistical significance' of a local alignment.

Exponentially correllates with the Minimal Alignment Score.

Higher E-values (100, 1000, ...) cause More reads

to Pass the alignment threshold

-F BOOL Optional Search only the forward strand. False

-N BOOL Optional SW penalty for ambiguous letters (N's) scored

as --mismatch

-R BOOL Optional Search only the reverse-complementary strand. False

[OTU_PICKING]

--id INT Optional %%id similarity threshold (the alignment 0.97

must still pass the E-value threshold).

--coverage INT Optional %%query coverage threshold (the alignment must 0.97

still pass the E-value threshold)

--de_novo_otu BOOL Optional FASTA/FASTQ file for reads matching database < %%id False

(set using --id) and < %%cov (set using --coverage)

(alignment must still pass the E-value threshold).

--otu_map BOOL Optional Output OTU map (input to QIIME's make_otu_table.py). False

[ADVANCED]

--passes INT,INT,INT Optional Three intervals at which to place the seed on the read L,L/2,3

(L is the seed length)

--edges INT Optional Number (or percent if INT followed by %% sign) of 4

nucleotides to add to each edge of the read

prior to SW local alignment

--num_seeds BOOL Optional Number of seeds matched before searching 2

for candidate LIS

--full_search INT Optional Search for all 0-error and 1-error seed False

matches in the index rather than stopping

after finding a 0-error match (<1%% gain in

sensitivity with up four-fold decrease in speed)

--pid BOOL Optional Add pid to output file names. False

-a INT Optional DEPRECATED in favour of '-threads'. Number of numCores

processing threads to use.

Automatically redirects to '-threads'

--threads INT Optional Number of Processing threads to use numCores

[INDEXING]

-L DOUBLE Optional Indexing: seed length. 18

-m DOUBLE Optional Indexing: the amount of memory (in Mbytes) for building 3072

the index.

-v BOOL Optional Produce verbose output when building the index True

--interval INT Optional Indexing: Positive integer: index every Nth L-mer in 1

the reference database e.g. '-interval 2'.

--max_pos INT Optional Indexing: maximum (integer) number of positions to store 1000

for each unique L-mer. If 0 all positions are stored.

[HELP]

-h BOOL Optional Print help information

--version BOOL Optional Print SortMeRNA version number

[DEVELOPER]

--dbg_put_db BOOL Optional

--cmd BOOL Optional Launch an interactive session (command prompt) False

--task INT Optional Processing Task: 4

0 - align. Only perform alignment

1 - post-processing (log writing)

2 - generate reports

3 - align and post-process

4 - all

データベース

git clone https://github.com/biocore/sortmerna.git

> ls -lh

$ ls -lh sortmerna/data/rRNA_databases/

total 149672

-rw-r--r-- 1 kazu staff 2.0K 6 16 17:25 README.txt

-rw-r--r-- 1 kazu staff 2.2M 6 16 17:25 rfam-5.8s-database-id98.fasta

-rw-r--r-- 1 kazu staff 8.1M 6 16 17:25 rfam-5s-database-id98.fasta

drwxr-xr-x 5 kazu staff 160B 6 16 17:25 scripts/

-rw-r--r-- 1 kazu staff 3.7M 6 16 17:25 silva-arc-16s-id95.fasta

-rw-r--r-- 1 kazu staff 734K 6 16 17:25 silva-arc-23s-id98.fasta

-rw-r--r-- 1 kazu staff 19M 6 16 17:25 silva-bac-16s-id90.fasta

-rw-r--r-- 1 kazu staff 12M 6 16 17:25 silva-bac-23s-id98.fasta

-rw-r--r-- 1 kazu staff 13M 6 16 17:25 silva-euk-18s-id95.fasta

-rw-r--r-- 1 kazu staff 14M 6 16 17:25 silva-euk-28s-id98.fasta

-rw-r--r-- 1 kazu staff 591K 6 16 17:25 silva_ids_acc_tax.tar.gz

（データベースにはアーキアと真核生物のrRNA配列もある。SILVAからダウンロードするとarb ファイルで扱いにくいので、最新バージョンではないがこちらの配列セットを様々な処理に使うのもあり）

ラン

１、indexファイルの準備

解析にはデータベースのrRNA (FASTA) にindexをつける必要がある。ここではSILVAの16S rRNAのindexを作成している。１はスキップ可能。

indexdb_rna --ref silva-arc-16s-id95.fasta,silva-arc-16s-id95.fasta-db

-v verbose
--ref STRING,STRING FASTA reference file, index file

fastaとinexの間は","で区切る。複数のファイルを指定することもできる（マニュアル参照）。

２、SortMeRNAの実行

indexがない場合、作成されてからランされる。

mkdir temp
sortmerna --ref silva-arc-16s-id95.fasta,silva-arc-16s-id95.fasta-db \
 --reads pair_1.fq --reads pair_2.fq \
 --aligned mapped --fastx --workdir temp --sam mapped --other unmap --out2