2020 2/5 condaインストール追記
2020 6/16 コマンドが大きく変更したため更新(v2.1)
2020 12/9 unmapを出力するようにコマンドを修正, 再びhelp更新(v4.2)
次世代シーケンシング(NGS)技術を生物群集から直接抽出したRNAに適用すると、コーディング型とノンコーディング型の両方のRNAを特徴づける断片が混在している。これらを区別し、メッセンジャーRNAとリボソームRNA(rRNA)のファミリーをさらに分類することは、相互作用環境の遺伝子発現パターンや構成種の系統分類を調べる上で重要なステップである。
本研究では、メタトランスクリプトームデータからrRNA断片を迅速にフィルタリングするために設計された新しいソフトウェアSortMeRNAを紹介する。このソフトウェアは、大規模なリードセットを扱うことができ、rRNAデータベースに一致する全ての断片を高感度かつ低ランニングタイムでソートすることが可能である。
出力はfasta、fastq、アライメントのsam、またblastライクな出力も可能である。Illumina, 454, Ion Torrent and PacBioのシーケンスデータに対応している。QIIMEと一緒に使用することで、OTUを検出し系統解析にも利用することができる。
マニュアル
http://bioinfo.lifl.fr/RNA/sortmerna/code/SortMeRNA-user-manual-v2.1.pdf
FAQ
http://bioinfo.lifl.fr/sortmerna/faqs.php
インストール
conda install -c bioconda -y sortmerna
> sortmerna -h
# sortmerna -h
[process:1369] === Options processing starts ... ===
Found value: sortmerna
Found flag: -h
[process:1453] Processing option: h with value:
Program: SortMeRNA version 4.2.0
Copyright: 2016-2020 Clarity Genomics BVBA:
Turnhoutseweg 30, 2340 Beerse, Belgium
2014-2016 Knight Lab:
Department of Pediatrics, UCSD, La Jolla
2012-2014 Bonsai Bioinformatics Research Group:
LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe
Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the GNU Lesser General Public License for more details.
Contributors: Jenya Kopylova jenya.kopylov@gmail.com
Laurent Noテゥ laurent.noe@lifl.fr
Pierre Pericard pierre.pericard@lifl.fr
Daniel McDonald wasade@gmail.com
Mikaテォl Salson mikael.salson@lifl.fr
Hテゥlティne Touzet helene.touzet@lifl.fr
Rob Knight robknight@ucsd.edu
Usage: sortmerna --ref FILE [--ref FILE] --reads FWD_READS [--reads REV_READS] [OPTIONS]:
-------------------------------------------------------------------------------------------------------------
| option type-format description default |
-------------------------------------------------------------------------------------------------------------
[REQUIRED]
--ref PATH Required Reference file (FASTA) absolute or relative path.
Use mutliple times, once per a reference file
--reads PATH Required Raw reads file (FASTA/FASTQ/FASTA.GZ/FASTQ.GZ).
Use twice for files with paired reads
[COMMON]
--workdir PATH Optional Directory for storing Reference index, USRDIR/sortmerna/run/
Key-value database, and the output.
Default structure:
WORKDIR/
idx/
kvdb/
out/
--kvdb PATH Optional Directory for storing Key-value database WORKDIR/kvdb
KVDB is used for storing alignement results.
--idx PATH Optional Directory for storing Reference index. WORKDIR/idx
--fastx BOOL Optional Output aligned reads into FASTA/FASTQ file
--sam BOOL Optional Output SAM alignment for aligned reads.
--SQ BOOL Optional Add SQ tags to the SAM file
--blast STRING Optional output alignments in various Blast-like formats
'0' - pairwise
'1' - tabular(Blast - m 8 format)
'1 cigar' - tabular + column for CIGAR
'1 cigar qcov' - tabular + columns for CIGAR
and query coverage
'1 cigar qcov qstrand' - tabular + columns for CIGAR,
query coverage and strand
--aligned STRING/BOOL Optional Aligned reads file prefix [dir/][pfx] WORKDIR/out/aligned
Directory and file prefix for aligned output i.e.
each output file will go into the specified directory with the given prefix.
The appropriate extension (fasta|fastq|blast|sam|etc) will be automatically added.
Both 'dir' and 'pfx' are optional.
The 'dir' can be a relative or an absolute path.
If 'dir' is not specified, the output will be created in the WORKDIR/out/
If 'pfx' is not specified, the prefix 'aligned' will be used
Examples:
-aligned $MYDIR/dir_1/dir_2/1 -> $MYDIR/dir_1/dir_2/1.fasta
-aligned dir_1/apfx -> $PWD/dir_1/apfx.fasta
-aligned dir_1/ -> $PWD/aligned.fasta
-aligned apfx -> $PWD/apfx.fasta
-aligned (no argument) -> WORKDIR/out/aligned.fasta
--other STRING/BOOL Optional Non-aligned reads file prefix [dir/][pfx] WORKDIR/out/other
Must be used with 'fastx'.
Directory and file prefix for non-aligned output i.e.
each output file will go into the specified directory with the given prefix.
The appropriate extension (fasta|fastq|blast|sam|etc) will be automatically added.
Both 'dir' and 'pfx' are optional.
The 'dir' can be a relative or an absolute path.
If 'dir' is not specified, the output will be created in the WORKDIR/out/
If 'pfx' is not specified, the prefix 'other' will be used
Examples:
-other $MYDIR/dir_1/dir_2/1 -> $MYDIR/dir_1/dir_2/1.fasta
-other dir_1/apfx -> $PWD/dir_1/apfx.fasta
-other dir_1/ -> $PWD/dir_1/other.fasta
-other apfx -> $PWD/apfx.fasta
-other (no argument) -> aligned_out/other.fasta
i.e. the same output directory as used
for aligned output
--num_alignments INT Optional Positive integer (INT >=0).
Report first INT alignments per read reaching E-value
If INT = 0, all alignments will be output
--best INT Optional Report INT best alignments per read reaching E-value 1
by searching --min_lis INT candidate alignments
If INT == 0: search All candidate alignments
If INT > 0: search INT best alignments.
The larger is the INT, the longer is the search time.
Explanation:
A read can potentially be aligned (reaching E-value threshold)
to multiple reference sequences.
The 'best' alignment is an alignment that is better
than the previously found alignments.
The very first found alignment is automatically the best alignment
until a better one is found.
--min_lis INT Optional Search all alignments having the first INT longest LIS
LIS stands for Longest Increasing Subsequence,
it is computed using seeds' positions to expand hits into
longer matches prior to Smith - Waterman alignment.
Requires option 'best'.
Mutually exclusive with option 'num_alignments'
--print_all_reads BOOL Optional Output null alignment strings for non-aligned reads False
to SAM and/or BLAST tabular files
--paired BOOL Optional Indicates a Single reads file with paired reads False
If a single reads file with paired reads is used,
and neither 'paired_in' nor 'paired_out' are specified,
use this option together with 'out2' to output
FWD and REV reads into separate files
--paired_in BOOL Optional If one of the paired-end reads is Aligned, False
put both reads into Aligned FASTA/Q file
Must be used with 'fastx'.
Mutually exclusive with 'paired_out'.
--paired_out BOOL Optional If one of the paired-end reads is Non-aligned, False
put both reads into Non-Aligned FASTA/Q file
Must be used with 'fastx'.
Mutually exclusive with 'paired_in'.
--out2 BOOL Optional Output paired reads into separate files. False
Must be used with 'fastx'.
Ignored without either of 'paired_in' |
'paired_out' | 'paired' | two 'reads'
--match INT Optional SW score (positive integer) for a match. 2
--mismatch INT Optional SW penalty (negative integer) for a mismatch. -3
--gap_open INT Optional SW penalty (positive integer) for introducing a gap. 5
--gap_ext INT Optional SW penalty (positive integer) for extending a gap. 2
-e DOUBLE Optional E-value threshold. 1
Defines the 'statistical significance' of a local alignment.
Exponentially correllates with the Minimal Alignment Score.
Higher E-values (100, 1000, ...) cause More reads
to Pass the alignment threshold
-F BOOL Optional Search only the forward strand. False
-N BOOL Optional SW penalty for ambiguous letters (N's) scored
as --mismatch
-R BOOL Optional Search only the reverse-complementary strand. False
[OTU_PICKING]
--id INT Optional %%id similarity threshold (the alignment 0.97
must still pass the E-value threshold).
--coverage INT Optional %%query coverage threshold (the alignment must 0.97
still pass the E-value threshold)
--de_novo_otu BOOL Optional FASTA/FASTQ file for reads matching database < %%id False
(set using --id) and < %%cov (set using --coverage)
(alignment must still pass the E-value threshold).
--otu_map BOOL Optional Output OTU map (input to QIIME's make_otu_table.py). False
[ADVANCED]
--passes INT,INT,INT Optional Three intervals at which to place the seed on the read L,L/2,3
(L is the seed length)
--edges INT Optional Number (or percent if INT followed by %% sign) of 4
nucleotides to add to each edge of the read
prior to SW local alignment
--num_seeds BOOL Optional Number of seeds matched before searching 2
for candidate LIS
--full_search INT Optional Search for all 0-error and 1-error seed False
matches in the index rather than stopping
after finding a 0-error match (<1%% gain in
sensitivity with up four-fold decrease in speed)
--pid BOOL Optional Add pid to output file names. False
-a INT Optional DEPRECATED in favour of '-threads'. Number of numCores
processing threads to use.
Automatically redirects to '-threads'
--threads INT Optional Number of Processing threads to use numCores
[INDEXING]
-L DOUBLE Optional Indexing: seed length. 18
-m DOUBLE Optional Indexing: the amount of memory (in Mbytes) for building 3072
the index.
-v BOOL Optional Produce verbose output when building the index True
--interval INT Optional Indexing: Positive integer: index every Nth L-mer in 1
the reference database e.g. '-interval 2'.
--max_pos INT Optional Indexing: maximum (integer) number of positions to store 1000
for each unique L-mer. If 0 all positions are stored.
[HELP]
-h BOOL Optional Print help information
--version BOOL Optional Print SortMeRNA version number
[DEVELOPER]
--dbg_put_db BOOL Optional
--cmd BOOL Optional Launch an interactive session (command prompt) False
--task INT Optional Processing Task: 4
0 - align. Only perform alignment
1 - post-processing (log writing)
2 - generate reports
3 - align and post-process
4 - all
データベース
git clone https://github.com/biocore/sortmerna.git
> ls -lh
$ ls -lh sortmerna/data/rRNA_databases/
total 149672
-rw-r--r-- 1 kazu staff 2.0K 6 16 17:25 README.txt
-rw-r--r-- 1 kazu staff 2.2M 6 16 17:25 rfam-5.8s-database-id98.fasta
-rw-r--r-- 1 kazu staff 8.1M 6 16 17:25 rfam-5s-database-id98.fasta
drwxr-xr-x 5 kazu staff 160B 6 16 17:25 scripts/
-rw-r--r-- 1 kazu staff 3.7M 6 16 17:25 silva-arc-16s-id95.fasta
-rw-r--r-- 1 kazu staff 734K 6 16 17:25 silva-arc-23s-id98.fasta
-rw-r--r-- 1 kazu staff 19M 6 16 17:25 silva-bac-16s-id90.fasta
-rw-r--r-- 1 kazu staff 12M 6 16 17:25 silva-bac-23s-id98.fasta
-rw-r--r-- 1 kazu staff 13M 6 16 17:25 silva-euk-18s-id95.fasta
-rw-r--r-- 1 kazu staff 14M 6 16 17:25 silva-euk-28s-id98.fasta
-rw-r--r-- 1 kazu staff 591K 6 16 17:25 silva_ids_acc_tax.tar.gz
(データベースにはアーキアと真核生物のrRNA配列もある。SILVAからダウンロードするとarb ファイルで扱いにくいので、最新バージョンではないがこちらの配列セットを様々な処理に使うのもあり)
ラン
1、indexファイルの準備
解析にはデータベースのrRNA (FASTA) にindexをつける必要がある。ここではSILVAの16S rRNAのindexを作成している。1はスキップ可能。
indexdb_rna --ref silva-arc-16s-id95.fasta,silva-arc-16s-id95.fasta-db
- -v verbose
- --ref STRING,STRING FASTA reference file, index file
fastaとinexの間は","で区切る。複数のファイルを指定することもできる(マニュアル参照)。
2、SortMeRNAの実行
indexがない場合、作成されてからランされる。
mkdir temp
sortmerna --ref silva-arc-16s-id95.fasta,silva-arc-16s-id95.fasta-db \
--reads pair_1.fq --reads pair_2.fq \
--aligned mapped --fastx --workdir temp --sam mapped --other unmap --out2
- --fastx output FASTA/FASTQ file
- --aligned STRING aligned reads filepath + base file name
mapped.fq(インターレース)とunmap_fwd.fq, unmap_rev.fqが出力される。ペアの順番は同期されていないので注意すること。
--fastx、--otherのほかに、--samや--blastがある。詳細はhelpから確認してください。
引用
SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data.
Kopylova E, Noé L, Touzet H.
Bioinformatics. 2012 Dec 15;28(24):3211-7.