2018-03-11

教師なしトリミングツール UrQt

信頼性の低いヌクレオチドがあると、後の分析において偽陰性および偽陽性の数を増加させるか、またはデノボアセンブリにおいて誤ったk-merを生成し、アセンブリを複雑にして誤ったアセンブルを引き起こす可能性がある[論文より ref.4]。信頼性の低いヌクレオチドを除去し、有益なヌクレオチドのみを使用するために、ほとんどのNGSデータ分析は、分析前にQCステップから開始される。

　低品質のヌクレオチドに対処する古典的なQC戦略は、FastQC [ref.6]などのツールを使用してヌクレオチドごとの品質を視覚化した後、FASTX-Toolkit [ref.5]などのツールを使用してリードの5'末端と３’末端の任意の数のヌクレオチドを除去し、次に、特定のphredスコア以下の長さの所定の割合を持つすべてのヌクレオチドを除外する。より最近のアプローチは、低頻度の多型を除去するようにヌクレオチドを改変する（エラー訂正）。この種のアプローチは、最も頻度の高いモチーフに基づいて低頻度のモチーフを修正するために、k-mer配列を使用することが多い。しかしながら、このタイプのアプローチは高いシークエンシングカバレッジ（Quakeの場合は15倍、ALLPATHS-LGの場合は100倍）を必要とし、RNA seqのような非一様なシークエンシング実験には適用できない）。

UrQtは教師なしセグメンテーションアルゴリズムを実装し、最尤法によって各リードで最適なトリミングカットポイントを見つける。UrQtはデータ依存のパラメータを必要とせず、最新のマルチコアアーキテクチャを利用しており、分析パイプラインへの組み込みに有益なツールとなっている。

公式サイト

https://lbbe.univ-lyon1.fr/-UrQt-.html?lang=fr

マニュアル

https://lbbe.univ-lyon1.fr/Documentation-5173.html?lang=fr

インストール

git clone https://github.com/l-modolo/UrQt
cd UrQt/
make
./UrQt

$ ./UrQt

UrQt.1.0.18

Argument must be defined.

Usage: ./UrQt--in <input.fastq> --out <output.fastq>

--in input fastq file

--out output fastq file

Optional:

--inpair input fastq file for paired end data

--outpair output fastq file for paired end data, empty read in one file will be removed in both

--phred <number> [33 = Sanger (ASCII 33 to 126), 64 = Illumina 1.3 (ASCII 64 to 126), 59 = Solexa/Illumina 1.0 (ASCII 59 to 126)] (default: 33)

Trimming option:

--t <number> minimum phred score for a ``good quality'' (default: 20)

--N <character> polyN to trim (default: QC trimming)

--max_head_trim <number> maximum number of nucleotide trimmed at the head of the reads (default: read length)

--max_tail_trim <number> maximum number of nucleotide trimmed at the tail of the reads (default: read length)

--min_read_size <number> remove all reads smaller than this size after the trimming step (default: 0)

--pos <head|tail|both> (expected position of trimmed sequence in the read) (default: both)

--r no removing of empty reads (100% of bases trimmed) (default: the empty reads are removed from the output)

--min_QC_length <double> if present with --min_QC_phred the minimum percentage of base with min_QC_phred necessary to keep a read (default: without QC percentage for a length)

--min_QC_phred <int> if present with --min_QC_length, the minimum phred score on min_QC_length percent of the base necessary to keep a read (default: without QC percentage for a length)

Estimation :

--s <number> number of reads to sample to compute the fixe proportion of the 4 different nucleotides (default: proportion computed in the partitioning of each reads)

--S if present the proportion of the 4 different nucleotides is set to 1/4 (default: proportion computed in the partitioning of each reads)

Other:

--v verbose

--gz gziped output

--m <number> number of thread to use

--buffer <buffer> max number of reads in memory

パスの通ったディレクトリにコピーしておく。

ラン

シングルエンド。gzも扱える。

UrQt --in single.fastq.gz --gz --m 8 --out trimmed.fastq.gz

--in input fastq file
--outpair output fastq file for paired end data, empty read in one file
--t <number> minimum phred score for a ``good quality'' (default: 20)
--gz gziped output
--m <number> number of thread to use

ペアエンド

UrQt --in pair1.fq --inpair pair2.fq --outpair1_trimmed.fastq --outpair pair2_trimmed.fastq

before（リード数641,426、平均294bp）

f:id:kazumaxneo:20180311201237j:plain

after（リード数641,413、平均216bp） (--t 20)

f:id:kazumaxneo:20180311201321j:plain

sickleの結果と比較してみる（sickle紹介）(qualtiy ≥ 20)

after（リード数599,641、平均187.8bp）

f:id:kazumaxneo:20180311202056j:plain

引用

UrQt: an efficient software for the Unsupervised Quality trimming of NGS data.

Modolo L, Lerat E

BMC Bioinformatics. 2015 Apr 29;16:137.

2018-03-11

バクテリアをstrainレベルで検出する StrainSeeker

bacteria metagenome bed GUIツール結果の視覚化 (visualization) 生物種の推定 (taxonomic profiling) web tool 2017 PeerJ MLST abundance estimation in metagenomics data

　病原性細菌の検出には、細菌病原体を迅速に同定する必要がある。このために、通常、病原体は単離され、PCRや全ゲノム配列が行われる。分子タイピングの主な目標の1つは、病原体をクローン群に分類することである。なぜなら、同じ種の系統は宿主に対して大きく異なる効果を持つからである。よく知られている例は、大腸菌 O157:H7（Tu、He＆Zhou、2014）および大腸菌EC958（Petty et al、2014）のようないくつかの株である。分類するために、MLST（Maiden、2006）またはクローン特異的なマーカーが使用されている（Inouye et al, 2014）。KvarQ（Steiner et al, 2014）、Mykrobe（Bradleyら, 2015）およびSRST2（Inouyeら, 2014）などのWGSデータから直接に関連する突然変異および対立遺伝子を検出することができるいくつかのアプローチが開発されているが、大部分の場合、十分なカバレッジおよび特化したアレルデータベースが必要である（例えばMykrobeはMycobacterium tuberculosisおよびStaphylococcus aureusの同定にのみ使用できる）。そのようなプログラムの使用は、株の同定プロセスを複雑にする。

　クローン特異的マーカーを探す代わりに、完全なバクテリアゲノムを参照配列として使用することができる。 Kraken（Wood＆Salzberg、2014）またはCLARK（Ounit et al、2015）のようなk-mer長の配列の検出に基づくバクテリア同定プログラムは、RefSeqバクテリアゲノムデータベース全体を使用し、さらに、それぞれのリードを個別に分類するため、低カバレッジWGSサンプルも処理できる。 Sigma（Ahn、Chai＆Pan、2015）のようなアライメントベースのツールと比較して、k-merベースのプログラムは、特に実行時間を考慮すると優れていることが示されている（Lindgreen、Adair＆Gardner、2016; Peabody et al、2015 ）。 KrakenはNCBIのtaxonomy treeを使用して、リードを別々に識別し、ツリー上の各taxonへのヒットをカウントし、最も多くのヒット数を持つブランチを見つける（kraken紹介）。

　StrainSeekerは、異なる分類学レベルで共有されるk-merの数に基づいて、バクテリアの分離株をWGSデータから直接クローンまたはクレードに分類する。NCBI分類法のような既存のtaxonomyシステムに結びついていないリファレンスバクテリアゲノムと系統レベルとの間の系統学的関係を近似するためのガイドツリーを使用するので、大腸菌およびShigella sp. 間のような論争を避けるのに役立つ（ガイドツリーは、ユーザーが提供する必要がある）。

マニュアル

http://bioinfo.ut.ee/strainseeker/index.php?r=site/page&view=manual

webツール

Web Tool

インストール

cent OSに導入した。

依存

PERL
R

Builer

GenomeTester4（GlistMaker、GlistCompare、GlistQuery）

Seeker

GenomeTester4（GlistMaker、GlistCompare、GlistQuery、GDistribution）
R scripts for statistical tests

GenomeTester4のインストールは以前紹介しています（リンク）。

本体& helper scirptのダウンロード

http://bioinfo.ut.ee/strainseeker/index.php?r=site/page&view=downloadable

本体のbuilder.pl、seeker.plと解凍したhelper scirptを同じディレクトリに置いておく。

> perl builder.pl -h

$ perl builder.pl -h

Usage: builder.pl -n <NWK FILE> -d <DIR OF FASTA FILES> -o <USER DEFINED DB NAME> [OPTIONAL PARAMETERS]

Options:

-h, --help - Print this help

-v, --version - Print version of the program

-n, --newick - Guide tree in newick format (same names as fasta files without suffix .fna)

-d, --dir - Directory of fasta files (.fna)

-o, --output - User defined database name

-b, --blacklist - .list file of k-mers unwanted in database (human, plasmids etc)

-w, --word - K-mer length used in database building and later searching (default 32)

-m, --min - Minimal amout of k-mers in node to be considered as subroot (default 250)

-g, --greater - Maximum times child could have more k-mers than parent (default 250)

-t, --threads - Number of cores used

-max - Maximum number of k-mers in one list (default 100000)

> perl seeker.pl -h

$ perl seeker.pl -h

Usage: seeker.pl -d <DB DIR NAME> -i <SAMPLE.fastq> [OPTIONAL PARAMETERS]

Options:

-h, --help - Print this help

-v, --version - Print version of the program

-i, none - Input file (can be multiple, each with own flag)

-o, --output - Output file name (default StrainSeeker_output)

-d, --dir - Path to database directory

-verbose - Print out more of the working process

ラン

１、データベースの構築（300GBくらい空き容量が必要）

perl builder.pl -n refseq_guide_tree.nwk -d strain_fasta_directory -w 32 -b ss_blacklist_w32.list -o my_database

-n is the guide tree in Newick format, describing the relationships between given strains.
-d is a directory containing all the .fna files for strains used in the Newick file.
-b is the path to blacklist (must have the same k-mer length as parameter -w).
-w is the k-mer length.
-o user-defined database name.

ここでは既存のデータベースを使う。ダウンロードリンク（リンク）から32-merのDatabase - pre-built from 4,324 NCBI RefSeq strains (w32) か16-merのDatabase - pre-built from 4,324 NCBI RefSeq strains (w16)をダウンロードする。

シーケンスデータは、オーサーらが評価ペーパー（リンク）にしたがって準備したfastqを使う。このリンクからダウンロードできる。

検索する。

perl seeker.pl -i sample.fastq -d ss_db_w32 -o sample_result.txt -verbose

出力の確認（一部）。

> cat sample_result.txt

$ cat sample_result.txt

Sample:sample_result.txt

22.66893% KNOWN Streptococcus_mitis_B6

17.53777% KNOWN Streptococcus_pseudopneumoniae_IS7493

17.02426% KNOWN Streptococcus_pneumoniae_Hungary19A-6

10.73490% RELATED Lactobacillus_helveticus_strain_KLDS18701,Lactobacillus_kefiranofaciens_ZW3,Lactobacillus_helveticus_strain_CAUH18,Lactobacillus_helveticus_H10,Lactobacillus_helveticus_DPC_4571,Lactobacillus_helveticus_CNRZ32,Lactobacillus_helveticus_R0052,Lactobacillus_helveticus_strain_MB2-1,Lactobacillus_helveticus_H9

500MBまでの限定だが、web版もある。

web版(4300のbacteriaデータベース）

http://bioinfo.ut.ee/strainseeker/index.php?r=site/webtool

テスト結果

http://bioinfo.ut.ee/strainseeker/index.php?r=site/results&id=demoresults

http://bioinfo.ut.ee/strainseeker/demo/demoresults/setA1_bacarc_5feb_11min.txt

上のデータを解析すると以下のような結果が得られた。

f:id:kazumaxneo:20180311181530j:plain

引用

StrainSeeker: fast identification of bacterial strains from raw sequencing reads using user-provided guide trees.

Roosaare M1, Vaher M1, Kaplinski L1, Möls M1,2, Andreson R1, Lepamets M1, Kõressaar T1, Naaber P3,4, Kõljalg S4,5, Remm M1.

PeerJ. 2017 May 18;5:e3353.

2018-03-11

ウィルスのintegration部位を検出する Virus-Clip

virus human genome

　ウイルス感染は、様々なヒト悪性腫瘍の共通の危険因子である。例えばB型肝炎ウイルス（HBV）は、感染時にヒトゲノムに組み込まれ、発癌にかかりやすい遺伝子機能の破壊をもたらすことがある。過去には、PCRに基づきウィルスを検出していたが、制限が多かった。この障害は、近年のNGSの進歩により解決されたが依然として未解決の制限がある。たとえば、VirusSeq [論文より ref.2]は、正確なウイルスのブレークポイント位置を検出できず、ViralFusionSeq [ref.3]とVirusFinder [ref.4,5]は、複雑なインストール手順と長いランタイムを必要とする。

　Virus-Clipは、ソフトクリップされたリードからウイルスが組み込まれたブレークポイントの正確な位置を特定する。さらにannovarと連携して影響を受けたヒト遺伝子に自動的に注釈を付けることもできる。最小限のステップと簡素化された手順により、実行パフォーマンスは既存のツールと比較して大幅に向上し、高速でメモリ効率が高いとされる。Virus-Clipはトランスクリプトームデータを用いて検証され、その検出は満足できる感度および特異性を有することが確認された。

インストール

依存

SAMTools
BWA
BLAST+
ANNOVAR（ダウンロードリンク）

本体は下記からダウンロードする。wordマニュアルも同封されている。

http://web.hku.hk/~dwhho/Virus-Clip.zip

virus_clip.shを開いて各ツールの場所とデータベース場所を指定しておく必要がある。私の場合は、whichコマンドで調べ(e.g., which samtools)、以下のように設定した。データベースの場所は後で設定する。

f:id:kazumaxneo:20180311171303j:plain

ラン

テストデータとしてVirusFinder 2 で使用されているWhole genome sequencing of HBV-positive hepatocellular carcinoma (HCC) samples（リンク）を使用した。

indexの準備

bwa index HBV.fasta

humanのblastデータベースの作成

makeblastdb -in human.fa -dbtype nucl -out human

作業ディレクトリと出力ディレクトリの作成

mkdir virus_clip
mkdir virus_clip/align #read alignmentの保存場所
mkdir virus_clip/result #viral integration detection resultの保存場所

データベースの場所をvirus_clip.shに記載する。以下のようにした。

f:id:kazumaxneo:20180311171348j:plain

ペアエンドのfastq.gzをVirus-Clip/sequenceに置いてある。ペアエンドのランは以下のように実行する（シングルエンドなら最後は1）。

bash virus_clip.sh /Users/user/Downloads/Virus-Clip/sequence  /Users/user/Downloads/Virus-Clip/virus_clip/ output 2

引用

Virus-Clip: a fast and memory-efficient viral integration site detection tool at single-base resolution with annotation capability

Daniel WH Ho, Karen MF Sze, and Irene OL Ng

Oncotarget. 2015 Aug 28; 6(25): 20959–20963.

2018-03-11

FASTQ、BED、BAMを操作するNGSUtilsその４ gtfutils

GTF

４回目はgtfを操作するgtfutilsを紹介する。

インストール

公式ページ

NGSUtils

git clone git://github.com/ngsutils/ngsutils.git
cd ngsutils/
make  #依存がインストールされる（詳細はwebマニュアル参照）

#conda
mamba create -n ngsutils python=2.7 -y
conda activate ngsutils
mamba install -c bioconda ngsutils -y

$ gtfutils

Usage: gtfutils COMMAND

Commands

General

add_isoform - Appends isoform annotation from UCSC isoforms file

add_reflink - Appends isoform/name annotation from RefSeq/refLink

add_xref - Appends name annotation from UCSC Xref file

annotate - Annotates genomic positions based on a GTF model

filter - Filter annotations from a GTF file

genesize - Extract genomic/transcript sizes for genes

junctions - Build a junction library from FASTA and GTF model

query - Query a GTF file by coordinates

Conversion

fromgff - Convert a GFF to a GTF file

tobed - Convert a GFF/GTF file to BED format

Run 'gtfutils help CMD' for more information about a specific command

ngsutils 0.5.9-a7f08f5

パスを通しておく。

ラン

filter（リンク）条件でフィルタリング

#UCSCスタイル(1、2)からNCBIスタイル（chr1、chr2）に変更
gtfutils filter -to-ucsc input.gtf > output.gtf

-chr str    Remove annotations from chromosomes with 'str' in the name
    
-to-ucsc    Rename Ensembl-style chromosome names (1, 2, etc) to 
            UCSC/NCBI-style names (chr1, chr2, etc.)

genesize（リンク）サイズを抽出

gtfutils genesize input.gtf

query（リンク）指定領域内のgeneを抽出

#chr1の10000-20000に存在する遺伝子
gtfutils query input.gtf chr1:10000-20000

remove_dup（リンク）重複するアノテーションを別名に修正

gtfutils remove_dup input.gtf

fromgff（リンク）GFFをGTFに変換

gtfutils fromgff input.gff > output.gtf

tobed（リンク）GTF/GFFをBEDに変換

gtfutils robed input.gtf > output.gtf

tobed（リンク）GTF/GFFをBEDに変換

gtfutils robed input.gtf > output.gtf

add_xref（リンク）UCSCのkgXref tableを使いGTFにgene_nameを追加する

gtfutils add_xref input.gtf kgXref.txt

kgXref tableの５カラム目が使われる。

add_reflink（リンク）UCSCのrefLink tableを使いGTFにgene_nameとisoform annotationsを追加する

gtfutils add_reflink input.gtf reflink.txt

annotate（リンク）指定のテキストファイルに従ってアノテーションを追加する

junctions（リンク）splicing junctionを繋いだ配列を作る

引用

NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets

Marcus R. Breese and Yunlong Liu

Bioinformatics. 2013 Feb 15; 29(4): 494–496.

2018-03-10

FASTQ、BED、BAMを操作するNGSUtilsその３ fastqutils

fasta/fastqの操作

３回目はfastqを操作するfastqutilsを紹介する。

インストール

公式ページ

NGSUtils - bedutils

git clone git://github.com/ngsutils/ngsutils.git
cd ngsutils/
make  #依存がインストールされる（詳細はwebマニュアル参照）

$ fastqutils

Usage: fastqutils COMMAND

Commands

General

barcode_split - Splits a FASTQ/FASTA file based on sequence barcodes

filter - Filter out reads using a number of metrics

merge - Merges paired FASTQ files into one file

names - Write out the read names

properpairs - Find properly paired reads (when fragments are filtered separately)

revcomp - Reverse compliment a FASTQ file

sort - Sorts a FASTQ file by name or sequence

split - Splits a FASTQ file into N chunks

stats - Calculate summary statistics for a FASTQ file

tag - Adds a prefix or suffix to the read names in a FASTQ file

tile - Splits long FASTQ reads into smaller (tiled) chunks

trim - Remove 5' and 3' linker sequences (slow, S/W aligned)

truncate - Truncates reads to a maximum length

unmerge - Unmerged paired FASTQ files into two (or more) files

Conversion

convertqual - Converts qual values from Illumina to Sanger scale

csencode - Converts color-space FASTQ file to encoded FASTQ

fromfasta - Converts (cs)FASTA/qual files to FASTQ format

fromqseq - Converts Illumina qseq (export/sorted) files to FASTQ

tobam - Converts to BAM format (unmapped)

tofasta - Converts to FASTA format (seq or qual)

Run 'fastqutils help CMD' for more information about a specific command

ngsutils 0.5.9-a7f08f5

パスを通しておく。

ラン

filter（リンク）条件でフィルタリング

#100bp以上のリードを抽出
fastqutils filter -size 100 input.fq > output.fq

#150bp以下になるまでリードをトリミング
fastqutils filter -truncate 150 input.fq > output.fq

#ウィンドウサイズ4でスキャンし、quality30以下をトリミング
fastqutils filter -qual 30 4 input.fq > output.fq

  -wildcard num               Discard reads w/ more than N wildcards (N or .)

  -size minsize               Discard reads that are too short

  -truncate size              Trim reads to a maximum length

  -qual minval window_size    Truncate reads (5'->3') where the quality falls
                              below a threshold (floating average over
                              window_size)

  -prefix size                Trim away [size] bases from the 5' end

  -suffixqual minval          Trim away bases from the 3' end with low quality
                              value should be given as a character (in Sanger
                              scale)(like Illumina B-trim)

  -trim seq pct mintrim       Trim away at least [mintrim] bases that match a
                              [sequence] (3' adaptor) allowing for a match
                              percentage [pct] (0.0-1.0)

  -paired                     Only keep reads that are correctly paired
                              (Requires an interleaved FASTQ file)

  -whitelist keeplist.txt     Only keep reads whose name is in the keeplist

names（リンク）リード名を出力。

fastqutils names input.fq > output

properpairs（リンク）ペアエンドでペアになっていないリードを除く。

fastqutils properpairs pair1.fq pair2.fq output_pair1.fq output_pair2.fq

revcomp（リンク）逆相補鎖（reverse complementary）に変換。

fastqutils revcomp input.fq > output.fq

sort（リンク）nameやポジションでソートする。

fastqutils sort input.fq > sort.fq

split（リンク）指定数に分割する。

#3つに分割する。
fastqutils split input.fq output 3

convertqual（リンク）イルミナのクオリティ値をsanger scaleに変換。

fastqutils convertqual input.fq > output.fq

csencode（リンク）color spaceをATGCに変換。

fastqutils csencode input.fq > output.fq

0 -> A
1 -> C
2 -> G
3 -> T
4,5,6,. -> N

fromfasta（リンク）qualとfastaからfastqを作成。

fastqutils fromfasta input.csfasta input.qual  > output.fq

fromqseq（リンク）illumina export.txtからfastqを作成。

fastqutils fromqseq export.txt

stats（リンク）statistics

fastqutils stats input.fq > summary

出力。全部位の情報も出る。

$ fastqutils stats forward.fq

Done! (0:58)

Space: basespace

Pairing: Fragmented

Quality scale: Illumina

Number of reads: 224924

Length distribution

Mean: 230.287452651

StdDev: 49.7016332922

Min: 20

25 percentile: 204

Median: 235

75 percentile: 262

Max: 301

Total: 224924

Quality distribution

pos mean stdev min 25pct 50pct 75pct max count

1 33.9262595366 0.436774185711 30 34 34 34 38 224924

2 33.8028489623 1.54960128452 10 34 34 34 38 224924

3 33.8104426384 1.54959912975 10 34 34 34 38 224924

4 33.8467082214 1.44051755323 10 34 34 34 38 224924

5 33.8715655066 1.56917725475 10 34 34 34 38 224924

6 37.6461204674 1.84245487446 10 38 38 38 38 224924

7 37.622272412 1.94525249168 10 38 38 38 38 224924

8 37.6478232647 1.86618349415 10 38 38 38 38 224924

9 37.6512333055 1.85726711398 10 38 38 38 38 224924

10 37.6626060358 1.79494590177 10 38 38 38 38 224924

11 37.6324714126 1.89571282258 9 38 38 38 38 224924

12 37.6014342622 1.99703097792 10 38 38 38 38 224924

13 37.6435818321 1.8427123548 10 38 38 38 38 224924

14 37.6336940478 1.88442047255 10 38 38 38 38 224924

15 37.6378332237 1.85404372502 9 38 38 38 38 224924

16 37.6625215628 1.78640763134 9 38 38 38 38 224924

17 37.6524514947 1.80913909303 10 38 38 38 38 224924

18 37.6461649268 1.81338343748 9 38 38 38 38 224924

merge（リンク）ペアエンドをマージ。

unmerge（リンク）マージしたペアエンドを分割。

bfq（リンク）BFQ (binary-compressed FASTQ format) ⇆ fastq変換。

tag（リンク）リード名にtagをつける。

tile（リンク）指定長でfastqをオーバーラップありで分割。

trim（リンク）リンカーを除去。

truncate（リンク）指定長までリードをトリム。

tobam（リンク）bamを作成。いまいちわからない。

tofasta（リンク）fastaを作成。

.gzも扱えます。

引用

NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets

Marcus R. Breese and Yunlong Liu

Bioinformatics. 2013 Feb 15; 29(4): 494–496.

2018-03-10

FASTQ、BED、BAMを操作するNGSUtilsその２ bedutils

bed

２回目はbedを操作するbedutilsを紹介する。

インストール

公式ページ

NGSUtils - bedutils

git clone git://github.com/ngsutils/ngsutils.git
cd ngsutils/
make  #依存がインストールされる（詳細はwebマニュアル参照）

$ ./bedutils

Usage: bedutils COMMAND

Commands

General

clean - Cleans a BED file (score should be integers)

extend - Extends BED regions (3')

overlap - Find overlapping BED regions from a query and target file

reduce - Merges overlapping BED regions

refcount - Given a number of BED files, calculate the number of samples that overlap regions in a reference BED file

sizes - Extract the sizes of BED regions

sort - Sorts a BED file (in place)

stats - Calculates simple stats for a BED file

subtract - Subtracts one set of BED regions from another

Conversion

annotate - Annotate BED files by adding / altering columns

frombasecall - Converts a file in basecall format to BED3 format

fromprimers - Converts a list of PCR primer pairs to BED regions

fromvcf - Converts a file in VCF format to BED6

tobed3 - Removes extra columns from a BED (or BED compatible) file

tobed6 - Removes extra columns from a BED (or BED compatible) file

tobedgraph - BED to BedGraph

tofasta - Extract BED regions from a reference FASTA file

Misc

cleanbg - Cleans up a bedgraph file

Run 'bedutils help CMD' for more information about a specific command

ngsutils 0.5.9-a7f08f5

パスを通しておく。

ラン

clean（リンク）ポジションを整数に修復する。

bamutils clean input.bed > output.bed

extend（リンク）ポジションを指定数変更する。

#5'側を100bp、3'側を100bp伸ばす
bamutils extend -5 100 -3 100 input.bed > output.bed

nearest（リンク）リファレンスの最も近い位置のアノテーションを取ってくる

 bamutils nearest input.bed reference.bed

-max The maximal distance to look for a nearest region (default: 100K)
-match Only use regions in the reference that contain the name from the query file.

sizes（リンク）領域のサイズを調べる。

 bamutils sizes input.bed > output

sort（リンク）ポジションソートする。

 bamutils sort input.bed output.bed

stats（リンク）statistics

 bamutils stats input.bed

tobed3（リンク）3カラム形式のbedに変換。

 bamutils tobed3 input.bed > output.bed

tobed6（リンク）6カラム形式のbedに変換。

 bamutils tobed6 input.bed > output.bed

tofasta（リンク）bedの領域をfastaで出力。

 bamutils tofasta input.bed ref.fasta > output.fasta

overlap（リンク）オーバーラップする領域を探す

reduce（リンク）オーバーラップする領域を減らす。

refcount（リンク）オーバーラップする領域数を数える。

subtract（リンク）2つのbedを比較してオーバーラップする領域を消す。

annotate（リンク）アノテーションをつけてBED3からBED6に変換。

frombasecall（リンク）basecallフォーマットからBED3に変換。

fromprimers（リンク）primerペアからBEDで領域を作成。

tobedgraph（リンク）オーバーラップがあるbedからbedgraphを作成。

cleanbg（リンク）bedgraphを修復。

bedのオーバーラップを抽出したりするならbedtoolsを使ってください。

引用

NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets

Marcus R. Breese and Yunlong Liu

Bioinformatics. 2013 Feb 15; 29(4): 494–496.

2018-03-10

FASTQ、BED、BAMを操作するNGSUtilsその１ bamutils

bam/sam

2020 4/17 インストール追記

NGSUtilsは、FASTQ、BED、BAM形式のファイルなどを操作するためのツール。 Mac OS XおよびLinuxで動作する。コマンドが多いので３回に分けて紹介する。１回目はbamを操作するbamutils。

インストール

公式ページ

NGSUtils - Tools for next-generation sequencing analysis

webマニュアル

http://ngsutils.org/installation/

依存

Mac OSX or Linux operating system (tested on RHEL5 and RHEL6)
Python 2.6+ (including development packages)
Make

上記に含まれていないが、cent OSに導入した（python 2.7.2でビルドした）。

git clone git://github.com/ngsutils/ngsutils.git
cd ngsutils/
make  #依存がインストールされる（詳細はwebマニュアル参照）

#bioconda (link) (失敗した)
conda create -n ngsutils -y
conda activate ngsutils
conda install -c bioconda -y ngsutils

ラン

keepbest（リンク） best mappingを抽出。

bamutils keepbest input.bam output.bam

best（リンク） best mappingのみ出力

bamutils best input.bam output.bam

export（リンク） 指定した情報を全リードから出力

bamutils export -name input.bam > output #リード名を出力

bamutils export -ref input.bam > output #mappingされたクロモソーム名を出力

-name Read name
-ref Mapped reference (chrom)
-pos Mapped position (1-based)
-strand Mapped strand (+/-)
-cigar CIGAR alignment string
-flags Mapping flags (base 10 number)
-seq Sequence
-qual Quality sequence
-mapq MAPQ score
-nextref Next mapped reference (paired-end)
-nextpos Next mapped position (paired-end)
-tlen Template length (paired-end)
-tag:tag_name Any tag

　　　For example: -tag:AS -tag:NH Outputs the alignment score and the edit distance

-tag:* Outputs all tags in SAM format

expressed（リンク） マッピングされた領域をBED形式で出力。（samtools indexでbam.baiファイルを準備しておく必要がある）

bamutils expressed input.bam > output.bed

-ns Ignore strandedness when creating regions (default: false)
-uniq Only use unique starting positions when performing counts (default: false)
-dist <N> minimum distance required between regions - they will be merged if w/in this distance (default: 10)
-mincount <N> The minimum number of reads required in a region (default: 2)

filter（リンク） 指定した条件を満たすリードだけbamで出力。

#chr1にマッピングされており、リード長が100以上でペアエンドが-> <-の向きに両方マップングされているリードだけbamで出力。
bamutils filter input.bam output.bam -minlen 100 -properpair -includeref chr1

-minlen val Remove reads that are smaller than {val}
-maxlen val Remove reads that are larger than {val}
-mapped Keep only mapped reads
-unmapped Keep only unmapped reads
-properpair Keep only properly paired reads (both mapped, correct orientation, flag set in BAM)
-noproperpair Keep only not-properly paired reads

他にも様々な条件を指定できる。

junctioncount（リンク） 指定した領域をspanしてマッピングされているリードをカウント

bamutils junctioncount input.bam chr1:5000-5500

merge（リンク） bamをマージ（best matchを選択してくる）。

bamutils merge output.bam input1.bam input2.bam

同じリードが複数ヶ所にアライメントされている場合、ベストマッチを選択してくる。

pair（リンク）別々にマッピングされているペアエンドについて、指定条件内でペアを再度調べる。

bamutils pair output.bam input1.bam input2.bam

-size <low-high> The minimum/maximum insert size to accept. By default, this will attempt to minimize the distance between reads, upto the lower-bound. Any pair over the upper bound will be discarded. Note: for RNA, because it is impossible to detect junctions that are between the reads, this should be a very big range (ex: 50-1000000) Default: 50-10000
-tag <VAL> Tag to use to determine from which file reads will be taken. (must be type :i or :f) You may have more than one of these, in which case they will be sorted in order. You can add a +/- at the end of the name to signify sort order (asc/desc). Default: AS+, NM-

pcrdup（リンク）ペアエンドのPCR duplicationを調べる（たまたま同じ位置で始まっているペアもPCR duplicationとみなす）。

bamutils pcrdup input.bam -count

peakheight（リンク）指定したbedファイルの領域のpeak positionを調べ、bedに追記する。

bamutils peakheight input.bam input.bed

maximum peak size、total unique reads、 total coverage (sum of coverage over each base)、 coverage density (total coverage / length)が追記される。

removevlipping（リンク）hard-clipされた領域を捨てる。

bamutils removeclipping input.bam output.bam

renamepair（リンク）一部アライナーはペアエンドのリード名に/1や/2をつけてbsmを作り、これが下流の解析でエラーになる。これを消す（ペアエンドは同じリード名になる）。

bamutils renamepair input.bam output.bam

split（リンク）bamを分割する。

#refereneで分割
bamutils split -ref input.bam output

#リード数100000ずつ分割
bamutils split -n 100000 input.bam output

stats（リンク）bamを分析。

bamutils stats input.bam

$ bamutils stats input.bam

Calculating Read stats...

Done! (0:36)

out.bam

Mapped: 1083286

Unmapped: 0

Flag distribution

[0x001] Multiple fragments 1083286 100.00%

[0x002] All fragments aligned 888871 82.05%

[0x008] Next unmapped 58635 5.41%

[0x010] Reverse complimented 542299 50.06%

[0x020] Next reverse complimented 539537 49.81%

[0x040] First fragment 1083286 100.00%

[0x800] Supplementary 1 0.00%

Template length: 568.90 +/- 824.04

Reference counts count

1 1083286

tag（リンク）tagをbamに追加する。

bamutils tag -xs input.bam output.bam

-xs Add the XS:A tag for +/- strandedness (req'd by Cufflinks)
-suffix suff A suffix to add to each read name
-tag tag Add an arbitrary tag (ex: -tag XX:Z:test)
-orig-ref tag Add a new tag with the original reference name (For example, in a region-based BAM will be converted to standard coordinates)

check（リンク）破損のチェック。

bamutils check input.bam

cleancigar（リンク）CIGAR情報の修復。

bamutils cleancigar  input.bam

nearest（リンク） TSSを指定したbedの指定ポジション中などからもっとも近い部位を抽出。

innerdist（リンク） mate-pairのinner distanceを計算。

inner mate-pair distanceの定義はリンク先を参照。

他にもfasta、fastq、bedに変換する抽出系コマンド、RNA seqと変異解析用コマンドもある。

引用

NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets

Marcus R. Breese and Yunlong Liu

Bioinformatics. 2013 Feb 15; 29(4): 494–496.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

教師なしトリミングツール UrQt

バクテリアをstrainレベルで検出する StrainSeeker

ウィルスのintegration部位を検出する Virus-Clip

FASTQ、BED、BAMを操作するNGSUtilsその４ gtfutils

FASTQ、BED、BAMを操作するNGSUtilsその３ fastqutils

FASTQ、BED、BAMを操作するNGSUtilsその２ bedutils

FASTQ、BED、BAMを操作するNGSUtilsその１ bamutils