macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

QC、エラー修復、トリミング、レポート作成を自動実行する AfterQC

 

AfterQCはfastqのフィルタリング、トリミング、エラー修復、およびクオリティチェックを全て自動で行なってくれるツールである。エラー修復はオーバーラップするペアードエンドリードのクオリティを比較して実行される。2017年に論文が発表された。

 

 インストール

Github

git clone https://github.com/OpenGene/AfterQC.git

$ after.py -h

 

Usage: Automatic Filtering, Trimming, Error Removing and Quality Control for Illumina fastq data 

 

Simplest usage:

cd to the folder containing your fastq data, run <python after.py>

 

Options:

  --version             show program's version number and exit

  -h, --help            show this help message and exit

  -1 READ1_FILE, --read1_file=READ1_FILE

                        file name of read1, required. If input_dir is

                        specified, then this arg is ignored.

  -2 READ2_FILE, --read2_file=READ2_FILE

                        file name of read2, if paired. If input_dir is

                        specified, then this arg is ignored.

  -7 INDEX1_FILE, --index1_file=INDEX1_FILE

                        file name of 7' index. If input_dir is specified, then

                        this arg is ignored.

  -5 INDEX2_FILE, --index2_file=INDEX2_FILE

                        file name of 5' index. If input_dir is specified, then

                        this arg is ignored.

  -d INPUT_DIR, --input_dir=INPUT_DIR

                        the input dir to process automatically. If read1_file

                        are input_dir are not specified, then current dir (.)

                        is specified to input_dir

  -g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER

                        the folder to store good reads, by default it is named

                        'good', in the current directory

  -b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER

                        the folder to store bad reads, by default it is named

                        'bad', in the same folder as good_output_folder

  -r REPORT_OUTPUT_FOLDER, --report_output_folder=REPORT_OUTPUT_FOLDER

                        the folder to store QC reports, by default it is named

                        'QC', in the same folder as good_output_folder

  --read1_flag=READ1_FLAG

                        specify the name flag of read1, default is R1, which

                        means a file with name *R1* is read1 file

  --read2_flag=READ2_FLAG

                        specify the name flag of read2, default is R2, which

                        means a file with name *R2* is read2 file

  --index1_flag=INDEX1_FLAG

                        specify the name flag of index1, default is I1, which

                        means a file with name *I1* is index2 file

  --index2_flag=INDEX2_FLAG

                        specify the name flag of index2, default is I2, which

                        means a file with name *I2* is index2 file

  -f TRIM_FRONT, --trim_front=TRIM_FRONT

                        number of bases to be trimmed in the head of read. -1

                        means auto detect

  -t TRIM_TAIL, --trim_tail=TRIM_TAIL

                        number of bases to be trimmed in the tail of read. -1

                        means auto detect

  --trim_pair_same=TRIM_PAIR_SAME

                        use same trimming configuration for read1 and read2 to

                        keep their sequence length identical, default is true

  -q QUALIFIED_QUALITY_PHRED, --qualified_quality_phred=QUALIFIED_QUALITY_PHRED

                        the quality value that a base is qualifyed. Default 15

                        means phred base quality >=Q15 is qualified.

  -u UNQUALIFIED_BASE_LIMIT, --unqualified_base_limit=UNQUALIFIED_BASE_LIMIT

                        if exists more than unqualified_base_limit bases that

                        quality is lower than qualified quality, then this

                        read/pair is bad. Default is 60

  -p POLY_SIZE_LIMIT, --poly_size_limit=POLY_SIZE_LIMIT

                        if exists one polyX(polyG means GGGGGGGGG...), and its

                        length is >= poly_size_limit, then this read/pair is

                        bad. Default is 35

  -a ALLOW_MISMATCH_IN_POLY, --allow_mismatch_in_poly=ALLOW_MISMATCH_IN_POLY

                        the count of allowed mismatches when detection polyX.

                        Default 2 means allow 2 mismatches for polyX detection

  -n N_BASE_LIMIT, --n_base_limit=N_BASE_LIMIT

                        if exists more than maxn bases have N, then this

                        read/pair is bad. Default is 5

  -s SEQ_LEN_REQ, --seq_len_req=SEQ_LEN_REQ

                        if the trimmed read is shorter than seq_len_req, then

                        this read/pair is bad. Default is 35

  --debubble            specify whether apply debubble algorithm to remove the

                        reads in the bubbles. Default is False

  --debubble_dir=DEBUBBLE_DIR

                        specify the folder to store output of debubble

                        algorithm, default is debubble

  --draw=DRAW           specify whether draw the pictures or not, when use

                        debubble or QC. Default is on

  --barcode=BARCODE     specify whether deal with barcode sequencing files,

                        default is on, which means all files with barcode_flag

                        in filename will be treated as barcode sequencing

                        files

  --barcode_length=BARCODE_LENGTH

                        specify the designed length of barcode

  --barcode_flag=BARCODE_FLAG

                        specify the name flag of a barcoded file, default is

                        barcode, which means a file with name *barcode* is a

                        barcoded file

  --barcode_verify=BARCODE_VERIFY

                        specify the verify sequence of a barcode which is

                        adjunct to the barcode

  --store_overlap=STORE_OVERLAP

                        specify whether store only overlapped bases of the

                        good reads

  --overlap_output_folder=OVERLAP_OUTPUT_FOLDER

                        the folder to store only overlapped bases of the good

                        reads

  --qc_only             if qconly is true, only QC result will be output, this

                        can be much fast

  --qc_sample=QC_SAMPLE

                        sample up to qc_sample reads when do QC, 0 means

                        sample all reads. Default is 200,000

  --qc_kmer=QC_KMER     specify the kmer length for KMER statistics for QC,

                        default is 8

  --no_correction       disable base correction for mismatched base pairs in

                        overlapped areas

  --mask_mismatch       set the qual num to 0 for mismatched base pairs in

                        overlapped areas to mask them out

  --no_overlap          disable overlap analysis (usually much faster with

                        this option)

——

 

ラン

シングルエンド

python after.py -1 R1.fq

ペアードエンド

python after.py -1 R1.fq -2 R2.fq

fastqのディレクトリを指定してラン(fastqを全て解析)。 

python after.py -d input/

 

ラン中にlogがほとんど出ないので不安になるが、200MBx2のfastqで10分くらいかかるのでいじらず待つ。解析が終わると、ランしたパスに3つのディレクトリができる。

f:id:kazumaxneo:20170917002342j:plain

上では2組のペアードエンドfastqフォルダを解析している。badにはトリミングをパスしなかったリードが収納されており、goodにトリミングされて最低限の長さを満たしたリードが収納されている。ペアリードの順番が崩れないように、goodにはペアリード両方についてPASSしたリードだけ出力されている。

 

 Q>30でトリミングし、25-bp以下になったリードは破棄する。

python after.py -1 R1.fq -2 R2.fq -q 30 -s 25

  フィルタリングオプション

  • -f <int> number of bases to be trimmed in the head of read. -1 means auto detect.
  • -t <int> number of bases to be trimmed in the tail of read. -1 means auto detect.
  • -q <int> the quality value that a base is qualifyed. Default 20 means base quality >=Q20 is qualified.
  • -u <int> if exists more than unqualified_base_limit bases that quality is lower than qualified quality, then this read/pair is bad. Default 0 means do not filter reads by low quality base count.
  • -p <int> POLY_SIZE_LIMIT if exists one polyX(polyG means GGGGGGGGG...), and its length is >= poly_size_limit, then this read/pair is bad. Default is 35.
  • -a <int> the count of allowed mismatches when evaluating poly_X. Default 5 means disallow any mismatches.
  • -n <int> if exists more than maxn bases have N, then this read/pair is bad. Default is 5.
  • -s <int> if the trimmed read is shorter than seq_len_req, then this read/pair is bad. Default is 35.

バブル除去オプション(一般のシーケンスには非推奨)

  • --debubble enable debubble algorithm to remove the reads in the bubbles. Default is False

バーコードオプション

  • --barcode specify whether deal with barcode sequencing files, default is on

クオリティチェックオプション

  • --qc_only enable this option, only QC result will be output, this can be much faster
  • --qc_kmer <int> specify the kmer length for KMER statistics for QC, default is 8

 

QC/には分析結果がhtmlで出力される。html5で描画されており、編集可能である。

f:id:kazumaxneo:20170917004054j:plain

f:id:kazumaxneo:20170917003650j:plain

f:id:kazumaxneo:20170917003652j:plain

f:id:kazumaxneo:20170917003656j:plain

f:id:kazumaxneo:20170917003659j:plain

f:id:kazumaxneo:20170917003703j:plain

f:id:kazumaxneo:20170917003706j:plain

f:id:kazumaxneo:20170917003709j:plain

f:id:kazumaxneo:20170917003713j:plain

f:id:kazumaxneo:20170917003717j:plain

f:id:kazumaxneo:20170917003723j:plain

f:id:kazumaxneo:20170917003729j:plain

f:id:kazumaxneo:20170917003726j:plain

f:id:kazumaxneo:20170917003732j:plain

 

 

AfterQCの 動作はfastqcとBBtools、scytheなどのツールを合わせたようなものになっている。動作はやや遅いため、後継のfastpの論文も発表されている。

   

引用

AfterQC: automatic filtering, trimming, error removing and quality control for fastq data.

Shifu Chen, Tanxiao Huang, Yanqing Zhou, Yue Han, Mingyan Xu and Jia Gu.

BMC Bioinformatics 2017 18(Suppl 3):80

 

VCFを管理、編集する VCFtools

2019 4/16 condaインストール

2019 12/9ビルド手順の誤り修正

2020 1/5 mergeの説明追加

2020 4/18 基本コマンド追記

2020 10/13 追記

20200 10/14 分かりにくい説明を修正

2021 2/17 dockerリンク追加

2021 5/16 ”変異”を”バリアント”に修正

2023/09/29 vcffilterの例を修正

 

VCFtoolsは、バリアントコールフォーマットのVCFファイルのマージ、ソートやフィルタリング、固有変異の抽出などができるツール。 よく使いそうなコマンドに限って紹介する。

  

 

HP

マニュアル

https://vcftools.github.io/perl_module.html

 

インストール

Github

#linux
git clone https://github.com/vcftools/vcftools.git
cd vcftools
./autogen.sh
./configure --prefix=/usr/local/
make -j
make install

#Bioconda (link)mac, linux。コマンドが違うので注意。上の手順の通り、gitからcloneしてビルドして下さい。
conda install -c bioconda -y vcftools

#2021 2/17 dockerhub (link)
docker pull biocontainers/vcftools:v0.1.16-1-deb_cv1

マニュアル

man vcftools

vcftools(man)                                                                                                                  05 January 2016                                                                                                                 vcftools(man)

 

NAME

       vcftools v0.1.14 - Utilities for the variant call format (VCF) and binary variant call format (BCF)

 

SYNOPSIS

       vcftools [ --vcf FILE | --gzvcf FILE | --bcf FILE] [ --out OUTPUT PREFIX ] [ FILTERING OPTIONS ]  [ OUTPUT OPTIONS ]

 

DESCRIPTION

       vcftools is a suite of functions for use on genetic variation data in the form of VCF and BCF files. The tools provided will be used mainly to summarize data, run calculations on data, filter out data, and convert data into other useful file formats.

 

EXAMPLES

       Output allele frequency for all sites in the input vcf file from chromosome 1

         vcftools --gzvcf input_file.vcf.gz --freq --chr 1 --out chr1_analysis

 

       Output a new vcf file from the input vcf file that removes any indel sites

         vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-all --out SNPs_only

 

       Output file comparing the sites in two vcf files

         vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz --diff-site --out in1_v_in2

 

       Output a new vcf file to standard out without any sites that have a filter tag, then compress it with gzip

         vcftools --gzvcf input_file.vcf.gz --remove-filtered-all --recode --stdout | gzip -c > output_PASS_only.vcf.gz

 

       Output a Hardy-Weinberg p-value for every site in the bcf file that does not have any missing genotypes

         vcftools --bcf input_file.bcf --hardy --max-missing 1.0 --out output_noMissing

 

       Output nucleotide diversity at a list of positions

         zcat input_file.vcf.gz | vcftools --vcf - --site-pi --positions SNP_list.txt --out nucleotide_diversity

 

BASIC OPTIONS

       These options are used to specify the input and output files.

 

   INPUT FILE OPTIONS

         --vcf <input_filename>

           This  option defines the VCF file to be processed. VCFtools expects files in VCF format v4.0, v4.1 or v4.2. The latter two are supported with some small limitations. If the user provides a dash character '-' as a file name, the program expects a VCF file to

           be piped in through standard in.

 

         --gzvcf <input_filename>

           This option can be used in place of the --vcf option to read compressed (gzipped) VCF files directly.

 

         --bcf <input_filename>

           This option can be used in place of the --vcf option to read BCF2 files directly. You do not need to specify if this file is compressed with BGZF encoding. If the user provides a dash character '-' as a file name, the program expects a BCF2 file to be piped

           in through standard in.

 

   OUTPUT FILE OPTIONS

         --out <output_prefix>

           This option defines the output filename prefix for all files generated by vcftools. For example, if <prefix> is set to output_filename, then all output files will be of the form output_filename.*** . If this option is omitted, all output files will have the

           prefix "out." in the current working directory.

 

         --stdout

         -c

           These options direct the vcftools output to standard out so it can be piped into another program or written directly to a filename of choice. However, a select few output functions cannot be written to standard out.

 

         --temp <temporary_directory>

           This option can be used to redirect any temporary files that vcftools creates into a specified directory.

 

SITE FILTERING OPTIONS

       These options are used to include or exclude certain sites from any analysis being performed by the program.

 

   POSITION FILTERING

         --chr <chromosome>

         --not-chr <chromosome>

           Includes or excludes sites with indentifiers matching <chromosome>. These options may be used multiple times to include or exclude more than one chromosome.

 

         --from-bp <integer>

         --to-bp <integer>

           These options specify a lower bound and upper bound for a range of sites to be processed. Sites with positions less than or greater than these values will be excluded. These options can only be used in conjunction with a single usage of --chr. Using one  of

           these does not require use of the other.

 

         --positions <filename>

         --exclude-positions <filename>

           Include or exclude a set of sites on the basis of a list of positions in a file. Each line of the input file should contain a (tab-separated) chromosome and position. The file can have comment lines that start with a "#", they will be ignored.

 

         --positions-overlap <filename>

         --exclude-positions-overlap <filename>

           Include  or  exclude  a set of sites on the basis of the reference allele overlapping with a list of positions in a file. Each line of the input file should contain a (tab-separated) chromosome and position. The file can have comment lines that start with a

           "#", they will be ignored.

 

         --bed <filename>

         --exclude-bed <filename>

           Include or exclude a set of sites on the basis of a BED file. Only the first three columns (chrom, chromStart and chromEnd) are required. The BED file is expected to have a header line. A site will be kept or excluded if any part of any allele (REF or  ALT)

           at a site is within the range of one of the BED entries.

 

         --thin <integer>

           Thin sites so that no two sites are within the specified distance from one another.

 

         --mask <filename>

         --invert-mask <filename>

         --mask-min <integer>

           These options are used to specify a FASTA-like mask file to filter with. The mask file contains a sequence of integer digits (between 0 and 9) for each position on a chromosome that specify if a site at that position should be filtered or not.

           An example mask file would look like:

             >1

             0000011111222...

             >2

             2222211111000...

           In this example, sites in the VCF file located within the first 5 bases of the start of chromosome 1 would be kept, whereas sites at position 6 onwards would be filtered out. And sites after the 11th position on chromosome 2 would be filtered out as well.

           The "--invert-mask" option takes the same format mask file as the "--mask" option, however it inverts the mask file before filtering with it.

           And the "--mask-min" option specifies a threshold mask value between 0 and 9 to filter positions by. The default threshold is 0, meaning only sites with that value or lower will be kept.

 

   SITE ID FILTERING

         --snp <string>

           Include SNP(s) with matching ID (e.g. a dbSNP rsID). This command can be used multiple times in order to include more than one SNP.

 

         --snps <filename>

         --exclude <filename>

           Include or exclude a list of SNPs given in a file. The file should contain a list of SNP IDs (e.g. dbSNP rsIDs), with one ID per line. No header line is expected.

 

   VARIANT TYPE FILTERING

         --keep-only-indels

         --remove-indels

           Include or exclude sites that contain an indel. For these options "indel" means any variant that alters the length of the REF allele.

 

   FILTER FLAG FILTERING

         --remove-filtered-all

           Removes all sites with a FILTER flag other than PASS.

 

         --keep-filtered <string>

         --remove-filtered <string>

           Includes or excludes all sites marked with a specific FILTER flag. These options may be used more than once to specify multiple FILTER flags.

 

   INFO FIELD FILTERING

         --keep-INFO <string>

         --remove-INFO <string>

           Includes or excludes all sites with a specific INFO flag. These options only filter on the presence of the flag and not its value. These options can be used multiple times to specify multiple INFO flags.

 

   ALLELE FILTERING

         --maf <float>

         --max-maf <float>

           Include  only  sites  with  a Minor Allele Frequency greater than or equal to the "--maf" value and less than or equal to the "--max-maf" value. One of these options may be used without the other. Allele frequency is defined as the number of times an allele

           appears over all individuals at that site, divided by the total number of non-missing alleles at that site.

 

         --non-ref-af <float>

         --max-non-ref-af <float>

         --non-ref-ac <integer>

         --max-non-ref-ac <integer>

         --non-ref-af-any <float>

         --max-non-ref-af-any <float>

         --non-ref-ac-any <integer>

         --max-non-ref-ac-any <integer>

           Include only sites with all Non-Reference (ALT) Allele Frequencies (af) or Counts (ac) within the range specified, and including the specified value. The default options require all alleles to meet the specified criteria, whereas the options  appended  with

           "any" require only one allele to meet the criteria. The Allele frequency is defined as the number of times an allele appears over all individuals at that site, divided by the total number of non-missing alleles at that site.

 

         --mac <integer>

         --max-mac <integer>

           Include  only  sites  with Minor Allele Count greater than or equal to the "--mac" value and less than or equal to the "--max-mac" value. One of these options may be used without the other. Allele count is simply the number of times that allele appears over

           all individuals at that site.

 

         --min-alleles <integer>

         --max-alleles <integer>

           Include only sites with a number of alleles greater than or equal to the "--min-alleles" value and less than or equal to the "--max-alleles" value. One of these options may be used without the other.

           For example, to include only bi-allelic sites, one could use:

             vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2

 

   GENOTYPE VALUE FILTERING

         --min-meanDP <float>

         --max-meanDP <float>

           Includes only sites with mean depth values (over all included individuals) greater than or equal to the "--min-meanDP" value and less than or equal to the "--max-meanDP" value. One of these options may be used without the other. These options  require  that

           the "DP" FORMAT tag is included for each site.

 

         --hwe <float>

           Assesses sites for Hardy-Weinberg Equilibrium using an exact test, as defined by Wigginton, Cutler and Abecasis (2005). Sites with a p-value below the threshold defined by this option are taken to be out of HWE, and therefore excluded.

 

         --max-missing <float>

           Exclude sites on the basis of the proportion of missing data (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed).

 

         --max-missing-count <integer>

           Exclude sites with more than this number of missing genotypes over all individuals.

 

         --phased

           Excludes all sites that contain unphased genotypes.

 

   MISCELLANEOUS FILTERING

         --minQ <float>

           Includes only sites with Quality value above this threshold.

 

INDIVIDUAL FILTERING OPTIONS

       These options are used to include or exclude certain individuals from any analysis being performed by the program.

         --indv <string>

         --remove-indv <string>

           Specify an individual to be kept or removed from the analysis. This option can be used multiple times to specify multiple individuals. If both options are specified, then the "--indv" option is executed before the "--remove-indv option".

 

         --keep <filename>

         --remove <filename>

           Provide  files  containing a list of individuals to either include or exclude in subsequent analysis. Each individual ID (as defined in the VCF headerline) should be included on a separate line. If both options are used, then the "--keep" option is executed

           before the "--remove" option. When multiple files are provided, the union of individuals from all keep files subtracted by the union of individuals from all remove files are kept. No header line is expected.

 

         --max-indv <integer>

           Randomly thins individuals so that only the specified number are retained.

 

GENOTYPE FILTERING OPTIONS

       These options are used to exclude genotypes from any analysis being performed by the program. If excluded, these values will be treated as missing.

         --remove-filtered-geno-all

           Excludes all genotypes with a FILTER flag not equal to "." (a missing value) or PASS.

 

         --remove-filtered-geno <string>

           Excludes genotypes with a specific FILTER flag.

 

         --minGQ <float>

           Exclude all genotypes with a quality below the threshold specified. This option requires that the "GQ" FORMAT tag is specified for all sites.

 

         --minDP <float>

         --maxDP <float>

           Includes only genotypes greater than or equal to the "--minDP" value and less than or equal to the "--maxDP" value. This option requires that the "DP" FORMAT tag is specified for all sites.

 

OUTPUT OPTIONS

       These options specify which analyses or conversions to perform on the data that passed through all specified filters.

 

   OUTPUT ALLELE STATISTICS

         --freq

         --freq2

           Outputs the allele frequency for each site in a file with the suffix ".frq". The second option is used to suppress output of any information about the alleles.

 

         --counts

         --counts2

           Outputs the raw allele counts for each site in a file with the suffix ".frq.count". The second option is used to suppress output of any information about the alleles.

 

         --derived

           For use with the previous four frequency and count options only. Re-orders the output file columns so that the ancestral allele appears first. This option relies on the ancestral allele being specified in the VCF file using the AA tag in the INFO field.

 

   OUTPUT DEPTH STATISTICS

         --depth

           Generates a file containing the mean depth per individual. This file has the suffix ".idepth".

 

         --site-depth

           Generates a file containing the depth per site summed across all individuals. This output file has the suffix ".ldepth".

 

         --site-mean-depth

           Generates a file containing the mean depth per site averaged across all individuals. This output file has the suffix ".ldepth.mean".

 

         --geno-depth

           Generates a (possibly very large) file containing the depth for each genotype in the VCF file. Missing entries are given the value -1. The file has the suffix ".gdepth".

 

   OUTPUT LD STATISTICS

         --hap-r2

           Outputs a file reporting the r2, D, and D' statistics using phased haplotypes. These are the traditional measures of LD often reported in the population genetics literature. The output file has the suffix ".hap.ld". This option assumes that  the  VCF  input

           file has phased haplotypes.

 

         --geno-r2

           Calculates  the squared correlation coefficient between genotypes encoded as 0, 1 and 2 to represent the number of non-reference alleles in each individual. This is the same as the LD measure reported by PLINK. The D and D' statistics are only available for

           phased genotypes. The output file has the suffix ".geno.ld".

 

         --geno-chisq

           If your data contains sites with more than two alleles, then this option can be used to test for genotype independence via the chi-squared statistic. The output file has the suffix ".geno.chisq".

 

         --hap-r2-positions <positions list file>

         --geno-r2-positions <positions list file>

           Outputs a file reporting the r2 statistics of the sites contained in the provided file verses all other sites. The output files have the suffix ".list.hap.ld" or ".list.geno.ld", depending on which option is used.

 

         --ld-window <integer>

           This optional parameter defines the maximum number of SNPs between the SNPs being tested for LD in the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

 

         --ld-window-bp <integer>

           This optional parameter defines the maximum number of physical bases between the SNPs being tested for LD in the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

 

         --ld-window-min <integer>

           This optional parameter defines the minimum number of SNPs between the SNPs being tested for LD in the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

 

         --ld-window-bp-min <integer>

           This optional parameter defines the minimum number of physical bases between the SNPs being tested for LD in the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

 

         --min-r2 <float>

           This optional parameter sets a minimum value for r2, below which the LD statistic is not reported by the "--hap-r2", "--geno-r2", and "--geno-chisq" functions.

 

         --interchrom-hap-r2

         --interchrom-geno-r2

           Outputs a file reporting the r2 statistics for sites on different chromosomes. The output files have the suffix ".interchrom.hap.ld" or ".interchrom.geno.ld", depending on the option used.

 

   OUTPUT TRANSITION/TRANSVERSION STATISTICS

         --TsTv <integer>

           Calculates the Transition / Transversion ratio in bins of size defined by this option. Only uses bi-allelic SNPs. The resulting output file has the suffix ".TsTv".

         --TsTv-summary

           Calculates a simple summary of all Transitions and Transversions. The output file has the suffix ".TsTv.summary".

 

         --TsTv-by-count

           Calculates the Transition / Transversion ratio as a function of alternative allele count. Only uses bi-allelic SNPs. The resulting output file has the suffix ".TsTv.count".

 

         --TsTv-by-qual

           Calculates the Transition / Transversion ratio as a function of SNP quality threshold. Only uses bi-allelic SNPs. The resulting output file has the suffix ".TsTv.qual".

 

         --FILTER-summary

           Generates a summary of the number of SNPs and Ts/Tv ratio for each FILTER category. The output file has the suffix ".FILTER.summary".

 

   OUTPUT NUCLEOTIDE DIVERGENCE STATISTICS

         --site-pi

           Measures nucleotide divergency on a per-site basis. The output file has the suffix ".sites.pi".

 

         --window-pi <integer>

         --window-pi-step <integer>

           Measures the nucleotide diversity in windows, with the number provided as the window size. The output file has the suffix ".windowed.pi". The latter is an optional argument used to specify the step size in between windows.

 

   OUTPUT FST STATISTICS

         --weir-fst-pop <filename>

           This option is used to calculate an Fst estimate from Weir and Cockerham's 1984 paper. This is the preferred calculation of Fst. The provided file must contain a list of individuals (one individual per line) from the VCF file that correspond to one  popula-

           tion. This option can be used multiple times to calculate Fst for more than two populations. These files will also be included as "--keep" options. By default, calculations are done on a per-site basis. The output file has the suffix ".weir.fst".

 

         --fst-window-size <integer>

         --fst-window-step <integer>

           These options can be used with "--weir-fst-pop" to do the Fst calculations on a windowed basis instead of a per-site basis. These arguments specify the desired window size and the desired step size between windows.

 

   OUTPUT OTHER STATISTICS

         --het

           Calculates a measure of heterozygosity on a per-individual basis. Specfically, the inbreeding coefficient, F, is estimated for each individual using a method of moments. The resulting file has the suffix ".het".

 

         --hardy

           Reports  a  p-value for each site from a Hardy-Weinberg Equilibrium test (as defined by Wigginton, Cutler and Abecasis (2005)). The resulting file (with suffix ".hwe") also contains the Observed numbers of Homozygotes and Heterozygotes and the corresponding

           Expected numbers under HWE.

 

         --TajimaD <integer>

           Outputs Tajima's D statistic in bins with size of the specified number. The output file has the suffix ".Tajima.D".

 

         --indv-freq-burden

           This option calculates the number of variants within each individual of a specific frequency. The resulting file has the suffix ".ifreqburden".

 

         --LROH

           This option will identify and output Long Runs of Homozygosity. The output file has the suffix ".LROH". This function is experimental, and will use a lot of memory if applied to large datasets.

 

         --relatedness

           This option is used to calculate and output a relatedness statistic based on the method of Yang et al, Nature Genetics 2010 (doi:10.1038/ng.608). Specifically, calculate the unadjusted Ajk statistic. Expectation of Ajk is zero for individuals within a popu-

           lations, and one for an individual with themselves. The output file has the suffix ".relatedness".

 

         --relatedness2

           This option is used to calculate and output a relatedness statistic based on the method of Manichaikul et al., BIOINFORMATICS 2010 (doi:10.1093/bioinformatics/btq559). The output file has the suffix ".relatedness2".

 

         --site-quality

           Generates a file containing the per-site SNP quality, as found in the QUAL column of the VCF file. This file has the suffix ".lqual".

 

         --missing-indv

           Generates a file reporting the missingness on a per-individual basis. The file has the suffix ".imiss".

 

         --missing-site

           Generates a file reporting the missingness on a per-site basis. The file has the suffix ".lmiss".

 

         --SNPdensity <integer>

           Calculates the number and density of SNPs in bins of size defined by this option. The resulting output file has the suffix ".snpden".

 

         --kept-sites

           Creates a file listing all sites that have been kept after filtering. The file has the suffix ".kept.sites".

 

         --removed-sites

           Creates a file listing all sites that have been removed after filtering. The file has the suffix ".removed.sites".

 

         --singletons

           This  option  will generate a file detailing the location of singletons, and the individual they occur in. The file reports both true singletons, and private doubletons (i.e. SNPs where the minor allele only occurs in a single individual and that individual

           is homozygotic for that allele). The output file has the suffix ".singletons".

 

         --hist-indel-len

           This option will generate a histogram file of the length of all indels (including SNPs). It shows both the count and the percentage of all indels for indel lengths that occur at least once in the input file. SNPs are considered indels with length zero.  The

           output file has the suffix ".indel.hist".

 

         --hapcount <BED file>

           This option will output the number of unique haplotypes within user specified bins, as defined by the BED file. The output file has the suffix ".hapcount".

 

         --mendel <PED file>

           This  option  is  use  to  report  mendel  errors identified in trios. The command requires a PLINK-style PED file, with the first four columns specifying a family ID, the child ID, the father ID, and the mother ID. The output of this command has the suffix

           ".mendel".

 

         --extract-FORMAT-info <string>

           Extract information from the genotype fields in the VCF file relating to a specfied FORMAT identifier. The resulting output file has the suffix ".<FORMAT_ID>.FORMAT". For example, the following command would  extract  the  all  of  the  GT  (i.e.  Genotype)

           entries:

             vcftools --vcf file1.vcf --extract-FORMAT-info GT

 

         --get-INFO <string>

           This  option is used to extract information from the INFO field in the VCF file. The <string> argument specifies the INFO tag to be extracted, and the option can be used multiple times in order to extract multiple INFO entries. The resulting file, with suf-

           fix ".INFO", contains the required INFO information in a tab-separated table. For example, to extract the NS and DB flags, one would use the command:

             vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB

 

   OUTPUT VCF FORMAT

         --recode

         --recode-bcf

           These options are used to generate a new file in either VCF or BCF from the input VCF or BCF file after applying the filtering options specified by the user. The output file has the suffix ".recode.vcf" or ".recode.bcf". By  default,  the  INFO  fields  are

           removed from the output file, as the INFO values may be invalidated by the recoding (e.g. the total depth may need to be recalculated if individuals are removed). This behavior may be overriden by the following options. By default, BCF files are written out

           as BGZF compressed files.

 

         --recode-INFO <string>

         --recode-INFO-all

           These options can be used with the above recode options to define an INFO key name to keep in the output file. This option can be used multiple times to keep more of the INFO fields. The second option is used to keep all INFO values in the original file.

 

         --contigs <string>

           This option can be used in conjuction with the --recode-bcf when the input file does not have any contig declarations. This option expects a file name with one contig header per line. These lines are included in the output file.

 

   OUTPUT OTHER FORMATS

         --012

           This option outputs the genotypes as a large matrix. Three files are produced. The first, with suffix ".012", contains the genotypes of each individual on a separate line. Genotypes are represented as 0, 1 and 2, where the number represent  that  number  of

           non-reference alleles. Missing genotypes are represented by -1. The second file, with suffix ".012.indv" details the individuals included in the main file. The third file, with suffix ".012.pos" details the site locations included in the main file.

 

         --IMPUTE

           This  option  outputs  phased  haplotypes in IMPUTE reference-panel format. As IMPUTE requires phased data, using this option also implies --phased. Unphased individuals and genotypes are therefore excluded. Only bi-allelic sites are included in the output.

           Using this option generates three files. The IMPUTE haplotype file has the suffix ".impute.hap", and the IMPUTE legend file has the suffix ".impute.hap.legend". The third file, with suffix ".impute.hap.indv", details the individuals included in  the  haplo-

           type file, although this file is not needed by IMPUTE.

 

         --ldhat

         --ldhelmet

         --ldhat-geno

           These  options  output data in LDhat/LDhelmet format. This option requires the "--chr" filter option to also be used. The two first options output phased data only, and therefore also implies "--phased" be used, leading to unphased individuals and genotypes

           being excluded. For LDhelmet, only snps will be considered, and therefore it implies "--remove-indels". The second option treats all of the data as unphased, and therefore outputs LDhat files in genotype/unphased format. Two output files are generated  with

           the  suffixes  ".ldhat.sites"  and  ".ldhat.locs",  which  correspond to the LDhat "sites" and "locs" input files respectively; for LDhelmet, the two files generated have the suffixes ".ldhelmet.snps" and ".ldhelmet.pos", which corresponds to the "SNPs" and

           "positions" files.

 

         --BEAGLE-GL

         --BEAGLE-PL

           These options output genotype likelihood information for input into the BEAGLE program. The VCF file is required to contain FORMAT fields with "GL" or "PL" tags, which can generally be output by SNP callers such as the GATK. Use of this  option  requires  a

           chromosome to be specified via the "--chr" option. The resulting output file has the suffix ".BEAGLE.GL" or ".BEAGLE.PL" and contains genotype likelihoods for biallelic sites. This file is suitable for input into BEAGLE via the "like=" argument.

 

         --plink

         --plink-tped

         --relatedness

           This option is used to calculate and output a relatedness statistic based on the method of Yang et al, Nature Genetics 2010 (doi:10.1038/ng.608). Specifically, calculate the unadjusted Ajk statistic. Expectation of Ajk is zero for individuals within a popu-

           lations, and one for an individual with themselves. The output file has the suffix ".relatedness".

 

         --relatedness2

           This option is used to calculate and output a relatedness statistic based on the method of Manichaikul et al., BIOINFORMATICS 2010 (doi:10.1093/bioinformatics/btq559). The output file has the suffix ".relatedness2".

 

         --site-quality

           Generates a file containing the per-site SNP quality, as found in the QUAL column of the VCF file. This file has the suffix ".lqual".

 

         --missing-indv

           Generates a file reporting the missingness on a per-individual basis. The file has the suffix ".imiss".

 

         --missing-site

           Generates a file reporting the missingness on a per-site basis. The file has the suffix ".lmiss".

 

         --SNPdensity <integer>

           Calculates the number and density of SNPs in bins of size defined by this option. The resulting output file has the suffix ".snpden".

 

         --kept-sites

           Creates a file listing all sites that have been kept after filtering. The file has the suffix ".kept.sites".

 

         --removed-sites

           Creates a file listing all sites that have been removed after filtering. The file has the suffix ".removed.sites".

 

         --singletons

           This  option  will generate a file detailing the location of singletons, and the individual they occur in. The file reports both true singletons, and private doubletons (i.e. SNPs where the minor allele only occurs in a single individual and that individual

           is homozygotic for that allele). The output file has the suffix ".singletons".

 

         --hist-indel-len

           This option will generate a histogram file of the length of all indels (including SNPs). It shows both the count and the percentage of all indels for indel lengths that occur at least once in the input file. SNPs are considered indels with length zero.  The

           output file has the suffix ".indel.hist".

 

         --hapcount <BED file>

           This option will output the number of unique haplotypes within user specified bins, as defined by the BED file. The output file has the suffix ".hapcount".

 

         --mendel <PED file>

           This  option  is  use  to  report  mendel  errors identified in trios. The command requires a PLINK-style PED file, with the first four columns specifying a family ID, the child ID, the father ID, and the mother ID. The output of this command has the suffix

           ".mendel".

 

         --extract-FORMAT-info <string>

           Extract information from the genotype fields in the VCF file relating to a specfied FORMAT identifier. The resulting output file has the suffix ".<FORMAT_ID>.FORMAT". For example, the following command would  extract  the  all  of  the  GT  (i.e.  Genotype)

           entries:

             vcftools --vcf file1.vcf --extract-FORMAT-info GT

 

         --get-INFO <string>

           This  option is used to extract information from the INFO field in the VCF file. The <string> argument specifies the INFO tag to be extracted, and the option can be used multiple times in order to extract multiple INFO entries. The resulting file, with suf-

           fix ".INFO", contains the required INFO information in a tab-separated table. For example, to extract the NS and DB flags, one would use the command:

             vcftools --vcf file1.vcf --get-INFO NS --get-INFO DB

 

   OUTPUT VCF FORMAT

         --recode

         --recode-bcf

           These options are used to generate a new file in either VCF or BCF from the input VCF or BCF file after applying the filtering options specified by the user. The output file has the suffix ".recode.vcf" or ".recode.bcf". By  default,  the  INFO  fields  are

           removed from the output file, as the INFO values may be invalidated by the recoding (e.g. the total depth may need to be recalculated if individuals are removed). This behavior may be overriden by the following options. By default, BCF files are written out

           as BGZF compressed files.

 

         --recode-INFO <string>

         --recode-INFO-all

           These options can be used with the above recode options to define an INFO key name to keep in the output file. This option can be used multiple times to keep more of the INFO fields. The second option is used to keep all INFO values in the original file.

 

         --contigs <string>

           This option can be used in conjuction with the --recode-bcf when the input file does not have any contig declarations. This option expects a file name with one contig header per line. These lines are included in the output file.

 

   OUTPUT OTHER FORMATS

         --012

           This option outputs the genotypes as a large matrix. Three files are produced. The first, with suffix ".012", contains the genotypes of each individual on a separate line. Genotypes are represented as 0, 1 and 2, where the number represent  that  number  of

           non-reference alleles. Missing genotypes are represented by -1. The second file, with suffix ".012.indv" details the individuals included in the main file. The third file, with suffix ".012.pos" details the site locations included in the main file.

 

         --IMPUTE

           This  option  outputs  phased  haplotypes in IMPUTE reference-panel format. As IMPUTE requires phased data, using this option also implies --phased. Unphased individuals and genotypes are therefore excluded. Only bi-allelic sites are included in the output.

           Using this option generates three files. The IMPUTE haplotype file has the suffix ".impute.hap", and the IMPUTE legend file has the suffix ".impute.hap.legend". The third file, with suffix ".impute.hap.indv", details the individuals included in  the  haplo-

           type file, although this file is not needed by IMPUTE.

 

         --ldhat

         --ldhelmet

         --ldhat-geno

           These  options  output data in LDhat/LDhelmet format. This option requires the "--chr" filter option to also be used. The two first options output phased data only, and therefore also implies "--phased" be used, leading to unphased individuals and genotypes

           being excluded. For LDhelmet, only snps will be considered, and therefore it implies "--remove-indels". The second option treats all of the data as unphased, and therefore outputs LDhat files in genotype/unphased format. Two output files are generated  with

           the  suffixes  ".ldhat.sites"  and  ".ldhat.locs",  which  correspond to the LDhat "sites" and "locs" input files respectively; for LDhelmet, the two files generated have the suffixes ".ldhelmet.snps" and ".ldhelmet.pos", which corresponds to the "SNPs" and

           "positions" files.

 

         --BEAGLE-GL

         --BEAGLE-PL

           These options output genotype likelihood information for input into the BEAGLE program. The VCF file is required to contain FORMAT fields with "GL" or "PL" tags, which can generally be output by SNP callers such as the GATK. Use of this  option  requires  a

           chromosome to be specified via the "--chr" option. The resulting output file has the suffix ".BEAGLE.GL" or ".BEAGLE.PL" and contains genotype likelihoods for biallelic sites. This file is suitable for input into BEAGLE via the "like=" argument.

 

         --plink

         --plink-tped

         --chrom-map

           These  options  output the genotype data in PLINK PED format. With the first option, two files are generated, with suffixes ".ped" and ".map". Note that only bi-allelic loci will be output. Further details of these files can be found in the PLINK documenta-

           tion.

           Note: The first option can be very slow on large datasets. Using the --chr option to divide up the dataset is advised, or alternatively use the --plink-tped option which outputs the files in the PLINK transposed format with suffixes ".tped" and ".tfam".

           For usage with variant sites in species other than humans, the --chrom-map option may be used to specify a file name that has a tab-delimited mapping of chromosome name to a desired integer value with one line per chromosome. This file must contain  a  map-

           ping for every chromosome value found in the file.

 

COMPARISON OPTIONS

       These options are used to compare the original variant file to another variant file and output the results. All of the diff functions require both files to contain the same chromosomes and that the files be sorted in the same order. If one of the files contains

       chromosomes that the other file does not, use the --not-chr filter to remove them from the analysis.

 

   DIFF VCF FILE

         --diff <filename>

         --gzdiff <filename>

         --diff-bcf <filename>

           These options compare the original input file to this specified VCF, gzipped VCF, or BCF file. These options must be specified with one additional option described below in order to specify what type of comparison is to be performed. See the  examples  sec-

           tion for typical usage.

 

   DIFF OPTIONS

         --diff-site

           Outputs the sites that are common / unique to each file. The output file has the suffix ".diff.sites_in_files".

 

         --diff-indv

           Outputs the individuals that are common / unique to each file. The output file has the suffix ".diff.indv_in_files".

 

         --diff-site-discordance

           This option calculates discordance on a site by site basis. The resulting output file has the suffix ".diff.sites".

 

         --diff-indv-discordance

           This option calculates discordance on a per-individual basis. The resulting output file has the suffix ".diff.indv".

 

         --diff-indv-map <filename>

           This  option  allows  the  user to specify a mapping of individual IDs in the second file to those in the first file. The program expects the file to contain a tab-delimited line containing an individual's name in file one followed by that same individual's

           name in file two with one mapping per line.

 

         --diff-discordance-matrix

           This option calculates a discordance matrix. This option only works with bi-allelic loci with matching alleles that are present in both files. The resulting output file has the suffix ".diff.discordance.matrix".

 

         --diff-switch-error

           This option calculates phasing errors (specifically "switch errors"). This option creates an output file describing switch errors found between sites, with suffix ".diff.switch".

 

AUTHORS

       Adam Auton (adam.auton@einstein.yu.edu)

       Anthony Marcketta (anthony.marcketta@einstein.yu.edu)

 

 

 

実行方法

gzip圧縮したvcfと非圧縮vcf(一部コマンドのみ)を両方扱うことができる。

#もし圧縮するなら (bgzipがないなら”conda install -c bioconda tabix”)
bgzip input.vcf
tabix -p vcf input.vcf.gz

#解凍
bgzip -d input.vcf.gz

#解凍(元のファイルは残す)
bgzip -dc input.vcf.gz > output.vcf

 

vcf format4.3を無理やり4.2に見せかける(非推奨)。#biostarのスレッドより(link) 。

#v4.3 => 4.2
sed 's/^##fileformat=VCFv4.3/##fileformat=VCFv4.2/' in.vcf > out.vcf

  

特定のchrだけ出力する。非圧縮vcfは--vcfで、gzip圧縮したvcfは--gzvcf で指定する。

#output only chr1 call
vcftools --vcf input_file.vcf --recode --out output --chr 1

#output only chr1 call, and includes all INFO fields
vcftools --vcf input_file.vcf --recode --out output --chr 1 --recode-INFO-all

#remove chr1 and chr2 call
vcftools --vcf input_file.vcf --recode --out output --not-chr 1 --not-chr 2

#output only chr21:1-500000 call
vcftools --vcf input_file.vcf --recode --out output --chr 21 --from-bp 1 --to-bp 500000
  • --recode--recode-bcf    These options are used to generate a new file in either VCF or BCF from the input VCF or BCF file after applying the filtering options specified by the user. The  output  file  has  the  suffix  ".recode.vcf" or ".recode.bcf". By default, the INFO fields are removed from the output file, as the INFO values may be invalidated by the recoding (e.g. the total depth may need to be recalculated if individuals are removed). This behavior may be overriden by the following options. By default, BCF files are written out as BGZF compressed files.
  • --recode-INFO <string>,  --recode-INFO-all     These options can be used with the above recode options to define an INFO key name to keep in the output file. This option can be used multiple times to keep more of the INFO fields. The second option is used to keep all INFO values in the original file.

 

logファイルとvcfファイルが出力される。logファイルには使用したパラメータや残ったコールの数が記録される。

f:id:kazumaxneo:20200415144538p:plain

 染色体名が分からなければVCFの先頭かfasta.faiを見る。 1番染色体は"chr1"

f:id:kazumaxneo:20200415144942p:plain



bcfに変換して出力する。(もちろん他の処理と組み合わせる事が可能)。

vcftools --vcf input.vcf --recode-bcf --recode-INFO-all --out output.bcf
  • --recode-bcf   These options are used to generate a new file in either VCF or BCF from the input VCF or BCF file after applying the filtering options specified by the user. The output file has the suffix ".recode.vcf" or ".recode.bcf". By  default,  the  INFO  fields  are

 

indelを除く。またはindellのみ。非圧縮vcfは--vcfで、gzip圧縮したvcfは--gzvcf で指定する。

#remove indel
vcftools --vcf input_file.vcf --remove-indels --recode --recode-INFO-all --out SNPs

#keep only indel
vcftools --vcf input_file.vcf --keep-only-indels --recode --recode-INFO-all --out indels
  • --keep-only-indels--remove-indels     Include or exclude sites that contain an indel. For these options "indel" means any variant that alters the length of the REF allele.

 

共通 / 固有のバリアントを出力。非圧縮vcfは--vcfで、gzip圧縮したvcfは--gzvcf で指定する。

vcftools --gzvcf input_file1.vcf.gz --gzdiff input_file2.vcf.gz --diff-site --out output
  • --diff-site  Outputs the sites that are common / unique to each file. The output file has the suffix ".diff.sites_in_files".
  • --diff-indv    Outputs the individuals that are common / unique to each file. The output file has the suffix ".diff.indv_in_files".
  •  --diff-site-discordance   This option calculates discordance on a site by site basis. The resulting output file has the suffix ".diff.sites".
  • --diff-indv-discordance     This option calculates discordance on a per-individual basis. The resulting output file has the suffix ".diff.indv".
  • --diff-indv-map <filename>     This  option  allows  the  user  to specify a mapping of individual IDs in the second file to those in the first file. The program expects the file to contain a tab-delimited line containing an individual's name in file one followed by that same individual's name in file two with one mapping per line.
  • --diff-discordance-matrix   This option calculates a discordance matrix. This option only works with bi-allelic loci with matching alleles that are present  in  both  files.  The  resulting  output  file  has  the  suffix ".diff.discordance.matrix".
  •  --diff-switch-error   This option calculates phasing errors (specifically "switch errors"). This option creates an output file describing switch errors found between sites, with suffix ".diff.switch".  would use the command:

 

Manichaikul et al. (doi:10.1093/bioinformatics/btq559)の方法に基づいて、関連統計量を出力。

vcftools --vcf input_file.vcf --relatedness2

出力のsuffixは".relatedness2"。 

 

Transition / Transversion比を計算。

vcftools --vcf input_file.vcf --TsTv-by-count
vcftools --vcf input_file.vcf --TsTv-by-qual
vcftools --vcf input_file.vcf --TsTv-summary
  • --TsTv-summary   Calculates a simple summary of all Transitions and Transversions. The output file has the suffix ".TsTv.summary".

  • --TsTv-by-count   output file has the suffix ".TsTv.count".

  • --TsTv-by-qual    Calculates the Transition / Transversion ratio as a function of SNP quality threshold. Only uses bi-allelic SNPs. The resulting output file has the suffix ".TsTv.qual". 

 

2つのVCFファイルを比較して、どのサイトと個人が共有されているかを判断する。2つ目のVCFは --diff, --gzdiff, --diff-bcf を使って指定する。

vcftools --vcf input_data.vcf --diff other_data.vcf --out compare
  • --diff <filename>, --gzdiff <filename>,  --diff-bcf <filename>   These options compare the original input file to this specified VCF, gzipped VCF, or BCF file. These options must be specified with one additional option described below in order to specify what type of comparison is to be performed. See the  examples  section for typical usage.

他にも複数コマンドがあります。

 

 

以下は ソースからビルドした場合のみ(condaで導入した場合は認識されなかった)。

 vcf-merge  

2つ以上のVCFをマージする。

vcf-merge A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz

#追記bcftoolsでマージ
bcftools merge --merge all --force-samples \
wt.vcf.gz mt1.vcf.gz mt2.vcf.gz > merged.vcf

 ポジションが同じ部位は1行にまとめられる。右端の4列を見ればどうなっているかわかる。

AとBに存在。Cにはない。

GT:GQ:DP:PL:AD 1/1:75:25:1115,75,0:0,25 1/1:60:37:795,60,0:0,37 .

Aだけに存在。

GT:GQ:DP:PL:AD 1/1:99:59:2415,181,0:0,59 . .

 SVcallの結果をマージするならSURVIVORが便利です。

SVシミュレーションや、SVのマージ、レポート生成ができる SURVIVOR

 

vcf-concat 

2つ以上のVCFをコメント(##)を除き結合する。(ポジションが同じ部位をマージしたりソートしたりはしない)。

vcf-concat A.vcf.gz B.vcf.gz C.vcf.gz | gzip -c > out.vcf.gz 

 

vcf-sort

chrとポジションでソートする。

vcf-sort file.vcf.gz

ソートコマンドの sort -k1,1d -k2,2n(1列目ポジションソート+2列目ポジションソート)を実行したのと同じ結果になる。

 

vcf-consensus

VCFのバリアント部位を元にレファレンスfastaを改変する。

cat ref.fa | vcf-consensus file.vcf.gz > out.fa 
  • --sample <name> If not given, all variants are applied
  • --haplotype <int> Apply only variants for the given haplotype (1,2)
  • --iupac-codes Apply variants in the form of IUPAC ambiguity codes

 

vcf-convert

VCFのフォーマットを3.3から4.0に変える。

zcat file.vcf.gz | vcf-convert -r reference.fa > out.vcf
  • --version <string> 4.0, 4.1, 4.2

 

vcf-isec

共通のバリアントや固有のバリアントを検出する。

2つ以上のVCFに共通して存在しているバリアントを抽出。

vcf-isec -n +2 A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz 
  • -n [+-=]<int> Output positions present in this many (=), this many or more (+), or this many or fewer (-) files.

Aに存在して、B、Cには存在しないバリアントを抽出。カラム名が違うというエラー(異なるVCFである)というエラーを防ぐには"-f"オプションをつける。

vcf-isec -c A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz
  • -c Output positions present in the first file but missing from the other files.  
  • -f  Continue even if the script complains about differing columns, VCF versions, etc.
  • -p <path> If present, multiple files will be created with all possible isec combinations. (Suitable for Venn Diagram analysis.)
  • -r <list|file> Do only the given regions (comma-separated list or one region per line in a file).
  • -w <int> In repetitive sequences, the same indel can be called at different positions. Consider records this far apart as matching (be it a SNP or an indel).

 

vcf-annotate

MQが30以上の部位を"PASS"させる。(MQリンク

vcf-annotate -f MinMQ=30 input.vcf > output.vcf
  • -f <file|list> Apply filters, list is in the format flt1=value/flt2/flt3=value/etc. If argument to -f is a file, user-defined filters be applied. See User Defined Filters below.
  • -H Remove lines with FILTER anything else than PASS or "."
  • -r <list> Comma-separated list of tags to be removed (e.g. ID,INFO/DP,FORMAT/DP,FILTER).

出力を確認すると、MQが30以上のものがフィルターをPASSしている。

f:id:kazumaxneo:20170916144847j:plain

 

デプス、アレルのデプス、マッピングクオリティなどのフィルターをかける。

vcf-annotate -f +/-a/c=3,10/d=10/-D/MinMQ=40/ input.vcf > output.vcf 

以下のようなフィルターがある。

f:id:kazumaxneo:20170916151147j:plain

上ではMinMQと書いているが、qと書いてもOK。PASSしなかった部位はその名前がつく。MinDPが通らなかったらFilterのカラムにMinDPと付く。

 

追記

VCFのバージョンなどでフィルタリングがうまく機能しない例があるようです。その場合はvcflib(bioconda)(インストールは"conda install -c bioconda vcflib")のvcffilterのフィルタリングを試してみてください。例えば、 DP>10、MQ>30、QD>20でフィルタリング。カバレッジの薄い領域、レアバリアントの大半は除去されます。docker imageもあります(pull数が多いイメージ例: docker pull shollizeck/vcflib:latest)。

(注;例えばMQタグがないVCFもあるので、その時はフィルタリングでエラーになる。grepで該当するタグがVCFファイル中に存在するか確認しておくこと)

#ここではvcfilterのバージョン差による違いを避けるためにdockerイメージを使う
docker run -itv $PWD:/data --rm shollizeck/vcflib:latest
vcffilter -h
cd /data/

#クオリティフィルタリング
vcffilter -f "DP > 10 & MQ > 30 & QD > 20" input.vcf > filtered.vcf
#cannot comapreのエラーがでるなら、そのtagがVCFに存在しない可能性がある。grepで確認してください => grep "MQ" in.vcf

#FreebayesではMQの代わりにMQMが使える
vcffilter -f "DP > 10 & MQM > 30" Freebayes.vcf > Freebayes_filtered.vcf

#それからhomozygousのコールだけ抽出。
cat filtered.vcf | vcffilter -g "GT = 1/1" | vcffixup - | vcffilter -f "AC > 0" > homozygous.vcf

#圧縮
bgzip results.vcf
tabix -p vcf results.vcf.gz

 

 

 

 

 

 IGVでみる場合、preference => variantのタブでバリアントのカラーの調整ができます。

f:id:kazumaxneo:20180823100428j:plain

vcf-mergeなどでマージしたvcfをIGVに読み込ませると、mergeしたvcf.gzにぶら下がってどのツール由来のコールかも表示されます。

f:id:kazumaxneo:20180823100925j:plain

 

  

vcf-compare

 VCFを比較して固有のバリアントと共通のバリアントの数を出力する(ベン図を書くときに使える)。

 vcf-compare A.vcf.gz B.vcf.gz C.vcf.gz > output
  • -a Ignore lines where FILTER column is anything else than PASS or '.'
  • -c <list|file> Same as -r, left for backward compatibility. Please do not use as it will be dropped in the future.
  • -d Debugging information. Giving the option multiple times increases verbosity
  • --cmp-genotypes Compare genotypes, not only positions
  • -ignore-indels Exclude sites containing indels from genotype comparison
  • -R <file> Compare the actual sequence, not just positions. Use with -w to compare indels.

VCF2ファイルを比較すると以下のような出力が得られる。

o$ cat out

# This file was generated by vcf-compare.

# The command line was: vcf-compare(v0.1.14-12-gcdb80b8) HaplotypeCaller.vcf.gz UnifiedGenotyper.vcf.gz

#

#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.

#VN The columns are: 

#VN        1  .. number of sites unique to this particular combination of files

#VN        2- .. combination of files and space-separated number, a fraction of sites in the file

VN 5 UnifiedGenotyper.vcf.gz (7.9%) #VCF1固有

VN 19 HaplotypeCaller.vcf.gz (24.7%) #VCF2固有

VN 58 HaplotypeCaller.vcf.gz (75.3%) UnifiedGenotyper.vcf.gz (92.1%) #共通

#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.

SN Number of REF matches: 58

SN Number of ALT matches: 57

SN Number of REF mismatches: 0

SN Number of ALT mismatches: 1

SN Number of samples in GT comparison: 0

5個のVCFを比較(それぞれのユニークなサイトが数の昇順で表示される。例えば、8行目はSRR12340681とSRR12340683だけ共通して見つかるバリアントが1つだけあることを示している) 

f:id:kazumaxneo:20210424191830p:plain

 

 

vcf-to-tab

VCFをシンプルなタブ形式に変換する。

zcat file.vcf.gz | vcf-to-tab > out.tab

#REFと同じサンプルは”./”になっているが、vcfが巨大だと消したい時がある。sedで消す。
sed s/'\.\/'/''/g out.tab > remove.tab

以下のような出力が得られる。

user$ head -6 out.tab

#CHROM  POS     REF     Z

chr     831647  C       T/T

chr     943495  G       A/A

chr     1012958 G       T/T

chr     1114645 T       TTTAATTAGGGTGATAAATTCG/TTTAATTAGGGTGATAAATTCG

chr     1204616 G       A/A

複数サンプル含まれるVCFを変換した場合

f:id:kazumaxneo:20210517180912p:plain


 

 

vcf-stats 

SNPs、indelの要約統計。

#VCFの中のサンプルNA00001だけ分析
vcf-stats file.vcf.gz -f SAMPLE/NA00001/DP=1:200 -p out/

#VCFの中の全サンプルを分析
vcf-stats file.vcf.gz -f FORMAT/DP=10:200 -p out/

#bcftoolsを使うなら(VCFの中の全サンプル)
bcftools stats -F ref.fasta -s mpileup.vcf.gz > mpileup.vcf.gz.stats
#create report pd
plot-vcfstats -p report/ mpileup.vcf.gz.stats

 

vcf-contrast  

グループ間の違いを見つける。

#サンプルA,Bのいずれかが、すべてのC,D,Eと異なるかどうかを調べる。最低20のMAOQ設定。
vcf-compare -f MinMQ=20 A.vcf.gz B.vcf.gz C.vcf.gz > output

 

fill-an-ac 

ANおよびAC INFOフィールドの入力または再計算。

fill-an-ac [OPTIONS] in.vcf > out.vcf

 

fill-fs 

VCFファイルの既知のバリアントをNでマスクし、フランキングシーケンス(INFO/FSタグ)をアノテーションする。プライマーの設計に便利。

#VCFファイルのバリアントをNでマスクし、ベッドファイルのリージョンを小文字にする。
fill-fs file.vcf -v mask.vcf -m lc -b mask.bed

 

fill-ref-md5

不足しているリファレンス情報を埋め、MD5をVCFのヘッダーに配列する。

fill-ref-md5 -i AS:NCBIM37,SP:"Mus\ Musculus" -r NCBIM37_um.fa -d NCBIM37_um.fa.dict in.vcf.gz out.vcf.gz

 

vcf-indel-stats

in-frame比率の計算。

vcf-indel-stats [OPTIONS] < in.vcf > out.txt

 

vcf-subset 

VCFファイルからいくつかの列を削除する。 

cat in.vcf | vcf-subset -r -t indels -e -c SAMPLE1 > out.vcf

 

vcf-query

VCFファイルをユーザーが定義したフォーマットに変換する強力なコマンド。ポジション、カラム、フィールドのサブセットの検索をサポートする。 

vcf-query file.vcf.gz 1:1000-2000 -c NA001,NA002,NA003

 

vcf-shuffle-cols 

VCFの列を並び替える。

vcf-shuffle-cols [OPTIONS] -t template.vcf.gz file.vcf.gz > out.vcf

 

vcf-annotate The script adds or removes filters and custom annotations to VCF files. To add custom annotations to VCF files, create TAB delimited file with annotations such as

vcf-fix-newlines Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.

vcf-fix-ploidy Fixes diploid vs haploid genotypes on sex chromosomes, including the pseudoautosomal regions.

vcf-phased-join Concatenates multiple overlapping VCFs preserving phasing.

 

 

追記

例えばvarscan2でバリアントコール(SNPsとsmall indels)し、VCFtoolsでSNPs と indelを分けて出力する。

#pileup
sambamba mpileup input.bam -t 12 --samtools -f ref.fa > pileup

#varscanのmpileup2vcfでバリアントコール
varscan mpileup2vcf mpileup --output-vcf 1 --variants 1 \
--min-coverage 8 --min-reads2 4 --min-avg-qual 15 --min-var-freq 0.01 \
--min-freq-for-hom 0.75 --p-value 99e-02 \
> varscan.vcf

#フィルタリング1 SNV
vcftools --vcf varscan.vcf --remove-indels --recode --recode-INFO-all --out varscan_SNV.vcf
#フィルタリング2 indel
vcftools --vcf varscan.vcf --keep-only-indels --recode --recode-INFO-all --out varscan_indel.vcf

  

引用

The variant call format and VCFtools

Petr Danecek,1,† Adam Auton,2,† Goncalo Abecasis,3 Cornelis A. Albers,1 Eric Banks,4 Mark A. DePristo,4 Robert E. Handsaker,4 Gerton Lunter,2 Gabor T. Marth,5 Stephen T. Sherry,6 Gilean McVean,2,7 Richard Durbin,1,* and 1000 Genomes Project Analysis Groupcorresponding author

Bioinformatics. 2011 Aug 1; 27(15): 2156–2158.

 

BIostars 質問 VCFのマージ

https://www.biostars.org/p/49730/

 

 

追記

変異が数十万以上あって動作が緩慢なら、最初にcyvcf2で不要な情報を除外してから使ってください。cyvcf2は2017年に発表された高速なVCFのパーサです。

 

マルチプルアライメンントのトリミングツール trimAI

2020 5/14 help追記

2021 1/23 condaによるインストール追記

 

マルチプルアライメントを行うとアライメントがほとんどできない領域ができることがあるが、そういった領域は情報として利用するのが難しいため、一般的に除去しても問題にならない。trimAIはラージスケールにも対応したマルチプルアライメントのトリミングツールで、何千もの配列のマルチプルアライメント出力からアライメントが貧弱な領域を除去することができる。入力できるのはPhylip、Fasta、Clustal、NBRF/Pir、Mega、Nexusなどになる。

 

マニュアル

http://trimal.cgenomics.org/use_of_the_command_line_trimal_v1.2

チュートリアル

http://trimal.cgenomics.org/_media/manual.b.pdf

 

インストール

Github

Download

http://trimal.cgenomics.org/downloads

ダウンロードしたディレクトリを解凍してビルドする。

git clone https://github.com/scapella/trimal.git
cd trimal/source/
make -j

#conda
mamba install -c bioconda trimal
mamba install -c bioconda/label/cf201901 trimal

./trimal

$ ./trimal 

 

trimAl v1.4.rev22 build[2015-05-21]. 2009-2015. Salvador Capella-Gutierrez and Toni Gabaldón.

 

trimAl webpage: http://trimal.cgenomics.org

 

This program is free software: you can redistribute it and/or modify 

it under the terms of the GNU General Public License as published by 

the Free Software Foundation, the last available version.

 

Please cite:

trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.

Salvador Capella-Gutierrez; Jose M. Silla-Martinez; Toni Gabaldon.

Bioinformatics 2009, 25:1972-1973.

 

Basic usage

trimal -in <inputfile> -out <outputfile> -(other options).

 

Common options (for a complete list please see the User Guide or visit http://trimal.cgenomics.org):

 

    -h                          Print this information and show some examples.

    --version                   Print the trimAl version.

 

    -in <inputfile>             Input file in several formats (clustal, fasta, NBRF/PIR, nexus, phylip3.2, phylip).

 

    -compareset <inputfile>     Input list of paths for the files containing the alignments to compare.

    -forceselect <inputfile>    Force selection of the given input file in the files comparison method.

 

    -backtrans <inputfile>      Use a Coding Sequences file to get a backtranslation for a given AA alignment

    -ignorestopcodon            Ignore stop codons in the input coding sequences

    -splitbystopcodon           Split input coding sequences up to first stop codon appearance

 

    -matrix <inpufile>          Input file for user-defined similarity matrix (default is Blosum62).

    --alternative_matrix <name> Select an alternative similarity matrix already loaded. 

                                Only available 'degenerated_nt_identity'

 

    -out <outputfile>           Output alignment in the same input format (default stdout). (default input format)

    -htmlout <outputfile>       Get a summary of trimal's work in an HTML file.

 

    -keepheader                 Keep original sequence header including non-alphanumeric characters.

                                Only available for input FASTA format files. (future versions will extend this feature)

 

    -nbrf                       Output file in NBRF/PIR format

    -mega                       Output file in MEGA format

    -nexus                      Output file in NEXUS format

    -clustal                    Output file in CLUSTAL format

 

    -fasta                      Output file in FASTA format

    -fasta_m10                  Output file in FASTA format. Sequences name length up to 10 characters.

 

    -phylip                     Output file in PHYLIP/PHYLIP4 format

    -phylip_m10                 Output file in PHYLIP/PHYLIP4 format. Sequences name length up to 10 characters.

    -phylip_paml                Output file in PHYLIP format compatible with PAML

    -phylip_paml_m10            Output file in PHYLIP format compatible with PAML. Sequences name length up to 10 characters.

    -phylip3.2                  Output file in PHYLIP3.2 format

    -phylip3.2_m10              Output file in PHYLIP3.2 format. Sequences name length up to 10 characters.

 

    -complementary              Get the complementary alignment.

    -colnumbering               Get the relationship between the columns in the old and new alignment.

 

    -selectcols { n,l,m-k }     Selection of columns to be removed from the alignment. Range: [0 - (Number of Columns - 1)]. (see User Guide).

    -selectseqs { n,l,m-k }     Selection of sequences to be removed from the alignment. Range: [0 - (Number of Sequences - 1)]. (see User Guide).

 

    -gt -gapthreshold <n>       1 - (fraction of sequences with a gap allowed). Range: [0 - 1]

    -st -simthreshold <n>       Minimum average similarity allowed. Range: [0 - 1]

    -ct -conthreshold <n>       Minimum consistency value allowed.Range: [0 - 1]

    -cons <n>                   Minimum percentage of the positions in the original alignment to conserve. Range: [0 - 100]

 

    -nogaps                     Remove all positions with gaps in the alignment.

    -noallgaps                  Remove columns composed only by gaps.

    -keepseqs                   Keep sequences even if they are composed only by gaps.

 

    -gappyout                   Use automated selection on "gappyout" mode. This method only uses information based on gaps' distribution. (see User Guide).

    -strict                     Use automated selection on "strict" mode. (see User Guide).

    -strictplus                 Use automated selection on "strictplus" mode. (see User Guide).

                               (Optimized for Neighbour Joining phylogenetic tree reconstruction).

 

    -automated1                 Use a heuristic selection of the automatic method based on similarity statistics. (see User Guide). (Optimized for Maximum Likelihood phylogenetic tree reconstruction).

 

    -terminalonly               Only columns out of internal boundaries (first and last column without gaps) are 

                                candidates to be trimmed depending on the selected method

    --set_boundaries { l,r }    Set manually left (l) and right (r) boundaries - only columns out of these boundaries are 

                                candidates to be trimmed depending on the selected method. Range: [0 - (Number of Columns - 1)]

    -block <n>                  Minimum column block size to be kept in the trimmed alignment. Available with manual and automatic (gappyout) methods

 

    -resoverlap                 Minimum overlap of a positions with other positions in the column to be considered a "good position". Range: [0 - 1]. (see User Guide).

    -seqoverlap                 Minimum percentage of "good positions" that a sequence must have in order to be conserved. Range: [0 - 100](see User Guide).

 

    -clusters <n>               Get the most Nth representatives sequences from a given alignment. Range: [1 - (Number of sequences)]

    -maxidentity <n>            Get the representatives sequences for a given identity threshold. Range: [0 - 1].

 

    -w <n>                      (half) Window size, score of position i is the average of the window (i - n) to (i + n).

    -gw <n>                     (half) Window size only applies to statistics/methods based on Gaps.

    -sw <n>                     (half) Window size only applies to statistics/methods based on Similarity.

    -cw <n>                     (half) Window size only applies to statistics/methods based on Consistency.

 

    -sgc                        Print gap scores for each column in the input alignment.

    -sgt                        Print accumulated gap scores for the input alignment.

    -ssc                        Print similarity scores for each column in the input alignment.

    -sst                        Print accumulated similarity scores for the input alignment.

    -sfc                        Print sum-of-pairs scores for each column from the selected alignment

    -sft                        Print accumulated sum-of-pairs scores for the selected alignment

    -sident                     Print identity scores matrix for all sequences in the input alignment. (see User Guide).

    -soverlap                   Print overlap scores matrix for all sequences in the input alignment. (see User Guide).

 

 

 

実行方法

入力はマルチプルアライメントの出力ファイルとなる。

 10%以上の配列でアライメントにギャップがある領域を全てトリミングして出力する(トリミング後の長さが60%以下になる場合、60%までトリミングを行う)。

trimal -in input.aln -out output.aln -htmlout output.html -gt 0.9 -cons 60 
  • -in Input file in several formats (clustal, fasta, NBRF/PIR, nexus, phylip3.2, phylip).
  • -out Output alignment in the same input format (default stdout). (default input format)
  • -htmlout Get a summary of trimal's work in an HTML file.
  • -gt 1 - (fraction of sequences with a gap allowed).
  • -cons Minimum percentage of the positions in the original alignment to conserve. 

 

ギャップの閾値を自動で決める。4つの方法がある。

trimal -in input.aln -out output.aln -gappyout
  • -gappyout Use automated selection on "gappyout" mode. This method only uses information based on gaps' distribution. (see User Guide).
trimal -in input.aln -out output.aln -strict
  • -strict Use automated selection on "strict" mode. (see User Guide).
trimal -in input.aln -out output.aln -strictplus
  • -strictplus Use automated selection on "strictplus" mode. (see User Guide). (Optimized for Neighbour Joining phylogenetic tree reconstruction).
 trimal -in input.aln -out output.aln -automated1
  • -automated1 Use a heuristic selection of the automatic method based on similarity statistics. (see User Guide). (Optimized for Maximum Likelihood phylogenetic tree reconstruction).

 

  

マルチプルアライメントは t-coffeeなどで行うことができる(リンク)。

t_coffee input.fasta

 

引用

trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses

Salvador Capella-Gutiérrez, José M. Silla-Martínez and Toni Gabaldón∗

Bioinformatics. 2009 Aug 1;25(15):1972-3.

 

関連

リファレンスを変えて、変異株のゲノム配列を作る。

2019 8/3  リンク追加

2021 2/17 dockerhubリンク追加

 

変異のコール結果であるVCFファイルを元に変異株のゲノムを作りたいことが時々ある。そうゆう時は、gatkのFastaAlternateReferenceMakerが利用できる。

 

 マニュアル

f:id:kazumaxneo:20170913153420j:plain

 gatkがない人はbrewで導入しておく。

brew install gatk

2021 2/17 dockerhub (link)
docker pull broadinstitute/gatk3:3.8-1
docker run --rm -itv $PWD:/data/ broadinstitute/gatk3:3.8-1
java -jar GenomeAnalysisTK.jar -h

 

実行方法

 入力は変異コール結果のVCFファイルとそのリファレンスファイル。以下のように打つ。

gatk -T FastaAlternateReferenceMaker -R ref.fa -o output.fa -V indel.vcf
  • --lineWidth Maximum length of sequence to write per line (60). 

VCFファイルのコール部位通りに修正されたFASTAファイルが出力される。

  

  おまけ

この塩基置換が再現されているか、output.faからfastqをシミュレートで作り、WTのリファレンスにアライメントさせて確認する。

f:id:kazumaxneo:20170913152556j:plain

 

 変異が再現されている。

f:id:kazumaxneo:20170913152653j:plain

 

VCFtoolsでも同様のことができます。

cat ref.fa | vcf-consensus file.vcf.gz > out.fa 
  • --sample <name> If not given, all variants are applied
  • --haplotype <int> Apply only variants for the given haplotype (1,2)
  • --iupac-codes Apply variants in the form of IUPAC ambiguity codes

 

詳細はこちら


メタゲノムcontigのカバレッジ、GC、taxonomy情報を可視化して分析できる BlobTools

2019 1/16 テストラン追加、diamondデータベースbuidコマンドエラー修正

2019 1/19 diamondデータベースbuidコマンド修正

2019 1/21 追記

2019 6/22 インストール追記

2020 7/29  シミュレーション追記

2020 9/29 追記

2021 9/1   ビルドコマンド修正( リンク修正と、公開データベースのファイル構造が変わっているようので修正しくビルドできるように直しました)

2021 11/4 helpの誤字修正

 

アセンブリしたcontig中に、アセンブリツールのアーティファクトコンタミ由来のcontigが混じることは頻繁に起きる。そのため、アセンブリのクオリティチェックの一つにターゲットとなる生物以外の配列がどれほど混じっているか見積もることが重要になる。BlobToolsはcontigのカバレッジGC、taxonomy情報などをグラフに出力し、候補となるcontigのスクリーニングを支援するツールである。

 

 

公式サイト

https://github.com/DRL/blobtools

マニュアル

https://blobtools.readme.io/docs

 

インストール

依存

  • samtools1.5

Github

installを実行する。依存も含めてインストールされる。

git clone https://github.com/DRL/blobtools.git 
cd blobtools
sudo ./install

#bioconda (link)
conda install -c bioconda -y blobtools

[+] Python dependencies installed.

[+] Creating BlobTools executable...done.

[+] Downloading samtools-1.5...done.

[+] Unpacking samtools-1.5...done.

[+] Configuring samtools-1.5...done.

[+] Compiling samtools-1.5...done.

[+] Cleaning up...

[+] Downloading NCBI taxdump from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz ...done.

[+] Unpacking nodes/names.dmp ...done.

[+] Creating nodesDB from /root/blobtools/data/nodes.dmp and /root/blobtools/data/names.dmp

[+] Store nodesDB in default location /root/blobtools/data/nodesDB.txt

[%] 100%

[+] Removing intermediate files...

[+] BlobTools was installed successfully. Please run ./blobtools

ビルドの過程で、samtools1.5が導入される。NCBI taxonomyファイルもダウンロードされる。blobtoolsのバイナリができる。

> blobtools  

# blobtools     

usage: blobtools [<command>] [<args>...] [--help] [--version]

 

commands:

 

    create        create a BlobDB

    view          generate tabular view, CONCOCT input or COV files from BlobDB

    plot          generate a BlobPlot from a BlobDB

    covplot       generate a CovPlot from a BlobDB and a COV file

 

    map2cov       generate a COV file from BAM file

    taxify        generate a BlobTools compatible HITS file (TSV)

    bamfilter     subset paired-end reads from a BAM file

    seqfilter     subset sequences in FASTA file based sequence IDs in list

    nodesdb       create nodesdb based on NCBI Taxdump's names.dmp and nodes.dmp

 

    -h, --help      show this

    -v, --version   show version number

 

See 'blobtools <command> --help' for more information on a specific command.

 

examples:

 

    # 1. Create a BlobDB

    ./blobtools create -i example/assembly.fna -b example/mapping_1.bam -t example/blast.out -o example/test

 

    # 2. Generate a tabular view

    ./blobtools view -i example/test.blobDB.json

 

    # 3. Generate a blobplot

    ./blobtools blobplot -i example/test.blobDB.json

> blobtools create -h

# blobtools create -h

usage: blobtools create     -i FASTA [-y FASTATYPE] [-o PREFIX] [--title TITLE]

                              [-b BAM...] [-s SAM...] [-a CAS...] [-c COV...]

                              [--nodes <NODES>] [--names <NAMES>] [--db <NODESDB>]

                              [-t HITS...] [-x TAXRULE...] [-m FLOAT] [-d FLOAT] [--tax_collision_random]

                              [-h|--help]

 

    Options:

        -h --help                       show this

        -i, --infile FASTA              FASTA file of assembly. Headers are split at whitespaces.

        -y, --type FASTATYPE            Assembly program used to create FASTA. If specified,

                                        coverage will be parsed from FASTA header.

                                        (Parsing supported for 'spades', 'velvet', 'platanus')

        -t, --hitsfile HITS...          Hits file in format (qseqid\ttaxid\tbitscore)

                                        (e.g. BLAST output "--outfmt '6 qseqid staxids bitscore'")

                                        Can be specified multiple times

        -x, --taxrule <TAXRULE>...      Taxrule determines how taxonomy of blobs

                                        is computed (by default both are calculated)

                                        "bestsum"       : sum bitscore across all

                                                          hits for each taxonomic rank

                                        "bestsumorder"  : sum bitscore across all

                                                          hits for each taxonomic rank.

                                                  - If first <TAX> file supplies hits, bestsum is calculated.

                                                  - If no hit is found, the next <TAX> file is used.

        -m, --min_score <FLOAT>         Minimal score necessary to be considered for taxonomy calculaton, otherwise set to 'no-hit'

                                        [default: 0.0]

        -d, --min_diff <FLOAT>          Minimal score difference between highest scoring

                                        taxonomies (otherwise "unresolved") [default: 0.0]

        --tax_collision_random          Random allocation of taxonomy if highest scoring

                                        taxonomies have equal scores (otherwise "unresolved") [default: False]

        --nodes <NODES>                 NCBI nodes.dmp file. Not required if '--db'

        --names <NAMES>                 NCBI names.dmp file. Not required if '--db'

        --db <NODESDB>                  NodesDB file (default: $BLOBTOOLS/data/nodesDB.txt).

        -b, --bam <BAM>...              BAM file(s), can be specified multiple times

        -s, --sam <SAM>...              SAM file(s), can be specified multiple times

        -a, --cas <CAS>...              CAS file(s) (requires clc_mapping_info in $PATH), can be specified multiple times

        -c, --cov <COV>...              COV file(s), can be specified multiple times

        -o, --out <PREFIX>              BlobDB output prefix

        --title TITLE                   Title of BlobDB [default: output prefix)

> blobtools plot -h

# blobtools plot -h

usage: blobtools blobplot  -i BLOBDB

                                [-p INT] [-l INT] [--cindex] [-n] [-s]

                                [-r RANK] [-x TAXRULE] [--label GROUPS...]

                                [--lib COVLIB] [-o PREFIX] [-m]

                                [--sort ORDER] [--sort_first LABELS] [--hist HIST] [--notitle] [--filelabel]

                                [--colours FILE] [--exclude FILE]

                                [--refcov FILE] [--catcolour FILE]

                                [--format FORMAT] [--noblobs] [--noreads] [--legend]

                                [--cumulative] [--multiplot]

                                [-h|--help]

 

    Options:

        -h --help                   show this

        -i, --infile BLOBDB         BlobDB file (created with "blobtools create")

        --lib COVLIB                Plot only certain covlib(s). Separated by ","

        --notitle                   Do not add filename as title to plot

        --filelabel                 Label axis based on filenames

        -p, --plotgroups INT        Number of (taxonomic) groups to plot, remaining

                                     groups are placed in 'other' [default: 7]

        -l, --length INT            Minimum sequence length considered for plotting [default: 100]

        --cindex                    Colour blobs by 'c index' [default: False]

        -n, --nohit                 Hide sequences without taxonomic annotation [default: False]

        -s, --noscale               Do not scale sequences by length [default: False]

        --legend                    Plot legend of blobplot in separate figure

        -m, --multiplot             Multi-plot. Print blobplot for each (taxonomic) group separately

        --cumulative                Print plot after addition of each (taxonomic) group

        --sort <ORDER>              Sort order for plotting [default: span]

                                     span  : plot with decreasing span

                                     count : plot with decreasing count

        --sort_first <L1,L2,...>    Labels that should always be plotted first, regardless of sort order

                                     ("no-hit,other,undef" is often a useful setting)

        --hist <HIST>               Data for histograms [default: span]

                                     span  : span-weighted histograms

                                     count : count histograms

        -r, --rank <RANK>           Taxonomic rank used for colouring of blobs [default: phylum]

                                     (Supported: species, genus, family, order,

                                        phylum, superkingdom)

        -x, --taxrule <TAXRULE>     Taxrule which has been used for computing taxonomy

                                     (Supported: bestsum, bestsumorder) [default: bestsum]

        --format FORMAT             Figure format for plot (png, pdf, eps, jpeg,

                                        ps, svg, svgz, tiff) [default: png]

        --noblobs                   Omit blobplot [default: False]

        --noreads                   Omit plot of reads mapping [default: False]

 

        -o, --out PREFIX            Output prefix

 

        --label GROUPS...           Relabel (taxonomic) groups, can be used several times.

                                     e.g. "A=Actinobacteria,Proteobacteria"

        --colours COLOURFILE        File containing colours for (taxonomic) groups

        --exclude GROUPS            Exclude these (taxonomic) groups (also works for 'other')

                                     e.g. "Actinobacteria,Proteobacteria,other"

        --refcov <FILE>               File containing number of "total" and "mapped" reads

                                     per coverage file. (e.g.: bam0,900,100). If provided, info

                                     will be used in read coverage plot(s).

        --catcolour <FILE>            Colour plot based on categories from FILE

                                     (format : "seq category").

 

 

 

 ラン

example/のデータを使う。

blobDBの作成。

blobtools create -i example/assembly.fna -b example/mapping_1.bam -t example/blast.out -o example/my_first_blobplot
  • -i FASTA file of assembly. Headers are split at whitespaces.
  • -b BAM file (requires samtools in $PATH)
  • -t Taxonomy file in format (qseqid taxid bitscore) (e.g. BLAST output "--outfmt '6 qseqid staxids bitscore'")
  • -s SAM file
  • -o BlobDB output prefix --title Title of BlobDB [default: FASTA
  • --db NodesDB file [default: data/nodesDB.txt]. 
  • -y Assembly program used to create FASTA. If specified, coverage will be parsed from FASTA header.  (Parsing supported for 'spades', 'soap', 'velvet', 'abyss')

入力するのは、contigのFASTAファイル、FASTAマッピングして作ったbamファイル、blast解析してtaxonomy情報をつけたファイルとなる。taxonomyファイルの作成方法は下の方に記載した。taxonomyファイル自体はなくてもランは可能である。日本語パスのファイルがあると動作しないので注意する。

 

 

blobDBのデータベース情報の確認。

blobtools view -i example/my_first_blobplot.blobDB.json -o example/

出力されるtableファイルは以下のような内容になっている。 

user$ cat example/my_first_blobplot.blobDB.table.txt 

## blobtools v1.0

## assembly : /Users/user/local/blobtools/example/assembly.fna

## coverage : bam0 - /Users/user/local/blobtools/example/mapping_1.bam

## taxonomy : tax0 - /Users/user/local/blobtools/example/blast.out

## nodesDB : /Users/user/local/blobtools/data/nodesDB.txt

## taxrule : bestsum

## min_score : 0.0

## min_diff : 0.0

## tax_collision_random : False

##

# name length GC N bam0 phylum.t.6 phylum.s.7 phylum.c.8

contig_1 756 0.2606 0 90.406 Actinobacteria 200.0 0

contig_2 1060 0.2623 0 168.409 Actinobacteria 2300.0 0

contig_3 602 0.2342 0 43.761 Actinobacteria 10000.0 0

contig_4 951 0.3155 0 456.313 Actinobacteria 1000.0 0

contig_5 614 0.329 0 163.557 Nematoda 2000.0 0

contig_6 216 0.1944 0 25.88 Tardigrada 4000.0 2

contig_7 4060 0.2584 0 52.312 Nematoda 2000.0 0

contig_8 2346 0.2801 0 91.742 unresolved 2000.0 1

contig_9 1599 0.2439 0 74.757 Nematoda 200.0 0

  • bam0 Coverage from bam0 (see main header for filename)
  • phylum.t.6 The assigned taxonomy of the sequence at the taxonomic rank of "phylum" under the tax-rule "best-sum"
  • phylum.s.7 The sum of scores for the taxonomy of the sequence at the taxonomic rank of "phylum" under the tax-rule "best-sum"
  • phylum.c.8 The c-index for the taxonomy of the sequence at the taxonomic rank of "phylum" under the tax-rule "best-sum"

 

テストラン

1、jsonファイル作成

blobtools create -i example/assembly.fna -b example/mapping_1.bam -t example/blast.out -o example/

my_first_blobplot.blobDB.jsonとmy_first_blobplot.mapping_1.bam.covが出力される。

 

 blobplotの作成。 

blobtools plot -i example/my_first_blobplot.blobDB.json -o example/
  • --format    Figure format for plot (png, pdf, eps, jpeg, ps, svg, svgz, tiff) [default: png]
  •  

-

以下の2つの図が出力される。

f:id:kazumaxneo:20170911232614p:plain

contig.faへのマッピング率とtaxonomyごとのマッピング率。

 

f:id:kazumaxneo:20170911232623p:plain

横軸がcontigのGC率、縦軸がcontigにリードを当てた時のカバレッジ。プロットの大きさがcontigの長さを表す。メタゲノムの分析にも利用できると思われる。

 

追記

カスタマイズする。階級はsuperkingdom。プロットの順番を変える(一番下に持って行きたいのを--sort_firstフラグで記載)。nohitはプロットしない( --nohit)。出力はPDF。

blobtools plot -i blobDB.json --rank superkingdom --format pdf --nohit --sort_first bacteria --multiplot
  • --multiplot   Multi-plot. Print blobplot for each (taxonomic) group separately
  • --format <FORMAT>  Figure format for plot (png, pdf, eps, jpeg,
    ps, svg, svgz, tiff) [default: png]
  • --nohit   Hide sequences without taxonomic annotation [default: False]
  • --rank <RANK> Taxonomic rank used for colouring of blobs [default: phylum]
    (Supported: species, genus, family, order, phylum, superkingdom)

 

門レベルを指定。分類ごとファイルを分け、順番に重ねる形でプロットする(--multiplot、最後のファイルには全てプロットされる)。sort countで上位20をプロットする。

blobtools plot -i blobDB.json --rank phylum --plotgroups 20 --multiplot --sort count
  • --multiplot   Multi-plot. Print blobplot for each (taxonomic) group separately
  • --plotgroups <INT> Number of (taxonomic) groups to plot, remaining
    groups are placed in 'other' [default: 7]
  • --sort <ORDER> Sort order for plotting [default: span]
    span : plot with decreasing span
    count : plot with decreasing count

 

 

taxonomyファイル(blast.out)の作成方法

1、Diamond database for UniProt Reference Proteomes

Uniplotのデータベース(dowoload) Proteomics mappingダウンロード(link

#2022/05/02
wget
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Reference_Proteomes_2022_01.tar.gz

2019年1月16日にダウンロードした時は2018_11が最新だった。=>現在は2022.01が最新

 

ここから先は公式マニュアル通りに行う(link)。この行程のみcent OSで作業した。

#1 Unpack protein FASTAs for each kingdom
parallel -j4 'gunzip {}' ::: `ls` | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional'

#またはGNU parallelを使わず1つずつ進める。必要なのはアミノ酸fasta
tar -zxvf Reference_Proteomes_2019_01.tar.gz

#以下は不要なので消す(消さないと結合時に巻き込まれる)
rm */*/*acc.gz
rm */*/*DNA*gz
rm */*/*additional.fasta.gz

#タンパク質配列を結合(2021.03はディレクトリ構造が変わったので修正した)
find . -maxdepth 3 -name '*.fasta.gz' |xargs cat > uniprot_ref_proteomes.fasta.gz
gzip -dv uniprot_ref_proteomes.fasta.gz

#3 Simplify sequence IDs
cat uniprot_ref_proteomes.fasta | sed -r 's/(^>sp\|)|(^>tr\|)/>/g' | cut -f1 -d"|" > temp
mv temp uniprot_ref_proteomes.fasta

#4 Subset mapping file to only contain NCBI TaxID entries
cat */*/*.idmapping.gz > idmapping.gz
gzip -dv idmapping.gz
cat idmapping | grep "NCBI_TaxID" > uniprot_ref_proteomes.taxids

#5 Make Diamond database
diamond makedb --in uniprot_ref_proteomes.fasta --taxonmap uniprot_ref_proteomes.taxids -d uniprot_ref_proteomes.diamond


#taxidsの取り込みでエラーが出たので、以下のように修正した
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip -d taxdmp
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
gzip -dv prot.accession2taxid.gz
#database作成
diamond makedb --in uniprot_ref_proteomes.fasta \
--taxonmap prot.accession2taxid \
--taxonnodes taxdmp/nodes.dmp \
-d uniprot_ref_proteomes.diamond

 

2、RNAcentral SILVA collection

# Download ID mapping file from RNAcentral release 5.0 wget
ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/id_mapping/id_mapping.tsv.gz

# extract SILVA sequence IDs (this is the TaxID-mapping file) gunzip -c id_mapping.tsv.gz | grep SILVA | sort | uniq > id_mapping.SILVA.tsv

# generate a list of sequence IDs
cut -f1 id_mapping.SILVA.tsv > id_mapping.SILVA.names.txt

# Download FASTA from RNAcentral release 5.0 (contains all sequences)
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/sequences/rnacentral_active.fasta.gz gunzip rnacentral_active.fasta.gz

# Subselect only SILVA rRNA sequences
blobtools seqfilter -i rnacentral_active.fasta -l id_mapping.SILVA.names.tsv

# Convert rRNA to rDNA
perl -ne 'chomp; if(m/^>/){@temp=split(" "); print $temp[0]."\n";} else {$_ =~ tr/U/T/; print $_."\n"}' rnacentral_active.filtered.fna > rnacentral_active.SILVA.rDNA.fasta

# make the blastDB
makeblastdb -dbtype nucl -in rnacentral_active.SILVA.rDNA.fasta

 

データベースができたら、scaffolds.fastaをblastにかける。--outfmt 6でタブ出力、出力項目はqseqid、staxids、 bitscoreとする。 

diamond blastx --query scaffolds.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--outfmt 6 qseqid staxids bitscore \
--sensitive --max-target-seqs 1 \
--evalue 1e-25 > blast.out

これでblastのテキスト出力が得られる。しかし、tatxidsがつかない配列が多数だったので、uniprot_ref_proteomes.taxidsをデータベースとして、diamondのタブ出力に手動でtaxonomy idsをアサインするスクリプトを書いた。下の図は、系統IDを手動アサインして、blobtoolsで可視化した結果。

 

Mock metagenomeのアセンブリ配列分析結果

f:id:kazumaxneo:20190121203448p:plain

f:id:kazumaxneo:20190121203853p:plain

 human oral swab

f:id:kazumaxneo:20190928130544p:plain

metagenome of marburg forest soil(SRX2359704)

f:id:kazumaxneo:20190122112207j:plain

f:id:kazumaxneo:20190122112007p:plain

f:id:kazumaxneo:20190928130513p:plain

アサインできてないcontigがたくさん発生している。

 

 

 

追記

2020 7/29

シミュレートデータを使って調べてみる。

 1、大腸菌のゲノムからARTを使いショートリードを50x分発生させた。

art_illumina -ss HS25 -i GCF_000005845.2_ASM584v2_genomic.fna -p -l 150 -f 50 -m 300 -s 30 -o ecoli_

次にシロイヌナズナの4番染色体配列からショートリードを15x分発生させた。

art_illumina -ss HS25 -i Arabidopsis_thaliana.TAIR10.dna.chromosome.4.fa -p -l 150 -f 15 -m 300 -s 30 -o at_

リード数は以下の通り。リード数が全体に占める割合は大腸菌のゲノムが45.4%シロイヌナズナの4番染色体が54.5%となる。シロイヌナズナの方は15xしかリードを発生させていないが、染色体サイズは18.5 Mbと大きいため、リードの占める割合はシロイヌナズナ大腸菌を上回った。

Table.1

f:id:kazumaxneo:20200729174003p:plain


2、2組のペアエンドfastqをマージしてmetaspadesでDe novoアセンブリした。

metaspades.py -1 merge-1.fq -2 merge-2.fq -k 77 -o outdir -t 30

=>  contigs数は978、トータルサイズは22,746,854 bpだった。

 

3、Uniref90をデータベース(上の方と同じ手順で作成)にdiamond blastx検索し(≤1e-50)、結果をblobtoolsでまとめて視覚化した。tax rankはspeciesとする。

blobtools create -i contig.fasta -b mapped.bam -t blastx-out
blobtools plot -i blobDB.json -r species

f:id:kazumaxneo:20200729175240p:plain

Fig,1

 

 

f:id:kazumaxneo:20200729175242p:plain

Fig,2

 

図の情報を保存しているblobplot.stats.txtファイルをexcelで開く。少し見栄えを良くしてある。

Table.2

f:id:kazumaxneo:20200729182620p:plain

 

A 、L、M列を拡大した。

大腸菌シロイヌナズナアサインは太字の5-6行目。7行目以降は誤判定となる(E value ≤ 1e-50で検索)。太字にしたM列目は上の図Fig.2で使用されている。リード数が全体に占める割合は、準備のセクションに記載通り、大腸菌のゲノムが45.4%シロイヌナズナの4番染色体が54.5%で比率は1.20だが、 表(下の図)の36.9 %43.7 %:ratio 1.18はこれに近い 。つまりbamo_read_mapが全リードのうちどれだけのリードがその分類群に当たったかを示していることになる。bamo_read_map_pはその%表記。

f:id:kazumaxneo:20200729182714p:plain

L列はマップされたリード数。元のリード数より多いのは複数回マッピングされたリードを全て+1カウントしているため。

 

E列のspan visibleはそれぞれのテンプレートサイズ(E.coli chr: 4.64Mb、At chr4: 18.6 Mb)に近い。両方ともテンプレートより1 Mb 以上小さくなっているが、これはミスアサインされたコンティグがあるため。

f:id:kazumaxneo:20200729184044p:plain

 C列のcount visibleはコンティグの数を示している。合計978コンティグあるが、そのうちの短い配列は除いてblobtoolsにかけているので902になっている。

引用

BlobTools: Interrogation of genome assemblies 

Laetsch DR, Blaxter ML

F1000Res. 2017;6:1287.

 

関連


BlobTools2

https://blobtoolkit.genomehubs.org/blobtools2/

 

アセンブルのgraphを可視化する GUIツール Bandage

2018 9/19 コマンド修正

2018 11/17 文章修正

2018 12/15 インストール追記

2019 2/28 追記

2019 3/19  scafofldsのコマンドのミス修正

2019 5/15リンク追加 

2020 3/8動画追加

2022/06/02 ツイート追加

 

bandageはde novo assemblerのfastgファイルを入力として、graphパスを描画してくれるツール。2015年に論文が発表された。

 

 インストール

公式サイト

f:id:kazumaxneo:20170911215239j:plain

公式サイトからdmgファイルをダウンロード。指示にしたがってインストールする。

bandage wiki

https://github.com/rrwick/Bandage/wiki

 linuxなら、condaで導入して、コマンドラインから作図したりできる。

conda install -c bioconda -y bandage

>Bandage  -h

$ Bandage  -h

 

  ____                  _                  

 |  _ \                | |                 

 | |_) | __ _ _ __   __| | __ _  __ _  ___ 

 |  _ < / _` | '_ \ / _` |/ _` |/ _` |/ _ \

 | |_) | (_| | | | | (_| | (_| | (_| |  __/

 |____/ \__,_|_| |_|\__,_|\__,_|\__, |\___|

                                 __/ |     

                                |___/      

Version: 0.8.1

 

Usage:    Bandage <command> [options]

          

Commands: <blank>      Launch the Bandage GUI

          load         Launch the Bandage GUI and load a graph file

          info         Display information about a graph

          image        Generate an image file of a graph

          querypaths   Output graph paths for BLAST queries

          reduce       Save a subgraph of a larger graph

          

Options:  --help       View this help message

          --helpall    View all command line settings

          --version    View Bandage version number

./spades_contig_graph.py  -h

$ ./spades_contig_graph.py  -h

usage: spades_contig_graph.py [-h] [-p PATHS_OUT] (-c | -l)

                              graph contigs paths output

 

SPAdes Contig Graph: a tool for creating FASTG contig graphs from SPAdes

assemblies

 

positional arguments:

  graph                 The assembly_graph.fastg file made by SPAdes

  contigs               A contigs or scaffolds fasta file made by SPAdes

  paths                 The paths file which corresponds to the contigs or

                        scaffolds file

  output                The graph file made by this program

 

optional arguments:

  -h, --help            show this help message and exit

  -p PATHS_OUT, --paths_out PATHS_OUT

                        Save the paths to this file (default: do not save

                        paths)

  -c, --connection_priority

                        Prioritise graph connections over segment length

  -l, --length_priority

                        Prioritise segment length over graph connections

——

 

プログラムをダウンロードしたら、passの通ったディレクトリにコピーしておく。

sudo cp Bandage /usr/local/bin/

#起動 
Bandage
#起動したらblastが使えるか確認する

>B

 

spadesのグラフデータを使う場合、spadesのスクリプトspades_contig_graph.pyを使いフォーマットを少し変える必要がある。brewでspadesをインストールしているなら以下のように打つ。

#length
./spades_contig_graph.py -l assembly_graph.fastg contigs.fasta contigs.paths output_length.fastq

#coverage
./spades_contig_graph.py -c assembly_graph.fastg contigs.fasta contigs.paths output_coverage.fastq

#scaffoldsに対して行うなら
#coverage
./spades_contig_graph.py -c assembly_graph.fastg scaffolds.fasta scaffolds.paths output_coverage.fastq

contigs.fasta、contigs.path、assembly_graph.fastgは、spadesのアセンブル結果出力ディレクトリに存在する。それを指定している。"-p prefix" でpathファイルも別途保存できる。

またはscaffoldsに対して実行する。出力されるfastgをbandageに入力する。

 

実行方法

inputよりfastgファイルを選択。

 f:id:kazumaxneo:20170911233837j:plain

Draw Graphをクリックするとアセンブルのgraphが描画される。 

f:id:kazumaxneo:20170911234007j:plain 

-基本操作-

  • 拡大縮小:Trackpadでピンチイン/アウトかマウスホイールを回す。
  • 画面のスクロール:Trackpadで2本指のドラッグかCtrl+画面のドラッグ。
  • ノードの移動:ドラッグでノードを引っ張る。
  • 複数ノードの移動:ノードをマウスで囲い選択してから引っ張る。
  • edgeの形の変更:controlキーを押しながらedgeの変更したい部分をドラッグ。
  • 画面の回転:Trackpadで2本指の回転か、Ctrl+画面の右クリックドラッグ。
  • ノードの形を局所的に変更: Control +クリック

 

指定した情報をnodeの上に表示できる。

f:id:kazumaxneo:20170911234452j:plain

 マーカー遺伝子がアセンブルされたcontigのどこに存在するか確認したいことがある。そういう時はbandage内部でBLAST解析すれば良い。Create/view BLAST searchを選択。

f:id:kazumaxneo:20170911234846j:plain

 Build BLAST databaseをクリックし、contig配列からBLASTデータベースを作成する。

f:id:kazumaxneo:20170911234639j:plain

  左上の方のLoad from FASTA fileを選択し、探したい配列のfastaファイルを選択する。終わったら中央下のRun BLAST searchを選択する。

 

解析が終わるとヒットしたノード名と領域が表示され、されにノードのヒットした領域がカラー表示される。

f:id:kazumaxneo:20170911235520j:plain

アノテーションをかけてコード領域のfnaファイルがあれば、遺伝子を可視化することも可能。

 

ドラッグしてnodeを囲めば、 選択されたnodeの合計サイズと平均depthを出せる。

f:id:kazumaxneo:20181119124339j:plain

 

 contigを探すなら、右上の小窓にcontig名を打つ。

f:id:kazumaxneo:20170911235756j:plain

 

 

 

Bandageを使えば、アセンブルが途切れている可能性がある耐性カセットの位置を見極めるのも簡単である。例えば以下の記事では、bandageを使ったアセンブルのグラフからカセット配列を見つける流れが書かれている。

https://holtlab.net/2015/07/20/locating-drug-resistance-regions-in-short-reads-using-bandage-and-ismapper/

 

追記

この方の使い方がとても参考になります。動画を貼っておきます。

 

 

追記


引用

Bandage: interactive visualization of de novo genome assemblies

Wick RR, Schultz MB, Zobel J, Holt KE

Bioinformatics. 2015 Oct 15;31(20):3350-2

 

追記

アセンブリグラフをメタゲノムデータセットに利用する


2021 3/16

他のノードと最もエッジを共有しているノードを探す

 

BandageCUI環境で使用中に

Bandage: error while loading shared libraries: libGL.so.1: cannot open shared object file: No such file or directoryが出た時。

apt-get install libglu1

で解決(ubuntu)。 

 

2022/06/02 

Bandageのfork

Bandage-NG


マッピングを評価するツール qplot

 

qplotはマッピング結果の統計情報を出力したり、empiricalなクオリティスコアとマッピング結果から求めたベースクオリティスコアの差などをグラフ化したPDFを出力することができる(既知SNPsファイルが必要)。クオリティの低い塩基(バーコードとか)が残っていないか、複数シーケンスした中でクオリティが極端に悪いサンプルが混じっていないかなど、いわゆるクオリティチェックに使えるツールである。2013年に論文が発表された。

 

インストール

Github

 ソースからビルドする。

cd qplot-master/
cd ../libStatGen; make cloneLib #必要なライブラリlibStatGenをビルド
cd ../qplot-master/
make #qplotをビルド
sudo mv bin/qplot /usr/local/bin/

 bin/にqplotができる。パスが通った場所に移動させるかリンクを貼る。

またはバイナリをダウンロードする。ヒューマンのdbSNPなどもダウンロードされる。

qplot.20130627.tar.gz (File Size: 1.7G)

 

実行方法

Githubでは人をターゲットに議論されているが、コントロールのSNPデータベースがあるなら、どんな生物でも解析できる。例えば野生株のSNPリストを使いbamを評価する。

qplot --stats qplot.stats --dbsnp gatk.vcf --reference input.fasta sorted.bam

 

Stats\BAM sorted.bam

TotalReads(e6) 0.53

MappingRate(%) 99.80

MapRate_MQpass(%) 99.80

TargetMapping(%) 0.00

ZeroMapQual(%) 2.24

MapQual<10(%) 2.26

PairedReads(%) 100.00

ProperPaired(%) 98.62

MappedBases(e9) 0.15

Q20Bases(e9) 0.15

Q20BasesPct(%) 96.74

MeanDepth 39.10

GenomeCover(%) 99.84

EPS_MSE 2.61

EPS_Cycle_Mean 26.19

GCBiasMSE 0.00

ISize_mode 383

ISize_medium 401

SecondaryRate(%) 0.00

DupRate(%) 0.00

QCFailRate(%) 0.00

BaseComp_A(%) 26.4

BaseComp_C(%) 23.6

BaseComp_G(%) 23.5

BaseComp_T(%) 26.5

BaseComp_O(%) 0.0

Depth>=1(%) 99.8

Depth>=5(%) 99.8

Depth>=10(%) 99.8

Depth>=15(%) 99.6

Depth>=25(%) 95.2

Depth>=30(%) 84.7

 

 

summaryグラフも出力させる。

qplot --plot qplot.pdf --stats qplot.stats --Rcode qplot.R --dbsnp gatk.vcf --reference input.fasta sorted.bam

 

f:id:kazumaxneo:20170909050645j:plain

f:id:kazumaxneo:20170909050648j:plain

 

illuminaのempirical base quality scoresとreported base quality scoresはほぼ同じとされているが、SOLiDなどはマッピングできない領域をトリムしてマッピングするためか、reported base quality scoresがやや低くなる傾向がある。

 

 

引用

QPLOT: A Quality Assessment Tool for Next Generation Sequencing Data

Bingshan Li, 1 ,* Xiaowei Zhan, 2 Mary-Kate Wing, 2 Paul Anderson, 2 Hyun Min Kang, 2 and Goncalo R. Abecasis 2 ,*

Biomed Res Int. 2013; 2013: 865181.