macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

sam/bamがmalformedではないか調べるPicardのValidateSamFile

sam/bamをいじっていると、ヘッダーが無かったり重複したり、ダウンロードが不完全だったり、様々な理由でおかしくなってしまうことがある。PicardのValidateSamFileはsam/bamにエラーがないか分析するコマンド。実行するとエラーが見つかったところを教えてくれる。

 

GATKより。

https://software.broadinstitute.org/gatk/documentation/article.php?id=7571

f:id:kazumaxneo:20180804171929p:plain

 

Picard-toolsのインストール

 brewで導入できる。

brew install picard-tools

picard ValidateSamFile

$ picard ValidateSamFile

ERROR: Option 'INPUT' is required.

 

USAGE: ValidateSamFile [options]

 

Documentation: http://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile

 

Validates a SAM or BAM file.

This tool reports on the validity of a SAM or BAM file relative to the SAM format specification.  This is useful for

troubleshooting errors encountered with other tools that may be caused by improper formatting, faulty alignments,

incorrect flag values, etc. 

 

By default, the tool runs in VERBOSE mode and will exit after finding 100 errors and output them to the console

(stdout). Therefore, it is often more practical to run this tool initially using the MODE=SUMMARY option.  This mode

outputs a summary table listing the numbers of all 'errors' and 'warnings'.

 

When fixing errors in your file, it is often useful to prioritize the severe validation errors and ignore the

errors/warnings of lesser concern.  This can be done using the IGNORE and/or IGNORE_WARNINGS arguments.  For helpful

suggestions on error prioritization, please follow this link to obtain additional documentation on ValidateSamFile

(https://www.broadinstitute.org/gatk/guide/article?id=7571).

 

After identifying and fixing your 'warnings/errors', we recommend that you rerun this tool to validate your SAM/BAM file

prior to proceeding with your downstream analysis.  This will verify that all problems in your file have been addressed.

 

Usage example:

 

java -jar picard.jar ValidateSamFile \

I=input.bam \

MODE=SUMMARY

 

To obtain a complete list with descriptions of both 'ERROR' and 'WARNING' messages, please see our additional 

documentation (https://www.broadinstitute.org/gatk/guide/article?id=7571) for this tool.

 

 

Version: 2.18.9-SNAPSHOT

 

 

Options:

 

--help

-h                            Displays options specific to this tool.

 

--stdhelp

-H                            Displays options specific to this tool AND options common to all Picard command line

                              tools.

 

--version                     Displays program version.

 

INPUT=File

I=File                        Input SAM/BAM file  Required. 

 

OUTPUT=File

O=File                        Output file or standard out if missing  Default value: null. 

 

MODE=Mode

M=Mode                        Mode of output  Default value: VERBOSE. This option can be set to 'null' to clear the

                              default value. Possible values: {VERBOSE, SUMMARY} 

 

IGNORE=Type                   List of validation error types to ignore.  Default value: null. Possible values:

                              {INVALID_QUALITY_FORMAT, INVALID_FLAG_PROPER_PAIR, INVALID_FLAG_MATE_UNMAPPED,

                              MISMATCH_FLAG_MATE_UNMAPPED, INVALID_FLAG_MATE_NEG_STRAND, MISMATCH_FLAG_MATE_NEG_STRAND,

                              INVALID_FLAG_FIRST_OF_PAIR, INVALID_FLAG_SECOND_OF_PAIR,

                              PAIRED_READ_NOT_MARKED_AS_FIRST_OR_SECOND, INVALID_FLAG_NOT_PRIM_ALIGNMENT,

                              INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT, INVALID_FLAG_READ_UNMAPPED, INVALID_INSERT_SIZE,

                              INVALID_MAPPING_QUALITY, INVALID_CIGAR, ADJACENT_INDEL_IN_CIGAR, INVALID_MATE_REF_INDEX,

                              MISMATCH_MATE_REF_INDEX, INVALID_REFERENCE_INDEX, INVALID_ALIGNMENT_START,

                              MISMATCH_MATE_ALIGNMENT_START, MATE_FIELD_MISMATCH, INVALID_TAG_NM, MISSING_TAG_NM,

                              MISSING_HEADER, MISSING_SEQUENCE_DICTIONARY, MISSING_READ_GROUP, RECORD_OUT_OF_ORDER,

                              READ_GROUP_NOT_FOUND, RECORD_MISSING_READ_GROUP, INVALID_INDEXING_BIN,

                              MISSING_VERSION_NUMBER, INVALID_VERSION_NUMBER, TRUNCATED_FILE,

                              MISMATCH_READ_LENGTH_AND_QUALS_LENGTH, EMPTY_READ, CIGAR_MAPS_OFF_REFERENCE,

                              MISMATCH_READ_LENGTH_AND_E2_LENGTH, MISMATCH_READ_LENGTH_AND_U2_LENGTH,

                              E2_BASE_EQUALS_PRIMARY_BASE, BAM_FILE_MISSING_TERMINATOR_BLOCK, UNRECOGNIZED_HEADER_TYPE,

                              POORLY_FORMATTED_HEADER_TAG, HEADER_TAG_MULTIPLY_DEFINED,

                              HEADER_RECORD_MISSING_REQUIRED_TAG, HEADER_TAG_NON_CONFORMING_VALUE, INVALID_DATE_STRING,

                              TAG_VALUE_TOO_LARGE, INVALID_INDEX_FILE_POINTER, INVALID_PREDICTED_MEDIAN_INSERT_SIZE,

                              DUPLICATE_READ_GROUP_ID, MISSING_PLATFORM_VALUE, INVALID_PLATFORM_VALUE,

                              DUPLICATE_PROGRAM_GROUP_ID, MATE_NOT_FOUND, MATES_ARE_SAME_END,

                              MISMATCH_MATE_CIGAR_STRING, MATE_CIGAR_STRING_INVALID_PRESENCE,

                              INVALID_UNPAIRED_MATE_REFERENCE, INVALID_UNALIGNED_MATE_START, MISMATCH_CIGAR_SEQ_LENGTH,

                              MISMATCH_SEQ_QUAL_LENGTH, MISMATCH_FILE_SEQ_DICT, QUALITY_NOT_STORED, DUPLICATE_SAM_TAG,

                              CG_TAG_FOUND_IN_ATTRIBUTES} This option may be specified 0 or more times. 

 

MAX_OUTPUT=Integer

MO=Integer                    The maximum number of lines output in verbose mode  Default value: 100. This option can be

                              set to 'null' to clear the default value. 

 

IGNORE_WARNINGS=Boolean       If true, only report errors and ignore warnings.  Default value: false. This option can be

                              set to 'null' to clear the default value. Possible values: {true, false} 

 

VALIDATE_INDEX=Boolean        DEPRECATED.  Use INDEX_VALIDATION_STRINGENCY instead.  If true and input is a BAM file

                              with an index file, also validates the index.  Until this parameter is retired VALIDATE

                              INDEX and INDEX_VALIDATION_STRINGENCY must agree on whether to validate the index. 

                              Default value: true. This option can be set to 'null' to clear the default value. Possible

                              values: {true, false} 

 

INDEX_VALIDATION_STRINGENCY=IndexValidationStringency

                              If set to anything other than IndexValidationStringency.NONE and input is a BAM file with

                              an index file, also validates the index at the specified stringency. Until VALIDATE_INDEX

                              is retired, VALIDATE INDEX and INDEX_VALIDATION_STRINGENCY must agree on whether to

                              validate the index.  Default value: EXHAUSTIVE. This option can be set to 'null' to clear

                              the default value. Possible values: {EXHAUSTIVE, LESS_EXHAUSTIVE, NONE} 

 

IS_BISULFITE_SEQUENCED=Boolean

BISULFITE=Boolean             Whether the SAM or BAM file consists of bisulfite sequenced reads. If so, C->T is not

                              counted as an error in computing the value of the NM tag.  Default value: false. This

                              option can be set to 'null' to clear the default value. Possible values: {true, false} 

 

MAX_OPEN_TEMP_FILES=Integer   Relevant for a coordinate-sorted file containing read pairs only. Maximum number of file

                              handles to keep open when spilling mate info to disk. Set this number a little lower than

                              the per-process maximum number of file that may be open. This number can be found by

                              executing the 'ulimit -n' command on a Unix system.  Default value: 8000. This option can

                              be set to 'null' to clear the default value. 

 

SKIP_MATE_VALIDATION=Boolean

SMV=Boolean                   If true, this tool will not attempt to validate mate information. In general cases, this

                              option should not be used.  However, in cases where samples have very high duplication or

                              chimerism rates (> 10%), the mate validation process often requires extremely large

                              amounts of memory to run, so this flag allows you to forego that check.  Default value:

                              false. This option can be set to 'null' to clear the default value. Possible values:

                              {true, false} 

 

 

ラン 

bamを分析する。MODE=SUMMARYを外すと、異常なリードを全てプリントする。

picard ValidateSamFile I=input.bam MODE=SUMMARY 

$ picard ValidateSamFile I=~/Documents/input.sam MODE=SUMMARY

#一部略

MAX_OPEN_TEMP_FILES=8000 SKIP_MATE_VALIDATION=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

[Sat Aug 04 17:26:55 JST 2018] Executing as kamisakakazuma@kamisakBookpuro on Mac OS X 10.13.6 x86_64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.9-SNAPSHOT

No errors found

 

sam/bamに異常が見つからなければ、最後に"No errors found"が表示される。問題がある場合、WARNINGSかERRORSで内容が示される。

 

 

bamのリードグループを改変してプログラムの挙動を確かめる。

samtools view -H input.bam > modify_header.sam
vi modify_header.sam

オリジナル

@RG     ID:X    LB:Y    SM:sample1      PL:ILLUMINA

改変後

@RG     ID:X    LB:Y    SM:sample1      SM:sample1      PL:ILLUMINA

レアケース過ぎて例としては良くないが、手っ取り早くサンプル名を2つにしてみた。

修正ヘッダを元のbamヘッダー情報と置換する。

samtools reheader modify_header sinput.bam > reheaer.bam

ValidateSamFileを実行。

picard ValidateSamFile I=reheaer.bam MODE=SUMMARY 

ERROR:HEADER_TAG_MULTIPLY_DEFINED内容も含めてエラーを検出できている。

 

個別のケースの修正方法については、GAATKチュートリアルで書かれている例を参考にして下さい。

GATK | Doc #7571 | Errors in SAM/BAM files can be diagnosed with ValidateSamFile

 

引用

Picard Tools - By Broad Institute