sam/bamをいじっていると、ヘッダーが無かったり重複したり、ダウンロードが不完全だったり、様々な理由でおかしくなってしまうことがある。PicardのValidateSamFileはsam/bamにエラーがないか分析するコマンド。実行するとエラーが見つかったところを教えてくれる。
GATKより。
https://software.broadinstitute.org/gatk/documentation/article.php?id=7571
Picard-toolsのインストール
brewで導入できる。
brew install picard-tools
> picard ValidateSamFile
$ picard ValidateSamFile
ERROR: Option 'INPUT' is required.
USAGE: ValidateSamFile [options]
Documentation: http://broadinstitute.github.io/picard/command-line-overview.html#ValidateSamFile
Validates a SAM or BAM file.
This tool reports on the validity of a SAM or BAM file relative to the SAM format specification. This is useful for
troubleshooting errors encountered with other tools that may be caused by improper formatting, faulty alignments,
incorrect flag values, etc.
By default, the tool runs in VERBOSE mode and will exit after finding 100 errors and output them to the console
(stdout). Therefore, it is often more practical to run this tool initially using the MODE=SUMMARY option. This mode
outputs a summary table listing the numbers of all 'errors' and 'warnings'.
When fixing errors in your file, it is often useful to prioritize the severe validation errors and ignore the
errors/warnings of lesser concern. This can be done using the IGNORE and/or IGNORE_WARNINGS arguments. For helpful
suggestions on error prioritization, please follow this link to obtain additional documentation on ValidateSamFile
(https://www.broadinstitute.org/gatk/guide/article?id=7571).
After identifying and fixing your 'warnings/errors', we recommend that you rerun this tool to validate your SAM/BAM file
prior to proceeding with your downstream analysis. This will verify that all problems in your file have been addressed.
Usage example:
java -jar picard.jar ValidateSamFile \
I=input.bam \
MODE=SUMMARY
To obtain a complete list with descriptions of both 'ERROR' and 'WARNING' messages, please see our additional
documentation (https://www.broadinstitute.org/gatk/guide/article?id=7571) for this tool.
Version: 2.18.9-SNAPSHOT
Options:
--help
-h Displays options specific to this tool.
--stdhelp
-H Displays options specific to this tool AND options common to all Picard command line
tools.
--version Displays program version.
INPUT=File
I=File Input SAM/BAM file Required.
OUTPUT=File
O=File Output file or standard out if missing Default value: null.
MODE=Mode
M=Mode Mode of output Default value: VERBOSE. This option can be set to 'null' to clear the
default value. Possible values: {VERBOSE, SUMMARY}
IGNORE=Type List of validation error types to ignore. Default value: null. Possible values:
{INVALID_QUALITY_FORMAT, INVALID_FLAG_PROPER_PAIR, INVALID_FLAG_MATE_UNMAPPED,
MISMATCH_FLAG_MATE_UNMAPPED, INVALID_FLAG_MATE_NEG_STRAND, MISMATCH_FLAG_MATE_NEG_STRAND,
INVALID_FLAG_FIRST_OF_PAIR, INVALID_FLAG_SECOND_OF_PAIR,
PAIRED_READ_NOT_MARKED_AS_FIRST_OR_SECOND, INVALID_FLAG_NOT_PRIM_ALIGNMENT,
INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT, INVALID_FLAG_READ_UNMAPPED, INVALID_INSERT_SIZE,
INVALID_MAPPING_QUALITY, INVALID_CIGAR, ADJACENT_INDEL_IN_CIGAR, INVALID_MATE_REF_INDEX,
MISMATCH_MATE_REF_INDEX, INVALID_REFERENCE_INDEX, INVALID_ALIGNMENT_START,
MISMATCH_MATE_ALIGNMENT_START, MATE_FIELD_MISMATCH, INVALID_TAG_NM, MISSING_TAG_NM,
MISSING_HEADER, MISSING_SEQUENCE_DICTIONARY, MISSING_READ_GROUP, RECORD_OUT_OF_ORDER,
READ_GROUP_NOT_FOUND, RECORD_MISSING_READ_GROUP, INVALID_INDEXING_BIN,
MISSING_VERSION_NUMBER, INVALID_VERSION_NUMBER, TRUNCATED_FILE,
MISMATCH_READ_LENGTH_AND_QUALS_LENGTH, EMPTY_READ, CIGAR_MAPS_OFF_REFERENCE,
MISMATCH_READ_LENGTH_AND_E2_LENGTH, MISMATCH_READ_LENGTH_AND_U2_LENGTH,
E2_BASE_EQUALS_PRIMARY_BASE, BAM_FILE_MISSING_TERMINATOR_BLOCK, UNRECOGNIZED_HEADER_TYPE,
POORLY_FORMATTED_HEADER_TAG, HEADER_TAG_MULTIPLY_DEFINED,
HEADER_RECORD_MISSING_REQUIRED_TAG, HEADER_TAG_NON_CONFORMING_VALUE, INVALID_DATE_STRING,
TAG_VALUE_TOO_LARGE, INVALID_INDEX_FILE_POINTER, INVALID_PREDICTED_MEDIAN_INSERT_SIZE,
DUPLICATE_READ_GROUP_ID, MISSING_PLATFORM_VALUE, INVALID_PLATFORM_VALUE,
DUPLICATE_PROGRAM_GROUP_ID, MATE_NOT_FOUND, MATES_ARE_SAME_END,
MISMATCH_MATE_CIGAR_STRING, MATE_CIGAR_STRING_INVALID_PRESENCE,
INVALID_UNPAIRED_MATE_REFERENCE, INVALID_UNALIGNED_MATE_START, MISMATCH_CIGAR_SEQ_LENGTH,
MISMATCH_SEQ_QUAL_LENGTH, MISMATCH_FILE_SEQ_DICT, QUALITY_NOT_STORED, DUPLICATE_SAM_TAG,
CG_TAG_FOUND_IN_ATTRIBUTES} This option may be specified 0 or more times.
MAX_OUTPUT=Integer
MO=Integer The maximum number of lines output in verbose mode Default value: 100. This option can be
set to 'null' to clear the default value.
IGNORE_WARNINGS=Boolean If true, only report errors and ignore warnings. Default value: false. This option can be
set to 'null' to clear the default value. Possible values: {true, false}
VALIDATE_INDEX=Boolean DEPRECATED. Use INDEX_VALIDATION_STRINGENCY instead. If true and input is a BAM file
with an index file, also validates the index. Until this parameter is retired VALIDATE
INDEX and INDEX_VALIDATION_STRINGENCY must agree on whether to validate the index.
Default value: true. This option can be set to 'null' to clear the default value. Possible
values: {true, false}
INDEX_VALIDATION_STRINGENCY=IndexValidationStringency
If set to anything other than IndexValidationStringency.NONE and input is a BAM file with
an index file, also validates the index at the specified stringency. Until VALIDATE_INDEX
is retired, VALIDATE INDEX and INDEX_VALIDATION_STRINGENCY must agree on whether to
validate the index. Default value: EXHAUSTIVE. This option can be set to 'null' to clear
the default value. Possible values: {EXHAUSTIVE, LESS_EXHAUSTIVE, NONE}
IS_BISULFITE_SEQUENCED=Boolean
BISULFITE=Boolean Whether the SAM or BAM file consists of bisulfite sequenced reads. If so, C->T is not
counted as an error in computing the value of the NM tag. Default value: false. This
option can be set to 'null' to clear the default value. Possible values: {true, false}
MAX_OPEN_TEMP_FILES=Integer Relevant for a coordinate-sorted file containing read pairs only. Maximum number of file
handles to keep open when spilling mate info to disk. Set this number a little lower than
the per-process maximum number of file that may be open. This number can be found by
executing the 'ulimit -n' command on a Unix system. Default value: 8000. This option can
be set to 'null' to clear the default value.
SKIP_MATE_VALIDATION=Boolean
SMV=Boolean If true, this tool will not attempt to validate mate information. In general cases, this
option should not be used. However, in cases where samples have very high duplication or
chimerism rates (> 10%), the mate validation process often requires extremely large
amounts of memory to run, so this flag allows you to forego that check. Default value:
false. This option can be set to 'null' to clear the default value. Possible values:
{true, false}
ラン
bamを分析する。MODE=SUMMARYを外すと、異常なリードを全てプリントする。
picard ValidateSamFile I=input.bam MODE=SUMMARY
$ picard ValidateSamFile I=~/Documents/input.sam MODE=SUMMARY
#一部略
MAX_OPEN_TEMP_FILES=8000 SKIP_MATE_VALIDATION=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Sat Aug 04 17:26:55 JST 2018] Executing as kamisakakazuma@kamisakBookpuro on Mac OS X 10.13.6 x86_64; OpenJDK 64-Bit Server VM 1.8.0_92-b15; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.9-SNAPSHOT
No errors found
sam/bamに異常が見つからなければ、最後に"No errors found"が表示される。問題がある場合、WARNINGSかERRORSで内容が示される。
bamのリードグループを改変してプログラムの挙動を確かめる。
samtools view -H input.bam > modify_header.sam
vi modify_header.sam
オリジナル
@RG ID:X LB:Y SM:sample1 PL:ILLUMINA
改変後
@RG ID:X LB:Y SM:sample1 SM:sample1 PL:ILLUMINA
レアケース過ぎて例としては良くないが、手っ取り早くサンプル名を2つにしてみた。
修正ヘッダを元のbamヘッダー情報と置換する。
samtools reheader modify_header sinput.bam > reheaer.bam
ValidateSamFileを実行。
picard ValidateSamFile I=reheaer.bam MODE=SUMMARY
ERROR:HEADER_TAG_MULTIPLY_DEFINED内容も含めてエラーを検出できている。
個別のケースの修正方法については、GAATKチュートリアルで書かれている例を参考にして下さい。
GATK | Doc #7571 | Errors in SAM/BAM files can be diagnosed with ValidateSamFile
引用