--min-overlap-size Minimum expected overlap. Default is 15.
--max-num-mismatches Maximum number of mismatches at the overlapped region to retain the pair. The default behavior relies on `-P` parameter an does not pay attention to the number of mismatches at the overlapped region.
-P Any merged sequence with P below the declared value is discarded and stored in a seperate file.
--min-qual-score Minimum Q-score for a base to overwrite a mismatch at the overlapped region. If there is a mismatch at the overlapped region, the base with higher quality is being used in the final sequence. Alternatively, if the Q-score of the base with higher quality is lower than the Q-score declared with this parameter, that base is being marked as an ambiguous base, which may result in the elimination of the merged sequence depending on the --ignore-Ns paranmeter. The default value is 15.
--retain-only-overlap When set, merger will only return the parts of reads that do overlap, and parts of reads that do not overlap will be trimmed.
# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/Esterel.trinity.fna.bz2.nucl.fasta
# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/Franka.trinity.fna.bz2.nucl.fasta
# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/Hs_Turkey-19-24.trinity.fna.bz2.nucl.fasta
# running makeblastdb with /Users/kazu/Documents/get_homologues-macosx-20200226/sample_transcripts_fasta_est_homologues/flcdnas_Hnijo.fna.gz.nucl.fasta
引用 Analysis of Plant Pan-Genomes and Transcriptomes with GET_HOMOLOGUES-EST, a Clustering Solution for Sequences of the Same Species Bruno Contreras-Moreira, Carlos P. Cantalapiedra, María J. García-Pereira, Sean P. Gordon, John P. Vogel, Ernesto Igartua, Ana M. Casas, Pablo Vinuesa
Front Plant Sci. 2017; 8: 184. Published online 2017 Feb 14
GET_HOMOLOGUES, a Versatile Software Package for Scalable and Robust Microbial Pangenome Analysis
DIAMOND v2.0.7 now supports full-matrix Smith Waterman extensions (vectorized using the SWIPE algorithm) and the new extended taxonomy mapping file from NCBI. https://t.co/YtVTQlDicf
--in Path to the input protein reference database file in FASTA format (may be gzip compressed). If this parameter is omitted, the input will be read from stdin
--taxonnodes Path to the nodes.dmp file from the NCBI taxonomy. This parameter is optional and needs to be supplied in order to provide taxonomy features. The file is contained within this archive downloadable at NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip.
DIAMOND v2 is here! Check out this paper from @bbuchfink & @HajkDrost in our Department. Instead of "we blasted gazillions of genomes, which took several days" it will now be "we diamonded gazillions of genomes during the coffee break." https://t.co/THmCxR2aeo
We introduce two new sensitivity modes: -very-sensitive and -ultra-sensitive allowing users to match the alignment sensitivity levels of BLAST while maintaining superior computational speed up to 360x. #Bioinformatics#Genomics#Phylogenomicspic.twitter.com/4IvXfiDT7y
Together with an optimized HPC and cloud-computing infrastructure, DIAMOND can now scale with the demands of ongoing bulk-sequencing efforts and exponentially growing genome assembly databases to facilitate massive comparative genomics efforts. #ERGApic.twitter.com/zPkuHooama
string: earliest genome index version compatible with this STAR release. Please do not change this value!
### Parameter Files
parametersFiles-
string: name of a user-defined parameters file, "-": none. Can only be defined on the command line.
### System
sysShell-
string: path to the shell binary, preferably bash, e.g. /bin/bash.
- ... the default shell is executed, typically /bin/sh. This was reported to fail on some Ubuntu systems - then you need to specify path to bash.
### Run Parameters
runMode alignReads
string: type of the run.
alignReads ... map reads
genomeGenerate ... generate genome files
inputAlignmentsFromBAM ... input alignments from BAM. Presently only works with --outWigType and --bamRemoveDuplicates.
liftOver ... lift-over of GTF files (--sjdbGTFfile) between genome assemblies using chain file(s) from --genomeChainFiles.
runThreadN1
int: number of threads to run STAR
runDirPermUser_RWX
string: permissions for the directories created at the run-time.
User_RWX ... user-read/write/execute
All_RWX... all-read/write/execute (same as chmod 777)
runRNGseed777
int: random number generator seed.
### Genome Parameters
genomeDir ./GenomeDir/
string: path to the directory where genome files are stored (for --runMode alignReads) or will be generated (for --runMode generateGenome)
genomeLoadNoSharedMemory
string: mode of shared memory usage for the genome files. Only used with --runMode alignReads.
LoadAndKeep ... load genome into shared and keep it in memory after run
LoadAndRemove ... load genome into shared but remove it after run
LoadAndExit ... load genome into shared memory and exit, keeping the genome in memory for future runs
Remove... do not map anything, just remove loaded genome from memory
NoSharedMemory... do not use shared memory, each job will have its own private copy of the genome
genomeFastaFiles-
string(s): path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they *cannot* be zipped.
Required for the genome generation (--runMode genomeGenerate). Can also be used in the mapping (--runMode alignReads) to add extra (new) sequences to the genome (e.g. spike-ins).
genomeChainFiles-
string: chain files for genomic liftover. Only used with --runMode liftOver .
genomeFileSizes 0
uint(s)>0: genome files exact sizes in bytes. Typically, this should not be defined by the user.
genomeConsensusFile -
string: VCF file with consensus SNPs (i.e. alternative allele is the major (AF>0.5) allele)
### Genome Indexing Parameters - only used with --runMode genomeGenerate
genomeChrBinNbits 18
int: =log2(chrBin), where chrBin is the size of the bins for genome storage: each chromosome will occupy an integer number of bins. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]).
genomeSAindexNbases 14
int: length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter --genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1).
genomeSAsparseD 1
int>0: suffux array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAM at the cost of mapping speed reduction
genomeSuffixLengthMax -1
int: maximum length of the suffixes, has to be longer than read length. -1 = infinite.
### Splice Junctions Database
sjdbFileChrStartEnd -
string(s): path to the files with genomic coordinates (chr <tab> start <tab> end <tab> strand) for the splice junction introns. Multiple files can be supplied wand will be concatenated.
sjdbGTFfile -
string: path to the GTF file with annotations
sjdbGTFchrPrefix-
string: prefix for chromosome names in a GTF file (e.g. 'chr' for using ENSMEBL annotations with UCSC genomes)
sjdbGTFfeatureExonexon
string: feature type in GTF file to be used as exons for building transcripts
sjdbGTFtagExonParentTranscripttranscript_id
string: GTF attribute name for parent transcript ID (default "transcript_id" works for GTF files)
sjdbGTFtagExonParentGenegene_id
string: GTF attribute name for parent gene ID (default "gene_id" works for GTF files)
SAM SE... SAM or BAM single-end reads; for BAM use --readFilesCommand samtools view
SAM PE... SAM or BAM paired-end reads; for BAM use --readFilesCommand samtools view
readFilesIn Read1 Read2
string(s): paths to files that contain input read1 (and, if needed,read2)
readFilesPrefix -
string: preifx for the read files names, i.e. it will be added in front of the strings in --readFilesIn
-: no prefix
readFilesCommand -
string(s): command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout
For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc.
readMapNumber -1
int: number of reads to map from the beginning of the file
-1: map all reads
readMatesLengthsInNotEqual
string: Equal/NotEqual - lengths of names,sequences,qualities for both mates are the same/ not the same. NotEqual is safe in all situations.
readNameSeparator /
string(s): character(s) separating the part of the read names that will be trimmed in output (read name after space is always trimmed)
readQualityScoreBase33
int>=0: number to be subtracted from the ASCII code to get Phred quality score
clip3pNbases 0
int(s): number(s) of bases to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates.
clip5pNbases 0
int(s): number(s) of bases to clip from 5p of each mate. If one value is given, it will be assumed the same for both mates.
clip3pAdapterSeq-
string(s): adapter sequences to clip from 3p of each mate.If one value is given, it will be assumed the same for both mates.
clip3pAdapterMMp0.1
double(s): max proportion of mismatches for 3p adpater clipping for each mate.If one value is given, it will be assumed the same for both mates.
clip3pAfterAdapterNbases0
int(s): number of bases to clip from 3p of each mate after the adapter clipping. If one value is given, it will be assumed the same for both mates.
### Limits
limitGenomeGenerateRAM 31000000000
int>0: maximum available RAM (bytes) for genome generation
limitIObufferSize150000000
int>0: max available buffers size (bytes) for input/output, per thread
limitOutSAMoneReadBytes100000
int>0: max size of the SAM record (bytes) for one read. Recommended value: >(2*(LengthMate1+LengthMate2+100)*outFilterMultimapNmax
limitOutSJoneRead1000
int>0: max number of junctions for one read (including all multi-mappers)
limitOutSJcollapsed1000000
int>0: max number of collapsed junctions
limitBAMsortRAM 0
int>=0: maximum available RAM (bytes) for sorting BAM. If =0, it will be set to the genome index size. 0 value can only be used with --genomeLoad NoSharedMemory option.
limitSjdbInsertNsj 1000000
int>=0: maximum number of junction to be inserted to the genome on the fly at the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run
limitNreadsSoft-1
int: soft limit on the number of reads
### Output: general
outFileNamePrefix ./
string: output files name prefix (including full or relative path). Can only be defined on the command line.
outTmpDir -
string: path to a directory that will be used as temporary by STAR. All contents of this directory will be removed!
- the temp directory will default to outFileNamePrefix_STARtmp
outTmpKeepNone
string: whether to keep the tempporary files after STAR runs is finished
None ... remove all temporary files
All .. keep all files
outStdLog
string: which output will be directed to stdout (standard out)
Log... log messages
SAM... alignments in SAM format (which normally are output to Aligned.out.sam file), normal standard output will go into Log.std.out
BAM_Unsorted ... alignments in BAM format, unsorted. Requires --outSAMtype BAM Unsorted
BAM_SortedByCoordinate ... alignments in BAM format, unsorted. Requires --outSAMtype BAM SortedByCoordinate
BAM_Quant... alignments to transcriptome in BAM format, unsorted. Requires --quantMode TranscriptomeSAM
outReadsUnmappedNone
string: output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s).
None... no output
Fastx ... output in separate fasta/fastq files, Unmapped.out.mate1/2
outQSconversionAdd0
int: add this number to the quality score (e.g. to convert from Illumina to Sanger, use -31)
outMultimapperOrder Old_2.4
string: order of multimapping alignments in the output files
Old_2.4 ... quasi-random order used before 2.5.0
Random... random order of alignments for each multi-mapper. Read mates (pairs) are always adjacent, all alignment for each read stay together. This option will become default in the future releases.
### Output: SAM and BAM
outSAMtypeSAM
strings: type of SAM/BAM output
1st word:
BAM... output BAM without sorting
SAM... output SAM without sorting
None ... no SAM/BAM output
2nd, 3rd:
Unsorted ... standard unsorted
SortedByCoordinate ... sorted by coordinate. This option will allocate extra memory for sorting which can be specified by --limitBAMsortRAM.
outSAMmodeFull
string: mode of SAM output
None ... no SAM output
Full ... full SAM output
NoQS ... full SAM but without quality scores
outSAMstrandField None
string: Cufflinks-like strand field flag
None... not used
intronMotif ... strand derived from the intron motif. Reads with inconsistent and/or non-canonical introns are filtered out.
outSAMattributesStandard
string: a string of desired SAM attributes, in the order desired for the output SAM
NH HI AS nM NM MD jM jI XS MC ch ... any combination in any order
None... no attributes
Standard... NH HI AS nM
All ... NH HI AS nM NM MD jM jI MC ch
vA... variant allele
vG... genomic coordiante of the variant overlapped by the read
vW... 0/1 - alignment does not pass / passes WASP filtering. Requires --waspOutputMode SAMtag
STARsolo:
CR CY UR UY ... sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing
CB UB ... error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires --outSAMtype BAM SortedByCoordinate.
sM... assessment of CB and UMI
sS... sequence of the entire barcode (CB,UMI,adapter...)
int>=0: start value for the IH attribute. 0 may be required by some downstream software, such as Cufflinks or StringTie.
outSAMunmappedNone
string(s): output of unmapped reads in the SAM format
1st word:
None ... no output
Within ... output unmapped reads within the main SAM file (i.e. Aligned.out.sam)
2nd word:
KeepPairs ... record unmapped mate for each alignment, and, in case of unsorted output, keep it adjacent to its mapped mate. Only affects multi-mapping reads.
outSAMorder Paired
string: type of sorting for the SAM output
Paired: one mate after the other for all paired alignments
PairedKeepInputOrder: one mate after the other for all paired alignments, the order is kept the same as in the input FASTQ files
outSAMprimaryFlag OneBestScore
string: which alignments are considered primary - all others will be marked with 0x100 bit in the FLAG
OneBestScore ... only one alignment with the best score is primary
AllBestScore ... all alignments with the best score are primary
outSAMreadID Standard
string: read ID record type
Standard ... first word (until space) from the FASTx read ID line, removing /1,/2 from the end
int: 0 to 65535: sam FLAG will be bitwise OR'd with this value, i.e. FLAG=FLAG | outSAMflagOR. This is applied after all flags have been set by STAR, and after outSAMflagAND. Can be used to set specific bits that are not set otherwise.
outSAMflagAND 65535
int: 0 to 65535: sam FLAG will be bitwise AND'd with this value, i.e. FLAG=FLAG & outSAMflagOR. This is applied after all flags have been set by STAR, but before outSAMflagOR. Can be used to unset specific bits that are not set otherwise.
outSAMattrRGline-
string(s): SAM/BAM read group line. The first word contains the read group identifier and must start with "ID:", e.g. --outSAMattrRGline id:xxx CN:yy "DS:z z z".
xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted.
Comma separated RG lines correspons to different (comma separated) input files in --readFilesIn. Commas have to be surrounded by spaces, e.g.
strings: extra @PG (software) line of the SAM header (in addition to STAR)
outSAMheaderCommentFile -
string: path to the file with @CO (comment) lines of the SAM header
outSAMfilterNone
string(s): filter the output into main SAM/BAM files
KeepOnlyAddedReferences ... only keep the reads for which all alignments are to the extra reference sequences added with --genomeFastaFiles at the mapping stage.
KeepAllAddedReferences ...keep all alignments to the extra reference sequences added with --genomeFastaFiles at the mapping stage.
outSAMmultNmax-1
int: max number of multiple alignments for a read that will be output to the SAM/BAM files.
-1 ... all alignments (up to --outFilterMultimapNmax) will be output
outSAMtlen1
int: calculation method for the TLEN field in the SAM/BAM files
1 ... leftmost base of the (+)strand mate to rightmost base of the (-)mate. (+)sign for the (+)strand mate
2 ... leftmost base of any mate to rightmost base of any mate. (+)sign for the mate with the leftmost base. This is different from 1 for overlapping mates with protruding ends
Normal... standard filtering using only current alignment
BySJout ... keep only those reads that contain junctions that passed filtering into SJ.out.tab
outFilterMultimapScoreRange 1
int: the score range below the maximum score for multimapping alignments
outFilterMultimapNmax 10
int: maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value.
Otherwise no alignments will be output, and the read will be counted as "mapped to too many loci" in the Log.final.out .
outFilterMismatchNmax 10
int: alignment will be output only if it has no more mismatches than this value.
outFilterMismatchNoverLmax0.3
real: alignment will be output only if its ratio of mismatches to *mapped* length is less than or equal to this value.
outFilterMismatchNoverReadLmax1.0
real: alignment will be output only if its ratio of mismatches to *read* length is less than or equal to this value.
outFilterScoreMin 0
int: alignment will be output only if its score is higher than or equal to this value.
outFilterScoreMinOverLread0.66
real: same as outFilterScoreMin, butnormalized to read length (sum of mates' lengths for paired-end reads)
outFilterMatchNmin0
int: alignment will be output only if the number of matched bases is higher than or equal to this value.
outFilterMatchNminOverLread 0.66
real: sam as outFilterMatchNmin, but normalized to the read length (sum of mates' lengths for paired-end reads).
outFilterIntronMotifs None
string: filter alignment using their motifs
None ... no filtering
RemoveNoncanonical ... filter out alignments that contain non-canonical junctions
RemoveNoncanonicalUnannotated... filter out alignments that contain non-canonical unannotated junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept.
outFilterIntronStrandsRemoveInconsistentStrands
string: filter alignments
RemoveInconsistentStrands... remove alignments that have junctions with inconsistent strands
None ... no filtering
### Output Filtering: Splice Junctions
outSJfilterReadsAll
string: which reads to consider for collapsed splice junctions output
All: all reads, unique- and multi-mappers
Unique: uniquely mapping reads only
outSJfilterOverhangMin30121212
4 integers:minimum overhang length for splice junctions on both sides for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif
does not apply to annotated junctions
outSJfilterCountUniqueMin 3 1 1 1
4 integers: minimum uniquely mapping read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif
Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied
does not apply to annotated junctions
outSJfilterCountTotalMin 3 1 1 1
4 integers: minimum total (multi-mapping+unique) read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif
Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied
does not apply to annotated junctions
outSJfilterDistToOtherSJmin 100 5 10
4 integers>=0: minimum allowed distance to other junctions' donor/acceptor
does not apply to annotated junctions
outSJfilterIntronMaxVsReadN50000 100000 200000
N integers>=0: maximum gap allowed for junctions supported by 1,2,3,,,N reads
i.e. by default junctions supported by 1 read can have gaps <=50000b, by 2 reads: <=100000b, by 3 reads: <=200000. by >=4 reads any gap <=alignIntronMax
does not apply to annotated junctions
### Scoring
scoreGap 0
int: splice junction penalty (independent on intron motif)
scoreGapNoncan -8
int: non-canonical junction penalty (in addition to scoreGap)
scoreGapGCAG -4
GC/AG and CT/GC junction penalty (in addition to scoreGap)
scoreGapATAC -8
AT/ACand GT/AT junction penalty(in addition to scoreGap)
scoreGenomicLengthLog2scale -0.25
extra score logarithmically scaled with genomic length of the alignment: scoreGenomicLengthLog2scale*log2(genomicLength)
scoreDelOpen -2
deletion open penalty
scoreDelBase -2
deletion extension penalty per base (in addition to scoreDelOpen)
scoreInsOpen -2
insertion open penalty
scoreInsBase -2
insertion extension penalty per base (in addition to scoreInsOpen)
scoreStitchSJshift 1
maximum score reduction while searching for SJ boundaries inthe stitching step
### Alignments and Seeding
seedSearchStartLmax 50
int>0: defines the search start point through the read - the read is split into pieces no longer than this value
seedSearchStartLmaxOverLread1.0
real: seedSearchStartLmax normalized to read length (sum of mates' lengths for paired-end reads)
seedSearchLmax 0
int>=0: defines the maximum length of the seeds, if =0 max seed lengthis infinite
seedMultimapNmax10000
int>0: only pieces that map fewer than this value are utilized in the stitching procedure
seedPerReadNmax 1000
int>0: max number of seeds per read
seedPerWindowNmax 50
int>0: max number of seeds per window
seedNoneLociPerWindow10
int>0: max number of one seed loci per window
seedSplitMin12
int>0: min length of the seed sequences split by Ns or mate gap
alignIntronMin21
minimum intron size: genomic gap is considered intron if its length>=alignIntronMin, otherwise it is considered Deletion
alignIntronMax0
maximum intron size, if 0, max intron size will be determined by (2^winBinNbits)*winAnchorDistNbins
alignMatesGapMax0
maximum gap between two mates, if 0, max intron gap will be determined by (2^winBinNbits)*winAnchorDistNbins
alignSJoverhangMin5
int>0: minimum overhang (i.e. block size) for spliced alignments
alignSJstitchMismatchNmax 0 -1 0 0
4*int>=0: maximum number of mismatches for stitching of the splice junctions (-1: no limit).
(1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif.
int>0: max number of loci anchors are allowed to map to
winBinNbits 16
int>0: =log2(winBin), where winBin is the size of the bin for the windows/clustering, each window will occupy an integer number of bins.
winAnchorDistNbins9
int>0: max number of bins between two anchors that allows aggregation of anchors into one window
winFlankNbins 4
int>0: log2(winFlank), where win Flank is the size of the left and right flanking regions for each window
winReadCoverageRelativeMin0.5
real>=0: minimum relative coverage of the read sequence by the seeds in a window, for STARlong algorithm only.
winReadCoverageBasesMin0
int>0: minimum number of bases covered by the seeds in a window , for STARlong algorithm only.
### Chimeric Alignments
chimOutType Junctions
string(s): type of chimeric output
Junctions ... Chimeric.out.junction
SeparateSAMold... output old SAM into separate Chimeric.out.sam file
WithinBAM ... output into main aligned BAM files (Aligned.*.bam)
WithinBAM HardClip... (default) hard-clipping in the CIGAR for supplemental chimeric alignments (defaultif no 2nd word is present)
WithinBAM SoftClip... soft-clipping in the CIGAR for supplemental chimeric alignments
chimSegmentMin0
int>=0: minimum length of chimeric segment length, if ==0, no chimeric output
chimScoreMin0
int>=0: minimum total (summed) score of the chimeric segments
chimScoreDropMax20
int>=0: max drop (difference) of chimeric score (the sum of scores of all chimeric segments) from the read length
chimScoreSeparation 10
int>=0: minimum difference (separation) between the best chimeric score and the next one
chimScoreJunctionNonGTAG-1
int: penalty for a non-GT/AG chimeric junction
chimJunctionOverhangMin 20
int>=0: minimum overhang for a chimeric junction
chimSegmentReadGapMax 0
int>=0: maximum gap in the read sequence between chimeric segments
chimFilterbanGenomicN
string(s): different filters for chimeric alignments
None ... no filtering
banGenomicN ... Ns are not allowed in the genome sequence around the chimeric junction
chimMainSegmentMultNmax10
int>=1: maximum number of multi-alignments for the main chimeric segment. =1 will prohibit multimapping main segments.
chimMultimapNmax0
int>=0: maximum number of chimeric multi-alignments
0 ... use the old scheme for chimeric detection which only considered unique alignments
chimMultimapScoreRange1
int>=0: the score range for multi-mapping chimeras below the best chimeric score. Only works with --chimMultimapNmax > 1
chimNonchimScoreDropMin 20
int>=0: to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be greater than this value
chimOutJunctionFormat 0
int: formatting type for the Chimeric.out.junction file
0 ... no comment lines/headers
1 ... comment lines at the end of the file: command line and Nreads: total, unique, multi
### Quantification of Annotations
quantMode -
string(s): types of quantification requested
-... none
TranscriptomeSAM ... output SAM/BAM alignments to transcriptome into a separate file
GeneCounts ... count reads per gene
quantTranscriptomeBAMcompression1 1
int: -2 to 10transcriptome BAM compression level
-2... no BAM output
-1... default compression (6?)
0... no compression
10 ... maximum compression
quantTranscriptomeBan IndelSoftclipSingleend
string: prohibit various alignment type
IndelSoftclipSingleend... prohibit indels, soft clipping and single-end alignments - compatible with RSEM
Singleend ... prohibit single-end alignments
### 2-pass Mapping
twopassMode None
string: 2-pass mapping mode.
None... 1-pass mapping
Basic ... basic 2-pass mapping, with all 1st pass junctions inserted into the genome indices on the fly
twopass1readsN-1
int: number of reads to process for the 1st step. Use very large number (or default -1) to map all reads in the first step.
string: WASP allele-specific output type. This is re-implemenation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad & Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061–1063 (2015), https://www.nature.com/articles/nmeth.3582 .
SAMtag... add WASP tags to the alignments that pass WASP filtering
CB_UMI_Simple ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium
CB_UMI_Complex... one UMI of fixed length, but multiple Cell Barcodes of varying length, as well as adapters sequences are allowed in read2 only, e.g. inDrop.
soloCBwhitelist -
string(s): file(s) with whitelist(s) of cell barcodes. Only one file allowed with
soloCBstart 1
int>0: cell barcode start base
soloCBlen 16
int>0: cell barcode length
soloUMIstart17
int>0: UMI start base
soloUMIlen10
int>0: UMI length
soloBarcodeReadLength 1
int: length of the barcode read
1 ... equal to sum of soloCBlen+soloUMIlen
0 ... not defined, do not check
soloCBposition-
strings(s)position of Cell Barcode(s) on the barcode read.
Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
Format for each barcode: startAnchor_startDistance_endAnchor_endDistance
start(end)Anchor defines the anchor base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
start(end)Distance is the distance from the CB start(end) to the Anchor base
String for different barcodes are separated by space.
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition0_0_2_-13_1_3_8
soloUMIposition -
stringposition of the UMI on the barcode read, same as soloCBposition
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition3_9_3_14
soloAdapterSequence -
string: adapter sequence to anchor barcodes.
soloAdapterMismatchesNmax 1
int>0:maximum number of mismatches allowed in adapter sequence.
soloCBmatchWLtype 1MM_multi
string: matching the Cell Barcodes to the WhiteList
Exact ... only exact matches allowed
1MM ... only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match.
1MM_multi ... multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches.
Allowed CBs have to have at least one read with exact match. Similar to CellRanger 2.2.0
1MM_multi_pseudocounts... same as 1MM_Multi, but pseudocounts of 1 are added to all whitelist barcodes.
Similar to CellRanger 3.x.x
soloStrandForward
string: strandedness of the solo libraries:
Unstranded... no strand information
Forward ... read strand same as the original RNA molecule
Reverse ... read strand opposite to the original RNA molecule
soloFeaturesGene
string(s):genomic features for which the UMI counts per Cell Barcode are collected
Gene... genes: reads match the gene transcript
SJ... splice junctions: reported in SJ.out.tab
GeneFull... full genes: count all reads overlapping genes' exons and introns
Transcript3p ... quantification of transcript for 3' protocols
soloUMIdedup1MM_All
string(s):type of UMI deduplication (collapsing) algorithm
1MM_All ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once)
1MM_Directional ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
Exact ... only exactly matching UMIs are collapsed
soloUMIfiltering-
string(s) type of UMI filtering
- ... basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0)
MultiGeneUMI... remove lower-count UMIs that map to more than one gene (introduced in CellRanger 3.x.x)
CellRanger2.2 ... simple filtering of CellRanger 2.2, followed by thre numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
TopCells... only report top cells by UMI count, followed by the excat number of cells
--readFilesIn paths to files that contain input read1 (and, if needed, read2)
--runThreadN (default1)number of threads to run STAR
--outFileNamePrefix output files name prefix (including full or relative path).
--outSAMtype BAM output BAM without sorting
--readFilesCommand string(s): command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout. For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc.
Alexander Dobin,1,* Carrie A. Davis,1 Felix Schlesinger,1 Jorg Drenkow,1 Chris Zaleski,1 Sonali Jha,1 Philippe Batut,1 Mark Chaisson,2 and Thomas R. Gingeras1
This script takes one or more alignment files in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it. See http://htseq.readthedocs.io/en/master/count.html for details.
positional arguments:
samfilenamesPath to the SAM/BAM files containing the mapped reads. If '-' is selected, read from standard input
featuresfilenamePath to the GTF file containing the features
optional arguments:
-h, --helpshow this help message and exit
-f {sam,bam,auto}, --format {sam,bam,auto}
Type of <alignment_file> data. DEPRECATED: file format is detected automatically. This option is ignored.
-r {pos,name}, --order {pos,name}
'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single-end data.
--max-reads-in-buffer MAX_BUFFER_SIZE
When <alignment_file> is paired end sorted by position, allow only so many reads to stay in memory until the mates are found (raising this number will use more memory). Has no effect for single end or paired end sorted by name
-s {yes,no,reverse}, --stranded {yes,no,reverse}
Whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand interpretation
-a MINAQUAL, --minaqual MINAQUAL
Skip all reads with MAPQ alignment quality lower than the given minimum value (default: 10). MAPQ is the 5th column of a SAM/BAM file and its usage depends on the software used to map the reads.
-t FEATURETYPE, --type FEATURETYPE
Feature type (3rd column in GTF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon)
-i IDATTR, --idattr IDATTR
GTF attribute to be used as feature ID (default, suitable for Ensembl GTF files: gene_id). All feature of the right type (see -t option) within the same GTF attribute will be added together. The typical way of using this option is to count all exonic reads from each gene and
add the exons but other uses are possible as well.
--additional-attr ADDITIONAL_ATTR
Additional feature attributes (default: none, suitable for Ensembl GTF files: gene_name). Use multiple times for more than one additional attribute. These attributes are only used as annotations in the output, while the determination of how the counts are added together is
Mode to handle reads overlapping more than one feature (choices: union, intersection-strict, intersection-nonempty; default: union)
--nonunique {none,all,fraction,random}
Whether and how to score reads that are not uniquely aligned or ambiguously assigned to features (choices: none, all, fraction, random; default: none)
--secondary-alignments {score,ignore}
Whether to score secondary alignments (0x100 flag)
--supplementary-alignments {score,ignore}
Whether to score supplementary alignments (0x800 flag)
-o SAMOUTS, --samout SAMOUTS
Write out all SAM alignment records into SAM/BAM files (one per input file needed), annotating each line with its feature assignment (as an optional field with tag 'XF'). See the -p option to use BAM instead of SAM.
Filename to output the counts to instead of stdout.
--append-output Append counts output. This option is useful if you have already creates a TSV/CSV/similar file with a header for your samples (with additional columns for the feature name and any additionl attributes) and want to fill in the rest of the file.
-n NPROCESSES, --nprocesses NPROCESSES
Number of parallel CPU processes to use (default: 1).
--feature-query FEATURE_QUERY
Restrict to features descibed in this expression. Currently supports a single kind of expression: attribute == "one attr" to restrict the GFF to a single gene or transcript, e.g. --feature-query 'gene_name == "ACTB"' - notice the single quotes around the argument of this
option and the double quotes around the gene name. Broader queries might become available in the future.
-q, --quiet Suppress progress report
--version Show software version and exit
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology Laboratory (EMBL) and Fabio Zanini (fabio.zanini@unsw.edu.au), UNSW Sydney. (c) 2010-2020. Released under the terms of the GNU General Public License v3. Part of the 'HTSeq' framework, version 0.12.4.
This script takes one alignment file in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it, accounting for barcodes. See http://htseq.readthedocs.io/en/master/count.html for details.
positional arguments:
samfilename Path to the SAM/BAM file containing the barcoded, mapped reads. If '-' is selected, read from standard input
featuresfilenamePath to the GTF file containing the features
optional arguments:
-h, --helpshow this help message and exit
-f {sam,bam,auto}, --format {sam,bam,auto}
Type of <alignment_file> data. DEPRECATED: file format is detected automatically. This option is ignored.
-r {pos,name}, --order {pos,name}
'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single-end data.
--max-reads-in-buffer MAX_BUFFER_SIZE
When <alignment_file> is paired end sorted by position, allow only so many reads to stay in memory until the mates are found (raising this number will use more memory). Has no effect for single end or paired end sorted by name
-s {yes,no,reverse}, --stranded {yes,no,reverse}
Whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand interpretation
-a MINAQUAL, --minaqual MINAQUAL
Skip all reads with MAPQ alignment quality lower than the given minimum value (default: 10). MAPQ is the 5th column of a SAM/BAM file and its usage depends on the software used to map the reads.
-t FEATURETYPE, --type FEATURETYPE
Feature type (3rd column in GTF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon)
-i IDATTR, --idattr IDATTR
GTF attribute to be used as feature ID (default, suitable for Ensembl GTF files: gene_id)
--additional-attr ADDITIONAL_ATTR
Additional feature attributes (default: none, suitable for Ensembl GTF files: gene_name). Use multiple times for each different attribute
Mode to handle reads overlapping more than one feature (choices: union, intersection-strict, intersection-nonempty; default: union)
--nonunique {none,all}
Whether to score reads that are not uniquely aligned or ambiguously assigned to features
--secondary-alignments {score,ignore}
Whether to score secondary alignments (0x100 flag)
--supplementary-alignments {score,ignore}
Whether to score supplementary alignments (0x800 flag)
-o SAMOUT, --samout SAMOUT
Write out all SAM alignment records into aSAM/BAM file, annotating each line with its feature assignment (as an optional field with tag 'XF'). See the -p option to use BAM instead of SAM.
TSV/CSV filename to output the counts to instead of stdout.
--cell-barcode CB_TAG
BAM tag used for the cell barcode (default compatible with 10X Genomics Chromium is CB).
--UMI UB_TAGBAM tag used for the unique molecular identifier, also known as molecular barcode (default compatible with 10X Genomics Chromium is UB).
-q, --quiet Suppress progress report
--version Show software version and exit
Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology Laboratory (EMBL) and Fabio Zanini (fabio.zanini@unsw.edu.au), UNSW Sydney. (c) 2010-2020. Released under the terms of the GNU General Public License v3. Part of the 'HTSeq' framework, version 0.12.4.
This script take a file with high-throughput sequencing reads (supported formats: SAM, Solexa _export.txt, FASTQ, Solexa _sequence.txt) and performs a simply quality assessment by producing plots showing the distribution of called bases and base-call quality scores by position within the reads. The
plots are output as a PDF file.
positional arguments:
readfilenameThe file to count reads in (SAM/BAM or Fastq)
type of read_file (one of: sam [default], bam, solexa-export, fastq, solexa-fastq)
-o OUTFILE, --outfile OUTFILE
output filename (default is <read_file>.pdf)
-r READLEN, --readlength READLEN
the maximum read length (when not specified, the script guesses from the file
-g GAMMA, --gamma GAMMA
the gamma factor for the contrast adjustment of the quality score plot
-n, --nosplit do not split reads in unaligned and aligned ones
-m MAXQUAL, --maxqual MAXQUAL
the maximum quality score that appears in the data (default: 41)
--primary-onlyFor SAM/BAM input files, ignore alignments that are not primary. This only affects 'multimapper' reads that align to several regions in the genome. By choosing this option, each read will only count as one; without this option, each of its alignments counts as one.
--max-records MAX_RECORDS
Limit the analysis to the first N reads/alignments.
-r 'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single- end data.
-f type of <alignment_file> data, either 'sam' or 'bam' (default: sam)
-s whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand
-a skip all reads with alignment quality lower than the given minimum value (default: 10)
引用
HTSeq—a Python framework to work with high-throughput sequencing data
Simon Anders,* Paul Theodor Pyl, and Wolfgang Huber
-m, --output-combo, Output combined (interleaved) paired-end fastq file. Must use -s option.
-M, --output-combo-all, Output combined (interleaved) paired-end fastq file with any discarded read written to output file as a single N. Cannot be used with the -s option.
Global options
--------------
-t, --qual-type, Type of quality values (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)) (required)
sickle pe -c interlace.fastq -t sanger -m interlace_trimmed.fastq -s trimmed_singles.fastq
-c Combined (interleaved) input paired-end fastq
-m Output combined (interleaved) paired-end fastq file. Must use -s option.
-M Output combined (interleaved) paired-end fastq file with any discarded read written to output file as a single N. Cannot be used with the -s option.