2020 3/14 動画追加
2020 9/30 論文引用追加
2022/02/04 v0.9
2022/06/08 アップデートに伴いコマンド修正, help更新
ロングリードシーケンシング技術からヒトゲノムアセンブリを作成する現在のワークフローは、大きなコホートへの効率的な拡大を妨げるコストおよび生産時間のボトルネックを有している。著者らは11のヒトゲノム用に最適化されたPromethIONナノポアシーケンス法を実証する。 9日間、1台のマシンで実行されたシーケンスから、サンプルあたりわずか3つのフローセルを使用して、平均63倍のカバレッジ、42 KbのリードN50、90%の平均リード同一性、100 Kb +リードで6.5倍のカバレッジが達成された。これらのデータを集めるために、新しい計算ツールを導入した:Shasta - de novo long readアセンブラ、およびMarginPolish&HELEN - 一連のナノポアアセンブリポリッシングアルゴリズム。 1つの商用計算ノードで、Shastaは6時間以内に完全なヒトゲノムアセンブリを作成することができ、MarginPolish&HELENはわずか1日で結果を磨き、ナノポアリードのみからの半数体サンプルに対して99.9%の同一性(QV30)を達成できる。我々(著者ら)は、精度、コスト、および時間の観点から、二倍体、一倍体、および trio-binned のヒトサンプルのアセンブリ性能を評価し、すべての分野で現在の最先端の方法と比較して改善を実証する。さらに、Hi-Cシークエンシングを追加すると、11のゲノムすべてについて染色体レベルのscaffoldsが得られることを示す。
user documentation
https://chanzuckerberg.github.io/shasta/QuickStart.html
2022/02/04
🧬Shasta 0.9.0 is now available!🧬
— Dr. Sara Simmonds 🧬🐚🐠🌱🌸 (@SeSimmonds) February 3, 2022
*Improved de novo long read phased assembly
*Higher quality, fewer artifacts
We are actively developing phased assemblies and love feedback. Please give it a try and reach out to us on GitHub.@BenedictPaten @cziscience https://t.co/tFcYF2KxHy
2020/10/06
New Shasta v0.6 release: https://t.co/3F3MbCoV0I
— Benedict Paten (@BenedictPaten) 2020年10月6日
Huge progress on almost every front; first hints of haplotype resolved ONT only assembly. @nanopore @cziscience
London Calling 2019
Shastaについても言及(18:50付近)
Quickstart
https://chanzuckerberg.github.io/shasta/QuickStart.html#QuickStartLinux
インストール
オーサーらが用意したdockerイメージを使ってテストした。
本体 Github
各プラットフォーム向けの開発段階のバイナリが配布されている(テスト時はダウンロードできなかった)。
#conda (link)
mamba create -n shasta -y
conda activate shasta
mamba install -c bioconda shasta -y
#docker images latestタグのダウンロード
docker pull tpesout/shasta:latest
#古いバージョン
#linux(Ubuntu 16.04, and 18.04, Linux Mint 18.3, CentOS 7.6, Debian 9, Fedora 29)
wget https://github.com/chanzuckerberg/shasta/releases/download/X.Y.Z/shasta-Linux-X.Y.Z
chmod ugo+x shasta-Linux-X.Y.Z
#macos
curl -O -L https://github.com/chanzuckerberg/shasta/releases/download/X.Y.Z/shasta-macOS-X.Y.Z
#shasta-macOS-X.Y.Zにリネームする
> shasta
Shasta Release 0.8.0
2022-Jun-07 23:51:27.173048 Assembly begins with the following command line:
shasta
Option "--config" is missing and is now required to run an assembly.
It must specify either a configuration file
or one of the following built-in configurations:
HiFi-Oct2021
Nanopore-Dec2019
Nanopore-Jun2020
Nanopore-Oct2021
Nanopore-OldGuppy-Sep2020
Nanopore-Phased-Aug2021
Nanopore-Plants-Apr2021
Nanopore-Sep2020
Nanopore-UL-Dec2019
Nanopore-UL-Jun2020
Nanopore-UL-Oct2021
Nanopore-UL-Phased-Oct2021
Nanopore-UL-Sep2020
Nanopore-UL-iterative-Sep2020
2022-Jun-07 23:51:27.177074 Option "--config" is missing and is now required to run an assembly.
> shasta -h
Options allowed only on the command line:
-h [ --help ] Write a help message.
-v [ --version ] Identify the Shasta version.
--config arg Configuration name. Can be the name of
a built-in configuration or the name of
a configuration file.
--input arg Names of input files containing reads.
Specify at least one.
--assemblyDirectory arg (=ShastaRun) Name of the output directory. If
command is assemble, this directory
must not exist.
--command arg (=assemble) Command to run. Must be one of:
assemble, saveBinaryData,
cleanupBinaryData, explore,
createBashCompletionScript
--memoryMode arg (=anonymous) Specify whether allocated memory is
anonymous or backed by a filesystem.
Allowed values: anonymous, filesystem.
--memoryBacking arg (=4K) Specify the type of pages used to back
memory.
Allowed values: disk, 4K , 2M (for best
performance). All combinations
(memoryMode, memoryBacking) are allowed
except for (anonymous, disk).
Some combinations require root
privilege, which is obtained using sudo
and may result in a password prompting
depending on your sudo set up.
--threads arg (=0) Number of threads, or 0 to use one
thread per virtual processor.
--exploreAccess arg (=user) Specify allowed access for --command
explore. Allowed values: user, local,
unrestricted. DO NOT CHANGE FROM
DEFAULT VALUE WITHOUT UNDERSTANDING THE
SECURITY IMPLICATIONS.
--port arg (=17100) Port to be used by the http server
(command --explore).
--alignmentsPafFile arg The name of a PAF file containing
alignments of reads to a reference.
Only used for --command explore, for
display of the alignment candidate
graph. Experimental.
Options allowed on the command line and in the config file:
--Reads.minReadLength arg (=10000) Read length cutoff. Shorter reads are
discarded.
--Reads.desiredCoverage arg (=0) Reduce coverage to desired value. If
not zero, specifies desired coverage
(number of bases). The read length
cutoff specified via
--Reads.minReadLength is increased to
reduce coverage to the specified value.
Power of 10 multipliers can be used,
for example 120Gb to request 120 Gb of
coverage.
--Reads.noCache If set, skip the Linux cache when
loading reads. This is done by
specifying the O_DIRECT flag when
opening input files containing reads.
--Reads.palindromicReads.skipFlagging
Skip flagging palindromic reads. Oxford
Nanopore reads should be flagged for
better results.
--Reads.palindromicReads.maxSkip arg (=100)
Used for palindromic read detection.
--Reads.palindromicReads.maxDrift arg (=100)
Used for palindromic read detection.
--Reads.palindromicReads.maxMarkerFrequency arg (=10)
Used for palindromic read detection.
--Reads.palindromicReads.alignedFractionThreshold arg (=0.1)
Used for palindromic read detection.
--Reads.palindromicReads.nearDiagonalFractionThreshold arg (=0.1)
Used for palindromic read detection.
--Reads.palindromicReads.deltaThreshold arg (=100)
Used for palindromic read detection.
--Reads.palindromicReads.detectOnFastqLoad
Filter reads that have exceptionally
poor quality in the second half, which
is a strong indication of palindromic
sequence.
--Reads.palindromicReads.qScoreRelativeMeanDifference arg (=0.09)
When filtering palindrome quality, this
parameter describes how much worse
should the right side average p(error)
should be.
--Reads.palindromicReads.qScoreMinimumMean arg (=0.15)
When filtering palindrome quality, this
parameter describes the absolute
minimum average p(error) in the right
half.
--Reads.palindromicReads.qScoreMinimumVariance arg (=0.025)
When filtering palindrome quality, this
parameter describes the absolute
minimum variance in the right half.
--Kmers.generationMethod arg (=0) Method to generate marker k-mers: 0 =
random, 1 = random, excluding globally
overenriched k-mers,2 = random,
excluding k-mers overenriched even in a
single read,3 = read from file.4 =
random, excluding k-mers appearing in
two copies close to each other even in
a single read.
--Kmers.k arg (=10) Length of marker k-mers (in run-length
space).
--Kmers.probability arg (=0.1) Fraction k-mers used as a marker.
--Kmers.enrichmentThreshold arg (=100.)
Enrichment threshold for
Kmers.generationMethod 1 and 2.
--Kmers.distanceThreshold arg (=1000) Distance threshold, in RLE bases, for
Kmers.generationMethod 4
--Kmers.file arg The absolute path of a file containing
the k-mers to be used as markers, one
per line. A relative path is not
accepted. Only used if
Kmers.generationMethod is 3.
--MinHash.version arg (=0) Controls the version of the LowHash
algorithm to use. Can be 0 (default) or
1.(experimental).
--MinHash.m arg (=4) The number of consecutive markers that
define a MinHash/LowHash feature.
--MinHash.hashFraction arg (=0.01) Defines how low a hash has to be to be
used with the LowHash algorithm.
--MinHash.minHashIterationCount arg (=10)
The number of MinHash/LowHash
iterations, or 0 to let
--MinHash.alignmentCandidatesPerRead
control the number of iterations.
--MinHash.alignmentCandidatesPerRead arg (=20)
If --MinHash.minHashIterationCount is
0, MinHash iteration is stopped when
the average number of alignment
candidates that each read is involved
in reaches this value. If
--MinHash.minHashIterationCount is not
0, this is not used.
--MinHash.minBucketSize arg (=0) The minimum bucket size to be used by
the LowHash algorithm.
--MinHash.maxBucketSize arg (=10) The maximum bucket size to be used by
the LowHash algorithm.
--MinHash.minFrequency arg (=2) The minimum number of times a pair of
reads must be found by the
MinHash/LowHash algorithm in order to
be considered a candidate alignment.
--MinHash.allPairs Skip the MinHash algorithm and mark all
pairs of reads as alignmentcandidates
with both orientation. This should only
be used for experimentation on very
small runs because it is very time
consuming.
--Align.alignMethod arg (=3) The alignment method to be used to
create the read graph & the marker
graph. 0 = old Shasta method, 1 = SeqAn
(slow), 3 = banded SeqAn, 4 = new
Shasta method (experimental).
--Align.maxSkip arg (=30) The maximum number of markers that an
alignment is allowed to skip.
--Align.maxDrift arg (=30) The maximum amount of marker drift that
an alignment is allowed to tolerate
between successive markers.
--Align.maxTrim arg (=30) The maximum number of unaligned markers
tolerated at the beginning and end of
an alignment.
--Align.maxMarkerFrequency arg (=10) Marker frequency threshold. Markers
more frequent than this value in either
of two oriented reads being aligned are
discarded and not used to compute the
alignment.
--Align.minAlignedMarkerCount arg (=100)
The minimum number of aligned markers
for an alignment to be used.
--Align.minAlignedFraction arg (=0) The minimum fraction of aligned markers
for an alignment to be used.
--Align.matchScore arg (=6) Match score for marker alignments (only
used for alignment methods 1 and 3).
--Align.mismatchScore arg (=-1) Mismatch score for marker alignments
(only used for alignment methods 1 and
3).
--Align.gapScore arg (=-1) Gap score for marker alignments (only
used for alignment methods 1 and 3).
--Align.downsamplingFactor arg (=0.10000000000000001)
Downsampling factor (only used for
alignment method 3).
--Align.bandExtend arg (=10) Amount to extend the downsampled band
(only used for alignment method 3).
--Align.maxBand arg (=1000) Maximum alignment band (only used for
alignment method 3).
--Align.sameChannelReadAlignment.suppressDeltaThreshold arg (=0)
If not zero, alignments between reads
from the same nanopore channel and
close in time are suppressed. The
"read" meta data fields from the FASTA
or FASTQ header are checked. If their
difference, in absolute value, is less
than the value of this option, the
alignment is suppressed. This can help
avoid assembly artifact. This check is
only done if the two reads have
identical meta data fields "runid",
"sampleid", and "ch". If any of these
meta data fields are missing, this
check is suppressed and this option has
no effect.
--Align.suppressContainments Suppress containment alignments, that
is alignments in which one read is
entirely contained in another read,
except possibly for up to maxTrim
markers at the beginning and end.
--Align.align4.deltaX arg (=200) Only used for alignment method 4
(experimental).
--Align.align4.deltaY arg (=10) Only used for alignment method 4
(experimental).
--Align.align4.minEntryCountPerCell arg (=10)
Only used for alignment method 4
(experimental).
--Align.align4.maxDistanceFromBoundary arg (=100)
Only used for alignment method 4
(experimental).
--ReadGraph.creationMethod arg (=0) The method used to create the read
graph (0 default, 1 or 2 experimental).
--ReadGraph.maxAlignmentCount arg (=6)
The maximum number of alignments to be
kept for each read.
--ReadGraph.maxChimericReadDistance arg (=2)
Used for chimeric read detection.
--ReadGraph.strandSeparationMethod arg (=1)
Strand separation method: 0 = no strand
separation, 1 = limited strand
separation, 2 = strict strand
separation.
--ReadGraph.crossStrandMaxDistance arg (=6)
Maximum distance (edges) for strand
separation method 1.
--ReadGraph.removeConflicts Remove conflicts from the read graph.
Experimental - do not use.
--ReadGraph.markerCountPercentile arg (=0.015)
Percentile for --ReadGraph.markerCount
(only used when --ReadGraph.creationMet
hod is 2).
--ReadGraph.alignedFractionPercentile arg (=0.12)
Percentile for adaptive selection of
--ReadGraph.alignedFraction (only used
when --ReadGraph.creationMethod is 2).
--ReadGraph.maxSkipPercentile arg (=0.12)
Percentile for adaptive selection of
--ReadGraph.maxSkip (only used when
--ReadGraph.creationMethod is 2).
--ReadGraph.maxDriftPercentile arg (=0.12)
Percentile for adaptive selection of
--ReadGraph.maxDrift (only used when
--ReadGraph.creationMethod is 2).
--ReadGraph.maxTrimPercentile arg (=0.015)
Percentile for adaptive selection of
--ReadGraph.maxTrim (only used when
--ReadGraph.creationMethod is 2).
--ReadGraph.flagInconsistentAlignments
Flag inconsistent alignments.
Experimental.
--ReadGraph.flagInconsistentAlignments.triangleErrorThreshold arg (=200)
Triangle error threshold, in markers,
for flagging inconsistent alignments.
Only used if --ReadGraph.flagInconsiste
ntAlignments is set. Experimental.
--ReadGraph.flagInconsistentAlignments.leastSquareErrorThreshold arg (=200)
Least square error threshold, in
markers, for flagging inconsistent
alignments. Only used if
--ReadGraph.flagInconsistentAlignments
is set. Experimental.
--ReadGraph.flagInconsistentAlignments.leastSquareMaxDistance arg (=1)
Least square max distance for flagging
inconsistent alignments. Only used if
--ReadGraph.flagInconsistentAlignments
is set. Experimental.
--MarkerGraph.minCoverage arg (=10) Minimum coverage (number of supporting
oriented reads) for a marker graph
vertex to be created.Specifying 0
causes a suitable value of this
parameter to be selected automatically.
--MarkerGraph.maxCoverage arg (=100) Maximum coverage (number of supporting
oriented reads) for a marker graph
vertex.
--MarkerGraph.minCoveragePerStrand arg (=0)
Minimum coverage (number of supporting
oriented reads) for each strand for a
marker graph vertex.
--MarkerGraph.minEdgeCoverage arg (=6)
Minimum edge coverage (number of
supporting oriented reads) for a marker
graph edge to be created.Experimental.
Only used with --Assembly.mode 1.
--MarkerGraph.minEdgeCoveragePerStrand arg (=2)
Minimum edge coverage (number of
supporting oriented reads) on each
strand for a marker graph edge to be
created.Experimental. Only used with
--Assembly.mode 1.
--MarkerGraph.allowDuplicateMarkers Specifies whether to allow more than
one marker on the same oriented read on
a single marker graph vertex.
Experimental.
--MarkerGraph.cleanupDuplicateMarkers
Specifies whether to clean up marker
graph vertices with more than one
marker on the same oriented read.
Experimental.
--MarkerGraph.duplicateMarkersPattern1Threshold arg (=0.5)
Used when cleaning up marker graph
vertices with more than one marker on
the same oriented read. Experimental.
--MarkerGraph.lowCoverageThreshold arg (=0)
Used during approximate transitive
reduction. Marker graph edges with
coverage lower than this value are
always marked as removed regardless of
reachability.
--MarkerGraph.highCoverageThreshold arg (=256)
Used during approximate transitive
reduction. Marker graph edges with
coverage higher than this value are
never marked as removed regardless of
reachability.
--MarkerGraph.maxDistance arg (=30) Used during approximate transitive
reduction.
--MarkerGraph.edgeMarkerSkipThreshold arg (=100)
Used during approximate transitive
reduction.
--MarkerGraph.pruneIterationCount arg (=6)
Number of prune iterations.
--MarkerGraph.simplifyMaxLength arg (=10,100,1000)
Maximum lengths (in markers) used at
each iteration of simplifyMarkerGraph.
--MarkerGraph.crossEdgeCoverageThreshold arg (=0)
Experimental. Cross edge coverage
threshold. If this is not zero,
assembly graph cross-edges with average
edge coverage less than this value are
removed, together with the
corresponding marker graph edges. A
cross edge is defined as an edge v0->v1
with out-degree(v0)>1, in-degree(v1)>1.
--MarkerGraph.refineThreshold arg (=0)
Experimental. Length threshold, in
markers, for the marker graph
refinement step, or 0 to turn off the
refinement step.
--MarkerGraph.reverseTransitiveReduction
Perform approximate reverse transitive
reduction of the marker graph.
--MarkerGraph.peakFinder.minAreaFraction arg (=0.080000000000000002)
Used in the automatic selection of
--MarkerGraph.minCoverage when
--MarkerGraph.minCoverage is set to 0.
--MarkerGraph.peakFinder.areaStartIndex arg (=2)
Used in the automatic selection of
--MarkerGraph.minCoverage when
--MarkerGraph.minCoverage is set to 0.
--Assembly.mode arg (=0) Assembly mode (0=default,
1=experimental).
--Assembly.crossEdgeCoverageThreshold arg (=3)
Maximum average edge coverage for a
cross edge of the assembly graph to be
removed.
--Assembly.markerGraphEdgeLengthThresholdForConsensus arg (=1000)
Controls assembly of long marker graph
edges.
--Assembly.consensusCaller arg (=Modal)
Selects the consensus caller for repeat
counts. See the documentation for
available choices.
--Assembly.storeCoverageData Used to request storing coverage data
in binary format.
--Assembly.storeCoverageDataCsvLengthThreshold arg (=0)
Used to specify the minimum length of
an assembled segment for which coverage
data in csv format should be stored. If
0, no coverage data in csv format is
stored.
--Assembly.writeReadsByAssembledSegment
Used to request writing the reads that
contributed to assembling each segment.
--Assembly.pruneLength arg (=0) Prune length (in markers) for pruning
of the assembly graph. Assembly graph
leaves shorter than this number of
markers are iteratively pruned. Set to
zero to suppress pruning of the
assembly graph. Assembly graph pruning
takes place separately and in addition
to marker graph pruning.
--Assembly.detangleMethod arg (=0) Specify the method used to detangle the
assembly graph. 0 = no detangling, 1 =
strict detangling, 2 = less strict
detangling, controlled by
Assembly.detangle.* options
(experimental).
--Assembly.detangle.diagonalReadCountMin arg (=1)
Minimum number of reads on detangle
matrix diagonal elements required for
detangling.
--Assembly.detangle.offDiagonalReadCountMax arg (=2)
Maximum number of reads on detangle
matrix off-diagonal elements allowed
for detangling.
--Assembly.detangle.offDiagonalRatio arg (=0.29999999999999999)
Maximum ratio of total off-diagonal
elements over diagonal element allowed
for detangling.
--Assembly.iterative Used to request iterative assembly
(experimental).
--Assembly.iterative.iterationCount arg (=3)
Number of iterations for iterative
assembly (experimental).
--Assembly.iterative.pseudoPathAlignMatchScore arg (=1)
Pseudopath alignment match score for
iterative assembly (experimental).
--Assembly.iterative.pseudoPathAlignMismatchScore arg (=-1)
Pseudopath alignment mismatch score for
iterative assembly (experimental).
--Assembly.iterative.pseudoPathAlignGapScore arg (=-1)
Pseudopath alignment gap score for
iterative assembly (experimental).
--Assembly.iterative.mismatchSquareFactor arg (=3)
Mismatch square factor for iterative
assembly (experimental).
--Assembly.iterative.minScore arg (=0)
Minimum pseudo-alignment score for
iterative assembly (experimental).
--Assembly.iterative.maxAlignmentCount arg (=6)
Maximum number of read graph neighbors
for iterative assembly (experimental).
--Assembly.iterative.bridgeRemovalIterationCount arg (=3)
Number of read graph bridge removal
iterations for iterative assembly
(experimental).
--Assembly.iterative.bridgeRemovalMaxDistance arg (=2)
Maximum distance for read graph bridge
removal for iterative assembly
(experimental).
--Assembly.bubbleRemoval.discordantRatioThreshold arg (=0.20000000000000001)
Discordant ratio threshold for bubble
removal (assembly mode 2 only,
experimental).
--Assembly.bubbleRemoval.ambiguityThreshold arg (=0.5)
Ambiguity threshold for bubble removal
(assembly mode 2 only, experimental).
--Assembly.bubbleRemoval.maxPeriod arg (=4)
Maximum repeat period for bubble
removal (assembly mode 2 only,
experimental).
--Assembly.superbubbleRemoval.edgeLengthThreshold arg (=6)
Edge length threshold in markers for
superbubble removal (assembly mode 2
only, experimental).
--Assembly.phasing.minReadCount arg (=3)
Minimum number of reads for phasing
(assembly mode 2 only, experimental).
--Assembly.suppressGfaOutput Suppress all GFA output (Mode 2
assembly only).
--Assembly.suppressFastaOutput Suppress all FASTA output (Mode 2
assembly only).
--Assembly.suppressDetailedOutput Suppress output of detailed
representation of the assembly (Mode 2
assembly only).
--Assembly.suppressPhasedOutput Suppress output of phased
representation of the assembly (Mode 2
assembly only).
--Assembly.suppressHaploidOutput Suppress output of haploid
representation of the assembly (Mode 2
assembly only).
実行方法
ロングロリードのfastaファイルとconfigファイルを指定する。プリセットも用意されている。
shasta --input input.fasta --config Nanopore-May2022 --assemblyDirectory working_dir
Documentには パフォーマンス改善の方法が記載されています。
引用
Efficient de novo assembly of eleven human genomes using PromethION sequencing and a novel nanopore toolkit
Kishwar Shafin, View ORCID ProfileTrevor Pesout, Ryan Lorig-Roach, Marina Haukness, Hugh E Olsen, Colleen Bosworth, Joel Armstrong, Kristof Tigyi, Nicholas Maurer, Sergey Koren, Fritz J Sedlazeck, Tobias Marschall, Simon Mayes, Vania Costa, Justin M Zook, Kelvin J Liu, Duncan Kilburn, Melanie Sorensen, Katy M Munson, Mitchell R Vollger, Evan E Eichler, Sofie Salama, David Haussler, Richard E Green, Mark Akeson, Adam Phillippy, Karen H Miga, Paolo Carnevali, Miten Jain, Benedict Paten
bioRxiv preprint first posted online Jul. 26, 2019
Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes
Kishwar Shafin, Trevor Pesout, […]Benedict Paten
Nature Biotechnology volume 38, pages1044–1053(2020)
関連