2021 6/27 論文引用
メタゲノムシーケンスは、多くの新しい細菌ゲノムシーケンスの識別とアセンブリをもたらした。 これらのバクテリアはしばしばプラスミドを含んでおり、それはあまり研究も理解もされていない。 これらのプラスミドの研究を支援するために、SCAPP(Sequence Contents Aware Plasmid Peeler)-メタゲノムシーケンシングからプラスミド配列をアセンブリするツールを開発した。SCAPPは、プラスミドアセンブリアルゴリズムのRecyclerのアイデアに基づいて構築され、プラスミドに関する生物学的知識を統合することでプラスミドアセンブリを改善する。 シミュレートされたメタゲノム、実際のヒト腸内のマイクロバイオームサンプル、および生成したヒトの腸内プラスミドサンプルを使い、SCAPPの性能をRecyclerおよびmetaplasmidSPAdesと比較した。 並行して牛の第一胃のplasmidome-metagenomeサンプルを作成し、それを使用して新しい評価手順を作成した。 ほとんどの場合、SCAPPはこの幅広いデータセット全体で、RecyclerおよびmetaplasmidSPAdesと同等以上のパフォーマンスを発揮した。
インストール
ubuntu18.04のdocker環境でテストした(ホストOS; macos10.14)。
依存
- SCAPP is written in Python3. SCAPP uses NumPy, NetworkX, pySAM, and nose. The necessary versions of these required dependencies will all be installed by the setup.py script.
-
BWA (tested with v0.7.5 and v0.7.17) , NCBI BLAST+ tools (tested with v2.7 and v2.9), and samtools (tested with v1.9 and v1.10).
-
The PlasClass classifier should also be installed in order to use the full functionality of SCAPP.
#condaを使う、yamlファイルをダウンロードして依存を導入。
wget https://raw.githubusercontent.com/Shamir-Lab/SCAPP/master/install_scapp.yaml
conda env create -f install_scapp.yaml
conda activate scapp
#ソースから
git clone https://github.com/Shamir-Lab/SCAPP.git
cd SCAPP
python setup.py install
#依存のplasclassも導入
git clone https://github.com/Shamir-Lab/PlasClass.git
cd PlasClass
python setup.py install
#bwa、blast+, samtoolsにもパスが通っている必要がある
> python scapp.py
# python scapp.py
usage: scapp.py [-h] -g GRAPH -o OUTPUT_DIR [-k MAX_K] [-l MIN_LENGTH]
[-m MAX_CV] [-p NUM_PROCESSES] [-sc USE_SCORES]
[-gh USE_GENE_HITS] [-b BAM] [-r1 READS1] [-r2 READS2]
[-pc PLASCLASS | -pf PLASFLOW] [-clft CLASSIFICATION_THRESH]
[-gm GENE_MATCH_THRESH] [-sls SELFLOOP_SCORE_THRESH]
[-slm SELFLOOP_MATE_THRESH] [-cst CHROMOSOME_SCORE_THRESH]
[-clt CHROMOSOME_LEN_THRESH] [-pst PLASMID_SCORE_THRESH]
[-plt PLASMID_LEN_THRESH] [-cd GOOD_CYC_DOMINATED_THRESH]
scapp.py: error: the following arguments are required: -g/--graph, -o/--output_dir
(base) root@ddca68d4cb9d:~/SCAPP/scapp# python scapp.py -
usage: scapp.py [-h] -g GRAPH -o OUTPUT_DIR [-k MAX_K] [-l MIN_LENGTH]
[-m MAX_CV] [-p NUM_PROCESSES] [-sc USE_SCORES]
[-gh USE_GENE_HITS] [-b BAM] [-r1 READS1] [-r2 READS2]
[-pc PLASCLASS | -pf PLASFLOW] [-clft CLASSIFICATION_THRESH]
[-gm GENE_MATCH_THRESH] [-sls SELFLOOP_SCORE_THRESH]
[-slm SELFLOOP_MATE_THRESH] [-cst CHROMOSOME_SCORE_THRESH]
[-clt CHROMOSOME_LEN_THRESH] [-pst PLASMID_SCORE_THRESH]
[-plt PLASMID_LEN_THRESH] [-cd GOOD_CYC_DOMINATED_THRESH]
scapp.py: error: the following arguments are required: -g/--graph, -o/--output_dir
(base) root@ddca68d4cb9d:~/SCAPP/scapp# python scapp.py -h
usage: scapp.py [-h] -g GRAPH -o OUTPUT_DIR [-k MAX_K] [-l MIN_LENGTH]
[-m MAX_CV] [-p NUM_PROCESSES] [-sc USE_SCORES]
[-gh USE_GENE_HITS] [-b BAM] [-r1 READS1] [-r2 READS2]
[-pc PLASCLASS | -pf PLASFLOW] [-clft CLASSIFICATION_THRESH]
[-gm GENE_MATCH_THRESH] [-sls SELFLOOP_SCORE_THRESH]
[-slm SELFLOOP_MATE_THRESH] [-cst CHROMOSOME_SCORE_THRESH]
[-clt CHROMOSOME_LEN_THRESH] [-pst PLASMID_SCORE_THRESH]
[-plt PLASMID_LEN_THRESH] [-cd GOOD_CYC_DOMINATED_THRESH]
SCAPP extracts likely plasmids from de novo assembly graphs
optional arguments:
-h, --help show this help message and exit
-g GRAPH, --graph GRAPH
Assembly graph FASTG file to process
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
Output directory
-k MAX_K, --max_k MAX_K
Integer reflecting maximum k value used by the
assembler
-l MIN_LENGTH, --min_length MIN_LENGTH
Minimum length required for reporting [default: 1000]
-m MAX_CV, --max_CV MAX_CV
Coefficient of variation used for pre-selection
[default: 0.5, higher--> less restrictive]
-p NUM_PROCESSES, --num_processes NUM_PROCESSES
Number of processes to use
-sc USE_SCORES, --use_scores USE_SCORES
Boolean flag of whether to use sequence classification
scores in plasmid assembly
-gh USE_GENE_HITS, --use_gene_hits USE_GENE_HITS
Boolean flag of whether to use plasmid-specific gene
hits in plasmid assembly
-b BAM, --bam BAM BAM file resulting from aligning reads to contigs
file, filtering for best matches
-r1 READS1, --reads1 READS1
1st paired-end read file path
-r2 READS2, --reads2 READS2
1st paired-end read file path
-pc PLASCLASS, --plasclass PLASCLASS
PlasClass score file with scores of the assembly graph
nodes
-pf PLASFLOW, --plasflow PLASFLOW
PlasFlow score file with scores of the assembly graph
nodes
-clft CLASSIFICATION_THRESH, --classification_thresh CLASSIFICATION_THRESH
threshold for classifying potential plasmid [0.5]
-gm GENE_MATCH_THRESH, --gene_match_thresh GENE_MATCH_THRESH
threshold for % identity and fraction of length to
match plasmid genes [0.75]
-sls SELFLOOP_SCORE_THRESH, --selfloop_score_thresh SELFLOOP_SCORE_THRESH
threshold for self-loop plasmid score [0.9]
-slm SELFLOOP_MATE_THRESH, --selfloop_mate_thresh SELFLOOP_MATE_THRESH
threshold for self-loop off loop mates [0.1]
-cst CHROMOSOME_SCORE_THRESH, --chromosome_score_thresh CHROMOSOME_SCORE_THRESH
threshold for high confidence chromosome node score
[0.2]
-clt CHROMOSOME_LEN_THRESH, --chromosome_len_thresh CHROMOSOME_LEN_THRESH
threshold for high confidence chromosome node length
[10000]
-pst PLASMID_SCORE_THRESH, --plasmid_score_thresh PLASMID_SCORE_THRESH
threshold for high confidence plasmid node score [0.9]
-plt PLASMID_LEN_THRESH, --plasmid_len_thresh PLASMID_LEN_THRESH
threshold for high confidence plasmid node length
[10000]
-cd GOOD_CYC_DOMINATED_THRESH, --good_cyc_dominated_thresh GOOD_CYC_DOMINATED_THRESH
threshold for # of mate-pairs off the cycle in
dominated node [0.5]
実行方法
アセンブリグラフのfastgとペアエンドのリードを指定する。
scapp -g input.fastg -o output_dir -k <max k value> -b input.bam -p 8
メインの出力はoutput_dir/assembly_graph.confident_cycs.fastaになる。
引用
SCAPP: An algorithm for improved plasmid assembly in metagenomes
David Pellow, Maraike Probst, Ori Furman, Alvah Zorea, Arik Segal, Itzik Mizrahi, Ron Shamir
bioRxiv preprint, Posted January 14, 2020
2021 6/27
SCAPP: an algorithm for improved plasmid assembly in metagenomes
David Pellow, Alvah Zorea, Maraike Probst, Ori Furman, Arik Segal, Itzhak Mizrahi, Ron Shamir
Microbiome. 2021 Jun 25;9(1):144
関連