2024-04-17

WGSやRNA-seqやTarget-captureなどのさまざまなショートリードデータからユーザーが指定した数百〜数千の遺伝子座の同祖配列を抽出し、系統解析に使用可能なMSAを出力する CAPTUS

　ターゲットキャプチャー、RNA-Seq、ゲノムスキミング、深く読んだ全ゲノムシーケンスなど、多様なハイスループットシーケンスデータは系統ゲノム解析に利用されているが、このようなミックスされたデータを単一の系統ゲノムデータセットに統合するには、多くのバイオインフォマティクスツールと多大な計算リソースが必要である。ここでは、ミックスされたデータを高速かつ効率的に解析する新しいパイプラインCAPTUSを紹介する。CAPTUSはこれらのデータをアセンブルし、アセンブルされたデータから目的の遺伝子座を検索し、最終的にパラログをフィルターしたアラインメントを作成する。他のソフトウェアと比較して、CAPTUSはより多くの生物種にわたり、より完全な遺伝子座をより多く復元することができる。本著者らは、ベゴニア（Begoniaceae）やウリ（Cucurbitaceae）を含む、主に熱帯の8科、約3,100種からなる被子植物目のCucurbitalesの4種類のシーケンスデータからなる包括的な混合データセットの構築にCAPTUSを適用した。系統学的結果は、ホロ寄生性のアポダンサ科の位置を除き、現在受け入れられているウリ科の分類を支持するものである。ウリ科では、現在認められているすべてのtribesが単系統であることが確認された。しかし、ウリ目とウリ科の両方に深い網目模様があることも明らかにした。我々（本著者ら）は、ウリ目における初期の系統学的研究の相反する結果が、遺伝子ツリーの対立を考慮することで調和可能であることを示す。

Documentation

https://edgardomortiz.github.io/captus.docs/index.html

マニュアルより

CaptusはHigh-Throughput Sequencingデータから系統ゲノムデータセットをアセンブルするためのツールキットである。Captusは当初、ターゲットエンリッチメント（RNAまたはDNAプローブのハイブリダイゼーションなどによる）シーケンスデータのアセンブリ専用に設計されたが、その後、ゲノムスキミング、Hyb-Seq（ターゲットエンリッチメント＋ゲノムスキミング）、RNA-seq、全ゲノムシーケンスなど、他の一般的なタイプのHTSデータにも対応できるように拡張された。ツールキットには、ターゲットエンリッチメント実験用のプローブをデザインするモジュールも含まれている。以下の特徴を持つ。

シンプル：インストールが簡単で使いやすい。
高速：感度を犠牲にすることなく、各ステップのスピードを最適化した。
再現性：コマンドは明確で、出力ディレクトリはよく整理されており、広範なログが常に保存されている。
柔軟：生データでも、独自のクリーンアップリードやアセンブルリードでも解析を開始できる。すべての解析をやり直すことなく、既存のデータセットにサンプルを追加したり、一部のサンプルのみで解析をやり直すことができる。
透明性：ワークフローのどの段階でも、サンプルの状態を迅速に評価できるように、情報量が多く、整理された読みやすい出力や、整然としたHTMLレポートを提供する。

インストール

レポジトリではmicromambaで環境を作って導入するのが最速と説明されている。ubuntu20でテストした（macへの導入については注意事項がある。レポジトリ参照）。

依存

Captus was written for python >= v3.7
BBTools (https://jgi.doe.gov/data-and-tools/bbtools/)
Falco (https://github.com/smithlabcode/falco) or FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
MEGAHIT (https://github.com/voutcn/megahit)
* Scipio (https://www.webscipio.org/)
BioPerl (https://bioperl.org/)
YAML (https://metacpan.org/pod/YAML)
* BLAT >= 36x7 (http://hgdownload.soe.ucsc.edu/admin/exe/)
MMseqs2 (https://github.com/soedinglab/MMseqs2)
MAFFT (https://mafft.cbrc.jp/alignment/software/)
MUSCLE (https://www.drive5.com/muscle/)
ClipKIT (https://github.com/JLSteenwyk/ClipKIT)
pigz (https://zlib.net/pigz/)

本体 Github

https://github.com/edgardomortiz/Captus

mamba create -n captus captus -y
conda activate captus

> captus_assembly -h

usage: captus_assembly command [options]

Captus 1.0.1: Assembly of Phylogenomic Datasets from High-Throughput Sequencing data

Captus-assembly commands:

command Program commands (in typical order of execution)

clean = Trim adaptors and quality filter reads with BBTools, run FastQC on the raw and cleaned reads

assemble = Perform de novo assembly with MEGAHIT: Assembling reads that were cleaned with the 'clean' command is recommended, but reads cleaned elsewhere are also allowed

extract = Recover targeted markers with BLAT and Scipio: Extracting markers from the assembly obtained with the 'assemble' command is recommended, but any other assemblies in FASTA format are also allowed

align = Align extracted markers across samples with MAFFT or MUSCLE: Marker alignment depends on the directory structure created by the 'extract' command. This step also performs paralog filtering and alignment trimming using ClipKIT

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For help on a particular command: captus_assembly command -h

> captus_assembly clean -h

usage: captus_assembly clean -r READS [READS ...] [options]

Captus-assembly: Clean; remove adaptors and quality-filter reads with

BBTools

Input:

-r READS [READS ...], --reads READS [READS ...]

FASTQ files. Valid filename extensions are: .fq,

.fastq, .fq.gz, and .fastq.gz. The names must

include the string '_R1' (and '_R2' when pairs are

provided). Everything before the string '_R1' will

be used as sample name. There are a few ways to

provide the FASTQ files:

A directory = path to directory containing FASTQ

files (e.g.: -r ./raw_reads)

A list = filenames separated by space (e.g.: -r

A_R1.fq A_R2.fq B_R1.fq C_R1.fq)

A pattern = UNIX matching expression (e.g.: -r

./raw_reads/*.fastq.gz)

Output:

-o OUT, --out OUT Output directory name (default: ./01_clean_reads)

--keep_all Do not delete any intermediate files (default:

False)

--overwrite Overwrite previous results (default: False)

Adaptor trimming:

--adaptor_set ADAPTOR_SET

Set of adaptors to remove

Illumina = Illumina adaptors included in BBTools

BGI = BGISEQ, DNBSEG, or MGISEQ adaptors

ALL = Illumina + BGI

Alternatively, you can provide a path to a FASTA

file containing your own adaptors (default: ALL)

--rna Trim ploy-A tails from RNA-Seq reads (default:

False)

Quality trimming and filtering:

--trimq TRIMQ Leading and trailing read regions with average PHRED

quality score below this value will be trimmed

(default: 13)

--maq MAQ After quality trimming, reads with average PHRED

quality score below this value will be removed

(default: 16)

--ftl FTL Trim any base to the left of this position. For

example, if you want to remove 4 bases from the left

of the reads set this number to 5 (default: 0)

--ftr FTR Trim any base to the right of this position. For

example, if you want to truncate your reads length

to 100 bp set this number to 100 (default: 0)

QC Statistics:

--qc_program {fastqc,falco}

Which program to use to obtain the statistics from

the raw and cleaned FASTQ files. Falco produces

identical results to FastQC while being much faster

(default: fastqc)

--skip_qc_stats Skip FastQC/Falco analysis on raw and cleaned reads

(default: False)

Other:

--bbduk_path BBDUK_PATH

Path to bbduk.sh (default: bbduk.sh)

--falco_path FALCO_PATH

Path to Falco (default: falco)

--fastqc_path FASTQC_PATH

Path to FastQC (default: fastqc)

--ram RAM Maximum RAM in GB (e.g.: 4.5) dedicated to Captus,

'auto' uses 99% of available RAM (default: auto)

--threads THREADS Maximum number of CPUs dedicated to Captus, 'auto'

uses all available CPUs (default: auto)

--concurrent CONCURRENT

Captus will attempt to run FastQC concurrently on

this many samples. If set to 'auto', Captus will run

at most 4 instances of FastQC or as many CPU cores

are available, whatever number is lower (default:

auto)

--debug Enable debugging mode, parallelization is disabled

so errors are logged to screen (default: False)

--show_less Do not show individual sample information during the

run, the information is still written to the log

(default: False)

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For more information, please see https://github.com/edgardomortiz/Captus

> captus_assembly assemble -h

usage: captus_assembly assemble -r READS [READS ...] [options]

Captus-assembly: Assemble; perform de novo assembly using MEGAHIT

Input:

-r READS [READS ...], --reads READS [READS ...]

FASTQ files. Valid file name extensions are: .fq,

.fastq, .fq.gz, and .fastq.gz. The names must

include the string '_R1' (and '_R2' when pairs are

provided). Everything before the string '_R1' will

be used as sample name. There are a few ways to

provide the FASTQ files:

A directory = path to directory containing FASTQ

files (e.g.: -r ./raw_reads)

A list = file names separated by space (e.g.: -r

A_R1.fq A_R2.fq B_R1.fq C_R1.fq)

A pattern = UNIX matching expression (e.g.: -r

./raw_reads/*.fastq.gz) (default:

./01_clean_reads)

--sample_reads_target SAMPLE_READS_TARGET

Use this number of read pairs (or reads if single-

end) for assembly. Reads are randomly subsampled

with 'reformat.sh' from BBTools (option:

srt/samplereadstarget). Useful for limiting the

amount of data of samples with very high sequencing

depth. To use all the reads, set this value to 0

(default: 0)

Output:

-o OUT, --out OUT Output directory name. Inside this directory the

output for each sample will be stored in a

subdirectory named as 'Sample_name__captus-asm'

(default: ./02_assemblies)

--keep_all Do not delete any intermediate files (default:

False)

--overwrite Overwrite previous results (default: False)

MEGAHIT:

--k_list K_LIST Comma-separated list of kmer sizes, all must be odd

values in the range 15-255, in increments of at most

28. If not provided, a list optimized for

hybridization capture and/or genome skimming data

will be used. The final kmer size will be adjusted

automatically so it never exceeds the mean read

length of the sample by more than 31

--min_count MIN_COUNT Minimum contig depth (a.k.a. multiplicity in

MEGAHIT), accepted values are integers >= 1.

Reducing it to 1 may increase the amount of low-

depth contigs likely produced from reads with

errors. Increase above 2 if the data has high and

even sequencing depth

--prune_level PRUNE_LEVEL

Prunning strength for low-coverage edges during

graph cleaning. Increasing the value beyond 2 can

speed up the assembly at the cost of losing low-

depth contigs. Accepted values are integers between

0 and 3

--merge_level MERGE_LEVEL

Merge complex bubbles, the first number multiplied

by the kmer size represents the maximum bubble

length to merge, the second number represents the

minimum similarity required to merge bubbles

(default: 20,0.95)

--preset PRESET The default preset is 'CAPSKIM', these settings work

well with either hybridization capture or genome

skimming data (or a combination of both). You can

assemble RNA-Seq reads with the 'RNA' preset or

high-coverage Whole Genome Sequencing reads with the

'WGS' preset, however, both presets require a

minimum of 8GB of RAM to work well (default:

CAPSKIM)

CAPSKIM = --k-list

31,39,47,63,79,95,111,127,143,159,175

--min-count 2 --prune-level 2

RNA = --k-list 27,47,67,87,107,127,147,167

--min-count 2 --prune-level 2

WGS = --k-list 31,39,49,69,89,109,129,149,169

--min-count 3 --prune-level 2

--min_contig_len MIN_CONTIG_LEN

Minimum contig length in output assembly, 'auto' is

mean input read length + smallest kmer size in '--

k_list' (default: auto)

--max_contig_gc MAX_CONTIG_GC

Maximum GC percentage allowed per contig. Useful to

filter contamination. For example, bacteria usually

exceed 60% GC content while eukaryotes rarely exceed

that limit. 100.0 disables the GC filter (default:

100.0)

--tmp_dir TMP_DIR Location to create the temporary directory

'captus_assembly_tmp' for MEGAHIT assembly.

Sometimes, when working on external hard drives

MEGAHIT will refuse to run unless this directory is

created in an internal hard drive. (default: $HOME)

Other:

--reformat_path REFORMAT_PATH

Path to reformat.sh (default: reformat.sh)

--megahit_path MEGAHIT_PATH

Path to MEGAHIT (default: megahit)

--megahit_toolkit_path MEGAHIT_TOOLKIT_PATH

Path to MEGAHIT's toolkit (default: megahit_toolkit)

--ram RAM Maximum RAM in GB (e.g.: 4.5) dedicated to Captus,

'auto' uses 99% of available RAM (default: auto)

--threads THREADS Maximum number of CPUs dedicated to Captus, 'auto'

uses all available CPUs (default: auto)

--concurrent CONCURRENT

Captus will attempt to assemble this many samples

concurrently. RAM and CPUs will be divided by this

value for each individual MEGAHIT process. For

example if you set --threads to 12 and --concurrent

to 3, then each MEGAHIT assembly will be done using

--threads/--concurrent = 4 CPUs. If set to 'auto',

Captus will run as many concurrent assemblies as

possible with a minimum of 4 CPUs and 4 GB of RAM

per assembly (8 GB if presets RNA or WGS are used)

(default: auto)

--debug Enable debugging mode, parallelization is disabled

so errors are logged to screen (default: False)

--show_less Do not show individual sample information during the

run, the information is still written to the log

(default: False)

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For more information, please see https://github.com/edgardomortiz/Captus

> captus_assembly extract -h

usage: captus_assembly extract -a CAPTUS_ASSEMBLIES_DIR [options]

Captus-assembly: Extract; recover markers from FASTA assemblies

Input:

-a CAPTUS_ASSEMBLIES_DIR, --captus_assemblies_dir CAPTUS_ASSEMBLIES_DIR

Path to an output directory from the 'assemble' step

of Captus-assembly which is tipically called

'02_assemblies'. If you DID NOT assemble any sample

within Captus and want to start exclusivey with

FASTAs assembled elsewhere, the path provided here

will be created in order to contain your assemblies

provided with '-f' into a proper directory structure

needed by Captus (default: ./02_assemblies)

-f FASTAS [FASTAS ...], --fastas FASTAS [FASTAS ...]

FASTA assembly file(s) that were not assembled with

Captus. Valid file name extensions are: .fa, .fna,

.fasta, .fa.gz, .fna.gz, .fasta.gz. These FASTA

files must contain only nucleotides (no aminoacids).

All the text before the extension of the filename

will be used as sample name. These FASTAs will be

automatically copied to the path provided with

'-a'/'--captus_assemblies_dir' using the correct

directory structure needed by Captus. There are a

few ways to provide the FASTA files:

A directory = path to directory containing FASTA

files (e.g.: -f ./my_fastas)

A list = filenames separated by space (e.g.: -f

speciesA.fa speciesB.fasta.gz)

A pattern = UNIX matching expression (e.g.: -f

./my_fastas/*.fasta)

Output:

-o OUT, --out OUT Output directory name (default: ./03_extractions)

--max_paralogs MAX_PARALOGS

Maximum number of secondary hits (copies) of any

particular reference target marker allowed in the

output. We recommend disabling the removal of

paralogs (secondary hits/copies) during the

'extract' step because the 'align' step uses a more

sophisticated filter for paralogs. -1 disables the

removal of paralogs (default: -1)

--max_loci_files MAX_LOCI_FILES

When the number of markers in a reference target

file exceeds this number, Captus will not write a

separate FASTA file per sample per marker to not

overload I/O. The single FASTA file containing all

recovered markers per sample needed by the 'align'

step is still produced as are the rest of output

files (default: 0)

--keep_all Do not delete any intermediate files (default:

False)

--overwrite Overwrite previous results (default: False)

Proteins extraction global options (Scipio):

--max_loci_scipio_x2 MAX_LOCI_SCIPIO_X2

When the number of loci in a protein reference

target file exceeds this number, Captus will not run

a second, more exhaustive round of Scipio. Usually

the results from the first round are extremely

similar and sufficient, the second round can become

extremely slow as the number of reference target

proteins grows (default: 2000)

--predict Scipio flags introns as dubious when the splice

signals are not found at the exon edges, this may

indicate that there are additional aminoacids in the

recovered protein that are not present in the

reference target protein. Enable this flag to

attempt translation of these dubious introns, if the

translation does not introduce premature stop codons

they will be added to the recovered protein

(default: False)

Nuclear proteins extraction (Scipio):

-n NUC_REFS, --nuc_refs NUC_REFS

Set of nuclear protein reference target sequences,

options are:

Angiosperms353 = The original set of target

proteins from Angiosperms353

Mega353 = The improved set of target proteins from

Angiosperms353

Alternatively, provide a path to a FASTA file

containing your reference target protein sequences

in either nucleotide or aminoacid. When the FASTA

file is in nucleotides, '--nuc_transtable' will be

used to translate it to aminoacids

--nuc_transtable NUC_TRANSTABLE

Genetic code table to translate your nuclear

proteins. Complete list of tables at: https://www.nc

bi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi (default:

1: Standard)

--nuc_min_score NUC_MIN_SCORE

Minimum Scipio score to retain hits to reference

target proteins. (default: 0.13)

--nuc_min_identity NUC_MIN_IDENTITY

Minimum identity percentage to retain hits to

reference target proteins (default: 65)

--nuc_min_coverage NUC_MIN_COVERAGE

Minimum coverage percentage of reference target

protein to consider a hit by a contig (default: 20)

Plastidial proteins extraction (Scipio):

-p PTD_REFS, --ptd_refs PTD_REFS

Set of plastidial protein reference target

sequences, options are:

SeedPlantsPTD = A set of plastidial proteins for

Seed Plants, curated by us

Alternatively, provide a path to a FASTA file

containing your reference target protein sequences

in either nucleotide or aminoacid. When the FASTA

file is in nucleotides, '--ptd_transtable' will be

used to translate it to aminoacids

--ptd_transtable PTD_TRANSTABLE

Genetic code table to translate your plastidial

proteins. Complete list of tables at: https://www.nc

bi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi (default:

11: Bacterial, Archaeal and Plant Plastid)

--ptd_min_score PTD_MIN_SCORE

Minimum Scipio score to retain hits to reference

target proteins (default: 0.2)

--ptd_min_identity PTD_MIN_IDENTITY

Minimum identity percentage to retain hits to

reference target proteins (default: 65)

--ptd_min_coverage PTD_MIN_COVERAGE

Minimum coverage percentage of reference target

protein to consider a hit by a contig (default: 20)

Mitochondrial proteins extraction (Scipio):

-m MIT_REFS, --mit_refs MIT_REFS

Set of mitochondrial protein reference target

sequences, options are:

SeedPlantsMIT = A set of mitochondrial proteins

for Seed Plants, curated by us

Alternatively, provide a path to a FASTA file

containing your reference target protein sequences

in either nucleotide or aminoacid. When the FASTA

file is in nucleotides, '--mit_transtable' will be

used to translate it to aminoacids

--mit_transtable MIT_TRANSTABLE

Genetic code table to translate your mitochondrial

proteins. Complete list of tables at: https://www.nc

bi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi (default:

1: Standard)

--mit_min_score MIT_MIN_SCORE

Minimum Scipio score to retain hits to reference

target proteins (default: 0.2)

--mit_min_identity MIT_MIN_IDENTITY

Minimum identity percentage to retain hits to

reference target proteins (default: 65)

--mit_min_coverage MIT_MIN_COVERAGE

Minimum coverage percentage of reference target

protein to consider a hit by a contig (default: 20)

Miscellaneous DNA extraction (BLAT):

-d DNA_REFS, --dna_refs DNA_REFS

Path to a FASTA nucleotide file of miscellaneous DNA

reference targets

--dna_min_identity DNA_MIN_IDENTITY

Minimum identity percentage to reference target

sequences to retain matches (default: 80)

--dna_min_coverage DNA_MIN_COVERAGE

Minimum coverage percentage of reference target

sequence to retain matches (default: 20)

Assemblies clustering (MMseqs2):

-c, --cluster_leftovers

Enable MMseqs2 clustering across samples of the

contigs that had no hits to the reference target

markers. A new miscellaneous DNA reference is built

from the best representative of each cluster in

order to perform a miscellaneous DNA marker

extraction. (default: False)

--mmseqs_method {easy-linclust,easy-cluster}

MMseqs2 clustering algorithm, options are:

easy-linclust = Fast linear time (for huge

datasets), less sensitive

clustering

easy-cluster = Sensitive homology search (slower)

(default: easy-linclust)

--cl_sensitivity CL_SENSITIVITY

MMseqs2 sensitivity, from 1 to 7.5, only applicable

when using 'easy-cluster'. Common reference points

are: 1 (faster), 4 (fast), 7.5 (sens) (default: 7.5)

--cl_mode {0,1,2} MMseqs2 clustering mode

(https://github.com/soedinglab/mmseqs2/wiki#clustering-modes),

options are:

0 = Greedy set cover

1 = Connected component

2 = Greedy incremental (analogous to CD-HIT)

(default: 2)

--cl_min_identity CL_MIN_IDENTITY

Minimum identity percentage between sequences in a

cluster, when set to 'auto' it becomes 99% of the '

--dna_min_identity' value but never less than 75%

(default: auto)

--cl_seq_id_mode CL_SEQ_ID_MODE

MMseqs2 sequence identity mode, options are:

0 = Alignment length

1 = Shorter sequence

2 = Longer sequence (default: 1)

--cl_min_coverage CL_MIN_COVERAGE

Any sequence in a cluster has to be at least this

percent included in the length of the longest

sequence in the cluster (default: 80)

--cl_cov_mode CL_COV_MODE

MMseqs2 sequence coverage mode

(https://github.com/soedinglab/mmseqs2/wiki#how-to-

set-the-right-alignment-coverage-to-cluster)

(default: 1)

--cl_max_seq_len CL_MAX_SEQ_LEN

Do not cluster sequences longer than this length in

bp, the maximum allowed by MMseqs2 is 65535. Use 0

to disable this filter (default: 20000)

--cl_rep_min_len CL_REP_MIN_LEN

After clustering is finished, only accept cluster

representatives of at least this length to be part

of the new miscellaneous DNA reference targets. Use

0 to disable this filter (default: 500)

--cl_min_samples CL_MIN_SAMPLES

Minimum number of samples per cluster, if set to

'auto' the number is adjusted to 30% of the total

number of samples or at least 4 (default: auto)

--cl_max_copies CL_MAX_COPIES

Maximum average number of sequences per sample in a

cluster. This can exclude loci that are extremely

paralogous (default: 5)

--cl_tmp_dir CL_TMP_DIR

Where to create the temporary directory

'captus_mmseqs_tmp' for MMseqs2. Clustering can

become slow when done on external drives, set this

location to a fast, preferably local, drive

(default: $HOME)

Other:

--scipio_path SCIPIO_PATH

Path to Scipio (default: bundled)

--blat_path BLAT_PATH Path to BLAT >= 36x7 (this version is the first one

that guarantees the same result both in Mac and

Linux) (default: bundled)

--mmseqs_path MMSEQS_PATH

Path to MMseqs2 (default: mmseqs)

--mafft_path MAFFT_PATH

Path to MAFFT (default: mafft/mafft.bat)

--ram RAM Maximum RAM in GB (e.g.: 4.5) dedicated to Captus,

'auto' uses 99% of available RAM (default: auto)

--threads THREADS Maximum number of CPUs dedicated to Captus, 'auto'

uses all available CPUs (default: auto)

--concurrent CONCURRENT

Captus will attempt to execute this many extractions

concurrently. RAM and CPUs will be divided by this

value for each individual process. If set to 'auto',

Captus will set as many processes as to at least

have 2GB of RAM available for each process due to

the RAM requirements of BLAT (default: auto)

--debug Enable debugging mode, parallelization is disabled

so errors are logged to screen (default: False)

--show_less Do not show individual sample information during the

run, the information is still written to the log

(default: False)

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For more information, please see https://github.com/edgardomortiz/Captus

> captus_assembly align -h

usage: captus_assembly align -e CAPTUS_EXTRACTIONS_DIR [options]

Captus-assembly: Align; collect, align, and curate aligned markers

Input:

-e CAPTUS_EXTRACTIONS_DIR, --captus_extractions_dir CAPTUS_EXTRACTIONS_DIR

Path to the output directory that contains the

assemblies and extractions from previous steps of

Captus-assembly. This directory is called

'02_assemblies' if you did not specify a different

name during the 'assemble' or 'extract' steps

(default: ./03_extractions)

-m MARKERS, --markers MARKERS

Which markers to align, you can provide a

comma-separated list, no spaces (default: all)

NUC = Nuclear proteins inside directories

'01_coding_NUC'

PTD = Plastidial proteins inside directories

'02_coding_PTD'

MIT = Mitochondrial proteins inside directories

'03_coding_MIT'

DNA = Miscellaneous DNA markers inside directories

'04_misc_DNA'

CLR = Cluster-derived DNA markers inside

directories '05_clusters'

ALL = Shortcut for NUC,PTD,MIT,DNA,CLR

-f FORMATS, --formats FORMATS

Which alignment format(s) to prepare for each marker

category, you can provide a comma-separated list, no

spaces (default: AA,NT,GE,MA)

Valid types for NUC, PTD, and MIT markers:

AA = Coding sequences in aminoacids

NT = Coding sequences in nucleotides

GE = Complete gene sequences (exons + introns)

without flanking upstream or downstream

basepairs

GF = Complete gene sequences with flanking

upstream and downstream basepairs

Valid types for miscellaneous DNA and

CLusteR-derived markers:

MA = Matched sequences without flanking upstream

or downstream basepairs

MF = Matched sequences with flanking upstream and

downstream basepairs

ALL = Shortcut for AA,NT,GE,GF,MA,MF

--max_paralogs MAX_PARALOGS

Maximum number of secondary hits (copies) per sample

to import from the extraction step. Large numbers of

marker copies per sample can increase alignment

times. Hits (copies) are ranked from best to worst

during the 'extract' step. -1 disables the removal

of paralogs and aligns them all, which might be

useful if you expect very high ploidy levels for

example (default: 5)

--min_samples MIN_SAMPLES

Minimum number of samples in a marker to proceed

with alignment. Markers with fewer samples will be

skipped (default: 4)

Output:

-o OUT, --out OUT Output directory name (default: ./04_alignments)

--keep_all Do not delete any intermediate files (default:

False)

--overwrite Overwrite previous results (default: False)

Alignment:

--align_method {mafft_auto,mafft_genafpair,mafft_localpair,mafft_globalpair,mafft_retree1,mafft_retree2,muscle_align,muscle_super5}

For MAFFT's algorithms see: https://mafft.cbrc.jp/al

ignment/software/algorithms/algorithms.html For

MUSCLE's algorithms see:

https://drive5.com/muscle5/manual/commands.html

(default: mafft_auto)

--timeout TIMEOUT Maximum allowed time in seconds for a single

alignment (default: 21600)

--disable_codon_align Do not align nucleotide coding sequences based on

their protein alignment (default: False)

--outgroup OUTGROUP Outgroup sample names, separated by commas, no

spaces. Since phylogenetic programs usually root the

resulting trees at the first taxon in the alignment,

Captus will place these samples at the beggining of

every alignment in the given order

Paralog filtering:

--filter_method {naive,informed,both,none}

Methods for filtering paralogous sequences:

naive = Only the best hit for each sample (marked

as hit=00) is retained

informed = Only keep the copy (regardless of hit

ranking) that is most similar to the

reference target sequence that was

chosen most frequently among all other

samples in the alignmentboth = Two

separate folders will be created, each

containing the results from each

filtering methodnone = Skip paralog

removal, just remove reference target

sequences from the alignments. Useful

for phylogenetic methods that allow

paralogs like ASTRAL-Pro (default:

both)

--tolerance TOLERANCE Only applicable to the 'informed' filter. If the

selected copy's identity to the most commonly chosen

reference target is below this number of Standard

Deviations from the mean, it will also be removed

(the lower the number the stricter the filter)

(default: 4.0)

ClipKIT:

--clipkit_method {smart-gap,gappy,kpic,kpic-smart-gap,kpic-gappy,kpi,kpi-smart-gap,kpi-gappy}

ClipKIT's algorithm, see https://jlsteenwyk.com/Clip

KIT/advanced/index.html#modes (default: gappy)

--clipkit_gaps CLIPKIT_GAPS

Gappyness threshold per position. Accepted values

between 0 and 1. This argument is ignored when using

the 'kpi' and 'kpic' algorithms or intermediate

steps that use 'smart-gap' (default: 0.9)

--min_data_per_column MIN_DATA_PER_COLUMN

Minimum number of non-missing sites per column. When

this parameter is > 0, Captus will dynamically

calculate a '--clipkit_gaps' threshold per alignment

to keep this minimum amount of data per column

(default: 0)

--min_coverage MIN_COVERAGE

Minimum coverage of sequence as proportion of the

mean of sequence lengths in the alignment, ignoring

gaps. Accepted values between 0 and 1. After ClipKIT

finishes trimming columns, Captus will also remove

short sequences below this threshold (default: 0.4)

Other:

--collect_only Only collect the markers from the extraction folder

and exit (skips addition of reference target

sequences and subsequent steps) (default: False)

--redo_from {alignment,filtering,removal,trimming}

Repeat analysis from a particular stage:

alignment = Delete all subdirectories with

alignments and restart

filtering = Delete all subdirectories with

filtered alignments and restart

removal = Delete all subdirectories with

alignments with reference targets

removed and restart

trimming = Delete all subdirectories with trimmed

alignments and restart

--mafft_path MAFFT_PATH

Path to MAFFT (default: mafft/mafft.bat)

--muscle_path MUSCLE_PATH

Path to MUSCLE (default: muscle)

--clipkit_path CLIPKIT_PATH

Path to ClipKIT (default: clipkit)

--ram RAM Maximum RAM in GB (e.g.: 4.5) dedicated to Captus,

'auto' uses 99% of available RAM (default: auto)

--threads THREADS Maximum number of CPUs dedicated to Captus, 'auto'

uses all available CPUs (default: auto)

--concurrent CONCURRENT

Captus will attempt to execute this many alignments

concurrently. CPUs will be divided by this value for

each individual process. If set to 'auto', Captus

will set as many processes as to at least have 2

threads available for each MAFFT or MUSCLE process

(default: auto)

--debug Enable debugging mode, parallelization is disabled

so errors are logged to screen (default: False)

--show_more Show individual alignment information during the

run. Detailed information is written regardless to

the log (default: False)

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For more information, please see https://github.com/edgardomortiz/Captus

実行方法

１，fastqの準備

実行前に、命名規則に従ってFASTQファイル名を変える必要がある。FASTQファイル名が最終的な系統樹名に反映される。

https://edgardomortiz.github.io/captus.docs/assembly/clean/preparation/index.html

（マニュアルより転載）

注意（マニュアルより）；

サンプル名にアンダーバー（_）を使用することは可能だが、1つ以上連続して使用しない。
ペアエンドリードの場合、R1 と R2 のファイル名にはそれぞれ _R1 と _R2 のパターンが含まれていなければ正しくマッチングされない（_1と_2では不可）。シングルエンドの場合は、ファイル名に _R1 を含める必要がある。
有効な拡張子は、.fq、.fastq、.fq.gz、.fastq.gz
特殊文字を使用しない

テストデータ

Angiosperms353プローブセット(Johnson et al., 2018)を用いて、被子植物間で高度に保存された353の遺伝子座の targeted-capture sequencing (CAP)によって得られた4つの植物種(GenusA_speciesA, GenusB_speciesB, GenusC_speciesC, GenusD_speciesD)ペアエンドシークエンシングリード（合計4つ）がテスト用に公開されている。ここではこれを使う。

https://drive.google.com/uc?export=download&id=1qDwAQqMk9JTy0aHf7dKrqgEqPNXyF00J

ダウンロードして解凍する。

00_raw_reads/

２，ラン

captus_assemblyモジュールはclean、assemble、extract、alignの4つのコマンドからなり、通常この順番で実行する。しかし、リードが既にクリーニングされている場合はassembleモジュールから開始でき、既にアセンブルされたデータやリファレンスゲノムが提供された場合はextractモジュールで解析を開始でき、シナリオによって柔軟に利用する事が出来るように設計されている。

2-1. clean - アダプタートリミング（RNAseqリードをクリーニングする場合はポリAトリミングも行う）に続いて、BBToolsのbbduk.sh（紹介）を使ってクオリティトリミングを行う（PHREDスコア<13の先頭塩基と末尾塩基がトリミングされ、次に平均PHREDスコア<16のトリミングリードが除去される）。クリーニング後にはFalcoまたはFastQCによって全サンプルの未処理リードとクリーニング後リードのHTMLレポートが作成される。ポリAテールをトリミングするために”--rna”オプションが用意されている。

fastqファイルを含むディレクトリを指定して実行する。

captus_assembly clean -r 00_raw_reads

--rna Trim ploy-A tails from RNA-Seq reads (default: False)
--qc_program Which program to use to obtain the statistics from the raw and cleaned FASTQ files. Falco produces identical results to FastQC while being much faster (default: fastqc)

01_clean_readsディレクトリが出来て、結果はその中に保存される。

出力

クリーニングされたfastq.gzが出力される。ハイクオリティなレポートも生成される。

図はインタラクティブに操作可能

2-2. assemble - コンティグのアセンブル
クリーニングされたリードからターゲットキャプチャーとゲノムスキミングデータに最適化されたデフォルト設定でMEGAHITによるde novoアセンブルを行う。-rで2-1の出力ディレクトリを指定する。コンタミネーションが心配な場合は、-max_contig_gcオプションの使用を検討するように書かれている（；ほとんどの真核生物ゲノムの細菌汚染を除去するには最大60%のGCが適切（Fierst and Murdock 2017；論文より））。

captus_assembly assemble -r 01_clean_reads

--min_contig_len Minimum contig length in output assembly, 'auto' is mean input read length + smallest kmer size in '--k_list' (default: auto)
--max_contig_gc Maximum GC percentage allowed per contig. Useful to filter contamination. For example, bacteria usually exceed 60% GC content while eukaryotes rarely exceed that limit. 100.0 disables the GC filter (default: 100.0)

02_assembliesディレクトリが生成される。

出力

サブディレクトリの中にassembly.fastaが出来る、これがアセンブルされたコンティグとなる。

assembly report

2-3. extract - アセンブルしたコンティグからターゲット遺伝子座の配列を抽出

アセンブル後、目的の特定の遺伝子座（すなわち、リファレンスターゲット遺伝子座）を検索することができる。CAPTUSでは、5種類のマーカー（i）核タンパク質（NUC）、（ii）プラスチドタンパク質（PTD）、（iii）ミトコンドリアタンパク質（MIT）、（iv）雑多なDNA領域（DNA）、および（v）新しいクラスタリング由来の推定ホモログ（CLR）を同時に検索し、抽出することができる。タンパク質抽出を行うためにSCIPIO（Hatje et al）が使用されている。SCIPIOはシーケンスやアセンブリーエラーに起因するフレームシフトを自動的に修正でき、複数のexonや複数のコンティグにまたがるタンパク質も回収できる。

その他のDNA抽出は、他のタイプのDNA領域（例えば、イントロンを含む完全な遺伝子、非コード領域、リボソーム遺伝子、個々のエクソン、RADマーカーなど）を回収することを目的としている。この場合、CAPTUSはBLAT(Kent 2002)を使ってアセンブリを検索し、コンティグ間で見つかったヒットを独自のコードを使って貪欲にアセンブルし、連結する（論文より）。

-rで１のアセンブルディレクトリのパスを指定する。複数のマーカータイプ（核タンパク質、プラスチドリアタンパク質、ミトコンドリアタンパク質、および雑多なDNAマーカー）を同時に指定し、抽出することができる。ここでは、Angiosperms353に対応したキャプチャシークエンシングデータを使っているので、”-n”でAngiosperms353を指定している（captusに内蔵されているリファレンスの１つ）。"-a"で前項の2-2の出力ディレクトリを指定する。

captus_assembly extract -a 02_assemblies -n Angiosperms353

-a Path to the output directory from the assemble command.
-n Path to a FASTA file of reference sequences. Here we use the built-in Angiosperms353 reference dataset.
-f FASTA assembly file(s) that were not assembled with Captus. Valid file name extensions are: .fa, .fna, .fasta, .fa.gz, .fna.gz, .fasta.gz. These FASTA files must contain only nucleotides (no aminoacids). All the text before the extension of the filename will be used as sample name. These FASTAs will be automatically copied to the path provided with '-a'/'--captus_assemblies_dir' using the correct directory structure needed by Captus. （省略）
-n Set of nuclear protein reference target sequences, options are: Angiosperms353 = The original set of target proteins from Angiosperms353. Mega353 = The improved set of target proteins from Angiosperms353 Alternatively, provide a path to a FASTA file containing your reference target protein sequences in either nucleotide or aminoacid. When the FASTA file is in nucleotides, '--nuc_transtable' will be used to translate it to aminoacids
-p Set of plastidial protein reference target sequences, options are: SeedPlantsPTD = A set of plastidial proteins for Seed Plants, curated by us Alternatively, provide a path to a FASTA file containing your reference target protein sequences in either nucleotide or aminoacid. When the FASTA file is in nucleotides, '--ptd_transtable' will be used to translate it to aminoacids
-m Set of mitochondrial protein reference target sequences, options are: SeedPlantsMIT = A set of mitochondrial proteins for Seed Plants, curated by us Alternatively, provide a path to a FASTA file containing your reference target protein sequences in either nucleotide or aminoacid. When the FASTA file is in nucleotides, '--mit_transtable' will be used to translate it to aminoacids
-c Enable MMseqs2 clustering across samples of the contigs that had no hits to the reference target markers. A new miscellaneous DNA reference is built from the best representative of each cluster in order to perform a miscellaneous DNA marker extraction. (default: False)

出力

サブディレクトリにアノテーションされて抽出された配列が格納される（*.faaと*.fnaのFASTAファイル）。

report/

使用したデータそれぞれのマーカーの保存の有無と保存性の高さのヒートマップ。上のボタンからインタラクティブに表示する情報を切り替えることができる。

Total hits (Copy)に切り替えた。

2-4. align - 各遺伝子座について全サンプルから抽出した配列の多重整列を生成

2-3で抽出された出力は次にアラインメントモジュールで処理される。個々のマーカーはサンプル間で収集され、遺伝子座ごとに別々の FASTA ファイルに整理される。参照配列も各遺伝子座ファイルに追加され、サンプルから回収された配列が断片的な場合のアラインメントガイドとなる。その後、MAFFTまたはMUSCLE5を用いて、デフォルトの設定でアライメントを行う。ただし、アミノ酸単位のタンパク質配列と、それに対応するヌクレオチド単位のコード配列が同じランでアラインメントされる場合、CAPTUSはアミノ酸をMAFFTでアライメントし、AAアラインメントをヌクレオチドのテンプレートとして使用することで、コドンを考慮したCDSのアラインメントが行われる。

最後に、CLIPKITを用いて、生成された全てのアラインメントはトリミングされ、情報量や欠損データの割合などの基準に基づいてアラインメントのカラムを削除できる。デフォルトでは、90%以上の欠損データと40%未満の平均カバレッジを持つ配列が削除される。

-eで2-3の出力ディレクトリを指定する。

captus_assembly align -e 03_extractions

-e Path to the output directory from the extract command.

出力

1）MSA前、2）MSA後、3）MSA&トリミング後に分かれている。

03_trimmed/

01）MSA後のフィルタリングなど一切なし、02）ナイーブな手法でパラログがフィルタリングされたアラインメント、03）インフォームドメソッドによってパラログがフィルタリングされたアラインメント、04-06）01_unfiltered_w_refs、02_naive_w_refs、03_informed_w_refsと同等のアラインメントを含むが、参照配列を除いたもの（系統推定に不要のため）

各サブディレクトリにおいて、NTとAA両方でそれぞれの遺伝子座のMSAが保存される（318個あった）。

report

各処理ステップにおける一般的なアライメント統計量分布

マーカータイプ（右上）と統計量（ｘ軸、下のボタン）は変更可能

3、Phylogenetic inference

captus_assemblyパイプライン後の代表的な解析例として、系統推定がある。captusでは、IQ-TREEを用いた系統推論の簡単な例が提示されている。

# edge-linked partition modelで全遺伝子座位連結を用いた樹形推定
mkdir 05_phylogeny && cd 05_phylogeny
iqtree -p ../04_alignments/03_trimmed/06_informed/01_coding_NUC/02_NT -pre concat -T AUTO

https://edgardomortiz.github.io/captus.docs/tutorials/assembly/basic/index.html

論文とマニュアルより

出力される全てのファイルはマニュアルで説明されている（例）。
実際の解析でオプションをどう使うかについては、論文やチュートリアルでも説明されている。
CAPTUSは、異なる段階でワークフローに入ったサンプルを柔軟に組み合わせることができる。そのため、例えばターゲットキャプチャーやクリーンなRNA-Seqリードで表されるサンプル、および分類群サンプリングを増やすためにGenBankからダウンロードしたゲノムアセンブリで構成するなど、柔軟に実行できる。
プラスチドタンパク質とミトコンドリアタンパク質について；CaptusにはSeedPlantsPTDとSeedPlantsMITというリファレンスデータセットが組み込まれている。これらは、2-3のextractコマンドで-p SeedPlantsPTD と -m SeedPlantsMIT の引数を加えるだけで、あらゆる顕花植物のオルガネラゲノムから遺伝子配列を抽出することができる。
captus_assemblyモジュールの他にdesignモジュールがある（まだマニュアルは無い）

引用

A novel phylogenomics pipeline reveals complex pattern of reticulate evolution in Cucurbitales
Edgardo M. Ortiz, Alina Höwener, Gentaro Shigita, Mustafa Raza, Olivier Maurin, Alexandre Zuntini, Félix Forest, William J. Baker, Hanno Schaefer

bioRxiv, Posted November 01, 2023.

簡単に触ってみて、大規模な系統解析の負担を大きく軽減できるとても有用性の高い実装だと感じました。チュートリアルデータのランなら非常に簡単に最後まで進めることができますので、まずは触ってみると良いのではないでしょうか。

著者の方から教えていただきました。ありがとうございました。