2024-04-17

WGSやRNA-seqやTarget-captureなどのさまざまなショートリードデータからユーザーが指定した数百〜数千の遺伝子座の同祖配列を抽出し、系統解析に使用可能なMSAを出力する CAPTUS

　ターゲットキャプチャー、RNA-Seq、ゲノムスキミング、深く読んだ全ゲノムシーケンスなど、多様なハイスループットシーケンスデータは系統ゲノム解析に利用されているが、このようなミックスされたデータを単一の系統ゲノムデータセットに統合するには、多くのバイオインフォマティクスツールと多大な計算リソースが必要である。ここでは、ミックスされたデータを高速かつ効率的に解析する新しいパイプラインCAPTUSを紹介する。CAPTUSはこれらのデータをアセンブルし、アセンブルされたデータから目的の遺伝子座を検索し、最終的にパラログをフィルターしたアラインメントを作成する。他のソフトウェアと比較して、CAPTUSはより多くの生物種にわたり、より完全な遺伝子座をより多く復元することができる。本著者らは、ベゴニア（Begoniaceae）やウリ（Cucurbitaceae）を含む、主に熱帯の8科、約3,100種からなる被子植物目のCucurbitalesの4種類のシーケンスデータからなる包括的な混合データセットの構築にCAPTUSを適用した。系統学的結果は、ホロ寄生性のアポダンサ科の位置を除き、現在受け入れられているウリ科の分類を支持するものである。ウリ科では、現在認められているすべてのtribesが単系統であることが確認された。しかし、ウリ目とウリ科の両方に深い網目模様があることも明らかにした。我々（本著者ら）は、ウリ目における初期の系統学的研究の相反する結果が、遺伝子ツリーの対立を考慮することで調和可能であることを示す。

Documentation

https://edgardomortiz.github.io/captus.docs/index.html

マニュアルより

CaptusはHigh-Throughput Sequencingデータから系統ゲノムデータセットをアセンブルするためのツールキットである。Captusは当初、ターゲットエンリッチメント（RNAまたはDNAプローブのハイブリダイゼーションなどによる）シーケンスデータのアセンブリ専用に設計されたが、その後、ゲノムスキミング、Hyb-Seq（ターゲットエンリッチメント＋ゲノムスキミング）、RNA-seq、全ゲノムシーケンスなど、他の一般的なタイプのHTSデータにも対応できるように拡張された。ツールキットには、ターゲットエンリッチメント実験用のプローブをデザインするモジュールも含まれている。以下の特徴を持つ。

シンプル：インストールが簡単で使いやすい。
高速：感度を犠牲にすることなく、各ステップのスピードを最適化した。
再現性：コマンドは明確で、出力ディレクトリはよく整理されており、広範なログが常に保存されている。
柔軟：生データでも、独自のクリーンアップリードやアセンブルリードでも解析を開始できる。すべての解析をやり直すことなく、既存のデータセットにサンプルを追加したり、一部のサンプルのみで解析をやり直すことができる。
透明性：ワークフローのどの段階でも、サンプルの状態を迅速に評価できるように、情報量が多く、整理された読みやすい出力や、整然としたHTMLレポートを提供する。

インストール

レポジトリではmicromambaで環境を作って導入するのが最速と説明されている。ubuntu20でテストした（macへの導入については注意事項がある。レポジトリ参照）。

依存

Captus was written for python >= v3.7
BBTools (https://jgi.doe.gov/data-and-tools/bbtools/)
Falco (https://github.com/smithlabcode/falco) or FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
MEGAHIT (https://github.com/voutcn/megahit)
* Scipio (https://www.webscipio.org/)
BioPerl (https://bioperl.org/)
YAML (https://metacpan.org/pod/YAML)
* BLAT >= 36x7 (http://hgdownload.soe.ucsc.edu/admin/exe/)
MMseqs2 (https://github.com/soedinglab/MMseqs2)
MAFFT (https://mafft.cbrc.jp/alignment/software/)
MUSCLE (https://www.drive5.com/muscle/)
ClipKIT (https://github.com/JLSteenwyk/ClipKIT)
pigz (https://zlib.net/pigz/)

本体 Github

https://github.com/edgardomortiz/Captus

mamba create -n captus captus -y
conda activate captus

> captus_assembly -h

usage: captus_assembly command [options]

Captus 1.0.1: Assembly of Phylogenomic Datasets from High-Throughput Sequencing data

Captus-assembly commands:

command Program commands (in typical order of execution)

clean = Trim adaptors and quality filter reads with BBTools, run FastQC on the raw and cleaned reads

assemble = Perform de novo assembly with MEGAHIT: Assembling reads that were cleaned with the 'clean' command is recommended, but reads cleaned elsewhere are also allowed

extract = Recover targeted markers with BLAT and Scipio: Extracting markers from the assembly obtained with the 'assemble' command is recommended, but any other assemblies in FASTA format are also allowed

align = Align extracted markers across samples with MAFFT or MUSCLE: Marker alignment depends on the directory structure created by the 'extract' command. This step also performs paralog filtering and alignment trimming using ClipKIT

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For help on a particular command: captus_assembly command -h

> captus_assembly clean -h

usage: captus_assembly clean -r READS [READS ...] [options]

Captus-assembly: Clean; remove adaptors and quality-filter reads with

BBTools

Input:

-r READS [READS ...], --reads READS [READS ...]

FASTQ files. Valid filename extensions are: .fq,

.fastq, .fq.gz, and .fastq.gz. The names must

include the string '_R1' (and '_R2' when pairs are

provided). Everything before the string '_R1' will

be used as sample name. There are a few ways to

provide the FASTQ files:

A directory = path to directory containing FASTQ

files (e.g.: -r ./raw_reads)

A list = filenames separated by space (e.g.: -r

A_R1.fq A_R2.fq B_R1.fq C_R1.fq)

A pattern = UNIX matching expression (e.g.: -r

./raw_reads/*.fastq.gz)

Output:

-o OUT, --out OUT Output directory name (default: ./01_clean_reads)

--keep_all Do not delete any intermediate files (default:

False)

--overwrite Overwrite previous results (default: False)

Adaptor trimming:

--adaptor_set ADAPTOR_SET

Set of adaptors to remove

Illumina = Illumina adaptors included in BBTools

BGI = BGISEQ, DNBSEG, or MGISEQ adaptors

ALL = Illumina + BGI

Alternatively, you can provide a path to a FASTA

file containing your own adaptors (default: ALL)

--rna Trim ploy-A tails from RNA-Seq reads (default:

False)

Quality trimming and filtering:

--trimq TRIMQ Leading and trailing read regions with average PHRED

quality score below this value will be trimmed

(default: 13)

--maq MAQ After quality trimming, reads with average PHRED

quality score below this value will be removed

(default: 16)

--ftl FTL Trim any base to the left of this position. For

example, if you want to remove 4 bases from the left

of the reads set this number to 5 (default: 0)

--ftr FTR Trim any base to the right of this position. For

example, if you want to truncate your reads length

to 100 bp set this number to 100 (default: 0)

QC Statistics:

--qc_program {fastqc,falco}

Which program to use to obtain the statistics from

the raw and cleaned FASTQ files. Falco produces

identical results to FastQC while being much faster

(default: fastqc)

--skip_qc_stats Skip FastQC/Falco analysis on raw and cleaned reads

(default: False)

Other:

--bbduk_path BBDUK_PATH

Path to bbduk.sh (default: bbduk.sh)

--falco_path FALCO_PATH

Path to Falco (default: falco)

--fastqc_path FASTQC_PATH

Path to FastQC (default: fastqc)

--ram RAM Maximum RAM in GB (e.g.: 4.5) dedicated to Captus,

'auto' uses 99% of available RAM (default: auto)

--threads THREADS Maximum number of CPUs dedicated to Captus, 'auto'

uses all available CPUs (default: auto)

--concurrent CONCURRENT

Captus will attempt to run FastQC concurrently on

this many samples. If set to 'auto', Captus will run

at most 4 instances of FastQC or as many CPU cores

are available, whatever number is lower (default:

auto)

--debug Enable debugging mode, parallelization is disabled

so errors are logged to screen (default: False)

--show_less Do not show individual sample information during the

run, the information is still written to the log

(default: False)

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For more information, please see https://github.com/edgardomortiz/Captus

> captus_assembly assemble -h

usage: captus_assembly assemble -r READS [READS ...] [options]

Captus-assembly: Assemble; perform de novo assembly using MEGAHIT

Input:

-r READS [READS ...], --reads READS [READS ...]

FASTQ files. Valid file name extensions are: .fq,

.fastq, .fq.gz, and .fastq.gz. The names must

include the string '_R1' (and '_R2' when pairs are

provided). Everything before the string '_R1' will

be used as sample name. There are a few ways to

provide the FASTQ files:

A directory = path to directory containing FASTQ

files (e.g.: -r ./raw_reads)

A list = file names separated by space (e.g.: -r

A_R1.fq A_R2.fq B_R1.fq C_R1.fq)

A pattern = UNIX matching expression (e.g.: -r

./raw_reads/*.fastq.gz) (default:

./01_clean_reads)

--sample_reads_target SAMPLE_READS_TARGET

Use this number of read pairs (or reads if single-

end) for assembly. Reads are randomly subsampled

with 'reformat.sh' from BBTools (option:

srt/samplereadstarget). Useful for limiting the

amount of data of samples with very high sequencing

depth. To use all the reads, set this value to 0

(default: 0)

Output:

-o OUT, --out OUT Output directory name. Inside this directory the

output for each sample will be stored in a

subdirectory named as 'Sample_name__captus-asm'

(default: ./02_assemblies)

--keep_all Do not delete any intermediate files (default:

False)

--overwrite Overwrite previous results (default: False)

MEGAHIT:

--k_list K_LIST Comma-separated list of kmer sizes, all must be odd

values in the range 15-255, in increments of at most

28. If not provided, a list optimized for

hybridization capture and/or genome skimming data

will be used. The final kmer size will be adjusted

automatically so it never exceeds the mean read

length of the sample by more than 31

--min_count MIN_COUNT Minimum contig depth (a.k.a. multiplicity in

MEGAHIT), accepted values are integers >= 1.

Reducing it to 1 may increase the amount of low-

depth contigs likely produced from reads with

errors. Increase above 2 if the data has high and

even sequencing depth

--prune_level PRUNE_LEVEL

Prunning strength for low-coverage edges during

graph cleaning. Increasing the value beyond 2 can

speed up the assembly at the cost of losing low-

depth contigs. Accepted values are integers between

0 and 3

--merge_level MERGE_LEVEL

Merge complex bubbles, the first number multiplied

by the kmer size represents the maximum bubble

length to merge, the second number represents the

minimum similarity required to merge bubbles

(default: 20,0.95)

--preset PRESET The default preset is 'CAPSKIM', these settings work

well with either hybridization capture or genome

skimming data (or a combination of both). You can

assemble RNA-Seq reads with the 'RNA' preset or

high-coverage Whole Genome Sequencing reads with the

'WGS' preset, however, both presets require a

minimum of 8GB of RAM to work well (default:

CAPSKIM)

CAPSKIM = --k-list

31,39,47,63,79,95,111,127,143,159,175

--min-count 2 --prune-level 2

RNA = --k-list 27,47,67,87,107,127,147,167

--min-count 2 --prune-level 2

WGS = --k-list 31,39,49,69,89,109,129,149,169

--min-count 3 --prune-level 2

--min_contig_len MIN_CONTIG_LEN

Minimum contig length in output assembly, 'auto' is

mean input read length + smallest kmer size in '--

k_list' (default: auto)

--max_contig_gc MAX_CONTIG_GC

Maximum GC percentage allowed per contig. Useful to

filter contamination. For example, bacteria usually

exceed 60% GC content while eukaryotes rarely exceed

that limit. 100.0 disables the GC filter (default:

100.0)

--tmp_dir TMP_DIR Location to create the temporary directory

'captus_assembly_tmp' for MEGAHIT assembly.

Sometimes, when working on external hard drives

MEGAHIT will refuse to run unless this directory is

created in an internal hard drive. (default: $HOME)

Other:

--reformat_path REFORMAT_PATH

Path to reformat.sh (default: reformat.sh)

--megahit_path MEGAHIT_PATH

Path to MEGAHIT (default: megahit)

--megahit_toolkit_path MEGAHIT_TOOLKIT_PATH

Path to MEGAHIT's toolkit (default: megahit_toolkit)

--ram RAM Maximum RAM in GB (e.g.: 4.5) dedicated to Captus,

'auto' uses 99% of available RAM (default: auto)

--threads THREADS Maximum number of CPUs dedicated to Captus, 'auto'

uses all available CPUs (default: auto)

--concurrent CONCURRENT

Captus will attempt to assemble this many samples

concurrently. RAM and CPUs will be divided by this

value for each individual MEGAHIT process. For

example if you set --threads to 12 and --concurrent

to 3, then each MEGAHIT assembly will be done using

--threads/--concurrent = 4 CPUs. If set to 'auto',

Captus will run as many concurrent assemblies as

possible with a minimum of 4 CPUs and 4 GB of RAM

per assembly (8 GB if presets RNA or WGS are used)

(default: auto)

--debug Enable debugging mode, parallelization is disabled

so errors are logged to screen (default: False)

--show_less Do not show individual sample information during the

run, the information is still written to the log

(default: False)

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For more information, please see https://github.com/edgardomortiz/Captus

> captus_assembly extract -h

usage: captus_assembly extract -a CAPTUS_ASSEMBLIES_DIR [options]

Captus-assembly: Extract; recover markers from FASTA assemblies

Input:

-a CAPTUS_ASSEMBLIES_DIR, --captus_assemblies_dir CAPTUS_ASSEMBLIES_DIR

Path to an output directory from the 'assemble' step

of Captus-assembly which is tipically called

'02_assemblies'. If you DID NOT assemble any sample

within Captus and want to start exclusivey with

FASTAs assembled elsewhere, the path provided here

will be created in order to contain your assemblies

provided with '-f' into a proper directory structure

needed by Captus (default: ./02_assemblies)

-f FASTAS [FASTAS ...], --fastas FASTAS [FASTAS ...]

FASTA assembly file(s) that were not assembled with

Captus. Valid file name extensions are: .fa, .fna,

.fasta, .fa.gz, .fna.gz, .fasta.gz. These FASTA

files must contain only nucleotides (no aminoacids).

All the text before the extension of the filename

will be used as sample name. These FASTAs will be

automatically copied to the path provided with

'-a'/'--captus_assemblies_dir' using the correct

directory structure needed by Captus. There are a

few ways to provide the FASTA files:

A directory = path to directory containing FASTA

files (e.g.: -f ./my_fastas)

A list = filenames separated by space (e.g.: -f

speciesA.fa speciesB.fasta.gz)

A pattern = UNIX matching expression (e.g.: -f

./my_fastas/*.fasta)

Output:

-o OUT, --out OUT Output directory name (default: ./03_extractions)

--max_paralogs MAX_PARALOGS

Maximum number of secondary hits (copies) of any

particular reference target marker allowed in the

output. We recommend disabling the removal of

paralogs (secondary hits/copies) during the

'extract' step because the 'align' step uses a more

sophisticated filter for paralogs. -1 disables the

removal of paralogs (default: -1)

--max_loci_files MAX_LOCI_FILES

When the number of markers in a reference target

file exceeds this number, Captus will not write a

separate FASTA file per sample per marker to not

overload I/O. The single FASTA file containing all

recovered markers per sample needed by the 'align'

step is still produced as are the rest of output

files (default: 0)

--keep_all Do not delete any intermediate files (default:

False)

--overwrite Overwrite previous results (default: False)

Proteins extraction global options (Scipio):

--max_loci_scipio_x2 MAX_LOCI_SCIPIO_X2

When the number of loci in a protein reference

target file exceeds this number, Captus will not run

a second, more exhaustive round of Scipio. Usually

the results from the first round are extremely

similar and sufficient, the second round can become

extremely slow as the number of reference target

proteins grows (default: 2000)

--predict Scipio flags introns as dubious when the splice

signals are not found at the exon edges, this may

indicate that there are additional aminoacids in the

recovered protein that are not present in the

reference target protein. Enable this flag to

attempt translation of these dubious introns, if the

translation does not introduce premature stop codons

they will be added to the recovered protein

(default: False)

Nuclear proteins extraction (Scipio):

-n NUC_REFS, --nuc_refs NUC_REFS

Set of nuclear protein reference target sequences,

options are:

Angiosperms353 = The original set of target

proteins from Angiosperms353

Mega353 = The improved set of target proteins from

Angiosperms353

Alternatively, provide a path to a FASTA file

containing your reference target protein sequences

in either nucleotide or aminoacid. When the FASTA

file is in nucleotides, '--nuc_transtable' will be

used to translate it to aminoacids

--nuc_transtable NUC_TRANSTABLE

Genetic code table to translate your nuclear

proteins. Complete list of tables at: https://www.nc

bi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi (default:

1: Standard)

--nuc_min_score NUC_MIN_SCORE

Minimum Scipio score to retain hits to reference

target proteins. (default: 0.13)

--nuc_min_identity NUC_MIN_IDENTITY

Minimum identity percentage to retain hits to

reference target proteins (default: 65)

--nuc_min_coverage NUC_MIN_COVERAGE

Minimum coverage percentage of reference target

protein to consider a hit by a contig (default: 20)

Plastidial proteins extraction (Scipio):

-p PTD_REFS, --ptd_refs PTD_REFS

Set of plastidial protein reference target

sequences, options are:

SeedPlantsPTD = A set of plastidial proteins for

Seed Plants, curated by us

Alternatively, provide a path to a FASTA file

containing your reference target protein sequences

in either nucleotide or aminoacid. When the FASTA

file is in nucleotides, '--ptd_transtable' will be

used to translate it to aminoacids

--ptd_transtable PTD_TRANSTABLE

Genetic code table to translate your plastidial

proteins. Complete list of tables at: https://www.nc

bi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi (default:

11: Bacterial, Archaeal and Plant Plastid)

--ptd_min_score PTD_MIN_SCORE

Minimum Scipio score to retain hits to reference

target proteins (default: 0.2)

--ptd_min_identity PTD_MIN_IDENTITY

Minimum identity percentage to retain hits to

reference target proteins (default: 65)

--ptd_min_coverage PTD_MIN_COVERAGE

Minimum coverage percentage of reference target

protein to consider a hit by a contig (default: 20)

Mitochondrial proteins extraction (Scipio):

-m MIT_REFS, --mit_refs MIT_REFS

Set of mitochondrial protein reference target

sequences, options are:

SeedPlantsMIT = A set of mitochondrial proteins

for Seed Plants, curated by us

Alternatively, provide a path to a FASTA file

containing your reference target protein sequences

in either nucleotide or aminoacid. When the FASTA

file is in nucleotides, '--mit_transtable' will be

used to translate it to aminoacids

--mit_transtable MIT_TRANSTABLE

Genetic code table to translate your mitochondrial

proteins. Complete list of tables at: https://www.nc

bi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi (default:

1: Standard)

--mit_min_score MIT_MIN_SCORE

Minimum Scipio score to retain hits to reference

target proteins (default: 0.2)

--mit_min_identity MIT_MIN_IDENTITY

Minimum identity percentage to retain hits to

reference target proteins (default: 65)

--mit_min_coverage MIT_MIN_COVERAGE

Minimum coverage percentage of reference target

protein to consider a hit by a contig (default: 20)

Miscellaneous DNA extraction (BLAT):

-d DNA_REFS, --dna_refs DNA_REFS

Path to a FASTA nucleotide file of miscellaneous DNA

reference targets

--dna_min_identity DNA_MIN_IDENTITY

Minimum identity percentage to reference target

sequences to retain matches (default: 80)

--dna_min_coverage DNA_MIN_COVERAGE

Minimum coverage percentage of reference target

sequence to retain matches (default: 20)

Assemblies clustering (MMseqs2):

-c, --cluster_leftovers

Enable MMseqs2 clustering across samples of the

contigs that had no hits to the reference target

markers. A new miscellaneous DNA reference is built

from the best representative of each cluster in

order to perform a miscellaneous DNA marker

extraction. (default: False)

--mmseqs_method {easy-linclust,easy-cluster}

MMseqs2 clustering algorithm, options are:

easy-linclust = Fast linear time (for huge

datasets), less sensitive

clustering

easy-cluster = Sensitive homology search (slower)

(default: easy-linclust)

--cl_sensitivity CL_SENSITIVITY

MMseqs2 sensitivity, from 1 to 7.5, only applicable

when using 'easy-cluster'. Common reference points

are: 1 (faster), 4 (fast), 7.5 (sens) (default: 7.5)

--cl_mode {0,1,2} MMseqs2 clustering mode

(https://github.com/soedinglab/mmseqs2/wiki#clustering-modes),

options are:

0 = Greedy set cover

1 = Connected component

2 = Greedy incremental (analogous to CD-HIT)

(default: 2)

--cl_min_identity CL_MIN_IDENTITY

Minimum identity percentage between sequences in a

cluster, when set to 'auto' it becomes 99% of the '

--dna_min_identity' value but never less than 75%

(default: auto)

--cl_seq_id_mode CL_SEQ_ID_MODE

MMseqs2 sequence identity mode, options are:

0 = Alignment length

1 = Shorter sequence

2 = Longer sequence (default: 1)

--cl_min_coverage CL_MIN_COVERAGE

Any sequence in a cluster has to be at least this

percent included in the length of the longest

sequence in the cluster (default: 80)

--cl_cov_mode CL_COV_MODE

MMseqs2 sequence coverage mode

(https://github.com/soedinglab/mmseqs2/wiki#how-to-

set-the-right-alignment-coverage-to-cluster)

(default: 1)

--cl_max_seq_len CL_MAX_SEQ_LEN

Do not cluster sequences longer than this length in

bp, the maximum allowed by MMseqs2 is 65535. Use 0

to disable this filter (default: 20000)

--cl_rep_min_len CL_REP_MIN_LEN

After clustering is finished, only accept cluster

representatives of at least this length to be part

of the new miscellaneous DNA reference targets. Use

0 to disable this filter (default: 500)

--cl_min_samples CL_MIN_SAMPLES

Minimum number of samples per cluster, if set to

'auto' the number is adjusted to 30% of the total

number of samples or at least 4 (default: auto)

--cl_max_copies CL_MAX_COPIES

Maximum average number of sequences per sample in a

cluster. This can exclude loci that are extremely

paralogous (default: 5)

--cl_tmp_dir CL_TMP_DIR

Where to create the temporary directory

'captus_mmseqs_tmp' for MMseqs2. Clustering can

become slow when done on external drives, set this

location to a fast, preferably local, drive

(default: $HOME)

Other:

--scipio_path SCIPIO_PATH

Path to Scipio (default: bundled)

--blat_path BLAT_PATH Path to BLAT >= 36x7 (this version is the first one

that guarantees the same result both in Mac and

Linux) (default: bundled)

--mmseqs_path MMSEQS_PATH

Path to MMseqs2 (default: mmseqs)

--mafft_path MAFFT_PATH

Path to MAFFT (default: mafft/mafft.bat)

--ram RAM Maximum RAM in GB (e.g.: 4.5) dedicated to Captus,

'auto' uses 99% of available RAM (default: auto)

--threads THREADS Maximum number of CPUs dedicated to Captus, 'auto'

uses all available CPUs (default: auto)

--concurrent CONCURRENT

Captus will attempt to execute this many extractions

concurrently. RAM and CPUs will be divided by this

value for each individual process. If set to 'auto',

Captus will set as many processes as to at least

have 2GB of RAM available for each process due to

the RAM requirements of BLAT (default: auto)

--debug Enable debugging mode, parallelization is disabled

so errors are logged to screen (default: False)

--show_less Do not show individual sample information during the

run, the information is still written to the log

(default: False)

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For more information, please see https://github.com/edgardomortiz/Captus

> captus_assembly align -h

usage: captus_assembly align -e CAPTUS_EXTRACTIONS_DIR [options]

Captus-assembly: Align; collect, align, and curate aligned markers

Input:

-e CAPTUS_EXTRACTIONS_DIR, --captus_extractions_dir CAPTUS_EXTRACTIONS_DIR

Path to the output directory that contains the

assemblies and extractions from previous steps of

Captus-assembly. This directory is called

'02_assemblies' if you did not specify a different

name during the 'assemble' or 'extract' steps

(default: ./03_extractions)

-m MARKERS, --markers MARKERS

Which markers to align, you can provide a

comma-separated list, no spaces (default: all)

NUC = Nuclear proteins inside directories

'01_coding_NUC'

PTD = Plastidial proteins inside directories

'02_coding_PTD'

MIT = Mitochondrial proteins inside directories

'03_coding_MIT'

DNA = Miscellaneous DNA markers inside directories

'04_misc_DNA'

CLR = Cluster-derived DNA markers inside

directories '05_clusters'

ALL = Shortcut for NUC,PTD,MIT,DNA,CLR

-f FORMATS, --formats FORMATS

Which alignment format(s) to prepare for each marker

category, you can provide a comma-separated list, no

spaces (default: AA,NT,GE,MA)

Valid types for NUC, PTD, and MIT markers:

AA = Coding sequences in aminoacids

NT = Coding sequences in nucleotides

GE = Complete gene sequences (exons + introns)

without flanking upstream or downstream

basepairs

GF = Complete gene sequences with flanking

upstream and downstream basepairs

Valid types for miscellaneous DNA and

CLusteR-derived markers:

MA = Matched sequences without flanking upstream

or downstream basepairs

MF = Matched sequences with flanking upstream and

downstream basepairs

ALL = Shortcut for AA,NT,GE,GF,MA,MF

--max_paralogs MAX_PARALOGS

Maximum number of secondary hits (copies) per sample

to import from the extraction step. Large numbers of

marker copies per sample can increase alignment

times. Hits (copies) are ranked from best to worst

during the 'extract' step. -1 disables the removal

of paralogs and aligns them all, which might be

useful if you expect very high ploidy levels for

example (default: 5)

--min_samples MIN_SAMPLES

Minimum number of samples in a marker to proceed

with alignment. Markers with fewer samples will be

skipped (default: 4)

Output:

-o OUT, --out OUT Output directory name (default: ./04_alignments)

--keep_all Do not delete any intermediate files (default:

False)

--overwrite Overwrite previous results (default: False)

Alignment:

--align_method {mafft_auto,mafft_genafpair,mafft_localpair,mafft_globalpair,mafft_retree1,mafft_retree2,muscle_align,muscle_super5}

For MAFFT's algorithms see: https://mafft.cbrc.jp/al

ignment/software/algorithms/algorithms.html For

MUSCLE's algorithms see:

https://drive5.com/muscle5/manual/commands.html

(default: mafft_auto)

--timeout TIMEOUT Maximum allowed time in seconds for a single

alignment (default: 21600)

--disable_codon_align Do not align nucleotide coding sequences based on

their protein alignment (default: False)

--outgroup OUTGROUP Outgroup sample names, separated by commas, no

spaces. Since phylogenetic programs usually root the

resulting trees at the first taxon in the alignment,

Captus will place these samples at the beggining of

every alignment in the given order

Paralog filtering:

--filter_method {naive,informed,both,none}

Methods for filtering paralogous sequences:

naive = Only the best hit for each sample (marked

as hit=00) is retained

informed = Only keep the copy (regardless of hit

ranking) that is most similar to the

reference target sequence that was

chosen most frequently among all other

samples in the alignmentboth = Two

separate folders will be created, each

containing the results from each

filtering methodnone = Skip paralog

removal, just remove reference target

sequences from the alignments. Useful

for phylogenetic methods that allow

paralogs like ASTRAL-Pro (default:

both)

--tolerance TOLERANCE Only applicable to the 'informed' filter. If the

selected copy's identity to the most commonly chosen

reference target is below this number of Standard

Deviations from the mean, it will also be removed

(the lower the number the stricter the filter)

(default: 4.0)

ClipKIT:

--clipkit_method {smart-gap,gappy,kpic,kpic-smart-gap,kpic-gappy,kpi,kpi-smart-gap,kpi-gappy}

ClipKIT's algorithm, see https://jlsteenwyk.com/Clip

KIT/advanced/index.html#modes (default: gappy)

--clipkit_gaps CLIPKIT_GAPS

Gappyness threshold per position. Accepted values

between 0 and 1. This argument is ignored when using

the 'kpi' and 'kpic' algorithms or intermediate

steps that use 'smart-gap' (default: 0.9)

--min_data_per_column MIN_DATA_PER_COLUMN

Minimum number of non-missing sites per column. When

this parameter is > 0, Captus will dynamically

calculate a '--clipkit_gaps' threshold per alignment

to keep this minimum amount of data per column

(default: 0)

--min_coverage MIN_COVERAGE

Minimum coverage of sequence as proportion of the

mean of sequence lengths in the alignment, ignoring

gaps. Accepted values between 0 and 1. After ClipKIT

finishes trimming columns, Captus will also remove

short sequences below this threshold (default: 0.4)

Other:

--collect_only Only collect the markers from the extraction folder

and exit (skips addition of reference target

sequences and subsequent steps) (default: False)

--redo_from {alignment,filtering,removal,trimming}

Repeat analysis from a particular stage:

alignment = Delete all subdirectories with

alignments and restart

filtering = Delete all subdirectories with

filtered alignments and restart

removal = Delete all subdirectories with

alignments with reference targets

removed and restart

trimming = Delete all subdirectories with trimmed

alignments and restart

--mafft_path MAFFT_PATH

Path to MAFFT (default: mafft/mafft.bat)

--muscle_path MUSCLE_PATH

Path to MUSCLE (default: muscle)

--clipkit_path CLIPKIT_PATH

Path to ClipKIT (default: clipkit)

--ram RAM Maximum RAM in GB (e.g.: 4.5) dedicated to Captus,

'auto' uses 99% of available RAM (default: auto)

--threads THREADS Maximum number of CPUs dedicated to Captus, 'auto'

uses all available CPUs (default: auto)

--concurrent CONCURRENT

Captus will attempt to execute this many alignments

concurrently. CPUs will be divided by this value for

each individual process. If set to 'auto', Captus

will set as many processes as to at least have 2

threads available for each MAFFT or MUSCLE process

(default: auto)

--debug Enable debugging mode, parallelization is disabled

so errors are logged to screen (default: False)

--show_more Show individual alignment information during the

run. Detailed information is written regardless to

the log (default: False)

Help:

-h, --help Show this help message and exit

--version Show Captus' version number

For more information, please see https://github.com/edgardomortiz/Captus

実行方法

１，fastqの準備

実行前に、命名規則に従ってFASTQファイル名を変える必要がある。FASTQファイル名が最終的な系統樹名に反映される。

https://edgardomortiz.github.io/captus.docs/assembly/clean/preparation/index.html

（マニュアルより転載）

注意（マニュアルより）；

サンプル名にアンダーバー（_）を使用することは可能だが、1つ以上連続して使用しない。
ペアエンドリードの場合、R1 と R2 のファイル名にはそれぞれ _R1 と _R2 のパターンが含まれていなければ正しくマッチングされない（_1と_2では不可）。シングルエンドの場合は、ファイル名に _R1 を含める必要がある。
有効な拡張子は、.fq、.fastq、.fq.gz、.fastq.gz
特殊文字を使用しない

テストデータ

Angiosperms353プローブセット(Johnson et al., 2018)を用いて、被子植物間で高度に保存された353の遺伝子座の targeted-capture sequencing (CAP)によって得られた4つの植物種(GenusA_speciesA, GenusB_speciesB, GenusC_speciesC, GenusD_speciesD)ペアエンドシークエンシングリード（合計4つ）がテスト用に公開されている。ここではこれを使う。

https://drive.google.com/uc?export=download&id=1qDwAQqMk9JTy0aHf7dKrqgEqPNXyF00J

ダウンロードして解凍する。

00_raw_reads/

２，ラン

captus_assemblyモジュールはclean、assemble、extract、alignの4つのコマンドからなり、通常この順番で実行する。しかし、リードが既にクリーニングされている場合はassembleモジュールから開始でき、既にアセンブルされたデータやリファレンスゲノムが提供された場合はextractモジュールで解析を開始でき、シナリオによって柔軟に利用する事が出来るように設計されている。

2-1. clean - アダプタートリミング（RNAseqリードをクリーニングする場合はポリAトリミングも行う）に続いて、BBToolsのbbduk.sh（紹介）を使ってクオリティトリミングを行う（PHREDスコア<13の先頭塩基と末尾塩基がトリミングされ、次に平均PHREDスコア<16のトリミングリードが除去される）。クリーニング後にはFalcoまたはFastQCによって全サンプルの未処理リードとクリーニング後リードのHTMLレポートが作成される。ポリAテールをトリミングするために”--rna”オプションが用意されている。

fastqファイルを含むディレクトリを指定して実行する。

captus_assembly clean -r 00_raw_reads

--rna Trim ploy-A tails from RNA-Seq reads (default: False)
--qc_program Which program to use to obtain the statistics from the raw and cleaned FASTQ files. Falco produces identical results to FastQC while being much faster (default: fastqc)

01_clean_readsディレクトリが出来て、結果はその中に保存される。

出力

クリーニングされたfastq.gzが出力される。ハイクオリティなレポートも生成される。

図はインタラクティブに操作可能

2-2. assemble - コンティグのアセンブル
クリーニングされたリードからターゲットキャプチャーとゲノムスキミングデータに最適化されたデフォルト設定でMEGAHITによるde novoアセンブルを行う。-rで2-1の出力ディレクトリを指定する。コンタミネーションが心配な場合は、-max_contig_gcオプションの使用を検討するように書かれている（；ほとんどの真核生物ゲノムの細菌汚染を除去するには最大60%のGCが適切（Fierst and Murdock 2017；論文より））。

captus_assembly assemble -r 01_clean_reads

--min_contig_len Minimum contig length in output assembly, 'auto' is mean input read length + smallest kmer size in '--k_list' (default: auto)
--max_contig_gc Maximum GC percentage allowed per contig. Useful to filter contamination. For example, bacteria usually exceed 60% GC content while eukaryotes rarely exceed that limit. 100.0 disables the GC filter (default: 100.0)

02_assembliesディレクトリが生成される。

出力

サブディレクトリの中にassembly.fastaが出来る、これがアセンブルされたコンティグとなる。

assembly report

2-3. extract - アセンブルしたコンティグからターゲット遺伝子座の配列を抽出

アセンブル後、目的の特定の遺伝子座（すなわち、リファレンスターゲット遺伝子座）を検索することができる。CAPTUSでは、5種類のマーカー（i）核タンパク質（NUC）、（ii）プラスチドタンパク質（PTD）、（iii）ミトコンドリアタンパク質（MIT）、（iv）雑多なDNA領域（DNA）、および（v）新しいクラスタリング由来の推定ホモログ（CLR）を同時に検索し、抽出することができる。タンパク質抽出を行うためにSCIPIO（Hatje et al）が使用されている。SCIPIOはシーケンスやアセンブリーエラーに起因するフレームシフトを自動的に修正でき、複数のexonや複数のコンティグにまたがるタンパク質も回収できる。

その他のDNA抽出は、他のタイプのDNA領域（例えば、イントロンを含む完全な遺伝子、非コード領域、リボソーム遺伝子、個々のエクソン、RADマーカーなど）を回収することを目的としている。この場合、CAPTUSはBLAT(Kent 2002)を使ってアセンブリを検索し、コンティグ間で見つかったヒットを独自のコードを使って貪欲にアセンブルし、連結する（論文より）。

-rで１のアセンブルディレクトリのパスを指定する。複数のマーカータイプ（核タンパク質、プラスチドリアタンパク質、ミトコンドリアタンパク質、および雑多なDNAマーカー）を同時に指定し、抽出することができる。ここでは、Angiosperms353に対応したキャプチャシークエンシングデータを使っているので、”-n”でAngiosperms353を指定している（captusに内蔵されているリファレンスの１つ）。"-a"で前項の2-2の出力ディレクトリを指定する。

captus_assembly extract -a 02_assemblies -n Angiosperms353

-a Path to the output directory from the assemble command.
-n Path to a FASTA file of reference sequences. Here we use the built-in Angiosperms353 reference dataset.
-f FASTA assembly file(s) that were not assembled with Captus. Valid file name extensions are: .fa, .fna, .fasta, .fa.gz, .fna.gz, .fasta.gz. These FASTA files must contain only nucleotides (no aminoacids). All the text before the extension of the filename will be used as sample name. These FASTAs will be automatically copied to the path provided with '-a'/'--captus_assemblies_dir' using the correct directory structure needed by Captus. （省略）
-n Set of nuclear protein reference target sequences, options are: Angiosperms353 = The original set of target proteins from Angiosperms353. Mega353 = The improved set of target proteins from Angiosperms353 Alternatively, provide a path to a FASTA file containing your reference target protein sequences in either nucleotide or aminoacid. When the FASTA file is in nucleotides, '--nuc_transtable' will be used to translate it to aminoacids
-p Set of plastidial protein reference target sequences, options are: SeedPlantsPTD = A set of plastidial proteins for Seed Plants, curated by us Alternatively, provide a path to a FASTA file containing your reference target protein sequences in either nucleotide or aminoacid. When the FASTA file is in nucleotides, '--ptd_transtable' will be used to translate it to aminoacids
-m Set of mitochondrial protein reference target sequences, options are: SeedPlantsMIT = A set of mitochondrial proteins for Seed Plants, curated by us Alternatively, provide a path to a FASTA file containing your reference target protein sequences in either nucleotide or aminoacid. When the FASTA file is in nucleotides, '--mit_transtable' will be used to translate it to aminoacids
-c Enable MMseqs2 clustering across samples of the contigs that had no hits to the reference target markers. A new miscellaneous DNA reference is built from the best representative of each cluster in order to perform a miscellaneous DNA marker extraction. (default: False)

出力

サブディレクトリにアノテーションされて抽出された配列が格納される（*.faaと*.fnaのFASTAファイル）。

report/

使用したデータそれぞれのマーカーの保存の有無と保存性の高さのヒートマップ。上のボタンからインタラクティブに表示する情報を切り替えることができる。

Total hits (Copy)に切り替えた。

2-4. align - 各遺伝子座について全サンプルから抽出した配列の多重整列を生成

2-3で抽出された出力は次にアラインメントモジュールで処理される。個々のマーカーはサンプル間で収集され、遺伝子座ごとに別々の FASTA ファイルに整理される。参照配列も各遺伝子座ファイルに追加され、サンプルから回収された配列が断片的な場合のアラインメントガイドとなる。その後、MAFFTまたはMUSCLE5を用いて、デフォルトの設定でアライメントを行う。ただし、アミノ酸単位のタンパク質配列と、それに対応するヌクレオチド単位のコード配列が同じランでアラインメントされる場合、CAPTUSはアミノ酸をMAFFTでアライメントし、AAアラインメントをヌクレオチドのテンプレートとして使用することで、コドンを考慮したCDSのアラインメントが行われる。

最後に、CLIPKITを用いて、生成された全てのアラインメントはトリミングされ、情報量や欠損データの割合などの基準に基づいてアラインメントのカラムを削除できる。デフォルトでは、90%以上の欠損データと40%未満の平均カバレッジを持つ配列が削除される。

-eで2-3の出力ディレクトリを指定する。

captus_assembly align -e 03_extractions

-e Path to the output directory from the extract command.

出力

1）MSA前、2）MSA後、3）MSA&トリミング後に分かれている。

03_trimmed/

01）MSA後のフィルタリングなど一切なし、02）ナイーブな手法でパラログがフィルタリングされたアラインメント、03）インフォームドメソッドによってパラログがフィルタリングされたアラインメント、04-06）01_unfiltered_w_refs、02_naive_w_refs、03_informed_w_refsと同等のアラインメントを含むが、参照配列を除いたもの（系統推定に不要のため）

各サブディレクトリにおいて、NTとAA両方でそれぞれの遺伝子座のMSAが保存される（318個あった）。

report

各処理ステップにおける一般的なアライメント統計量分布

マーカータイプ（右上）と統計量（ｘ軸、下のボタン）は変更可能

3、Phylogenetic inference

captus_assemblyパイプライン後の代表的な解析例として、系統推定がある。captusでは、IQ-TREEを用いた系統推論の簡単な例が提示されている。

# edge-linked partition modelで全遺伝子座位連結を用いた樹形推定
mkdir 05_phylogeny && cd 05_phylogeny
iqtree -p ../04_alignments/03_trimmed/06_informed/01_coding_NUC/02_NT -pre concat -T AUTO

https://edgardomortiz.github.io/captus.docs/tutorials/assembly/basic/index.html

論文とマニュアルより

出力される全てのファイルはマニュアルで説明されている（例）。
実際の解析でオプションをどう使うかについては、論文やチュートリアルでも説明されている。
CAPTUSは、異なる段階でワークフローに入ったサンプルを柔軟に組み合わせることができる。そのため、例えばターゲットキャプチャーやクリーンなRNA-Seqリードで表されるサンプル、および分類群サンプリングを増やすためにGenBankからダウンロードしたゲノムアセンブリで構成するなど、柔軟に実行できる。
プラスチドタンパク質とミトコンドリアタンパク質について；CaptusにはSeedPlantsPTDとSeedPlantsMITというリファレンスデータセットが組み込まれている。これらは、2-3のextractコマンドで-p SeedPlantsPTD と -m SeedPlantsMIT の引数を加えるだけで、あらゆる顕花植物のオルガネラゲノムから遺伝子配列を抽出することができる。
captus_assemblyモジュールの他にdesignモジュールがある（まだマニュアルは無い）

引用

A novel phylogenomics pipeline reveals complex pattern of reticulate evolution in Cucurbitales
Edgardo M. Ortiz, Alina Höwener, Gentaro Shigita, Mustafa Raza, Olivier Maurin, Alexandre Zuntini, Félix Forest, William J. Baker, Hanno Schaefer

bioRxiv, Posted November 01, 2023.

簡単に触ってみて、大規模な系統解析の負担を大きく軽減できるとても有用性の高い実装だと感じました。チュートリアルデータのランなら非常に簡単に最後まで進めることができますので、まずは触ってみると良いのではないでしょうか。

著者の方から教えていただきました。ありがとうございました。

2024-04-15

GO termからタンパク質の機能的要約を生成する GO2Sum

2024 npj Systems Biology and Applications GO term

　タンパク質の生物学的機能を理解することは、現代の生物学において基本的に重要である。タンパク質の機能を表現するために、制御された語彙であるGene Ontology (GO)は、オープンエンドなテキスト解釈を避け、コンピュータプログラムで扱いやすいため、頻繁に使用されている。特に、現在のタンパク質機能予測手法の大半はGO termに依存している。しかし、タンパク質の機能を記述するGO termの広範なリストは、解釈の際に生物学者に困難をもたらす可能性がある。この問題に対応するため、本著者らは、GO termのセットを入力とし、T5大規模言語モデルを用いて人間が読める要約を生成するモデル、GO2Sum（Gene Ontology terms Summarizer）を開発した。GO2Sumは、GO termの割り当てとUniProtエントリーのフリーテキスト機能記述についてT5を微調整することにより開発され、GO term記述を連結することにより機能記述を再現することを可能にした。その結果、GO2Sumは、UniProtエントリーの機能、サブユニット構造、パスウェイの段落を生成する際に、ウェブコーパス全体で学習させたオリジナルのT5モデルを大幅に上回ることが実証された。

新しい論文です：GO2Sum: generating human-readable functional summary of proteins from GO terms, npj Systems Biology and Applications. 続く）https://t.co/qrhFifp67g
— Daisuke Kihara (@d_kihara) 2024年3月19日

遺伝子の機能は元々データベースや論文で普通の文章（自然言語）で書かれていたのをコンピュータで扱いやすいように2０年くらい前にGOが作られましたが、GO2Sumでは、人間が読んで分かるようにまた自然言語による記述に戻したわけです。
— Daisuke Kihara (@d_kihara) 2024年3月19日

Behind the Paper

https://communities.springernature.com/posts/go2sum-generating-human-readable-functional-summary-of-proteins-from-go-terms

インストール

ubuntu20でcondaの環境を作ってテストした。

依存

Python 3.9 : https://www.python.org/downloads/
GO2SUM is based on simpleT5, a Pytorch-lightning implementation of T5 model

Github

git clone https://github.com/kiharalab/GO2Sum.git && cd GO2Sum
mamba create --name go2sum python=3.9
conda activate go2sum
#Pytorch-lightning implementation of T5 model（タイムアウト対策に--default-timeoutを指定）
pip --default-timeout=1000 install --upgrade simplet5

> python3 main.py

python3 main.py -h

Global seed set to 42

usage: main.py [-h] [--input_file INPUT_FILE] [--summary_type SUMMARY_TYPE] [--output_file OUTPUT_FILE]

Run GO2Sum

optional arguments:

-h, --help show this help message and exit

--input_file INPUT_FILE

Tab-separated file with Protein ID and GO Annotation list

--summary_type SUMMARY_TYPE

Type of summary to generate (function, subunit, pathway, or all)

--output_file OUTPUT_FILE

Name of result file

モデル

models.zip files

https://kiharalab.org/GO2Sum/models.zip

(2.3GB)

解凍してカレントパスに配置すれば認識する。

実行方法

スクリプトを使用するには以下のフォーマットのタブ区切りテキストファイルを作成する。

各行は1つのタンパク質とそれに関連するGO IDに対応する。1列目はタンパク質名、2列目はセミコロン";"で区切られたGO IDのリストを記載する。

（マニュアルより転載）

テストラン

cd GO2Sum/
python3 main.py --input_file test1.tab --summary_type function --output_file example.tab

--input_file Tab-separated file with Protein ID and GO Annotation list
--summary_type Type of summary to generate (function, subunit, pathway, or all)
--output_file Name of result file

出力

results/にfunction_example.tab、subunit_example.tab、pathway_example.tabが生成される。

> ls -l results/

> cat results/function_example.tab
Q5AK66 FUNCTION: Catalyzes the formation of phosphatidylserine from phosphatidylserine Plays a central role in phospholipid metabolism and in the interorganelle trafficking of phosphatidylserine Important for virulence
A9AJN2 FUNCTION: Ethanolamine-phosphate cytidylyltransferase that catalyzes the second step in the synthesis of phosphatidylethanolamine from ethanolamine via the CDP-ethanolamine pathway Phosphatidylethanolamine is a dominant inner-leaflet phospholipid in cell membranes, where it plays a role in membrane function by structurally stabilizing membrane-anchored proteins, and participates in important cellular processes such as cell division, cell fusion, blood coagulation, and apoptosis

引用

GO2Sum: generating human-readable functional summary of proteins from GO terms
Swagarika Jaharlal Giri, Nabil Ibtehaz & Daisuke Kihara

npj Systems Biology and Applications volume 10, Article number: 29 (2024)

2024-04-13

（ヒト）AlphaFoldでモデル化されたタンパク質間相互作用のデータベース Predictomes

Protein-protein interactions (PPIs) human genome 2024 Preprint AlphaFold

　タンパク質間相互作用(PPI)は生物学において普遍的なものであるが、生化学的プロセスの根底にあるPPIの包括的な構造解析は不足している。AlphaFold-Multimer(AF-M)はこの知識のギャップを埋める可能性を秘めているが、標準的なAF-Mの信頼性指標では、関連するPPIと偽陽性の豊富な予測を確実に分離することはできない。この限界に対処するため、本著者らは、十分にキュレートされたデータセットを用いて機械学習を行い、プロテオームワイドスクリーンを含め、真のPPIと偽のPPIを分離する優れた性能を示すSPOCと呼ばれるStructure Prediction and Omics informed Classifierを訓練した。SPOCを約300の human genome maintenance proteinsの all-by-all matrixに適用し、約40,000の予測を生成した。この予測はpredictomes.orgで閲覧することができ、ユーザはSPOCを使って自分の予測をスコアリングすることもできる。本アプローチで発見された信頼性の高いPPIは、ゲノム維持における新しい仮説を示唆している。この結果は、大規模なAF-Mスクリーニングを解釈するためのフレームワークを提供し、プロテオーム全体の構造相互作用の基礎を築くのに役立つ。

Tutorial

https://predictomes.org/tutorial

簡単に見てみます。

webサービス

https://predictomes.org/にアクセスする。

タンパク質間相互作用（PPI）は、事実上すべての生物学的プロセスに不可欠である。ハーバード大学医学部のウォルター研究室では、ディープラーニングシステムAlphaFold-Multimer（AF-M）を使って、PPIを系統的にスクリーニングしている。われわれはゲノム維持に焦点を当てているが、このアプローチは生物学のどの分野にも使える。

AlphaFold-Multimer（AF-M）は、AlphaFoldと同じディープラーニングの原理を使ってタンパク質複合体の構造を予測する。AF-MのColabfoldバージョンをローカルにインストールし、クラウドベースのGPUを借りて、ゲノム維持機構のコアとなるタンパク質間のバイナリーなタンパク質間相互作用（PPI）をすべて予測した。各タンパク質ペアは、テンプレートを有効にして独自に訓練された5つのAF-Mモデルのうち3つで折りたたまれた。このパイプラインは、潜在的なPPIの "all-by-all "マトリックスを生成した。計算時間を節約するため、AF-M構造は緩和されなかった。AF-MがGPUの処理能力を超えるようなタンパク質ペア（合計3600残基以上）は折りたたまれれなかった。

相互作用が真である可能性が高いかどうかを評価するために、SPOCと呼ばれる分類器をトレーニングした。さらに、標準的なAF-M信頼度指標（PAE, pLDDT, pDOCKQ）と、もう一つの指標である予測に一致したAF-Mモデルの平均数（"avg_models"）を提供した。AlphaFold多量体データの解析に使用しているスクリプトはGithubで公開されている（全てマニュアルより）。

matrixに移動する。

https://predictomes.org/view/ddr

タイルの色の濃さは信頼度（SPOC）に比例する。

ドラッグして囲んだ領域が拡大される。オレンジのドットがあるタイルはPDBで見つかったペアを表す。

カーソルを合わせるとそのタンパク質ペアの各スコアが表示される。

SPOCの良スコアは、5%以下のFDRと定義されている。図では0.975。カットオフ値は文脈に依存する。例えば、実際の相互作用体を濃縮すべき限られたタンパク質群（IP-質量分析データなど）をスクリーニングする場合、SPOCスコアが0.75を超えると5%のFDRが達成される）。一方、真の相互作用因子の割合が低いプロテオームワイドスクリーンを行う場合、5%のFDRを達成するためには0.95のSPOCスコアが必要である（128:1の曲線）（マニュアルより）。

ヒートマップの指標は左上から変更できる。

関心があるタイルをクリックすると対話型構造ビューアにジャンプする。

The predicted alignment error (PAE、単位はオングストローム）は、残基の位置決め精度のグローバルな尺度である。この値は相互作用するタンパク質内およびタンパク質間のすべての残基のペアについて計算され、0から30オングストロームの範囲である。PAEプロットでは、青は低いPAE値、赤は30以上のPAE値を表す（マニュアルより）。

PDBに複合体構造があるもの（オレンジタイル）は、スーパーインポーズ機能も利用できる（画像右上のsuperimposeから利用可能な複合体構造を選択する）。

構造の表示、pLDDTによる残基のフィルタリング、pLDDTのような異なる測定基準による構造の色付けなどのオプションがある。

続き

メインmatrixに戻る。

行、列ともに特定の生物学的パスウェイだけ選べる。columnsはTelomere（11）、RowsはTelomereとATPases（11）にした。

マニュアルと論文より

AF-M予測は必然的に「偽陽性」と「偽陰性」をもたらす。実際、SPOCスコアが低くても相互作用がない証拠にはならず（低いスコアしか得られなかったいくつかの既知の複合体がある）、スコアが高くても相互作用の決定的な証拠にはならない（通常相互作用しないMCM2とMCM7のようなタンパク質パラログのスコアが良いことから分かる）。さらに調べる対象を探す場合、一般的には、SPOCのスコアが最も高いPPIを優先すること、独立したエビデンスによって支持されているもの、あるいは生物学的現象の説明に役立つものを優先することが推奨される。全ての場合において、全ての予測を検証するためには実験的証拠が不可欠である。
AF-Mの実行には、ローカルにインストールしたColabFoldが使用されている。予測の大部分は40GBのA100 NVIDIA GPUで実行され、サブセットはL40S NVIDIA GPUで実行された。これらのGPUのメモリ制限を考慮し、すべてのジョブの上限を合計3,600アミノ酸とした。構造が特に注目される特定のケースでは、例外的に3,600アミノ酸を超える配列に対してAF-Mが実行された（論文メソッドより）。

引用

Predictomes: A classifier-curated database of AlphaFold-modeled protein-protein interactions
Ernst W. Schmid, Johannes C. Walter

bioRxiv, Posted April 12, 2024.

関連

Dataタブを見るとこのようになっており、将来的にさらに相互作用予測結果が増えるかもしれませんね。

2024-04-12

メタゲノムアセンブリからのターゲットとするウイルスゲノムの完全性と連続性を向上させる COBRA

2024 Nature Microbiology Viruses assembly

　ウイルスの研究はメタゲノムシークエンシングを用いて行われることが多いが、ゲノムの不完全性が包括的で正確な解析の妨げとなっている。Contig Overlap Based Re-Assembly (COBRA)は、de Bruijnグラフに基づいてアセンブリのブレークポイントを解決し、コンティグを結合する。ここでは、海洋と土壌のウイルスデータセットを用いてCOBRAのベンチマークを行った。COBRAはアセンブルされた配列を正確に結合し、ビニングツールよりも顕著に高いゲノム精度を達成した。231の淡水メタゲノムから、7,334のバクテリオファージクラスターが得られた。COBRA解析前は34%であったのに対し、70%が環状であった。巨大ファージ（≥200 kbp）のサンプリングを拡大し、その中で最大のファージ（717 kbp）を完成までキュレーションした。ロッツェー湖から得られたファージゲノムの改良により、メタトランスクリプトームデータの背景が明らかになり、巨大ファージ、whiBをコードするファージ、cysCおよびcysHをコードするファージのin situでの活性が示された。COBRAはウイルスゲノムのアセンブリの連続性と完全性を向上させ、遺伝子含有量、多様性、進化の解析の精度と信頼性を向上させる。

レポジトリより

バリエーション部位で切断されたコンティグは、各アセンブラによって決められた長さの末端オーバーラップを持つ。metaSPAdesとMEGAHITではde nonoアセンブリで使用されるmax-kmer（以下maxK）、IDBA_UDではmaxK-1であり、cobraではこれを「予想オーバーラップ長」と呼ぶ。COBRAは、すべてのコンティグについて予想オーバーラップ長（順方向と逆方向の両方）を決定し、コンティグのカバレッジ、コンティグのオーバーラップ関係、コンティグの連続性（ペアエンドリードのマッピングに基づく）などの特徴のリストに基づいて、ユーザーが提供する各クエリ（アセンブリ全体の一部であるはず）について有効な結合パスを探し、拡張が試みられる。

インストール

ubuntu20にて、condaで環境を作って導入した。

Github

https://github.com/linxingchen/cobra?tab=readme-ov-file

#conda
mamba create -n cobra python=3.8
conda activate cobra
mamba install bioconda::cobra-meta -y
#or
mamba install linxingchen1987::cobra-meta -y

#pip
pip install cobra-meta

> cobra-meta -h

-q QUERY -f FASTA -a {idba,megahit,metaspades} -mink MINK -maxk MAXK -m MAPPING -c COVERAGE [-lm LINKAGE_MISMATCH] [-o OUTPUT] [-t THREADS] [-v]

This script is used to get higher quality (including circular) virus genomes by joining assembled contigs based on their end overlaps.

options:

-h, --help show this help message and exit

-lm LINKAGE_MISMATCH, --linkage_mismatch LINKAGE_MISMATCH

the max read mapping mismatches for determining if two contigs are spanned by paired reads. [2]

-o OUTPUT, --output OUTPUT

the name of output folder (default = '<query>_COBRA').

-t THREADS, --threads THREADS

the number of threads for blastn. [16]

-v, --version show program's version number and exit

required named arguments:

-q QUERY, --query QUERY

the query contigs file (fasta format), or the query name list (text file, one column).

-f FASTA, --fasta FASTA

the whole set of assembled contigs (fasta format).

-a {idba,megahit,metaspades}, --assembler {idba,megahit,metaspades}

de novo assembler used, COBRA not tested for others.

-mink MINK, --mink MINK

the min kmer size used in de novo assembly.

-maxk MAXK, --maxk MAXK

the max kmer size used in de novo assembly.

-m MAPPING, --mapping MAPPING

the reads mapping file in sam or bam format.

-c COVERAGE, --coverage COVERAGE

the contig coverage file (two columns divided by tab).

実行方法

COBRAは4つのファイルを入力として必要とする；１次アセンブリのfasta、そのfastaへmappingして得たbam、bamから計算した各コンティグへのカバレッジ情報、アセンブル結果を改善させたいウィルスのfastaファイル（クエリ配列）。

1つのアセンブリから得られたすべてのコンティグを含むFasta形式のファイル（--fastaで指定）。idba, megahit, metaspadesのみテストされている。
--fastaファイルに含まれるすべてのコンティグのシーケンスカバレッジを2列（タブ区切り）ファイルにしたもの（-cで指定）
COBRAに伸長させたいクエリコンティグ。fasta形式のファイルか、クエリコンティグの名前を1列のテキストファイルとして指定（-qで指定）
--fastaで指定したfastaへのペアエンドリードのマッピング結果のbam（-mで指定）

２で指定するcoverageファイル（レポジトリより転載）

このcoverageファイルを作るには、4のbamを作り、それからmetabatのコマンドとcobraのスクリプトを使って作成する。

1, mapping
#bowtie2
bowtie2-build --threads 20 -f assembly.fasta bowtie2_index
bowtie2 -p 16 -x bowtie2_index -1 R1.fastq.gz -2 R2.fastq.gz | samtools sort -@ 8 -m 2G -O BAM - > mapping.bam

#bbmap
bbmap.sh ref=assembly.fasta in1=R1.fastq.gz in2=R2.fastq.gz threads=16 out=mapping.sam
samtools sort -@ 8 -m 2G -O BAM output.sam > out.bam

2, MetaBATの"jgi_summarize_bam_contig_depths "ツールでcoverageを計算
jgi_summarize_bam_contig_depths --outputDepth original.coverage.txt mapping.bam

#3 convert format
git clone https://github.com/linxingchen/cobra.git
cd cobra/
python coverage.transfer.py -i original.coverage.txt -o coverage.txt

準備が出来たらcobraを実行する。１－４のファイルのほか、スレッド数、出力ディレクトリ、multi k-merなde brujin graphアセンブラで使用したk値の最小と最大値も指定する。

cobra-meta -q query.fasta -f assembly.fasta -a megahit -mink 21 -maxk 141 -m mapping.bam -c coverage.txt -o outdir -t 16

-o the name of output folder (default = '<query>_COBRA').
-t the number of threads for blastn. [16]
-f the whole set of assembled contigs (fasta format).
-m the reads mapping file in sam or bam format.
-c the contig coverage file (two columns divided by tab).
-a {idba, megahit, metaspades} de novo assembler used, COBRA not tested for others.
-a the name of the de novo assembler used, currently only 'idba' (for IDBA_UD), 'metaspades' (for metaSPAdes), and 'megahit' (for MEGAHIT).
-maxk the largest kmer used in de novo assembly.
-mink the smallest kmer used in de novo assembly.

出力

COBRAは、クエリーコンティグが結合され環状化された状態によって異なるファイルに出力する。

その他

scaffolds（例えばmetaSPAdes）をCOBRAエクステンションの入力として使用することも可能だが、スキャフォールドステップでエラーが発生する可能性があるため、IDBA_UDからのスキャフォールドは使用しないことが推奨されている；IDBA_UDとMEGAHITのアセンブリーにはコンティグを使う
COBRAはIDBA_UD、metaSPAdes、MEGAHITのコンティグ／スキャフォールドに対してのみテストされている（他のアセンブラのコンティグ／スキャフォールドを使うのは危険）。

引用

COBRA improves the completeness and contiguity of viral genomes assembled from metagenomes

LinXing Chen & Jillian F. Banfield

Nature Microbiology volume 9, pages737–750 (2024)

2024-04-12

タンパク質言語モデルにより正確で高速なリモート相同性配列検索を行う PLMSearch

Protein language model protein search 2024 Nature Communications web tool

　Homologous protein searchは、タンパク質のアノテーションや解析に最もよく使われる手法の一つである。構造検索と比較して、配列のみから遠い進化関係を検出することは依然として困難である。ここでは、配列のみを入力とするHomologous protein searchメソッドPLMSearch（Protein Language Model）を提案する。PLMSearchは、事前に学習させたタンパク質言語モデルのdeep representationsを利用し、多数の実際の構造類似性を用いて類似性予測モデルを学習させる。これにより、PLMSearchは配列の背後に隠された遠隔の相同性情報を捉えることができる。広範な実験結果から、PLMSearchはMMseqs2のように数百万のクエリーとターゲットタンパク質のペアを数秒で検索でき、感度を3倍以上向上させ、最先端の構造検索手法に匹敵することが示されている。特に、従来の配列検索手法とは異なり、PLMSearchは、配列は異なるが構造は類似している、最も離れた相同性ペアを呼び出すことができる。PLMSearchはhttps://dmiip.sjtu.edu.cn/PLMSearchで利用できる。

配列検索の普遍性と効率を維持しながら感度を向上させるために、PLMSearchを提案する。PLMSearchは主に以下の3つのステップからなる： (1) PfamClanは、同じPfam clanドメインを共有するタンパク質ペアをフィルタリングする。(2) SS-predictor (Structural Similarity predictor)は、タンパク質言語モデルによって生成されたエンベッディングを用いて、全てのクエリー-ターゲットペア間の類似性を予測する。PLMSearchは、タンパク質言語モデルを使って、 deep sequence embeddingsから遠隔相同性情報を取り込むので、構造を入力として使わなくてもそれほど感度が落ちることはない。また、このステップで使用されるSS-predictorは、構造類似度（TM-score）を学習のグランドトゥルースとして使用する。これにより、PLMSearchは構造が入力としてなくても、信頼性の高い類似度を取得することができる。(3) PLMSearchは、PfamClanによって事前にフィルタリングされたペアを、予測された類似度に基づいてソートし、それに応じて各クエリタンパク質の検索結果を出力する。その後、PLMAlignは、PLMSearchによって検索されたトップランクのタンパク質ペアの配列アラインメントとアラインメントスコアを提供する。SCOPe40-testとSwiss-Protで検索テストを行った結果、PLMSearchは常に最良の手法の一つであり、精度とスピードのトレードオフが最適であることが明らかになった。

help（上のメニューから選択する）

https://dmiip.sjtu.edu.cn/PLMSearch#

Github

https://github.com/maovshao/PLMSearch

webサービス

https://dmiip.sjtu.edu.cn/PLMSearch#にアクセスする。

問い合わせするタンパク質の配列をペーストするかファイルを指定する。

クエリの最大数は100。ただし、リクエストが多すぎるときは”Too much tasks. Please try again later.”が表示される。

ターゲットDBとして、Swiss-Prot（568K配列）、PDB（680K配列）、UniRef50（53.6M配列）、またはクエリデータセット自体 (Self) を選べる。デフォルトは機能がわかっているタンパク質からなるSwiss-Prot DBになっている。DBのほか、MethodをPLMSearchかSS-predictorから選べる。

PLMSearchはSS-predictor（全てのタンパク質の類似度を計算する。タンパク質のペアは類似度に基づいてランク付けされる。）とは異なり、全てのタンパク質のペアを一から検索する事は回避し、PfamClanによって事前にフィルタリングされたペアを基に検索する。

デモ配列を使った検索では、サブミット後に結果のページがロードされるまで数秒かかった。

出力例

表にトップ５のヒットした配列名、Similarity、そのUniProtとAlphaFoldへのリンク、タンパク質のfastaファイルとPDBファイル、タンパク質名、生物名、tax idへのリンクが出ている。Similarityは0.3（UniRef50では0.5）以上のタンパク質ペアのみが保持される。

上のDownloadでは、ヒットした配列のNeedleman-Wunschアルゴリズムによる配列同一性と配列アライメントファイル、PLMAlignのスコアとPLMAlignによる配列アライメントなどをダウンロードできる。

引用

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology
Wei Liu, Ziye Wang, Ronghui You, Chenghan Xie, Hong Wei, Yi Xiong, Jianyi Yang & Shanfeng Zhu
Nature Communications volume 15, Article number: 2775 (2024)

関連

2024-04-11

タンパク質構造へのバリアントのマッピングのためのコマンドラインツール 3Dmapper

2024 Bioinformatics AlphaFold mapping variant structural aligners

　ゲノムデータの解釈は、生物学的プロセスの分子メカニズムを理解する上で極めて重要である。タンパク質構造は、遺伝子をコードする変異体に機能的な背景を与えることにより、この解釈を容易にする上で重要な役割を果たす。しかし、遺伝子とタンパク質の対応付けは、データ形式の不一致のために、面倒でエラーが起こりやすい作業である。過去20年の間に、アノテーションされた位置やバリアントをタンパク質構造に自動的にマッピングするためのツールやデータベースが数多く開発されてきた。しかし、これらのツールのほとんどはウェブベースであり、大規模なゲノムデータ解析には適していない。
この問題を解決するために、PythonとRで開発されたスタンドアローンのコマンドラインツールである3Dmapperを紹介する。このツールは、アノテーションされたタンパク質の位置とバリアントをタンパク質構造に体系的にマッピングし、効率的で信頼性の高いソリューションを提供する。

wiki

https://github.com/vicruiser/3Dmapper/wiki

Tutorial

https://github.com/vicruiser/3Dmapper/wiki/Tutorial-mode

インストール

condaでpython3.7の環境を作ってテストした。

依存

BLAST standalone software version >= 2.6.
Python > 3.6
R version > 3.5
GNU parallel

Github

mamba create -n 3Dmapper python=3.7 -y
conda activate 3Dmapper
git clone https://github.com/vicruiser/3Dmapper.git
cd 3Dmapper
pip install . 
sh r_dependencies.sh

> makestructuraldb -h

usage: makestructuraldb [-h] --pdb <String> [<String> ...] --blast_db <String>

[-o <String>] [-d <String>] [-t <String>] [-e <float>]

[--pident <float>] [-c <float>]

[--interaction <String>] [--biolip] [-p] [-j <int>]

----------------------------------------------------

____ _____

|___ \| __ \

__) | | | |_ __ ___ __ _ _ __ _ __ ___ _ __

|__ <| | | | '_ ` _ \ / _` | '_ \| '_ \ / _ \ '__|

___) | |__| | | | | | | (_| | |_) | |_) | __/ |

|____/|_____/|_| |_| |_|\__,_| .__/| .__/ \___|_|

| | | |

|_| |_|

----------------------------------------------------

optional arguments:

-h, --help show this help message and exit

--pdb <String> [<String> ...]

PDB file path

--blast_db <String> proteome files path (output of makeblastdb)

-o <String>, --out <String>

output directory

-d <String>, --dist <String>

inter-residue distance threshold in angstroms (5 by

default)

-t <String>, --int-type <String>

interface definition. Options are: 'noh' (by default)

to calculate distance considering heavy atoms only;

'h' to compute hydrogen bonds; 'calpha' to compute CA-

CA distance; 'cbeta' to measure distances between CB-

CB (CA in the case of Glycine); 'sidechain' consider

atoms only from sidechains; 'backbone' to calculate

distances between atoms in the backbone only; or 'all'

to calcule interfaces between any possible atom

-e <float>, --evalue <float>

e-value threshold in BLAST. Default is 1e-7.

--pident <float> percent identity threshold between query (pdb chain)

and hit (protein) sequences. Default is 20 percent

-c <float>, --coverage <float>

percent coverage threshold of the protein sequence

(how much of the protein sequence is covered by the

PDB sequence). Default is 0 percent

--interaction <String>

Interaction type: 'protein', 'ligand', 'nucleic',

combination of the previous separated by space, or

'all' (by default)

--biolip consider BioLiP list

(https://zhanggroup.org/BioLiP/ligand_list) to remove

artifact ligands? Default is False

-p, --parallel Parallelize process

-j <int>, --jobs <int>

number of jobs to run in parallel

-----------------------------------

| >>>Publication link<<<<< |

| victoria.ruiz.serra@gmail.com |

-----------------------------------

> makevariantsdb -h

$ makevariantsdb -h

usage: makevariantsdb [-h]

(-vf <String> [<String> ...] | -maf <String> [<String> ...])

[-f] [-o <String>] [-s] [-p] [-j <int>]

----------------------------------------------------

____ _____

|___ \| __ \

__) | | | |_ __ ___ __ _ _ __ _ __ ___ _ __

|__ <| | | | '_ ` _ \ / _` | '_ \| '_ \ / _ \ '__|

___) | |__| | | | | | | (_| | |_) | |_) | __/ |

|____/|_____/|_| |_| |_|\__,_| .__/| .__/ \___|_|

| | | |

|_| |_|

----------------------------------------------------

optional arguments:

-h, --help show this help message and exit

-vf <String> [<String> ...], --varfile <String> [<String> ...]

input VCF, VEP or VEP-like file(s)

-maf <String> [<String> ...], --maf_file <String> [<String> ...]

input MAF file(s)

-f, --force force to owerwrite? Inactive by default

-o <String>, --out <String>

output directory. Default is current directory.

-s, --sort sort input file to split

-p, --parallel Speed up running time. Depends on GNU Parallel. O.

Tange(2011): GNU Parallel - The Command-Line Power

Tool, login: The USENIX Magazine, February 2011:

42-47.

-j <int>, --jobs <int>

number of jobs to run in parallel

-----------------------------------

| >>>Publication link<<<<< |

| victoria.ruiz.serra@gmail.com |

-----------------------------------

> mapper -h

usage: mapper [-h]

(-pid <String> [<String> ...] | -vid <String> [<String> ...])

-psdb <String> -vdb <String> [-o <String>] --id_mapping <String>

[-i <String> [<String> ...]] [-c <String> [<String> ...]]

[-d <float>] [--pident <int>] [-e <int>] [-f | -a] [-p]

[-j <int>] [-v] [-l] [-csv] [-hdf]

----------------------------------------------------

____ _____

|___ \| __ \

__) | | | |_ __ ___ __ _ _ __ _ __ ___ _ __

|__ <| | | | '_ ` _ \ / _` | '_ \| '_ \ / _ \ '__|

___) | |__| | | | | | | (_| | |_) | |_) | __/ |

|____/|_____/|_| |_| |_|\__,_| .__/| .__/ \___|_|

| | | |

|_| |_|

----------------------------------------------------

optional arguments:

-h, --help show this help message and exit

-pid <String> [<String> ...], --prot-id <String> [<String> ...]

one or more IDs of protein, transcripts or genes

provided via command line or from a file

-vid <String> [<String> ...], --var-id <String> [<String> ...]

single or list of variants ids provided via command

line or from a file

-psdb <String> interfaces database directory

-vdb <String>, --vardb <String>

variants database directory

-o <String>, --out <String>

output directory

--id_mapping <String>

File that contains the conversion of protein,

transcripts and gene IDs and optionally APPRIS

isoforms IDs.

-i <String> [<String> ...], --isoform <String> [<String> ...]

if available in the ID mapping file, this parameter

can filter by a single or a list of APPRIS isoforms.

The principal isoform is set by default. Options are:

principal1, principal2, ...

-c <String> [<String> ...], --consequence <String> [<String> ...]

filter by variant or position consequence type

-d <float>, --dist <float>

threshold of interface maximum distance allowed in

angstroms. By default, the maximum value will be the

one selected in makeinterfacedb

--pident <int> threshold of sequence identity (percertage)

-e <int>, --evalue <int>

threshold of evalue

-f, --force force to owerwrite? Inactive by default

-a, --append Two or more calls to the program write are able to

append results to the same output file.

-p, --parallel Parallelize process

-j <int>, --jobs <int>

number of jobs to run in parallel

-v, --verbose Print progress.

-l, --location Map all variants and detect their location.

Output format:

-csv Write the mapped data to a CSV file.

-hdf Write the mappedd data to an HDF5 file using HDFStore.

-----------------------------------

| >>>Publication link<<<<< |

| victoria.ruiz.serra@gmail.com |

-----------------------------------

> makevisualization -h

----------------------------------------------------

____ _____

|___ \| __ \

__) | | | |_ __ ___ __ _ _ __ _ __ ___ _ __

|__ <| | | | '_ ` _ \ / _` | '_ \| '_ \ / _ \ '__|

___) | |__| | | | | | | (_| | |_) | |_) | __/ |

|____/|_____/|_| |_| |_|\__,_| .__/| .__/ \___|_|

| | | |

|_| |_|

----------------------------------------------------

-----------------------------------

| >>>Publication link<<<<< |

| victoria.ruiz.serra@gmail.com |

-----------------------------------

⡀⠀ Loadingusage: makevisualization [-h] -p <string> [<string> ...] [--pdb_list]

[-i <string>] [-s <string>] [-o <string>]

[-n <string>] [-it <string>] [-l <string>]

[-bg <string>] [-ns] [-mol <string>] [-is <string>]

[-f]

makechimerax generates ChimeraX scripts of the provided PDB code(s) from the

data contained in the mapped file generated by 3Dmapper.

optional arguments:

-h, --help show this help message and exit

-p <string> [<string> ...], --pdb_code <string> [<string> ...]

PDB code/s to be found within the mapped file. If

assembly is not included it will generate a script for

all mapped assemblies of that PDB code.

--pdb_list Specifies that PDBs provided are in a file, sepparated

by spaces, tabs or new lines.

-i <string>, --interface_positions <string>

File containing the variants mapped to interfaces

generated by 3Dmapper.

-s <string>, --structure_positions <string>

File containing the variants mapped to protein

structure generated by 3Dmapper.

-o <string>, --output <string>

Output folder in which the ChimeraX script/s will be

saved.

-n <string>, --name <string>

Base name for the ChimeraX scripts.

-it <string>, --inter_type <string>

If interfaces are available, filter by specified

interaction type (ligand, protein or nucleic).

-l <string>, --lighting <string>

Select lighting option: full, soft or simple.

-bg <string>, --background <string>

Select background option: white or black.

-ns, --no_silhouette Display will not present silhouettes.

-mol <string>, --mol_style <string>

Select option for molecule style: ball, sphere or

stick.

-is <string>, --itf_style <string>

Select option for molecule style of the interfaces

displayed: ball, sphere or stick.

-f, --force Force to overwrite?

実行方法

3Dmapperは、構造データベースとバリアントアノテーションファイルの生成から始まり、バリアントのタンパク質構造へのマッピング、そしてChimeraXを使った結果の可視化でパイプラインは終了する。使用するには、クエリのPDBファイルと問い合わせ先のタンパク質セットのBLAST DBが必要。チュートリアルの流れを確認する。

１、BLAST DB作成

チュートリアルでは、BLAST+のmakeblastdbコマンドを使ってEnsembl ID:

ENST00000367182 & ENST00000374005のヒト転写産物のタンパク質データベースを作成している（実行後のDBはcloneした3Dmapper/example/1-makestructuraldb/に含まれている）。チュートリアル通りコマンドを実行するなら以下の通り。

#1 change direcotry
cd 3Dmapper/example/1-makestructuraldb/

#2 download human proteome (20,598 seqs)
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz
gzip -dv UP000005640_9606.fasta.gz

#3 run makeblastdb commnad
makeblastdb -in UP000005640_9606.fasta -dbtype prot -out human_proteome_uniprot

human_proteome_uniprot.phr、human_proteome_uniprot.pin、human_proteome_uniprot.peqが出来る。

２，構造DB作成

makestructuraldbコマンドを使用する。”--pdb”でPDBファイルあるいはCIF形式ファイルのパスを１行ずつ記載したテキストを指定する。"--blast_db"で１のBLAST DBを指定する。

makestructuraldb --pdb input_pdbs.txt --blast_db human_proteome_blastdb/human_proteome_uniprot --pident 95 -o outdir

--pdb PDB file path
--blast_db proteome files path (output of makeblastdb)
--pident percent identity threshold between query (pdb chain) and hit (protein) sequences. Default is 20 percent
-o output directory

ここでは２つのヒト転写産物のタンパク質構造のPDBファイルが分かっているため、それを使用しているが、wikiでは、現在PDBに構造がないタンパク質の構造データを取得したい場合に、PDBにあるファイルをすべてダウンロードし、配列相同性に頼って構造的ホモログを見つけたり、AlphaFold2モデルをダウンロードして使用する事も提案している。

３、バリアントデータベースの生成

makevariantsdbコマンドを使用する。入力ファイルvariants.vepには、転写産物ENST00000367182とENST00000374005に関連するバリアントが含まれている。

cd 3Dmapper/example/2-makevariantsdb/
makevariantsdb -vf variants.vep

-vf input VCF, VEP or VEP-like file(s)

DB/内にファイルが作成される。これらのファイルには、それぞれの転写産物に関連する variants が含まれている。

> ls -lR DB/

４，構造データベースへのバリアントのマッピング

mapperを使う。"-psdb"で2で作成した構造DB、”-vdb”で3で作成したバリアントDBを指定する。"--id_mapping"ではタンパク質ID、転写産物ID、遺伝子IDの関係を示したCSVファイルを指定する。

cd 3Dmapper/example/3-mapper/
mapper -pid P09769 O15151 -psdb ../1-makestructuraldb/structural_db/structuralDB/ -vdb ../2-makevariantsdb/DBs/varDB/ --id_mapping dict_geneprot_GRCh38_uniprot.txt -csv -a -l -o 3dmapper_results

-pid one or more IDs of protein, transcripts or genes provided via command line or from a file
-psdb interfaces database directory
-vdb variants database directory
--id_mapping File that contains the conversion of protein, transcripts and gene IDs and optionally APPRIS isoforms IDs.
-o output directory

このコマンドはバリアントを構造データベースのタンパク質構造にマップし、バリアントがインターフェースにマップされているか、構造にマップされているか、マップされていないか、ノンコーディングであるかに基づいて、バリアントに関する情報を含む4つのCSVファイルを生成する。

ls -lR 3dmapper_results/

５、ステップ４で得られたマッピング結果の可視化

makevisualizationコマンドを使ってマッピング結果を可視化するためのChimeraXスクリプト(.cxc)を生成する。

cd 3Dmapper/example/4-makevisualization/
makevisualization -p list_pdbs.txt --pdb_list -i ../3-mapper/3dmapper_results/csv/InterfacePositions_pident20.0_isoform_all_consequence_all.csv --force -s ../3-mapper/3dmapper_results/csv/StructurePositions_pident20.0_isoform_all_consequence_all.csv -is sphere

レポジトリの可視化例を見ると、バリアントによって変化したアミノ酸残基が強調して視覚化されている。

https://github.com/vicruiser/3Dmapper/wiki/Step-4.-Results-Visualization-with-ChimeraX

Chimeraスクリプトの実行については、このサイトの解説が参考になると思います。

https://yubais.net/doc/chimera/#example

引用

3Dmapper: a command line tool for BioBank-scale mapping of variants to protein structures
Victoria Ruiz-Serra, Samuel Valentini, Sergi Madroñero, Alfonso Valencia, Eduard Porta-Pardo
Bioinformatics, Published: 02 April 2024

参考

UCSF Chimera入門

https://yubais.net/doc/chimera/#example

2024-04-08

細菌のpopulation genomicsのためのインタラクティブなビューア Phandango

2018 Bioinformatics web tool population genomics bacteria

　現在の細菌集団ゲノミクスのデータセットに含まれる豊富なデータを十分に活用するには、数百から数千の分離株における数百万塩基対にわたるさまざまなタイプの解析を統合し、統合する必要がある。現在のアプローチでは、系統学的、疫学的、統計学的、進化学的な解析結果を静的に表現していることが多く、相互に関連付けることが難しい。Phandangoは、ウェブブラウザ上で動作するインタラクティブなアプリケーションであり、複数のゲノム解析手法からの出力を直感的かつインタラクティブに組み合わせて、大規模な集団ゲノムデータセットを高速に探索することができる。Phandangoは、www.phandango.netで自由に利用できるウェブアプリケーションであり、サンプルとして多様なデータセットコレクションが含まれている。ソースコードと詳細なWikiページは、GitHubのhttps://github.com/jameshadfield/phandangoにある。

About

https://github.com/jameshadfield/phandango/wiki

webサービス

https://jameshadfield.github.io/phandango/#/にアクセスする。

Example

Phandangoは、メタデータやゲノムデータにリンクされた系統樹（ツリー）を表示するように設計されている。これらのファイルは、適切なファイルがブラウザにドラッグされると（またはアプリから例の1つがロードされると）表示される。ほぼすべてのケースでツリーが必須となっている（系統樹なしでメタデータを表示しようとした場合など、不適切なデータ型に対しては警告が表示される）。

Mainに移動してGubbinsのtreeファイル（finalのツリーファイル:.final_tree.tre）をドラッグ&ドロップした。

図はトラックパッド／マウスホイールでズームできる。controlまたはcommandを押しながらズームすると、水平方向にのみツリーをズームできる。ノードを右クリックすると表示オプションが表示される。

GubbinsのGFF3をドラッグした。

上のセッティングには画面表示に関するパラメータがある。

https://github.com/jameshadfield/phandango/wiki/Input-data-formatsより

メタデータはツリー内の関連する分類群に合わせて色分けされたブロックとして表示される。そのためツリーは必須。
plink形式のGWAS結果を視覚化できる。GWAS視覚化時のゲノムアノテーションは表示の右上に表示される。このアノテーションはGFF3形式で、組換え／GWASの結果を視覚化するためには必須。末尾が.gffまたは.gff3でなければならない。
BRAT NextGen（細菌データセット内の組換えイベントを検出するツール）の出力のタブ区切りのtxtファイル（デフォルトでsegments_tabular.txt）をを視覚化できる。
ROARY出力のgene_presence_absence.csvを視覚化できる。

引用

Phandango: an interactive viewer for bacterial population genomics
James Hadfield, Nicholas J Croucher, Richard J Goater, Khalil Abudahab, David M Aanensen, Simon R Harris Author Notes
Bioinformatics, Volume 34, Issue 2, January 2018, Pages 292–293

参考

Roaryのパンゲノムプロット（株それぞれのパンゲノムを構成する遺伝子の有り/無し）とツリーを視覚化する。

http://sepsis-omics.github.io/tutorials/modules/roary/#vizualize-with-phandango

関連

・近縁な何百~何千のバクテリアの系統解析を行うGubbins

・SNVをコールしたり、全ゲノムのマルチプルアライメントを行う Snippy

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

WGSやRNA-seqやTarget-captureなどのさまざまなショートリードデータからユーザーが指定した数百〜数千の遺伝子座の同祖配列を抽出し、系統解析に使用可能なMSAを出力する CAPTUS

GO termからタンパク質の機能的要約を生成する GO2Sum

（ヒト）AlphaFoldでモデル化されたタンパク質間相互作用のデータベース Predictomes

メタゲノムアセンブリからのターゲットとするウイルスゲノムの完全性と連続性を向上させる COBRA

タンパク質言語モデルにより正確で高速なリモート相同性配列検索を行う PLMSearch

タンパク質構造へのバリアントのマッピングのためのコマンドラインツール 3Dmapper

細菌のpopulation genomicsのためのインタラクティブなビューア Phandango