非コード転写産物のマルチマッピングおよびマルチオーバーラッピングアラインメントのあいまいさに対処するRNA-seq定量ツール MGcount

2022 1/27追記

　トータルRNAシーケンス（total-RNA-seq）により、コーディングとノンコーディングの両方のトランスクリプトームを同時に研究することができる。しかし、計算パイプラインは従来、特定のバイオタイプに焦点を当て、total-RNA-seqデータセットが満たさない仮定をしてきた。異なるRNAバイオタイプの転写産物は、長さ、生合成、および機能が異なり、ゲノム領域で重複することがあり、高いコピー数でゲノムに存在することがある。そのため、total-RNA-seqライブラリのリードは、ゲノムアラインメントをあいまいにする可能性があり、柔軟な定量アプローチが必要とされている。

　本発表では、曖昧なアラインメントを処理するための2つの戦略を組み合わせたtotal-RNA-seq定量ツール、MultiGraph count (MGcount)を紹介する。MGcountは、ゲノム上の同じ位置で転写産物が重なったときの長さの違いを考慮し、リードをsmall-RNAとlong-RNAのフィーチャーに階層的に割り当てる。次に、MGcountは、グラフベースのアプローチにより、リードが系統的にマルチマップする類似配列のRNA産物を集約する。MGcountは、RNAシーケンスの下流解析パイプラインと互換性のあるトランスクリプトームカウントマトリックス、バルクとシングルセルの両方の分解能、および異なるバイオタイプの反復転写構造をモデル化したグラフを出力する。本ソフトウェアは、Pythonモジュールまたは単一ファイルの実行プログラムとして使用することができる。

　MGcountは、複数のゲノム位置にアライメントするリードや、複数の遺伝子フィーチャーとオーバーラップするリードをうまく統合する、柔軟なtotal-RNA-seq定量ツールである。そのアプローチは、タンパク質コード、ロングノンコード、スモールノンコードの転写産物濃度を、前駆体および処理済みの両方の形態で同時に推定するのに適している。ソースコードとコンパイルされたソフトウェアの両方が、https://github.com/hitaandrea/MGcount で入手できる。

user guide

https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/UserGuide.html

マニュアルより
MGcountは、計算効率の良いソフトウェアとして知られるFeatureCountsの上に構築されている（Liao et.al., 2014）。

インストール

condaで仮想環境を作ってテストした。

Github

mamba create -n MGcount python=3.9 -y
conda activate MGcount
pip3 install git+https://github.com/hitaandrea/MGcount.git
#subread
mamba install -c bioconda subread -y

#非公式docker iamgeも見つかる
docker pull gcfntnu/mgcount:1.0.0

> python3 -m mgcount -h

$ python3 -m mgcount -h

usage: __main__.py [-h] --bam_infiles BAM_INFILES --outdir OUTDIR --gtf GTF [--sample_id SAMPLE_ID] [-T N_CORES] [-p] [-s {0,1,2}] [--th_low TH_LOW] [--th_high TH_HIGH] [--seed SEED] [--feature_output_long FEATURE_OUTPUT_LONG]

[--feature_biotype_long FEATURE_BIOTYPE_LONG] [--min_overlap_long MIN_OVERLAP_LONG] [--ml_flag_long ML_FLAG_LONG] [--feature_small FEATURE_SMALL] [--feature_output_small FEATURE_OUTPUT_SMALL]

[--feature_biotype_small FEATURE_BIOTYPE_SMALL] [--min_overlap_small MIN_OVERLAP_SMALL] [--ml_flag_small ML_FLAG_SMALL] [--featureCounts_path FEATURECOUNTS_PATH] [--btyperounds_filename BTYPEROUNDS_FILENAME]

MGcount RNA-seq quantification pipeline

optional arguments:

-h, --help show this help message and exit

--sample_id SAMPLE_ID

Optional sampleID names

-T N_CORES, --n_cores N_CORES

Number of cores for parallelization

-p, --paired_flag Paired end flag

-s {0,1,2}, --strand_option {0,1,2}

Options available are 0: unstranded, 1: forwardstranded (default) or 2:reverse_stranded

--th_low TH_LOW Low minimal threshold for feature-to-feature multi-mapping fraction

--th_high TH_HIGH High minimal threshold for feature-to-feature multi-mapping fraction

--seed SEED Optional fixed seed for random numbers generation during communities detection

--feature_output_long FEATURE_OUTPUT_LONG

GTF field name for which to summarize expresion of longRNA assigned reads

--feature_biotype_long FEATURE_BIOTYPE_LONG

GTF field name defining biotype for longRNA features

--min_overlap_long MIN_OVERLAP_LONG

Minimal feature-alignment overlapping fraction for assignation (default 0.75) for long round

--ml_flag_long ML_FLAG_LONG

Multi-loci graph detection based groups flag for long round

--feature_small FEATURE_SMALL

GTF feature type for smallRNA reads to be assigned to

--feature_output_small FEATURE_OUTPUT_SMALL

GTF field name for which to summarize counts of smallRNA assigned reads

--feature_biotype_small FEATURE_BIOTYPE_SMALL

GTF field name defining biotype for smallRNA features

--min_overlap_small MIN_OVERLAP_SMALL

Minimal feature-alignment overlapping fraction for assignation (default 1) for long round

--ml_flag_small ML_FLAG_SMALL

Multi-loci graph detection based groups flag for small round

--featureCounts_path FEATURECOUNTS_PATH

Path to featureCounts executable file (by default, the software is looked for on /user/bin/featureCounts)

--btyperounds_filename BTYPEROUNDS_FILENAME

Optional .csv file with gene_biotype assignation_round custom defined pairs

required arguments:

--bam_infiles BAM_INFILES, -i BAM_INFILES

Alignment input file names

--outdir OUTDIR, -o OUTDIR

Output directory path

--gtf GTF Annotations file name

チュートリアルデータのラン

マニュアルでは、サブサンプリングされたヒト脳RNA-seqからのbamファイル２つを使ったチュートリアルが用意されている。

１、データの準備

mkdir mgcount_tutorial
cd mgcount_tutorial
wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/tutorial_bamfiles.zip
unzip tutorial_bamfiles.zip -d input_bamfiles

２、MGcountのランに必要なファイルの準備

ランには、入力アライメントファイルのパスを指定したテキストファイルを提供する。

bamのパスを書いたテキストファイルの作成

printf "$PWD/%s\n" input_bamfiles/* > input_bamfilenames.txt

input_bamfiles/

f:id:kazumaxneo:20220126225613p:plain

もう１つ、転写産物のアノテーションを含む.gtf ファイルも必要。MGcountリポジトリでは、複数のデータベースからのアノテーションを統合した4つの.gtfファイルが用意されている。それらをダウンロードする。

wget https://filedn.com/lTnUWxFTA93JTyX3Hvbdn2h/mgcount/integrated_annotations_gtf.zip
unzip integrated_annotations_gtf.zip -d annotations_gtf

annotations_gtf/

f:id:kazumaxneo:20220126224930p:plain

３、MGcountのラン

GTFファイルとbamファイルのパスを示したテキストを指定する。featurecountsのパスは、パスが通っていても/user/bin/しかチェックされない。--featureCounts_pathでFeatureCountsのパスを指定する（which featureconts）。

python3 -m mgcount -T 2 --gtf annotations_gtf/Homo_sapiens.GRCh38.custom.gtf --outdir outputs --bam_infiles input_bamfilenames.txt --featureCounts_path <path>/<to>/featureCounts

-T Number of cores for parallelization
--gtf Annotations file name
--outdir Output directory path
--bam_infiles Alignment input file names
--featureCounts_path Path to featureCounts executable file (by default, the software is looked for on /user/bin/featureCounts)
-s {0,1,2} Options available are 0: unstranded, 1: forwardstranded (default) or 2:reverse_stranded

注意；featurecountsのパスエラーは、ランに失敗していもラン途中にエラーメッセージが出ないのでわかりにくい。/user/bin/に置いていないなら必ず--featureCounts_pathでfeaturecountsのパスを指定する。

MGcountが終了すると、出力ディレクトリにリードカウント行列、Featureメタデータテーブル、long RNAとsmall RNAそれぞれについて検出されたグラフ構造とコミュニティを含む2つのファイルなど、MGcountが生成した6つの新しい出力ファイルが書き出される（マニュアルより）。

出力例/

f:id:kazumaxneo:20220127210548p:plain

count_matrix.csv/

f:id:kazumaxneo:20220127212411p:plain

MGcountはRのコンソールでsystem関数で呼び出してランすることもできます。

また、MGcountのリポジトリには、アノテーションの管理、MGcount出力の可視化RののためのRのサポートスクリプトが含まれており、チュートリアルでは、これらを使ってexon, intron、non-codingのカウント数を比較したり、マルチマッピンググラフを視覚するチュートリアルが用意されています。アクセスしてみて下さい。

引用

MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts
Andrea Hita, Gilles Brocart, Ana Fernandez, Marc Rehmsmeier, Anna Alemany & Sol Schvartzman
BMC Bioinformatics volume 23, Article number: 39 (2022)