macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Trinityのインストール

 

Trinityはバグ修正と性能改善のバージョンアップが続けられていて、2022年5月現在ではv2.14が利用できます。v2.14はまだcondaでは導入できないので、ソースからビルドします。

 

Installing Trinity

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Installing-Trinity

 

インストール

依存

Github

mamba create -n trinity python=3.9
conda activate trinity
mamba install -c bioconda -y kmer-jellyfish #もしくはソースからビルドする(参考
mamba install -c bioconda salmon -y
mamba install -c bioconda bowtie2=2.4.5 -y #少し古いバージョンだとエラーになる。
mamba install -c bioconda samtools -y #もしくはソースからビルドする(参考
#本体(2022年12月に2.15が出た、2023年2月現在v2.15.1が最新)
wget https://github.com/trinityrnaseq/trinityrnaseq/releases/download/Trinity-v2.15.0/trinityrnaseq-v2.15.0.FULL.tar.gz
tar -xvf trinityrnaseq-v2.15.0.FULL.tar.gz
cd trinityrnaseq-v2.15.0/
make -j20
make plugins # build the additional plugin components
sudo make install
#perl dbのエラーが出るなら
mamba install -c bioconda perl-db-file -y

#trinity本体はcondaでも導入できる(conda配布版は2.13 2.15.1が最新(上のリンクから確認)
#bioconda(link
mamba create -n trinity python=3.9
conda activate trinity
#バージョン指定しないと古いバージョンが入ることがある。バージョンを指定する。
mamba install -c conda-forge -c bioconda -y trinity=2.15.1

> Trinity

###############################################################################

#

 

     ______  ____   ____  ____   ____  ______  __ __

    |      ||    \ |    ||    \ |    ||      ||  |  |

    |      ||  D  ) |  | |  _  | |  | |      ||  |  |

    |_|  |_||    /  |  | |  |  | |  | |_|  |_||  ~  |

      |  |  |    \  |  | |  |  | |  |   |  |  |___, |

      |  |  |  .  \ |  | |  |  | |  |   |  |  |     |

      |__|  |__|\_||____||__|__||____|  |__|  |____/

 

    Trinity-v2.14.0

 

 

#

#

# Required:

#

#  --seqType <string>      :type of reads: ('fa' or 'fq')

#

#  --max_memory <string>      :suggested max memory to use by Trinity where limiting can be enabled. (jellyfish, sorting, etc)

#                            provided in Gb of RAM, ie.  '--max_memory 10G'

#

#  If paired reads:

#      --left  <string>    :left reads, one or more file names (separated by commas, no spaces)

#      --right <string>    :right reads, one or more file names (separated by commas, no spaces)

#

#  Or, if unpaired reads:

#      --single <string>   :single reads, one or more file names, comma-delimited (note, if single file contains pairs, can use flag: --run_as_paired )

#

#  Or,

#      --samples_file <string>         tab-delimited text file indicating biological replicate relationships.

#                                   ex.

#                                        cond_A    cond_A_rep1    A_rep1_left.fq    A_rep1_right.fq

#                                        cond_A    cond_A_rep2    A_rep2_left.fq    A_rep2_right.fq

#                                        cond_B    cond_B_rep1    B_rep1_left.fq    B_rep1_right.fq

#                                        cond_B    cond_B_rep2    B_rep2_left.fq    B_rep2_right.fq

#

#                      # if single-end instead of paired-end, then leave the 4th column above empty.

#

####################################

##  Misc:  #########################

#

#  --SS_lib_type <string>          :Strand-specific RNA-Seq read orientation.

#                                   if paired: RF or FR,

#                                   if single: F or R.   (dUTP method = RF)

#                                   See web documentation.

#

#  --CPU <int>                     :number of CPUs to use, default: 2

#  --min_contig_length <int>       :minimum assembled contig length to report

#                                   (def=200, must be >= 100)

#

#  --long_reads <string>           :fasta file containing error-corrected or circular consensus (CCS) pac bio reads

#                                   (** note: experimental parameter **, this functionality continues to be under development)

#

#  --genome_guided_bam <string>    :genome guided mode, provide path to coordinate-sorted bam file.

#                                   (see genome-guided param section under --show_full_usage_info)

#

#  --long_reads_bam <string>       :long reads to include for genome-guided Trinity

#                                  (bam file consists of error-corrected or circular consensus (CCS) pac bio read aligned to the genome)

#

#  --jaccard_clip                  :option, set if you have paired reads and

#                                   you expect high gene density with UTR

#                                   overlap (use FASTQ input file format

#                                   for reads).

#                                   (note: jaccard_clip is an expensive

#                                   operation, so avoid using it unless

#                                   necessary due to finding excessive fusion

#                                   transcripts w/o it.)

#

#  --trimmomatic                   :run Trimmomatic to quality trim reads

#                                        see '--quality_trimming_params' under full usage info for tailored settings.

#

#  --output <string>               :name of directory for output (will be

#                                   created if it doesn't already exist)

#                                   default( your current working directory: "/media/kazu/4TB4/Sugimoto_denovoRNAseq/trinity_out_dir" 

#                                    note: must include 'trinity' in the name as a safety precaution! )

#  

#  --full_cleanup                  :only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta

#

#  --cite                          :show the Trinity literature citation

#

#  --verbose                       :provide additional job status info during the run.

#

#  --version                       :reports Trinity version (Trinity-v2.14.0) and exits.

#

#  --show_full_usage_info          :show the many many more options available for running Trinity (expert usage).

#

#

###############################################################################

#

#  *Note, a typical Trinity command might be:

#

#        Trinity --seqType fq --max_memory 50G --left reads_1.fq  --right reads_2.fq --CPU 6

#

#            (if you have multiple samples, use --samples_file ... see above for details)

#

#    and for Genome-guided Trinity, provide a coordinate-sorted bam:

#

#        Trinity --genome_guided_bam rnaseq_alignments.csorted.bam --max_memory 50G

#                --genome_guided_max_intron 10000 --CPU 6

#

#     see: /home/kazu/Documents/trinityrnaseq-v2.14.0.FULL/trinityrnaseq-v2.14.0/sample_data/test_Trinity_Assembly/

#          for sample data and 'runMe.sh' for example Trinity execution

#

#     For more details, visit: http://trinityrnaseq.github.io

#

###############################################################################

導入できました。v2.14ではデフォルトの最小コンティグ長が100-ntになっているので、今までのバージョンと比較する時は注意して下さい。 

 

ラン例

De novo transcriptome assembly

#全サンプルをmergeしたfastqを使い、de novoでアセンブル
Trinity --seqType fq --left R1.fq.gz --right R2.fq.gz --max_memory 200G --CPU 40 --output Trinity_outdir
  • --SS_lib_type  Strand-specific RNA-Seq read orientation. if paired: RF or FR, if single: F or R.   (dUTP method = RF) See web documentation.

  • --left    left reads, one or more file names (separated by commas, no spaces)

  • --right    right reads, one or more file names (separated by commas, no spaces)
  • --single   single reads, one or more file names, comma-delimited (note, if single file contains pairs, can use flag: --run_as_paired )
  • --samples_file    tab-delimited text file indicating biological replicate relationships.

  • --CPU   number of CPUs to use, default: 2
  • --long_reads    asta file containing error-corrected or circular consensus (CCS) pac bio reads  (** note: experimental parameter **, this functionality continues to be under development)

     

  • --genome_guided_bam    genome guided mode, provide path to coordinate-sorted bam file.

  • --output   name of directory for output (will be created if it doesn't already exist)

  • --full_cleanup   only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta
  • --jaccard_clip   option, set if you have paired reads and you expect high gene density with UTR overlap (use FASTQ input file format for reads). (note: jaccard_clip is an expensive operation, so avoid using it unless necessary due to finding excessive fusion transcripts w/o it.) (参考

                             

しばらく前のバージョンからcDNAロングリードを使ったアセンブリもできるようになっていますね。機会があれば試してみたいです。

引用

Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Manfred G Grabherr, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, Xian Adiconis, Lin Fan, Raktima Raychowdhury, Qiandong Zeng, Zehua Chen, Evan Mauceli, Nir Hacohen, Andreas Gnirke, Nicholas Rhind, Federica di Palma, Bruce W Birren, Chad Nusbaum, Kerstin Lindblad-Toh, Nir Friedman & Aviv Regev

Nature Biotechnology 29, 644–652 (2011)

 

関連