macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ナノポアのアダプタートリミングツール Porechop

2020 5/18 インストール手順追記

 

 PorechopはOXford Nanoporeのリードのアダプタートリミングツール。データベースを保持しており、自動でアダプター配列を認識し除去してくれる。マルチプレックスのidnex配列を除く機能も持つ。

 

ダウンロードリンク

GitHub - rrwick/Porechop: Adapter trimmer for Oxford Nanopore reads

 

インストール

Github

#bioconda(link)
conda create -n porechop -y
conda activate porechop
conda install -c bioconda porechop -y

git clone https://github.com/rrwick/Porechop.git
cd Porechop
python3 setup.py install # usr/locan/binにパスも通る
porechop -h #ヘルプの表示

porechop -h

$ porechop -h

usage: porechop -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY] [-t THREADS] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD] [--barcode_diff BARCODE_DIFF] [--require_two_barcodes]

                [--untrimmed] [--discard_unassigned] [--adapter_threshold ADAPTER_THRESHOLD] [--check_reads CHECK_READS] [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE] [--min_trim_size MIN_TRIM_SIZE]

                [--extra_end_trim EXTRA_END_TRIM] [--end_threshold END_THRESHOLD] [--no_split] [--discard_middle] [--middle_threshold MIDDLE_THRESHOLD] [--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE]

                [--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE] [--min_split_read_size MIN_SPLIT_READ_SIZE] [-h] [--version]

 

Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and splitting reads with internal adapters

 

Main options:

  -i INPUT, --input INPUT               FASTA/FASTQ of input reads or a directory which will be recursively searched for FASTQ files (required)

  -o OUTPUT, --output OUTPUT            Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed reads will be printed to stdout)

  --format {auto,fasta,fastq,fasta.gz,fastq.gz}

                                        Output format for the reads - if auto, the format will be chosen based on the output filename or the input read format (default: auto)

  -v VERBOSITY, --verbosity VERBOSITY   Level of progress information: 0 = none, 1 = some, 2 = lots, 3 = full - output will go to stdout if reads are saved to a file and stderr if reads are printed to stdout (default: 1)

  -t THREADS, --threads THREADS         Number of threads to use for adapter alignment (default: 12)

 

Barcode binning settings:

  Control the binning of reads based on barcodes (i.e. barcode demultiplexing)

 

  -b BARCODE_DIR, --barcode_dir BARCODE_DIR

                                        Reads will be binned based on their barcode and saved to separate files in this directory (incompatible with --output)

  --barcode_threshold BARCODE_THRESHOLD

                                        A read must have at least this percent identity to a barcode to be binned (default: 75.0)

  --barcode_diff BARCODE_DIFF           If the difference between a read's best barcode identity and its second-best barcode identity is less than this value, it will not be put in a barcode bin (to exclude cases which are too

                                        close to call) (default: 5.0)

  --require_two_barcodes                Reads will only be put in barcode bins if they have a strong match for the barcode on both their start and end (default: a read can be binned with a match at its start or end)

  --untrimmed                           Bin reads but do not trim them (default: trim the reads)

  --discard_unassigned                  Discard unassigned reads (instead of creating a "none" bin) (default: False)

 

Adapter search settings:

  Control how the program determines which adapter sets are present

 

  --adapter_threshold ADAPTER_THRESHOLD

                                        An adapter set has to have at least this percent identity to be labelled as present and trimmed off (0 to 100) (default: 90.0)

  --check_reads CHECK_READS             This many reads will be aligned to all possible adapters to determine which adapter sets are present (default: 10000)

  --scoring_scheme SCORING_SCHEME       Comma-delimited string of alignment scores: match, mismatch, gap open, gap extend (default: 3,-6,-5,-2)

 

End adapter settings:

  Control the trimming of adapters from read ends

 

  --end_size END_SIZE                   The number of base pairs at each end of the read which will be searched for adapter sequences (default: 150)

  --min_trim_size MIN_TRIM_SIZE         Adapter alignments smaller than this will be ignored (default: 4)

  --extra_end_trim EXTRA_END_TRIM       This many additional bases will be removed next to adapters found at the ends of reads (default: 2)

  --end_threshold END_THRESHOLD         Adapters at the ends of reads must have at least this percent identity to be removed (0 to 100) (default: 75.0)

 

Middle adapter settings:

  Control the splitting of read from middle adapters

 

  --no_split                            Skip splitting reads based on middle adapters (default: split reads when an adapter is found in the middle)

  --discard_middle                      Reads with middle adapters will be discarded (default: reads with middle adapters are split) (required for reads to be used with Nanopolish, this option is on by default when outputting reads

                                        into barcode bins)

  --middle_threshold MIDDLE_THRESHOLD   Adapters in the middle of reads must have at least this percent identity to be found (0 to 100) (default: 90.0)

  --extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE

                                        This many additional bases will be removed next to middle adapters on their "good" side (default: 10)

  --extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE

                                        This many additional bases will be removed next to middle adapters on their "bad" side (default: 100)

  --min_split_read_size MIN_SPLIT_READ_SIZE

                                        Post-split read pieces smaller than this many base pairs will not be outputted (default: 1000)

 

Help:

  -h, --help                            Show this help message and exit

  --version                             Show program's version number and exit

(porechop) kamisakakazumanoMac-mini:plastome_output20200512 kazu$ 

 

 

 

初めに10000リードランダムに(?)抽出して、Porechopのアダプターライブラリと称号を行いアダプターを検出する。その時の閾値は90%以上の相同性となっているが、--adapter_thresholdを指定すれば変更可能。

 

ラン

Porechopのデータベースと比較して、アダプター配列をstartとendから除く。

porechop -i input_reads.fastq.gz -o output_reads.fastq.gz 

非圧縮のfastq/fastaも使用できる。重要そうなパラメータを載せておく。

  • --adapter_threshold: An adapter set has to have at least this percent identity to be labelled as present and trimmed off (0 to 100) (default: 90.0)
  • --end_size: The number of base pairs at each end of the read which will be searched for adapter sequences (default: 150)
  • --min_trim_size: Adapter alignments smaller than this will be ignored (default: 4)
  • --end_threshold: Adapters at the ends of reads must have at least this percent identity to be removed (0 to 100) (default: 75.0)
  • --extra_end_trim: his many additional bases will be removed next to adapters found at the ends of reads (default: 2)

 

特に--end_thresholdは大きく影響を与えそうである。著者は1DのONTリードでのみ検証しており、2Dのデータの使用については保証していない。精度の高い2Dのデータならばもう少し相同性に関わる値を厳しくした方が良い可能性がある。

結果

Trimming adapters from read ends

     SQK-NSK007_Y_Top: AATGTACTTCGTTCAGTTACGTATTGCT

  SQK-NSK007_Y_Bottom: GCAATACGTAACTGAACGAAGT

        Rapid_adapter: GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA

 

14198 / 22907 reads had adapters trimmed from their start (842478 bp removed)

  778 / 22907 reads had adapters trimmed from their end (8997 bp removed)

 

 

Splitting reads containing middle adapters

0 / 22907 reads were split based on middle adapters

 

 

Saved result to /Users/user/nanopore2/merged_trimmed.fastq

 

lambdaのコントロールのシーケンスデータを読むと、このような結果となった。

--verbosity 2をつけると、どこにアダプター配列を含むかなどが細かく表示される。

porechop -i input_reads.fastq.gz -o output_reads.fastq.gz --verbosity 2