NextPolish Parameter Reference

NextPolish requires at least one assembly file (option: genome) and one read file list (option: sgs_fofn or lgs_fofn or hifi_fofn) as input, it works with gzip’d FASTA and FASTQ formats and uses a config file to pass options.

Input

  • genome file

    genome=/path/to/need_to_be_polished_assembly_file
    
  • read file list (one file one line, paired-end files should be interleaved)

    ls reads1_R1.fq reads1_R2.fq reads2_R1.fq.gz reads2_R2.fq.gz ... > sgs.fofn
    
  • config file

    A config file is a text file that contains a set of parameters (key=value pairs) to set runtime parameters for NextPolish. The following is a typical config file, which is also located in doc/run.cfg.

    [General]
    job_type = local
    job_prefix = nextPolish
    task = best
    rewrite = yes
    rerun = 3
    parallel_jobs = 6
    multithread_jobs = 5
    genome = ./raw.genome.fasta
    genome_size = auto
    workdir = ./01_rundir
    polish_options = -p {multithread_jobs}
    
    [sgs_option] #optional
    sgs_fofn = ./sgs.fofn
    sgs_options = -max_depth 100 -bwa
    
    [lgs_option] #optional
    lgs_fofn = ./lgs.fofn
    lgs_options = -min_read_len 1k -max_depth 100
    lgs_minimap2_options = -x map-ont
    
    [hifi_option] #optional
    hifi_fofn = ./hifi.fofn
    hifi_options = -min_read_len 1k -max_depth 100
    hifi_minimap2_options = -x asm20
    

Output

  • genome.nextpolish.fasta

    Polished genome with fasta format, the fasta header includes primary seqID, length. A lowercase letter indicates a low quality base after polishing, this usually caused by heterozygosity.

  • genome.nextpolish.fasta.stat

    Some basic statistical information of the polished genome.

Options

Global options

job_type = sge

local, sge, pbs… (default: sge)

job_prefix = nextPolish

prefix tag for jobs. (default: nextPolish)

task = best

task need to run [all, default, best, 1, 2, 5, 12, 1212…], 1, 2 are different algorithm modules for short reads, while 5 is the algorithm module for long reads, all=[5]1234, default=[5]12, best=[55]1212. (default: best)

rewrite = no

overwrite existed directory [yes, no]. (default: no)

rerun = 3

re-run unfinished jobs untill finished or reached ${rerun} loops, 0=no. (default: 3)

parallel_jobs = 6

number of tasks used to run in parallel. (default: 6)

multithread_jobs = 5

number of threads used to in a task. (default: 5)

submit = auto

command to submit a job, auto = automatically set by Paralleltask.

kill = auto

command to kill a job, auto = automatically set by Paralleltask.

check_alive = auto

command to check a job status, auto = automatically set by Paralleltask.

job_id_regex = auto

the job-id-regex to parse the job id from the out of submit, auto = automatically set by Paralleltask.

use_drmaa = no

use drmaa to submit and control jobs.

genome = genome.fa

genome file need to be polished. (required)

genome_size = auto

genome size, auto = calculate genome size using the input ${genome} file. (default: auto)

workdir = 01_rundir

work directory. (default: ./)

polish_options = -p {multithread_jobs}
-p, number of processes used for polishing.
-u, output uppercase sequences. (default: False)
-debug, output details of polished bases to stderr, only useful in short read polishing. (default: False)

Options for short reads

sgs_fofn = ./sgs.fofn

input short read files list, one file one line, paired-end files should be interleaved.

sgs_options = -max_depth 100 -bwa
-N, don't discard a read/pair if the read contains N base.
-use_duplicate_reads, use duplicate pair-end reads in the analysis. (default: False)
-unpaired, unpaired input files. (default: False)
-max_depth, use up to ${max_depth} fold reads data to polish. (default: 100)
-bwa, use bwa to do mapping. (default: -bwa)
-minimap2, use minimap2 to do mapping, which is much faster than bwa.

Options for long reads

lgs_fofn = ./lgs.fofn

input long read files list, one file one line.

lgs_options = -min_read_len 1k -max_depth 100
-min_read_len, filter reads with length shorter than ${min_read_len}. (default: 1k)
-max_read_len, filter reads with length longer than $ {max_read_len}, ultra-long reads usually contain lots of errors, and the mapping step requires significantly more memory and time, 0=disable (default: 0)
-max_depth, use up to ${max_depth} fold reads data to polish, 0=disable. (default: 100)
lgs_minimap2_options = -x map-pb -t {multithread_jobs}

minimap2 options, used to set PacBio/Nanopore reads mapping. (required)

Options for hifi reads

hifi_fofn = ./hifi.fofn

input hifi read files list, one file one line.

hifi_options = -min_read_len 1k -max_depth 100
-min_read_len, filter reads with length shorter than ${min_read_len}. (default: 1k)
-max_read_len, filter reads with length longer than $ {max_read_len}, ultra-long reads usually contain lots of errors, and the mapping step requires significantly more memory and time, 0=disable (default: 0)
-max_depth, use up to ${max_depth} fold reads data to polish, 0=disable. (default: 100)
hifi_minimap2_options = -x map-pb -t {multithread_jobs}

minimap2 options, used to set hifi reads mapping. (required)