NextPolish Parameter Reference¶

NextPolish requires at least one assembly file (option: genome) and one read file list (option: sgs_fofn or lgs_fofn or hifi_fofn) as input, it works with gzip’d FASTA and FASTQ formats and uses a config file to pass options.

Input¶

genome file

genome=/path/to/need_to_be_polished_assembly_file

read file list (one file one line, paired-end files should be interleaved)

ls reads1_R1.fq reads1_R2.fq reads2_R1.fq.gz reads2_R2.fq.gz ... > sgs.fofn

config file

A config file is a text file that contains a set of parameters (key=value pairs) to set runtime parameters for NextPolish. The following is a typical config file, which is also located in doc/run.cfg.

[General]
job_type = local
job_prefix = nextPolish
task = best
rewrite = yes
rerun = 3
parallel_jobs = 6
multithread_jobs = 5
genome = ./raw.genome.fasta
genome_size = auto
workdir = ./01_rundir
polish_options = -p {multithread_jobs}

[sgs_option] #optional
sgs_fofn = ./sgs.fofn
sgs_options = -max_depth 100 -bwa

[lgs_option] #optional
lgs_fofn = ./lgs.fofn
lgs_options = -min_read_len 1k -max_depth 100
lgs_minimap2_options = -x map-ont

[hifi_option] #optional
hifi_fofn = ./hifi.fofn
hifi_options = -min_read_len 1k -max_depth 100
hifi_minimap2_options = -x asm20

Output¶

genome.nextpolish.fasta

Polished genome with fasta format, the fasta header includes primary seqID, length. A lowercase letter indicates a low quality base after polishing, this usually caused by heterozygosity.
genome.nextpolish.fasta.stat

Some basic statistical information of the polished genome.

Options¶

Global options¶

job_type = sge¶

local, sge, pbs… (default: sge)

job_prefix = nextPolish¶

prefix tag for jobs. (default: nextPolish)

task = best¶

task need to run [all, default, best, 1, 2, 5, 12, 1212…], 1, 2 are different algorithm modules for short reads, while 5 is the algorithm module for long reads, all=[5]1234, default=[5]12, best=[55]1212. (default: best)

rewrite = no¶

overwrite existed directory [yes, no]. (default: no)

rerun = 3¶

re-run unfinished jobs untill finished or reached ${rerun} loops, 0=no. (default: 3)

parallel_jobs = 6¶

number of tasks used to run in parallel. (default: 6)

multithread_jobs = 5¶

number of threads used to in a task. (default: 5)

submit = auto¶

command to submit a job, auto = automatically set by Paralleltask.

kill = auto¶

command to kill a job, auto = automatically set by Paralleltask.

check_alive = auto¶

command to check a job status, auto = automatically set by Paralleltask.

job_id_regex = auto¶

the job-id-regex to parse the job id from the out of submit, auto = automatically set by Paralleltask.

use_drmaa = no¶

use drmaa to submit and control jobs.

genome = genome.fa¶

genome file need to be polished. (required)

genome_size = auto¶

genome size, auto = calculate genome size using the input ${genome} file. (default: auto)

workdir = 01_rundir¶

work directory. (default: ./)
polish_options = -p {multithread_jobs}¶
-p, number of processes used for polishing.
-u, output uppercase sequences. (default: False)
-debug, output details of polished bases to stderr, only useful in short read polishing. (default: False)

Options for short reads¶

sgs_fofn = ./sgs.fofn¶: input short read files list, one file one line, paired-end files should be interleaved.

sgs_options = -max_depth 100 -bwa¶

-N, don't discard a read/pair if the read contains N base.
-use_duplicate_reads, use duplicate pair-end reads in the analysis. (default: False)
-unpaired, unpaired input files. (default: False)
-max_depth, use up to ${max_depth} fold reads data to polish. (default: 100)
-bwa, use bwa to do mapping. (default: -bwa)
-minimap2, use minimap2 to do mapping, which is much faster than bwa.

Options for long reads¶

lgs_fofn = ./lgs.fofn¶

input long read files list, one file one line.
lgs_options = -min_read_len 1k -max_depth 100¶
-min_read_len, filter reads with length shorter than ${min_read_len}. (default: 1k)
-max_read_len, filter reads with length longer than $ {max_read_len}, ultra-long reads usually contain lots of errors, and the mapping step requires significantly more memory and time, 0=disable (default: 0)
-max_depth, use up to ${max_depth} fold reads data to polish, 0=disable. (default: 100)
lgs_minimap2_options = -x map-pb -t {multithread_jobs}¶

minimap2 options, used to set PacBio/Nanopore reads mapping. (required)

Options for hifi reads¶

hifi_fofn = ./hifi.fofn¶

input hifi read files list, one file one line.
hifi_options = -min_read_len 1k -max_depth 100¶
-min_read_len, filter reads with length shorter than ${min_read_len}. (default: 1k)
-max_read_len, filter reads with length longer than $ {max_read_len}, ultra-long reads usually contain lots of errors, and the mapping step requires significantly more memory and time, 0=disable (default: 0)
-max_depth, use up to ${max_depth} fold reads data to polish, 0=disable. (default: 100)
hifi_minimap2_options = -x map-pb -t {multithread_jobs}¶

minimap2 options, used to set hifi reads mapping. (required)