NextPolish Parameter Reference¶
NextPolish requires at least one assembly file (option: genome) and one read file list (option: sgs_fofn or lgs_fofn or hifi_fofn) as input, it works with gzip’d FASTA and FASTQ formats and uses a config file to pass options.
Input¶
genome filegenome=/path/to/need_to_be_polished_assembly_file
read file list(one file one line, paired-end files should be interleaved)ls reads1_R1.fq reads1_R2.fq reads2_R1.fq.gz reads2_R2.fq.gz ... > sgs.fofn
config fileA config file is a text file that contains a set of parameters (key=value pairs) to set runtime parameters for NextPolish. The following is a typical config file, which is also located in
doc/run.cfg.[General] job_type = local job_prefix = nextPolish task = best rewrite = yes rerun = 3 parallel_jobs = 6 multithread_jobs = 5 genome = ./raw.genome.fasta genome_size = auto workdir = ./01_rundir polish_options = -p {multithread_jobs} [sgs_option] #optional sgs_fofn = ./sgs.fofn sgs_options = -max_depth 100 -bwa [lgs_option] #optional lgs_fofn = ./lgs.fofn lgs_options = -min_read_len 1k -max_depth 100 lgs_minimap2_options = -x map-ont [hifi_option] #optional hifi_fofn = ./hifi.fofn hifi_options = -min_read_len 1k -max_depth 100 hifi_minimap2_options = -x asm20
Output¶
genome.nextpolish.fastaPolished genome with fasta format, the fasta header includes primary seqID, length. A lowercase letter indicates a low quality base after polishing, this usually caused by heterozygosity.
genome.nextpolish.fasta.statSome basic statistical information of the polished genome.
Options¶
Global options¶
job_type= sge¶local, sge, pbs… (default: sge)
job_prefix= nextPolish¶prefix tag for jobs. (default: nextPolish)
task= best¶task need to run [all, default, best, 1, 2, 5, 12, 1212…], 1, 2 are different algorithm modules for short reads, while 5 is the algorithm module for long reads, all=[5]1234, default=[5]12, best=[55]1212. (default: best)
rewrite= no¶overwrite existed directory [yes, no]. (default: no)
rerun= 3¶re-run unfinished jobs untill finished or reached ${rerun} loops, 0=no. (default: 3)
parallel_jobs= 6¶number of tasks used to run in parallel. (default: 6)
multithread_jobs= 5¶number of threads used to in a task. (default: 5)
submit= auto¶command to submit a job, auto = automatically set by Paralleltask.
kill= auto¶command to kill a job, auto = automatically set by Paralleltask.
check_alive= auto¶command to check a job status, auto = automatically set by Paralleltask.
job_id_regex= auto¶the job-id-regex to parse the job id from the out of
submit, auto = automatically set by Paralleltask.
use_drmaa= no¶use drmaa to submit and control jobs.
genome= genome.fa¶genome file need to be polished. (required)
genome_size= auto¶genome size, auto = calculate genome size using the input ${genome} file. (default: auto)
workdir= 01_rundir¶work directory. (default: ./)
polish_options= -p {multithread_jobs}¶ -p, number of processes used for polishing. -u, output uppercase sequences. (default: False) -debug, output details of polished bases to stderr, only useful in short read polishing. (default: False)
Options for short reads¶
sgs_fofn= ./sgs.fofn¶input short read files list, one file one line, paired-end files should be interleaved.
sgs_options= -max_depth 100 -bwa¶-N, don't discard a read/pair if the read contains N base. -use_duplicate_reads, use duplicate pair-end reads in the analysis. (default: False) -unpaired, unpaired input files. (default: False) -max_depth, use up to ${max_depth} fold reads data to polish. (default: 100) -bwa, use bwa to do mapping. (default: -bwa) -minimap2, use minimap2 to do mapping, which is much faster than bwa.
Options for long reads¶
lgs_fofn= ./lgs.fofn¶input long read files list, one file one line.
lgs_options= -min_read_len 1k -max_depth 100¶-min_read_len, filter reads with length shorter than ${min_read_len}. (default: 1k) -max_read_len, filter reads with length longer than $ {max_read_len}, ultra-long reads usually contain lots of errors, and the mapping step requires significantly more memory and time, 0=disable (default: 0) -max_depth, use up to ${max_depth} fold reads data to polish, 0=disable. (default: 100)
lgs_minimap2_options= -x map-pb -t {multithread_jobs}¶minimap2 options, used to set PacBio/Nanopore reads mapping. (required)
Options for hifi reads¶
hifi_fofn= ./hifi.fofn¶input hifi read files list, one file one line.
hifi_options= -min_read_len 1k -max_depth 100¶-min_read_len, filter reads with length shorter than ${min_read_len}. (default: 1k) -max_read_len, filter reads with length longer than $ {max_read_len}, ultra-long reads usually contain lots of errors, and the mapping step requires significantly more memory and time, 0=disable (default: 0) -max_depth, use up to ${max_depth} fold reads data to polish, 0=disable. (default: 100)
hifi_minimap2_options= -x map-pb -t {multithread_jobs}¶minimap2 options, used to set hifi reads mapping. (required)