NextPolish¶
NextPolish is used to fix base errors (SNV/Indel) in the genome generated by noisy long reads, it can be used with short read data only or long read data only or a combination of both. It contains two core modules, and use a stepwise fashion to correct the error bases in reference genome. To correct/assemble the raw third-generation sequencing (TGS) long reads with approximately 10-15% sequencing errors, please use NextDenovo.
Installation¶
DOWNLOAD
click here or use the following command:
wget https://github.com/Nextomics/NextPolish/releases/latest/download/NextPolish.tgz
Note
If you get an error like
version 'GLIBC_2.14' not found
orliblzma.so.0: cannot open shared object file
, Please download this version.REQUIREMENT
- Python (Support python 2 and 3):
INSTALL
pip install paralleltask tar -vxzf NextPolish.tgz && cd NextPolish && make
UNINSTALL
cd NextPolish && make clean
TEST
nextPolish test_data/run.cfg
Quick Start¶
Prepare sgs_fofn
ls reads1_R1.fq reads1_R2.fq reads2_R1.fq reads2_R2.fq > sgs.fofn
Create run.cfg
genome=input.genome.fa echo -e "task = best\ngenome = $genome\nsgs_fofn = sgs.fofn" > run.cfg
Run
nextPolish run.cfg
Finally polished genome
- Sequence:
/path_to_work_directory/genome.nextpolish.fasta
- Statistics:
/path_to_work_directory/genome.nextpolish.fasta.stat
- Sequence:
Tip
You can also use your own alignment pipeline, and then only use NextPolish to polish the genome, which will be faster than the default pipeline when runing on a local system. The accuracy of the polished genome is the same as the default. See following for an example (using bwa to do alignment).
#Set input and parameters
round=2
threads=20
read1=reads_R1.fastq.gz
read2=reads_R2.fastq.gz
input=input.genome.fa
for ((i=1; i<=${round};i++)); do
#step 1:
#index the genome file and do alignment
bwa index ${input};
bwa mem -t ${threads} ${input} ${read1} ${read2}|samtools view --threads 3 -F 0x4 -b -|samtools fixmate -m --threads 3 - -|samtools sort -m 2g --threads 5 -|samtools markdup --threads 5 -r - sgs.sort.bam
#index bam and genome files
samtools index -@ ${threads} sgs.sort.bam;
samtools faidx ${input};
#polish genome file
python NextPolish/lib/nextpolish1.py -g ${input} -t 1 -p ${threads} -s sgs.sort.bam > genome.polishtemp.fa;
input=genome.polishtemp.fa;
#step2:
#index genome file and do alignment
bwa index ${input};
bwa mem -t ${threads} ${input} ${read1} ${read2}|samtools view --threads 3 -F 0x4 -b -|samtools fixmate -m --threads 3 - -|samtools sort -m 2g --threads 5 -|samtools markdup --threads 5 -r - sgs.sort.bam
#index bam and genome files
samtools index -@ ${threads} sgs.sort.bam;
samtools faidx ${input};
#polish genome file
python NextPolish/lib/nextpolish1.py -g ${input} -t 2 -p ${threads} -s sgs.sort.bam > genome.nextpolish.fa;
input=genome.nextpolish.fa;
done;
#Finally polished genome file: genome.nextpolish.fa
Note
It is recommend to use long reads to polish the raw genome (set task
start with “5” and lgs_fofn
or use racon) before polishing with short reads to avoid incorrect mapping of short reads in some high error rate regions, especially for the assembly generated without a consensus step, such as miniasm.
Getting Help¶
HELP
Feel free to raise an issue at the issue page. They would also be helpful to other users.
CONTACT
For additional help, please send an email to huj_at_grandomics_dot_com.
Copyright¶
NextPolish is freely available for academic use and other non-commercial use.
Cite¶
Limitations¶
NextPolish is designed for genomes assembled by long reads, so it assumes an input genome without gaps (N
bases). Therefore, please split your genome assembly by its gaps and then link thems back after polishing if your input contains gaps. Usually we scaffolded a genome using BioNano or Hic data after a polishing step.
Star¶
You can track updates by tab the Star
button on the upper-right corner at the github page.