Version Issue Documentation Status

NextPolish

NextPolish is used to fix base errors (SNV/Indel) in the genome generated by noisy long reads, it can be used with short read data only or long read data only or a combination of both. It contains two core modules, and use a stepwise fashion to correct the error bases in reference genome. To correct/assemble the raw third-generation sequencing (TGS) long reads with approximately 10-15% sequencing errors, please use NextDenovo.

Installation

  • DOWNLOAD

    click here or use the following command:

    wget https://github.com/Nextomics/NextPolish/releases/latest/download/NextPolish.tgz
    

    Note

    If you get an error like version 'GLIBC_2.14' not found or liblzma.so.0: cannot open shared object file, Please download this version.

  • REQUIREMENT

  • INSTALL

    pip install paralleltask
    tar -vxzf NextPolish.tgz && cd NextPolish && make
    
  • UNINSTALL

    cd NextPolish && make clean
    
  • TEST

    nextPolish test_data/run.cfg
    

Quick Start

  1. Prepare sgs_fofn

    ls reads1_R1.fq reads1_R2.fq reads2_R1.fq reads2_R2.fq > sgs.fofn
    
  2. Create run.cfg

    genome=input.genome.fa
    echo -e "task = best\ngenome = $genome\nsgs_fofn = sgs.fofn" > run.cfg
    
  3. Run

    nextPolish run.cfg
    
  4. Finally polished genome

    • Sequence: /path_to_work_directory/genome.nextpolish.fasta
    • Statistics: /path_to_work_directory/genome.nextpolish.fasta.stat

Tip

You can also use your own alignment pipeline, and then only use NextPolish to polish the genome, which will be faster than the default pipeline when runing on a local system. The accuracy of the polished genome is the same as the default. See following for an example (using bwa to do alignment).

#Set input and parameters
round=2
threads=20
read1=reads_R1.fastq.gz
read2=reads_R2.fastq.gz
input=input.genome.fa
for ((i=1; i<=${round};i++)); do
#step 1:
   #index the genome file and do alignment
   bwa index ${input};
   bwa mem -t ${threads} ${input} ${read1} ${read2}|samtools view --threads 3 -F 0x4 -b -|samtools fixmate -m --threads 3  - -|samtools sort -m 2g --threads 5 -|samtools markdup --threads 5 -r - sgs.sort.bam
   #index bam and genome files
   samtools index -@ ${threads} sgs.sort.bam;
   samtools faidx ${input};
   #polish genome file
   python NextPolish/lib/nextpolish1.py -g ${input} -t 1 -p ${threads} -s sgs.sort.bam > genome.polishtemp.fa;
   input=genome.polishtemp.fa;
#step2:
   #index genome file and do alignment
   bwa index ${input};
   bwa mem -t ${threads} ${input} ${read1} ${read2}|samtools view --threads 3 -F 0x4 -b -|samtools fixmate -m --threads 3  - -|samtools sort -m 2g --threads 5 -|samtools markdup --threads 5 -r - sgs.sort.bam
   #index bam and genome files
   samtools index -@ ${threads} sgs.sort.bam;
   samtools faidx ${input};
   #polish genome file
   python NextPolish/lib/nextpolish1.py -g ${input} -t 2 -p ${threads} -s sgs.sort.bam > genome.nextpolish.fa;
   input=genome.nextpolish.fa;
done;
#Finally polished genome file: genome.nextpolish.fa

Note

It is recommend to use long reads to polish the raw genome (set task start with “5” and lgs_fofn or use racon) before polishing with short reads to avoid incorrect mapping of short reads in some high error rate regions, especially for the assembly generated without a consensus step, such as miniasm.

Getting Help

  • HELP

    Feel free to raise an issue at the issue page. They would also be helpful to other users.

  • CONTACT

    For additional help, please send an email to huj_at_grandomics_dot_com.

Limitations

NextPolish is designed for genomes assembled by long reads, so it assumes an input genome without gaps (N bases). Therefore, please split your genome assembly by its gaps and then link thems back after polishing if your input contains gaps. Usually we scaffolded a genome using BioNano or Hic data after a polishing step.

Star

You can track updates by tab the Star button on the upper-right corner at the github page.