IDBA-UD Assembly¶

IDBA is the basic iterative de Bruijn graph assembler for second-generation sequencing reads. IDBA-UD, an extension of IDBA, is designed to utilize paired-end reads to assemble low-depth regions and use progressive depth on contigs to reduce errors in high-depth regions. It is a generic purpose assembler and epspacially good for single-cell and metagenomic sequencing data. See the IDBA home page for more info.

IDBA-UD requires paired-end reads stored in single FastA file and a pair of reads is in consecutive two lines. You can use fq2fa (part of the IDBA repository) to merge two FastQ read files to a single file. The following command will generate a FASTA formatted file called reads12.fas by “shuffling” the reads from FASTQ files read1.fq and read2.fq:

cd ~/workdir/assembly/

qsub -cwd -N fq2fa -l mtc=1 -b y \
/vol/cmg/bin/fq2fa --merge read1.fq read2.fq reads12.fas

IDBA-UD can be run by the following command. As our compute instances have multiple cores, we use the option –num_threads 24 to tell IDBA-UD it should use 24 parallel threads:

cd ~/workdir/assembly/

qsub -cwd -pe multislot 24 -N idba_ud -l mtc=1 -b y \
/vol/cmg/bin/idba_ud -r reads12.fas --num_threads 24 -o idba_ud_out

The contig sequences are located in the idba_ud_out directory in file contig.fa. Again, let’s get some basic statistics on the contigs:

getN50.pl -s 500 -f idba_ud_out/contig.fa

Note

Most jobs above will be started on the compute cluster using the qsub.

qstat: check the status and JOBNUMBER of your jobs

qdel JOBNUMBER: delete job with job number JOBNUMBER

We usually submit the jobs to the cluster giving them a job name by using -N JOBNAME. This will create log-files named

JOBNAME.oJOBNUMBER: standard output messages of the tool

JOBNAME.eJOBNUMBER: standard error messages of the tool

You can look into these files by typing e.g. less JOBNAME.oJOBNUMBER (hit q to quit) or tail -f JOBNAME.oJOBNUMBER (hit ^C to quit).