IDBA-UD AssemblyΒΆ
IDBA is the basic iterative de Bruijn graph assembler for second-generation sequencing reads. IDBA-UD, an extension of IDBA, is designed to utilize paired-end reads to assemble low-depth regions and use progressive depth on contigs to reduce errors in high-depth regions. It is a generic purpose assembler and epspacially good for single-cell and metagenomic sequencing data. See the IDBA home page for more info.
IDBA-UD requires paired-end reads stored in single FastA file and a pair of reads is in consecutive two lines. You can use fq2fa (part of the IDBA repository) to merge two FastQ read files to a single file. The following command will generate a FASTA formatted file called reads12.fas by “shuffling” the reads from FASTQ files read1.fq and read2.fq:
cd ~/workdir/assembly/
qsub -cwd -N fq2fa -l mtc=1 -b y \
/vol/cmg/bin/fq2fa --merge read1.fq read2.fq reads12.fas
IDBA-UD can be run by the following command. As our compute instances have multiple cores, we use the option –num_threads 24 to tell IDBA-UD it should use 24 parallel threads:
cd ~/workdir/assembly/
qsub -cwd -pe multislot 24 -N idba_ud -l mtc=1 -b y \
/vol/cmg/bin/idba_ud -r reads12.fas --num_threads 24 -o idba_ud_out
The contig sequences are located in the idba_ud_out directory in file contig.fa. Again, let’s get some basic statistics on the contigs:
getN50.pl -s 500 -f idba_ud_out/contig.fa
Note
Most jobs above will be started on the compute cluster using the qsub
.
qstat
: check the status and JOBNUMBER of your jobsqdel JOBNUMBER
: delete job with job number JOBNUMBER
We usually submit the jobs to the cluster giving them a job name by using -N JOBNAME
.
This will create log-files named
JOBNAME.oJOBNUMBER
: standard output messages of the toolJOBNAME.eJOBNUMBER
: standard error messages of the tool
You can look into these files by typing e.g. less JOBNAME.oJOBNUMBER
(hit q
to quit)
or tail -f JOBNAME.oJOBNUMBER
(hit ^C
to quit).