Accelerating Matlab DSP Code on the GPU

Intrigued by GPUs, I've spent a few days testing out Jacket, an interface that lets you accelerate MATLAB (my favorite, if frustrating language) on NVIDIA GPUs. It's definitely got some caveats. But it was really easy to accelerate my code. And the results were impressive. So I thought I'd put up a few simple DSP-related benchmarks I created and ran on my laptop (a Macbook Air with NVIDIA GeForce 9400M graphics card). The m-files for the two functions I benchmarked (2D FFT and 2D interpolation) can be downloaded here.

If you're interested in lower-level GPU DSP programming, I suggest you check out Shehrzad Qureshi's excellent blog on the subject.

NOTE: The benchmarks I'm putting up (and all benchmarks really), should be taken with large grains of salt. I threw my code together pretty quickly, and results will vary greatly depending on your system setup. My intent here is just to convey my impressions of the tool after spending a few hours with it.

I was pleasantly surprised by how quickly I was able to accelerate code. Basically, you just cast data you want processed on the GPU to one of Jacket's GPU data types. Then make normal MATLAB function calls. For example, the code below performs a 2D FFT on an image stored in a matrix 'I1'.

I1 = gsingle(I1);

I1_fft = fft2(I1);

I cast here to 'gsingle' (as opposed to 'gdouble') because my GPU only supports single precision.

If you want to bring data back to the CPU (for instance, to use a function not supported by Jacket), you just cast the data back to a standard MATLAB datatype.

Getting optimal performance when combining functions requires a bit more work (see vectorizing your code, minimizing data transfers between the CPU and GPU, and a couple other general principles). But it doesn't seem too difficult from reading the documentation.

Figure 1 shows the speedup for the 2D FFT funcion (fft2.m). The x-axis specifies the size of the image passed to fft2 (128x128, 256x256, …..2048x2048).

Figure 1. GPU vs CPU Speedup

Interestingly, the speedup starts dropping off at 2048x2048 pixels, and at 4096x4096 pixels, I got an out-of-memory error. Not sure what's happening there. Probably a limitation of my GPU...

A more interesting function for me was interp2(), which performs 2D interpolation. As this is a very computationally demanding function, and one used in many image processing algorithms, acceleration could be useful. Figure 2 shows the speedup.

Figure 2. GPU vs CPU Speedup

Again, I got an out-of-memory error on 4096x4096. But the speedup is impressive. I should also note that while the 128x128 speedup is actually a slight speedown (0.85X), this is probably due to not “warming up” the GPU properly (see this wiki on benchmarking Jacket for more on this subject). Running this benchmark (and FFT2) back to back produces a much better speedup for 128x128.

As for the tool's usefulness for DSP in general, it's probably not suitable for end implementations of most DSP applications, due to the real-time and cost-constrained nature of most DSP apps (a cellphone running MATLAB?). However, it will be useful in accelerating algorithm development for most DSP apps, which are usually designed in MATLAB. And it will be very useful for applications that process large datasets offline. Analysis of 3D seismic data for oil exploration, medical imaging, and analyzing radar/sonar/satellite data, are three examples that come to mind.

I should also note that if free is more your budget, there's also GPUmat, a freeware MATLAB accelerator also based on NVIDIA's CUDA platform. I haven't tested it out yet. But looking at the documentation, it's similar, but with fewer functions supported, less documentation, application examples, etc. It does FFTs, basic math and matrix operations, and some general MATLAB functions, but is sparse on more complex functions (for instance, no interpolation functions such as interp2.m).

[ - ]
Comment by March 28, 2010
Seth, Thanks for the post, I have two questions? 1) Did you include the time to transer the data to GPU memory in your speed up? 2) Has anyone done large FFTs with Jacket such as greater than 4096x4096, because I read in one blog this is a problem and wanted to hear if anyone has tried.
[ - ]
Comment by March 28, 2010
1) Did you include the time to transer the data to GPU memory in your speed up? No. From Jacket UG: "Each casting operation to and from the GPU pushes or pulls data back and forth from CPU memory to GPU memory." Since I cast the data to 'gsingle' outside of the computation I'm measuring, the data transfer is not measured. It would be interesting to measure the data transfer. If you want to measure this, you could just move the cast to 'gsingle' inside the 'tic/toc' functions in the benchmarking code. 2) Has anyone done large FFTs with Jacket such as greater than 4096x4096, because I read in one blog this is a problem and wanted to hear if anyone has tried. I've done a little searching, and I haven't found the answer either. I have read a few things about FFT size limitations in CUDA, but those postings were a bit old....

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.