Performance Tutorial (Part III): Using Accelerate

Author: David Gohara

Continuing where we left off last time, we'll take our 2D FFT and convert it to a 2D planar FFT and then make a full-fledged 3D FFT out of it. This part of the tutorial should cruise by as we've covered most of the ground already.

You can download the Xcode project here

Let's start by setting our build target to part3 and selecting part3.c in the Groups and Files pane.

In order to do a pseudo-/full 3D FFT, we need to make a few changes. We've added a third dimension (nx), we are now taking the log2 of that dimension as well.

During the weights setup we expand that to include getting the MAX value across nx, ny and nz (although this isn't really necessary for part3):

fft_weights = create_fftsetup(MAX(MAX(logX,logY),logZ),kFFTRadix2);

3D FFT as 2D Planes

Some applications (such as EM image reconstruction) work on stacks of 2D images. This is similar to what we will do here. We define a new DSPSplitComplex type called tmp1 and will use that as a pointer to the correct location in memory of our data start for each plane.

int offset = 0;
for(i=0;i<nz;i++){
	tmp1.realp = input.realp + offset;
	tmp1.imagp = input.imagp + offset;
	
	fft2d_zip(fft_weights,&tmp1,1,0,logX,logY,FFT_FORWARD);
	
	offset += nx*ny;
}

Here we simply increment the pointers to the real and imaginary parts of the array by adding an offset value to pointer address at each loop iteration. Pretty simple. The rest of the code doesn't change.

Full 3D FFT

Continuing, set the build target to part4 and select part4.c

The next addition to the code is to define and allocate space for a results array. The results array will be used to rearrange the data so that we can perform the full 3D FFT. Recall that Accelerate doesn't include a 3D FFT by default (otherwise we would simply use that). So in order to perform one using Accelerate we need to first performa a 2D FFT, followed by a 1D FFT along the third dimension.

DSPSplitComplex input, result;
DSPSplitComplex tmp1;

input.realp = (float*)malloc(dims*sizeof(float));
input.imagp = (float*)malloc(dims*sizeof(float));

result.realp = (float*)malloc(dims*sizeof(float));
result.imagp = (float*)malloc(dims*sizeof(float));

...and of course at the end we include...

free(result.realp);
free(result.imagp);

Our 2D FFT will be performed on the planes as we did in part3. Once that is done we need to rearrange the data such that data along the Z-axis is now available in memory at a unit stride of 1. We can use another Accelerate function called mtrans (Matrix Transpose) to do this for us. mtrans is designed to work on 2D arrays, but we can achieve the same effect on a "3D" array by adjusting a few parameters:

mtrans(input.realp,1,result.realp,1,(nx*ny),nz);
mtrans(input.imagp,1,result.imagp,1,(nx*ny),nz);

input data array 	= input.realp (or input.imagp)
stride1 		= 1 (we want to process every element)
output data array 	= result.realp (the rearranged data)
stride2			= 1 (we want the data to be placed with a stride of 1)
dim1			= nx*ny (row count)
dim2			= nz (column count)

Most of this is straightforward. For dim1 we set that to nx*ny, because our row count is actually the full subset of one 2D plane. For dim2 our column count is the number of elements along the current Z direction. After the transpose these will be swapped (as you'll see below). Note that since our data is in split complex format we need to perform two of these operations (one for the real part and one for the imaginary part). mtrans (as it's engineered) will not work on interleaved complex data.

Once we have our rearranged data we can then perform a series of 1D FFT's. We will need to perform nx*ny FFT's each with a length of nz.

for(i=0;i<nx*ny;i++){
	tmp1.realp = &result.realp[i*nz];
	tmp1.imagp = &result.imagp[i*nz];
	
	fft_zip(fft_weights,&tmp1,1,logZ,FFT_FORWARD);
}

  We use fft_zip (1D complex in-place):

weights 	= fft_weights
data 		= tmp1 (which is a pointer to the mtrans results adjusted by i*nz)
signal stride 	= 1 (we want to process every element)
log2n 		= logZ (the base 2 exponent)
direction	= FFT_FORWARD (we are doing the forward transform)

Once all of the 1D FFT's have been completed, we need rearrange the data so that it's ordered the original way it was supplied. So we make another call (two actually) to mtrans. This time we will take the result data and put it back into the input data (overwriting what is there). We also swap the dim1 and dim2 values to ensure the data is placed properly:

mtrans(result.realp,1,input.realp,1,nz,(nx*ny));
mtrans(result.imagp,1,input.imagp,1,nz,(nx*ny));

We end up by converting to interleaved complex (ztoc) and then normalizing the data (vsmul) as before.

Checking the Result

To check that everything is internally consistent we can perform the forward and reverse FFTs to see if we get back our input values. Set the target to part5 and select part5.c. In both FFT loops we simply do a forward FFT followed by a reverse FFT.

Build and Run the project and you'll see in the run log that we get back what we put in.

To better understand what mtrans does and why it's such a powerhouse over simple copy operations (especially on Altivec), you may want to check out the following page on Apple's website:

http://developer.apple.com/hardwaredrivers/ve/algorithms.html#transpose

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Speed of large FFT's

Thanks for the tutorial, it is exactly what i have been looking for.

I have been timing the FFT for various size radix 2 input arrays and comparing the result against matlab. It would appear that my little G4 iBook is a bit of a speed deamon when it uses vector processing. For a 2^7 x 2^7 array of ones i am getting a speed over 13 times faster in your program than in matlab run on the same laptop. It even bests my 3.2GHz Pentium 4 ubuntu system by almost a factor of 2.

I have one question though. For arrays larger than this the vector FFT starts to lose its edge eventually loosing out to matlab when the array size gets larger than 2^10;

vDSP 2^7 = 0.002420s, matlab 2^7 = 0.020070s
vDSP 2^8 = 0.042763, matlab 2^8 = 0.237347
vDSP 2^9 = 0.270737, matlab 2^9 = 0.443622
vDSP 2^10 = 1.073488, matlab 2^10 = 1.793051
vDSP 2^11 = 28.547014, matlab 2^11 = 9.270158

What possible tricks could matlab be pulling to scale its performance like this and what can i do to try and match it?

Thanks

Huw

Re: Speed of large FFT's

Hi Huw,

At the stage where you are losing performance it's probably due to the problem sizes falling out of cache. At least that's my first thought. Which FFT routines are you using in your test?

The people that wrote/write these routines sometimes frequent this site and can probably give you a definitive answer. I'll notify them of the question to see if they have any thoughts on this. In the meantime, can you respond back with the exact test you are doing in MATLAB and with Accelerate?

Thanks,

Dave

The performance of the

The performance of the Accelerate.framework FFT on PowerPC hits a brick wall once the data is no longer able to fit in the L1 cache (32 kB). The implementation would do a lot better if it had some internal tiling -- that is make a point of using the data for the next step while it is still in the cache -- but it apparently doesn't / didn't. At some not-too-large size, the implementation also gives up and just calls scalar code which is a good deal slower. I don't recall if that is 2**12, 2**14 or 2**20, but probably one of the first two. (For most computationally cheap algorithms, vectorization only really provides a significant speed win if the data is in the cache, anyway.)

I suggest filing a performance bug report against Accelerate.framework. However, it seems unlikely that there will be any performance fixes for PowerPC at this stage of the game. Apple is pretty busy just making sure the new processors (currently Penryn) look good. You may wish to benchmark on Leopard+Intel, instead.

Note that each time you add a factor of two to the size of the FFT, the worst case computational error limits go up by a factor of two. By around 2^20 you've lost nearly all the accuracy in the float (worst case), although stochastically we expect the errors to be better than that most of the time. Large FFTs should probably be done in double precision.

Oh, and make sure your

Oh, and make sure your timing loop doesn't include the cost of calling FFT_Setup. That function is not intended to be called every time you do a FFT. It is not particularly optimized and is usually a good deal slower than the FFT itself.

Public benchmarks:

http://www.fftw.org/speed/G4-1.06GHz-macosx/

The BenchFFT folks don't seem to have gotten around to publishing Intel numbers for MacOS X.

Thanks for all the info. I

Thanks for all the info. I should have guessed it would be a cache problem. The guys at matlab must have written a pretty good routine for their results to scale so well with matrix size. Although not so good that they can't be beaten for smaller FFT's :)

I have rewritten the code to run as double precision, but as Dave states PPC cant do vectorsied FFT for doubles so it sees a big slowdown (still slightly faster than matlab though). I'm not too worried about its performance on this old iBook i am looking at getting an Oct core Mac Pro for my research but need to justify the expense to my boss. I figured that by writing a program with threads and vectorisation and taking it to the local apple store to run on a Mac pro i should be able to get some results that persuade him pretty quickly.

Thanks again

Huw

Error in data transformation

Hello,

In using the above-mentioned 3D FFT, I am facing the following error. I have set the data size to be 8X8X8. I have set the data to be of real value 1 and complex value 0. I first performed a forward transform to get the 3D FFT coeffs. Once converted, I performed a inverse transform to get the original data. In doing this, I am not able to get the original data back. I wonder what I missed....

The code for the inverse is as follows....

// REVERSE CALCULATIONS (data has the 3D FFT Coeffs)

/* Split the complex (interleaved) data into two arrays */
ctoz(data,2,&input,1,dims);
/* Calculate 1D FFTs for the third dimension */
for(i=0;i

code continuation

for(i=0;i