OpenCL for cumulative addition?

Hi Folks,
here's one for you.

Looking at the sample apps and there's a couple for histograms. It looks like a good use of OpenCL to parallelise the problem.

I'm looking at taking the kernel a stage further to produce a cumulative histogram.

So basically, let's say you have 5 numbers, e.g. 2, 6, 1, 3, 7.

The cumulative addition of those gives 2, 2 + 6, 2 + 6 + 1, etc.

Resulting in: 2, 8, 9, 12, 19.

It seems to me that this operation is not worth coding with a kernel. To parallelise it you would split the problem domain up into multiple operations, each depending upon the previous result. Barriers would be required to wait until the previous result had finished, and then you would need more addition to include the other results.

My inclination would be to leave the histogram part as a kernel. It does it's business working through a large sample set in parallel, counting everything up.

Then, to do the accumulative part, back to the CPU with a for loop.

Any opinions? Am I on the money with this? Or is there another way to look at the problem so as to exploit OpenCL?

I'm still working through the sample apps provided by NVidia and Apple, so I'll be curious to see if they exclusively do everything on the GPU (perhaps even if it doesn't make the most sense?), or if they have GPU and CPU code working in concert on a problem.

Cheers,
Max

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

OpenCL for cumulative addition?

Just thinking out loud here. Perhaps the way to do it is to restructure the histogram function so that each bin also counts the lesser bin.

So in this example let's say that the first thread counts bin 0 with 2 as a result. The second thread counts bin 1 with 6 items as a result.

I guess the way to do this would be for the second thread to also count bin 0's items.

There would be a lot of duplicate effort, but I wonder if it would be faster?

It's interesting that it seems with OpenCL you have to really look at the problem from the ground up. It you take things sequentially like you're used to with existing program code, you're not going to get the full benefit.

O(log n)

This may be a case of the parallel prefix problem. It can run in log time, less than linear time, if data parallelism is exploited in the accumulation algorithm.

The famous paper by Hillis and Steele:
www.cs.arizona.edu/classes/cs522/spring09/papers/dataparallel.pdf

I suspect OpenCL would work great for this problem as it is well suited for SIMD GPU devices.

Excellent thanks

The article looks very relevant. I'll have a close look. Thanks for that.

Cheers,
Max

More efficient

Looks like there is a more efficient algorithm...

http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/scan/doc/scan.pdf