OpenCL Tutorial - Shared Memory Kernel Optimization

In this episode we'll go over an example of real-world code that has been parallelized by porting to the GPU. The use of shared memory to improve performance is covered as well as a discussion of synchronization points for coordinated work within a work-group. Source code is provided.

In iTunes, you can subscribe to the podcasts by going to:

Advanced -> Subscribe to podcast
URL: http://feeds.feedburner.com/opencl

Episode 6 - Shared Memory Kernel Optimization (Desktop/iPhone/iPod touch)
Episode 6 - Shared Memory Kernel Optimization (PDF)
Source code for Episode 6

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Possible subjects for next postcast

Congratulations again for your tutorials which are the best to be found either on cuda or openCL. I think that you are a great teacher and all my colleagues familiar or not with the concepts explained think the same.

Relatively to future subjects I see several points of interest
a) Reduction even if as you said it is something we try to avoid on gpus since even if feasible it needs a lot of work if we want to build something efficient
b) Usage of asynchronous calls to GPU to leave the CPU do useful things while the GPU is serving
c) Mixed precision work if you have this type of experience - Even if Fermi will give us better performance on double precision –
d) Analysis of Fermi Nvidia future architecture and how you see the cache structure help you on cases where shared memory was not feasible

Thanks again & waiting for your next postcast

Pierre

code question

When I comment out the lines of main.c that correspond to the scalar computations (lines 430-448 inclusive), (in other words just execute the GPU code), the accumulated value that's reported is 0. Any idea what's going on?
Thanks
-Paul

re:code question

come to think of it ... when I run the code as is (downloaded from this site) I get some strange outputs for the GPU sections:

Connecting to NVIDIA GeForce GT 120...
Loading program 'mdh_orig.cl'

Build Log:
\200)p\377
Recommended Size: 140734799803584
Allocation: 0.007973929 Enqueue: 1.984e-06 Read: 0.000562317
Accumulated value: 0
GPU Loop - Unoptimized: 0.069678957
-----------------------------------------------------------
-----------------------------------------------------------
Connecting to NVIDIA GeForce GT 120...
Loading program 'mdh_opt.cl'

Build Log:
\200)p\377
Recommended Size: 140734799803584
Allocation: 0.006881355 Enqueue: 1.281e-06 Read: 0.000554055
Accumulated value: 0
GPU Loop - Optimized: 0.082082314
-----------------------------------------------------------

re:code question

more info about the video card:

Connecting to NVIDIA GeForce GT 120...
Vendor: NVIDIA
Device Name: GeForce GT 120
Profile: FULL_PROFILE
Supported Extensions: cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_APPLE_gl_sharing cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions

Local Mem Type (Local=1, Global=2): 1
Global Mem Size (MB): 512
Global Mem Cache Size (Bytes): 0
Max Mem Alloc Size (MB): 128
Clock Frequency (MHz): 1400

Vector type width for: char = 1
Vector type width for: short = 1
Vector type width for: int = 1
Vector type width for: long = 1
Vector type width for: float = 1
Vector type width for: double = 0

Max Work Group Size: 512
Max Compute Units: 32

re:code question [solved]

OK my fault ... It was the gotcha that you describe in the video (at 29min 32 sec into the video) ... I had not set the "Set the working directory to:" flag to "Project Directory", it was still set to "Build Products directory". All is good now

There's a built-in for that!

I've just been through this process with my own code. I had a loop that was very similar to yours.

The OpenCL language has a call which does the global to local copy and from what I gather you've implemented the same thing explicitly.


event_t async_work_group_copy (
__local gentype *dst, const __global gentype *src,
size_t num_elements, event_t event)

Interestingly there's also a prefetch() function which can be used to preload caches with global data too.

It's all in section 9.10.8 of the OpenCL Spec.

Re: There's a built-in for that!

Yes. You can use that as well, but for pedagogical reasons I didn't discuss it this time. I felt it's better to present the long way, so people understand why coalescing across workgroup items is faster.

For prefetch on the GPU I haven't had much success getting any speedup in data into the caches. On the CPU I have seen a difference (for obvious reasons). What I do like about both of these approaches is (using the built-ins) is that it reduces the manual overhead of managing caching to local memory and in the case of prefetching may also future-proof the code.

Dave

OpenCL kernel now runs on CPU too & faster than MPI ! ;-)

David,

Thanks for creating the tutorial.

I have modified the application to run on CPU & it runs faster than MPI version (21.4 vs 26.9 seconds) on my dual-core MacBook Pro system.
Would be interesting if someone would run it on many-core system & compare.

Can you explain the hard-coded "64" that you had for local_work_size? Can that number be derived from the kernel or the OpenCL device specific information or is it some kind of hard coded size from CUDA?
Avoiding this kind of device dependency seems like something that would be good to put into the tutorials.

I can provide updated source (added error checking/printing too) if anyone wants it.

Alan

Here is console dump of it running:

-----------------------------------------------------------
Accumulated value: 3.221799e+28
CPU Loop - Parallel: TIME:26.927154533
-----------------------------------------------------------

-----------------------------------------------------------
Connecting to NVIDIA GeForce 8600M GT...
Vendor: NVIDIA
Device Name: GeForce 8600M GT
Profile: FULL_PROFILE
Supported Extensions: cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_APPLE_gl_sharing cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions

Local Mem Type (Local=1, Global=2): 1
Local Mem Size (MB): 0
Global Mem Size (MB): 256
Global Mem Cache Size (Bytes): 0
Max Mem Alloc Size (MB): 128
Clock Frequency (MHz): 940

Vector type width for: char = 1
Vector type width for: short = 1
Vector type width for: int = 1
Vector type width for: long = 1
Vector type width for: float = 1
Vector type width for: double = 0

Max Work Group Size: 512
Max Compute Units: 32

Loading program 'mdh_opt.cl'

Build Log:

LocalWorkSize:64 GlobalWorkSize:135168
Recommended Size: 512
Allocation: 0.015352243 Enqueue: 1.787761788 Read: 0.027141915
Accumulated value: 3.221799e+28
GPU Loop - Optimized: ERR:0 TIME:1.940809348
-----------------------------------------------------------

-----------------------------------------------------------
Connecting to Intel Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz...
Vendor: Intel
Device Name: Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
Profile: FULL_PROFILE
Supported Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_APPLE_gl_sharing cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions

Local Mem Type (Local=1, Global=2): 2
Local Mem Size (MB): 0
Global Mem Size (MB): 3072
Global Mem Cache Size (Bytes): 64
Max Mem Alloc Size (MB): 1024
Clock Frequency (MHz): 2400

Vector type width for: char = 16
Vector type width for: short = 8
Vector type width for: int = 4
Vector type width for: long = 2
Vector type width for: float = 4
Vector type width for: double = 2

Max Work Group Size: 1
Max Compute Units: 2

Loading program 'mdh_opt.cl'

Build Log:

LocalWorkSize:1 GlobalWorkSize:135168
Recommended Size: 1
Allocation: 0.007816448 Enqueue: 21.453254013 Read: 0.00059402
Accumulated value: 3.221799e+28
CPU Loop - Optimized: ERR:0 TIME:21.46357646
-----------------------------------------------------------

Debugger stopped.
Program exited with status value:0.

Using the secondary GPU on a MacBook Pro...

Thanks so much for this great series of podcasts!

Just following up on the very good sounding advice of using a secondary GPU for heavy computation to avoid graphics interruption (freezing, etc.)... Unfortunately it appears to not be so easy on the MacBook Pro. :/

First off, the MBP must be in "Higher performance" mode in the energy saver panel to enable both GPUs. It uses the faster GPU (9600M GT) for the primary display, which is too bad, because I'd really rather use that one for heavy computation whilst the other one (9400M) takes care of my not-so-demanding graphics display... A cursory inspection didn't uncover an easy way to switch the GPU used for primary display.

The real problem, however, is that even when using the non-primary GPU (8600M) for OpenCL, the primary display (controlled by the 9600M GT) still locks up under heavy computation (for a few seconds in the example project). This makes it pointless to ever use the secondary GPU, which is a real shame!

I can't figure it out... Perhaps they use a shared resource or something, or is this behavior intrinsic to OpenCL (I can't imagine that's the case)? The thing is, they are on separate busses (PCIe vs. PCI) and have different memory quantities (512MB vs. 256MB) which makes me think that they're totally distinct. So why would running a kernel on the secondary GPU cause the primary GPU's display to freeze for the duration of the computation? Did I miss something in my quick modification?

Here's the modified exec_kernel (I just made it fetch two devices and choose the second device if it exists): link

Update: My second CPU core is pegged (100% utilization) by the kernel_task process for the duration of both tests; I'm inclined to believe that this is the reason for the temporary lock-ups when using the non-primary GPU. Given that kernel_task does many things, I'm not sure if it's indeed something to do with the GPU/OpenCL, or otherwise some kind of anomaly in the example program. I also experimented to little effect with the OpenGL driver monitor. Any ideas?

Update #2: The unresponsiveness seems to happen just after clEnqueueNDRangeKernel() when clFinish() is called. As I understand it, this is when the kernel is executed on the GPU. Initially, I thought perhaps the slow-down was due to I/O during buffer creation, but it's pretty clear that it has to do with actual execution.

I did notice that there were a couple hundred mach threads-- could this be the issue and the reason kernel_task is at 100%? If so, and considering data parallelism seems to require scads of threads (although I thought this took place all in hardware on the GPU), how does one avoid it?

Data that are not multiples of powers of 2

Dave, I would like to congratulate you for your tutorials for they have been extremely helpful and clear! Keep up the excellent work!

I was wondering if you could address the situations in which the data dimensions are not multiples of powers of 2. I have found a handful of codes for, say, matrix multiplication and everything is based on multiples of a block size, for example 16. How can I deal with shared memory and avoid bank conflicts by assuming that the dimensions may not be given like that?

Thank you and look forward to seeing your next chapter!

Regards,

Bruno

I'm stumped

Runs on either of my MacBook Pro's GPUs (integrated 9400M or discrete 9600M GT) results in the kernel_task process appropriating all CPU resources, causing a temporary lockup.

Unfortunately, Dave seems to be too busy to help me sort this out. I'd really appreciate it if someone would either watch Activity Monitor or profile using Instruments while running the episode 6 code (with the CPU runs disabled, leaving only GPU runs enabled), and let me know if kernel_task's CPU usage spikes or not.

If anyone has another simple example that does not cause kernel_task to ravage the CPU, please provide it. I have tried a few 3rd party (not Apple-provided) examples all of which exhibit this behavior. Apple's examples are pretty complex, rendering the results directly to OpenGL; it's hard to discern if the CPU usage is elevated, but at least it doesn't lock the system up for the duration.

I've been trying to figure this out for over a week now. These freezes are a major barrier for my projects; if this kind of overhead is considered normal, I will have to abandon OpenCL entirely.

Will

Freezing Update

I've concluded that either the OpenCL implementation forces OS X's kernel_task process to block while stuff is executing on a GPU, or there is something peculiar about the new MacBook Pro's GPUs. Because OpenGL executes without issue on my MacBook Pro, it appears that this is a serious flaw in Apple's OpenCL implementation.

Unfortunately, I don't have an iMac or Mac Pro, and nobody with one has been kind enough to test it. Therefore all I can do (and have done) is submit a bug report to Apple and move on. What a shame.

Good luck,
Will

Questions

Thanks for the podcast. It really helped me understand how to work with OpenCL. I have two questions though.

Firstly, how does the memory structure of Nvidia cards relate with that of ATI cards? Do banks work the same on the two manufacturers' cards or do we have to optimize separately?

Secondly, how can we query the hardware to get the Warp/Wavefront size, if we want to deploy a program that will work well for different cards?

Thanks again for the excellent podcast!

Re: Data that are not multiples of powers of 2

Simple answer: pad your data with 0s.

-jho

I've concluded that either

I've concluded that either the OpenCL implementation forces OS X's kernel_task process to block while stuff is executing on a GPU, or there is something peculiar about the new MacBook Pro's GPUs.

The kernel is still working. The screen is just prevented from updating. These video cards don't support preemption, so if OpenCL is running, the screen (with a few exceptions) can't update until it is done because there is no way to unload the OpenCL kernel, do a bit of video work, and then reload the OpenCL kernel. It is suggested that you keep your kernels small.

OpenGL has the same problem, by the way. You can write a shader that takes a long long time and the screen wont update. However, as the shader is likely producing the next video frame, people expect this behavior.

By the way, if you do find

By the way, if you do find stuff that is not working as desired and you think your code is probably not broken, it is far better to file a bug with Apple. http://bugreporter.apple.com Complaining here is only slightly likely to get the problem diagnosed by an expert and will almost certainly not result to a fix in the OS, if one is needed.

You can also better get the attention of Apple engineers by posting to the Apple developer forum for OpenCL:

https://devforums.apple.com/community/mac/opencl

I was under the impression

The kernel is still working. The screen is just prevented from updating.

I was under the impression that the executing an OpenCL kernel (or OpenGL shader) on one GPU would not affect another GPU (which is being used for display). Furthermore, through profiling, I've seen the OS kernel block (resulting in a system-wide lock-up) for the duration of the OpenCL GPU execution-- this indicates to me that the assumption that "the screen is just prevented from updating" is incorrect.

By the way, if you do find stuff that is not working as desired and you think your code is probably not broken, it is far better to file a bug with Apple.

I posted here because the code here is what manifested the problem, and someone more familiar with it might be able to help figure out what's going on, or otherwise try it on their machine to see if it is actually my hardware that is the problem. Sorry if it came off as complaining-- I assure you, I was only trying to be constructive and help determine the nature of was appears to me to be a serious performance issue.

I submitted a pretty thorough bug report quite some time ago (November '09) with the information I'd gathered. Thanks for the OpenCL devfourms link - I don't think I'd seen that - however now I'm just waiting for the wrinkles to be smoothed out before using OpenCL again.

Opencv accelerated using openCL

I don't know if there is a new podcast comming up, but I want to ask if it's possible to use OpenCL to implemente OpenCV programs, for object detection and tracking and things like that. are they compatible?

Opencv accelerated using openCL

You would have to re-write the computationally intensive parts of OpenCV to run using OpenCL. You would also need compatible hardware, OS, and drivers.

Cheers,
Max

@paulgribble - I was using

@paulgribble - I was using the same graphics card before and had encountered some issues too but when I upgraded to Nvidia GeForce GTX 470, everything seems to be working just fine.

Best Regards,
Stirling Mortloc