OpenCL Tutorial - Memory Access and Layout

In this episode we cover some questions regarding function calls from kernels and the use of clFinish. Also, we'll discuss basic GPU architecture, memory layout, shared memory. Thread blocks, warps and efficient data loading will also be discussed.

In iTunes, you can subscribe to the podcasts by going to:

Advanced -> Subscribe to podcast
URL: http://feeds.feedburner.com/opencl

Episode 4 - Memory Layout and Access (Desktop/iPhone/iPod touch)
Episode 4 - Memory Layout and Access (PDF)

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Excellent reviews

Hi,
Thank you for posting these, they are making sense ("Gelling") it all much better than anything I found as of yet,

In this episode I am a bit confused - how many FPUs are there? are there each for an SM or each for SP? or maybe some other arrangement?

Also, if you could explain a bit more about the different kind of memory hardware in the system (not only texture and local) and how to actually handle them in code; btw, you'll be happy to know that nvidia's beta opencl drivers also have an example of the matrix transpose and they add 1 to align the half warps as well; actually, if you can explain the whole stride thing again with a different example maybe..

Great Episode, thank you for

Great Episode, thank you for doing this!

At the end I was a bit surprised. I thought that a stride of 16 would be perfect (1st thread reads bank 0, 16th thread reads bank 15, 17th thread would read bank 0 again). What did I miss in this example? I probably is something obvious, but I simply can't explain that bit.

Thanks again for taking the time to do this series! It really is helpful.

Q&A

Thanks for the questions. I'm going to do a Q&A podcast, and I'll address both of the questions in that episode. I'm hoping to get that short podcast out this week, but things have been a bit hectic. Hopefully things will make more sense once I put out that episode.

Dave

What causes a command queue to go invalid

I have an OpenCL application where for some reason, the command queue is invalid after the call to clEnqueueNDRangeKernel completes.

All of the clEnqueueWriteBuffer calls complete with no errors, as does clEnqueueNDRangeKernel itself. It's when I call clEnqueueReadBuffer that I get an error -36 (Invalid Command Queue).

Is there something that can cause a command queue to become invalid other than calling clReleaseCommandQueue?

Re: What causes a command queue to go invalid

Not that I know of. The command queue should stay valid as long as your context is valid and you haven't explicitly released either. Is this something you can send via email for me to look at?

I'm trying to recall an example where I've seen that specific error code, but I can't at the moment.

Dave

Re: What causes a command queue to go invalid.

I had something like this, and it was due to a clEnqueueReadBuffer going wrong. It took me a while to track down, but the key was attaching a notify function to the context so I was informed about errors. Apple ship a few in it's OpenCL Library which you can use out of the box.

extern void clLogMessagesToSystemLogAPPLE(...)
extern void clLogMessagesToStdoutAPPLE(...)
extern void clLogMessagesToStderrAPPLE(...)

Add one of these to your clCreateContext call, and you'll see which call invalidates things.

For me, it all boiled down to a kernel argument, which was a pointer to an output buffer, not being explicitly defined as global. I assume it therefore defaulted to local (possibly, not sure), and when I tried to read the buffer associated I wasn't allowed. This only happened when I was running on the GPU though.

There's a thread about it at http://www.khronos.org/message_boards/viewtopic.php?f=28&t=2061 , but it continues into trying to track down a separate compiler bug. The important thing was to get those error messages because it's some other call that's causing the problem, and it doesn't come back as an error code.

Its easier than that, even.

Its easier than that, even. Just do this:

export CL_LOG_ERRORS="stdout"

and the pfn_notify will be turned on for you. You can also use stderr or console, if you prefer.

Not that I know of. The

Not that I know of. The command queue should stay valid as long as your context is valid and you haven't explicitly released either. Is this something you can send via email for me to look at?

On the GPU, your command queue can die if you do something illegal, such as wander off the end of an array or something goes wrong like a deferred buffer allocation fails at execution time, or deferred compilation fails. Generally speaking, it happens when something is being done just in time under the covers well after the chance to correctly return an error condition out of the API that caused it to happen, or when your code crashes.

On the CPU, this currently cant happen, because it doesn't defer compilations or allocations. You can still crash though.