Is it just me or is OpenCL quite challenging?
Hi Folks,
I've been working my way through the matrix transpose and now the NVidia histogram examples.
My background is mainframe, then Java, and now more recently C, C++, and Objective-C.
I have to confess that OpenCL is the most challenging thing I have done thus far programming wise. It looks immensely powerful, and I'm enjoying the challenge, but it's not proving easy.
To do things naively would be straightforward, but arranging everything in a fully optimised manner with coalesced reads and writes, and minimising bank conflicts, takes some careful thinking.
Just wondering if maybe I'm dumber than I thought I was, or if this really is quite tricky?
Any opinions out there? Ego boosting votes of confidence most welcome! Or you can tell me I'm as dumb as I thought I was... I can take it...
Cheers,
Max




OpenCL is challenging to me
This 47 year old thinks it is challenging -- especially all the GPU specific details that need to be understood to really use it properly. Actually I'm hoping that experts will write libraries for linear algebra routines to utilize OpenCL so I don't have to work with it.
Then I'm dumb too.
"Or you can tell me I'm as dumb as I thought I was..."
Can't tell you that, but I can tell you that you are as dumb as I am.
I also think your mileage can vary according to implementation (version and GPU model) so you have to test carefully.
You might try out OpenMP which I think is really straight forward and predictable, but you need a multi-core (>2) to really see it. Looking forward to those 6 core, dual, virtual threaded Mac Pro coming our way (24 multi-core threads?).
p.s. - I agree with the other comments ... if you use standard routines then wait for other people to do the work (IMHO).
thanks!-
-lance
OpenCL Challenging
Guys, Guys, Guys...
Yes... It is challenging but once you understand a few key concepts you can really take advantage of the power of GPU computing (and even optimize your CPU computing) for problems that are a good fit for parallel programming.
Based on what Max mentioned above, and general feedback I've received, I think the part people get hung up on is memory access and layout. However, perhaps you can tell me a bit of what is causing problems (anyone), and maybe that could be the next podcast (I haven't done one for a while).
Let's get this sorted out :)
Dave
OpenCL Challenging
Hi Dave,
don't get me wrong. I'm fully into OpenCL and enjoying the challenge. I'm sort of glad that it is quite tricky. If it's too easy everyone will be doing it, and I want to re-skill over to something that's new, interesting, and hopefully soon to be much in demand.
The problem I have is in understanding the examples. In particular how the author comes up with an indexing strategy to avoid bank conflicts. I understand what a bank conflict is and why it is a bad thing.
What would be superbly useful would be some guidelines - at a code level - to say this is the code, this is the bank conflict, and here are the steps you need to take to optimise.
Take for example the NVidia oclHistogram 64 kernel. The author came up with the following code to adjust indexes to avoid the conflict (from Histogram64.cl):
//Encode local id in order to avoid bank conflicts at l_Hist[] accesses:
//each group of LOCAL_MEMORY_BANKS work-items accesses consecutive local memory banks
//and the same bytes [0..3] within the banks
//Because of this permutation workgroup size should be a multiple of 4 * LOCAL_MEMORY_BANKS
const uint lPos =
( (get_local_id(0) & ~(LOCAL_MEMORY_BANKS * 4 - 1)) << 0 ) |
( (get_local_id(0) & (LOCAL_MEMORY_BANKS - 1)) << 2 ) |
( (get_local_id(0) & (LOCAL_MEMORY_BANKS * 3 )) >> 4 );
I'm curious to know how they arrived at this? Was it pre-mapping out things on a spreadsheet? Did they apply some rules of thumb? Some kind of visual tool (has to be a long way off I know) would be fantastically useful. In the meantime though, what can we do to detect a conflict and what are the code level strategies to resolve them?
Sorry I'm going on a bit here. I'll shut up now...
Cheers,
Max
OpenCL Challenging ps.
Hi Dave I have to confess I haven't watched the 6th podcast yet so maybe you've already covered some of this area already.
Perhaps it's just a matter of going through more and more examples until things sink in.
I'm getting there myself Brad. Maybe we just have to realise it's tricky stuff and it's matter of perseverance...
Cheers,
Max
Challenges with OpenCL
Hi Dave,
Thanks for asking, and a special thanks for producing the videos about using OpenCL. I've played with OpenCL a bit, and the difficulties that I see, beginning with the most general, are:
1. If one is exploring new algorithms (as I am, for problems in fluid dynamics and quantum mechanics) instead of trying to optimize already well-defined algorithms, then the cost / benefit ratio of using OpenCL shifts. GCD by contrast is much simpler to use, though of course it doesn't use the GPU. But total development + run time may be less by avoiding OpenCL.
2. Again for scientific exploration one often doesn't want to risk using single-precision calculations, but the performance of double-precision computations on most current GPUs (at least those available for Macs) is not significantly greater than that of the CPUs (please correct me if I'm wrong here.) That looks like it will change with the new NVIDIA Fermi processors but that leads me to ...
3. In a heterogeneous academic environment the performance of the GPUs deployed on machines in my group varies much more than the deployed CPUs. If I make the jump to OpenCL, will my students also benefit from it?
4. For the large-memory problems that I work on, memory bandwidth is an important limitation. To really take advantage of OpenCL I believe I would have to redesign my fluid code from the ground up. It heavily relies on Obj-C objects to implement a multigrid elliptic equation solver, and all that code would need to be transformed to procedural OpenCL code. Just using OpenCL to speed up an inner loop would probably lead to little improvement as data structures would then have to be passed back & forth into the GPU repeatedly. The move from OOP back to procedural programming leads me to the last obstacle that I've identified:
5. Finally -- and this is highly subjective -- OpenCL aesthetically seems awkward compared to GCD / C-blocks.
Well probably I shouldn't go on like this, not having actually made big move to OpenCL, as perhaps some or all of the above is inaccurate. But that's how it seems today.