OpenCL demo that really shows an advantage?
I have been learning OpenCL recently. I have some Apple demos up and running, and have written some myself. However, Apple's demos don't seem to make any performance measurements. I have done some myself on simple examples (doing some arbitrary calculations on large amounts of data), but so far, the CPU easily outperforms the GPU.
To be precise, it is 5 million floats, a simple kernel making a square root and a multiplication. Times (msec):
GPU 256
CPU 144
Raw CPU 72
"Raw CPU" does the whole thing in for loops, that is single-thread and therefore single-core. That should be the obvious loser, but it isn't.
I am working on an Intel Mac Mini with a GF9400M, not a very good GPU but one that should be supported.
I don't know if my code is flawed (it works essentially like the Hello World demo but on more data), OpenCL buggy or the 9400M unsupported or just too bad for OpenCL. As far as I can tell from what is written about OpenCL, 9400M should work (but there are also some rumors of the opposite).
http://forums.macrumors.com/showthread.php?t=700971
http://techreport.com/discussions.x/16268
I suppose I am doing something stupid. Is there some similar demo somewhere that does this right? Or someone out there who has seen the 9400M perform OpenCL decently?




RE: OpenCL demo that really shows an advantage?
hi-
Very good questions! I don't know the answers. Maybe OpenCL does not support double on the 9400M and you are using sqrt()? which has a double prototype for argument and return? Did you try it only with float and no function call i.e.: just multiply, divide, add, subtract floats (32 bits)?
thanks!-
-lance
RE: OpenCL demo that really shows an advantage?
Thanks for the quick response!
I tried with just multiplications in the kernel, and plain floats, and the CPU still wins, despite doing sqrt()!
Whatever I do, the pattern stays the same: GPU slow, CPU faster. And that is upside down.
It is tempting to believe that I've switched the timing values, but the difference is between 1 and 3 seconds so there is no room for such mistakes.
I will try the same program on Linux or MSW ASAP, to see if it can be a bug on Apple's part.
RE: OpenCL demo that really shows an advantage?
The 9400M is an integrated graphics card. While OpenCL does work with this card, the benefits will be marginal at best. It doesn't surprise me at all that your CPU's in this kind of configuration win. Is this a MacBook or MacBook Pro? If it's an MBP with dual graphics cards, target the 9600 device (which is a full graphics unit with 32 cores).
Also, you may or may not see an advantage for simple calculations. Remember, that you have to copy data over to the card (integrated or not) and this copy operation is SLOW. Especially compared to what the CPU can do in that time.
Finally, double precision support can be probed for using the OpenCL API's. At the moment double support is an extension (and not a requirement). On the GPU with Apple's implementation it is not currently supported.
Dave
RE: OpenCL demo that really shows an advantage?
Do you have a 9600M GT or above? If not then if you want you can email me your project and I'll run it on 9400M and 9600M to see how it works. ( lbland@vvi.com ).
thanks!-
-lance
RE: OpenCL demo that really shows an advantage?
I just mailed the source. I tried varying the settings a bit, like running with no debug info and higher optimizations, but the general picture stays the same.
And the question is still open: Is there a good, pretty simple demo out there that actually measures performance? Otherwise I'd better finish mine.
Sure there is.
Is there a good, pretty simple demo out there that actually measures performance?
Pretty much almost all the available code samples from Apple, the NBody, matrix transpose and FFT examples give some measure of performance. In your case, I think that your code is problematic. Besides the GPU result that you get (I don't expect too much of the 9400M anyway), your CPU results with OpenCL are suspicious and you definitely need to check where the problem is. Your theory of a problem of the Apple's implementation of OpenCL does not make so much sense given that I (we) do see clean advantages for going with multi-core or GPU with OpenCL on Snow Leopard.
Why don't you first try the nice code sample from David's tutorial (http://www.macresearch.org/opencl, episode 6), it basically allows you to see the difference in performance for a compute task running on CPU (scalar and parallel) and on GPU (non-optimized and optimized).
I think it is worth to give it a try...
RE: OpenCL demo that really shows an advantage?
hi-
OpenCL_Hello_World_Example from Apple gives similar results to Ragnemalm's so something is up ... efficiency of OpenCL seems to be very dependent upon the caller's implementation. I'd like to know if Ragnemalm's code (or OpenCL_Hello_World_Example) can be speed up (OpenCL-GPU v.s. OpenCL-CPU for example). I think OpenCL-CPU should not suffer from the data shuffling problem Dave mentioned so it should be fast. Offhand, I couldn't find anything wrong with the code, but I didn't try to optimize it by any means or find out why it is slow.
thanks-
-lance
RE: OpenCL demo that really shows an advantage?
Hi Lance,
Can you email me the example you used? sdg0919 [at] gmail
Thanks,
Dave
RE: OpenCL demo that really shows an advantage?
So here are some numbers based on the user example but adjusted to 51 million floats on an 8-core Mac Pro w/ GTX285. I've taken timings of each step of the kernel execution path (I didn't bother with compiling and disposing). Kernel execution is the same for CL on GPU and CPU. This doesn't surprise me, since we're not really doing any meaningful computation.
The largest discrepancy is between the Allocate and Write steps by 40ms (which I'd expect). And the readback (with the GPU being faster, which I wouldn't have expected. I've included the RAW CPU and parallelized on 16 threads as well. The serial raw calculation is slower which I'd expect (although the parallel efficiency with CL isn't as good as without, which I'd expect only in that again, we aren't doing any meaningful work).
Allocate and write 0.166932
Args and Info 0.000007
Kernel Only 0.256035
Read back 0.172207
CPU total ~0.621510
Allocate and write 0.204211
Args and Info 0.000062
Kernel Only 0.227780
Read back 0.088733
GPU total ~0.625147
RAW 0.922288
RAW 16 Threads 0.095965
Need Summary ...
So, what is the summary/conclusion:
OpenCL isn't that applicable for easy vector problems like a vector add, but OpenMP screams at it (which means is wonderfully effective and efficient)? OpenCL has applicability to more major problems like O(n^2) and O(n log(n)) computational complexity (FFT, Matrix problems, etc.) for large n? That OpenCL takes more work, while OpenMP is easy so at least do OpenMP and maybe bypass OpenCL? Anything with doubles, at least do OpenMP but not OpenCL? And, maybe the conclusions are implementation dependent and may change?
thanks!-
-lance
Ep 6 looks good
Hakime pointed at the code in episode 6, which is indeed the kind of demo I was looking for! (I wasn't that far in that tutorial yet.) And it turns things in another direction.
The Apple's demos I've tried, however, give no performance informations. My code must be flawed somehow, but since it is based on Apple's demos, the problem is slightly deeper. But I will continue with more work on the ep 6 code.
Performance numbers for 9400 for episode 6 source code
as Hakime mentioned episode 6 source code is useful and is more of a real-world example. This shows even on a 9400M!
output of me running it on a MacBook Unibody
-----------------------------------------------------------
Accumulated value: 3.221799e+28
CPU Loop - Scalar: 48.27731331
-----------------------------------------------------------
-----------------------------------------------------------
Accumulated value: 3.221799e+28
CPU Loop - Parallel: 24.303175453
-----------------------------------------------------------
-----------------------------------------------------------
Connecting to NVIDIA GeForce 9400M...
Loading program 'mdh_orig.cl'
Build Log:
Recommended Size: 512
Allocation: 0.01059249 Enqueue: 3.06181287 Read: 0.024699636
Accumulated value: 3.221799e+28
GPU Loop - Unoptimized: 3.422553173
-----------------------------------------------------------
-----------------------------------------------------------
Connecting to NVIDIA GeForce 9400M...
Loading program 'mdh_opt.cl'
Build Log:
Recommended Size: 512
Allocation: 0.008599839 Enqueue: 2.550169089 Read: 0.002314684
Accumulated value: 3.221799e+28
GPU Loop - Optimized: 2.613505051
-----------------------------------------------------------
The optimized code isn't giving as much back as it did for Dr. Gohara on the GTX 285, but running it on the 9400M is still clearly better (9x in this case) than on the dual core CPU found in the MacBook.
OpenCL can still give substantial advantages on GPU hardware near the middle.
I got similar numbers
I got similar numbers. It gives so much more hope to see OpenCL perform 9x the CPU than 1/4x. I will go on from there.