OpenCL Tutorial - Building an OpenCL Project
In this episode we cover some questions that were asked on the forums and via email about double-precision arithmetic, object oriented programming, clarification on global and local work groups and types of scientific calculations that are amenable to GPU computing. In addition, we'll go over in more detail how to query devices for specific information and features and walk through an example of an OpenCL calculation in Xcode.
In iTunes, you can subscribe to the podcasts by going to:
Advanced -> Subscribe to podcast
URL: http://feeds.feedburner.com/opencl
Episode 3 - Building an OpenCL Project (Desktop/iPhone/iPod touch)
Episode 3 - Building an OpenCL Project (PDF)
Source code for Episode 3




Comments
assertion failed error on demo code line 92
Running…
Connecting to NVIDIA GeForce 9400M...
Assertion failed: (err == CL_SUCCESS), function runCL, file /Users/paulgribble/Desktop/Episode_3_source/main.c, line 92.
Program received signal: “SIGABRT”.
sharedlibrary apply-load-rules all
(gdb)
re: assertion failed error on demo code line 92
Hi Paul,
In looking at the error (I can't reproduce it on my system), the error code is: CL_INVALID_VALUE and for that function call it means that an incorrect or NULL value was passed to clCreateProgramWithSource.
Can you modify this line so it looks like this and see if it prints the program string:
char *program_source = load_program_source(filename);
printf("%s\n",program_source);
Thanks,
Dave
OK the printf statement
OK the printf statement prints out (null) so I think you're right, the file is not being read in properly.
Just FYI I just downloaded the source code again on a different machine and same problem so it is not an idiosyncracy of a particular machine
solved : must click on
solved : must click on "Episode 3" under the "Executables" list item, then get info (or apple-I) then set the working directory to Project Directory (not build directory)
Thanks for episode 3
Dear Dave,
Thanks very much for episode 3. It answered a number of my questions, and the sparse matrix reference that you cited at the end looks quite informative. Now I need to think deeply about whether OpenCL makes sense for the algorithms I'm developing -- it comes down to the question of development effort and time vs. computer speed up.
I too encountered the problem with the source code, but Paul's fix worked.
-- Brad
Thanks for the comments. I
Thanks for the comments. I too had the same situation and Paul's fix works here.
Integrated Graphics Solutions
Thanks for these great tutorials.
I'm wondering whether graphics cards without dedicated memory (such as the GeForce 9400M) do have an advantage within these respects, as it is not required to transfer the data over PCI.
Modifying the __kernel function
I would like to know what kind of complexity can be put into the __kernel function. For example, I have tried with this:
answer[gid] = rand()%(b[gid]-a[gid]) + a[gid]
but it doesnot work. What is not correct?
More generally, is it possible to write answer[gid] = myfunction(gid) where my function would be in another file and would define a complex simulation dealing with agents (objects), and return for example the final number of agents?
Why so many clFinishes()?
In this code you seem to be obsessively checking that everything has finished before moving on. This will kill your performance as you're keeping the CPU and the GPU in lock step the whole way.
I know this is Hello World, but it's a bad habit to get in to.
At the moment the code is using blocking enqueues, so they will all wait for the operation to finish before moving on, AND using clFinish() at various points to wait for the GPU again when there hasn't even been any more work submitted.
Surely in this code all the clFinish() calls can be removed, and all the memory writes made non-blocking. It's just the final read that needs to block so that the results are guaranteed to be in host memory before they are printed.
It's just the same principle as glFinish in OpenGL (avoid at all costs if you want performance), or am I missing something?
Adding this after the call
To Guillaume Chapron:
Adding this after the call to clBuildProgram(), but before the assert gives some more info:
(fullstops because the indentation gets lost)
if (err != CL_SUCCESS) {
....char log[512];
....err = clGetProgramBuildInfo(program[0], device, CL_PROGRAM_BUILD_LOG, 512, log, NULL);
....printf("** %s\n", log);
}
In this case:
** :8:23:{8:16-8:22}{8:25-8:40}: error: invalid operands to binary expression ('float' and 'float')
.........answer[gid] = rand() % (b[gid]-a[gid]) + a[gid];
.......................~~~~~~ ^ ~~~~~~~~~~~~~~~
This is because (to quote the spec): "The remainder (%) operates on built- in integer scalar and integer vector data types only"
Replacing % with a + reveals the other problem:
** kernel referenced an external function rand, that could not be found.There is no random function in the OpenCL kernel language.
functions in the OpenCL kernel language
Thanks for the reply, but if there is no random function in the OpenCL kernel language, how can we do Monte Carlo simulations? Can we just write a kernel function that calls a compiled executable where the Monte Carlo would be coded?
re: Why so many clFinishes()?
The clFinishes are left overs from a version that had timing code. In order to time the phases accurately everything needs to be blocked on the CPU side correctly or the timings are meaningless.
However, regarding performance, there is being pedantic and then there is paranoid. The notion that forcing a clFinish on a copy operation to the GPU will kill performance is paranoid. If your code is bottlenecked on the write to the GPU anyway, then OpenCL on the GPU isn't for you. Synchronization between the CPU and GPU isn't a bad thing necessarily either. Remove the clFinishes and time the block. You'll almost certainly see for any non-trivial application (meaning one where real work is being done on your enqueued kernels) no difference. The bulk of the work should always be on arithmetic not any of the other setup or teardown.
Another consideration with why using clFinish on the command queue is important is there is no guarantee that you won't flood the command queue with too many operations. This is strictly implementation dependent. And in earlier revs of the OpenCL implementation on Snow Leopard it was necessary. Whether it is necessary now or not I haven't checked for problems where the command queue contains a large number of instructions. But again, issuing the clFinish periodically has in my extensive experience with OpenCL on SL not been a performance problem.
Dave
Re: Why so many clFinishes()?
Timing is certainly a reasonable reason to have them there. As you say, your measurements would otherwise be meaningless.
I also had no idea that the command queue might overflow, which would certainly come under "Am I missing anything". I hope that would be classified as a bug and it's now fixed, but I'm coming from OpenCL as someone who's done GPGPU via OpenGL. You can't overflow the command buffers in OpenGL, so I'd be surprised if you can in a bug free OpenCL.
All the time you're doing one task, I agree with you. You won't see much difference in performance, but why should the CPU be waiting while a copy or a kernel execution takes place? It can get on and do more interesting things.
To think of it another way. Suppose I have a program that runs on a dual core system. It spawns a single worker thread to run on the second CPU. Would my main thread just sit there and wait for the worker to complete? No, there'd be no performance advantage. The main thread would go and do something else like preparing the next work block, or anything else in fact; As long as I had both processor cores doing work.
It's the same with OpenCL, but now the two cores are the CPU and the GPU. By using clFinish() / blocking calls when you don't need them you stop the CPU going and doing something more useful — if you have something for it to do. If you don't, no harm.
None of this is really relevant to the tutorial (which has been very good so far and thanks for writing it), but I though it was worth mentioning that if you start building code with lots of clFinish() calls in it you may be leaving performance on the floor if the problem you're solving is suitable for having both the CPU and GPU doing work.
I just didn't want people getting the idea that they were necessary after every enqueue call if they weren't, as they get in the way of system level optimisations later.
Re: Why so many clFinishes()?
Of course, I agree that in the case of an app that could be doing something else blocking is silly. In many cases though I'd tend to think this kind of code will be running on a thread off the main application thread anyway. But your point is taken.
Dave
line 92 fix
Copy example.cl into the build directory and it should work.
Another consideration with
Another consideration with why using clFinish on the command queue is important is there is no guarantee that you won't flood the command queue with too many operations. This is strictly implementation dependent. And in earlier revs of the OpenCL implementation on Snow Leopard it was necessary. Whether it is necessary now or not I haven't checked for problems where the command queue contains a large number of instructions. But again, issuing the clFinish periodically has in my extensive experience with OpenCL on SL not been a performance problem.
Actually David, in normal operation, you should almost never have to use clFinish. The various blocking read/write/map enqueue calls always do an implicit flush, so the work will always get launched, and will force some synchronization between main thread and device. Cases where you might want a finish:
1) You are at risk of enqueuing 100's of thousands of operations causing a backlog causing free memory to become exhausted, and/or 3 days of work to be enqueued that you have no way to cancel. Periodic finishes would prevent too much work from backing up and keep your latencies relatively interactive.
2) You want to make sure OpenCL is done and has released various temporary resources before you start something else resource intensive.
3) You are looking for reference counting leaks and want to make sure OpenCL is done churning.
If you find that OpenCL starts failing due to queue overflow with a relatively small number of commands (<10,000?) please file a bug with Apple. That isn't supposed to happen.
Thanks for the reply, but if
Thanks for the reply, but if there is no random function in the OpenCL kernel language, how can we do Monte Carlo simulations? Can we just write a kernel function that calls a compiled executable where the Monte Carlo would be coded?
Random number generators are a little tricky on massively parallel systems, due to contention over the random number generator state, or the cost of keeping many different copies of random state, and how to update random state cooperatively if some of the workitems don't make the function call. It is obviously not impossible, though. It just hasn't been high enough on the totem pole to get Khronos to do what research is required to make sure it can be implemented cheaply on existing hardware before requiring it in a specification. If you want this to happen, you can kick it along by implementing some efficient random generator on a variety of different OpenCL devices and attaching your work to a feature request at Khronos's bug site. Or, maybe just write a paper.
In the mean time, it should be simple in most cases to fill an array with random numbers and pass that to the kernel.
Hi, In episode 3 FFTs,
Hi,
In episode 3 FFTs, BLAS/LAPACK, and Monte Carlo implementations on the GPU were discussed. I know apple has an OpenCL FFT implementation, but where can you find BLAS/LAPACK implementations? The ATI SDK for OpenCL includes some operations like matrix multiplication and eigenvalues but I need a more complete package especially one that has matrix inversion and singular value decomposition.
Thanks