## OpenCL Tutorial - OpenCL Fundamentals

Since Snow Leopard is coming soon, and the first episode of the series was well received I'm releasing the second episode earlier than I was planning. Lot's of excitement around Snow Leopard and OpenCL it seems. So something to keep you busy until Snow Leopard Friday and the weekend of pure bliss we'll all be in.

In Episode 2 of the OpenCL tutorial series we'll cover some OpenCL fundamentals. This is a little more in-depth than the last one, and discusses general concepts that are important for understanding how OpenCL works. OpenCL address spaces are covered as well the basics of executing an OpenCL program (abstractly).

Also, thanks to everyone who gave feedback. I'm still working out a few things in producing these. But if there are comments, questions or suggestions, please let me know (either by email or via this site).

In iTunes, you can subscribe to the podcasts by going to:

Advanced -> Subscribe to podcast

URL: http://feeds.feedburner.com/opencl

Episode 2 - OpenCL Fundamentals (Desktop/iPhone/iPod touch)

Episode 2 - OpenCL Fundamentals (PDF)

## Comments

## Awesome!

Awesome!

## Types of scientific computing problems amenable to OpenCL

Thanks very much for posting these slides.

I attended the OpenCL sessions at WWDC09, but as it was all new to me, I am still unclear about which classes of scientific computing problems are most amenable to speed-up via OpenCL. It would be helpful to know how useful OpenCL is for certain canonical problems such as:

1. FFTs (for what range of problem sizes)

2. blas-type vector-vector, matrix-vector, and matrix-matrix operations (again, for what size range)

3. Monte Carlo

4. PDEs (simple vs. complicated spatial lattices)

and problems of increasing algorithmic complexity. To take full advantage of OpenCL it is crucial to think carefully about the organization of data structures and memory bandwidth external to the GPU, so it isn't obvious to me how useful OpenCL is for different classes of scientific problems.

Another important issue is that most GPUs lack hardware support for double-precision calculations. Is this expected to change in the future? Perhaps single-precision will suffice for commodity uses and there will be little incentive to add double-precision capabilities. Also, if Intel's Larrabee architecture becomes widespread, will Grand Central Dispatch be a better programming model for it than OpenCL? (GCD is very easy to use.)

As it looks like it requires a fair bit of work to move to the GPU, it would be helpful to have more information before embarking on the journey.

## Re: Types of scientific computing problems amenable to OpenCL

These are all very good questions. And they are somewhat complicated to answer. This might be generally useful to discuss as part of the Q&A of the next podcast. If you can wait a bit (sometime next week), I can make a point to address these questions then if that's OK.

The one thing I'd point out now however, is that while GCD is in fact easy to use, OpenCL is too once you get past the setup/initialization stuff. Sure there may be algorithmic challenges. But having written some sparse matrix solvers I can tell you a little creative thinking and good numerical analysis can pay off handsomely.

Again, really good questions and I'll address them in the next video.

Dave

## That would be great

Thanks, Dave, that would be wonderful. I look forward to your next video and learning about your experiences using OpenCL.

Brad

## sorry but I have to ask

First. Great article.

Now...

I´ve been doing CUDA for a while, and works really nice, now with openCL things get even funnier, but besides portability is there any other advantage to switch?, performance wise.

I did some basic tests, and it ran slower (matrix-matrix mult) .

I´m about to start writing code for my theses and portability is nice, but not enough to leave CUDA

thanks in advance

## Thank you for doing these

Thank you for doing these tutorials, they are really good quality :)

I have a request for a next installment, could you talk a little bit of the numeric precision of the different supported graphics cards. I see that in the openCL specification double precision is optional. If I declare a double precision variable and it goes to the CPU, will I get double precision after an operation? what if it's accessed by a GPU that doesn't support doubles, will I get an error, or will the code execute like if it was a float variable?

Best Regards

Jesus Cruz

## Forgive me if some of this

Forgive me if some of this was mentioned in the video, I've only read the PDF slides.

It would be great if you could elaborate some more on the subtleties of the global vs the local work sizes, what the functional difference is, where you would want to use a specific local size, etc.

I ask because I noticed some weirdness with the `clEnqueueNDRangeKernel` function when experimenting with OpenCL on my own.

Firstly, it seems the local work size must always be smaller than the global work size. In retrospect, this is obvious, but it would be nice to see where the "local" stuff fits into the picture (it's omitted from your slides).

Secondly, following on that first point, it seems possible to set the local size to NULL, and I'm wondering in what case it would be beneficial to specify the local size manually vs. letting OpenCL handle this? If you set the local work size, it must follow a few contraints (evenly divisible by the global work size in each dimension, smaller than a certain constant, etc). These issues bit me in my tests, I think other people should be made aware of them.

Thirdly, it seems that you can specify up to 3 dimensions of global and local work sizes (your examples use a 2 dimensional matrix of work items). Again, what is the difference here with just using 1 dimension vs. 2 (or 3)?

Finally, thanks for the slides, they were very helpful.

## where is the sample code in Tuorial?

can we download it?

## I ask because I noticed some

I ask because I noticed some weirdness with the `clEnqueueNDRangeKernel` function when experimenting with OpenCL on my own.

Firstly, it seems the local work size must always be smaller than the global work size. In retrospect, this is obvious, but it would be nice to see where the "local" stuff fits into the picture (it's omitted from your slides).

The local workgroup size -- if you specify one -- must always evenly divide into the global workgroup size.

Secondly, following on that first point, it seems possible to set the local size to NULL, and I'm wondering in what case it would be beneficial to specify the local size manually vs. letting OpenCL handle this? If you set the local work size, it must follow a few contraints (evenly divisible by the global work size in each dimension, smaller than a certain constant, etc). These issues bit me in my tests, I think other people should be made aware of them.

If you pass NULL, the implementation will attempt to select a local workgroup size for you. This will be based on the largest workgroup size and some factorization of your global workgroup size.

Thirdly, it seems that you can specify up to 3 dimensions of global and local work sizes (your examples use a 2 dimensional matrix of work items). Again, what is the difference here with just using 1 dimension vs. 2 (or 3)?

You get 1, 2 or 3 dimensions. They are just arbitrary number lines. Pick one that works well for your problem.

## Seems like local work size

Seems like local work size and global work size are important to understand.

Currently i am doing an example on matrix multiplication and multiplying two matrices A[16][8] & B[8][36], and i have following work sizes:

size_t local_work_size[] = {No. Rows of A = 16, No. Columns of B =36};

size_t global_work_size[] = {B.width = 36, A.height =16};

and the kernal multiplies and adds items from row of A and col of B and stores it in C.

So, in a simple example of of say adding elements of arrays A[10] & B[10], my local size is 1 and global size is 10, because i want kernel to execute 10 times on one work-item at a time or one thread per work-item (and here total threads are 10)

But in matrix multiplication example, each row in A is read B.width = 36 times, and each column in B is read A.height = 16 times. So, i want kernel to execute 36 times on one row = work item, and 16 times on one col = work item. So, local_work_size are number of work items, and each will be executed global_work_size of times?

I am so confused. Please help clarify.