Understanding Matrix Transpose

Hi Folks,
I'm trying to figure out the Apple Matrix Transpose example.

Just wondering if anyone has been doing the same?

First thing to get stuck on is the work size.

Apple comments state:

// [5] This parameter is an array indicating the size of each dimension of the data array
// for all the data being processed by this kernel.
// For example, if your data represented a two-dimensional image that is 64 pixels wide
// by 128 pixels high, then you would supply an array of dimensions [64,128].

So for the example which has width 256 height 4096 I would expect an array of dimensions [256][4096] to be supplied.

Instead, the example supplies

global[0] = width * GROUP_DIMY; // 256 * 2 = 512
global[1] = height / GROUP_DIMX; // 4096 ÷ 32 = 128

Seems very strange as it does not even contain the same number of elements.

256 * 4096 = 1048576


whereas

512 * 128 = 65536

65536 is 1048576/16.


Any ideas?

This seems to be a key example. Understanding it means having grasped memory allocation so as to avoid bank conflicts.

I'm also confused about the indexing in the kernel itself, but one step at a time...

Any advice would be much appreciated.

Thanks,
Max

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Re: Understanding Matrix Transpose

Might have figured some of this out. The work item processes 16 elements.

So that needs to be executed 65536 times to process a total of 1048576 elements (256 * 4096).

Makes sense.

That said, how the work group is calculated is still a bit of a mystery...

Re: Understanding Matrix Transpose

From the NVIDIA_OpenCL_ProgrammingOverview...

Choosing the best NDRange depends on the kernel as well (in particular, the number of registers it uses) and ultimately requires some experimentation. For this reason, applications should not rely on the OpenCL implementation to determine the right work-group size (by setting local_work_size to NULL in clEnqueueNDRangeKernel()).

Re: Understanding Matrix Transpose

Loving this conversation with myself here. Anybody out there? Don't be shy...

anyway I think I have figured a bit more out. A half warp being 16 threads that means that for coalesced memory access, each thread will access one float. So that's 16 x 4 bytes = 64 bytes total.

That's why the work group size is 64 bytes.

Re: Understanding Matrix Transpose

Yes. You got it now.

I sort of discuss this in one the OpenCL tutorials. But the most efficient load is a 64 byte load (which is 16 floats). This gives you one float per bank per half warp. By definition a load to the same bank in two different half-warps, even half-warps of the same warp cannot cause a bank conflict. This is also reason that float2 and float4 aren't as efficient as people would expect or hope.

In any case, each thread will load the byte offset of data corresponding to the address modulo the thread id (for aligned and coalesced loads) for a 4-byte value.

Not that this WILL change at some point on future hardware (if it hasn't already), as the number of banks increase to service more threads per warp (that is going from half-warp servicing to full warp service).

Another thing to point out is that this is specific for NVIDIA hardware. On ATI hardware the equivalent (called a wavefront) is 64-bytes by default.

Regards,

Dave

Re: Understanding Matrix Transpose

Hi Dave,
that's great thanks.

I'm still confused about a few things. OpenCL looks really amazing so I want to make sure I understand the fundamentals. Hopefully I'll look back on all this one day and wonder why it seemed difficult at the time.

Anyway, this again relates to the Apple Matrix Transpose example.

The width is 256 and height 4096, so the total problem size, or NDRange, is 1048576 elements.

Now, the kernel processes 16 elements. Therefore it must be executed 65536 times to complete the problem.

How does the compiler know this?

Cheers,
Max

Re: Understanding Matrix Transpose

Sorry I am being stupid of course I answered that already.

global[0] = width * GROUP_DIMY; // 256 * 2 = 512
global[1] = height / GROUP_DIMX; // 4096 ÷ 32 = 128

So that's 512 * 128 = 65536.

Re: Understanding Matrix Transpose

Which leads me on to another point of confusion.

Sorry, this is becoming like a personal OpenCL tribulation blog!

I hope those that come after me find this useful :-)

Here goes again...

A work item is an instance of a kernel, and equivalent to a thread.

With NVIDIA hardware, a warp is 32 threads, or 32 work items.

In the Apple Matrix Transpose example the work group size is specified as 64 work items.

In the kernel code, it processes - hardcoded - 16 elements.

Therefore, each instance of the kernel - i.e., each work item - processes 16 elements.

So by specifying a work group size of 64 work items, what is being specified? Is it that 64 kernel instances should run in parallel?

Cheers,
Max

Re: Understanding Matrix Transpose

Would it be helpful if I did a podcast specifically on this example? I can imagine others may be confused as well. If you think it's worth it, I can try to put one together by weeks end.

Dave

Re: Understanding Matrix Transpose

Hi Dave,
well, I can't speak for others but personally I would jump up and down with joy if you did a podcast on this.

I think it's a great example. Understanding it means that key concepts have been grasped, and it really is the leap from the concepts to code that is the greatest challenge.

If you do decide to do it, it would be really brilliant if you could cover the local memory allocation (I can't understand why it's

sizeof(float) * GROUP_DIMX * (GROUP_DIMX + 1)

even assuming 1 is for padding.

The indexing inside the kernel is tricky too. It's doing some logical ANDs, and bit shifting to avoid bank conflicts I think.

Anyway, would be great if you did it. No worries either way though. You must be really busy fielding questions on all this.

Cheers,
Max

Re: Understanding Matrix Transpose

actually one thing I am starting to realise about this example is that it is a slightly different problem from that covered in the MacResearch tutorials 4 and 5. They transpose the entire image as one matrix, where this one - I think - is transposing sub-matrices.

What's strange though is that each kernel processes 16 items. So it must be loading 64 bytes per kernel instance.

If it processes one item that would be 4 bytes, and would give the more optimised 64 byte load for the half warp.

Re: Understanding Matrix Transpose

There's an NVIDIA example of this also and I have to say it's a lot simpler than the Apple one.

It looks - at first glance - pretty much how I'd expect given the tutorials thus far.

That said the Apple one is very puzzling and a bit of a challenge. I'd love to know why they are doing it at way, and to confirm that it is fully optimised.

Re: Understanding Matrix Transpose

That matrix transpose example from Apple's just shows a final optimized version of how to do it. It's tough to understand unless you go through the progression of the steps necessary to get that version of the code. The primary things to understand are how the transpose changes as a function of the workgroup size, and effective use of shared memory.

Dave

In algebra, the determinant

In algebra, the determinant is a special number associated with any square matrix. The fundamental geometric meaning of a determinant is a scale factor for measure when the matrix is regarded as a linear transformation. Thus a 2 × 2 matrix with determinant 2 when applied to a set of points with finite area will transform those points into a set with twice the area. Determinants are important both in calculus, where they enter the substitution rule for several variables, and in multilinear algebra. A matrix is invertible if and only if its determinant is non-zero.