OpenCL Tutorial - Questions and Answers to Episode Four

This episode covers questions hthat were generated from the previous podcast. We'll discuss GPU layout/terminology and bank conflicts resulting from shared memory access.

In iTunes, you can subscribe to the podcasts by going to:

Advanced -> Subscribe to podcast
URL: http://feeds.feedburner.com/opencl

Episode 5 - Questions and Answers to Episode Four (Desktop/iPhone/iPod touch)
Episode 5 - Questions and Answers to Episode Four (PDF)

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Vectors vs Banks

So what happens when using (for example) a float4 array and each thread accesses successive items?

Each float4 is 128-bits wide, so threads 0,4,8 & 12 would access banks 0-3; Thread 1,5,9 & 13 banks 4-7?
This would suggest that bank conflicts are only avoidable when using 32-bit (or less) data items. Is that true?

Also, how much do the numbers you give (memory sizes, warps, etc) vary across the various nvidia processors?

All of this seems to mean you need to write highly specialised code for each type of compute device. How much performance are we sacrificing if we write more generic code?

PDF in podcast feed?

Hi!

I was wondering if it is possible for you guys to also add PDF files to the podcast so I can subscribe in iTunes too? Thanks very much!

Re: Vectors vs Banks

Hi Paul,

"So what happens when using (for example) a float4 array and each thread accesses successive items?

Each float4 is 128-bits wide, so threads 0,4,8 & 12 would access banks 0-3; Thread 1,5,9 & 13 banks 4-7?
This would suggest that bank conflicts are only avoidable when using 32-bit (or less) data items. Is that true?"

In this case the float4 data is mapped across 4 banks. Each bank entry is only 32 bits wide on current hardware. So the first 4 threads would read their respective data with no bank conflict. The 5th thread read would result in a bank conflict since it would overlap with the first and so forth. Although the -N constructs (where N > 1) are supported, the benefit from packing data into the vector type when shared memory is used isn't really there from my understanding.

Note that bank conflicts can only occur when more than one thread in a single half-warp is accessing the same bank (with the exception of broadcast, of course). This is because a separate read/write instruction is issued by the hardware for each half-warp.

"Also, how much do the numbers you give (memory sizes, warps, etc) vary across the various nvidia processors?"

On all hardware currently programmable with OpenCL or CUDA compute 1.1 or higher a warp is by definition 32 threads. There was talk that at some point a warp would no longer be serviced as two half-warps in future hardware. But I can't find the reference to where I heard/read that, so I'm not certain what/when/where that will change.

On current hardware (compare say NVIDIA compute 1.0 vs. 1.2 hardware, for example), there are significant differences in what is done as a coalesced load versus uncoalesced. For example, on 1.0 hardware all threads must participate in a data load for the load to occur as a single transaction, whereas on 1.2 or higher hardware, that is not the case (the hardware will detect that using a set of heuristics).

All of that effectively means you may need to target multiple types of hardware to get optimal performance. However, if you can, target layout data and access patterns for the dumbest hardware, and you'll get the performance for free on all future hardware. Easier said than done sometimes, of course :)

"All of this seems to mean you need to write highly specialised code for each type of compute device. How much performance are we sacrificing if we write more generic code?"

That's really application dependent as I'm sure you can imagine. However, I've found that trying to refactor code for the lowest hardware I need to support gives me big wins in the end on the newer hardware (and more important for me, when I back port the optimized code/data structures to the CPU). I tend to use the approach of optimizing where needed or when something comes up. Unfortunately, GPUs are simply not as forgiving as CPUs. At least not yet.

HTH,

Dave

Bank Layouts and Private Memory

First, thanks for the great work and effort.

So from what I am understanding, if I am understanding correctly at all, local shared SM memory is interleaved, bankwise. So given a block of 64B written sequentially, that will write into all 16 banks and not just bank 0. Is this correct?

Also, how much private register memory does each SP have in a gt200. I haven't been able to locate that info. Is it significant?

Re: Bank Layouts and Private Memory

Yes. That is correct. A 64B segment will be written sequentially across all 16 banks if the type is sizeof(32-bit).

Regarding registers. Each SP has access to 2K entries in the register file that is shared (16K entries per SM). The total register file size is 64KB (which is double the size on the G80 series). The registers are allocated and partitioned at compile time by the JIT compiler for the given kernel and card.

Hope that helps,

Dave

Thanks, and a Q about Scalar Arch

Thanks for the help. It does clarify things about resources.

One thing I'm finding is that the design for gt2xx (and beyond) isn't what I had imagined. It is at the core SP level not a vector processor at all. It is scalar. So operations on point vectors, etc, are serialized. Is that correct? A 4 element vector added to another 4 element vector would require 4 separate additions (1 per component).