

King Abdullah University of Science and Technology

# CS 380 - GPU and GPGPU Programming<br>Lecture 16: CUDA Memories, Pt. 2 Lecture 16: CUDA Memories, Pt. 2

Markus Hadwiger, KAUST

#### Reading Assignment #6 (until Oct 14)



Read (required):

eading Assignment #6 (until Oct 14)<br>• Programming Massively Parallel Processors book (4th edition),<br>• Programming Massively Parallel Processors book (4th edition),<br>• Chapter 5 (*Memory architecture and data locality*) Chapter 5 (Memory architecture and data locality)

Read (optional):

- eading Assignment #6 (until Oct 14)<br>
Frogramming Massively Parallel Processors book (4th edition),<br>
Chapter 5 (*Memory architecture and data locality*)<br>
For Programming Massively Parallel Processors book (4th edition),<br>
Ch Chapter 20 (An introduction to CUDA streams)
- ead (required):<br>• Programming Massively Parallel Processors book (4th edition),<br>**Chapter 5** (*Memory architecture and data locality*)<br>• and (optional):<br>• Programming Massively Parallel Processors book (4th edition),<br>• Prog Chapter 21 (CUDA dynamic parallelism)

#### Reading Assignment #7 (until Oct 21)



#### Read (required):

eading Assignment #7 (until Oct 21)<br>• Programming Massively Parallel Processors book (4th edition),<br>• Programming Massively Parallel Processors book (4th edition),<br>• Chapter 6 (*Performance considerations*) Chapter 6 (Performance considerations) • Inline PTX Assignment #7 (until Oct 21)<br>• Independent (required):<br>• Programming Massively Parallel Processors book (4th edition),<br>• Chapter 6 (*Performance considerations*)<br>• Inline PTX Assembly in CUDA: Inline\_PTX\_Assem ad (required):<br>Programming Massively Parallel Processors book (4th edition),<br>Chapter 6 (*Performance considerations*)<br>ad (optional):<br>• Inline PTX Assembly in CUDA: Inline\_PTX\_Assembly.pdf<br>• Dissecting GPU Architectures thr

#### Read (optional):

- 
- 

Volta: https://arxiv.org/abs/1804.06826 Turing: https://arxiv.org/abs/1903.07486

https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/ s9839-discovering-the-turing-t4-gpu-architecture-with-microbenchmarks.pdf

Ampere: https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s33322/

#### Next Lectures



no lecture on Oct 14 ! (fall semester break)

Lecture 17: Tue, Oct 15: Vulkan tutorial (room 3128, 14:30-15:45) no lecture on Oct 14 ! (fall semester break)<br>Lecture 17: Tue, Oct 15: Vulkan tutorial (room 3128, 14:30-15:45)<br>Lecture 18: Thu, Oct 17: Quiz #2 (only quiz; room 3128, 10:00)<br>Lecture 20: Tue, Oct 22 (make-up lecture; 14:30

Lecture 18: Thu, Oct 17: Quiz #2 (only quiz; room 3128, 10:00)

Lecture 19: Mon, Oct 21

Lecture 21: Thu, Oct 24

#### Example: Matrix Multiplication

 $5<sub>2</sub>$ 



```
qlobal void MatrixMul( float *matA, float *matB, float *matC, int w)
{
        shared float blockA[ BLOCK SIZE ][ BLOCK SIZE ];
      ample: Matrix Multiplication<br>
bal_void MatrixMul(float *mata, float *matB, float *matC, int w)<br>
_shared_float blockB[ BLOCK_SIZE ][ BLOCK_SIZE ];<br>
_shared_float blockB[ BLOCK_SIZE ][ BLOCK_SIZE ];<br>
int bx = blockIdx.x; int
      int bx = blockIdx.x; int tx = threadIdx.x;
      int by = blockIdx.y; int ty = threadIdx.y;
      int col = bx * BLOCK SIZE + tx;
      int row = by * BLOCK SIZE + ty;
      float out = 0.0f;
      for ( int m = 0; m < w / BLOCK SIZE; m++ ) {
          blockA[ ty ][ tx ] = matA[ row * w + m * BLOCK SIZE + tx ];
          blockB[ ty ][ tx ] = matB[ col + ( m * BLOCK SIZE + ty ) * w ];
           syncthreads();
          for ( int k = 0; k < B</math> LOCK SIZE; <math>k++</math> ) {out += blockA[ ty ][ k ] * blockB[ k ][ tx ];
           }
            syncthreads();
      }
     matC[ row * w + col ] = out;
}
                                                      Caveat: for brevity, this code assumes matrix sizes 
                                                      are a multiple of the block size (either because 
                                                      they really are, or because padding is used; 
                                                      otherwise guard code would need to be added)
```
#### Example: Matrix Multiplication



```
syncthreads();
```
}

}

```
matC[ row * w + col ] = out;
```
Caveat: for brevity, this code assumes matrix sizes are a multiple of the block size (either because they really are, or because padding is used; otherwise guard code would need to be added)

# CUDA Memory: Shared Memory

# Memory and Cache Types Memory and Cache Types<br>
Global memory<br>
• [Device] L2 cache<br>
• [SM] L1 cache (shared mem carve



#### Global memory

- 
- Memory and Cache Types<br>• [SM] L1 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Pearl of the same (starse with the same of t • Memory and Cache Types<br>• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; *or* L1 shared with tex cache<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be • [Device] L2 cache Types<br>• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be
- 
- 

#### Shared memory

• [Device] **L2 cache**<br>• [SM] **L1 cache** (shared mem carved out; or L1 shared with tex cache<br>• [SM/TPC] **Texture cache** (separate, or shared with L1 cache)<br>• [SM] **Read-only data cache** (storage might be same as tex cache)<br> (Hopper/CC 9.x: also thread block clusters) Shared memory<br>
• [SM] Shareable only between threads in same thread block<br>
(Hopper/CC 9.x: also thread block clusters)<br>
Constant memory: Constant (uniform) cache<br>
Unified memory programming: Device/host memory sharing<br>
Mar

Constant memory: Constant (uniform) cache

Unified memory programming: Device/host memory sharing

#### L1 Cache vs. Shared Memory



Different configs on Fermi and Kepler; carveout on Maxwell and newer <sup>1</sup> Cache vs. Shared Memory<br>
fferent configs on Fermi and Kepler; carveout on Maxwell and newer<br>
• More shared memory on newer GPUs (64KB, 96KB, 100KB, 164KB, ...)<br>
Carveout from unified L1/read-only data cache<br>
• Car CUDA

Carveout from unified L1/read-only data cache

(See CUDA C Programming Guide!)

```
// Device code
 qlobal void MyKernel(...)
    shared float buffer [BLOCK DIM];
// Host code
int carveout = 50; // prefer shared memory capacity 50% of maximum
// Named Carveout Values:
// carveout = cudaSharedmemCarveoutDefault; // (-1)// carveout = cudaSharedmemCarveoutMaxL1; // (0)
// carveout = cudaSharedmemCarveoutMaxShared; // (100)
cudaFuncSetAttribute(MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout,
carveout);
MyKernel <<<qridDim, BLOCK DIM>>>(...);
```
#### NVIDIA GH100 SM SIDE Dispatch Unit (32 thread/clk)

#### Multiprocessor: SM (CC 9.0)

- 
- 
- 
- 

#### 4 partitions inside SM and the scheduler (32 thread/clk)

- 
- 
- 
- 
- dispatch unit, 16K register file



#### Compute Capab. 9.x (Hopper, Part 2)



#### K.8.3. Shared Memory

Similar to the NVIDIA Ampere GPU architecture, the amount of the unified data cache reserved for shared memory is configurable on a per kernel basis. For the NVIDIA H100 Tensor Core GPU architecture, the unified data cache has a size of 256 KB for devices of compute capability 9.0. The shared memory capacity can be set to 0, 8, 16, 32, 64, 100, 132, 164, 196 or 228 KB.

Markus Hadwiger, KAUST<br>
Markus

# **Shared Memory Allocation**

- 2 modes
- Static size within kernel

```
shared float vec[256];
```
Dynamic size when calling the kernel

```
// in main
int VecSize = MAX THREADS * sizeof (float4);
vecMat<<< blockGrid, threadBlock, VecSize >>>(p1, p2, ...);
```
// declare as extern within kernel

 $extern$  shared float  $vec[]$ ;

#### **Shared Memory**

- Accessible by all threads in a block  $\bullet$
- Fast compared to global memory
	- Low access latency ۸
	- **High bandwidth** ۸
- **Common uses:** 
	- **Software managed cache** ۸
	- **Data layout conversion**  $\qquad \qquad \bullet$



#### **Global Memory (DRAM)**

#### **Shared Memory/L1 Sizing**

#### Shared memory and L1 use the same 64KB

- Program-configurable split:
	- Fermi: 48:16, 16:48
	- Kepler: 48:16, 16:48, 32:32

later: use carveout

- use cudaFuncSetAttribute()
- CUDA API: cudaDeviceSetCacheConfig(), cudaFuncSetCacheConfig()
- Large L1 can improve performance when:
	- Spilling registers (more lines in the cache -> fewer evictions)
- Large SMEM can improve performance when:
	- **Occupancy is limited by SMEM**

#### **Shared Memory**

#### Uses:

- Inter-thread communication within a block  $\bullet$
- Cache data to reduce redundant global memory accesses  $\qquad \qquad \bullet$
- Use it to improve global memory access patterns  $\qquad \qquad \bullet$

#### **Organization:**

- 32 banks, 4-byte (or 8-byte) banks  $\qquad \qquad \bullet$
- Successive words accessed through different banks  $\bullet$

# Parallel Memory Architecture Parallel Memory Architecture<br>• In a parallel machine, many threads access memory<br>- Therefore, memory is divided into banks Parallel Memory Architecture<br>
n a parallel machine, many threads access mem<br>
– Therefore, memory is divided into banks<br>
– Essential to achieve high bandwidth Parallel Memory Architecture<br>
n a parallel machine, many threads access mer<br>
– Therefore, memory is divided into banks<br>
– Essential to achieve high bandwidth

- -
	-
- 
- In a parallel machine, many threads access memory<br>
 Therefore, memory is divided into banks<br>
 Essential to achieve high bandwidth<br>
 Each bank can service one address per cycle<br>
 A memory can service as many simultan m a parallel machine, many threads access memore-<br>
— Therefore, memory is divided into banks<br>
— Essential to achieve high bandwidth<br>
<br>
<br>
Each bank can service one address per cycle<br>
— A memory can service as many simultan accesses as it has banks
- − Essential to achieve high bandwidth<br>
 Each bank can service one address per cycle<br>
 A memory can service as many simultaneous<br>
accesses as it has banks<br>
 Multiple simultaneous accesses to a bank<br>
Fank 6<br>
Fank 6<br>
Fan result in a bank conflict Each bank can service one address per cycle<br>
— A memory can service as many simultaneous<br>
accesses as it has banks<br>
Multiple simultaneous accesses to a bank<br>
esult in a bank conflict<br>
— Conflicting accesses are serialized

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign



#### Memory Banks Friendrich

Fermi/Kepler/Maxwell and newer: and newer.<br>32 banks

default: raun.<br>4B / bank

Kepler or newer: configurable to 8B / bank



#### **Shared Memory**

#### Uses:

- Inter-thread communication within a block
- Cache data to reduce redundant global memory accesses  $\qquad \qquad \bullet$
- Use it to improve global memory access patterns

#### **Performance:**

- smem accesses are issued per warp
- Throughput is 4 (or 8) bytes per bank per clock per multiprocessor
- serialization: if N threads of 32 access different words in the same bank,  $\bullet$ **N** accesses are executed serially
- **EXAMPLE 10 multicast: N threads access the same word in one fetch** 
	- Could be different bytes within the same word

#### **Shared Memory Organization**

- Organized in 32 independent banks
- **Optimal access: no two words from** same bank
	- Separate banks per thread
	- **Banks can multicast**  $\bullet$
- Multiple words from same bank serialize  $\bullet$





# Bank Addressing Examples



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

# Bank Addressing Examples



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

- 
- How addresses map to banks on G80<br>
Each bank has a bandwidth of 32 bits per clock cycle<br>
Successive 32-bit words are assigned to successive<br>
panks<br>
 So bank = address % 16<br>
 So bank = address % 16<br>
 Same as the size of How addresses map to banks on G80<br>• Each bank has a bandwidth of 32 bits per clock cycle<br>• Successive 32-bit words are assigned to successive How addresses map to banks on G80<br>• Each bank has a bandwidth of 32 bits per clock cycle<br>• Successive 32-bit words are assigned to successive<br>banks banks How addresses ma<br>
• Each bank has a bandwidth<br>
• Successive 32-bit words are<br>
banks<br>
• G80 has 16 banks<br>
– So bank = address % 16 Each bank has a bandwidth of 32 bits per c<br>
Successive 32-bit words are assigned to su<br>
banks<br>
- So bank = address % 16<br>
- Same as the size of a half-warp<br>
· No bank conflicts between different half-warps, or<br>
single half-• No bank conflicts between different half-warps, only within a single half-warp<br>
• No bank conflicts between different half-warps, only within a single half-warp
- -
	- - single half-warp

#### Fermi and newer have 32 banks, considers full warps instead of half warps!

# **Shared Memory Bank Conflicts**

- Shared memory is as fast as registers if there are no bank conflicts
- The fast case:  $\bullet$ 
	- If all threads of a half-warp access different banks, there is no bank conflict
	- If all threads of a half-warp access the identical address, there is no bank conflict (broadcast)
- The slow case:  $\bullet$ 
	- Bank Conflict: multiple threads in the same half-warp access the same bank
	- Must serialize the accesses
	- $Cost = max # of simultaneous accesses to a single bank$

#### full warps instead of half warps on Fermi and newer!

# **Linear Addressing**

Given:

shared float shared [256];  $float$  foo = shared[baseIndex + s \* threadIdx.x];

This is only bank-conflict-free if s  $\bullet$ shares no common factors with the number of banks

 $-$  16 on G80, so s must be odd





Parallel08 - Memory Access

Hendrik Lensch and Robert Strzodka

# **Data Types and Bank Conflicts**

This has no conflicts if type of shared is 32-bits:  $\bullet$ 

 $foo = shared[baseIndex + threadIdx.x]$ 

But not if the data type is smaller  $\bullet$ 

4-way bank conflicts: shared char shared[];  $foo = shared[baseIndex + threadIdx.x];$ 

#### not true on Fermi, because of multi-cast!

- 2-way bank conflicts: shared short shared[];  $foo = shared[baseIndex + threadIdx.x];$ 

#### not true on Fermi, because of multi-cast!





# **Structs and Bank Conflicts**

Struct assignments compile into as many memory accesses as there are struct members: Thread ( Bank (



- This has no bank conflicts for vector; struct size is 3 words 3 accesses per thread, contiguous banks (no common factor with 16) struct vector  $v =$  vectors [baseIndex + threadIdx.x];
- This has 2-way bank conflicts for myType;  $\bullet$ (each bank will be accessed by 2 threads simultaneously) struct myType  $m = myTypes[baseIndex + threadIdx.x];$

Parallel08 - Memory Access

### **Broadcast on Shared Memory**

- Each thread loads the same  $\bullet$ element – no bank conlict
	- $x = shared[0]$ ;
- Will be resolved implicitly  $\bullet$

#### multi-cast on Fermi and newer!



#### **Common Array Bank Conflict Patterns**  $1<sub>D</sub>$

- Each thread loads 2 elements into shared mem:
	- 2-way-interleaved loads result in 2-way bank conflicts:

int tid = threadIdx.x;  $shared[2*tid] = global[2*tid];$  $shared[2*tid+1] = global[2*tid+1];$ 

- This makes sense for traditional CPU  $\bullet$ threads, locality in cache line usage and reduced sharing traffic.
	- Not in shared memory usage where there is no cache line effects but banking effects



# **A Better Array Access Pattern**

Each thread loads one element in  $\bullet$ every consecutive group of blockDim elements.

```
shared[tid] = global[tid];shared[tid + blockDim.x] =qlobal[tid + blockDim.x];
```


#### **OPTIMIZE**

**Kernel Optimizations: Shared Memory Accesses** 

© NVIDIA 2013

#### **Case Study: Matrix Transpose**

- **Coalesced read**
- **Scattered write (stride N)**
- $\Rightarrow$  Process matrix tile, not single row/column, per block
- $\Rightarrow$  Transpose matrix tile within block





© NVIDIA 2013

#### **Case Study: Matrix Transpose**

- **Coalesced read**
- **Scattered write (stride N)**
- **Transpose matrix tile within block**  $\bullet$
- $\Rightarrow$  Need threads in a block to cooperate: use shared memory





#### **Transpose with coalesced read/write**

```
global
           transpose(float in[], float out[])
\overline{1}shared float tile[TILE] [TILE] ;
  int glob in = xIndex + (yIndex)*N;
 int glob out = xIndex + (yIndex) * N;tile[threadIdx.y][threadIdx.x] = in[glob_in];
   syncthreads();
  out[glob out] = tile[threadIdx.x][threadIdx.y];\mathbf{H}
```
#### **Fixed GMEM coalescing, but introduced SMEM bank conflicts**

transpose<<<grid, threads>>>(in, out);

© NVIDIA 2013

#### **Transpose with coalesced read/write**

```
global
           transpose(float in[], float out[])
\overline{1}shared float tile[TILE] [TILE] ;
  int glob in = xIndex + (yIndex) *N;
 int glob out = xIndex + (yIndex) * N;tile[threadIdx.y][threadIdx.x] = in[glob in];
   syncthreads () ;
 out[glob out] = tile[threadIdx.x][threadIdx.y];
```
#### **Fixed GMEM coalescing, but introduced SMEM bank conflicts**

transpose<<<<grid, threads>>>(in, out);

© NVIDIA 2013

#### **Shared Memory: Avoiding Bank Conflicts**

- Example: 32x32 SMEM array
- Warp accesses a column:

read (LD) from shared memory out[glob out] =

tile[threadIdx.x][threadIdx.y];

32-way bank conflicts (threads in a warp access the same bank)



#### **Shared Memory: Avoiding Bank Conflicts**



#### No bank conflicts anymore

}

```
_global__ transpose(float in[], float out[])
\overline{1}shared float tile[TILE][TILE+1];
  int glob in = xIndex + (yIndex) *N;
 int glob_out = xIndex + (yIndex)*N;tile[threadIdx.y][threadIdx.x] = in[glob in];
   syncthreads();
 out[glob out] = tile[threadIdx.x][threadIdx.y];
```
#### Thank you.