

King Abdullah University of Science and Technology

# CS 380 - GPU and GPGPU Programming<br>Lecture 19: CUDA Memories, Pt. 3 Lecture 19: CUDA Memories, Pt. 3

Markus Hadwiger, KAUST

### Reading Assignment #8 (until Oct 28)



Read (required):

- eading Assignment #8 (until Oct 28)<br>• Programming Massively Parallel Processors book, 4<sup>th</sup> edition<br>• Programming Massively Parallel Processors book, 4<sup>th</sup> edition<br>• Optimizing Parallel Reduction in CUDA, Mark Harris, Chapter 10: Reduction eading Assignment #8 (until Oct 28)<br>• Programming Massively Parallel Processors book, 4<sup>th</sup> edition<br>• Programming Massively Parallel Processors book, 4<sup>th</sup> edition<br>• Optimizing Parallel Reduction in CUDA, Mark Harris,<br>http ead (required):<br>• Programming Massively Parallel Processors book, 4<sup>th</sup> edition<br>• Optimizing Parallel Reduction in CUDA, Mark Harris,<br>https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf<br>ead (optional):<br>•
- 

https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Read (optional):

https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/

#### Next Lectures



Next Lectures<br>Lecture 20: Tue, Oct 22 (make-up lecture; 14:30 – 15:45)<br>Lecture 21: Thu, Oct 24 Lecture 21: Thu, Oct 24 **Next Lectures**<br>
Lecture 20: Tue, Oct 22 (make-up lecture; 14:30 – 15:45)<br>
Lecture 21: Thu, Oct 24<br>
Lecture 22: Mon, Oct 28<br>
Lecture 23: Tue, Oct 29 (make-up lecture; 14:30 – 15:45)<br>
Lecture 24: Thu, Oct 31

Lecture 22: Mon, Oct 28

Lecture 24: Thu, Oct 31

## CUDA Memory: Shared Memory

# Memory and Cache Types Memory and Cache Types<br>
Global memory<br>
• [Device] L2 cache<br>
• [SM] L1 cache (shared mem carve



#### Global memory

- 
- Memory and Cache Types<br>• [SM] L1 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Pearl of the same (starse with the same of t • Memory and Cache Types<br>• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; *or* L1 shared with tex cache<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be • [Device] L2 cache Types<br>• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be
- 
- 

#### Shared memory

• [Device] **L2 cache**<br>• [SM] **L1 cache** (shared mem carved out; or L1 shared with tex cache<br>• [SM/TPC] **Texture cache** (separate, or shared with L1 cache)<br>• [SM] **Read-only data cache** (storage might be same as tex cache)<br> (Hopper/CC 9.x: also thread block clusters) Shared memory<br>
• [SM] Shareable only between threads in same thread block<br>
(Hopper/CC 9.x: also thread block clusters)<br>
Constant memory: Constant (uniform) cache<br>
Unified memory programming: Device/host memory sharing<br>
Mar

Constant memory: Constant (uniform) cache

Unified memory programming: Device/host memory sharing

## **Common Array Bank Conflict Patterns**  $1<sub>D</sub>$

- Each thread loads 2 elements into shared mem:
	- 2-way-interleaved loads result in 2-way bank conflicts:

int tid = threadIdx.x;  $shared[2*tid] = global[2*tid];$  $shared[2*tid+1] = global[2*tid+1];$ 

- This makes sense for traditional CPU  $\bullet$ threads, locality in cache line usage and reduced sharing traffic.
	- Not in shared memory usage where there is no cache line effects but banking effects



## **A Better Array Access Pattern**

Each thread loads one element in  $\bullet$ every consecutive group of blockDim elements.

```
shared[tid] = global[tid];shared[tid + blockDim.x] =qlobal[tid + blockDim.x];
```


## **Typical Parallel Programming Pattern**

log(n) steps



Helpful fact for counting nodes of full binary trees: If there are N leaf nodes, there will be N-1 non-leaf nodes

Parallel08 - Control Flow

Hendrik Lensch and Robert Strzodka

## **Vector Reduction**



Parallel08 - Control Flow

Hendrik Lensch and Robert Strzodka

## **Vector Reduction with Branch Divergence**



## A better implementation



Parallel08 - Control Flow

Hendrik Lensch and Robert Strzodka

## A better implementation

- Only the last 5 iterations will have divergence
- Entire warps will be shut down as iterations progress
	- For a 512-thread block, 4 iterations to shut down all but one warp in each block
	- Better resource utilization, will likely retire warps and thus blocks faster
- Recall, no bank conflicts either



## CUDA Memory: Uniforms & Textures

# Memory and Cache Types Memory and Cache Types<br>
Global memory<br>
• [Device] L2 cache<br>
• [SM] L1 cache (shared mem carve



#### Global memory

- 
- Memory and Cache Types<br>• [SM] L1 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Pearl of the same (starse with the same of t • Memory and Cache Types<br>• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; *or* L1 shared with tex cache<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be • [Device] L2 cache Types<br>• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be
- 
- 

#### Shared memory

• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be same as tex cache)<br>Shared m (Hopper/CC 9.x: also thread block clusters) Shared memory<br>
Shared memory<br>
(SM] Shareable only between threads in same thread block<br>
(Hopper/CC 9.x: also thread block clusters)<br> **Constant memory: Constant (uniform) cache**<br>
Unified memory programming: Device/host memo

#### Constant memory: Constant (uniform) cache

Unified memory programming: Device/host memory sharing

## Constants

- Immediate address constants
- Indexed address constants  $\bullet$
- **Constants stored in DRAM, and cached** on chip
	- $-$  L1 per SM
- A constant value can be broadcast to all threads in a Warp
	- Extremely efficient way of accessing a value that is common for all threads in a block!

#### // specify as global variable

device constant float gpuGamma[2];

```
// copy gamma value to constant device memory
cudaMemcpyToSymbol(gpuGamma, &gamma, sizeof(float));
// access as global variable in kernel
res = qpuGamma[0] * threadIdx.x;
```


# Memory and Cache Types Memory and Cache Types<br>
Global memory<br>
• [Device] L2 cache<br>
• [SM] L1 cache (shared mem carve



#### Global memory

- 
- Memory and Cache Types<br>• [SM] L1 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Pearl of the same (starse with the same of t • Memory and Cache Types<br>• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; *or* L1 shared with tex cache<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be • [Device] L2 cache Types<br>• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be
- 
- 

#### Shared memory

• [Device] **L2 cache**<br>• [SM] **L1 cache** (shared mem carved out; or L1 shared with tex cache<br>• [SM/TPC] **Texture cache** (separate, or shared with L1 cache)<br>• [SM] **Read-only data cache** (storage might be same as tex cache)<br> (Hopper/CC 9.x: also thread block clusters) Shared memory<br>
• [SM] Shareable only between threads in same thread block<br>
(Hopper/CC 9.x: also thread block clusters)<br>
Constant memory: Constant (uniform) cache<br>
Unified memory programming: Device/host memory sharing<br>
Mar

Constant memory: Constant (uniform) cache

Unified memory programming: Device/host memory sharing

## **Texture Memory**

- Cached, potentially exhibiting higher bandwidth if there is locality in the texture fetches;
- They are not subject to the constraints on memory access patterns that global or constant memory reads must respect to get good performance
- The latency of addressing calculations is hidden better, possibly improving performance for applications that perform random accesses to the data
- No penalty when accessing float4
- Optional
	- 8-bit and 16-bit integer input data may be optionally converted to 32bit floating point
	- Packed data may be broadcast to separate variables in a single operation;
	- values in the range  $[0.0, 1.0]$  or  $[-1.0, 1.0]$
	- texture filtering
	- address modes, e.g. wrapping / texture borders

## **Additional Texture Functionality**

#### • All of these are "free"

- Dedicated hardware
- Must use CUDA texture objects
	- See CUDA Programming Guide for more details
	- Texture objects can interoperate graphics (OpenGL, DirectX)
- Out-of-bounds index handling: clamp or wrap-around

#### **Optional interpolation**

- Think: using fp indices for arrays
- Linear, bilinear, trilinear
	- Interpolation weights are 9-bit
- **Optional format conversion** 
	- {char, short, int, fp16} -> float

ี 2013 ผม**ก** 

© 2013. NVIDIA

## **Examples of Texture Object Indexing**



#### Integer indices fall between elements **Optional interpolation:**

Weights are determined by coordinate distance

#### **Index Wrap:**



**Index Clamp:** 



## **Native Memory Layout - Data** Locality

## CPU

- $\cdot$  1D input
- $\cdot$  1D output
- Other dimensions with offsets



Output

## **GPU**

- $\cdot$  2D input
- $2D$  output
- Other dimensions with

offsets Input

Output



### Space-Filling Curves: Morton Order (Z Order)



Map higher-dimensional space to 1D



## **1D Access**

#### • Access to linear Cuda memory

float4\* pos; cudaMalloc( (void\*\*) &pos, x\*sizeof(float4) );

#### • Texture reference

- $-$  type
- access/filtering mode

// global texture reference

texture< float4, 1, cudaReadModeElementType> texPos;

#### • Bind to linear array

```
cudaBindTexture(0, texPos, pos, x*sizeof(float4)));
cudaUnbindTexture(texPos);
```
#### • Within kernel

```
float4 pal = tex1Dfetch( texPos, threadIdx.x);
```
#### • Writing to a texture that is currently read by some threads is undefined!!!

## **2D Access**

#### • Optimized for 2D / 3D locality

texture< float4, 2, cudaReadModeElementType> texImq;

#### Requires binding to special Array memory special memory layout

```
cudachannelFormatDesc floatText =cudaCreateChannelDesc<float4>();
float4*src;cudaArray* img;
cudaMallocArray ( &img, &floatTex, w, h);
cudaMemcpyToArray(img, 0, 0, src, w*h*sizeof(float4),
cudaMemcpyHostToDevice);
cudaBindTextureToArray(texImq, imq, floatTex));
cudaUnbindTexture(texImq);
```
## **2D Access**

• Within kernel

```
float4 r = \text{tex2D} ( texImq, x +xoff, y+yoff);
```
- **Pros** 
	- optimized for 2D locality (optimized memory layout / spacefilling curve)
- Cons
	- If the result of some kernel should be used as 2D texture cudaMemcpyToArray is required
	- You cannot write to a texture which is currently read from

#### • CUDA "surfaces" are writeable textures!

#### **Texture performance**

#### Texture :

- **Provides hardware accelerated filtered** sampling of data (1D, 2D, 3D)
- Read-only data cache holds fetched samples  $\qquad \qquad \bullet$
- Backed up by the L2 cache

#### SMX vs Fermi SM:

- 4x filter ops per clock
- 4x cache capacity



#### **Texture Cache Unlocked**

#### Added a new path for compute

- **Avoids the texture unit**
- Allows a global address to be fetched and cached  $\qquad \qquad \bullet$
- **Eliminates texture setup**

#### Why use it?

- Separate pipeline from shared/L1
- **Highest miss bandwidth**
- Flexible, e.g. unaligned accesses
- **Managed automatically by compiler** 
	- "const \_\_ restrict" indicates eligibility



## CUDA Memory: Global Memory **CUDA Memory:<br>Global Memory<br>• Memory coalescing<br>• Cached memory access (L2 / L1) CUDA Memory:<br>• Global Memory<br>• Memory coalescing<br>• Cached memory access (L2 / L1)**

- 
- 

# Memory and Cache Types Memory and Cache Types<br>
Global memory<br>
• [Device] **L2 cache**<br>
• [SM] **L1 cache** (shared mem carve



#### Global memory

- 
- Memory and Cache Types<br>• [SM] L1 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Pearl of the same (starse with the same of t • Memory and Cache Types<br>• [Device] • L2 cache<br>• [SM] • L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM] • Read-only data cache (storage might be same as tex cache) • [Device] L2 cache Types<br>• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache)<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be
- 
- 

#### Shared memory

• [Device] L2 cache<br>• [SM] L1 cache (shared mem carved out; or L1 shared with tex cache<br>• [SM/TPC] Texture cache (separate, or shared with L1 cache)<br>• [SM] Read-only data cache (storage might be same as tex cache)<br>Shared m (Hopper/CC 9.x: also thread block clusters) Shared memory<br>
• [SM] Shareable only between threads in same thread block<br>
(Hopper/CC 9.x: also thread block clusters)<br>
Constant memory: Constant (uniform) cache<br>
Unified memory programming: Device/host memory sharing<br>
Mar

Constant memory: Constant (uniform) cache

Unified memory programming: Device/host memory sharing

## **Maximize Byte Use**



#### • Two things to keep in mind:

- Memory accesses are per warp
- Memory is accessed in discrete chunks
	- lines/segments ۰
	- want to make sure that bytes that travel from DRAM to SMs get used
		- For that we should understand how memory system works

#### Note: not that different from CPUs

- x86 needs SSE/AVX memory instructions to maximize performance

## **GPU Memory System**



**GPU NEC** 

© 2013, NVI DIA

- All data lives in DRAM
	- Global memory
	- Local memory
	- Textures
	- $-$  Constants

## **GPU Memory System**



**Ude** 

© 2013, NVIDIA

- All DRAM accesses go through L<sub>2</sub>
- **Including copies:** 
	- $-$  P2P
	- CPU-GPU

## **GPU Memory System**



**Tide** 

© 2013, NVIDIA

- Once in an SM, data goes into one of 3 caches/buffers
- **Programmer's choice** 
	- <del>L1 is the "default"</del>
	- Read-only, Const require explicit code

## **Access Path**

#### L1 path

- Global memory
	- Memory allocated with cudaMalloc()
	- Mapped CPU memory, peer GPU memory
	- Globally-scoped arrays qualified with global
- Local memory
	- · allocation/access managed by compiler so we'll ignore

#### Read-only/TEX path

- Data in texture objects, CUDA arrays
- CC 3.5 and higher:
	- Global memory accessed via intrinsics (or specially qualified kernel arguments)

#### Constant path

- Globally-scoped arrays qualified with constant

## **Access Via L1**

#### Natively supported word sizes per thread:

- $-1B$ , 2B, 4B, 8B, 16B
	- Addresses must be aligned on word-size boundary
- Accessing types of other sizes will require multiple instructions
- Accesses are processed per warp
	- Threads in a warp provide 32 addresses
		- Fewer if some threads are inactive
	- HW converts addresses into memory transactions
		- Address pattern may require multiple transactions for an instruction
		- If **N** transactions are needed, there will be  $(N-1)$  replays of the instruction

#### Global Memory Access



all recent compute capabilities and **Addresses:** Global Memory Access<br>all recent<br>compute capabilities<br>(- 9.x)

Beware:

Uncached here means

the L2 cache is always used!





### Compute Capab. 3.x (Kepler, Part 1)



#### K.3.2. Global Memory

Global memory accesses for devices of compute capability 3.x are cached in L2 and for devices of compute capability 3.5 or 3.7, may also be cached in the read-only data cache described in the previous section; they are normally not cached in L1. Some devices of compute capability 3.5 and devices of compute capability 3.7 allow opt-in to caching of global memory accesses in  $L1$  via the  $-$ Xptxas  $-dlcm=ca$  option to nvcc.

A cache line is 128 bytes and maps to a 128 byte aligned segment in device memory. Memory accesses that are cached in both L1 and L2 are serviced with 128-byte memory transactions, whereas memory accesses that are cached in L2 only are serviced with 32-byte memory transactions. Caching in L2 only can therefore reduce over-fetch, for example, in the case of scattered memory accesses.

If the size of the words accessed by each thread is more than 4 bytes, a memory request by a warp is first split into separate 128-byte memory requests that are issued independently:

- Two memory requests, one for each half-warp, if the size is 8 bytes,
- Four memory requests, one for each quarter-warp, if the size is 16 bytes.

#### Compute Capab. 3.x (Kepler, Part 2)



Each memory request is then broken down into cache line requests that are issued independently. A cache line request is serviced at the throughput of L1 or L2 cache in case of a cache hit, or at the throughput of device memory, otherwise.

Note that threads can access any words in any order, including the same words.

If a non-atomic instruction executed by a warp writes to the same location in global memory for more than one of the threads of the warp, only one thread performs a write and which thread does it is undefined.

Data that is read-only for the entire lifetime of the kernel can also be cached in the read-only data cache described in the previous section by reading it using the ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and restrict qualifiers increases the likelihood that the compiler will detect the read-only condition.

Figure 21 shows some examples of global memory accesses and corresponding memory transactions.

#### Compute Capab. 5.x (Maxwell, Part 1)



#### K.4.2. Global Memory

Global memory accesses are always cached in L2 and caching in L2 behaves in the same way as for devices of compute capability 3.x (see Global Memory).

Data that is read-only for the entire lifetime of the kernel can also be cached in the unified L1/ texture cache described in the previous section by reading it using the ldg() function (see Markus Hadwiger, KAUST 38

#### Compute Capab. 5.x (Maxwell, Part 2)



Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified L1/texture cache for devices of compute capability 5.0. For devices of compute capability 5.2, it is, by default, not cached in the unified L1/texture cache, but caching may be enabled using the following mechanisms:

- Perform the read using inline assembly with the appropriate modifier as described in the  $\blacktriangleright$ PTX reference manual:
- 
- 

■ Compile with the -xptxas -dlcm=ca compilation flag, in which case all reads are cached, except reads that are performed using inline assembly with a modifier that disables caching;<br>
■ Compile with the -xptxas -fscm=ca c

#### PTX State Spaces (1)



Memory type/access etc. organized using notion of state spaces

#### Table 6 State Spaces



#### PTX State Spaces (2)



#### Table 7 Properties of State Spaces



Notes:

 $1$  Variables in . const and . global state spaces are initialized to zero by default.

 $2$  Accessible only via the 1d. param instruction. Address may be taken via mov instruction.

 $3$  Accessible via 1d. param and  $st.$  param instructions. Device function input and return parameters may have their address taken via mov; the parameter is then located on the stack frame and its address is in the . local state space.

 $4$  Accessible only via the tex instruction.

#### PTX Cache Operators



#### Table 27 Cache Operators for Memory Load Instructions



#### SASS LD/ST Instructions



#### Architecture-dep.



#### Kepler:

#### (see also LDG.CI etc.)

#### Compute Capab. 6.x (Pascal)



#### K.5.2. Global Memory

Global memory behaves the same way as in devices of compute capability 5.x (See Global Memory).

### Compute Capab. 7.x (Volta/Turing)



### K.6.3. Global Memory

Global memory behaves the same way as in devices of compute capability 5.x (See Global Memory).

### Compute Capab. 8.x (Ampere/Ada)



### K.7.2. Global Memory

Global memory behaves the same way as for devices of compute capability 5.x (See Global Memory).

### Compute Capab. 9.x (Hopper)



#### K.8.2. Global Memory

Global memory behaves the same way as for devices of compute capability 5.x (See Global Memory).

## Thank you.