

**KAUST** 

# CS 380 - GPU and GPGPU Programming Lecture 13: GPU Compute APIs, Pt. 2

Markus Hadwiger, KAUST

# Reading Assignment #7 + #8 (until Oct 23)

Read (required):

- Programming Massively Parallel Processors book (4th edition), Chapter 7 (Convolution)
- Programming Massively Parallel Processors book (4th edition), Chapter 8 (Stencil)

Read (optional):

- Inline PTX Assembly in CUDA: Inline\_PTX\_Assembly.pdf
- Dissecting GPU Architectures through Microbenchmarking:

Volta: https://arxiv.org/abs/1804.06826 Turing: https://arxiv.org/abs/1903.07486

https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/ s9839-discovering-the-turing-t4-gpu-architecture-with-microbenchmarks.pdf

**Ampere:** https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s33322/

## **Next Lectures**



Lecture 14: Wed, Oct 12

no lectures on Oct 16 and Oct 19 ! (mid-semester break and IEEE VIS conference)

Lecture 15: Sun, Oct 23

Lecture 16: Wed, Oct 26

Lecture 17: Sun, Oct 30

Lecture 18: Tue, Nov 1 (make-up lecture; 16:00 – 17:15 ?)

Lecture 19: Wed, Nov 2

# **GPU Compute APIs**





# **Memory Architecture**



| Memory   | Location | Cached               | Access | Scope                  | Lifetime    |
|----------|----------|----------------------|--------|------------------------|-------------|
| Register | On-chip  | N/A                  | R/W    | One thread             | Thread      |
| Local    | Off-chip | <del>-N≎</del> * YES | R/W    | One thread             | Thread      |
| Shared   | On-chip  | N/A                  | R/W    | All threads in a block | Block       |
| Global   | Off-chip | <del>-No</del> * YES | R/W    | All threads + host     | Application |
| Constant | Off-chip | Yes                  | R      | All threads + host     | Application |
| Texture  | Off-chip | Yes                  | R      | All threads + host     | Application |

\* cached on Fermi or newer!

## (Memory) State Spaces



### PTX ISA 7.8 (Chapter 5)

| Name                                           | Addressable      | Initializable          | Access     | Sharing                  |  |
|------------------------------------------------|------------------|------------------------|------------|--------------------------|--|
| .reg No                                        |                  | No                     | R/W        | per-thread               |  |
| .sreg                                          | No               | No                     | RO         | per-CTA                  |  |
| .const                                         | Yes              | Yes <sup>1</sup>       | RO         | per-grid                 |  |
| .global                                        | Yes              | Yes <sup>1</sup>       | R/W        | Context                  |  |
| .local                                         | Yes              | No                     | R/W        | per-thread               |  |
| .param (as input to<br>kernel)                 | Yes <sup>2</sup> | No                     | RO         | per-grid                 |  |
| .param (used in Restricted <sup>3</sup> No R/W |                  | R/W                    | per-thread |                          |  |
| .shared                                        | Yes              | Yes No R/W per-cluster |            | per-cluster <sup>5</sup> |  |
| .tex                                           | No <sup>4</sup>  | Yes, via driver        | RO         | Context                  |  |

#### Notes:

<sup>1</sup> Variables in .const and .global state spaces are initialized to zero by default.

<sup>2</sup> Accessible only via the ld.param instruction. Address may be taken via mov instruction.

<sup>3</sup> Accessible via ld.param and st.param instructions. Device function input and return parameters may have their address taken via mov; the parameter is then located on the stack frame and its address is in the .local state space.

 $^4$  Accessible only via the tex instruction.

<sup>5</sup> Visible to the owning CTA and other active CTAs in the cluster.

## **Managing Memory**

Unified memory space can be enabled on Fermi / CUDA 4.x and newer CPU and GPU have separate memory spaces

## Host (CPU) code manages device (GPU) memory:

- Allocate / free
- Copy data to and from device
- Applies to global device memory (DRAM)



# **GPU Memory Allocation / Release**

cudaMalloc(void \*\* pointer, size\_t nbytes)
 cudaMemset(void \* pointer, int value, size\_t count)
 cudaFree(void\* pointer)

```
int n = 1024;
int nbytes = 1024*sizeof(int);
int *a_d = 0;
cudaMalloc( (void**)&a_d, nbytes );
cudaMemset( a_d, 0, nbytes);
cudaFree(a_d);
```

 $\infty$ 



# **Data Copies**

### cudaMemcpy(void \*dst, void \*src, size\_t nbytes, enum cudaMemcpyKind direction);

- direction specifies locations (host or device) of src and dst
- Blocks CPU thread: returns after the copy is complete
- Doesn't start copying until previous CUDA calls complete

### enum cudaMemcpyKind

- cudaMemcpyHostToDevice
- cudaMemcpyDeviceToHost
- cudaMemcpyDeviceToDevice

**IVISIOD** 0 E WORLD OF VISUAL COMPUTI

 $\infty$ 



int main(void)

float \*a\_h, \*b\_h; // host data float \*a\_d, \*b\_d; // device data int N = 14, nBytes, i ;

nBytes = N\*sizeof(float); a\_h = (float \*)malloc(nBytes); b\_h = (float \*)malloc(nBytes); cudaMalloc((void \*\*) &a\_d, nBytes); cudaMalloc((void \*\*) &b\_d, nBytes);

for (i=0, i<N; i++) a\_h[i] = 100.f + i;

cudaMemcpy(a\_d, a\_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b\_d, a\_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b\_h, b\_d, nBytes, cudaMemcpyDeviceToHost);

```
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;</pre>
```

**NVISION** 08 HE WORLD OF VISUAL COMPUTING



### int main(void)

float \*a\_h, \*b\_h; // host data float \*a\_d, \*b\_d; // device data int N = 14, nBytes, i ;

nBytes = N\*sizeof(float); a\_h = (float \*)malloc(nBytes); b\_h = (float \*)malloc(nBytes); cudaMalloc((void \*\*) &a\_d, nBytes); cudaMalloc((void \*\*) &b\_d, nBytes);

```
for (i=0, i<N; i++) a_h[i] = 100.f + i;
```

cudaMemcpy(a\_d, a\_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b\_d, a\_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b\_h, b\_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a\_h[i] == b\_h[i] );
free(a\_h); free(b\_h); cudaFree(a\_d); cudaFree(b\_d);
return 0;</pre>



**NVISION** 08 HE WORLD OF VISUAL COMPUTING

}

 $\ensuremath{\mathbb{C}}$  2008 NVIDIA Corporation.



### int main(void)

float \*a\_h, \*b\_h; // host data float \*a\_d, \*b\_d; // device data int N = 14, nBytes, i ;

nBytes = N\*sizeof(float); a\_h = (float \*)malloc(nBytes); b\_h = (float \*)malloc(nBytes); cudaMalloc((void \*\*) &a\_d, nBytes); cudaMalloc((void \*\*) &b\_d, nBytes);

```
for (i=0, i<N; i++) a_h[i] = 100.f + i;
```

cudaMemcpy(a\_d, a\_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b\_d, a\_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b\_h, b\_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a\_h[i] == b\_h[i] );
free(a\_h); free(b\_h); cudaFree(a\_d); cudaFree(b\_d);
return 0;</pre>



 $\ensuremath{\mathbb{C}}$  2008 NVIDIA Corporation.



#### int main(void)

float \*a\_h, \*b\_h; // host data float \*a\_d, \*b\_d; // device data int N = 14, nBytes, i ;

nBytes = N\*sizeof(float); a\_h = (float \*)malloc(nBytes); b\_h = (float \*)malloc(nBytes); cudaMalloc((void \*\*) &a\_d, nBytes); cudaMalloc((void \*\*) &b\_d, nBytes);

### for (i=0, i<N; i++) a\_h[i] = 100.f + i;

cudaMemcpy(a\_d, a\_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b\_d, a\_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b\_h, b\_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a\_h[i] == b\_h[i] );
free(a\_h); free(b\_h); cudaFree(a\_d); cudaFree(b\_d);
return 0;</pre>





#### int main(void)

float \*a\_h, \*b\_h; // host data float \*a\_d, \*b\_d; // device data int N = 14, nBytes, i ;

nBytes = N\*sizeof(float); a\_h = (float \*)malloc(nBytes); b\_h = (float \*)malloc(nBytes); cudaMalloc((void \*\*) &a\_d, nBytes); cudaMalloc((void \*\*) &b\_d, nBytes);

for (i=0, i<N; i++) a\_h[i] = 100.f + i;

cudaMemcpy(a\_d, a\_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b\_d, a\_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b\_h, b\_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a\_h[i] == b\_h[i] );
free(a\_h); free(b\_h); cudaFree(a\_d); cudaFree(b\_d);
return 0;</pre>



© 2008 NVIDIA Corporation.



cudaMemcpy(a\_d, a\_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b\_d, a\_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b\_h, b\_d, nBytes, cudaMemcpyDeviceToHost);

### int main(void)

return 0;

}

float \*a\_h, \*b\_h; // host data float \*a\_d, \*b\_d; // device data int N = 14, nBytes, i ;

nBytes = N\*sizeof(float); a\_h = (float \*)malloc(nBytes); b\_h = (float \*)malloc(nBytes); cudaMalloc((void \*\*) &a\_d, nBytes); cudaMalloc((void \*\*) &b\_d, nBytes);

```
for (i=0, i<N; i++) a_h[i] = 100.f + i;
```

for (i=0; i< N; i++) assert( a\_h[i] == b\_h[i] );

free(a\_h); free(b\_h); cudaFree(a\_d); cudaFree(b\_d);

**NVISION** 08 HE WORLD OF VISUAL COMPUTING



 $\ensuremath{\mathbb{C}}$  2008 NVIDIA Corporation.



### int main(void)

float \*a\_h, \*b\_h; // host data float \*a\_d, \*b\_d; // device data int N = 14, nBytes, i ;

nBytes = N\*sizeof(float); a\_h = (float \*)malloc(nBytes); b\_h = (float \*)malloc(nBytes); cudaMalloc((void \*\*) &a\_d, nBytes); cudaMalloc((void \*\*) &b\_d, nBytes);

for (i=0, i<N; i++) a\_h[i] = 100.f + i;

cudaMemcpy(a\_d, a\_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b\_d, a\_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b\_h, b\_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a\_h[i] == b\_h[i] );
free(a\_h); free(b\_h); cudaFree(a\_d); cudaFree(b\_d);
return 0;</pre>



 $\ensuremath{\mathbb{C}}$  2008 NVIDIA Corporation.



### int main(void)

float \*a\_h, \*b\_h; // host data float \*a\_d, \*b\_d; // device data int N = 14, nBytes, i ;

nBytes = N\*sizeof(float); a\_h = (float \*)malloc(nBytes); b\_h = (float \*)malloc(nBytes); cudaMalloc((void \*\*) &a\_d, nBytes); cudaMalloc((void \*\*) &b\_d, nBytes);

```
for (i=0, i<N; i++) a_h[i] = 100.f + i;
```

cudaMemcpy(a\_d, a\_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b\_d, a\_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b\_h, b\_d, nBytes, cudaMemcpyDeviceToHost);

### for (i=0; i< N; i++) assert( a\_h[i] == b\_h[i] );

free(a\_h); free(b\_h); cudaFree(a\_d); cudaFree(b\_d);
return 0;



TVISIOD 08 HE WORLD OF VISUAL COMPUTING

}

 $\ensuremath{\mathbb{C}}$  2008 NVIDIA Corporation.



### int main(void)

float \*a\_h, \*b\_h; // host data float \*a\_d, \*b\_d; // device data int N = 14, nBytes, i ;

nBytes = N\*sizeof(float); a\_h = (float \*)malloc(nBytes); b\_h = (float \*)malloc(nBytes); cudaMalloc((void \*\*) &a\_d, nBytes); cudaMalloc((void \*\*) &b\_d, nBytes);

```
for (i=0, i<N; i++) a_h[i] = 100.f + i;
```

cudaMemcpy(a\_d, a\_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b\_d, a\_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b\_h, b\_d, nBytes, cudaMemcpyDeviceToHost);

```
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;</pre>
```



 $\ensuremath{\mathbb{C}}$  2008 NVIDIA Corporation.



# **Executing Code on the GPU**

### Kernels are C functions with some restrictions

- Cannot access host memory
- except: (\*) and (\*\*)

- Must have void return type
- No variable number of arguments ("varargs")
- Not recursive) recursion supported on \_\_device\_\_ functions from
  - No static variables cc. 2.x (i.e., basically on all current GPUs)

### Function arguments automatically copied from host to device

(\*) "unified memory programming" introduced with CUDA 6 (cc. 3.x +): allocate memory with cudaMallocManaged(); uses automatic migration

(\*\*) also: mapped pinned (page-locked) memory ("zero-copy memory") : allocate memory with cudaMallocHost(); beware of low performance!!

Note: UVA ("unified virtual addressing"; cc. 2.x +) is something different!! just pertains to unified pointers (see cudaPointerGetAttributes(), ...)

TVISION 08 TE WORLD OF VISUAL COMPUTING

# **Function Qualifiers**



# Kernels designated by function qualifier:

Function called from host and executed on deviceMust return void

# Other CUDA function qualifiers \_\_\_device\_\_\_

- Function called from device and run on device
- Cannot be called from host code

## \_\_host\_\_

Function called from host and executed on host (default)
 <u>host</u> and <u>device</u> qualifiers can be combined to generate both CPU and GPU code

**VISIOD** WORLD OF VISUAL COMP

 $\infty$ 

© 2008 NVIDIA Corporation.



# Variable Qualifiers (GPU code)

### \_\_\_device\_

- Stored in global memory (large, high latency, no cache)
- Allocated with cudaMalloc (\_\_device\_\_ qualifier implied)
- Accessible by all threads
- Lifetime: application

### \_shared\_

- Stored in on-chip shared memory (very low latency)
- Specified by execution configuration or at compile time
- Accessible by all threads in the same thread block
- Lifetime: thread block

### Unqualified variables:

- Scalars and built-in vector types are stored in registers
- What doesn't fit in registers spills to "local" memory

CUDA 6+: \_\_managed\_\_ (with \_\_device\_\_) for managed memory (unified memory programming)

© 2008 NVIDIA Corporation.



TVISIOD 0 TE WORLD OF VISUAL COMPUTI

# Launching Kernels

Modified C function call syntax:

kernel<<<dim3 dG, dim3 dB>>>(...)

Execution Configuration ("<<< >>>")

- **dG** dimension and size of grid in blocks
  - Two-dimensional: x and y
  - Blocks launched in the grid: dG.x \* dG.y
- **dB** dimension and size of blocks in threads:
  - Three-dimensional: x, y, and z
  - Threads per block: dB.x \* dB.y \* dB.z
- Unspecified dim3 fields initialize to 1

© 2008 NVIDIA Corporation.



INISION (

# **CUDA Built-in Device Variables**

All <u>global</u> and <u>device</u> functions have access to these automatically defined variables

- dim3 gridDim;
  - Dimensions of the grid in blocks (at most 2D)
  - dim3 blockDim;
    - Dimensions of the block in threads
- dim3 blockIdx;
  - Block index within the grid
- dim3 threadIdx;
  - Thread index within the block









# **Increment Array Example**

### **CPU** program

```
void inc_cpu(int *a, int N)
                                    __global__ void inc_gpu(int *a, int N)
 int idx;
                                     int idx = blockldx.x * blockDim.x
                                              + threadldx.x;
 for (idx = 0; idx<N; idx++)
                                     if (idx < N)
   a[idx] = a[idx] + 1;  a[idx] = a[idx] + 1;
int main()
                                    int main()
  inc_cpu(a, N);
                                     dim3 dimBlock (blocksize);
                                     dim3 dimGrid( ceil( N / (float)blocksize) );
}
                                     inc_gpu<<<dimGrid, dimBlock>>>(a, N);
```

**CUDA** program

© 2008 NVIDIA Corporation.



# **Thread Cooperation**

The Missing Piece: threads may need to cooperate

### Thread cooperation is valuable

- Share results to avoid redundant computation
- Share memory accesses
  - Drastic bandwidth reduction
- Thread cooperation is a powerful feature of CUDA

- Cooperation between a monolithic array of threads is not scalable
  - Cooperation within smaller batches of threads is scalable

© 2008 NVIDIA Corporation



# **Host Synchronization**

## All kernel launches are asynchronous

- control returns to CPU immediately
- kernel executes after all previous CUDA calls have completed

## cudaMemcpy() is synchronous

- control returns to CPU after copy completes
- copy starts after all previous CUDA calls have completed

### CudaThreadSynchronize()

blocks until all previous CUDA calls complete

CUDA 4.x or newer: cudaDeviceSynchronize() and cudaStreamSynchronize()

**NUISION** 08 E WORLD OF VISUAL COMPUTING

© 2008 NVIDIA Corporation.



# **Host Synchronization Example**

// copy data from host to device cudaMemcpy(a\_d, a\_h, numBytes, cudaMemcpyHostToDevice);

// execute the kernel
inc\_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a\_d, N);

// run independent CPU code
run\_cpu\_stuff();

// copy data from device back to host cudaMemcpy(a\_h, a\_d, numBytes, cudaMemcpyDeviceToHost);

**IVISION** 08 IE WORLD OF VISUAL COMPUTING

© 2008 NVIDIA Corporation.





# **Device Runtime Component: Synchronization Function**

- void \_\_syncthreads();
- Synchronizes all threads in a block
  - Once all threads have reached this point, execution resumes normally
  - Used to avoid RAW / WAR / WAW hazards when accessing shared
- Allowed in conditional code only if the conditional is uniform across the entire thread block

# Synchronization

- Threads in the same block can communicate using shared memory
- \_\_syncthreads()
  - -Barrier for threads only within the current block
- \_\_threadfence()
  - Flushes global memory writes to make them visible to all threads

Plus newer sync functions, e.g., from compute capability 2.x on:

\_\_syncthreads\_count(), \_\_syncthreads\_and/or(), \_\_threadfence\_block(), \_\_threadfence\_system(), …

Now: *Must* use versions with \_sync suffix, because of Independent Thread Scheduling (compute capability 7.x and newer)! 70



# **COOPERATIVE GROUPS VS BUILT-IN FUNCTIONS**

Example: warp aggregated atomic

| <pre>// increment the value at ptr by 1 and return the old valuedevice int atomicAggInc(int *p);</pre> |                                                     |  |  |  |
|--------------------------------------------------------------------------------------------------------|-----------------------------------------------------|--|--|--|
| <pre>coalesced_group g = coalesced_threads();</pre>                                                    | <pre>int mask =activemask();</pre>                  |  |  |  |
|                                                                                                        | <pre>int rank =popc(mask &amp;lanemask_lt());</pre> |  |  |  |
|                                                                                                        | <pre>int leader_lane =ffs(mask) - 1;</pre>          |  |  |  |
| int res;                                                                                               | int res;                                            |  |  |  |
| <pre>if (g.thread_rank() == 0)</pre>                                                                   | if (rank == 0)                                      |  |  |  |
| <pre>res = atomicAdd(p, g.size());</pre>                                                               | <pre>res = atomicAdd(p,popc(mask));</pre>           |  |  |  |
| <pre>res = g.shfl(res, 0);</pre>                                                                       | <pre>res =shfl_sync(mask, res, leader_lane);</pre>  |  |  |  |
| <pre>return g.thread_rank() + res;</pre>                                                               | return rank + res;                                  |  |  |  |
|                                                                                                        | 48 🥺 <b>NVIDIA.</b>                                 |  |  |  |

## New in CC 9.0: Thread Block Clusters



### New thread hierarchy level!



all threads of a block are on the same SM !





all blocks of a cluster are on the same GPC !



Markus Hadwiger, KAUST

# **Code Examples**

# **Example #1: 1D Convolution**

## Example #1: 1D Convolution



1D Convolution with 3-tap averaging kernel (every thread is averaging three inputs)

| input[0] | input[1]  | input[2]  | input[3]  | input[4]  | input[5]  | input[6]  | input[7]  | input[8]  | input[9] |
|----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|----------|
|          |           |           |           |           |           |           |           |           |          |
|          | output[0] | output[1] | output[2] | output[3] | output[4] | output[5] | output[6] | output[7] |          |

output[i] = (input[i] + input[i+1] + input[i+2]) / 3.f;

# Running on a GP104 (Pascal) SM



#define THREADS PER BLK 128 global void convolve(int N, float\* input, float\* output) \_\_shared\_\_ float support[THREADS\_PER\_BLK+2]; int index = blockIdx.x \* blockDim.x + threadIdx.x; support[threadIdx.x] = input[index]; if (threadIdx.x < 2) {</pre> support[THREADS\_PER\_BLK+threadIdx.x] = input[index+THREADS PER BLK]; } \_\_syncthreads(); float result = 0.0f; // thread-local for (int i=0; i<3; i++)</pre> result += support[threadIdx.x + i]; output[index] = result / 3.f;

Recall, CUDA kernels execute as SPMD programs

On NVIDIA GPUs groups of 32 CUDA threads share an instruction stream. These groups called "warps". A convolve thread block is executed by 4 warps (4 warps x 32 threads/warp = 128 CUDA threads per block) (Warps are an important GPU implementation detail, but not a CUDA abstraction!)

SM core operation each clock:

- Select up to four runnable warps from 64 resident on SM core (thread-level parallelism)
- Select up to two runnable instructions per warp (instruction-level parallelism) \* (but no ALU dual-issue!)

# Running on a V100 (Volta) SM





A convolve thread block is executed by 4 warps (4 warps x 32 threads/warp = 128 CUDA threads per block)

SM core operation each clock:

- Each sub-core selects one runnable warp (from the 16 warps in its partition)
- Each sub-core runs next instruction for the CUDA threads in the warp (this instruction may apply to all or a subset of the CUDA threads in a warp depending on divergence)

courtesy Kayvon Fatahalian

Stanford CS149, Fall 2021

(sub-core == SM partition)

## Code on Same SM Arch. But Different #SMs



# Assigning work



(16 cores)



(6 cores)

# Desirable for CUDA program to run on all of these GPUs without modification

Note: there is no concept of num\_cores in the CUDA programs I have shown you. (CUDA thread launch is similar in spirit to a forall loop in data parallel model examples)

(could now be up to 144 SMs, etc., ...)

# Thank you.