



## CS 380 - GPU and GPGPU Programming Lecture 13: GPU Compute APIs, Pt. 3

Markus Hadwiger, KAUST

### Reading Assignment #7 (until Oct 20)



- Programming Massively Parallel Processors book, 3rd edition, Chapter 7 (*Parallel Patterns: Convolution*)
- PTX Instruction Set Architecture 7.4 (https://docs.nvidia.com/cuda/pdf/ptx\_isa\_7.4.pdf) Read Chapters 1 – 3; get an overview of Chapter 12; browse through the other chapters to get a feeling for what PTX looks like
- CUDA SASS, Chapter 4: https://docs.nvidia.com/cuda/pdf/CUDA\_Binary\_Utilities.pdf

Read (optional):

- Inline PTX Assembly in CUDA (CUDA SDK: Inline\_PTX\_Assembly.pdf)
- Dissecting GPU Architecture through Microbenchmarking:

Volta: https://arxiv.org/abs/1804.06826 Turing: https://arxiv.org/abs/1903.07486

https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/ s9839-discovering-the-turing-t4-gpu-architecture-with-microbenchmarks.pdf

Ampere: https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s33322/

## Semester Project (proposal until Oct 25!)



- Choosing your own topic encouraged! (we will also suggest some topics)
  - Pick something that you think is really cool!
  - Can be completely graphics or completely computation, or both combined
  - Can be built on CS 380 frameworks, NVIDIA OpenGL SDK, CUDA SDK, ...
- Write short (1-2 pages) project proposal by end of Sep (announced later)
  - Talk to us before you start writing! (content and complexity should fit the lecture)
- Submit semester project with report (deadline: Dec 9)
- Present semester project (event in final exams week: Dec 13 (tentative))

#### Semester Project Ideas (1)



#### Some ideas for topics

- Procedural shading with noise + marble etc. (GPU Gems 2, chapter 26)
- Procedural shading with noise + bump mapping (GPU Gems 2, chapter 26)
- Subdivision surfaces (GPU Gems 2, chapter 7)
- Ambient occlusion, screen space ambient occlusion
- Shadow mapping, hard shadows, soft shadows
- Deferred shading
- Particle system rendering + CUDA particle sort
- Advanced image filters: fast bilateral filtering, Gaussian kD trees
- Advanced image de-convolution (e.g., convex L1 optimization)
- PDE solvers (e.g., anisotropic diffusion filtering, 2D level set segmentation, 2D fluid flow)

### Semester Project Ideas (2)



#### Some ideas for topics

- Distance field computation (GPU Gems 3, chapter 34)
- Livewire ("intelligent scissors") segmentation in CUDA
- Linear systems solvers, matrix factorization (Cholesky, ...); with/without CUBLAS
- CUDA + matlab
- Fractals (Sierpinski, Koch, ...)
- Image compression
- Bilateral grid filtering for multichannel images
- Discrete wavelet transforms
- Fast histogram computations
- Terrain rendering from height map images; clipmaps or adaptive tesselation

# **Matrix-Matrix Multiplication**

P=MN

Parallel08 – Memory Access

Hendrik Lensch and Robert Strzodka

#### Programming Model: Square Matrix Multiplication



Parallel08 – Memory Access

.

.

Hendrik Lensch and Robert Strzodka

# Multiply Using One Thread Block

- One block of threads computes matrix P
  - Each thread computes one element of P
- Each thread
  - Loads a row of matrix M
  - Loads a column of matrix N
  - Perform one multiply and addition for each pair of M and N elements
  - Compute to off-chip memory access ratio close to 1:1 (not very high)
- Size of matrix limited by the number of threads allowed in a thread block



### Matrix Multiplication Device-Side Kernel Function (cont.)

```
for (int k = 0; k < M.width; ++k)
    ſ
      float Melement = M.elements[ty * M.pitch + k];
      float Nelement = Nd.elements[k * N.pitch + tx];
     Pvalue += Melement * Nelement;
    }
    // Write the matrix to device memory;
    // each thread writes one element
   P.elements[ty * blockDim.x+ tx] = Pvalue;
}
                                                                    ty
                                                          tx
```

### Handling Arbitrary Sized Square Matrices

 Have each 2D thread block to compute a (BLOCK\_WIDTH)<sup>2</sup> sub-matrix (tile) of the result matrix Each has (BLOCK\_WIDTH)<sup>2</sup> threads Generate a 2D Grid of (WIDTH/BLOCK\_WIDTH)<sup>2</sup> blocks by You still need to put a loop around the kernel call for cases where ty WIDTH is greater than bx tx Max grid size!

## Multiply Using Several Blocks - Idea



Parallel08 – Memory Access

Hendrik Lensch and Robert Strzodka

## Multiply Using Several Blocks - Idea



Parallel08 – Memory Access

Hendrik Lensch and Robert Strzodka

#### Example: Matrix Multiplication (1)



 Copy matrices to device; invoke kernel; copy result matrix back to host

```
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK SIZE
void MatMul (const Matrix A, const Matrix B, Matrix C)
    // Load A and B to device memory
   Matrix d A;
    d A.width = d A.stride = A.width; d A.height = A.height;
    size t size = A.width * A.height * sizeof(float);
    cudaMalloc((void**)&d A.elements, size);
    cudaMemcpy(d A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d B;
    d B.width = d B.stride = B.width; d B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc((void**)&d B.elements, size);
    cudaMemcpy(d B.elements, B.elements, size,
               cudaMemcpyHostToDevice);
```

#### Example: Matrix Multiplication (2)



```
// Allocate C in device memory
Matrix d_C;
d_C.width = d_C.stride = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc((void**)&d_C.elements, size);
```

```
// Invoke kernel
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
```

```
// Free device memory
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
```

#### Example: Matrix Multiplication (3)





- Multiply matrix block-wise
- Set BLOCK\_SIZE for efficient hardware use, e.g., to 16 on cc. 1.x or 16 or 32 on cc. 2.x +

blockRow

- Maximize parallelism
  - Launch as many threads per block as block elements
  - Each thread fetches one element of block
  - Perform row \* column dot products in parallel

#### Example: Matrix Multiplication (4)

{

}

16



```
global void MatrixMul( float *matA, float *matB, float *matC, int w )
     shared float blockA[ BLOCK SIZE ][ BLOCK SIZE ];
     shared float blockB[ BLOCK SIZE ][ BLOCK SIZE ];
   int bx = blockIdx.x; int tx = threadIdx.x;
   int by = blockIdx.y; int ty = threadIdx.y;
   int col = bx * BLOCK SIZE + tx;
   int row = by * BLOCK SIZE + ty;
   float out = 0.0f;
   for ( int m = 0; m < w / BLOCK SIZE; m++ ) {</pre>
       blockA[ ty ][ tx ] = matA[ row * w + m * BLOCK SIZE + tx
                                                                           1;
       blockB[ ty ][ tx ] = matB[ col + ( m * BLOCK SIZE + ty ) * w ];
        syncthreads();
        for (int k = 0; k < BLOCK SIZE; k++) {
            out += blockA[ ty ][ k ] * blockB[ k ][ tx ];
        }
          syncthreads();
   }
                                                Caveat: for brevity, this code assumes matrix sizes
                                                are a multiple of the block size (either because
   matC[ row * w + col ] = out;
                                                they really are, or because padding is used;
                                                otherwise guard code would need to be added)
```



| М <sub>0,0</sub> | M <sub>1,0</sub> | M <sub>2,0</sub> | M <sub>3,0</sub> |
|------------------|------------------|------------------|------------------|
| M <sub>0,1</sub> | M <sub>1,1</sub> | M <sub>2,1</sub> | M <sub>3,1</sub> |
| M <sub>0,2</sub> | M <sub>1,2</sub> | M <sub>2,2</sub> | M <sub>3,2</sub> |
| M <sub>0,3</sub> | M <sub>1,3</sub> | M <sub>2,3</sub> | M <sub>3,3</sub> |



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

# Memory Coalescing

- When accessing global memory, peak performance utilization occurs when all threads in a half warp (full warp on Fermi) access continuous memory locations.
- Requirements relaxed on >=1.2 devices; L1 cache on Fermi! Not coalesced coalesced



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign

# Memory Layout of a Matrix in C



19

# Memory Layout of a Matrix in C



ECE 498AL, University of Illinois, Urbana-Champaign

# Thank you.