

**KAUST** 

## CS 380 - GPU and GPGPU Programming Lecture 3: Introduction, Pt. 3

Markus Hadwiger, KAUST

## Reading Assignment #2 (until Sep 11)



Read (required):

• Orange book (GLSL), Chapter 4 (*The OpenGL Programmable Pipeline*)



 Nice brief overviews of GLSL and legacy assembly shading language https://en.wikipedia.org/wiki/OpenGL\_Shading\_Language https://en.wikipedia.org/wiki/ARB\_assembly\_language

• GPU Gems 2 book, Chapter 30 (*The GeForce 6 Series GPU Architecture*)

http://download.nvidia.com/developer/GPU\_Gems\_2/GPU\_Gems2\_ch30.pdf

## Programming Assignments: Schedule (tentative)

Assignment #1:

- Querying the GPU (OpenGL/GLSL and CUDA) due Sep 4
  Assignment #2:

  Phong shading and procedural texturing (GLSL)
  Assignment #3:
  Deferred Shading and Image Processing with GLSL
  due Oct 2
  - Image Processing with CUDA
  - Convolutional layers with CUDA due Oct 23

Assignment #5:

Linear Algebra (CUDA)
 due Nov 13

## What is in a GPU?



#### Lots of floating point processing power

- Processors, different names: ALUs, stream processors (SP), CUDA cores, FP32 cores, FP64 cores, ...
- Was vector processing, now scalar cores!

AMMIN

Still lots of fixed graphics functionality

- Attribute interpolation (per-vertex  $\rightarrow$  per-fragment)
- Rasterization (triangles  $\rightarrow$  fragments/pixels)
- Texture sampling and filtering
- Depth buffering (per-pixel visibility)
- Blending/compositing (semi-transparent geometry, ...)
- Frame buffers (and implicit atomic operations in ROPs)



## **NVIDIA Volta SM**

#### Multiprocessor: SM

- 64 FP32 + INT32 cores
- 32 FP64 cores
- 8 tensor cores (FP16/FP32 mixed-precision)

### 4 partitions inside SM

- 16 FP32 + INT32 cores each
- 8 FP64 cores each
- 8 LD/ST units each
- 2 tensor cores each
- Each has: warp scheduler, dispatch unit, register file

| М                               |           |                               |                                |                                 |           |            |                |                                 |           |            |              |           |           |           |                |  |
|---------------------------------|-----------|-------------------------------|--------------------------------|---------------------------------|-----------|------------|----------------|---------------------------------|-----------|------------|--------------|-----------|-----------|-----------|----------------|--|
|                                 |           |                               |                                |                                 |           |            | L1 Instruct    | ion Cache                       |           |            |              |           |           |           |                |  |
|                                 |           | L0 Ir                         | nstruc                         | tion C                          | ache      |            |                | L0 Instruction Cache            |           |            |              |           |           |           |                |  |
|                                 | War       | p Sch                         | nedule                         | r (32 tl                        | hread     | /clk)      |                | Warp Scheduler (32 thread/clk)  |           |            |              |           |           |           |                |  |
|                                 |           | Dispatch Unit (32 thread/clk) |                                |                                 |           |            |                |                                 |           |            |              |           |           |           |                |  |
|                                 | Reg       | File ('                       |                                | Register File (16,384 x 32-bit) |           |            |                |                                 |           |            |              |           |           |           |                |  |
| FP64                            | INT       | INT                           | FP32                           | FP32                            |           |            |                | FP64                            | INT       | INT        | FP32         | FP32      |           |           |                |  |
| FP64                            | INT       | INT                           | FP32                           | FP32                            |           |            |                | FP64                            | INT       | INT        | FP32         | FP32      |           |           |                |  |
| FP64                            | INT       | INT                           | FP32                           | FP32                            |           |            |                | FP64                            | INT       | INT        | FP32         | FP32      | H         |           |                |  |
| FP64                            | INT       | INT                           | FP32                           | FP32                            |           | SOR        |                | FP64                            | INT       | INT        | FP32         | FP32      | CORE      |           | TENSOR<br>CORE |  |
| FP64                            | INT       | INT                           | FP32                           |                                 |           |            | CORL           | FP64                            | INT       | INT        | FP32         |           |           |           | CONL           |  |
| FP64<br>FP64                    | INT       |                               | FP32<br>FP32                   | FP32                            |           |            |                | FP64<br>FP64                    |           | INT<br>INT | FP32<br>FP32 | FP32      |           |           |                |  |
| FP64                            | INT       | INT                           | _                              | FP32                            |           |            |                | FP64                            | INT       | INT        | FP32         |           |           |           |                |  |
| LD/ LD/<br>ST ST                | LD/<br>ST | LD/<br>ST                     | LD/<br>ST                      | LD/<br>ST                       | LD/<br>ST | LD/<br>ST  | SFU            | LD/ LD/<br>ST ST                | LD/<br>ST | LD/<br>ST  | LD/<br>ST    | LD/<br>ST | LD/<br>ST | LD/<br>ST | SFU            |  |
|                                 | _         | L0 lr                         | nstruc                         | tion C                          | ache      |            |                |                                 |           | L0 lr      | nstruct      | tion C    | ache      |           |                |  |
|                                 | War       |                               | Warp Scheduler (32 thread/clk) |                                 |           |            |                |                                 |           |            |              |           |           |           |                |  |
|                                 | Di        | spatcl                        | h Unit                         | (32 th                          | read/c    | :lk)       |                | Dispatch Unit (32 thread/clk)   |           |            |              |           |           |           |                |  |
| Register File (16,384 x 32-bit) |           |                               |                                |                                 |           |            |                | Register File (16,384 x 32-bit) |           |            |              |           |           |           |                |  |
| FP64                            | INT       | INT                           | FP32                           | FP32                            |           |            |                | FP64                            | INT       | INT        | FP32         | FP32      |           |           |                |  |
| FP64                            | INT       | INT                           | FP32                           | FP32                            |           |            |                | FP64                            | INT       | INT        | FP32         | FP32      |           |           |                |  |
| FP64                            | INT       | INT                           |                                | FP32                            |           |            |                | FP64                            | INT       | INT        |              | FP32      |           |           |                |  |
| FP64                            | INT       | INT                           | FP32                           |                                 |           | SOR<br>DRE |                | FP64                            | INT       | INT        |              | FP32      |           | SOR<br>RE | TENSOR<br>CORE |  |
| FP64<br>FP64                    | INT       |                               | _                              | FP32<br>FP32                    |           |            |                | FP64<br>FP64                    |           | INT<br>INT | FP32<br>FP32 |           |           |           |                |  |
| FP64                            | INT       | INT                           |                                | FP32                            |           |            |                | FP64                            | INT       | INT        | -            | FP32      |           |           |                |  |
| FP64                            | INT       | INT                           | FP32                           | FP32                            |           |            |                | FP64                            | INT       | INT        | FP32         |           |           |           |                |  |
| LD/ LD/<br>ST ST                | LD/<br>ST | LD/<br>ST                     | LD/<br>ST                      | LD/<br>ST                       | LD/<br>ST | LD/<br>ST  | SFU            | LD/ LD/<br>ST ST                | LD/<br>ST | LD/<br>ST  | LD/<br>ST    | LD/<br>ST | LD/<br>ST | LD/<br>ST | SFU            |  |
|                                 |           |                               |                                |                                 |           |            |                |                                 |           |            |              |           |           |           |                |  |
|                                 |           |                               |                                |                                 |           | 128KE      | B L1 Data Cach | ie / Shared M                   | emory     | /          |              |           |           |           |                |  |

## Example for "Special Cores": Tensor Cores



Mixed-precision, fast matrix-matrix multiply and accumulate



From this, build larger sizes, higher dimensionalities, ... Newer versions have additional precisions/formats, ...

# **Real-time graphics primitives (entities)**



**Represent surface as a 3D triangle mesh** 

01

o 4

o 2

Vertices

o 3



Primitives (e.g., triangles, points, lines)

# **Real-time graphics primitives (entities)**



Courtesy Kayvon Fatahalian, CMU

## What can the hardware do?



### Rasterization

- Decomposition into fragments
- Interpolation of color
- Texturing
  - Interpolation/filtering
  - Fragment shading

### Fragment operations (or: raster operations)

- Depth test (Z-test)
- Alpha blending (compositing)

**•** ...



## **Graphics Pipeline**





### **Geometry Processing**







### Fragment (Raster) Operations







## **Graphics Pipeline**





## **Graphics** Pipeline





# **Graphics pipeline architecture**

Performs operations on vertices, triangles, fragments, and pixels



## Direct3D 10 Pipeline (~OpenGL 3.2)



Courtesy David Blythe, Microsoft

## Direct3D 11 Pipeline (~OpenGL 4.x)



#### New tessellation stages

- Hull shader
  - (OpenGL: tessellation control)
- Tessellator

(OpenGL: tessellation primitive generator)

Domain shader

(OpenGL: tessellation evaluation)

#### Outside this pipeline

- Compute shader
- (Ray tracing cores, D3D 12)
- (Mesh shader pipeline, D3D 12.2)



Stage

Memory Resources (Buffer, Texture,

## **Direct3D 12 Traditional Geometry Pipeline**





## Direct3D 12 Mesh Shader Pipeline



Reinventing the Geometry Pipeline

- Mesh and amplification shaders: new high-performance geometry pipeline based on compute shaders (DX 12 Ultimate / feature level 12.2)
- Compute shader-style replacement of IA/VS/HS/Tess/DS/GS



See talk by Shawn Hargreaves: https://www.youtube.com/watch?v=CFXKTXtil34

## Vulkan (1.3)





## Vulkan (1.3)



• Mesh and task shaders: new high-performance geometry pipeline based on compute shaders (Mesh and task shaders also available as OpenGL 4.5/4.6 extension: GL\_NV\_mesh\_shader)

#### TRADITIONAL PIPELINE



Pipelined memory, keeping interstage data on chip



### **Motivational Examples**



### Doom (2016)

http://www.adriancourreges.com/blog/2016/09/09/ doom-2016-graphics-study/

#### **Doom Eternal**

https://simoncoenen.com/blog/programming/graphics/ DoomEternalStudy.html

**Unreal Engine 5** 

## Thank you.