



#### CS 380 - GPU and GPGPU Programming Lecture 2: Introduction, Pt. 2

Markus Hadwiger, KAUST

#### Reading Assignment #1 (until Sep 7)

Read (required):

- Orange book, chapter 1 (Review of OpenGL Basics)
- Orange book, chapter 2 (*Basics*)

#### What are GPUs?

**Graphics** Processing Units

But evolved toward

- Very flexible, massively parallel floating point co-processors
- But not entirely programmable!
- Fixed-function parts have definite advantages (e.g., texture filtering, z-buffering)

We will cover both perspectives

- GPUs for graphics
- GPU computing (GPGPU general purpose computation on GPU)









#### Peak Performance



|        |                    |               | Theoretical GFLOP/s | at base clock |      |      |
|--------|--------------------|---------------|---------------------|---------------|------|------|
| 11000  |                    |               |                     |               |      |      |
| 10500  | NVIDIA GPU Single  | e Precision   |                     |               |      |      |
| 10000  |                    | ole Precision |                     |               |      |      |
| 9500   | Intel CPU Single P | recision      |                     |               |      |      |
| 9000   | Intel CPU Double   |               |                     |               |      |      |
| 8500   |                    |               |                     |               |      |      |
| 8000 - |                    |               |                     |               |      |      |
| 7500   |                    |               |                     |               |      |      |
| 7000   |                    |               |                     |               |      |      |
| 6500   |                    |               |                     |               |      |      |
| 6000   |                    |               |                     |               |      |      |
| 5500   |                    |               |                     |               | ×    |      |
| 5000   |                    |               |                     |               |      |      |
| 4500   |                    |               |                     |               |      |      |
| 4000   |                    |               |                     |               |      |      |
| 3500   |                    |               |                     |               |      |      |
| 3000   |                    |               |                     |               |      |      |
| 2500   |                    |               |                     | /             |      |      |
| 2000   |                    |               |                     |               |      |      |
| 1500   |                    |               |                     |               | ++   |      |
| 1000   |                    |               |                     |               |      |      |
| 500    |                    |               |                     |               |      |      |
| 2003   | 2005               | 2007          | 2009                | 2011          | 2013 | 2015 |

Markus Hadwiger, KAUST







## **RISE OF GPU COMPUTING**





Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

#### **GPU Architectures Over the Years**



# **GPU Roadmap**



#### **GPU Architectures Over the Years**





#### **Recent Updates**



#### **NVIDIA Ampere architecture (2020)**

https://en.wikipedia.org/wiki/Ampere\_(microarchitecture)

Promo presentation from Sep 1, 2020:

https://www.nvidia.com/en-us/geforce/special-event/

Geforce 30-series (Ampere):

https://nvidia.com/en-us/geforce/graphics-cards/30-series/

RTX 3090 has 10,496 CUDA cores

A100 (Ampere):

https://www.nvidia.com/en-us/data-center/a100/

A100 has 6,912 CUDA cores

#### **Recent Updates**





#### **Overviews and Specs**



Wikipedia has many comprehensive lists of architectures and specs

https://en.wikipedia.org/wiki/ List\_of\_Nvidia\_graphics\_processing\_units

https://en.wikipedia.org/wiki/ List\_of\_AMD\_graphics\_processing\_units

#### What is in a GPU?



Lots of floating point processing power

• Stream processing cores different names: stream processors, CUDA cores, ...



• Was vector processing, now scalar cores!

Still lots of fixed graphics functionality

- Attribute interpolation (per-vertex -> per-fragment)
- Rasterization (turning triangles into fragments/pixels)
- Texture sampling and filtering
- Depth buffering (per-pixel visibility)
- Blending/compositing (semi-transparent geometry, ...)
- Frame buffers



#### Example for "Special Cores": Tensor Cores



Mixed-precision, fast matrix-matrix multiply and accumulate



From this, build larger sizes, higher dimensionalities, ...

#### **NVIDIA Volta SM**

#### Multiprocessor: SM

- 64 FP32 + INT32 cores
- 32 FP64 cores
- 8 tensor cores (FP16/FP32 mixed-precision)

#### 4 partitions inside SM

- 16 FP32 + INT32 cores each
- 8 FP64 cores each
- 8 LD/ST units each
- 2 tensor cores each
- Each has: warp scheduler, dispatch unit, register file

| Μ                               |                                           |                  |           |           |                |                                 |                                |                                 |           |           |               |           |           |           |        |  |
|---------------------------------|-------------------------------------------|------------------|-----------|-----------|----------------|---------------------------------|--------------------------------|---------------------------------|-----------|-----------|---------------|-----------|-----------|-----------|--------|--|
|                                 |                                           |                  |           |           |                |                                 | L1 Instruct                    | ion Cache                       |           |           |               |           |           |           |        |  |
| L0 Instruction Cache            |                                           |                  |           |           |                |                                 |                                | L0 Instruction Cache            |           |           |               |           |           |           |        |  |
| Warp Scheduler (32 thread/clk)  |                                           |                  |           |           |                |                                 |                                | Warp Scheduler (32 thread/clk)  |           |           |               |           |           |           |        |  |
| Dispatch Unit (32 thread/clk)   |                                           |                  |           |           |                |                                 |                                | Di                              | spatcl    | n Unit    | (32 th        | read/d    | :lk)      |           |        |  |
| Register File (16,384 x 32-bit) |                                           |                  |           |           |                | Register File (16,384 x 32-bit) |                                |                                 |           |           |               |           |           |           |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      | TENSOR<br>CORE | TENSOR<br>CORE                  | FP64                           | INT                             | INT       | FP32      | FP32          | TENSOR    |           | TENSOR    |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 | FP64                           | INT                             | INT       | FP32      | FP32          | cc        | DRE       | CORE      |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
| FP64                            | INT                                       | INT              |           | FP32      | H              |                                 |                                | FP64                            | INT       | INT       | 1 Disease and | FP32      |           |           |        |  |
| LD/ LD/<br>ST ST                | LD/<br>ST                                 | LD/<br>ST        | LD/<br>ST | LD/<br>ST | LD/<br>ST      | LD/<br>ST                       | SFU                            | LD/ LD/<br>ST ST                | LD/<br>ST | LD/<br>ST | LD/<br>ST     | LD/<br>ST | LD/<br>ST | LD/<br>ST | SFU    |  |
|                                 | L0 Instruction Cache L0 Instruction Cache |                  |           |           |                |                                 |                                |                                 |           |           |               |           |           |           |        |  |
| Warp Scheduler (32 thread/clk)  |                                           |                  |           |           |                |                                 | Warp Scheduler (32 thread/clk) |                                 |           |           |               |           |           |           |        |  |
| Dispatch Unit (32 thread/clk)   |                                           |                  |           |           |                |                                 |                                | Dispatch Unit (32 thread/clk)   |           |           |               |           |           |           |        |  |
| Register File (16,384 x 32-bit) |                                           |                  |           |           |                |                                 |                                | Register File (16,384 x 32-bit) |           |           |               |           |           |           |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      | Trave average  | SOR                             | TENSOR                         | FP64                            | INT       | INT       | FP32          | FP32      |           | SOR       | TENSOR |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      | co             | RE                              | CORE                           | FP64                            | INT       | INT       | FP32          | FP32      | CC        | ORE       | CORE   |  |
| FP64                            | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
|                                 | INT                                       | INT              | FP32      | FP32      |                |                                 |                                | FP64                            | INT       | INT       | FP32          | FP32      |           |           |        |  |
| FP64                            |                                           |                  | EDDO      | FP32      |                |                                 |                                | FP64                            | INT       | INT       |               | FP32      |           |           |        |  |
| FP64<br>FP64                    | INT                                       | INT              |           |           |                |                                 |                                |                                 |           |           |               |           |           |           |        |  |
| FP64                            |                                           | INT<br>LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST      | LD/<br>ST                       | SFU                            | LD/ LD/<br>ST ST                | LD/<br>ST | LD/<br>ST | LD/<br>ST     | LD/<br>ST | LD/<br>ST | LD/<br>ST | SFU    |  |
| FP64<br>FP64<br>LD/ LD/         | INT<br>LD/                                | LD/              | LD/       | LD/       |                | ST                              | SFU<br>3 L1 Data Cach          | ST ST                           | ST        | ST        |               |           |           |           | SFU    |  |

# **Real-time graphics primitives (entities)**



Represent surface as a 3D triangle mesh

01

o 4

o 2

Vertices

o 3



Primitives (e.g., triangles, points, lines)

# **Real-time graphics primitives (entities)**



Courtesy Kayvon Fatahalian, CMU

### What can the hardware do?



#### Rasterization

- Decomposition into fragments
- Interpolation of color
- Texturing
  - Interpolation/Filtering
  - Fragment Shading

#### Fragment Operations

- Depth Test (Z-Test)
- Alpha Blending (Compositing)



## **Graphics Pipeline**





#### **Geometry Processing**







#### **Fragment Operations**







### **Graphics Pipeline**





## Thank you.