

# **CS 380 - GPU and GPGPU Programming Lecture 13: GPU Compute APIs 2**

Markus Hadwiger, KAUST

## Reading Assignment #7 (until Oct 19)



#### Read (required):

- Read https://en.wikipedia.org/wiki/Instruction pipelining
- CUDA NVCC doc (CUDA SDK: CUDA\_Compiler\_Driver\_NVCC.pdf)
  Read Chapters 1 3; Chapter 5; get an overview of the rest
- PTX Instruction Set Architecture 7.1 in CUDA SDK (ptx\_isa\_7.1.pdf)
   Read Chapters 1 3; get an overview of Chapter 12;
   browse through the other chapters to get a feeling for what PTX looks like
- Look at CUDA SASS in CUDA SDK: CUDA\_Binary\_Utilities.pdf, Chapter 4

#### Read (optional):

- Inline PTX Assembly in CUDA (CUDA SDK: Inline\_PTX\_Assembly.pdf)
- Dissecting GPU Architecture through Microbenchmarking:

```
Volta: https://arxiv.org/abs/1804.06826
Turing: https://arxiv.org/abs/1903.07486
```

https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9839-discovering-the-turing-t4-gpu-architecture-with-microbenchmarks.pdf

#### **CUDA Multi-Threading**

- CUDA model groups threads into blocks; blocks into grid
- Execution on actual hardware:
  - Block assigned to SM (up to 8, 16, or 32 blocks per SM; depending on compute capability)
  - 32 threads grouped into warp



#### **Execution Model**

#### **Software**







Threads are executed by thread processors





Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)





A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time



## **Kernel Memory Access**

Per-thread



On-chip

Off-chip, uncached

cached on Fermi or newer

Per-block

Block



- On-chip, small
- Fast

Per-device



- Off-chip, large
- Uncached
- Persistent across kernel launches
- Kernel I/O

cached on Fermi or newer





## **Memory Architecture**



| Memory   | Location | Cached | Access | Scope                  | Lifetime    |
|----------|----------|--------|--------|------------------------|-------------|
| Register | On-chip  | N/A    | R/W    | One thread             | Thread      |
| Local    | Off-chip | No*    | R/W    | One thread             | Thread      |
| Shared   | On-chip  | N/A    | R/W    | All threads in a block | Block       |
| Global   | Off-chip | No*    | R/W    | All threads + host     | Application |
| Constant | Off-chip | Yes    | R      | All threads + host     | Application |
| Texture  | Off-chip | Yes    | R      | All threads + host     | Application |

\* cached on Fermi or newer

#### (Memory) State Spaces



#### PTX ISA 7.1 (Chapter 5)

| Name                        | Addressable             | Initializable    | Access | Sharing    |
|-----------------------------|-------------------------|------------------|--------|------------|
| .reg                        | No                      | No               | R/W    | per-thread |
| .sreg                       | No                      | No               | RO     | per-CTA    |
| .const                      | Yes                     | Yes <sup>1</sup> | R0     | per-grid   |
| .global                     | Yes                     | Yes <sup>1</sup> | R/W    | Context    |
| .local                      | Yes                     | No               | R/W    | per-thread |
| .param (as input to kernel) | Yes <sup>2</sup>        | No               | RO     | per-grid   |
| .param (used in functions)  | Restricted <sup>3</sup> | No               | R/W    | per-thread |
| .shared                     | Yes                     | No               | R/W    | per-CTA    |
| .tex                        | No <sup>4</sup>         | Yes, via driver  | RO     | Context    |

#### Notes:

<sup>&</sup>lt;sup>1</sup> Variables in .const and .global state spaces are initialized to zero by default.

 $<sup>^2</sup>$  Accessible only via the 1d.param instruction. Address may be taken via mov instruction.

<sup>&</sup>lt;sup>3</sup> Accessible via ld.param and st.param instructions. Device function input and return parameters may have their address taken via mov; the parameter is then located on the stack frame and its address is in the .local state space.

<sup>&</sup>lt;sup>4</sup> Accessible only via the tex instruction.

## **Managing Memory**

Unified memory space can be enabled on Fermi / CUDA 4.x and newer

- CPU and GPU have separate memory spaces
- Host (CPU) code manages device (GPU) memory:
  - Allocate / free
  - Copy data to and from device
  - Applies to global device memory (DRAM)



#### **GPU Memory Allocation / Release**

- cudaMalloc(void \*\* pointer, size\_t nbytes)
- cudaMemset(void \* pointer, int value, size\_t count)
- cudaFree(void\* pointer)

```
int n = 1024;
int nbytes = 1024*sizeof(int);
int *a d = 0;
cudaMalloc( (void**)&a d, nbytes );
cudaMemset( a_d, 0, nbytes);
cudaFree(a d);
```

#### **Data Copies**

- cudaMemcpy(void \*dst, void \*src, size\_t nbytes, enum cudaMemcpyKind direction);
  - direction specifies locations (host or device) of src and dst
  - Blocks CPU thread: returns after the copy is complete
  - Doesn't start copying until previous CUDA calls complete
- enum cudaMemcpyKind
  - cudaMemcpyHostToDevice
  - cudaMemcpyDeviceToHost
  - cudaMemcpyDeviceToDevice



```
int main(void)
 float *a_h, *b_h; // host data
 float *a_d, *b_d; // device data
 int N = 14, nBytes, i;
 nBytes = N*sizeof(float);
 a_h = (float *)malloc(nBytes);
 b_h = (float *)malloc(nBytes);
 cudaMalloc((void **) &a_d, nBytes);
 cudaMalloc((void **) &b_d, nBytes);
 for (i=0, i< N; i++) a h[i] = 100.f + i;
 cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
 cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
 cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
 for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
 free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
 return 0;
```

```
int main(void)
 float *a h, *b h; // host data
 float *a_d, *b_d; // device data
 int N = 14, nBytes, i;
 nBytes = N*sizeof(float);
 a_h = (float *)malloc(nBytes);
 b_h = (float *)malloc(nBytes);
 cudaMalloc((void **) &a_d, nBytes);
 cudaMalloc((void **) &b_d, nBytes);
 for (i=0, i< N; i++) a h[i] = 100.f + i;
 cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
 cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
 cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
 for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
 free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
 return 0;
```

© 2008 NVIDIA Corporation.

Host

a h

b h

```
int main(void)
 float *a h, *b h; // host data
 float *a_d, *b_d; // device data
 int N = 14, nBytes, i;
 nBytes = N*sizeof(float);
 a_h = (float *)malloc(nBytes);
 b_h = (float *)malloc(nBytes);
 cudaMalloc((void **) &a_d, nBytes);
 cudaMalloc((void **) &b_d, nBytes);
 for (i=0, i< N; i++) a h[i] = 100.f + i;
 cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
 cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
 cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
 for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
 free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
 return 0;
```

Host

a h

b h

Device

a d

b d

int main(void)

```
float *a h, *b h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
```

Device

Host

a h

b h

a d

b d

```
int main(void)
 float *a h, *b h; // host data
                                                                 Host
 float *a_d, *b_d; // device data
 int N = 14, nBytes, i;
 nBytes = N*sizeof(float);
 a_h = (float *)malloc(nBytes);
 b_h = (float *)malloc(nBytes);
 cudaMalloc((void **) &a_d, nBytes);
 cudaMalloc((void **) &b_d, nBytes);
 for (i=0, i< N; i++) a h[i] = 100.f + i;
 cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
 cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
 cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
 for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
 free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
 return 0;
```

Device

b\_d

a h

b h

```
int main(void)
 float *a h, *b h; // host data
 float *a_d, *b_d; // device data
 int N = 14, nBytes, i;
 nBytes = N*sizeof(float);
 a_h = (float *)malloc(nBytes);
 b_h = (float *)malloc(nBytes);
 cudaMalloc((void **) &a_d, nBytes);
 cudaMalloc((void **) &b_d, nBytes);
 for (i=0, i< N; i++) a h[i] = 100.f + i;
 cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
 cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
 cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
 for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
 free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
 return 0;
```

Device

a\_d

b\_d

Host

a h

b h

```
int main(void)
 float *a h, *b h; // host data
 float *a_d, *b_d; // device data
 int N = 14, nBytes, i;
 nBytes = N*sizeof(float);
 a_h = (float *)malloc(nBytes);
 b_h = (float *)malloc(nBytes);
 cudaMalloc((void **) &a_d, nBytes);
 cudaMalloc((void **) &b_d, nBytes);
 for (i=0, i< N; i++) a h[i] = 100.f + i;
 cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
 cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
 cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
 for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
 free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
 return 0;
```



```
int main(void)
 float *a h, *b h; // host data
 float *a_d, *b_d; // device data
 int N = 14, nBytes, i;
 nBytes = N*sizeof(float);
 a_h = (float *)malloc(nBytes);
 b_h = (float *)malloc(nBytes);
 cudaMalloc((void **) &a_d, nBytes);
 cudaMalloc((void **) &b_d, nBytes);
 for (i=0, i< N; i++) a h[i] = 100.f + i;
 cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
 cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
 cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
 for (i=0; i< N; i++) assert( a h[i] == b h[i] );
 free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
 return 0;
```

Host

a\_h

b\_h

Device

a d

b d

```
int main(void)
 float *a h, *b h; // host data
 float *a_d, *b_d; // device data
 int N = 14, nBytes, i;
 nBytes = N*sizeof(float);
 a_h = (float *)malloc(nBytes);
 b_h = (float *)malloc(nBytes);
 cudaMalloc((void **) &a_d, nBytes);
 cudaMalloc((void **) &b_d, nBytes);
 for (i=0, i< N; i++) a h[i] = 100.f + i;
 cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
 cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
 cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
 for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
 free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
 return 0;
```

Device

Host

#### **Executing Code on the GPU**

- Kernels are C functions with some restrictions
  - Cannot access host memory except: (\*) and (\*\*)
  - Must have void return type
  - No variable number of arguments ("varargs")
  - ( Not recursive ) recursion supported on \_\_device\_\_ functions from
  - No static variables
    cc. 2.x (i.e., basically on all current GPUs)
- Function arguments automatically copied from host to device
  - (\*) "unified memory programming" introduced with CUDA 6 (cc. 3.x +): allocate memory with cudaMallocManaged(); uses automatic migration
  - (\*\*) also: mapped pinned (page-locked) memory ("zero-copy memory") : allocate memory with cudaMallocHost(); beware of low performance!!

Note: UVA ("unified virtual addressing"; cc. 2.x +) is something different!! just pertains to unified pointers (see cudaPointerGetAttributes(), ...) ☑ □VIDIA.

#### **Function Qualifiers**

- Kernels designated by function qualifier:
  - global
    - Function called from host and executed on device
    - Must return void
- Other CUDA function qualifiers
  - device
    - Function called from device and run on device
    - Cannot be called from host code
  - host
    - Function called from host and executed on host (default)
    - \_\_device\_\_ qualifiers can be combined to and generate both CPU and GPU code



#### Variable Qualifiers (GPU code)

- \_\_device
  - Stored in global memory (large, high latency, no cache)
  - Allocated with cudaMalloc (\_\_device\_\_ qualifier implied)
  - Accessible by all threads
  - Lifetime: application
- \_\_shared\_
  - Stored in on-chip shared memory (very low latency)
  - Specified by execution configuration or at compile time
  - Accessible by all threads in the same thread block
  - Lifetime: thread block
- Unqualified variables:
  - Scalars and built-in vector types are stored in registers
  - What doesn't fit in registers spills to "local" memory

CUDA 6+: \_\_managed\_\_ (with \_\_device\_\_) for managed memory (unified memory programming)



## **Launching Kernels**

Modified C function call syntax:

kernel << dim3 dG, dim3 dB>>> (...)

- Execution Configuration ("<<< >>>")
  - dG dimension and size of grid in blocks
    - Two-dimensional: x and y
    - Blocks launched in the grid: dG.x \* dG.y
  - dB dimension and size of blocks in threads:
    - Three-dimensional: x, y, and z
    - Threads per block: dB.x \* dB.y \* dB.z
  - Unspecified dim3 fields initialize to 1

#### **CUDA Built-in Device Variables**

- All <u>global</u> and <u>device</u> functions have access to these automatically defined variables
  - dim3 gridDim;
    - Dimensions of the grid in blocks (at most 2D)
  - dim3 blockDim;
    - Dimensions of the block in threads
  - dim3 blockIdx;
    - Block index within the grid
  - dim3 threadIdx;
    - Thread index within the block

## **Unique Thread IDs**

- Built-in variables are used to determine unique thread IDs
  - Map from local thread ID (threadIdx) to a global ID which can be used as array indices

blockldx.x
blockDim.x = 5
threadIdx.x

0 1 2 3 4

5 6 7 8 9

10 11 12 13 14

blockldx.x\*blockDim.x + threadIdx.x





#### **Increment Array Example**

#### **CPU** program

```
void inc_cpu(int *a, int N)
 int idx;
 for (idx = 0; idx<N; idx++)
   a[idx] = a[idx] + 1;
int main()
  inc_cpu(a, N);
```

#### **CUDA** program

```
__global__ void inc_gpu(int *a, int N)
  int idx = blockldx.x * blockDim.x
           + threadIdx.x;
 if (idx < N)
a[idx] = a[idx] + 1;
 int main()
  dim3 dimBlock (blocksize);
  dim3 dimGrid( ceil( N / (float)blocksize) );
  inc_gpu<<<dimGrid, dimBlock>>>(a, N);
```

## **Thread Cooperation**

- The Missing Piece: threads may need to cooperate
- Thread cooperation is valuable
  - Share results to avoid redundant computation
  - Share memory accesses
    - Drastic bandwidth reduction
- Thread cooperation is a powerful feature of CUDA
- Cooperation between a monolithic array of threads is not scalable
  - Cooperation within smaller batches of threads is scalable



#### **Host Synchronization**

- All kernel launches are asynchronous
  - control returns to CPU immediately
  - kernel executes after all previous CUDA calls have completed
- cudaMemcpy() is synchronous
  - control returns to CPU after copy completes
  - copy starts after all previous CUDA calls have completed
- cudaThreadSynchronize()
  - blocks until all previous CUDA calls complete

**CUDA 4.x or newer:** cudaDeviceSynchronize() and cudaStreamSynchronize()



### **Host Synchronization Example**

```
// copy data from host to device
cudaMemcpy(a_d, a_h, numBytes, cudaMemcpyHostToDevice);
// execute the kernel
inc_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a_d, N);
// run independent CPU code
run_cpu_stuff();
// copy data from device back to host
cudaMemcpy(a_h, a_d, numBytes, cudaMemcpyDeviceToHost);
```

# Device Runtime Component: Synchronization Function



- void \_\_syncthreads();
- Synchronizes all threads in a block
  - Once all threads have reached this point, execution resumes normally
  - Used to avoid RAW / WAR / WAW hazards when accessing shared
- Allowed in conditional code only if the conditional is uniform across the entire thread block

## Synchronization

- Threads in the same block can communicate using shared memory
- No HW global synchronization function yet
- \_\_syncthreads()
  - -Barrier for threads only within the current block
- \_\_threadfence()
  - Flushes global memory writes to make them visible to all threads

```
Plus newer sync functions, e.g., from compute capability 2.x: __syncthreads_count(), __syncthreads_and/or(), __threadfence_system(), ...
```

Now: *Must* use versions with \_sync suffix, because of Independent Thread Scheduling (compute capability 7.x and newer)!

## Matrix-Matrix Multiplication

P=MN

# Programming Model: Square Matrix Multiplication

- P = M \* N of size WIDTH x WIDTH
- Without tiling:
  - One thread handles one element of P
  - M and N are loaded WIDTH times from global memory



## Multiply Using One Thread Block

- One block of threads computes matrix P
  - Each thread computes one element of P
- Each thread
  - Loads a row of matrix M
  - Loads a column of matrix N
  - Perform one multiply and addition for each pair of M and N elements
  - Compute to off-chip memory access ratio close to 1:1 (not very high)
- Size of matrix limited by the number of threads allowed in a thread block



# Matrix Multiplication Device-Side Kernel Function (cont.)

```
for (int k = 0; k < M.width; ++k)
  float Melement = M.elements[ty * M.pitch + k];
  float Nelement = Nd.elements[k * N.pitch + tx];
  Pvalue += Melement * Nelement;
// Write the matrix to device memory;
// each thread writes one element
P.elements[ty * blockDim.x+ tx] = Pvalue;
                                                               ty
                                                      tx
```

#### Handling Arbitrary Sized Square Matrices

 Have each 2D thread block to compute a (BLOCK\_WIDTH)<sup>2</sup> sub-matrix (tile) of the result matrix

- Each has (BLOCK\_WIDTH)<sup>2</sup> threads
- Generate a 2D Grid of (WIDTH/BLOCK\_WIDTH)<sup>2</sup> blocks

You still need to put a loop around the kernel call for cases where WIDTH is greater than Max grid size!



#### Multiply Using Several Blocks - Idea

- One thread per element of P
- Load sub-blocks of M and N into shared memory
- Each thread reads one element of M and one of N
- Reuse each sub-block for all threads, i.e. for all elements of P
- Outer loop on sub-blocks



Parallel08 - Memory Access

Hendrik Lensch and Robert Strzodka

tx

012

bsize-1

#### Multiply Using Several Blocks - Idea

- One thread per element of P
- Load sub-blocks of M and N into shared memory
- Each thread reads one element of M and one of N
- Reuse each sub-block for all threads, i.e. for all elements of P
- Outer loop on sub-blocks



Parallel08 - Memory Access

Hendrik Lensch and Robert Strzodka

tx

bsize-1

012



#### Example: Matrix Multiplication (1)

 Copy matrices to device; invoke kernel; copy result matrix back to host

```
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK SIZE
void MatMul (const Matrix A, const Matrix B, Matrix C)
    // Load A and B to device memory
   Matrix d A;
    d A.width = d A.stride = A.width; d A.height = A.height;
    size t size = A.width * A.height * sizeof(float);
    cudaMalloc((void**)&d A.elements, size);
    cudaMemcpy(d A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d B;
    d B.width = d B.stride = B.width; d B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc((void**)&d B.elements, size);
    cudaMemcpy(d B.elements, B.elements, size,
               cudaMemcpyHostToDevice);
```



#### Example: Matrix Multiplication (2)

```
// Allocate C in device memory
Matrix d C;
d C.width = d C.stride = C.width; d C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc((void**)&d C.elements, size);
// Invoke kernel
dim3 dimBlock (BLOCK SIZE, BLOCK SIZE);
dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
MatMulKernel << dimGrid, dimBlock>>> (d A, d B, d C);
// Read C from device memory
cudaMemcpy(C.elements, d C.elements, size,
           cudaMemcpyDeviceToHost);
// Free device memory
cudaFree (d A.elements);
cudaFree (d B.elements);
cudaFree (d C.elements);
```





- Multiply matrix block-wise
- Set BLOCK\_SIZE for efficient hardware use, e.g., to 16 on cc. 1.x or 16 or 32 on cc. 2.x +
- Maximize parallelism
  - Launch as many threads per block as block elements
  - Each thread fetches one element of block
  - Perform row \* column dot products in parallel





#### Example: Matrix Multiplication (4)

```
global void MatrixMul( float *matA, float *matB, float *matC, int w )
     shared float blockA[ BLOCK SIZE ][ BLOCK SIZE ];
     shared float blockB[ BLOCK SIZE ][ BLOCK SIZE ];
   int bx = blockIdx.x; int tx = threadIdx.x;
   int by = blockIdx.y; int ty = threadIdx.y;
   int col = bx * BLOCK SIZE + tx;
   int row = by * BLOCK SIZE + ty;
   float out = 0.0f;
   for ( int m = 0; m < w / BLOCK SIZE; m++ ) {
       blockA[ ty ][ tx ] = matA[ row * w + m * BLOCK SIZE + tx
       blockB[ ty ][ tx ] = matB[ col + ( m * BLOCK SIZE + ty ) * w ];
       syncthreads();
       for ( int k = 0; k < BLOCK SIZE; k++ ) {
           out += blockA[ ty ][ k ] * blockB[ k ][ tx ];
         syncthreads();
   }
   matC[ row * w + col ] = out;
```

Caveat: for brevity, this code assumes matrix sizes are a multiple of the block size (either because they really are, or because padding is used; otherwise guard code would need to be added)

}

