A C microbenchmark for comparing several matrix multiplication implementations, from a naive triple loop to cache-aware and SIMD-accelerated variants.
This project is useful for studying how loop ordering, cache behavior, transposition, and vector instructions affect dense GEMM-like workloads on modern CPUs.
- Single-binary benchmark (
matrix) with selectable algorithm variants - Configurable square matrix size (
-n) - Deterministic random initialization (fixed seed) for reproducible checksums
- Simple timing output suitable for scripting
- Includes a helper evaluation script (
eval.sh) for size sweeps
matrix.c: benchmark implementation and algorithm variantsMakefile: build configurationeval.sh: quick performance sweep by matrix size for a selected algorithm
- Linux
- GCC toolchain with C11 support (
gcc,make)
makeBy default, the Makefile uses:
-Ofast-march=native-flto-std=c11
To clean build artifacts:
make clean./matrix [-n dimension] [-a algorithm]-n: square matrix dimension (default:1024)-a: algorithm selector
Algorithm IDs:
0: naivei-j-k1: loop reorderi-k-j2: loop reorder + tiling (block size 256)3: transposeBthen multiply4: transpose + SIMD path (AVX2/SSE/NEON depending on build target)99: run all algorithms in sequence
Show help:
./matrix -hRun all variants for a 256x256 matrix:
./matrix -n 256 -a 99Typical output format:
matmult_opt0 0.044723 chsum: -648.131751
matmult_opt1 0.004007 chsum: -648.131751
matmult_opt2 0.003787 chsum: -648.131751
matmult_opt3 0.016045 chsum: -648.131751
matmult_opt4 0.003884 chsum: -648.131646
Fields:
- first column: algorithm label
- second column: elapsed time in seconds
- third column: checksum of output matrix (
C) for quick correctness sanity checks
eval.sh sweeps dimensions and prints a compact table.
./eval.sh <algorithm>Example:
./eval.sh 1Output columns:
n: matrix dimensionws: approximate matrix working-set size in KiB (n*n*4*3/1024)dur: measured duration in seconds
- Random inputs are generated with a fixed seed (
srand(292)), so checksums are reproducible for the same build and architecture. - Small checksum differences between SIMD and non-SIMD paths are expected due to floating-point accumulation order.
- Tiled and SIMD kernels perform best when dimensions align with vector/block widths.
- Algorithm
99runs all variants sequentially, so runtime can be long for large matrices. - SIMD behavior depends on target ISA selected by the compiler and CPU (
AVX2,SSE, orNEON).
- Run on an otherwise idle machine.
- Pin process to a core if needed using
taskset. - Repeat runs and compare medians rather than single samples.
The implementation references optimization ideas discussed in: