Skip to content

CSL-KU/matmult

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Matrix Multiplication Microbenchmark (matmult)

A C microbenchmark for comparing several matrix multiplication implementations, from a naive triple loop to cache-aware and SIMD-accelerated variants.

This project is useful for studying how loop ordering, cache behavior, transposition, and vector instructions affect dense GEMM-like workloads on modern CPUs.

Features

  • Single-binary benchmark (matrix) with selectable algorithm variants
  • Configurable square matrix size (-n)
  • Deterministic random initialization (fixed seed) for reproducible checksums
  • Simple timing output suitable for scripting
  • Includes a helper evaluation script (eval.sh) for size sweeps

Repository Layout

  • matrix.c: benchmark implementation and algorithm variants
  • Makefile: build configuration
  • eval.sh: quick performance sweep by matrix size for a selected algorithm

Build

Prerequisites

  • Linux
  • GCC toolchain with C11 support (gcc, make)

Compile

make

By default, the Makefile uses:

  • -Ofast
  • -march=native
  • -flto
  • -std=c11

To clean build artifacts:

make clean

Usage

./matrix [-n dimension] [-a algorithm]
  • -n: square matrix dimension (default: 1024)
  • -a: algorithm selector

Algorithm IDs:

  • 0: naive i-j-k
  • 1: loop reorder i-k-j
  • 2: loop reorder + tiling (block size 256)
  • 3: transpose B then multiply
  • 4: transpose + SIMD path (AVX2/SSE/NEON depending on build target)
  • 99: run all algorithms in sequence

Show help:

./matrix -h

Example

Run all variants for a 256x256 matrix:

./matrix -n 256 -a 99

Typical output format:

matmult_opt0  0.044723  chsum: -648.131751
matmult_opt1  0.004007  chsum: -648.131751
matmult_opt2  0.003787  chsum: -648.131751
matmult_opt3  0.016045  chsum: -648.131751
matmult_opt4  0.003884  chsum: -648.131646

Fields:

  • first column: algorithm label
  • second column: elapsed time in seconds
  • third column: checksum of output matrix (C) for quick correctness sanity checks

Batch Evaluation

eval.sh sweeps dimensions and prints a compact table.

./eval.sh <algorithm>

Example:

./eval.sh 1

Output columns:

  • n: matrix dimension
  • ws: approximate matrix working-set size in KiB (n*n*4*3/1024)
  • dur: measured duration in seconds

Notes On Correctness And Comparisons

  • Random inputs are generated with a fixed seed (srand(292)), so checksums are reproducible for the same build and architecture.
  • Small checksum differences between SIMD and non-SIMD paths are expected due to floating-point accumulation order.

Known Limitations

  • Tiled and SIMD kernels perform best when dimensions align with vector/block widths.
  • Algorithm 99 runs all variants sequentially, so runtime can be long for large matrices.
  • SIMD behavior depends on target ISA selected by the compiler and CPU (AVX2, SSE, or NEON).

Reproducibility Tips

  • Run on an otherwise idle machine.
  • Pin process to a core if needed using taskset.
  • Repeat runs and compare medians rather than single samples.

Acknowledgments

The implementation references optimization ideas discussed in:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors