Skip to content

Latest commit

 

History

History
212 lines (171 loc) · 8.36 KB

File metadata and controls

212 lines (171 loc) · 8.36 KB

Matrix Multiplication Bench

Compare MatMul performance of various library implementations.

Setup

Note

Eigen and Google's Benchmark need to be located one directory up from gemm_microbench.

Install Eigen

#Get Eigen from the official website
wget https://gitlab.com/libeigen/eigen/-/archive/3.4.0/eigen-3.4.0.zip

#Unpack and set the correct name
unzip eigen-3.4.0.zip && rm eigen-3.4.0.zip && mv eigen-3.4.0 Eigen

If the above doesn't work, reference Eigen's Official Website

Install Google's Benchmark

# Check out the library.
$ git clone https://github.com/google/benchmark.git

# Go to the library root directory
$ cd benchmark

# Make a build directory to place the build output
$ cmake -E make_directory "build"

# Generate build system files with cmake, and download any dependencies
$ cmake -E chdir "build" cmake -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release ../

# or, starting with CMake 3.13, use a simpler form:
# cmake -DBENCHMARK_DOWNLOAD_DEPENDENCIES=on -DCMAKE_BUILD_TYPE=Release -S . -B "build"
# Build the library
$ cmake --build "build" --config Release

# Next, you can run the tests to check the build
$ cmake -E chdir "build" ctest --build-config Release

# Install the library globally
& sudo cmake --build "build" --config Release --target install

If the above doesn't work, reference Google's Benchmark GitHub

Install Intel oneAPU Base Toolkit

# Download the installer
$ wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/4a5320d1-0b48-458d-9668-fd0e4501208c/intel-oneapi-base-toolkit-2025.1.3.7_offline.sh

# Run the install script
$ sudo sh ./intel-oneapi-base-toolkit-2025.1.3.7_offline.sh -a --silent --cli --eula accept

# Set the environment variables
$ . /opt/intel/oneapi/setvars.sh

If the above doesn't work, reference Intel oneAPI Base Toolkit's Official Website

Running the Benchmark

# Default setup where the max threads available on your system are used
make all

# This specifies how many threads should be used
make all THREADS=16

Results

The benchmark also checks that the result of the GEMM operation matches across all libraries and crates.

Single-thread

Test Case Library Language Result
Input A Eigen C++ 0.466s
matrixmultiply Rust 0.599s
CBLAS Rust 0.406s
Input B Eigen C++ 0.495s
matrixmultiply Rust 0.607s
CBLAS Rust 0.409s
Input C Eigen C++ 2.073s
matrixmultiply Rust 2.644s
CBLAS Rust 1.705s
Input D Eigen C++ 40.823s
matrixmultiply Rust 42.953s
CBLAS Rust 29.670s

Grpahed Results for single-thread

2 Threads

Test Case Library Language Result
Input A Eigen C++ 0.268s
matrixmultiply Rust 0.329s
CBLAS Rust 0.207s
Input B Eigen C++ 0.281s
matrixmultiply Rust 0.343s
CBLAS Rust 0.211s
Input C Eigen C++ 1.151s
matrixmultiply Rust 1.443s
CBLAS Rust 0.852s
Input D Eigen C++ 28.048s
matrixmultiply Rust 25.668s
CBLAS Rust 15.485s

Grpahed Results for 2 threads

4 Threads (Maximum Threads Supported by matrixmultiply)

Test Case Library Language Result
Input A Eigen C++ 0.172s
matrixmultiply Rust 0.209s
CBLAS Rust 0.122s
Input B Eigen C++ 0.173s
matrixmultiply Rust 0.191s
CBLAS Rust 0.124s
Input C Eigen C++ 0.609s
matrixmultiply Rust 0.808s
CBLAS Rust 0.480s
Input D Eigen C++ 14.786s
matrixmultiply Rust 15.655s
CBLAS Rust 9.329s

Grpahed Results for 4 threads

8 Threads

Test Case Library Language Result
Input A Eigen C++ 0.091s
matrixmultiply Rust 0.248s
CBLAS Rust 0.0695s
Input B Eigen C++ 0.0959s
matrixmultiply Rust 0.253s
CBLAS Rust 0.0668s
Input C Eigen C++ 0.401s
matrixmultiply Rust 0.976s
CBLAS Rust 0.258s
Input D Eigen C++ 5.503s
matrixmultiply Rust 15.779s
CBLAS Rust 4.948s

Grpahed Results for 8 threads

16 Threads

Test Case Library Language Result
Input A Eigen C++ 0.0691s
matrixmultiply Rust 0.243s
CBLAS Rust 0.0498s
Input B Eigen C++ 0.0767s
matrixmultiply Rust 0.246s
CBLAS Rust 0.0343s
Input C Eigen C++ 0.259s
matrixmultiply Rust 1.024s
CBLAS Rust 0.165s
Input D Eigen C++ 3.381s
matrixmultiply Rust 16.960s
CBLAS Rust 3.994s

Grpahed Results for 16 threads

32 Threads

Test Case Library Language Result
Input A Eigen C++ 0.0606s
matrixmultiply Rust 0.231s
CBLAS Rust 0.0462s
Input B Eigen C++ 0.0725s
matrixmultiply Rust 0.240s
CBLAS Rust 0.0555s
Input C Eigen C++ 0.235s
matrixmultiply Rust 0.966s
CBLAS Rust 0.187s
Input D Eigen C++ 2.933s
matrixmultiply Rust 17.732s
CBLAS Rust 3.597s

Grpahed Results for 32 threads

48 Threads (Maximum Threads Supported by my testing machine - Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz)

Test Case Library Language Result
Input A Eigen C++ 0.0629s
matrixmultiply Rust 0.231s
CBLAS Rust 0.0307s
Input B Eigen C++ 0.0744s
matrixmultiply Rust 0.234s
CBLAS Rust 0.0462s
Input C Eigen C++ 0.161s
matrixmultiply Rust 1.009s
CBLAS Rust 0.121s
Input D Eigen C++ 3.065s
matrixmultiply Rust 18.508s
CBLAS Rust 3.928s

Grpahed Results for 48 threads