[draft] i8 quantization experiment by Vegetable26 · Pull Request #16 · cwida/SuperKMeans

Vegetable26 · 2026-03-23T05:09:48Z

Changes to the core algorithm:

Implements quantization from f32 -> i8 for centroid assignment
It still uses f32 for centroid update
Adds an optional comparison for how frequently the f32 and i8 assignments end up being the same
Also adds an option where we compute KNN with i8 quantized vectors. We then perform assignment using the full fidelity f32 vectors for the K candidates.
This PR uses XNNPack for the i8 matmul implementation

I attached the experimental results for a cohere 2M benchmark (my Mac does not have enough memory for the full benchmark). Some interesting results:

E2E recall rates are roughly the same across all implementations
xnnpack f32 = 260 seconds, cblas_sgemm = 111 seconds, xnnpack i8 = 65 seconds.
Note xnnpack to cblas is not a fully fair comparison due to cblas_sgemm using the AMX coprocessor and xnnpack only using sdot. But extrapolating from this we do see a ~4x speedup from f32 -> i8 quantization within xnnpack which is roughly expected (as the we are quantizing to 25% of the original size: 32 -> 8 bits).
The original std::partial_sort implementation for top-K was quite inefficient. I re-implemented a new version that is lower overhead (more aggressively filters invalid candidates)
Finally we see that i8 assignments with top-10 candidates (then rescoring with f32) provides almost the exact assignments as the f32 implementation. And doing i8 top-10 (as opposed to i8 top-1) candidate assignment adds very little overhead. We do see that if we only consider the top-1 candidate, i8 and f32 assignments do disagree a bit more

Vegetable26 · 2026-03-23T05:39:18Z

CMakeLists.txt

-find_package(OpenMP REQUIRED)
+
+# Apple Clang does not bundle OpenMP; point FindOpenMP at Homebrew's libomp.
+if(APPLE)


nit: mainly for my own setup

Vegetable26 · 2026-03-23T05:39:31Z

CMakeLists.txt

+FetchContent_MakeAvailable(xnnpack)
+set(XNNPACK_LINK_LIBRARIES XNNPACK pthreadpool)
+
+add_compile_definitions(CMAKE_SOURCE_DIR="${CMAKE_SOURCE_DIR}" BENCHMARK_TIME)


nit: mainly for my own setup

Vegetable26 · 2026-03-23T05:39:56Z

examples/simple_clustering.cpp

@@ -1,17 +1,20 @@
+#define BENCHMARK_TIME


nit: most of the changes here are for my own setup (quickly running a few different experiments)

Vegetable26 commented Mar 23, 2026

View reviewed changes

Joseph Hwang added 3 commits March 22, 2026 23:18

Various changes

c5ffb61

Matmul

6448e3c

XNNpack

f00dd07

Vegetable26 force-pushed the jzh/base_experiment branch from 477c167 to f00dd07 Compare March 23, 2026 06:19

Joseph Hwang added 2 commits March 22, 2026 23:25

Other changes

4a5cb30

Ok

50865fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] i8 quantization experiment#16

[draft] i8 quantization experiment#16
Vegetable26 wants to merge 5 commits intocwida:mainfrom
Vegetable26:jzh/base_experiment

Vegetable26 commented Mar 23, 2026 •

edited

Loading

Uh oh!

Vegetable26 Mar 23, 2026

Uh oh!

Vegetable26 Mar 23, 2026

Uh oh!

Vegetable26 Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Vegetable26 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Vegetable26 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Vegetable26 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Vegetable26 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Vegetable26 commented Mar 23, 2026 •

edited

Loading