Skip to content

[draft] i8 quantization experiment#16

Open
Vegetable26 wants to merge 5 commits intocwida:mainfrom
Vegetable26:jzh/base_experiment
Open

[draft] i8 quantization experiment#16
Vegetable26 wants to merge 5 commits intocwida:mainfrom
Vegetable26:jzh/base_experiment

Conversation

@Vegetable26
Copy link
Copy Markdown

@Vegetable26 Vegetable26 commented Mar 23, 2026

Changes to the core algorithm:

  • Implements quantization from f32 -> i8 for centroid assignment
  • It still uses f32 for centroid update
  • Adds an optional comparison for how frequently the f32 and i8 assignments end up being the same
  • Also adds an option where we compute KNN with i8 quantized vectors. We then perform assignment using the full fidelity f32 vectors for the K candidates.
  • This PR uses XNNPack for the i8 matmul implementation

I attached the experimental results for a cohere 2M benchmark (my Mac does not have enough memory for the full benchmark). Some interesting results:

  • E2E recall rates are roughly the same across all implementations
  • xnnpack f32 = 260 seconds, cblas_sgemm = 111 seconds, xnnpack i8 = 65 seconds.
  • Note xnnpack to cblas is not a fully fair comparison due to cblas_sgemm using the AMX coprocessor and xnnpack only using sdot. But extrapolating from this we do see a ~4x speedup from f32 -> i8 quantization within xnnpack which is roughly expected (as the we are quantizing to 25% of the original size: 32 -> 8 bits).
  • The original std::partial_sort implementation for top-K was quite inefficient. I re-implemented a new version that is lower overhead (more aggressively filters invalid candidates)
  • Finally we see that i8 assignments with top-10 candidates (then rescoring with f32) provides almost the exact assignments as the f32 implementation. And doing i8 top-10 (as opposed to i8 top-1) candidate assignment adds very little overhead. We do see that if we only consider the top-1 candidate, i8 and f32 assignments do disagree a bit more

find_package(OpenMP REQUIRED)

# Apple Clang does not bundle OpenMP; point FindOpenMP at Homebrew's libomp.
if(APPLE)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mainly for my own setup

FetchContent_MakeAvailable(xnnpack)
set(XNNPACK_LINK_LIBRARIES XNNPACK pthreadpool)

add_compile_definitions(CMAKE_SOURCE_DIR="${CMAKE_SOURCE_DIR}" BENCHMARK_TIME)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mainly for my own setup

@@ -1,17 +1,20 @@
#define BENCHMARK_TIME
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: most of the changes here are for my own setup (quickly running a few different experiments)

@Vegetable26 Vegetable26 force-pushed the jzh/base_experiment branch from 477c167 to f00dd07 Compare March 23, 2026 06:19
Joseph Hwang added 2 commits March 22, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant