Skip to content

Add MSVC and GCC support & add+improve CI Coverage (incl DEBUG runs)#69

Open
peterboncz wants to merge 98 commits intodevfrom
add-mvsc+gcc-support
Open

Add MSVC and GCC support & add+improve CI Coverage (incl DEBUG runs)#69
peterboncz wants to merge 98 commits intodevfrom
add-mvsc+gcc-support

Conversation

@peterboncz
Copy link
Copy Markdown

@peterboncz peterboncz commented Apr 3, 2026

Summary

Port FastLanes to compile and pass all tests on GCC Linux and MSVC (Windows x86-64 and ARM64), including shared library (DLL) builds. Also overhaul the CI matrix to cut job count from ~167 to ~50 per PR.

MSVC and GCC Compiler Support

  • New compiler.hpp portability header centralizing compiler-specific macros (vectorize pragmas, __builtin_clz/ctz, diagnostic push/pop)
  • CMakeLists.txt: accept MSVC alongside Clang, with appropriate flag mappings (/bigobj, /MP, /Zc:__cplusplus)
  • Replace ~700 #pragma clang loop vectorize(enable) with FLS_PRAGMA_VECTORIZE
  • Replace #pragma GCC/clang diagnostic with portable FLS_DIAG_* macros
  • Add Windows implementation for memory_usage.cpp (GetProcessMemoryInfo)
  • Add MSVC compat shims to fsst12.h (matching existing fsst.h pattern)
  • Guard Clang-only -Wno-macro-redefined flags in 8 subdirectory CMakeLists
  • Split interpreter.cpp into encoding/decoding files to reduce MSVC compile time
  • Fix C++20 structured binding in range-for (unsupported by MSVC)
  • Add msvc_heap_guard test helper to handle SEH guard-page exceptions in GoogleTest

GCC compiler support was an easy add after going through this. We only support gcc on Linux, because it is much harder to use on MacOS and msvc is the preferred Windows compiler. Note that clang works on all three platforms still.

Explicit Symbol Export

we hide all symbols by default in the shared libraries and export the API symbols -- required vor DLLs

  • FLS_API macro (__declspec(dllexport/dllimport)) on all public classes and functions
  • Explicit = delete of copy ops on classes with unique_ptr members (MSVC eagerly instantiates all special members for dllexport classes)
  • Out-of-line destructors where MSVC needs complete types for unique_ptr in dllexport context
  • FLS_STATIC propagated as PUBLIC compile definition on the FastLanes target so consumers (tests, examples) link correctly against the static library

Bug Fixes

while it is easy, popular (and justified) to hate on Windows, it does have another independent software stack with strong asan tools and debugger, which unearthed quite some bugs. Also, running the CI tests on debug put the system through more asan testing, exposing some more. In order to get working CI across compilers, platforms, Debug/Release, shared/static quite some issues needed to be fixed to get to green CI on this PR:

  1. MSVC portability bug caused by 32-bit unsigned long on Windows (1)
    std::stol/stoul overflow: Values that fit in 64 bits but exceed 32-bit range threw out_of_range. Fixed by using std::stoll/stoull in attribute.cpp and rowgroup.cpp.
  2. MSVC portability bug caused by 32-bit unsigned long on Windows (2)
    FSST12 symbol table cast: fsst12.h cast the symbol table to unsigned long*, giving a 4-byte stride on Windows instead of 8. This made symbol lookups read wrong offsets — producing correct-length but
    garbled strings. Fixed by casting to unsigned long long*.
  3. FSST12 buffer overrun, as it did a 64-bits wrote from &res (uint32). Only the first 1-3 bytes mattered, but this a buffer overrun and a 4-byte write should be done iso a 8-byte. All FSST12 fixes got ported back to the fsst repo.
  4. 2GB reserve in string decoders
    The 8 string decode operators called vector::reserve(1024 * 2MB) = 2GB. Linux overcommits virtual memory so this silently succeeds; Windows does not, throwing bad_alloc. Fixed with an amortized doubling growth strategy.
  5. GTest SEH guard-page clash
    When tests process large data (~80MB JPEG base64), the Windows heap allocator touches reserved-but-uncommitted guard pages, triggering a transient STATUS_ACCESS_VIOLATION (0xc0000005). Normally the NT heap manager's own vectored exception handler (VEH) commits the page and resumes transparently. But GoogleTest wraps every TEST body in an SEH __try/__except frame, and SEH frames fire after VEH handlers return EXCEPTION_CONTINUE_SEARCH — so GoogleTest's catch-all __except intercepts the fault first and reports a spurious failure. Fix: a custom first-chance VEH handler (msvc_heap_guard.cpp) that detects read-AVs on committable pages, calls VirtualAlloc(MEM_COMMIT) on the faulting page, and resumes execution — doing exactly what the heap manager would have done.
  6. fill_in undefined behavior
    The string column padding loop called push_back() on byte_arr while simultaneously indexing into the same vector to read the last value's bytes. When push_back triggers reallocation, the indices into the old
    buffer become dangling — undefined behavior. On Linux this happened to work (the allocator often didn't relocate), but MSVC's debug CRT caught it. Fix: copy the last value into a separate vector<uint8_t> before the loop, then insert() from that copy.
  7. Fix GALP null deref
    make_dec_galp_expr() called column_view.column_descriptor.encoding_rpn()->operand_tokens()->size() to set state.cur_operand. But GALP/ALP columns never emit operand tokens during encoding, so operand_tokens()
    returns null — dereferencing it is a null pointer crash. The fix was simply removing that line, since dec_alp_opr uses hardcoded segment indices and never reads cur_operand.
  8. ALP factor_idx out-of-bounds (encoder.cpp)
    In find_top_k_combinations(), the inner loop initialized factor_idx = exponent_idx and iterated down to 0. For float, MAX_EXPONENT is 10 but FACT_ARR has only 10 elements (indices 0-9). When exponent_idx == 10, factor_idx started at 10 — an out-of-bounds access into FACT_ARR[10]. Fix: clamp factor_idx to min(exponent_idx, FACT_ARR.size() - 1). For double this was harmless (MAX_EXPONENT=18, FACT_ARR has 19 elements), but for float it was reading one past the end. MSVC's debug iterator checks caught this.
  9. Null map not always allocated (rowgroup.cpp)
    TypedColumnView and NullMapView called .data() on null_map_arr without checking if the vector was empty. Columns without nulls have an empty null_map_arr, so .data() returns nullptr. Downstream code then dereferenced this null pointer. Fix: when null_map_arr is empty, point to a static zero_null_map[65536] array of zeroes (meaning "no nulls") instead.
  10. Signed/unsigned conversion warnings in string decoders
    The amortized growth code in all 8 string decode operators called std::max(capacity * 2, size + CFG::String::max_bytes_per_string) where size is size_t (unsigned) but max_bytes_per_string is a signed type. MSVC treats this as a narrowing conversion error. Fix: add static_cast<size_t>(CFG::String::max_bytes_per_string) in all 8 operators.
  11. Debug test failure (NextiaJD_Chicago_Crimes_2012_to_2017 Failed: Negative index).
    When the analyzer couldn't find a good compression option (all options had ≥ 20 exceptions), it fell through with the default bw = 64. Then in is_exception(), pow2(64) returned 0 due to the overflow guard (64 >= sizeof(T)*8), making real_upper = base + 0 = 0. This caused val >= 0 to be true for every value, so all 1024 values in the vector were flagged as exceptions. The count of 1024 was written to the file, and on decode FLS_ASSERT_CORRECT_POS correctly rejected it since 1024 > 1023. The fix: Early-return false in is_exception() when bw covers the full type width — by definition, no value can be an exception if you're using all the bits.
  12. Debug test failure due to out-of-bounds read (128*8=1024 is clearly out of bounds of the vector size), which gets optimized away in Release (fallback_scalar_aav_1024_uf1_unrsum_src.cpp)
  13. ODR violation — two Histogram classes in namespace fastlanes with different layouts
    Renamed the one in analyze_operator.hpp to AnalyzeHistogram4 files: header, impl, .cpp, plus rust vendor copies)

CI Overhaul

  • Eliminate push+PR duplication: all workflows now trigger on push: [main, dev] + pull_request: instead of bare push: (was running every job twice)
  • Clang-tidy on 1 platform: tidy diagnostics are platform-independent; run on ubuntu-24.04 only, other platforms get lightweight compile-checks
  • Reduce shared/static cross-product: static everywhere, shared on one representative per OS family
  • Benchmarks: reduced from 2 platforms to ubuntu-latest only
  • Dataset generation: reduced from 6 platforms to ubuntu-latest only
  • Compiler diversity in tests: GCC on ubuntu-24.04 + ubuntu-24.04-arm, Clang on ubuntu-22.04 + ubuntu-22.04-arm, Clang on macOS, MSVC on Windows. Avoid gcc Debug on ubuntu-arm though, because it is slow and times out while still compiling.

Accelerating Compiles by Splitting

Getting CI to run uncovered another issue: the long compilation times in the interpreter, where hundreds of templates need to be instantiated. This caused in CI timeouts of the Debug compiles. This in turn probably led to Debug CI builds being switched off. And that led to problems getting undetected, because e.g. ASAN and ASSERTs would not get tested. To reduce compilation time, these steps were made:

  • split the interpreter.cpp which instantiated hundreds of templates implicitly, in two source files (encoder and decoder).
  • split the biggest operators (rsum, slpatch, analyze)
  • further split the visit state in a pair of states <enc,dec> and split on that in all visitors for encoding resp. decoding.

this improved compilation time considerably -- but we still have to avoid ubuntu-clang-Debug-static CI as even with all these improvements compilation times out on both arm and x86 (but non-ubuntu, and shared Debug builds do run correctly now).

Motivation

Broader compiler support is not only nice for users with motives not to use clang (which has best auto-vectorization), but also enables vcpkg packaging and is required for the future DuckDB extension for FastLanes, which needs native Windows support. The improvements in compile times and (re-)enabling of Debug testing widens CI coverage and brought to light lingering bugs.

=================================

Enable FastLanes to compile with both Clang (Linux/MacOS) and MSVC (Windows).

(This is a first step towards vcpkg packaging, which will make including FastLanes
 in a project easier. We need Windows not only because it is such a modern and nice
 OS, but also because it is a native platform for DuckDB and we want to use the
 future vcpkg version in the a DuckDB extension for FastLanes.)

in detail:
- New compiler.hpp portability header centralizing compiler-specific macros
  (vectorize pragmas, __builtin_clz/ctz, diagnostic push/pop)
- CMakeLists.txt: accept MSVC alongside Clang, with appropriate flag mappings
- Replace ~700 #pragma clang loop vectorize(enable) with FLS_PRAGMA_VECTORIZE
- Replace #pragma GCC/clang diagnostic with portable FLS_DIAG_* macros
- Add Windows implementation for memory_usage.cpp (GetProcessMemoryInfo)
- Add MSVC compat shims to fsst12.h (matching existing fsst.h pattern)
- Guard Clang-only -Wno-macro-redefined flags in 8 subdirectory CMakeLists.txt
- Add /bigobj, /MP, /Zc:__cplusplus flags for MSVC builds
- Split interpreter.cpp into encoding/decoding files to reduce MSVC compile time
- Fix C++20 structured binding in range-for (unsupported by MSVC)
…msvc++ compile time passable for windows CI)

make format
- add newer macos 26, deprecate 13
- remove debug/true builds as they are slow and flaky
- add windows-arm as a platform
- add mvsc test and example
…ed as exceptions

When base + 2^bw overflowed the unsigned type (e.g. base=64512, bw=10
for uint16_t: 64512+1024=65536 wraps to 0), the upper bound check
val >= 0 became true for all values, producing 1024 exceptions per
vector instead of the expected 0-15. This silently inflated compressed
output in Release (assertions disabled) and caused assertion failures

Fix: replace the overflow-prone upper bound comparison with unsigned
delta arithmetic: (uval - ubase) >= pow2(bw). Unsigned subtraction
wraps correctly by definition, handling all base/bw combinations.
…reorder

Moving Finalize before Cast (needed for min/max in type-narrowing) left
bimap_frequency empty on cast columns, crashing the encoder with "Value
not found in BiMapFrequency." It also left min/max at defaults on newly
created cast columns, causing silent data corruption on round-trip.

Fix: split bimap_frequency population and min/max computation into a new
PopulateBiMap() step that runs after Cast on the final column types.
write 8 bytes of packed 12-bit codes, but `res` was declared as u32 (4
bytes), causing a 4-byte overread from the stack variable. On x64 MSVC
Release/static the /GS security cookie catches this and raises
STATUS_STACK_BUFFER_OVERRUN (0xc0000409), crashing EQUALITY_STRPT and
SINGLE_COLUMN_STRPT.

Fix: widen `res` from u32 to u64 so the 8-byte memcpy reads from a
valid 8-byte variable. Changed in all 4 copies of the FSST12 encoder:
  src/primitive/fsst12/fsst12.cpp
  src/cor/prm/fsst12/libfsst12.cpp
  rust/vendor/fastlanes/src/primitive/fsst12/fsst12.cpp
  rust/vendor/fastlanes/src/cor/prm/fsst12/libfsst12.cpp

Verified clean under MSVC AddressSanitizer (/fsanitize=address).
- hide all non-exported symbols since this may avoid certain bugs
split the four big operator files in encoder/decoder classes, such that:
(1) we get more compilation paralellism, and
(2) reduce the amount of template instantiations by half
    (in Debug this will count especially)
1. FSST12 memcpy overread — sizeof(u64) → sizeof(u32) for a u32 res variable (4 files)
2. ODR violation — two Histogram<T> classes in namespace fastlanes with different layouts
   Renamed the one in analyze_operator.hpp to AnalyzeHistogram<T>
   (4 files: header, impl, .cpp, plus rust vendor copies)
- fix histogram instantiation
…l_map_arr is empty but data has been padded to 1024 elements by FillMissingValues. Fixed by checking idx < null_map_arr.size() instead of !null_map_arr.empty(). Applied to both typed and string comparisons in src/table/rowgroup.cpp and rust/vendor/fastlanes/src/table/rowgroup.cpp.
…ompile times

The 247-alternative physical_operator variant caused std::visit() to generate
huge dispatch tables (28MB+ object files), making GCC Debug builds time out.

Split into enc_physical_operator (~110 alternatives) and dec_physical_operator
(~137 alternatives), with physical_operator as a thin wrapper variant over both.
Added visit_enc/visit_dec/visit_physical helpers that unwrap the outer variant
and dispatch into the correct sub-variant.

Object file sizes dropped ~60-70% (e.g. enc_transpose: 16MB→6.5MB,
dec_slpatch: 13MB→3.7MB). Total expression directory: ~200MB→~65MB.

Files changed:
- physical_expression.hpp: variant split + visit helpers
- interpreter_encoding.cpp: wrap emplace_back with enc_physical_operator
- interpreter_decoding.cpp: wrap emplace_back with dec_physical_operator
- 17 operator .cpp files: visit() → visit_enc/visit_dec/visit_physical
- materializer.cpp: two-argument visit_dec for operator+column dispatch
remove ubutu-arm clang debug static target (it is the only one still timing out)
@peterboncz peterboncz changed the title Add MSVC and GCC support in preparation for VCPKG Add MSVC and GCC support & add+improve CI Coverage (incl DEBUG runs) Apr 8, 2026
- all platforms and settings are at least built
- bring back clang on windows-latest and try windows-arm as well
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant