Add MSVC and GCC support & add+improve CI Coverage (incl DEBUG runs)#69
Open
peterboncz wants to merge 98 commits intodevfrom
Open
Add MSVC and GCC support & add+improve CI Coverage (incl DEBUG runs)#69peterboncz wants to merge 98 commits intodevfrom
peterboncz wants to merge 98 commits intodevfrom
Conversation
================================= Enable FastLanes to compile with both Clang (Linux/MacOS) and MSVC (Windows). (This is a first step towards vcpkg packaging, which will make including FastLanes in a project easier. We need Windows not only because it is such a modern and nice OS, but also because it is a native platform for DuckDB and we want to use the future vcpkg version in the a DuckDB extension for FastLanes.) in detail: - New compiler.hpp portability header centralizing compiler-specific macros (vectorize pragmas, __builtin_clz/ctz, diagnostic push/pop) - CMakeLists.txt: accept MSVC alongside Clang, with appropriate flag mappings - Replace ~700 #pragma clang loop vectorize(enable) with FLS_PRAGMA_VECTORIZE - Replace #pragma GCC/clang diagnostic with portable FLS_DIAG_* macros - Add Windows implementation for memory_usage.cpp (GetProcessMemoryInfo) - Add MSVC compat shims to fsst12.h (matching existing fsst.h pattern) - Guard Clang-only -Wno-macro-redefined flags in 8 subdirectory CMakeLists.txt - Add /bigobj, /MP, /Zc:__cplusplus flags for MSVC builds - Split interpreter.cpp into encoding/decoding files to reduce MSVC compile time - Fix C++20 structured binding in range-for (unsupported by MSVC)
…msvc++ compile time passable for windows CI) make format
- add newer macos 26, deprecate 13 - remove debug/true builds as they are slow and flaky - add windows-arm as a platform - add mvsc test and example
…ed as exceptions When base + 2^bw overflowed the unsigned type (e.g. base=64512, bw=10 for uint16_t: 64512+1024=65536 wraps to 0), the upper bound check val >= 0 became true for all values, producing 1024 exceptions per vector instead of the expected 0-15. This silently inflated compressed output in Release (assertions disabled) and caused assertion failures Fix: replace the overflow-prone upper bound comparison with unsigned delta arithmetic: (uval - ubase) >= pow2(bw). Unsigned subtraction wraps correctly by definition, handling all base/bw combinations.
…reorder Moving Finalize before Cast (needed for min/max in type-narrowing) left bimap_frequency empty on cast columns, crashing the encoder with "Value not found in BiMapFrequency." It also left min/max at defaults on newly created cast columns, causing silent data corruption on round-trip. Fix: split bimap_frequency population and min/max computation into a new PopulateBiMap() step that runs after Cast on the final column types.
write 8 bytes of packed 12-bit codes, but `res` was declared as u32 (4 bytes), causing a 4-byte overread from the stack variable. On x64 MSVC Release/static the /GS security cookie catches this and raises STATUS_STACK_BUFFER_OVERRUN (0xc0000409), crashing EQUALITY_STRPT and SINGLE_COLUMN_STRPT. Fix: widen `res` from u32 to u64 so the 8-byte memcpy reads from a valid 8-byte variable. Changed in all 4 copies of the FSST12 encoder: src/primitive/fsst12/fsst12.cpp src/cor/prm/fsst12/libfsst12.cpp rust/vendor/fastlanes/src/primitive/fsst12/fsst12.cpp rust/vendor/fastlanes/src/cor/prm/fsst12/libfsst12.cpp Verified clean under MSVC AddressSanitizer (/fsanitize=address).
- hide all non-exported symbols since this may avoid certain bugs
split the four big operator files in encoder/decoder classes, such that:
(1) we get more compilation paralellism, and
(2) reduce the amount of template instantiations by half
(in Debug this will count especially)
1. FSST12 memcpy overread — sizeof(u64) → sizeof(u32) for a u32 res variable (4 files) 2. ODR violation — two Histogram<T> classes in namespace fastlanes with different layouts Renamed the one in analyze_operator.hpp to AnalyzeHistogram<T> (4 files: header, impl, .cpp, plus rust vendor copies)
…o add-mvsc+gcc-support
- fix histogram instantiation
…l_map_arr is empty but data has been padded to 1024 elements by FillMissingValues. Fixed by checking idx < null_map_arr.size() instead of !null_map_arr.empty(). Applied to both typed and string comparisons in src/table/rowgroup.cpp and rust/vendor/fastlanes/src/table/rowgroup.cpp.
…o add-mvsc+gcc-support
…ompile times The 247-alternative physical_operator variant caused std::visit() to generate huge dispatch tables (28MB+ object files), making GCC Debug builds time out. Split into enc_physical_operator (~110 alternatives) and dec_physical_operator (~137 alternatives), with physical_operator as a thin wrapper variant over both. Added visit_enc/visit_dec/visit_physical helpers that unwrap the outer variant and dispatch into the correct sub-variant. Object file sizes dropped ~60-70% (e.g. enc_transpose: 16MB→6.5MB, dec_slpatch: 13MB→3.7MB). Total expression directory: ~200MB→~65MB. Files changed: - physical_expression.hpp: variant split + visit helpers - interpreter_encoding.cpp: wrap emplace_back with enc_physical_operator - interpreter_decoding.cpp: wrap emplace_back with dec_physical_operator - 17 operator .cpp files: visit() → visit_enc/visit_dec/visit_physical - materializer.cpp: two-argument visit_dec for operator+column dispatch
remove ubutu-arm clang debug static target (it is the only one still timing out)
- all platforms and settings are at least built - bring back clang on windows-latest and try windows-arm as well
…o add-mvsc+gcc-support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Port FastLanes to compile and pass all tests on GCC Linux and MSVC (Windows x86-64 and ARM64), including shared library (DLL) builds. Also overhaul the CI matrix to cut job count from ~167 to ~50 per PR.
MSVC and GCC Compiler Support
compiler.hppportability header centralizing compiler-specific macros (vectorize pragmas,__builtin_clz/ctz, diagnostic push/pop)CMakeLists.txt: accept MSVC alongside Clang, with appropriate flag mappings (/bigobj,/MP,/Zc:__cplusplus)#pragma clang loop vectorize(enable)withFLS_PRAGMA_VECTORIZE#pragma GCC/clang diagnosticwith portableFLS_DIAG_*macrosmemory_usage.cpp(GetProcessMemoryInfo)fsst12.h(matching existingfsst.hpattern)-Wno-macro-redefinedflags in 8 subdirectory CMakeListsinterpreter.cppinto encoding/decoding files to reduce MSVC compile timemsvc_heap_guardtest helper to handle SEH guard-page exceptions in GoogleTestGCC compiler support was an easy add after going through this. We only support gcc on Linux, because it is much harder to use on MacOS and msvc is the preferred Windows compiler. Note that clang works on all three platforms still.
Explicit Symbol Export
we hide all symbols by default in the shared libraries and export the API symbols -- required vor DLLs
FLS_APImacro (__declspec(dllexport/dllimport)) on all public classes and functions= deleteof copy ops on classes withunique_ptrmembers (MSVC eagerly instantiates all special members for dllexport classes)unique_ptrin dllexport contextFLS_STATICpropagated as PUBLIC compile definition on the FastLanes target so consumers (tests, examples) link correctly against the static libraryBug Fixes
while it is easy, popular (and justified) to hate on Windows, it does have another independent software stack with strong asan tools and debugger, which unearthed quite some bugs. Also, running the CI tests on debug put the system through more asan testing, exposing some more. In order to get working CI across compilers, platforms, Debug/Release, shared/static quite some issues needed to be fixed to get to green CI on this PR:
unsigned longon Windows (1)std::stol/stoul overflow: Values that fit in 64 bits but exceed 32-bit range threw out_of_range. Fixed by using std::stoll/stoull in attribute.cpp and rowgroup.cpp.
unsigned longon Windows (2)FSST12 symbol table cast: fsst12.h cast the symbol table to unsigned long*, giving a 4-byte stride on Windows instead of 8. This made symbol lookups read wrong offsets — producing correct-length but
garbled strings. Fixed by casting to unsigned long long*.
The 8 string decode operators called vector::reserve(1024 * 2MB) = 2GB. Linux overcommits virtual memory so this silently succeeds; Windows does not, throwing bad_alloc. Fixed with an amortized doubling growth strategy.
When tests process large data (~80MB JPEG base64), the Windows heap allocator touches reserved-but-uncommitted guard pages, triggering a transient STATUS_ACCESS_VIOLATION (0xc0000005). Normally the NT heap manager's own vectored exception handler (VEH) commits the page and resumes transparently. But GoogleTest wraps every TEST body in an SEH __try/__except frame, and SEH frames fire after VEH handlers return EXCEPTION_CONTINUE_SEARCH — so GoogleTest's catch-all __except intercepts the fault first and reports a spurious failure. Fix: a custom first-chance VEH handler (msvc_heap_guard.cpp) that detects read-AVs on committable pages, calls VirtualAlloc(MEM_COMMIT) on the faulting page, and resumes execution — doing exactly what the heap manager would have done.
fill_inundefined behaviorThe string column padding loop called push_back() on byte_arr while simultaneously indexing into the same vector to read the last value's bytes. When push_back triggers reallocation, the indices into the old
buffer become dangling — undefined behavior. On Linux this happened to work (the allocator often didn't relocate), but MSVC's debug CRT caught it. Fix: copy the last value into a separate vector<uint8_t> before the loop, then insert() from that copy.
make_dec_galp_expr() called column_view.column_descriptor.encoding_rpn()->operand_tokens()->size() to set state.cur_operand. But GALP/ALP columns never emit operand tokens during encoding, so operand_tokens()
returns null — dereferencing it is a null pointer crash. The fix was simply removing that line, since dec_alp_opr uses hardcoded segment indices and never reads cur_operand.
In find_top_k_combinations(), the inner loop initialized factor_idx = exponent_idx and iterated down to 0. For float, MAX_EXPONENT is 10 but FACT_ARR has only 10 elements (indices 0-9). When exponent_idx == 10, factor_idx started at 10 — an out-of-bounds access into FACT_ARR[10]. Fix: clamp factor_idx to min(exponent_idx, FACT_ARR.size() - 1). For double this was harmless (MAX_EXPONENT=18, FACT_ARR has 19 elements), but for float it was reading one past the end. MSVC's debug iterator checks caught this.
TypedColumnView and NullMapView called .data() on null_map_arr without checking if the vector was empty. Columns without nulls have an empty null_map_arr, so .data() returns nullptr. Downstream code then dereferenced this null pointer. Fix: when null_map_arr is empty, point to a static zero_null_map[65536] array of zeroes (meaning "no nulls") instead.
The amortized growth code in all 8 string decode operators called std::max(capacity * 2, size + CFG::String::max_bytes_per_string) where size is size_t (unsigned) but max_bytes_per_string is a signed type. MSVC treats this as a narrowing conversion error. Fix: add static_cast<size_t>(CFG::String::max_bytes_per_string) in all 8 operators.
When the analyzer couldn't find a good compression option (all options had ≥ 20 exceptions), it fell through with the default bw = 64. Then in is_exception(), pow2(64) returned 0 due to the overflow guard (64 >= sizeof(T)*8), making real_upper = base + 0 = 0. This caused val >= 0 to be true for every value, so all 1024 values in the vector were flagged as exceptions. The count of 1024 was written to the file, and on decode FLS_ASSERT_CORRECT_POS correctly rejected it since 1024 > 1023. The fix: Early-return false in is_exception() when bw covers the full type width — by definition, no value can be an exception if you're using all the bits.
Renamed the one in analyze_operator.hpp to AnalyzeHistogram4 files: header, impl, .cpp, plus rust vendor copies)
CI Overhaul
push: [main, dev]+pull_request:instead of barepush:(was running every job twice)ubuntu-24.04only, other platforms get lightweight compile-checksubuntu-latestonlyubuntu-latestonlyAccelerating Compiles by Splitting
Getting CI to run uncovered another issue: the long compilation times in the interpreter, where hundreds of templates need to be instantiated. This caused in CI timeouts of the Debug compiles. This in turn probably led to Debug CI builds being switched off. And that led to problems getting undetected, because e.g. ASAN and ASSERTs would not get tested. To reduce compilation time, these steps were made:
this improved compilation time considerably -- but we still have to avoid ubuntu-clang-Debug-static CI as even with all these improvements compilation times out on both arm and x86 (but non-ubuntu, and shared Debug builds do run correctly now).
Motivation
Broader compiler support is not only nice for users with motives not to use clang (which has best auto-vectorization), but also enables vcpkg packaging and is required for the future DuckDB extension for FastLanes, which needs native Windows support. The improvements in compile times and (re-)enabling of Debug testing widens CI coverage and brought to light lingering bugs.