Skip to content

[feat] Pre-built JIT IR: compile C++ kernels to bitcode at build time for JIT inlining#501

Open
zhangxffff wants to merge 1 commit intobytedance:mainfrom
zhangxffff:feat/prebuild_jit_ir
Open

[feat] Pre-built JIT IR: compile C++ kernels to bitcode at build time for JIT inlining#501
zhangxffff wants to merge 1 commit intobytedance:mainfrom
zhangxffff:feat/prebuild_jit_ir

Conversation

@zhangxffff
Copy link
Copy Markdown
Collaborator

@zhangxffff zhangxffff commented Apr 14, 2026

What problem does this PR solve?

Issue Number: close #502

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

  • Introduce a pre-built JIT IR framework that compiles C++ kernel functions to LLVM bitcode at build time, embeds the bitcode in the binary, and links/inlines them into JIT-composed modules at runtime
  • Add store kernels that directly use DecodedVector's inline methods (compiled with bolt headers), eliminating virtual dispatch in the JIT hot path
  • Add store PoC test that stores all key columns in a single JIT function call per row, replacing N separate RowContainer::store() calls
  • Add LLVM optimization passes (AlwaysInliner + InstCombine/GVN) to ThrustJITv2's IR transform layer
  • Add JIT compile logging (LOG(INFO) for function name/time/code size, VLOG(1) for full IR dump)

Architecture

  How pre-built bitcode is created

    ┌──────────────┐     clang -emit-llvm     ┌──────────────┐
    │ kernels.cpp  │ ──────────────────────►   │ kernels.bc   │
    │ (C++ source) │         -O2               │ (LLVM IR)    │
    └──────────────┘                           └──────┬───────┘
                                                      │ xxd -i
                                                      ▼
                                               ┌──────────────┐
                                               │kernels_bc.h  │
                                               │(byte array)  │
                                               └──────────────┘
                                                  embedded in
                                                    binary

  How a JIT module with pre-built IR is compiled at runtime
  (using store PoC as example, see PrebuiltIRTest.cpp)

  PrebuiltIRTest.cpp                 ThrustJITv2                     PrebuiltIR
  (RowContainer store PoC)           (JIT engine)                    (bitcode linker)
  ────────────────────────           ────────────                    ──────────────

  jit = getInstance() ─────────────► singleton

  CompileModule(irGen, name) ──────► check LRU cache (hit → return)
                                     create Module
                                         │
    irGenerator(module):                 │
      │                                  │
      ├── PrebuiltIR::linkInto(m) ──────►├── parseBitcodeFile
      │                                  │     (kernels_bc → Module)
      │   // now module has:             │
      │   // jit_store_i64 (internal)    ├── Linker::linkModules
      │   // jit_store_f64 (internal)    │     (merge into target)
      │   // jit_store_i32 (internal)    │
      │   // ...                         ├── mark all → InternalLinkage
      │                                  │
      ├── IRBuilder:                     │
      │   create store_keys func         │
      │   with baked-in constants:       │
      │                                  │
      │   func store_keys(rc, row,       │
      │                   cols, idx):    │
      │     call jit_store_i64(          │
      │       row, 0, cols[0], idx,      │  ← offsets, nullByte, nullMask
      │       32, 1)                     │    are compile-time constants
      │     call jit_store_f64(          │
      │       row, 8, cols[1], idx,      │
      │       32, 2)                     │
      │     call jit_store_i32(          │
      │       row, 16, cols[2], idx,     │
      │       32, 4)                     │
      │     ret void                     │
      │                                  │
      └── return ───────────────────────►│
                                         │
                                    IR Transform Layer:
                                         ├── AlwaysInliner
                                         │   jit_store_i64 → inlined
                                         │   jit_store_f64 → inlined
                                         │   jit_store_i32 → inlined
                                         ├── InstCombine/GVN/CFGSimplify
                                         │
                                    LLJIT → machine code
                                         │
                                    LOG: "Compiled 'jit_pb_store_...'
                                          in 14ms, 2048 bytes"
                                         │
                                    cache by name
                                         │
  mod = CompiledModuleSP ◄───────────────┘

  fn = mod->getFuncPtr(name)
  fn(rc, row, cols, idx) → runs JIT code


  Why pre-built vs IRBuilder?

  IRBuilder approach (existing):        Pre-built approach (new):

    400 lines of C++ using              40 lines of C++ using
    IRBuilder API to emit               IRBuilder API to emit
    LLVM IR instructions                calls to pre-built kernels
    ┌──────────────────────┐            ┌──────────────────────┐
    │ builder.CreateGEP    │            │ builder.CreateCall(  │
    │ builder.CreateLoad   │            │   "jit_store_i64",   │
    │ builder.CreateICmp   │            │   {row, offset, ...})│
    │ builder.CreateBr     │            └──────────────────────┘
    │ builder.CreatePHI    │
    │ ... 400 more lines   │            Kernel logic written in
    └──────────────────────┘            plain C++ (kernels.cpp),
                                        compiled to bitcode,
    Hard to write, review,              inlined at JIT time.
    and maintain.
                                        Easy to write and test.

Performance Impact

  • Positive Impact: Store PoC shows ~2-3x speedup for fixed-width types (3 keys, 10K rows).

Release Note

- Add pre-built JIT IR framework: compile C++ kernels to LLVM bitcode at build time, inline into JIT modules at runtime. Reduces JIT code complexity (40 lines vs 400) while maintaining performance.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

@zhangxffff zhangxffff marked this pull request as draft April 14, 2026 01:58
@zhangxffff zhangxffff changed the title [feature] Pre-built JIT IR: compile C++ kernels to bitcode at build time for JIT inlining [WIP] Pre-built JIT IR: compile C++ kernels to bitcode at build time for JIT inlining Apr 14, 2026
@zhangxffff zhangxffff requested a review from kexianda April 14, 2026 02:20
@zhangxffff zhangxffff changed the title [WIP] Pre-built JIT IR: compile C++ kernels to bitcode at build time for JIT inlining [feat] Pre-built JIT IR: compile C++ kernels to bitcode at build time for JIT inlining Apr 15, 2026
@zhangxffff zhangxffff marked this pull request as ready for review April 15, 2026 08:50
@zhangxffff zhangxffff force-pushed the feat/prebuild_jit_ir branch 2 times, most recently from 7d7742f to 32b616d Compare April 16, 2026 09:28
@zhangxffff zhangxffff force-pushed the feat/prebuild_jit_ir branch from dc77263 to 957dc70 Compare April 19, 2026 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Pre-built JIT IR: compile C++ kernels to bitcode for JIT inlining

1 participant