Skip to content

feat: build core operator framework with multi-device backends, Python bindings, testing, and CI#39

Open
voltjia wants to merge 93 commits intomasterfrom
feat/dev-infra
Open

feat: build core operator framework with multi-device backends, Python bindings, testing, and CI#39
voltjia wants to merge 93 commits intomasterfrom
feat/dev-infra

Conversation

@voltjia
Copy link
Copy Markdown
Collaborator

@voltjia voltjia commented Apr 1, 2026

This PR builds out the full InfiniOps operator library from the initial Tensor/DataType scaffolding. It spans 109 files across core framework, operator implementations, build system, Python bindings, testing, and CI.

Core Framework

  • Device abstraction supporting 6 device types (CPU, Nvidia, Iluvatar, MetaX, Moore, and Cambricon) with compile-time filtering.
  • Operator base class with hash-based caching (CacheKey) for workspace reuse.
  • Handle for stream/workspace dependency injection.
  • Dispatcher — tag-based compile-time + runtime dispatch over device/dtype (C++17 compatible).
  • DispatchFunc high-level interface for multi-type and mixed dispatch.

Operators & Backend Support

Operator CPU NVIDIA Iluvatar MetaX Moore Cambricon
Add
GEMM
RmsNorm
SwiGLU
CausalSoftmax

Build System

  • CMake with per-backend flags (-DWITH_NVIDIA, etc.) and auto-detection mode (-DAUTO_DETECT_DEVICES).
  • pyproject.toml with scikit-build-core backend — supports pip install ..
  • Compiler wrapper scripts for MetaX (MACA) and Cambricon (CNToolkit).

Python Bindings

  • libclang-based code generator (scripts/generate_wrappers.py) parses operator headers → emits pybind11 code.
  • PyTorch tensor interop via TensorFromPybind11Handle.

Testing

  • pytest with auto-parametrization by device & dtype, tolerance-based assertions.
  • @pytest.mark.auto_act_and_assert for declarative Act/Assert against PyTorch references.
  • Benchmark mode (--benchmark) via torch.utils.benchmark.

CI (.ci/)

  • Docker-based multi-platform pipeline with per-device config (config.yaml).
  • Remote dispatch agent with webhook scheduler (agent.py).
  • Auto-detects GPU hardware and runs platform-appropriate test suites.

voltjia and others added 30 commits February 12, 2026 19:24
- Add additional entries to the Device enum class to support new hardware targets.
- Adapt GEMM mcblas implementation to use MetaX backend and add the test example.
- Extract common BLAS interfaces into a new blas.h abstraction for GEMM implementations to share.
…GEMM implementation

- Add `ConstexprMap` and compile-time traits in `common/` for efficient
  type-to-metadata mapping and relevant operations.
- Implement a generic dispatcher to reduce boilerplate for dispatching, especially for data types and devices.
- Add the CPU implementation for the GEMM
- Update `DataType` definitions and type lists to support wide
  dispatching.

Follow-up: support for fp16 and bf16 kernels is pending.
…further abstract `blas.h`

- further abstract `blas.h`, backends now only do name change
- fix various naming issues and small issues
- combined the `gemm` example programs across the platforms, now only one program for all platforms
zhangyue207 and others added 24 commits March 4, 2026 17:05
…pip install` on MetaX (#5)

* refactor: adapt the dispatcher to be C++17-compatiable

- dispatcher now does not depend on C++20 features
- udpate the current dispatcher use cases
- add some relevant constexpr traits in common/traits.h
- add `PYBIND_ENABLE_EXTRAS` internal cmake variable for controlling the flags introduced by pybind

* style: format some comments in common/traits.h

* fix: support mxcc to use pytest by using `scripts/mxcc_wrapper.sh`

* build: add auto-detection for MetaX

* style: change the naming for types and variables in `common/traits.h`, `common/constexpr_map.h` and `dispatcher.h`

* style: fix the method and context string naming in `src/add/add.h`

* refactor: change the anonymous namespaces in `dispatcher.h` to namespace `detail` to comply with the styling rules

* style: fix comment styling issues

* fix: update `DispatchFunc` usage in `src/cuda/rms_norm/kernel.h`

---------

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
* refactor: update `Operator::call` to accept `handle`, `stream`, `workspace`, and `workspace_size_in_bytes`

* feat: add `workspace_size_in_bytes` virtual method to `OperatorBase`
* feat: add the implementation of  operator on Cambricon

* chore: format `src/cambricon/gemm/cnblas.h` with `clang-format`

* refactor: update `src/cambricon/gemm/cnblas.h` to use latest `operator()` mechanism

* refactor: update `src/cambricon/gemm/cnblas.h` to use `workspace_` from `OperatorBase`

* chore: resolve PR comments

* chore: reverse tensor descriptor destruction order

---------

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
…nds (#12)

* feat(ops): implement CausalSoftmax operator with CPU and CUDA backends

* refactor(ops): update CausalSoftmax constructor and method signatures for consistency

* style: improve assertion messages in CausalSoftmax for clarity and consistency

* chore: format files with `clang-format`

* test: disable testing skipping for unpresent `infini.ops.causal_softmax`

* refactor: update `causal_softmax` to use latest `operator()` mechanism

---------

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
* feat: add the CPU implementation of float16 and bfloat16 and the CPU `Cast()` function

- add the CPU implementation of float16 and bfloat16 as `float16_t` and `bfloat16_t`
- add the CPU `Cast()` function that support conversion between any two CPU supported types, including the custom `float16_t` and `bfloat16_t`

* style: change `indexToOffset()` to `IndexToOffset()` to comply with the styling requirement

* feat: add the CUDA `Cast()` function

* refactor: refactor CUDA `Cast` utility with SFINAE-based hardware dispatch and move them into `common/cuda/cast.h`

* style: change the naming of some types in `common/cast.h` and `common/cuda/cast.h` to better comply with the naming rules

* chore: remove unused header `data_type.h` in `common/cuda/kernel_commons.h`

* style: adjust comments for styling rule compliance

* style: change `float_t` and `bfloat16_t` to `Float16` and `BFloat16` and fix various styling issues.
* feat: add  op with NVIDIA and CPU backends

* fix: fix code as pr comment

* chore: format `tests/test_swiglu.py` and `tests/utils.py`

---------

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
* feat: reorganize casting utilities and enhance CPU support

- Moved casting functions to separate CPU and CUDA headers for better organization.
- Introduced a new `Cast()` function in the CPU implementation to handle type conversions, including support for custom types like `float16_t` and `bfloat16_t`.
- Updated various operators to utilize the new casting utilities, ensuring consistent type handling across CPU and CUDA backends.
- Enhanced test cases to cover additional data types and ensure compatibility with the new casting logic.

* fix: update bfloat16 test tolerance in `test_rms_norm.py`

- Increased the tolerance for `bfloat16` from `1e-2` to `2e-2` to better accommodate numerical precision in tests.

* format: simplify type dispatching in `Add` operator and formating
* fix: add equality operators and CacheKey struct for improved operator caching

- Implemented `operator==` and `operator!=` for the `Device` class to facilitate comparison.
- Introduced `CacheKey` struct in `operator.h` to enhance caching mechanism with a hash and vector of tensors.
- Updated the `Operator::call` method to utilize `CacheKey` for caching operators based on input arguments.
- Added `MetaEqual` method in `Tensor` class for tensor comparison based on metadata.

* refactor: move CacheKey struct to detail namespace and enhance Tensor comparison

- Changed the namespace of `CacheKey` to `infini::ops::detail` for better organization.
- Updated the hash and equality operators for `CacheKey` to reflect the new namespace.
- Removed the `MetaEqual` method from the `Tensor` class and replaced it with a dedicated `std::equal_to` specialization for `Tensor` to improve comparison logic.

* style: remove unnecessary blank line in cublas.h for improved readability
* feat(gemm-moore): add Moore (MUSA) GEMM backend support.

* refactor(gemm-moore): reuse shared BLAS helper and specialize scalars.

* build: use detected Python interpreter for wrapper generation.

---------

Co-authored-by: zhuyue <zhuyue@qiyuanlab.com>
…igLU` (#17)

* refactor: reorganize casting utilities and enhance CUDA kernel support

- Moved CPU casting functions to a new file `common/cpu/cast.h` and updated the `Cast` function to utilize these utilities.
- Updated CUDA kernel files to include the new casting utilities and improved block size handling in kernel launches.
- Enhanced the `Add`, `CausalSoftmax`, `Gemm`, `RmsNorm`, and `Swiglu` operators to utilize the new casting mechanisms for better type handling.
- Added support for additional data types in tests and adjusted test cases for consistency across CPU and GPU backends.

* refactor: improve formatting

* refactor: cache cudaDeviceProp per device via DevicePropertyCache

Introduce DevicePropertyCache to query and cache all device properties
once at first access, avoiding repeated cudaGetDeviceProperties calls.
QueryMaxThreadsPerBlock and GetOptimalBlockSize are simplified to
delegate to the cache. Also move block_size out of dispatch lambdas
in add and swiglu kernels since it does not depend on the dispatched type.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* perf: reduce redundant ops in Add, SwiGLU, and RmsNorm CUDA kernels

- Add __restrict__ to all pointer params in AddKernel and SwigluKernel
  to enable compiler alias analysis, vectorization, and prefetch
- Remove dead for-loop in Add/SwiGLU kernel launch (step >= output_size_
  by construction, loop body always executed exactly once); drop offset param
- Inline sigmoid in SwiGLU bfloat16/bfloat162 paths to eliminate
  redundant bf16<->float round-trips (8 -> 4 conversions for bfloat162,
  2 -> 1 for bfloat16)
- Use a temp variable in RmsNorm SumSquared to guarantee single global load

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: abort with diagnostic on out-of-range device_id in GetDeviceProps

Returning a default-constructed dummy cudaDeviceProp silently propagated
incorrect device properties; now print an explicit error and abort so
the bug is immediately visible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: improve code formatting and comments in CUDA kernel files

- Updated comments for clarity in `kernel_commons.h`.
- Reformatted kernel launch macro in `kernel.h` for better readability.
- Enhanced line breaks in `SwigluKernel` implementation for improved code structure.
- Adjusted test function formatting in `test_add.py` for consistency.

* refactor: remove unnecessary cuda_runtime.h includes in kernel headers

- Removed redundant `#include <cuda_runtime.h>` from `add`, `causal_softmax`, `rms_norm`, and `swiglu` kernel header files.
- Added TODO comments for future removal of the remaining includes to improve code clarity and maintainability.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
… and add `_torch_rms_norm` fallback (#20)

* fix: remove uint16 test from test_add.py

- Removed `torch.uint16` from the list of integer data types in the `_INT_DTYPES` tuple to streamline the code and eliminate redundancy.

* refactor: enhance dtype handling in test_add.py

* refactor: streamline dtype parameterization in test_add.py and enhance rms_norm fallback handling in test_rms_norm.py

* refactor: add unsigned integer data types to test_add.py for enhanced dtype handling

* refactor: simplify integer dtype filtering

* refactor: simplify `_torch_rms_norm` fallback logic

---------

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
* feat(moore): add add-op support with musa integration.

* refactor: improve specialization logic

* fix: fix `AttributeError` on Cambricon

---------

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
* feat(moore): add swiglu-op support with musa integration

- add a Moore swiglu backend on top of the shared CUDA-style path
- extract shared swiglu compute into a reusable op for backend override
- keep Moore-specific half and bfloat162 handling in the Moore backend only

* refactor: introduce `src/moore/polyfills.cuh`

* refactor: use polyfills for Moore SwiGLU

---------

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
- add MetaX  operator specialization
- make the shared CUDA-style kernel compatible with MetaX
- reuse common casting utilities for fp16 and bf16 conversions

Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>
- add MetaX `RmsNorm` operator specialization
- make the shared CUDA-style rms_norm kernel compatible with MetaX
- forward runtime `eps` when launching the kernel

Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>
Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>
* feat/nv ci test

* feat: ci sys for nv platform

* fix(ci): fix results dir permissions and reduce parallel workers

- Pass host UID/GID into container and `chown` results after tests,
  so mounted `ci-results/` is accessible by the host user.
- Limit `pytest-xdist` workers from `-n auto` to `-n 8` to prevent
  OOM worker crashes on high-core-count machines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor(ci): Refactor code structure for improved readability and maintainability

* docs: add multi-machine deployment guide for NVIDIA and Iluvatar platform

* feat(ci): enhance CI configuration and agent functionality with platform detection and job resolution

* feat(ci): add MetaX platform CI support

Add Dockerfile, config, and mx-smi GPU detection for MetaX (MACA) platform.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(ci): improve job dispatch logging and handle job results more effectively

* feat(ci): add Moore Threads (MUSA) platform CI support

Add GPU detection via mthreads-gmi, Dockerfile, config, and update docs
with Moore and MetaX platform deployment instructions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(ci): capture Docker error output for remote job diagnostics

* feat(ci): capture error output and improve CLI result display

- Capture last 50 lines of Docker output via ring buffer so failed
  jobs return diagnostic info to the CLI client.
- Store raw bytes during execution; decode only on the failure path.
- Align job name columns in `<==` result lines for readability.
- Show summary only when jobs fail, removing redundant all-pass output.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(ci): add Cambricon MLU platform CI support

- Add .ci/images/cambricon/Dockerfile for AnolisOS-based Cambricon image
- Add cambricon platform to config.yaml with MLU-style GPU passthrough
- Add GPU_STYLE_MLU constant and MLU_VISIBLE_DEVICES support in run.py
- Add cnmon-based GPU detection (_detect_gpus_cambricon) in ci_resource.py
- Add --test CLI flag to override pytest test path at runtime
- Skip empty stage run commands instead of erroring (compilation-only mode)
- Fix _torch_gemm fallback for CPU float16/bfloat16 (upcast to float32)
- Skip bfloat16 on MLU (cnnlBatchMatMulEx does not support it)
- Hoist _PYTEST_VALUE_FLAGS to module level; add ValueError guard in cambricon parser
- Remove redundant yaml import guard in agent.py (utils.py already handles it)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(ci): translate README and comments to English, use ngpus for NVIDIA scheduler

- Rewrite README.md entirely in English; add Cambricon to platform
  table and directory tree.
- Translate all inline comments in config.yaml to English.
- Replace `gpu_ids: "0"` with `ngpus: 1` for NVIDIA platform so the
  scheduler auto-picks a free GPU rather than pinning to device 0.
- Add `ngpus` support to `parse_gpu_requirement` in ci_resource.py so
  scheduler correctly counts NVIDIA GPU demand.
- Replace deprecated `gpu_count` fallback with `ngpus` in run.py
  `build_docker_args`.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(ci): add --local flag to run.py for testing uncommitted changes

- Mount current directory read-only into container via `-v cwd:/workspace/repo:ro`
- Copy to writable `/tmp/src` inside container before setup runs, so host
  files are never modified by pip install or build artifacts
- Simplify README: fix ngpus example, add gpu_style column, add --local docs

Co-Authored-By: Claude <noreply@anthropic.com>

* style(ci): normalize comments to complete English sentences with markdown

- Backtick-quote tool/package names (`torch`, `pip`, `git`, `cmake`,
  `coreutils-single`, `conda`) and paths in Dockerfile comments.
- Add explanatory comment to the commented-out `agents:` block in
  `config.yaml` describing when to uncomment it.
- Convert all section-header banners in `.ci/tests/` to "Tests for
  `FunctionName`." sentence form; fix three docstrings in `test_agent.py`.
- Backtick-quote identifiers in `tests/test_gemm.py` inline comments.

Co-Authored-By: Claude <noreply@anthropic.com>

* style(tests): backtick-quote identifiers in test_gemm.py skip message

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
* feat: Add RMSNorm op in cambricon backend.

* refactor: make `Cast` utility to use `Device::Type` template parameter

* refactor: add `Caster` mixin

* refactor: rename `cast**` to `caster**`

* fix: fix the mlu naming to google c++ naming style

* chore: format files with `clang-format`

* refactor: update CUDA kernels to use `Caster`

* fix: fix rmsnorm dispatch to use one dispatch

---------

Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
…xed dispatch (#29)

* feat: add a convenient interface for any `int64_t`-convertible types and use `DispatchFunc()` to dispatch `DataType` and block sizes with a single call.

- add a convenient interface for any `int64_t`-convertible types, which is mostly used for multi-type dispatch and mixed dispatch
- use `DispatchFunc()` to dispatch `DataType` and block sizes with a single function call in various kernels' implementation
- remove the `CUDA_BLOCK_SIZE_XXX` macros and simply use numbers instead

* style: fix the styling issue by adding a period to the TODO comment

* fix: fix rebase error

* style: fix the styling issues for comments in `dispatcher.h` and `cuda/causal_softmax/kernel.h`
#38)

* refactor: make `TypeMap`, `IsFP16`, `IsBFloat16`, and `DispatchFunc` device-aware

* refactor: make `cuda/` shared headers self-contained and include-order-independent

* fix: update call sites to device-aware `TypeMap`, `IsFP16`/`IsBFloat16`, and `DispatchFunc`

* chore: format files with `clang-format`

* fix: update `cuda/swiglu` kernels to use device-aware type predicates

* fix: replace per-instance `blasHandle_t` with a static singleton in `Blas`

* fix: restore kernel headers for `moore/add` and `moore/swiglu` to use `clang-format off` and `clang-format on`

* fix: use absolute includes, consistent include guards, and formatted comments

* refactor: extract `GetOptimalBlockSize` logic into shared `ComputeOptimalBlockSize`

* fix: include `<musa_fp16.h>` in `polyfills.cuh` before `hrcp` macro to prevent collision

* chore: add blank lines between `using` type alias declarations in `device_.h`

* chore: add TODO comments for potential performance and concurrency issues

* fix: move `clang-format` guards to wrap only CUDA headers in `iluvatar/device_.h`
a_host.strides()};
Tensor b_device{b_ptr, b_host.shape(), b_host.dtype(), a_host.device(),
b_host.strides()};
Tensor c_device{c_ptr, c_host.shape(), c_host.dtype(), a_host.device(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a_host/b_host/c_host 指的是 host 侧 Tensor 吗?但为什么 a_device/b_device/c_device 又是直接用 a_host.device() 构造的呢?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是专门给 example 用的统一 runtime api 吗,为什么不用 infiniop 封装好的呢?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

每个 elementwise op 都需要对应一个头文件实现吗?感觉是不是可以统一成一份,简化 elementwise op 添加的流程。

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件为什么叫 device_.h...

Backend::memcpyH2D);
Backend::memcpy(d_other_strides_, other_strides_.data(), strides_size,
Backend::memcpyH2D);
Backend::memcpy(d_out_strides_, out_strides_.data(), strides_size,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

留个 TODO 之后想办法把 malloc/memcpy 尽可能减少或者在某些常见维度下去除吧;
至少目前这种写法可以先改成只 malloc 一次总内存大小,再按照 offset 分配,其他 elementwise kernel 同理。

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件名加下划线后缀是有什么特殊含义吗?
以及这个文件的内容是不是放在 .cuh 里更合适。

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个也应该是 .cuh


static constexpr auto free = cudaFree;

static constexpr auto memcpyH2D = cudaMemcpyHostToDevice;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这类 api 每个用到的 op 里都需要在 namespace infini::ops::op_name::DeviceBackend 里显式写一次,有点冗余,为什么不是靠 device 分类所有用到的 api(这样至少每个 device 只用写一次),而是靠 op 分类呢?

#ifndef INFINI_OPS_CASTER_H_
#define INFINI_OPS_CASTER_H_

#include "data_type.h"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件没用到 data_type 相关内容吧

@kilinchange
Copy link
Copy Markdown

另外,目前实现代码和头文件都集中在 src 目录下,后续可以考虑引入独立的 include 目录,用于承载对外暴露的公共头文件接口;src 则主要保留内部实现和私有头。


namespace infini::ops {

namespace swiglu {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

以及目前还是存在大量 namespace 里带上 op 名称的现象,感觉这里的主要问题是 device 的层级放到 op 之上更合适,复用度更高,现在是反过来了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants