feat: build core operator framework with multi-device backends, Python bindings, testing, and CI by voltjia · Pull Request #39 · InfiniTensor/InfiniOps

voltjia · 2026-04-01T08:57:48Z

This PR builds out the full InfiniOps operator library from the initial Tensor/DataType scaffolding. It spans 109 files across core framework, operator implementations, build system, Python bindings, testing, and CI.

Core Framework

Device abstraction supporting 6 device types (CPU, Nvidia, Iluvatar, MetaX, Moore, and Cambricon) with compile-time filtering.
Operator base class with hash-based caching (CacheKey) for workspace reuse.
Handle for stream/workspace dependency injection.
Dispatcher — tag-based compile-time + runtime dispatch over device/dtype (C++17 compatible).
DispatchFunc high-level interface for multi-type and mixed dispatch.

Operators & Backend Support

Operator	CPU	NVIDIA	Iluvatar	MetaX	Moore	Cambricon
Add	✓	✓	✓	✓	✓
GEMM	✓	✓	✓	✓	✓	✓
RmsNorm	✓	✓	✓	✓		✓
SwiGLU	✓	✓	✓	✓	✓
CausalSoftmax	✓	✓	✓	✓

Build System

CMake with per-backend flags (-DWITH_NVIDIA, etc.) and auto-detection mode (-DAUTO_DETECT_DEVICES).
pyproject.toml with scikit-build-core backend — supports pip install ..
Compiler wrapper scripts for MetaX (MACA) and Cambricon (CNToolkit).

Python Bindings

libclang-based code generator (scripts/generate_wrappers.py) parses operator headers → emits pybind11 code.
PyTorch tensor interop via TensorFromPybind11Handle.

Testing

pytest with auto-parametrization by device & dtype, tolerance-based assertions.
@pytest.mark.auto_act_and_assert for declarative Act/Assert against PyTorch references.
Benchmark mode (--benchmark) via torch.utils.benchmark.

CI (`.ci/`)

Docker-based multi-platform pipeline with per-device config (config.yaml).
Remote dispatch agent with webhook scheduler (agent.py).
Auto-detects GPU hardware and runs platform-appropriate test suites.

- Add additional entries to the Device enum class to support new hardware targets. - Adapt GEMM mcblas implementation to use MetaX backend and add the test example. - Extract common BLAS interfaces into a new blas.h abstraction for GEMM implementations to share.

…GEMM implementation - Add `ConstexprMap` and compile-time traits in `common/` for efficient type-to-metadata mapping and relevant operations. - Implement a generic dispatcher to reduce boilerplate for dispatching, especially for data types and devices. - Add the CPU implementation for the GEMM - Update `DataType` definitions and type lists to support wide dispatching. Follow-up: support for fp16 and bf16 kernels is pending.

…further abstract `blas.h` - further abstract `blas.h`, backends now only do name change - fix various naming issues and small issues - combined the `gemm` example programs across the platforms, now only one program for all platforms

…nd example header file inclusion

…nges

…pip install` on MetaX (#5) * refactor: adapt the dispatcher to be C++17-compatiable - dispatcher now does not depend on C++20 features - udpate the current dispatcher use cases - add some relevant constexpr traits in common/traits.h - add `PYBIND_ENABLE_EXTRAS` internal cmake variable for controlling the flags introduced by pybind * style: format some comments in common/traits.h * fix: support mxcc to use pytest by using `scripts/mxcc_wrapper.sh` * build: add auto-detection for MetaX * style: change the naming for types and variables in `common/traits.h`, `common/constexpr_map.h` and `dispatcher.h` * style: fix the method and context string naming in `src/add/add.h` * refactor: change the anonymous namespaces in `dispatcher.h` to namespace `detail` to comply with the styling rules * style: fix comment styling issues * fix: update `DispatchFunc` usage in `src/cuda/rms_norm/kernel.h` --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

* refactor: update `Operator::call` to accept `handle`, `stream`, `workspace`, and `workspace_size_in_bytes` * feat: add `workspace_size_in_bytes` virtual method to `OperatorBase`

* feat: add the implementation of operator on Cambricon * chore: format `src/cambricon/gemm/cnblas.h` with `clang-format` * refactor: update `src/cambricon/gemm/cnblas.h` to use latest `operator()` mechanism * refactor: update `src/cambricon/gemm/cnblas.h` to use `workspace_` from `OperatorBase` * chore: resolve PR comments * chore: reverse tensor descriptor destruction order --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

… in `src/pybind11_utils.h`

…nds (#12) * feat(ops): implement CausalSoftmax operator with CPU and CUDA backends * refactor(ops): update CausalSoftmax constructor and method signatures for consistency * style: improve assertion messages in CausalSoftmax for clarity and consistency * chore: format files with `clang-format` * test: disable testing skipping for unpresent `infini.ops.causal_softmax` * refactor: update `causal_softmax` to use latest `operator()` mechanism --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

* feat: add the CPU implementation of float16 and bfloat16 and the CPU `Cast()` function - add the CPU implementation of float16 and bfloat16 as `float16_t` and `bfloat16_t` - add the CPU `Cast()` function that support conversion between any two CPU supported types, including the custom `float16_t` and `bfloat16_t` * style: change `indexToOffset()` to `IndexToOffset()` to comply with the styling requirement * feat: add the CUDA `Cast()` function * refactor: refactor CUDA `Cast` utility with SFINAE-based hardware dispatch and move them into `common/cuda/cast.h` * style: change the naming of some types in `common/cast.h` and `common/cuda/cast.h` to better comply with the naming rules * chore: remove unused header `data_type.h` in `common/cuda/kernel_commons.h` * style: adjust comments for styling rule compliance * style: change `float_t` and `bfloat16_t` to `Float16` and `BFloat16` and fix various styling issues.

* feat: add op with NVIDIA and CPU backends * fix: fix code as pr comment * chore: format `tests/test_swiglu.py` and `tests/utils.py` --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

* feat: reorganize casting utilities and enhance CPU support - Moved casting functions to separate CPU and CUDA headers for better organization. - Introduced a new `Cast()` function in the CPU implementation to handle type conversions, including support for custom types like `float16_t` and `bfloat16_t`. - Updated various operators to utilize the new casting utilities, ensuring consistent type handling across CPU and CUDA backends. - Enhanced test cases to cover additional data types and ensure compatibility with the new casting logic. * fix: update bfloat16 test tolerance in `test_rms_norm.py` - Increased the tolerance for `bfloat16` from `1e-2` to `2e-2` to better accommodate numerical precision in tests. * format: simplify type dispatching in `Add` operator and formating

* fix: add equality operators and CacheKey struct for improved operator caching - Implemented `operator==` and `operator!=` for the `Device` class to facilitate comparison. - Introduced `CacheKey` struct in `operator.h` to enhance caching mechanism with a hash and vector of tensors. - Updated the `Operator::call` method to utilize `CacheKey` for caching operators based on input arguments. - Added `MetaEqual` method in `Tensor` class for tensor comparison based on metadata. * refactor: move CacheKey struct to detail namespace and enhance Tensor comparison - Changed the namespace of `CacheKey` to `infini::ops::detail` for better organization. - Updated the hash and equality operators for `CacheKey` to reflect the new namespace. - Removed the `MetaEqual` method from the `Tensor` class and replaced it with a dedicated `std::equal_to` specialization for `Tensor` to improve comparison logic. * style: remove unnecessary blank line in cublas.h for improved readability

* feat(gemm-moore): add Moore (MUSA) GEMM backend support. * refactor(gemm-moore): reuse shared BLAS helper and specialize scalars. * build: use detected Python interpreter for wrapper generation. --------- Co-authored-by: zhuyue <zhuyue@qiyuanlab.com>

…igLU` (#17) * refactor: reorganize casting utilities and enhance CUDA kernel support - Moved CPU casting functions to a new file `common/cpu/cast.h` and updated the `Cast` function to utilize these utilities. - Updated CUDA kernel files to include the new casting utilities and improved block size handling in kernel launches. - Enhanced the `Add`, `CausalSoftmax`, `Gemm`, `RmsNorm`, and `Swiglu` operators to utilize the new casting mechanisms for better type handling. - Added support for additional data types in tests and adjusted test cases for consistency across CPU and GPU backends. * refactor: improve formatting * refactor: cache cudaDeviceProp per device via DevicePropertyCache Introduce DevicePropertyCache to query and cache all device properties once at first access, avoiding repeated cudaGetDeviceProperties calls. QueryMaxThreadsPerBlock and GetOptimalBlockSize are simplified to delegate to the cache. Also move block_size out of dispatch lambdas in add and swiglu kernels since it does not depend on the dispatched type. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * perf: reduce redundant ops in Add, SwiGLU, and RmsNorm CUDA kernels - Add __restrict__ to all pointer params in AddKernel and SwigluKernel to enable compiler alias analysis, vectorization, and prefetch - Remove dead for-loop in Add/SwiGLU kernel launch (step >= output_size_ by construction, loop body always executed exactly once); drop offset param - Inline sigmoid in SwiGLU bfloat16/bfloat162 paths to eliminate redundant bf16<->float round-trips (8 -> 4 conversions for bfloat162, 2 -> 1 for bfloat16) - Use a temp variable in RmsNorm SumSquared to guarantee single global load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: abort with diagnostic on out-of-range device_id in GetDeviceProps Returning a default-constructed dummy cudaDeviceProp silently propagated incorrect device properties; now print an explicit error and abort so the bug is immediately visible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: improve code formatting and comments in CUDA kernel files - Updated comments for clarity in `kernel_commons.h`. - Reformatted kernel launch macro in `kernel.h` for better readability. - Enhanced line breaks in `SwigluKernel` implementation for improved code structure. - Adjusted test function formatting in `test_add.py` for consistency. * refactor: remove unnecessary cuda_runtime.h includes in kernel headers - Removed redundant `#include <cuda_runtime.h>` from `add`, `causal_softmax`, `rms_norm`, and `swiglu` kernel header files. - Added TODO comments for future removal of the remaining includes to improve code clarity and maintainability. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

… and add `_torch_rms_norm` fallback (#20) * fix: remove uint16 test from test_add.py - Removed `torch.uint16` from the list of integer data types in the `_INT_DTYPES` tuple to streamline the code and eliminate redundancy. * refactor: enhance dtype handling in test_add.py * refactor: streamline dtype parameterization in test_add.py and enhance rms_norm fallback handling in test_rms_norm.py * refactor: add unsigned integer data types to test_add.py for enhanced dtype handling * refactor: simplify integer dtype filtering * refactor: simplify `_torch_rms_norm` fallback logic --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

* feat(moore): add add-op support with musa integration. * refactor: improve specialization logic * fix: fix `AttributeError` on Cambricon --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

* feat(moore): add swiglu-op support with musa integration - add a Moore swiglu backend on top of the shared CUDA-style path - extract shared swiglu compute into a reusable op for backend override - keep Moore-specific half and bfloat162 handling in the Moore backend only * refactor: introduce `src/moore/polyfills.cuh` * refactor: use polyfills for Moore SwiGLU --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

- add MetaX operator specialization - make the shared CUDA-style kernel compatible with MetaX - reuse common casting utilities for fp16 and bf16 conversions Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>

- add MetaX `RmsNorm` operator specialization - make the shared CUDA-style rms_norm kernel compatible with MetaX - forward runtime `eps` when launching the kernel Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>

Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>

* feat/nv ci test * feat: ci sys for nv platform * fix(ci): fix results dir permissions and reduce parallel workers - Pass host UID/GID into container and `chown` results after tests, so mounted `ci-results/` is accessible by the host user. - Limit `pytest-xdist` workers from `-n auto` to `-n 8` to prevent OOM worker crashes on high-core-count machines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(ci): Refactor code structure for improved readability and maintainability * docs: add multi-machine deployment guide for NVIDIA and Iluvatar platform * feat(ci): enhance CI configuration and agent functionality with platform detection and job resolution * feat(ci): add MetaX platform CI support Add Dockerfile, config, and mx-smi GPU detection for MetaX (MACA) platform. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(ci): improve job dispatch logging and handle job results more effectively * feat(ci): add Moore Threads (MUSA) platform CI support Add GPU detection via mthreads-gmi, Dockerfile, config, and update docs with Moore and MetaX platform deployment instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(ci): capture Docker error output for remote job diagnostics * feat(ci): capture error output and improve CLI result display - Capture last 50 lines of Docker output via ring buffer so failed jobs return diagnostic info to the CLI client. - Store raw bytes during execution; decode only on the failure path. - Align job name columns in `<==` result lines for readability. - Show summary only when jobs fail, removing redundant all-pass output. Co-Authored-By: Claude <noreply@anthropic.com> * feat(ci): add Cambricon MLU platform CI support - Add .ci/images/cambricon/Dockerfile for AnolisOS-based Cambricon image - Add cambricon platform to config.yaml with MLU-style GPU passthrough - Add GPU_STYLE_MLU constant and MLU_VISIBLE_DEVICES support in run.py - Add cnmon-based GPU detection (_detect_gpus_cambricon) in ci_resource.py - Add --test CLI flag to override pytest test path at runtime - Skip empty stage run commands instead of erroring (compilation-only mode) - Fix _torch_gemm fallback for CPU float16/bfloat16 (upcast to float32) - Skip bfloat16 on MLU (cnnlBatchMatMulEx does not support it) - Hoist _PYTEST_VALUE_FLAGS to module level; add ValueError guard in cambricon parser - Remove redundant yaml import guard in agent.py (utils.py already handles it) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(ci): translate README and comments to English, use ngpus for NVIDIA scheduler - Rewrite README.md entirely in English; add Cambricon to platform table and directory tree. - Translate all inline comments in config.yaml to English. - Replace `gpu_ids: "0"` with `ngpus: 1` for NVIDIA platform so the scheduler auto-picks a free GPU rather than pinning to device 0. - Add `ngpus` support to `parse_gpu_requirement` in ci_resource.py so scheduler correctly counts NVIDIA GPU demand. - Replace deprecated `gpu_count` fallback with `ngpus` in run.py `build_docker_args`. Co-Authored-By: Claude <noreply@anthropic.com> * feat(ci): add --local flag to run.py for testing uncommitted changes - Mount current directory read-only into container via `-v cwd:/workspace/repo:ro` - Copy to writable `/tmp/src` inside container before setup runs, so host files are never modified by pip install or build artifacts - Simplify README: fix ngpus example, add gpu_style column, add --local docs Co-Authored-By: Claude <noreply@anthropic.com> * style(ci): normalize comments to complete English sentences with markdown - Backtick-quote tool/package names (`torch`, `pip`, `git`, `cmake`, `coreutils-single`, `conda`) and paths in Dockerfile comments. - Add explanatory comment to the commented-out `agents:` block in `config.yaml` describing when to uncomment it. - Convert all section-header banners in `.ci/tests/` to "Tests for `FunctionName`." sentence form; fix three docstrings in `test_agent.py`. - Backtick-quote identifiers in `tests/test_gemm.py` inline comments. Co-Authored-By: Claude <noreply@anthropic.com> * style(tests): backtick-quote identifiers in test_gemm.py skip message Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

This reverts commit d6b5fd5.

* feat: Add RMSNorm op in cambricon backend. * refactor: make `Cast` utility to use `Device::Type` template parameter * refactor: add `Caster` mixin * refactor: rename `cast**` to `caster**` * fix: fix the mlu naming to google c++ naming style * chore: format files with `clang-format` * refactor: update CUDA kernels to use `Caster` * fix: fix rmsnorm dispatch to use one dispatch --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

…xed dispatch (#29) * feat: add a convenient interface for any `int64_t`-convertible types and use `DispatchFunc()` to dispatch `DataType` and block sizes with a single call. - add a convenient interface for any `int64_t`-convertible types, which is mostly used for multi-type dispatch and mixed dispatch - use `DispatchFunc()` to dispatch `DataType` and block sizes with a single function call in various kernels' implementation - remove the `CUDA_BLOCK_SIZE_XXX` macros and simply use numbers instead * style: fix the styling issue by adding a period to the TODO comment * fix: fix rebase error * style: fix the styling issues for comments in `dispatcher.h` and `cuda/causal_softmax/kernel.h`

#38) * refactor: make `TypeMap`, `IsFP16`, `IsBFloat16`, and `DispatchFunc` device-aware * refactor: make `cuda/` shared headers self-contained and include-order-independent * fix: update call sites to device-aware `TypeMap`, `IsFP16`/`IsBFloat16`, and `DispatchFunc` * chore: format files with `clang-format` * fix: update `cuda/swiglu` kernels to use device-aware type predicates * fix: replace per-instance `blasHandle_t` with a static singleton in `Blas` * fix: restore kernel headers for `moore/add` and `moore/swiglu` to use `clang-format off` and `clang-format on` * fix: use absolute includes, consistent include guards, and formatted comments * refactor: extract `GetOptimalBlockSize` logic into shared `ComputeOptimalBlockSize` * fix: include `<musa_fp16.h>` in `polyfills.cuh` before `hrcp` macro to prevent collision * chore: add blank lines between `using` type alias declarations in `device_.h` * chore: add TODO comments for potential performance and concurrency issues * fix: move `clang-format` guards to wrap only CUDA headers in `iluvatar/device_.h`

kilinchange · 2026-04-01T11:12:15Z

examples/gemm/gemm.cc

+                  a_host.strides()};
+  Tensor b_device{b_ptr, b_host.shape(), b_host.dtype(), a_host.device(),
+                  b_host.strides()};
+  Tensor c_device{c_ptr, c_host.shape(), c_host.dtype(), a_host.device(),


a_host/b_host/c_host 指的是 host 侧 Tensor 吗？但为什么 a_device/b_device/c_device 又是直接用 a_host.device() 构造的呢？

kilinchange · 2026-04-01T12:06:45Z

examples/runtime_api.h

这是专门给 example 用的统一 runtime api 吗，为什么不用 infiniop 封装好的呢？

kilinchange · 2026-04-01T12:17:58Z

src/base/add.h

每个 elementwise op 都需要对应一个头文件实现吗？感觉是不是可以统一成一份，简化 elementwise op 添加的流程。

kilinchange · 2026-04-01T12:22:19Z

src/cambricon/device_.h

这个文件为什么叫 device_.h...

kilinchange · 2026-04-01T12:31:37Z

src/cuda/add/kernel.h

+                    Backend::memcpyH2D);
+    Backend::memcpy(d_other_strides_, other_strides_.data(), strides_size,
+                    Backend::memcpyH2D);
+    Backend::memcpy(d_out_strides_, out_strides_.data(), strides_size,


留个 TODO 之后想办法把 malloc/memcpy 尽可能减少或者在某些常见维度下去除吧；
至少目前这种写法可以先改成只 malloc 一次总内存大小，再按照 offset 分配，其他 elementwise kernel 同理。

kilinchange · 2026-04-01T12:36:56Z

src/cuda/caster_.h

文件名加下划线后缀是有什么特殊含义吗？
以及这个文件的内容是不是放在 .cuh 里更合适。

kilinchange · 2026-04-01T12:39:18Z

src/cuda/kernel_commons.h

这个也应该是 .cuh

kilinchange · 2026-04-01T12:46:27Z

src/iluvatar/swiglu/kernel.h

+
+  static constexpr auto free = cudaFree;
+
+  static constexpr auto memcpyH2D = cudaMemcpyHostToDevice;


这类 api 每个用到的 op 里都需要在 namespace infini::ops::op_name::DeviceBackend 里显式写一次，有点冗余，为什么不是靠 device 分类所有用到的 api（这样至少每个 device 只用写一次），而是靠 op 分类呢？

kilinchange · 2026-04-01T13:02:22Z

src/caster.h

+#ifndef INFINI_OPS_CASTER_H_
+#define INFINI_OPS_CASTER_H_
+
+#include "data_type.h"


这个文件没用到 data_type 相关内容吧

kilinchange · 2026-04-01T13:21:28Z

另外，目前实现代码和头文件都集中在 src 目录下，后续可以考虑引入独立的 include 目录，用于承载对外暴露的公共头文件接口；src 则主要保留内部实现和私有头。

kilinchange · 2026-04-01T14:23:32Z

src/iluvatar/swiglu/kernel.h

+
+namespace infini::ops {
+
+namespace swiglu {


以及目前还是存在大量 namespace 里带上 op 名称的现象，感觉这里的主要问题是 device 的层级放到 op 之上更合适，复用度更高，现在是反过来了。

voltjia and others added 30 commits February 12, 2026 19:24

chore: add .clang-format

11a4960

feat: add DataType

25de6c8

test: add an example for DataType

3e1bb6f

feat: add Tensor

96127b3

test: add an example for Tensor

9a5077b

feat: add Device

91daa44

feat: add Handle

fb016f2

feat: add Operator

128e739

feat: add Gemm

92a02b0

feat: add Operator<Gemm, Device::kNvidia>

75b4d99

test: add an example for Operator<Gemm, Device::kNvidia>

08fc189

feat: add DataType::FromString for string-to-dtype conversion

c590c2a

refactor: make Device a class and integrate it into Tensor

fb434a1

feat: add Device::TypeFromString for string-to-device-type conversion

6453be9

feat: add a script to generate pybind11 bindings

66f1d3c

fix: add virtual ~Operator() = default;

cb5ccbc

refactor: Simplify operator() dispatching

d5c1067

feat: add naive support for single-stage interfaces

45c3e62

feat: add stream handling

dbe3f4c

refactor: further simplify blasGemmEx(), unify comment formatting a…

3ab24bd

…nd example header file inclusion

fix: fix the typo for cudaMemset() in runtime_api.h

64ce184

feat: add Device::ToString

3df4832

feat: use Device::ToString in Tensor::ToString

ecf030e

feat: use lowercase words in Device::kDeviceToDesc and kDescToDevice

1f871cb

fix: update scripts/generate_wrappers.py to adapt to the latest cha…

9de33b3

…nges

feat: add support for legacy c code generation

41af1dc

fix: remove unintended white space in DeviceTypeFromString

632dea2

zhangyue207 and others added 24 commits March 4, 2026 17:05

feat(ops): add Iluvatar GPU backend for Add (#8)

59031f7

refactor: introduce handle and workspace (#13)

0256d48

* refactor: update `Operator::call` to accept `handle`, `stream`, `workspace`, and `workspace_size_in_bytes` * feat: add `workspace_size_in_bytes` virtual method to `OperatorBase`

fix: include "tensor.h" instead of "data_type.h" and "device.h"…

ea78d15

… in `src/pybind11_utils.h`

feat: add swiglu op with NVIDIA and CPU backends (#10)

42f1e20

* feat: add op with NVIDIA and CPU backends * fix: fix code as pr comment * chore: format `tests/test_swiglu.py` and `tests/utils.py` --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

feat(moore): add Moore backend for Add (#26)

dc9f440

* feat(moore): add add-op support with musa integration. * refactor: improve specialization logic * fix: fix `AttributeError` on Cambricon --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>

feat(ops): add MetaX causal_softmax (#27)

1b0b5ac

- add MetaX operator specialization - make the shared CUDA-style kernel compatible with MetaX - reuse common casting utilities for fp16 and bf16 conversions Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>

feat(ops): add MetaX backend for RmsNorm (#25)

f44be6f

- add MetaX `RmsNorm` operator specialization - make the shared CUDA-style rms_norm kernel compatible with MetaX - forward runtime `eps` when launching the kernel Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>

feat(ops): add MetaX backend for Swiglu (#28)

61fcdf7

Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>

chore: ignore ci-results/ directory

d6b5fd5

Co-Authored-By: Claude <noreply@anthropic.com>

Revert "chore: ignore ci-results/ directory"

56f3330

This reverts commit d6b5fd5.

voltjia requested review from Ziminli, kilinchange and whjthu April 1, 2026 08:58

kilinchange reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: build core operator framework with multi-device backends, Python bindings, testing, and CI#39

feat: build core operator framework with multi-device backends, Python bindings, testing, and CI#39
voltjia wants to merge 93 commits intomasterfrom
feat/dev-infra

voltjia commented Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

kilinchange commented Apr 1, 2026

Uh oh!

kilinchange Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants


		static constexpr auto free = cudaFree;

		static constexpr auto memcpyH2D = cudaMemcpyHostToDevice;

Conversation

voltjia commented Apr 1, 2026

Core Framework

Operators & Backend Support

Build System

Python Bindings

Testing

CI (.ci/)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilinchange commented Apr 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CI (`.ci/`)