feat: build core operator framework with multi-device backends, Python bindings, testing, and CI#39
feat: build core operator framework with multi-device backends, Python bindings, testing, and CI#39
Conversation
- Add additional entries to the Device enum class to support new hardware targets. - Adapt GEMM mcblas implementation to use MetaX backend and add the test example. - Extract common BLAS interfaces into a new blas.h abstraction for GEMM implementations to share.
…GEMM implementation - Add `ConstexprMap` and compile-time traits in `common/` for efficient type-to-metadata mapping and relevant operations. - Implement a generic dispatcher to reduce boilerplate for dispatching, especially for data types and devices. - Add the CPU implementation for the GEMM - Update `DataType` definitions and type lists to support wide dispatching. Follow-up: support for fp16 and bf16 kernels is pending.
…further abstract `blas.h` - further abstract `blas.h`, backends now only do name change - fix various naming issues and small issues - combined the `gemm` example programs across the platforms, now only one program for all platforms
…nd example header file inclusion
…pip install` on MetaX (#5) * refactor: adapt the dispatcher to be C++17-compatiable - dispatcher now does not depend on C++20 features - udpate the current dispatcher use cases - add some relevant constexpr traits in common/traits.h - add `PYBIND_ENABLE_EXTRAS` internal cmake variable for controlling the flags introduced by pybind * style: format some comments in common/traits.h * fix: support mxcc to use pytest by using `scripts/mxcc_wrapper.sh` * build: add auto-detection for MetaX * style: change the naming for types and variables in `common/traits.h`, `common/constexpr_map.h` and `dispatcher.h` * style: fix the method and context string naming in `src/add/add.h` * refactor: change the anonymous namespaces in `dispatcher.h` to namespace `detail` to comply with the styling rules * style: fix comment styling issues * fix: update `DispatchFunc` usage in `src/cuda/rms_norm/kernel.h` --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
* refactor: update `Operator::call` to accept `handle`, `stream`, `workspace`, and `workspace_size_in_bytes` * feat: add `workspace_size_in_bytes` virtual method to `OperatorBase`
* feat: add the implementation of operator on Cambricon * chore: format `src/cambricon/gemm/cnblas.h` with `clang-format` * refactor: update `src/cambricon/gemm/cnblas.h` to use latest `operator()` mechanism * refactor: update `src/cambricon/gemm/cnblas.h` to use `workspace_` from `OperatorBase` * chore: resolve PR comments * chore: reverse tensor descriptor destruction order --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
… in `src/pybind11_utils.h`
…nds (#12) * feat(ops): implement CausalSoftmax operator with CPU and CUDA backends * refactor(ops): update CausalSoftmax constructor and method signatures for consistency * style: improve assertion messages in CausalSoftmax for clarity and consistency * chore: format files with `clang-format` * test: disable testing skipping for unpresent `infini.ops.causal_softmax` * refactor: update `causal_softmax` to use latest `operator()` mechanism --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
* feat: add the CPU implementation of float16 and bfloat16 and the CPU `Cast()` function - add the CPU implementation of float16 and bfloat16 as `float16_t` and `bfloat16_t` - add the CPU `Cast()` function that support conversion between any two CPU supported types, including the custom `float16_t` and `bfloat16_t` * style: change `indexToOffset()` to `IndexToOffset()` to comply with the styling requirement * feat: add the CUDA `Cast()` function * refactor: refactor CUDA `Cast` utility with SFINAE-based hardware dispatch and move them into `common/cuda/cast.h` * style: change the naming of some types in `common/cast.h` and `common/cuda/cast.h` to better comply with the naming rules * chore: remove unused header `data_type.h` in `common/cuda/kernel_commons.h` * style: adjust comments for styling rule compliance * style: change `float_t` and `bfloat16_t` to `Float16` and `BFloat16` and fix various styling issues.
* feat: reorganize casting utilities and enhance CPU support - Moved casting functions to separate CPU and CUDA headers for better organization. - Introduced a new `Cast()` function in the CPU implementation to handle type conversions, including support for custom types like `float16_t` and `bfloat16_t`. - Updated various operators to utilize the new casting utilities, ensuring consistent type handling across CPU and CUDA backends. - Enhanced test cases to cover additional data types and ensure compatibility with the new casting logic. * fix: update bfloat16 test tolerance in `test_rms_norm.py` - Increased the tolerance for `bfloat16` from `1e-2` to `2e-2` to better accommodate numerical precision in tests. * format: simplify type dispatching in `Add` operator and formating
* fix: add equality operators and CacheKey struct for improved operator caching - Implemented `operator==` and `operator!=` for the `Device` class to facilitate comparison. - Introduced `CacheKey` struct in `operator.h` to enhance caching mechanism with a hash and vector of tensors. - Updated the `Operator::call` method to utilize `CacheKey` for caching operators based on input arguments. - Added `MetaEqual` method in `Tensor` class for tensor comparison based on metadata. * refactor: move CacheKey struct to detail namespace and enhance Tensor comparison - Changed the namespace of `CacheKey` to `infini::ops::detail` for better organization. - Updated the hash and equality operators for `CacheKey` to reflect the new namespace. - Removed the `MetaEqual` method from the `Tensor` class and replaced it with a dedicated `std::equal_to` specialization for `Tensor` to improve comparison logic. * style: remove unnecessary blank line in cublas.h for improved readability
* feat(gemm-moore): add Moore (MUSA) GEMM backend support. * refactor(gemm-moore): reuse shared BLAS helper and specialize scalars. * build: use detected Python interpreter for wrapper generation. --------- Co-authored-by: zhuyue <zhuyue@qiyuanlab.com>
…igLU` (#17) * refactor: reorganize casting utilities and enhance CUDA kernel support - Moved CPU casting functions to a new file `common/cpu/cast.h` and updated the `Cast` function to utilize these utilities. - Updated CUDA kernel files to include the new casting utilities and improved block size handling in kernel launches. - Enhanced the `Add`, `CausalSoftmax`, `Gemm`, `RmsNorm`, and `Swiglu` operators to utilize the new casting mechanisms for better type handling. - Added support for additional data types in tests and adjusted test cases for consistency across CPU and GPU backends. * refactor: improve formatting * refactor: cache cudaDeviceProp per device via DevicePropertyCache Introduce DevicePropertyCache to query and cache all device properties once at first access, avoiding repeated cudaGetDeviceProperties calls. QueryMaxThreadsPerBlock and GetOptimalBlockSize are simplified to delegate to the cache. Also move block_size out of dispatch lambdas in add and swiglu kernels since it does not depend on the dispatched type. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * perf: reduce redundant ops in Add, SwiGLU, and RmsNorm CUDA kernels - Add __restrict__ to all pointer params in AddKernel and SwigluKernel to enable compiler alias analysis, vectorization, and prefetch - Remove dead for-loop in Add/SwiGLU kernel launch (step >= output_size_ by construction, loop body always executed exactly once); drop offset param - Inline sigmoid in SwiGLU bfloat16/bfloat162 paths to eliminate redundant bf16<->float round-trips (8 -> 4 conversions for bfloat162, 2 -> 1 for bfloat16) - Use a temp variable in RmsNorm SumSquared to guarantee single global load Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: abort with diagnostic on out-of-range device_id in GetDeviceProps Returning a default-constructed dummy cudaDeviceProp silently propagated incorrect device properties; now print an explicit error and abort so the bug is immediately visible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: improve code formatting and comments in CUDA kernel files - Updated comments for clarity in `kernel_commons.h`. - Reformatted kernel launch macro in `kernel.h` for better readability. - Enhanced line breaks in `SwigluKernel` implementation for improved code structure. - Adjusted test function formatting in `test_add.py` for consistency. * refactor: remove unnecessary cuda_runtime.h includes in kernel headers - Removed redundant `#include <cuda_runtime.h>` from `add`, `causal_softmax`, `rms_norm`, and `swiglu` kernel header files. - Added TODO comments for future removal of the remaining includes to improve code clarity and maintainability. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
… and add `_torch_rms_norm` fallback (#20) * fix: remove uint16 test from test_add.py - Removed `torch.uint16` from the list of integer data types in the `_INT_DTYPES` tuple to streamline the code and eliminate redundancy. * refactor: enhance dtype handling in test_add.py * refactor: streamline dtype parameterization in test_add.py and enhance rms_norm fallback handling in test_rms_norm.py * refactor: add unsigned integer data types to test_add.py for enhanced dtype handling * refactor: simplify integer dtype filtering * refactor: simplify `_torch_rms_norm` fallback logic --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
* feat(moore): add swiglu-op support with musa integration - add a Moore swiglu backend on top of the shared CUDA-style path - extract shared swiglu compute into a reusable op for backend override - keep Moore-specific half and bfloat162 handling in the Moore backend only * refactor: introduce `src/moore/polyfills.cuh` * refactor: use polyfills for Moore SwiGLU --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
- add MetaX operator specialization - make the shared CUDA-style kernel compatible with MetaX - reuse common casting utilities for fp16 and bf16 conversions Co-authored-by: gongchensu <zhuyue@qiyuanlab.com>
* feat/nv ci test * feat: ci sys for nv platform * fix(ci): fix results dir permissions and reduce parallel workers - Pass host UID/GID into container and `chown` results after tests, so mounted `ci-results/` is accessible by the host user. - Limit `pytest-xdist` workers from `-n auto` to `-n 8` to prevent OOM worker crashes on high-core-count machines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(ci): Refactor code structure for improved readability and maintainability * docs: add multi-machine deployment guide for NVIDIA and Iluvatar platform * feat(ci): enhance CI configuration and agent functionality with platform detection and job resolution * feat(ci): add MetaX platform CI support Add Dockerfile, config, and mx-smi GPU detection for MetaX (MACA) platform. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(ci): improve job dispatch logging and handle job results more effectively * feat(ci): add Moore Threads (MUSA) platform CI support Add GPU detection via mthreads-gmi, Dockerfile, config, and update docs with Moore and MetaX platform deployment instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(ci): capture Docker error output for remote job diagnostics * feat(ci): capture error output and improve CLI result display - Capture last 50 lines of Docker output via ring buffer so failed jobs return diagnostic info to the CLI client. - Store raw bytes during execution; decode only on the failure path. - Align job name columns in `<==` result lines for readability. - Show summary only when jobs fail, removing redundant all-pass output. Co-Authored-By: Claude <noreply@anthropic.com> * feat(ci): add Cambricon MLU platform CI support - Add .ci/images/cambricon/Dockerfile for AnolisOS-based Cambricon image - Add cambricon platform to config.yaml with MLU-style GPU passthrough - Add GPU_STYLE_MLU constant and MLU_VISIBLE_DEVICES support in run.py - Add cnmon-based GPU detection (_detect_gpus_cambricon) in ci_resource.py - Add --test CLI flag to override pytest test path at runtime - Skip empty stage run commands instead of erroring (compilation-only mode) - Fix _torch_gemm fallback for CPU float16/bfloat16 (upcast to float32) - Skip bfloat16 on MLU (cnnlBatchMatMulEx does not support it) - Hoist _PYTEST_VALUE_FLAGS to module level; add ValueError guard in cambricon parser - Remove redundant yaml import guard in agent.py (utils.py already handles it) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(ci): translate README and comments to English, use ngpus for NVIDIA scheduler - Rewrite README.md entirely in English; add Cambricon to platform table and directory tree. - Translate all inline comments in config.yaml to English. - Replace `gpu_ids: "0"` with `ngpus: 1` for NVIDIA platform so the scheduler auto-picks a free GPU rather than pinning to device 0. - Add `ngpus` support to `parse_gpu_requirement` in ci_resource.py so scheduler correctly counts NVIDIA GPU demand. - Replace deprecated `gpu_count` fallback with `ngpus` in run.py `build_docker_args`. Co-Authored-By: Claude <noreply@anthropic.com> * feat(ci): add --local flag to run.py for testing uncommitted changes - Mount current directory read-only into container via `-v cwd:/workspace/repo:ro` - Copy to writable `/tmp/src` inside container before setup runs, so host files are never modified by pip install or build artifacts - Simplify README: fix ngpus example, add gpu_style column, add --local docs Co-Authored-By: Claude <noreply@anthropic.com> * style(ci): normalize comments to complete English sentences with markdown - Backtick-quote tool/package names (`torch`, `pip`, `git`, `cmake`, `coreutils-single`, `conda`) and paths in Dockerfile comments. - Add explanatory comment to the commented-out `agents:` block in `config.yaml` describing when to uncomment it. - Convert all section-header banners in `.ci/tests/` to "Tests for `FunctionName`." sentence form; fix three docstrings in `test_agent.py`. - Backtick-quote identifiers in `tests/test_gemm.py` inline comments. Co-Authored-By: Claude <noreply@anthropic.com> * style(tests): backtick-quote identifiers in test_gemm.py skip message Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
This reverts commit d6b5fd5.
* feat: Add RMSNorm op in cambricon backend. * refactor: make `Cast` utility to use `Device::Type` template parameter * refactor: add `Caster` mixin * refactor: rename `cast**` to `caster**` * fix: fix the mlu naming to google c++ naming style * chore: format files with `clang-format` * refactor: update CUDA kernels to use `Caster` * fix: fix rmsnorm dispatch to use one dispatch --------- Co-authored-by: Jiacheng Huang <huangjiacheng0709@outlook.com>
…xed dispatch (#29) * feat: add a convenient interface for any `int64_t`-convertible types and use `DispatchFunc()` to dispatch `DataType` and block sizes with a single call. - add a convenient interface for any `int64_t`-convertible types, which is mostly used for multi-type dispatch and mixed dispatch - use `DispatchFunc()` to dispatch `DataType` and block sizes with a single function call in various kernels' implementation - remove the `CUDA_BLOCK_SIZE_XXX` macros and simply use numbers instead * style: fix the styling issue by adding a period to the TODO comment * fix: fix rebase error * style: fix the styling issues for comments in `dispatcher.h` and `cuda/causal_softmax/kernel.h`
#38) * refactor: make `TypeMap`, `IsFP16`, `IsBFloat16`, and `DispatchFunc` device-aware * refactor: make `cuda/` shared headers self-contained and include-order-independent * fix: update call sites to device-aware `TypeMap`, `IsFP16`/`IsBFloat16`, and `DispatchFunc` * chore: format files with `clang-format` * fix: update `cuda/swiglu` kernels to use device-aware type predicates * fix: replace per-instance `blasHandle_t` with a static singleton in `Blas` * fix: restore kernel headers for `moore/add` and `moore/swiglu` to use `clang-format off` and `clang-format on` * fix: use absolute includes, consistent include guards, and formatted comments * refactor: extract `GetOptimalBlockSize` logic into shared `ComputeOptimalBlockSize` * fix: include `<musa_fp16.h>` in `polyfills.cuh` before `hrcp` macro to prevent collision * chore: add blank lines between `using` type alias declarations in `device_.h` * chore: add TODO comments for potential performance and concurrency issues * fix: move `clang-format` guards to wrap only CUDA headers in `iluvatar/device_.h`
| a_host.strides()}; | ||
| Tensor b_device{b_ptr, b_host.shape(), b_host.dtype(), a_host.device(), | ||
| b_host.strides()}; | ||
| Tensor c_device{c_ptr, c_host.shape(), c_host.dtype(), a_host.device(), |
There was a problem hiding this comment.
a_host/b_host/c_host 指的是 host 侧 Tensor 吗?但为什么 a_device/b_device/c_device 又是直接用 a_host.device() 构造的呢?
There was a problem hiding this comment.
这是专门给 example 用的统一 runtime api 吗,为什么不用 infiniop 封装好的呢?
There was a problem hiding this comment.
每个 elementwise op 都需要对应一个头文件实现吗?感觉是不是可以统一成一份,简化 elementwise op 添加的流程。
| Backend::memcpyH2D); | ||
| Backend::memcpy(d_other_strides_, other_strides_.data(), strides_size, | ||
| Backend::memcpyH2D); | ||
| Backend::memcpy(d_out_strides_, out_strides_.data(), strides_size, |
There was a problem hiding this comment.
留个 TODO 之后想办法把 malloc/memcpy 尽可能减少或者在某些常见维度下去除吧;
至少目前这种写法可以先改成只 malloc 一次总内存大小,再按照 offset 分配,其他 elementwise kernel 同理。
There was a problem hiding this comment.
文件名加下划线后缀是有什么特殊含义吗?
以及这个文件的内容是不是放在 .cuh 里更合适。
|
|
||
| static constexpr auto free = cudaFree; | ||
|
|
||
| static constexpr auto memcpyH2D = cudaMemcpyHostToDevice; |
There was a problem hiding this comment.
这类 api 每个用到的 op 里都需要在 namespace infini::ops::op_name::DeviceBackend 里显式写一次,有点冗余,为什么不是靠 device 分类所有用到的 api(这样至少每个 device 只用写一次),而是靠 op 分类呢?
| #ifndef INFINI_OPS_CASTER_H_ | ||
| #define INFINI_OPS_CASTER_H_ | ||
|
|
||
| #include "data_type.h" |
|
另外,目前实现代码和头文件都集中在 |
|
|
||
| namespace infini::ops { | ||
|
|
||
| namespace swiglu { |
There was a problem hiding this comment.
以及目前还是存在大量 namespace 里带上 op 名称的现象,感觉这里的主要问题是 device 的层级放到 op 之上更合适,复用度更高,现在是反过来了。
This PR builds out the full
InfiniOpsoperator library from the initialTensor/DataTypescaffolding. It spans 109 files across core framework, operator implementations, build system, Python bindings, testing, and CI.Core Framework
CacheKey) for workspace reuse.DispatchFunchigh-level interface for multi-type and mixed dispatch.Operators & Backend Support
Build System
-DWITH_NVIDIA, etc.) and auto-detection mode (-DAUTO_DETECT_DEVICES).pyproject.tomlwith scikit-build-core backend — supportspip install ..Python Bindings
libclang-based code generator (scripts/generate_wrappers.py) parses operator headers → emits pybind11 code.TensorFromPybind11Handle.Testing
pytestwith auto-parametrization bydevice&dtype, tolerance-based assertions.@pytest.mark.auto_act_and_assertfor declarative Act/Assert against PyTorch references.--benchmark) viatorch.utils.benchmark.CI (
.ci/)config.yaml).agent.py).