|
Only 6 matmul variants, 4 attention kernels, 14 element-wise ops — all from scratch. Reverse-mode autodiff with gradient tracking through 18+ operations. |
Familiar Linear, Embedding, GELU, SiLU, LayerNorm, BatchNorm, Dropout, MultiHeadAttention, GQA, Flash Attention, LR schedulers — ready to train. Clean, readable implementation of a DL framework from first principles. |
pip install tensoraxfrom tensorax import Tensor, nn, optim, lr_scheduler, functional as F
# Build
model = nn.Sequential(nn.Linear(4, 8), nn.GELU(), nn.LayerNorm(8), nn.Linear(8, 3))
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# Train
for epoch in range(100):
loss = F.mse_loss(model(x_train), y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
scheduler.step()→ Full usage guide with all APIs, code examples, and details: docs/USAGE.md
|
Core
|
Neural Networks
|
Training
|
|
Attention
|
CUDA Kernels
|
Infra
|
Matrix Multiplication — fp32, 3×1024×1024, 100 runs:
PyTorch CUDA (ref) ████████████████████████████████████████████ 0.41s (4.51×)
Tensorax 1D Block Tiling ██████████████████████████████████████████ 0.95s (2.31×) ← best
Tensorax Tiled ████████████████████████████████ 1.22s (1.80×)
NumPy CPU (baseline) █████████████████████████ 1.85s (1.00×)
2.31× faster than NumPy · 43% of PyTorch's cuBLAS kernels · all hand-written, zero library calls
Attention Kernels — 4 implementations from naive to flash, supporting arbitrary batch/heads, asymmetric sequence lengths, and optional masks.
csrc/ C++ / CUDA backend
cuda/kernels/ elementwise · matmul (×6) · reduction · attention (×4)
cpu/ CPU fallback for all ops
tensor_ops.{cpp,h} pybind11 bindings
tensorax/ Python package
tensor.py Tensor class + autograd
functional.py F.relu, F.gelu, F.silu, F.softmax, F.sdpa, ...
nn/ Linear, Embedding, norms, dropout, attention (SDPA, MHA, GQA)
optim.py SGD, Adam
lr_scheduler.py StepLR, CosineAnnealingLR, ExponentialLR, LinearLR, MultiStepLR
| Status | |
|---|---|
| ✅ | Core ops · autograd · NN layers · norms · optimizers · losses · attention (4 CUDA kernels) · GQA · MHA · matmul (6 variants) · GELU/SiLU · Embedding · LR schedulers |
| 🚧 | Expanded benchmarking · higher test coverage |
| 🔮 | Conv2D · MaxPool2D · AdamW · indexing/slicing · serialization · DataLoader · multi-GPU · mixed precision · DDP · ONNX export |
| Usage Guide | API reference, code examples, training patterns |
| Architecture | System design, kernel strategy, autograd internals |
| Development | Build, test, contribute |
| Examples | Runnable scripts for common tasks |
Fork → Branch → Commit → PR
See DEVELOPMENT.md for build instructions and guidelines.
@software{tensorax2025,
title = {Tensorax: Pure C++/CUDA Tensor Library},
author = {Shrirang Mahajan},
year = {2025},
url = {https://github.com/NotShrirang/tensorax}
}