contrib: Mixtral MoE (SDK 2.29) + Mistral-Small-4-119B-2603 by jimburtoft · Pull Request #133 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-20T18:55:31Z

Summary

Mixtral 8x7B: Updated with SDK 2.29 torch_block_wise workaround and benchmark results (40.4 tok/s, +5% over SDK 2.28). Added patch_moe.py script and documented TKG non-applicability for MoE models.
Mixtral 8x22B: New contrib directory with SDK 2.29 results (25.8 tok/s, +4% and +18% for long inputs). Includes NVMe storage instructions for 262GB model.
Mistral-Small-4-119B-2603: New custom model contrib with NeuronDeepseekV3ForCausalLM (429-line model class supporting MLA + 128-expert MoE). Achieves 74.5 tok/s on trn2.48xlarge TP=16 after fixing a critical MLA attention bug in stock NxDI code. Includes FP8→BF16 extraction, tokenizer fix, and all required patches.

Key Findings

MLA Bug Fix (upstream candidate): out_absorb = wkv_b[:, self.v_head_dim:, :] should be wkv_b[:, self.qk_nope_head_dim:, :] in modeling_deepseek.py. Invisible for stock DeepSeek V3 (both=128) but crashes Mistral-Small-4 (v_head_dim=128, qk_nope_head_dim=64).
TKG doesn't help MoE: Expert dispatch dominates TPOT (~60%), not attention.
SDK 2.29 torch_block_wise is slightly faster than SDK 2.28 NKI blockwise (+4-5%).

Instance Requirements

Model	Instance	TP	tok/s
Mixtral 8x7B	trn2.48xlarge	8	40.4
Mixtral 8x22B	trn2.48xlarge	16	25.8
Mistral-Small-4-119B	trn2.48xlarge	16	74.5

- Mixtral 8x7B: Updated README with SDK 2.29 results (40.4 tok/s, +5% over 2.28), added patch_moe.py for torch_block_wise workaround, documented TKG non-applicability - Mixtral 8x22B: New contrib directory with SDK 2.29 results (25.8 tok/s, +4%), patch_moe.py, NVMe storage instructions for 262GB model - Mistral-Small-4-119B-2603: New contrib with custom NeuronDeepseekV3ForCausalLM model class (MLA+MoE), FP8->BF16 extraction script, MLA bug fix, tokenizer fix, 74.5 tok/s on TP=16 (6.9x improvement over broken Phase 1 baseline)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contrib: Mixtral MoE (SDK 2.29) + Mistral-Small-4-119B-2603#133

contrib: Mixtral MoE (SDK 2.29) + Mistral-Small-4-119B-2603#133
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/mixtral-moe-sdk29

jimburtoft commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 20, 2026

Summary

Key Findings

Instance Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant