Skip to content

contrib: Mixtral MoE (SDK 2.29) + Mistral-Small-4-119B-2603#133

Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/mixtral-moe-sdk29
Open

contrib: Mixtral MoE (SDK 2.29) + Mistral-Small-4-119B-2603#133
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/mixtral-moe-sdk29

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

  • Mixtral 8x7B: Updated with SDK 2.29 torch_block_wise workaround and benchmark results (40.4 tok/s, +5% over SDK 2.28). Added patch_moe.py script and documented TKG non-applicability for MoE models.
  • Mixtral 8x22B: New contrib directory with SDK 2.29 results (25.8 tok/s, +4% and +18% for long inputs). Includes NVMe storage instructions for 262GB model.
  • Mistral-Small-4-119B-2603: New custom model contrib with NeuronDeepseekV3ForCausalLM (429-line model class supporting MLA + 128-expert MoE). Achieves 74.5 tok/s on trn2.48xlarge TP=16 after fixing a critical MLA attention bug in stock NxDI code. Includes FP8→BF16 extraction, tokenizer fix, and all required patches.

Key Findings

  1. MLA Bug Fix (upstream candidate): out_absorb = wkv_b[:, self.v_head_dim:, :] should be wkv_b[:, self.qk_nope_head_dim:, :] in modeling_deepseek.py. Invisible for stock DeepSeek V3 (both=128) but crashes Mistral-Small-4 (v_head_dim=128, qk_nope_head_dim=64).
  2. TKG doesn't help MoE: Expert dispatch dominates TPOT (~60%), not attention.
  3. SDK 2.29 torch_block_wise is slightly faster than SDK 2.28 NKI blockwise (+4-5%).

Instance Requirements

Model Instance TP tok/s
Mixtral 8x7B trn2.48xlarge 8 40.4
Mixtral 8x22B trn2.48xlarge 16 25.8
Mistral-Small-4-119B trn2.48xlarge 16 74.5

- Mixtral 8x7B: Updated README with SDK 2.29 results (40.4 tok/s, +5% over 2.28),
  added patch_moe.py for torch_block_wise workaround, documented TKG non-applicability
- Mixtral 8x22B: New contrib directory with SDK 2.29 results (25.8 tok/s, +4%),
  patch_moe.py, NVMe storage instructions for 262GB model
- Mistral-Small-4-119B-2603: New contrib with custom NeuronDeepseekV3ForCausalLM model
  class (MLA+MoE), FP8->BF16 extraction script, MLA bug fix, tokenizer fix,
  74.5 tok/s on TP=16 (6.9x improvement over broken Phase 1 baseline)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant