Update gemma-3-1b-it contrib: fix head_dim=256 issues, add chunked attention and SWA support#129
Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
Open
Update gemma-3-1b-it contrib: fix head_dim=256 issues, add chunked attention and SWA support#129jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
Conversation
…emma3/ Replaces the standalone Annapurna Labs implementation with subclasses of the official NeuronGemma3* classes, adding only the overrides needed for the 1B variant (head_dim=256, vocab_size=262144, GQA 4:1): - Chunked Q@K^T and scores@V for head_dim>128 (compiler DGE OOB fix) - k_cache_transposed restored for SWA layers (GQA repeat_kv layout fix) - vocab_size read from HF config instead of hardcoded 262208 - NKI attention kernel auto-disabled when head_dim>128 - query_pre_attn_scalar fused into Q/K weights at load time (zero cost) - DecoderLayer and TextModel overrides to swap in correct attention class Tested on trn2.3xlarge with upstream NxDI main (0.8.0+26b1fcf5.dev): compile 27s, load 12.5s, all 5 integration tests pass.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
models/gemma3/that fixes 5 issues specific to the 1B variant's unusual head_dim=256 architecturemodels/gemma3/flow through automaticallyWhy This Update?
The 1B variant has unusual architecture parameters vs the 4B/12B/27B variants:
The previous implementation disabled sliding window attention entirely and reimplemented all components from scratch. This update subclasses the official code and fixes only what's needed.
Issues Fixed
Architecture
Thin subclasses of official implementation (no upstream files modified):
Required Configuration
attn_kernel_enabledfalsek_cache_transposedtruecontext_encoding_buckets[512]+Validation
attn_kernel_enabled=FalseworkaroundCompatibility