Skip to content

Update tutorials for Neuron SDK 2.29 release (NKI 0.3.0)#122

Open
mrkcath-aws wants to merge 1 commit intomainfrom
nki-update
Open

Update tutorials for Neuron SDK 2.29 release (NKI 0.3.0)#122
mrkcath-aws wants to merge 1 commit intomainfrom
nki-update

Conversation

@mrkcath-aws
Copy link
Copy Markdown
Collaborator

Issue #, if available:

N/A

Description of changes:

Update samples for NKI 0.3.0

Testing:

Please see detailed unit test requirements in the CONTRIBUTING.md

  • [ N/A ] The change is covered by numeric check using nki.baremetal
  • [ N/A ] The change is covered by performance benchmark test using nki.benchmark
  • [ N/A ] The change is covered by end-to-end integration test

Pull Request Checklist

  • [ X ] I have filled in all the required field in the template
  • [ X ] I have tested locally that all the tests pass
  • [ X ] By submitting this pull request, I confirm that my contribution is made under the terms of the MIT-0 license.

@mrkcath-aws mrkcath-aws changed the title Update tutorials for NeuronSDK 2.29 release (NKI 0.3.0) Update tutorials for Neuron SDK 2.29 release (NKI 0.3.0) Apr 16, 2026
Comment thread src/nki_samples/tutorials/attention_fwd_performance/test_attention.py Outdated
Comment thread src/nki_samples/tutorials/average_pool2d/average_pool2d_nki_kernels.py Outdated
Comment thread src/nki_samples/tutorials/tensor_addition/tensor_addition_nki_kernels.py Outdated
Comment thread src/nki_samples/tutorials/tensor_addition/tensor_addition_torch.py Outdated
Comment on lines +9 to +10

def softmax_isa(data, axis=(1,)):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this one different from nl.softmax?

# scores @ V, contract along seqlen_kv
attn_out: nt.tensor[seqlen_q, d_head] = nl.matmul(scores, v_sbuf_t, transpose_x=False)

# nl.matmul with transpose_x=False internally transposes to PSUM which
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that make this language API a bit moot then? This API should change if what it does in its implementation doesn't work on gen3+?

Given that nc_matmul also has a is_transpose mode telling the TensorE to transpose the tile in sbuf, I'm wondering if maybe we could switch to that instead?

qk = nl.ndarray((seqlen_q, seqlen_kv), dtype=nl.float32, buffer=nl.psum)
nisa.nc_matmul(dst=qk, stationary=q_sbuf, moving=k_sbuf)
qk_sbuf = nl.ndarray(qk.shape, dtype=nl.float32, buffer=nl.sbuf)
nisa.tensor_copy(dst=qk_sbuf, src=qk)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Our docs says tensor_reduce and tensor_scalar (used by softmax) can be in PSUM. I wonder if this copy is even necessary.


attn_out[...] = nisa.tensor_scalar(data=attn_out_psum, op0=nl.multiply,
operand0=inverse_sum_row, engine=nisa.vector_engine)
nisa.tensor_scalar(dst=attn_out[...], data=attn_out_psum, op0=nl.multiply,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: While it works, using dst=attn_out[...] instead of dst=attn_out is a bit weird to me

row_max_kv = nl.ndarray((PMAX, num_kv_tiles), dtype=nl.float32, buffer=nl.sbuf)
for i_tile_kv in range(num_kv_tiles):
qk_sbuf = nl.ndarray((PMAX, FMAX_MOVING), dtype=nl.float32, buffer=nl.sbuf)
nisa.tensor_copy(dst=qk_sbuf, src=qk_tiles[i_tile_kv])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above: tensor_reduce mentions it accepts PSUM input, so is this copy is necessary?

exp_row[:, nl.ds(i_tile_kv*FMAX_MOVING, FMAX_MOVING)] = nisa.activation(
for i_tile_kv in range(num_kv_tiles):
qk_sbuf = nl.ndarray((PMAX, FMAX_MOVING), dtype=nl.float32, buffer=nl.sbuf)
nisa.tensor_copy(dst=qk_sbuf, src=qk_tiles[i_tile_kv])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for activation, where the data tile can be an SBUF or PSUM tile

@@ -362,9 +386,12 @@ def nki_matmul_fully_optimized_(
BLOCK_K = TILE_K * TILES_IN_BLOCK_K

# Verify the size is a multiple of block size
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm waiting on Doug to merge https://github.com/aws-neuron/private-aws-neuron-sdk-staging/pull/2998, but once it is I believe we can change the sample here with the updated version of the fully_optimized_ kernel (Not blocking this PR but just to remember)

Comment on lines -30 to -31
offset_i_x = nl.program_id(0) * 128
offset_i_y = nl.program_id(1) * 512
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we intentionally dropping the LNC stuff here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants