Update tutorials for Neuron SDK 2.29 release (NKI 0.3.0) by mrkcath-aws · Pull Request #122 · aws-neuron/nki-samples

mrkcath-aws · 2026-04-13T20:35:00Z

Issue #, if available:

N/A

Description of changes:

Update samples for NKI 0.3.0

Testing:

Please see detailed unit test requirements in the CONTRIBUTING.md

[ N/A ] The change is covered by numeric check using nki.baremetal
[ N/A ] The change is covered by performance benchmark test using nki.benchmark
[ N/A ] The change is covered by end-to-end integration test

Pull Request Checklist

[ X ] I have filled in all the required field in the template
[ X ] I have tested locally that all the tests pass
[ X ] By submitting this pull request, I confirm that my contribution is made under the terms of the MIT-0 license.

romagros-aws · 2026-04-16T18:47:23Z

+
+def softmax_isa(data, axis=(1,)):


How is this one different from nl.softmax?

romagros-aws · 2026-04-16T19:06:24Z

    # scores @ V, contract along seqlen_kv
-    attn_out: nt.tensor[seqlen_q, d_head] = nl.matmul(scores, v_sbuf_t, transpose_x=False)
-
+    # nl.matmul with transpose_x=False internally transposes to PSUM which


Doesn't that make this language API a bit moot then? This API should change if what it does in its implementation doesn't work on gen3+?

Given that nc_matmul also has a is_transpose mode telling the TensorE to transpose the tile in sbuf, I'm wondering if maybe we could switch to that instead?

romagros-aws · 2026-04-16T19:13:40Z

+    qk = nl.ndarray((seqlen_q, seqlen_kv), dtype=nl.float32, buffer=nl.psum)
+    nisa.nc_matmul(dst=qk, stationary=q_sbuf, moving=k_sbuf)
+    qk_sbuf = nl.ndarray(qk.shape, dtype=nl.float32, buffer=nl.sbuf)
+    nisa.tensor_copy(dst=qk_sbuf, src=qk)


Nit: Our docs says tensor_reduce and tensor_scalar (used by softmax) can be in PSUM. I wonder if this copy is even necessary.

romagros-aws · 2026-04-16T23:18:19Z


-        attn_out[...] = nisa.tensor_scalar(data=attn_out_psum, op0=nl.multiply,
-                                           operand0=inverse_sum_row, engine=nisa.vector_engine)
+        nisa.tensor_scalar(dst=attn_out[...], data=attn_out_psum, op0=nl.multiply,


nit: While it works, using dst=attn_out[...] instead of dst=attn_out is a bit weird to me

romagros-aws · 2026-04-16T23:27:14Z

+        row_max_kv = nl.ndarray((PMAX, num_kv_tiles), dtype=nl.float32, buffer=nl.sbuf)
+        for i_tile_kv in range(num_kv_tiles):
+            qk_sbuf = nl.ndarray((PMAX, FMAX_MOVING), dtype=nl.float32, buffer=nl.sbuf)
+            nisa.tensor_copy(dst=qk_sbuf, src=qk_tiles[i_tile_kv])


Similar to above: tensor_reduce mentions it accepts PSUM input, so is this copy is necessary?

romagros-aws · 2026-04-16T23:28:45Z

-            exp_row[:, nl.ds(i_tile_kv*FMAX_MOVING, FMAX_MOVING)] = nisa.activation(
+        for i_tile_kv in range(num_kv_tiles):
+            qk_sbuf = nl.ndarray((PMAX, FMAX_MOVING), dtype=nl.float32, buffer=nl.sbuf)
+            nisa.tensor_copy(dst=qk_sbuf, src=qk_tiles[i_tile_kv])


Same here for activation, where the data tile can be an SBUF or PSUM tile

romagros-aws · 2026-04-16T23:41:04Z

@@ -362,9 +386,12 @@ def nki_matmul_fully_optimized_(
  BLOCK_K = TILE_K * TILES_IN_BLOCK_K

  # Verify the size is a multiple of block size


I'm waiting on Doug to merge https://github.com/aws-neuron/private-aws-neuron-sdk-staging/pull/2998, but once it is I believe we can change the sample here with the updated version of the fully_optimized_ kernel (Not blocking this PR but just to remember)

romagros-aws · 2026-04-16T23:45:14Z

-  offset_i_x = nl.program_id(0) * 128
-  offset_i_y = nl.program_id(1) * 512


Are we intentionally dropping the LNC stuff here?

mrkcath-aws force-pushed the nki-update branch from 4111abf to c281d3e Compare April 16, 2026 18:17

mrkcath-aws changed the title ~~Update tutorials for NeuronSDK 2.29 release (NKI 0.3.0)~~ Update tutorials for Neuron SDK 2.29 release (NKI 0.3.0) Apr 16, 2026

mrkcath-aws force-pushed the nki-update branch from c281d3e to b53a834 Compare April 16, 2026 18:41

aws-mattmcm reviewed Apr 16, 2026

View reviewed changes

mrkcath-aws force-pushed the nki-update branch from b53a834 to c8e44fa Compare April 16, 2026 20:42

Update tutorials for NeuronSDK 2.29 release (NKI 0.3.0)

fbbf8b0

mrkcath-aws force-pushed the nki-update branch from c8e44fa to fbbf8b0 Compare April 16, 2026 20:50

romagros-aws reviewed Apr 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update tutorials for Neuron SDK 2.29 release (NKI 0.3.0)#122

Update tutorials for Neuron SDK 2.29 release (NKI 0.3.0)#122
mrkcath-aws wants to merge 1 commit intomainfrom
nki-update

mrkcath-aws commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

romagros-aws Apr 16, 2026

Uh oh!

romagros-aws Apr 16, 2026

Uh oh!

romagros-aws Apr 16, 2026

Uh oh!

romagros-aws Apr 16, 2026

Uh oh!

romagros-aws Apr 16, 2026

Uh oh!

romagros-aws Apr 16, 2026

Uh oh!

romagros-aws Apr 16, 2026

Uh oh!

romagros-aws Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -362,9 +386,12 @@ def nki_matmul_fully_optimized_(
		BLOCK_K = TILE_K * TILES_IN_BLOCK_K

		# Verify the size is a multiple of block size

		offset_i_x = nl.program_id(0) * 128
		offset_i_y = nl.program_id(1) * 512

Conversation

mrkcath-aws commented Apr 13, 2026

Issue #, if available:

Description of changes:

Testing:

Pull Request Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants