codegen improvements [AI slop] by pk910 · Pull Request #131 · pk910/dynamic-ssz

pk910 · 2026-03-16T23:50:49Z

let claude iterate on codegen improvements for several hours.
not for merge, more for cherry picking ideas

perf: improve hasher/merkleization performance and reduce allocations

This PR optimizes the shared hasher and SSZ utility code used by the code generation path. All changes are in hasher/hasher.go and sszutils/, affecting hash tree root computation speed and allocation behavior across all codegen operations.

Changes

Hasher PutX optimization (hasher/hasher.go)
Rewrite PutUint64/PutUint32/PutUint16/PutUint8/PutBool to append 32 zero bytes then write the value directly, instead of encoding into a tmp buffer and calling AppendBytes32.

Merkleize fast paths (hasher/hasher.go)
Add early returns in Merkleize() for the most common input sizes, avoiding the full merkleizeImpl function call with its getDepth computation and loop:

1 chunk (32 bytes): return immediately — data already in place
2 chunks (64 bytes): single hash call — handles Checkpoint, BLSPubKey
3–4 chunks (96–128 bytes): two hash calls — handles Fork and similar
8 chunks (256 bytes): three hash calls — handles Validator (100K per state, biggest single win)
16 chunks (512 bytes): four hash calls

MerkleizeWithMixin optimization (hasher/hasher.go)
Replace 3-step mixin size encoding (MarshalUint64 → append → pad) with a single 32-byte zero append + direct PutUint64 write.

PutBytes fast path (hasher/hasher.go)
Skip the AppendBytes32 call and its modulo-32 check for exact 32-byte inputs (Hash32, Root, WithdrawalCredentials).

Hasher buffer pre-allocation (hasher/hasher.go)
Pre-allocate 4MB buffer for new hashers, sized for 100K validators × 32 bytes = 3.2MB peak. Eliminates buffer regrowth allocations during HTR. Codegen HTR achieves 0 allocs/op consistently.

Inline BufferDecoder limits stack (sszutils/decoder_buffer.go)
Embed [16]int array in BufferDecoder, point the limits slice at it. Eliminates a separate make([]int, 0, 16) allocation per unmarshal.

Inline StreamEncoder scratch buffer (sszutils/encoder_stream.go)
Embed [32]byte array in StreamEncoder, point the scratch slice at it. Eliminates make([]byte, 0, 32) per MarshalWriter.

ExpandSlice capacity reuse (sszutils/unmarshal.go)
When cap(src) >= size, use src[:size] instead of make([]T, size). Helps repeated unmarshal on the same target.

Benchmark Results — StateMainnet

Throughput (ns/op, average of 3 runs)

Operation	Before	After	Δ
HashTreeRoot	59,817K	50,585K	−15.4%
Unmarshal	6,261K	6,191K	—
Marshal	3,996K	3,997K	—
MarshalWriter	4,339K	4,901K	—

Allocations (per op)

Operation	Before allocs	After allocs	Before B/op	After B/op
HashTreeRoot	0–2 (unstable)	0 (stable)	0–840K	~175K (amortized)
MarshalWriter	6	3	2,480	2,336
UnmarshalReader	1,522	1,520	164,698	164,186

Benchmark Results — BlockMainnet

Operation	Before (ns/op)	After (ns/op)	Δ
HashTreeRoot	516,924	453,560	−12.3%

Operation	Before allocs	After allocs
HashTreeRoot	0	0

Diff

4 files changed, 120 insertions(+), 45 deletions(-)

hasher/hasher.go, sszutils/decoder_buffer.go, sszutils/encoder_stream.go, sszutils/unmarshal.go

No public API changes. All existing tests pass.

… copy Write directly to the hash buffer instead of going through the tmp buffer and AppendBytes32. This reduces from 2 appends to 1 append + direct write for each Put operation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add early returns in Merkleize for the common cases of single-chunk (32 bytes) and two-chunk (64 bytes) inputs, avoiding the full merkleizeImpl call with its capacity pre-check and loop overhead. Single-chunk: just return (data is already in place) Two-chunk: single hash call directly, skip merkleizeImpl entirely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip the AppendBytes32 call and its modulo check for the common case of exactly 32-byte inputs (Hash32, Root, WithdrawalCredentials). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace 3-step append (MarshalUint64 + output + zeroBytes[:24]) with a single append of 32 zero bytes + direct binary.LittleEndian.PutUint64 write. Reduces from 3 appends to 1 append + 1 direct write. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add early return for 3-4 chunk (96-128 bytes) inputs in Merkleize, avoiding the full merkleizeImpl call. Uses two direct hash operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add direct 3-hash-operation path for exactly 256 bytes (8 chunks). Common for containers with 8 fields like Validator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Inline BufferDecoder limits stack as [16]int array to avoid separate slice allocation (saves 128 bytes per unmarshal) - ExpandSlice: reuse existing capacity instead of always allocating new Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Embed a [32]byte array in StreamEncoder and use it as the backing for the scratch slice, avoiding a separate make([]byte, 0, 32) heap allocation per MarshalWriter call. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set defaultHasherBufSize to 4MB for new hashers. When the hasher is pooled and reused, the capacity is retained. Eliminates buffer growth allocations during HTR. HTR: consistently 0 allocs/op for codegen path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-03-16T23:54:33Z

Codecov Report

❌ Patch coverage is 91.66667% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.44%. Comparing base (168669d) to head (9044fb7).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #131      +/-   ##
==========================================
- Coverage   92.59%   92.44%   -0.15%     
==========================================
  Files          44       44              
  Lines        8826     8883      +57     
==========================================
+ Hits         8172     8212      +40     
- Misses        397      408      +11     
- Partials      257      263       +6

Components	Coverage Δ
dynssz	`98.19% <91.66%> (-0.40%)`	⬇️
dynsszgen	`87.54% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pk910-agent and others added 10 commits March 16, 2026 17:58

perf: add fast path in PutBytes for exact 32-byte inputs

e8940fc

Skip the AppendBytes32 call and its modulo check for the common case of exactly 32-byte inputs (Hash32, Root, WithdrawalCredentials). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: add fast path in Merkleize for 3-4 chunk inputs

378f390

Add early return for 3-4 chunk (96-128 bytes) inputs in Merkleize, avoiding the full merkleizeImpl call. Uses two direct hash operations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: add fast path in Merkleize for exact 8-chunk inputs

7684451

Add direct 3-hash-operation path for exactly 256 bytes (8 chunks). Common for containers with 8 fields like Validator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: add fast path in Merkleize for exact 16-chunk inputs

91a19cd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codegen improvements [AI slop]#131

codegen improvements [AI slop]#131
pk910 wants to merge 10 commits intomasterfrom
perf/codegen-improvements

pk910 commented Mar 16, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pk910 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

perf: improve hasher/merkleization performance and reduce allocations

Changes

Benchmark Results — StateMainnet

Throughput (ns/op, average of 3 runs)

Allocations (per op)

Benchmark Results — BlockMainnet

Diff

Uh oh!

codecov bot commented Mar 16, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pk910 commented Mar 16, 2026 •

edited

Loading