reflection improvements [AI slop] by pk910 · Pull Request #132 · pk910/dynamic-ssz

pk910 · 2026-03-16T23:51:29Z

let claude iterate on reflection improvements for several hours.
not for merge, more for cherry picking ideas

perf: improve reflection path performance and reduce allocations

This PR optimizes the reflection-based SSZ paths across all operations: marshal, unmarshal, hash tree root, and their streaming variants. Changes target three areas: bulk data handling for uint64 slices, hasher/merkleization internals, and heap allocation reduction.

Changes

Bulk uint64 fast paths (reflection/marshal.go, reflection/unmarshal.go, reflection/treeroot.go)
For []uint64 lists and vectors (Balances, InactivityScores, Slashings), bypass per-element reflection dispatch. Marshal/HTR use unsafe.Pointer + unsafe.Slice to access the slice data directly and call EncodeUint64Slice / HashUint64Slice. Unmarshal uses reflect.MakeSlice with the correct target type (supporting defined types like type Gwei uint64) then decodes via unsafe view.

Hasher PutX optimization (hasher/hasher.go)
Rewrite PutUint64/PutUint32/PutUint16/PutUint8/PutBool to append 32 zero bytes then write the value directly, instead of encoding into a tmp buffer and calling AppendBytes32.

Merkleize fast paths (hasher/hasher.go)
Add early returns in Merkleize() for the most common input sizes, avoiding the full merkleizeImpl loop:

1 chunk (32 bytes): return immediately — data already in place
2 chunks (64 bytes): single hash call — handles Checkpoint, BLSPubKey
3–4 chunks (96–128 bytes): two hash calls — handles Fork and similar
8 chunks (256 bytes): three hash calls — handles Validator (100K per state)
16 chunks (512 bytes): four hash calls

MerkleizeWithMixin optimization (hasher/hasher.go)
Replace 3-step mixin size encoding (MarshalUint64 → append → pad) with a single 32-byte zero append + direct PutUint64 write.

PutBytes / AppendBytes32 fast paths (hasher/hasher.go)
Skip the modulo-32 padding check for exact 32-byte inputs (Hash32, Root, WithdrawalCredentials).

Unsafe byte access for arrays (reflection/marshal.go, reflection/treeroot.go)
Use unsafe.Slice((*byte)(sourceValue.UnsafeAddr()), len) for addressable [N]byte arrays, avoiding reflect.Value.Bytes() which takes a slow path for array types.

Fast size path for primitives (reflection/sszsize.go)
Return pre-computed TypeDescriptor.Size directly for primitive types and byte-array vectors in getSszValueSize, skipping the full switch dispatch.

Allocation reductions (reflection/common.go, sszutils/decoder_buffer.go, sszutils/encoder_stream.go, sszutils/unmarshal.go, reflection/unmarshal.go)

Return ReflectionCtx by value to keep it on the stack (−1 alloc per op)
Inline BufferDecoder limits as [16]int embedded array (−1 alloc, −128B per unmarshal)
Inline StreamEncoder scratch as [32]byte embedded array (−1 alloc per MarshalWriter)
Reuse existing slice backing arrays in unmarshal when capacity is sufficient
ExpandSlice: reuse existing capacity instead of always allocating
Eliminate Interface() boxing in bulk uint64 paths via unsafe.Pointer (−3 allocs on marshal)

Hasher buffer pre-allocation (hasher/hasher.go)
Pre-allocate 4MB buffer for new hashers, sized for 100K validators × 32 bytes. Eliminates buffer regrowth allocations during HTR. Stabilizes alloc count from variable 0–5 to consistent 2.

Benchmark Results — StateMainnet

Throughput (ns/op, average of 3 runs)

Operation	Before	After	Δ
Unmarshal	33,956K	28,308K	−16.6%
UnmarshalReader	37,709K	31,920K	−15.3%
Marshal	21,120K	19,652K	−6.9%
MarshalWriter	21,428K	18,400K	−14.1%
HashTreeRoot	75,723K	65,800K	−13.1%

Allocations (per op)

Operation	Before allocs	After allocs	Before B/op	After B/op
Marshal	2	2	131,104	131,104
MarshalWriter	3	2	2,176	2,176
HashTreeRoot	0–3 (unstable)	2 (stable)	0–1.3M	~262K
Unmarshal	102,090	102,091	18.3M	18.4M

Benchmark Results — BlockMainnet

Operation	Before (ns/op)	After (ns/op)	Δ
HashTreeRoot	561,691	518,923	−7.6%
MarshalWriter	104,943	102,322	−2.5%
Unmarshal	183,514	178,296	−2.8%

Diff

10 files changed, 282 insertions(+), 91 deletions(-)

No public API changes. All existing tests pass.

For lists and vectors of uint64 elements (like Balances, InactivityScores, Slashings), use bulk memory operations instead of per-element reflection dispatch. This avoids the overhead of reflect.Value.Index() + marshalType/ unmarshalType/buildRootFromType calls for each element. Benchmarks show ~12-16% improvement on StateMainnet marshal/unmarshal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… copy Write directly to the hash buffer instead of going through the tmp buffer and AppendBytes32. This reduces from 2 appends to 1 append + direct write for each Put operation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use the existing sszutils.HashUint64Slice() function for bulk memory copy when hashing uint64 lists and vectors, instead of a per-element AppendUint64 loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

For addressable fixed-size byte arrays (like [48]byte BLSPubKey, [32]byte Hash32), use unsafe.Slice to get bytes directly instead of going through reflect.Value.Bytes() which takes a slow path for arrays. This avoids the reflect bytesSlow path overhead (~4% of marshal time). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Return the pre-computed Size directly for primitive types (bool, uintN, intN, floatN) and byte-array vectors without entering the full switch dispatch. This avoids unnecessary function call overhead when sizing basic types recursively. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add early returns in Merkleize for the common cases of single-chunk (32 bytes) and two-chunk (64 bytes) inputs, avoiding the full merkleizeImpl call with its capacity pre-check and loop overhead. Single-chunk: just return (data is already in place) Two-chunk: single hash call directly, skip merkleizeImpl entirely Benchmarks show ~9% improvement on StateMainnet HashTreeRoot. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip the AppendBytes32 call and its modulo check for the common case of exactly 32-byte inputs (Hash32, Root, WithdrawalCredentials). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace 3-step append (MarshalUint64 + output + zeroBytes[:24]) with a single append of 32 zero bytes + direct binary.LittleEndian.PutUint64 write. Reduces from 3 appends to 1 append + 1 direct write. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add early return for 3-4 chunk (96-128 bytes) inputs in Merkleize, avoiding the full merkleizeImpl call. Uses two direct hash operations instead of the loop-based approach. This helps containers with 3-4 fields (like Fork, Checkpoint). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add direct 3-hash-operation path for exactly 256 bytes (8 chunks), which is the common case for containers with 8 fields like Validator. This avoids the merkleizeImpl function call overhead and its loop. Only safe for exact power-of-2 chunk counts where no zero-hash padding is needed at intermediate levels. Benchmarks show ~4.5% additional improvement on StateMainnet HTR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add direct 4-hash-operation path for exactly 512 bytes (16 chunks). Useful for containers with 16 fields. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use sszutils.HashUint64Slice() for bulk memory copy when hashing uint64 vectors (like Slashings), instead of per-element AppendUint64. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Return ReflectionCtx by value from NewReflectionCtx to keep it on the stack (eliminates 1 heap alloc per operation) - Inline BufferDecoder limits stack as [16]int array to avoid separate slice allocation (saves 128 bytes per unmarshal) - Reuse existing slice backing arrays in unmarshalList/unmarshalVector/ unmarshalDynamicList/unmarshalDynamicVector when capacity is sufficient (avoids reflect.MakeSlice for repeated unmarshal on same target) - ExpandSlice: reuse existing capacity instead of always allocating new Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace sourceValue.Interface().([]uint64) with unsafe.Pointer + unsafe.Slice to access uint64 slice data directly, avoiding the heap allocation from reflect.Value.Interface() boxing. Marshal: 5 → 2 allocs/op (−3), −72 B/op MarshalWriter: 6 → 3 allocs/op (−3), −70 B/op HTR: 5 → 4 allocs/op (−1) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Embed a [32]byte array in StreamEncoder and use it as the backing for the scratch slice, avoiding a separate make([]byte, 0, 32) heap allocation per MarshalWriter call. MarshalWriter: 3 → 2 allocs/op, 2178 → 2176 B/op Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Set defaultHasherBufSize to 4MB for new hashers. This is enough for most BeaconState HTR operations (100K validators × 32 bytes = 3.2MB) without triggering buffer regrowth. When the hasher is pooled and reused, the capacity is retained. When GC evicts it, the new hasher starts with sufficient capacity. HTR: 4 allocs/op → 2 allocs/op (stable), ~1MB → ~233KB B/op Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The bulk uint64 unmarshal path was creating make([]uint64, n) and setting it via reflect.ValueOf, which panics when the target slice has a defined element type (e.g. type Gwei uint64 vs Gwei = uint64). Fix by using reflect.MakeSlice with the correct target type, then using unsafe to get a []uint64 view for the bulk decode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add DefinedUint64 (type DefinedUint64 uint64, not alias) test cases for both list and vector operations. These catch bugs where bulk uint64 paths create []uint64 instead of the correct defined type, which causes reflect.Value.Set to panic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-03-17T00:00:42Z

Codecov Report

❌ Patch coverage is 86.62791% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.94%. Comparing base (168669d) to head (693415f).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
- Coverage   92.59%   91.94%   -0.65%     
==========================================
  Files          44       44              
  Lines        8826     8956     +130     
==========================================
+ Hits         8172     8235      +63     
- Misses        397      444      +47     
- Partials      257      277      +20

Components	Coverage Δ
dynssz	`97.02% <86.62%> (-1.57%)`	⬇️
dynsszgen	`87.54% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

reflect.Value.Pointer() panics on array values — it only works on slices, pointers, maps, channels, and funcs. Add sourceType.Kind == reflect.Slice guard to all bulk uint64 fast paths that use Pointer() in marshal, unmarshal, and HTR. Fixes panic in buildRootFromVector for array-typed uint64 vectors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pk910-agent and others added 18 commits March 16, 2026 17:43

perf: add fast path in PutBytes for exact 32-byte inputs

0b1adea

Skip the AppendBytes32 call and its modulo check for the common case of exactly 32-byte inputs (Hash32, Root, WithdrawalCredentials). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: add fast path in Merkleize for exact 16-chunk inputs

0c4f208

Add direct 4-hash-operation path for exactly 512 bytes (16 chunks). Useful for containers with 16 fields. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

perf: use HashUint64Slice for bulk HTR of uint64 vectors

bbe7298

Use sszutils.HashUint64Slice() for bulk memory copy when hashing uint64 vectors (like Slashings), instead of per-element AppendUint64. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reflection improvements [AI slop]#132

reflection improvements [AI slop]#132
pk910 wants to merge 19 commits intomasterfrom
perf/reflection-improvements

pk910 commented Mar 16, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pk910 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

perf: improve reflection path performance and reduce allocations

Changes

Benchmark Results — StateMainnet

Throughput (ns/op, average of 3 runs)

Allocations (per op)

Benchmark Results — BlockMainnet

Diff

Uh oh!

codecov bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pk910 commented Mar 16, 2026 •

edited

Loading

codecov bot commented Mar 17, 2026 •

edited

Loading