Skip to content

reflection improvements [AI slop]#132

Draft
pk910 wants to merge 19 commits intomasterfrom
perf/reflection-improvements
Draft

reflection improvements [AI slop]#132
pk910 wants to merge 19 commits intomasterfrom
perf/reflection-improvements

Conversation

@pk910
Copy link
Copy Markdown
Owner

@pk910 pk910 commented Mar 16, 2026

let claude iterate on reflection improvements for several hours.
not for merge, more for cherry picking ideas


perf: improve reflection path performance and reduce allocations

This PR optimizes the reflection-based SSZ paths across all operations: marshal, unmarshal, hash tree root, and their streaming variants. Changes target three areas: bulk data handling for uint64 slices, hasher/merkleization internals, and heap allocation reduction.

Changes

Bulk uint64 fast paths (reflection/marshal.go, reflection/unmarshal.go, reflection/treeroot.go)
For []uint64 lists and vectors (Balances, InactivityScores, Slashings), bypass per-element reflection dispatch. Marshal/HTR use unsafe.Pointer + unsafe.Slice to access the slice data directly and call EncodeUint64Slice / HashUint64Slice. Unmarshal uses reflect.MakeSlice with the correct target type (supporting defined types like type Gwei uint64) then decodes via unsafe view.

Hasher PutX optimization (hasher/hasher.go)
Rewrite PutUint64/PutUint32/PutUint16/PutUint8/PutBool to append 32 zero bytes then write the value directly, instead of encoding into a tmp buffer and calling AppendBytes32.

Merkleize fast paths (hasher/hasher.go)
Add early returns in Merkleize() for the most common input sizes, avoiding the full merkleizeImpl loop:

  • 1 chunk (32 bytes): return immediately — data already in place
  • 2 chunks (64 bytes): single hash call — handles Checkpoint, BLSPubKey
  • 3–4 chunks (96–128 bytes): two hash calls — handles Fork and similar
  • 8 chunks (256 bytes): three hash calls — handles Validator (100K per state)
  • 16 chunks (512 bytes): four hash calls

MerkleizeWithMixin optimization (hasher/hasher.go)
Replace 3-step mixin size encoding (MarshalUint64 → append → pad) with a single 32-byte zero append + direct PutUint64 write.

PutBytes / AppendBytes32 fast paths (hasher/hasher.go)
Skip the modulo-32 padding check for exact 32-byte inputs (Hash32, Root, WithdrawalCredentials).

Unsafe byte access for arrays (reflection/marshal.go, reflection/treeroot.go)
Use unsafe.Slice((*byte)(sourceValue.UnsafeAddr()), len) for addressable [N]byte arrays, avoiding reflect.Value.Bytes() which takes a slow path for array types.

Fast size path for primitives (reflection/sszsize.go)
Return pre-computed TypeDescriptor.Size directly for primitive types and byte-array vectors in getSszValueSize, skipping the full switch dispatch.

Allocation reductions (reflection/common.go, sszutils/decoder_buffer.go, sszutils/encoder_stream.go, sszutils/unmarshal.go, reflection/unmarshal.go)

  • Return ReflectionCtx by value to keep it on the stack (−1 alloc per op)
  • Inline BufferDecoder limits as [16]int embedded array (−1 alloc, −128B per unmarshal)
  • Inline StreamEncoder scratch as [32]byte embedded array (−1 alloc per MarshalWriter)
  • Reuse existing slice backing arrays in unmarshal when capacity is sufficient
  • ExpandSlice: reuse existing capacity instead of always allocating
  • Eliminate Interface() boxing in bulk uint64 paths via unsafe.Pointer (−3 allocs on marshal)

Hasher buffer pre-allocation (hasher/hasher.go)
Pre-allocate 4MB buffer for new hashers, sized for 100K validators × 32 bytes. Eliminates buffer regrowth allocations during HTR. Stabilizes alloc count from variable 0–5 to consistent 2.

Benchmark Results — StateMainnet

Throughput (ns/op, average of 3 runs)

Operation Before After Δ
Unmarshal 33,956K 28,308K −16.6%
UnmarshalReader 37,709K 31,920K −15.3%
Marshal 21,120K 19,652K −6.9%
MarshalWriter 21,428K 18,400K −14.1%
HashTreeRoot 75,723K 65,800K −13.1%

Allocations (per op)

Operation Before allocs After allocs Before B/op After B/op
Marshal 2 2 131,104 131,104
MarshalWriter 3 2 2,176 2,176
HashTreeRoot 0–3 (unstable) 2 (stable) 0–1.3M ~262K
Unmarshal 102,090 102,091 18.3M 18.4M

Benchmark Results — BlockMainnet

Operation Before (ns/op) After (ns/op) Δ
HashTreeRoot 561,691 518,923 −7.6%
MarshalWriter 104,943 102,322 −2.5%
Unmarshal 183,514 178,296 −2.8%

Diff

10 files changed, 282 insertions(+), 91 deletions(-)

No public API changes. All existing tests pass.

pk910-agent and others added 18 commits March 16, 2026 17:43
For lists and vectors of uint64 elements (like Balances, InactivityScores,
Slashings), use bulk memory operations instead of per-element reflection
dispatch. This avoids the overhead of reflect.Value.Index() + marshalType/
unmarshalType/buildRootFromType calls for each element.

Benchmarks show ~12-16% improvement on StateMainnet marshal/unmarshal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… copy

Write directly to the hash buffer instead of going through the tmp
buffer and AppendBytes32. This reduces from 2 appends to 1 append +
direct write for each Put operation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the existing sszutils.HashUint64Slice() function for bulk memory
copy when hashing uint64 lists and vectors, instead of a per-element
AppendUint64 loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
For addressable fixed-size byte arrays (like [48]byte BLSPubKey, [32]byte
Hash32), use unsafe.Slice to get bytes directly instead of going through
reflect.Value.Bytes() which takes a slow path for arrays.

This avoids the reflect bytesSlow path overhead (~4% of marshal time).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Return the pre-computed Size directly for primitive types (bool, uintN,
intN, floatN) and byte-array vectors without entering the full switch
dispatch. This avoids unnecessary function call overhead when sizing
basic types recursively.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add early returns in Merkleize for the common cases of single-chunk
(32 bytes) and two-chunk (64 bytes) inputs, avoiding the full
merkleizeImpl call with its capacity pre-check and loop overhead.

Single-chunk: just return (data is already in place)
Two-chunk: single hash call directly, skip merkleizeImpl entirely

Benchmarks show ~9% improvement on StateMainnet HashTreeRoot.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Skip the AppendBytes32 call and its modulo check for the common case
of exactly 32-byte inputs (Hash32, Root, WithdrawalCredentials).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace 3-step append (MarshalUint64 + output + zeroBytes[:24]) with a
single append of 32 zero bytes + direct binary.LittleEndian.PutUint64
write. Reduces from 3 appends to 1 append + 1 direct write.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add early return for 3-4 chunk (96-128 bytes) inputs in Merkleize,
avoiding the full merkleizeImpl call. Uses two direct hash operations
instead of the loop-based approach.

This helps containers with 3-4 fields (like Fork, Checkpoint).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add direct 3-hash-operation path for exactly 256 bytes (8 chunks),
which is the common case for containers with 8 fields like Validator.
This avoids the merkleizeImpl function call overhead and its loop.

Only safe for exact power-of-2 chunk counts where no zero-hash
padding is needed at intermediate levels.

Benchmarks show ~4.5% additional improvement on StateMainnet HTR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add direct 4-hash-operation path for exactly 512 bytes (16 chunks).
Useful for containers with 16 fields.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use sszutils.HashUint64Slice() for bulk memory copy when hashing
uint64 vectors (like Slashings), instead of per-element AppendUint64.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Return ReflectionCtx by value from NewReflectionCtx to keep it on
  the stack (eliminates 1 heap alloc per operation)
- Inline BufferDecoder limits stack as [16]int array to avoid separate
  slice allocation (saves 128 bytes per unmarshal)
- Reuse existing slice backing arrays in unmarshalList/unmarshalVector/
  unmarshalDynamicList/unmarshalDynamicVector when capacity is sufficient
  (avoids reflect.MakeSlice for repeated unmarshal on same target)
- ExpandSlice: reuse existing capacity instead of always allocating new

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace sourceValue.Interface().([]uint64) with unsafe.Pointer +
unsafe.Slice to access uint64 slice data directly, avoiding the
heap allocation from reflect.Value.Interface() boxing.

Marshal:      5 → 2 allocs/op (−3), −72 B/op
MarshalWriter: 6 → 3 allocs/op (−3), −70 B/op
HTR:          5 → 4 allocs/op (−1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Embed a [32]byte array in StreamEncoder and use it as the backing
for the scratch slice, avoiding a separate make([]byte, 0, 32) heap
allocation per MarshalWriter call.

MarshalWriter: 3 → 2 allocs/op, 2178 → 2176 B/op

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set defaultHasherBufSize to 4MB for new hashers. This is enough for
most BeaconState HTR operations (100K validators × 32 bytes = 3.2MB)
without triggering buffer regrowth. When the hasher is pooled and
reused, the capacity is retained. When GC evicts it, the new hasher
starts with sufficient capacity.

HTR: 4 allocs/op → 2 allocs/op (stable), ~1MB → ~233KB B/op

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The bulk uint64 unmarshal path was creating make([]uint64, n) and
setting it via reflect.ValueOf, which panics when the target slice
has a defined element type (e.g. type Gwei uint64 vs Gwei = uint64).

Fix by using reflect.MakeSlice with the correct target type, then
using unsafe to get a []uint64 view for the bulk decode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add DefinedUint64 (type DefinedUint64 uint64, not alias) test cases
for both list and vector operations. These catch bugs where bulk
uint64 paths create []uint64 instead of the correct defined type,
which causes reflect.Value.Set to panic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 86.62791% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.94%. Comparing base (168669d) to head (693415f).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
- Coverage   92.59%   91.94%   -0.65%     
==========================================
  Files          44       44              
  Lines        8826     8956     +130     
==========================================
+ Hits         8172     8235      +63     
- Misses        397      444      +47     
- Partials      257      277      +20     
Components Coverage Δ
dynssz 97.02% <86.62%> (-1.57%) ⬇️
dynsszgen 87.54% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

reflect.Value.Pointer() panics on array values — it only works on
slices, pointers, maps, channels, and funcs. Add sourceType.Kind ==
reflect.Slice guard to all bulk uint64 fast paths that use Pointer()
in marshal, unmarshal, and HTR.

Fixes panic in buildRootFromVector for array-typed uint64 vectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant