fix: prevent seed state cache poisoning in loadState clone#9246
fix: prevent seed state cache poisoning in loadState clone#9246lodekeeper wants to merge 3 commits intoChainSafe:unstablefrom
Conversation
The default `clone()` in `@chainsafe/ssz` transfers the source sub-view's
cache to the new instance, which means `migrated.validators` and the seed
container's cached child-view snapshot share the SAME internal `nodes[]`
and `caches[]` arrays. A subsequent `migratedState.commit()` then writes
modified validator / inactivity-score nodes into those shared arrays,
silently corrupting the seed state's cache snapshot at the modified
indices.
The corruption stays latent until the seed is later cloned with
transfer-cache enabled - the path `verifyBlock` takes via
`preState.clone({dontTransferCache: false})`. On that next-block clone,
reads at the modified index return the migrated state's validator instead
of the seed's, which surfaces in production as
"Withdrawal mismatch at index=0" divergences between Lodestar and EL.
Use `clone(true)` so the migrated sub-view starts with a fresh empty
cache and its commit cannot reach into the seed's cache arrays. A
regression test exercises the `loadState -> seedState.clone() ->
validators.getReadonly(modifiedIndex)` sequence.
Root cause was introduced by ChainSafe#8857 which added the `loadOtherState` /
shared-head seed path that exercises this clone in production.
🤖 Generated with AI assistance
There was a problem hiding this comment.
Code Review
This pull request addresses a cache aliasing bug in the loadState utility where cloning subviews with default settings could lead to corruption of the seed state's cache during a commit. The fix involves using clone(true) for inactivityScores and validators to ensure a fresh cache. A regression test has been added to prevent future occurrences. Feedback suggests using the more explicit object syntax {dontTransferCache: true} in the clone calls to improve code readability and maintain consistency with the inline documentation.
| // silently corrupting the seed container's cache snapshot. That corruption only surfaces | ||
| // on the next `seedState.clone({dontTransferCache: false})` read, producing a | ||
| // "Withdrawal mismatch at index=0"-style divergence downstream. | ||
| migratedState.inactivityScores = seedState.inactivityScores.clone(true); |
There was a problem hiding this comment.
For better readability and to avoid "magic boolean" parameters, it is recommended to use the object syntax for the clone options. This also ensures consistency with the documentation provided in the comments above (line 118) and the PR description, which both reference the {dontTransferCache: ...} syntax.
| migratedState.inactivityScores = seedState.inactivityScores.clone(true); | |
| migratedState.inactivityScores = seedState.inactivityScores.clone({dontTransferCache: true}); |
There was a problem hiding this comment.
Thanks — but this suggestion doesn't match the actual SSZ API. TreeViewDU.clone() is typed as clone(dontTransferCache?: boolean) (see @chainsafe/ssz packages/ssz/src/viewDU/abstract.ts:111 and partialListComposite.ts:156), so {dontTransferCache: true} wouldn't type-check.
The inconsistency you flagged is real, but the right fix is to update my comments to match the boolean API rather than the code to match the (incorrect) comments. Pushed dde4635 correcting the three places that referenced the object-syntax form.
| // arrays, silently corrupting the seed container's cache snapshot. That corruption only | ||
| // surfaces on the next `seedState.clone({dontTransferCache: false})` read, producing a | ||
| // "Withdrawal mismatch at index=0"-style divergence downstream. | ||
| migratedState.validators = seedState.validators.clone(true); |
There was a problem hiding this comment.
For better readability and to avoid "magic boolean" parameters, it is recommended to use the object syntax for the clone options. This also ensures consistency with the documentation provided in the comments above (line 201) and the PR description, which both reference the {dontTransferCache: ...} syntax.
| migratedState.validators = seedState.validators.clone(true); | |
| migratedState.validators = seedState.validators.clone({dontTransferCache: true}); |
There was a problem hiding this comment.
Same as above — TreeViewDU.clone() only accepts a boolean, so the object syntax wouldn't compile. I updated the inline comments here and on the inactivity-scores path in dde4635 so they no longer reference a {dontTransferCache: false} form that doesn't exist.
The inline comments referenced a `seedState.clone({dontTransferCache: false})`
object-syntax call that does not exist — the SSZ `TreeViewDU.clone()` signature
is `clone(dontTransferCache?: boolean)`. Clarify that the corruption surfaces
on the next default `seedState.clone()` read, and that `clone(true)` is the
dontTransferCache flag.
🤖 Generated with AI assistance
Summary
Fixes the v1.42.0 "Withdrawal mismatch at index=0" regression by changing the two
clone()calls inloadState.tstoclone(true)(i.e.dontTransferCache=true).The default
clone()in@chainsafe/ssztransfers the source sub-view's cache to the new instance, which means themigratedState.validatorssub-view and the seed container's cached child-view snapshot share the same internalnodes[]andcaches[]arrays. A subsequentmigratedState.commit()writes modified validator / inactivity-score nodes into those shared arrays, silently corrupting the seed state's cache snapshot at the modified indices.The corruption stays latent until the seed state is later cloned with transfer-cache enabled — the path
verifyBlocktakes viapreState.clone({dontTransferCache: false}). On that next-block clone, reads at the modified index return the migrated state's validator instead of the seed's, which surfaces asWithdrawal mismatch at index=0divergences between Lodestar and EL.Timeline of the corruption
On the next block:
Root cause
Introduced by #8857 (
chore: consume BeaconStateView) which added theloadOtherState/ shared-head seed path that exercises this clone in production.Test plan
loadState does not poison seed state's cacheinpackages/state-transition/test/unit/util/loadState.test.ts0xaa-filled validator onpostState.clone()) and PASSES with the fixstate-transitionutil tests passcheck-typesandlintcleanRelation to #9245
PR #9245 (
fix: gate loadOtherState validators/balances preload behind opt-in) addresses a different regression from the same #8857-era changes — the eagergetAllReadonlyValues()preload causing memory spikes on the API path. The two fixes are independent and both needed:persistentCheckpointsCachealso exercisesloadState()with the problematicclone(), so fix: gate loadOtherState validators/balances preload behind opt-in #9245's preload gating alone doesn't cover the full surface.🤖 Generated with AI assistance