Skip to content

RST Auto Synchronization#318

Draft
iamjoemccormick wants to merge 7 commits intomainfrom
iamjoe/feat/rst-auto-sync
Draft

RST Auto Synchronization#318
iamjoemccormick wants to merge 7 commits intomainfrom
iamjoe/feat/rst-auto-sync

Conversation

@iamjoemccormick
Copy link
Copy Markdown
Member

What does this PR do / why do we need it?

Required for all PRs.

Related Issue(s)

Required when applicable.

Where should the reviewer(s) start reviewing this?

Only required for larger PRs when this may not be immediately obvious.

Are there any specific topics we should discuss before merging?

Not required.

What are the next steps after this PR?

Not required.

Checklist before merging:

Required for all PRs.

When creating a PR these are items to keep in mind that cannot be checked by GitHub actions:

  • Documentation:
    • Does developer documentation (code comments, readme, etc.) need to be added or updated?
    • Does the user documentation need to be expanded or updated for this change?
  • Testing:
    • Does this functionality require changing or adding new unit tests?
    • Does this functionality require changing or adding new integration tests?
  • Git Hygiene:

For more details refer to the Go coding standards and the pull request process.

While integrating Watch with Remote, a few updates were needed:

- Detect the meta node ID and include it on the gRPC stream context.
- Default to waiting for subscribers to ack the last event received before streaming events.
Three bugs in the RST push+stub flow could cause state corruption or local data loss when a job is
cancelled under certain race conditions:

1. Remote's UpdateWork() had no terminal state guard. A late-arriving COMPLETED work result (e.g.
   from Sync journal replay after restart or a gRPC context cancellation race) could trigger
   job.Complete() on an already-cancelled job, overwriting the CANCELLED state and violating the
   user's cancel intent. For multipart uploads this typically results in a FAILED job (the multipart
   was already aborted so finishUpload fails), but the state corruption makes job history confusing
   and difficult to reason about. For non-multipart uploads this could create a stub file after the
   job was cancelled, though data loss should not occur since the contents were already synced to
   the bucket. Fixed by checking job.InTerminalState() before processing. The work result is still
   persisted for inspection, but no completion logic runs.

2. Sync's gRPC server discarded work results when the work manager returned both a result and an
   error. This happens when Remote tries to cancel already-COMPLETED work β€” the manager returns the
   COMPLETED result alongside an error, but the server returned only the gRPC error. Remote never
   learned the work was COMPLETED and set the state to UNKNOWN. Fixed by returning the work result
   without a gRPC error when the manager provides one, so Remote sees the actual state.

3. CreateOffloadedDataFile() opened the original file with O_TRUNC before writing stub content,
   creating a window where the file is 0 bytes. If the process crashed or SetFileDataState() failed
   after truncation, the file would be empty with no stub URL written. Fixed by writing stub content
   to a temp file with a random suffix, then atomically replacing the original via OverwriteFile
   (Renameat2). The original file is either fully intact or fully replaced. Note this issue was
   fairly inconsequential if the RST ID was set on the file's metadata, and the remote path was the
   same as the path inside BeeGFS.

Assisted-by: Claude:claude-opus-4-6
Assisted-by: Claude:claude-sonnet-4-6
@iamjoemccormick
Copy link
Copy Markdown
Member Author

@claude review once

Comment on lines +150 to +152
func (m *Manager) ResetUserRateLimit(userId uint32) {
m.rateLimiter.resetUser(userId)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”΄ ResetUserRateLimit panics with a nil pointer dereference when the dispatch Manager is disabled. Any caller invoking this exported method on a disabled Manager (cfg.Enabled=false) will crash because New() returns &Manager{} with rateLimiter=nil. Add an if !m.enabled { return } guard matching the Start()/Stop() pattern to fix it.

Extended reasoning...

Bug description

When cfg.Enabled=false, New() (dispatch.go:80) short-circuits and returns &Manager{} β€” a zero-value struct where rateLimiter is nil and enabled is false. ResetUserRateLimit() at lines 150–152 calls m.rateLimiter.resetUser(userId) unconditionally, which will immediately panic with a nil pointer dereference if the Manager is disabled.

Code path

// New() – disabled path
if !cfg.Enabled {
    log.Warn("automatically dispatching jobs from file system modification events is disabled")
    return &Manager{}, nil   // rateLimiter is nil, enabled is false
}

// ResetUserRateLimit() – no guard
func (m *Manager) ResetUserRateLimit(userId uint32) {
    m.rateLimiter.resetUser(userId)  // nil dereference if disabled
}

Why existing code doesn't prevent it

Start() and Stop() both guard with if !m.enabled { return }, so they are safe when the Manager is disabled. ResetUserRateLimit() is the only exported method that skips this pattern. The method's doc comment says "Safe to call from any goroutine (e.g., a gRPC handler)", which implies callers need not check whether dispatch is enabled before calling it β€” making the broken contract even more dangerous.

Impact

No current callers exist in this PR (the method is newly defined), so the bug cannot panic in production today. However, it is a public API explicitly designed for external callers such as gRPC handlers. Any future caller that does not independently check cfg.Enabled before invoking ResetUserRateLimit() will trigger a nil pointer dereference panic, crashing the BeeRemote process.

Step-by-step proof

  1. Caller configures dispatch.Config{Enabled: false, ...} (the default in beegfs-remote.toml).
  2. dispatch.New(cfg, ...) executes the !cfg.Enabled branch and returns &Manager{}.
  3. All fields are zero: enabled=false, rateLimiter=nil.
  4. A gRPC handler calls dispatchManager.ResetUserRateLimit(1000).
  5. m.rateLimiter is nil; calling nil.resetUser(1000) panics with runtime error: invalid memory address or nil pointer dereference.

Fix

Add the same guard used by Start() and Stop():

func (m *Manager) ResetUserRateLimit(userId uint32) {
    if !m.enabled {
        return
    }
    m.rateLimiter.resetUser(userId)
}

Comment thread common/beegfs/entry.go
Comment on lines +149 to +151
func IsDataStateOffloaded(state DataState) bool {
return state == DataStateManualRestore || state == DataStateAutoRestore
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”΄ The new IsDataStateOffloaded() function omits DataStateDelayedRestore, causing stub files created with --restore-policy=delayed to be silently mishandled by three call sites. Fix: add || state == DataStateDelayedRestore to the return expression in common/beegfs/entry.go:149-151.

Extended reasoning...

What the bug is and how it manifests

IsDataStateOffloaded() (introduced in this PR at common/beegfs/entry.go:149-151) returns true only for DataStateManualRestore and DataStateAutoRestore. This same PR also introduces DataStateDelayedRestore as a valid stub-file state, set by CreateOffloadedDataFile() when the caller passes restorePolicyToDataState(RESTORE_POLICY_DELAYED). Because IsDataStateOffloaded() doesn't include DataStateDelayedRestore, any file stubbed with the delayed restore policy is treated as a regular (non-offloaded) file by every caller of the function.

The specific code paths that break

Three callers are directly impacted: (1) GetLockedInfo() in common/rst/rst.go:698 β€” the if beegfs.IsDataStateOffloaded(...) block that reads the stub RST URL and populates lockedInfo.StubUrlRstId / StubUrlPath is skipped, so rstIds remains nil and the function returns ErrFileHasNoRSTs, making any push/pull/status operation fail for delayed-restore stubs. (2) prepareJobRequests() in common/rst/jobrequest.go:205 β€” the stub-content retrieval path is skipped, causing the function to fall through to the "no RSTs configured" error path instead of calling remote.GetStubContents. (3) getDefaultReleaseUnusedFileLock() in rst/remote/internal/job/manager.go:1221 β€” the guard that skips ClearAccessFlags for offloaded files returns false for DataStateDelayedRestore, so the write lock on a live stub file is incorrectly cleared, potentially allowing writes to stub content.

A fourth caller, getPathStatusFromDatabase() in ctl/pkg/ctl/rst/status.go:568, also uses IsDataStateOffloaded() to display the "contents are offloaded" message, so delayed-restore stubs would show incorrect status output.

Why existing code doesn't prevent it

GetEventDispatchFunc() in manager.go:266 intentionally compares directly to DataStateAutoRestore (not via IsDataStateOffloaded()), which is correct: delayed-restore stubs should not trigger auto-restore on blocked opens. That design choice is sound, but it means the intended behavior for the other callers β€” where all three restore-policy stub types must be treated as offloaded β€” is only expressed through the new helper that is now incomplete.

Step-by-step proof

  1. User runs beegfs-ctl rst push --stub-local --restore-policy=delayed /mnt/beegfs/file.
  2. CreateOffloadedDataFile() is called with dataState = DataStateDelayedRestore; the file becomes a stub with this state.
  3. User later runs beegfs-ctl rst pull /mnt/beegfs/file.
  4. prepareJobRequests() calls entry.GetEntry(), which returns DataStateDelayedRestore.
  5. beegfs.IsDataStateOffloaded(DataStateDelayedRestore) returns false.
  6. Code skips the stub-content retrieval branch, falls through to the RST-ID check, finds none, and returns ErrFileHasNoRSTs β€” the pull fails with a misleading error.

How to fix it

Change common/beegfs/entry.go:151 from:

return state == DataStateManualRestore || state == DataStateAutoRestore

to:

return state == DataStateManualRestore || state == DataStateAutoRestore || state == DataStateDelayedRestore

Comment on lines +179 to +201
// Event was filtered out, just ack it.
return ack
}
dispatch = m.defaultDispatchFn
}
}

userId := e.V2.GetMsgUserId()
eventType := e.V2.GetType()
if !m.rateLimiter.check(userId, eventType) {
m.log.Debug("event rate limited",
zap.Uint32("userId", userId),
zap.String("path", e.V2.GetPath()),
zap.Any("type", eventType))
dispatchEventsRateLimited.Add(1)
return ack
}

if dispatch(event) {
m.rateLimiter.record(userId, eventType)
dispatchEventsAccepted.Add(1)
} else {
dispatchEventsRejected.Add(1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ”΄ In dispatch(), the local 'dispatch' variable is only assigned inside 'if m.dispatchFns != nil', so when New() is called with dispatchFns=nil and a non-nil defaultDispatchFn, dispatch remains nil and calling dispatch(event) panics. The fix is to add an else branch after the block: 'else if m.defaultDispatchFn != nil { dispatch = m.defaultDispatchFn }'.

Extended reasoning...

What the bug is and how it manifests

The dispatch() method in watch/pkg/dispatch/dispatch.go declares a local 'var dispatch DispatchFunc' (zero value: nil). It only assigns a function to this variable inside the block 'if m.dispatchFns != nil'. If m.dispatchFns is nil, the entire block is skipped, dispatch stays nil, and the subsequent call 'if dispatch(event)' panics with a nil function call.

The specific code path that triggers it

New() guards against misconfiguration with: 'if len(dispatchFns) == 0 && defaultDispatchFn == nil { return error }'. In Go, len(nil) on a map is 0, so passing (dispatchFns=nil, defaultDispatchFn=nonNilFn) satisfies len(dispatchFns)==0 but the overall guard is false because defaultDispatchFn != nil. The Manager is created with m.dispatchFns=nil and m.defaultDispatchFn!=nil. Then in dispatch(), 'if m.dispatchFns != nil' evaluates to false, the block is skipped, dispatch is never set, and dispatch(event) panics.

Why existing code does not prevent it

The guard in New() uses len() which treats nil and empty maps identically, while dispatch() uses != nil which distinguishes them. The API contract of New() allows (nil, nonNilFn) but dispatch() has no code path that consults m.defaultDispatchFn when m.dispatchFns is nil. The only production caller in rst/remote/cmd/beegfs-remote/main.go always passes a non-nil, non-empty map with defaultDispatchFn=nil, so the panic is never triggered today.

Impact

Any future caller that passes dispatchFns=nil with a non-nil defaultDispatchFn will encounter a panic in the dispatch worker goroutine, crashing that goroutine silently or bringing down the service.

Step-by-step proof

  1. Caller invokes dispatch.New(cfg, log, nil, myFallbackFn, grpcServer).
  2. Guard: len(nil)==0 && myFallbackFn==nil evaluates to true && false = false. No error returned.
  3. Manager created with m.dispatchFns=nil, m.defaultDispatchFn=myFallbackFn.
  4. An event arrives; dispatch() is called.
  5. var dispatch DispatchFunc sets dispatch to nil.
  6. if m.dispatchFns != nil evaluates to false. Block skipped entirely.
  7. if dispatch(event) causes a nil function call panic.

How to fix

Add an else branch after the if m.dispatchFns != nil block:

} else if m.defaultDispatchFn \!= nil {
    dispatch = m.defaultDispatchFn
}

This mirrors the existing inner branch that falls back to defaultDispatchFn when a specific type is not found in dispatchFns, and covers the nil-map case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant