Conversation
While integrating Watch with Remote, a few updates were needed: - Detect the meta node ID and include it on the gRPC stream context. - Default to waiting for subscribers to ack the last event received before streaming events.
Three bugs in the RST push+stub flow could cause state corruption or local data loss when a job is cancelled under certain race conditions: 1. Remote's UpdateWork() had no terminal state guard. A late-arriving COMPLETED work result (e.g. from Sync journal replay after restart or a gRPC context cancellation race) could trigger job.Complete() on an already-cancelled job, overwriting the CANCELLED state and violating the user's cancel intent. For multipart uploads this typically results in a FAILED job (the multipart was already aborted so finishUpload fails), but the state corruption makes job history confusing and difficult to reason about. For non-multipart uploads this could create a stub file after the job was cancelled, though data loss should not occur since the contents were already synced to the bucket. Fixed by checking job.InTerminalState() before processing. The work result is still persisted for inspection, but no completion logic runs. 2. Sync's gRPC server discarded work results when the work manager returned both a result and an error. This happens when Remote tries to cancel already-COMPLETED work β the manager returns the COMPLETED result alongside an error, but the server returned only the gRPC error. Remote never learned the work was COMPLETED and set the state to UNKNOWN. Fixed by returning the work result without a gRPC error when the manager provides one, so Remote sees the actual state. 3. CreateOffloadedDataFile() opened the original file with O_TRUNC before writing stub content, creating a window where the file is 0 bytes. If the process crashed or SetFileDataState() failed after truncation, the file would be empty with no stub URL written. Fixed by writing stub content to a temp file with a random suffix, then atomically replacing the original via OverwriteFile (Renameat2). The original file is either fully intact or fully replaced. Note this issue was fairly inconsequential if the RST ID was set on the file's metadata, and the remote path was the same as the path inside BeeGFS. Assisted-by: Claude:claude-opus-4-6
Assisted-by: Claude:claude-sonnet-4-6
|
@claude review once |
| func (m *Manager) ResetUserRateLimit(userId uint32) { | ||
| m.rateLimiter.resetUser(userId) | ||
| } |
There was a problem hiding this comment.
π΄ ResetUserRateLimit panics with a nil pointer dereference when the dispatch Manager is disabled. Any caller invoking this exported method on a disabled Manager (cfg.Enabled=false) will crash because New() returns &Manager{} with rateLimiter=nil. Add an if !m.enabled { return } guard matching the Start()/Stop() pattern to fix it.
Extended reasoning...
Bug description
When cfg.Enabled=false, New() (dispatch.go:80) short-circuits and returns &Manager{} β a zero-value struct where rateLimiter is nil and enabled is false. ResetUserRateLimit() at lines 150β152 calls m.rateLimiter.resetUser(userId) unconditionally, which will immediately panic with a nil pointer dereference if the Manager is disabled.
Code path
// New() β disabled path
if !cfg.Enabled {
log.Warn("automatically dispatching jobs from file system modification events is disabled")
return &Manager{}, nil // rateLimiter is nil, enabled is false
}
// ResetUserRateLimit() β no guard
func (m *Manager) ResetUserRateLimit(userId uint32) {
m.rateLimiter.resetUser(userId) // nil dereference if disabled
}Why existing code doesn't prevent it
Start() and Stop() both guard with if !m.enabled { return }, so they are safe when the Manager is disabled. ResetUserRateLimit() is the only exported method that skips this pattern. The method's doc comment says "Safe to call from any goroutine (e.g., a gRPC handler)", which implies callers need not check whether dispatch is enabled before calling it β making the broken contract even more dangerous.
Impact
No current callers exist in this PR (the method is newly defined), so the bug cannot panic in production today. However, it is a public API explicitly designed for external callers such as gRPC handlers. Any future caller that does not independently check cfg.Enabled before invoking ResetUserRateLimit() will trigger a nil pointer dereference panic, crashing the BeeRemote process.
Step-by-step proof
- Caller configures
dispatch.Config{Enabled: false, ...}(the default inbeegfs-remote.toml). dispatch.New(cfg, ...)executes the!cfg.Enabledbranch and returns&Manager{}.- All fields are zero:
enabled=false,rateLimiter=nil. - A gRPC handler calls
dispatchManager.ResetUserRateLimit(1000). m.rateLimiterisnil; callingnil.resetUser(1000)panics withruntime error: invalid memory address or nil pointer dereference.
Fix
Add the same guard used by Start() and Stop():
func (m *Manager) ResetUserRateLimit(userId uint32) {
if !m.enabled {
return
}
m.rateLimiter.resetUser(userId)
}| func IsDataStateOffloaded(state DataState) bool { | ||
| return state == DataStateManualRestore || state == DataStateAutoRestore | ||
| } |
There was a problem hiding this comment.
π΄ The new IsDataStateOffloaded() function omits DataStateDelayedRestore, causing stub files created with --restore-policy=delayed to be silently mishandled by three call sites. Fix: add || state == DataStateDelayedRestore to the return expression in common/beegfs/entry.go:149-151.
Extended reasoning...
What the bug is and how it manifests
IsDataStateOffloaded() (introduced in this PR at common/beegfs/entry.go:149-151) returns true only for DataStateManualRestore and DataStateAutoRestore. This same PR also introduces DataStateDelayedRestore as a valid stub-file state, set by CreateOffloadedDataFile() when the caller passes restorePolicyToDataState(RESTORE_POLICY_DELAYED). Because IsDataStateOffloaded() doesn't include DataStateDelayedRestore, any file stubbed with the delayed restore policy is treated as a regular (non-offloaded) file by every caller of the function.
The specific code paths that break
Three callers are directly impacted: (1) GetLockedInfo() in common/rst/rst.go:698 β the if beegfs.IsDataStateOffloaded(...) block that reads the stub RST URL and populates lockedInfo.StubUrlRstId / StubUrlPath is skipped, so rstIds remains nil and the function returns ErrFileHasNoRSTs, making any push/pull/status operation fail for delayed-restore stubs. (2) prepareJobRequests() in common/rst/jobrequest.go:205 β the stub-content retrieval path is skipped, causing the function to fall through to the "no RSTs configured" error path instead of calling remote.GetStubContents. (3) getDefaultReleaseUnusedFileLock() in rst/remote/internal/job/manager.go:1221 β the guard that skips ClearAccessFlags for offloaded files returns false for DataStateDelayedRestore, so the write lock on a live stub file is incorrectly cleared, potentially allowing writes to stub content.
A fourth caller, getPathStatusFromDatabase() in ctl/pkg/ctl/rst/status.go:568, also uses IsDataStateOffloaded() to display the "contents are offloaded" message, so delayed-restore stubs would show incorrect status output.
Why existing code doesn't prevent it
GetEventDispatchFunc() in manager.go:266 intentionally compares directly to DataStateAutoRestore (not via IsDataStateOffloaded()), which is correct: delayed-restore stubs should not trigger auto-restore on blocked opens. That design choice is sound, but it means the intended behavior for the other callers β where all three restore-policy stub types must be treated as offloaded β is only expressed through the new helper that is now incomplete.
Step-by-step proof
- User runs
beegfs-ctl rst push --stub-local --restore-policy=delayed /mnt/beegfs/file. CreateOffloadedDataFile()is called withdataState = DataStateDelayedRestore; the file becomes a stub with this state.- User later runs
beegfs-ctl rst pull /mnt/beegfs/file. prepareJobRequests()callsentry.GetEntry(), which returnsDataStateDelayedRestore.beegfs.IsDataStateOffloaded(DataStateDelayedRestore)returnsfalse.- Code skips the stub-content retrieval branch, falls through to the RST-ID check, finds none, and returns
ErrFileHasNoRSTsβ the pull fails with a misleading error.
How to fix it
Change common/beegfs/entry.go:151 from:
return state == DataStateManualRestore || state == DataStateAutoRestoreto:
return state == DataStateManualRestore || state == DataStateAutoRestore || state == DataStateDelayedRestore| // Event was filtered out, just ack it. | ||
| return ack | ||
| } | ||
| dispatch = m.defaultDispatchFn | ||
| } | ||
| } | ||
|
|
||
| userId := e.V2.GetMsgUserId() | ||
| eventType := e.V2.GetType() | ||
| if !m.rateLimiter.check(userId, eventType) { | ||
| m.log.Debug("event rate limited", | ||
| zap.Uint32("userId", userId), | ||
| zap.String("path", e.V2.GetPath()), | ||
| zap.Any("type", eventType)) | ||
| dispatchEventsRateLimited.Add(1) | ||
| return ack | ||
| } | ||
|
|
||
| if dispatch(event) { | ||
| m.rateLimiter.record(userId, eventType) | ||
| dispatchEventsAccepted.Add(1) | ||
| } else { | ||
| dispatchEventsRejected.Add(1) |
There was a problem hiding this comment.
π΄ In dispatch(), the local 'dispatch' variable is only assigned inside 'if m.dispatchFns != nil', so when New() is called with dispatchFns=nil and a non-nil defaultDispatchFn, dispatch remains nil and calling dispatch(event) panics. The fix is to add an else branch after the block: 'else if m.defaultDispatchFn != nil { dispatch = m.defaultDispatchFn }'.
Extended reasoning...
What the bug is and how it manifests
The dispatch() method in watch/pkg/dispatch/dispatch.go declares a local 'var dispatch DispatchFunc' (zero value: nil). It only assigns a function to this variable inside the block 'if m.dispatchFns != nil'. If m.dispatchFns is nil, the entire block is skipped, dispatch stays nil, and the subsequent call 'if dispatch(event)' panics with a nil function call.
The specific code path that triggers it
New() guards against misconfiguration with: 'if len(dispatchFns) == 0 && defaultDispatchFn == nil { return error }'. In Go, len(nil) on a map is 0, so passing (dispatchFns=nil, defaultDispatchFn=nonNilFn) satisfies len(dispatchFns)==0 but the overall guard is false because defaultDispatchFn != nil. The Manager is created with m.dispatchFns=nil and m.defaultDispatchFn!=nil. Then in dispatch(), 'if m.dispatchFns != nil' evaluates to false, the block is skipped, dispatch is never set, and dispatch(event) panics.
Why existing code does not prevent it
The guard in New() uses len() which treats nil and empty maps identically, while dispatch() uses != nil which distinguishes them. The API contract of New() allows (nil, nonNilFn) but dispatch() has no code path that consults m.defaultDispatchFn when m.dispatchFns is nil. The only production caller in rst/remote/cmd/beegfs-remote/main.go always passes a non-nil, non-empty map with defaultDispatchFn=nil, so the panic is never triggered today.
Impact
Any future caller that passes dispatchFns=nil with a non-nil defaultDispatchFn will encounter a panic in the dispatch worker goroutine, crashing that goroutine silently or bringing down the service.
Step-by-step proof
- Caller invokes dispatch.New(cfg, log, nil, myFallbackFn, grpcServer).
- Guard: len(nil)==0 && myFallbackFn==nil evaluates to true && false = false. No error returned.
- Manager created with m.dispatchFns=nil, m.defaultDispatchFn=myFallbackFn.
- An event arrives; dispatch() is called.
- var dispatch DispatchFunc sets dispatch to nil.
- if m.dispatchFns != nil evaluates to false. Block skipped entirely.
- if dispatch(event) causes a nil function call panic.
How to fix
Add an else branch after the if m.dispatchFns != nil block:
} else if m.defaultDispatchFn \!= nil {
dispatch = m.defaultDispatchFn
}
This mirrors the existing inner branch that falls back to defaultDispatchFn when a specific type is not found in dispatchFns, and covers the nil-map case.
What does this PR do / why do we need it?
Required for all PRs.
Related Issue(s)
Required when applicable.
Where should the reviewer(s) start reviewing this?
Only required for larger PRs when this may not be immediately obvious.
Are there any specific topics we should discuss before merging?
Not required.
What are the next steps after this PR?
Not required.
Checklist before merging:
Required for all PRs.
When creating a PR these are items to keep in mind that cannot be checked by GitHub actions:
For more details refer to the Go coding standards and the pull request process.