RST Auto Synchronization by iamjoemccormick · Pull Request #318 · ThinkParQ/beegfs-go

iamjoemccormick · 2026-04-17T18:15:52Z

What does this PR do / why do we need it?

Required for all PRs.

Related Issue(s)

Required when applicable.

Where should the reviewer(s) start reviewing this?

Only required for larger PRs when this may not be immediately obvious.

Are there any specific topics we should discuss before merging?

Not required.

What are the next steps after this PR?

Not required.

Checklist before merging:

Required for all PRs.

When creating a PR these are items to keep in mind that cannot be checked by GitHub actions:

Documentation:
- Does developer documentation (code comments, readme, etc.) need to be added or updated?
- Does the user documentation need to be expanded or updated for this change?
Testing:
- Does this functionality require changing or adding new unit tests?
- Does this functionality require changing or adding new integration tests?
Git Hygiene:
- Do commits adhere to Conventional Commits?
- Is the commit history squashed down reasonably?

For more details refer to the Go coding standards and the pull request process.

… to beegfs

While integrating Watch with Remote, a few updates were needed: - Detect the meta node ID and include it on the gRPC stream context. - Default to waiting for subscribers to ack the last event received before streaming events.

Three bugs in the RST push+stub flow could cause state corruption or local data loss when a job is cancelled under certain race conditions: 1. Remote's UpdateWork() had no terminal state guard. A late-arriving COMPLETED work result (e.g. from Sync journal replay after restart or a gRPC context cancellation race) could trigger job.Complete() on an already-cancelled job, overwriting the CANCELLED state and violating the user's cancel intent. For multipart uploads this typically results in a FAILED job (the multipart was already aborted so finishUpload fails), but the state corruption makes job history confusing and difficult to reason about. For non-multipart uploads this could create a stub file after the job was cancelled, though data loss should not occur since the contents were already synced to the bucket. Fixed by checking job.InTerminalState() before processing. The work result is still persisted for inspection, but no completion logic runs. 2. Sync's gRPC server discarded work results when the work manager returned both a result and an error. This happens when Remote tries to cancel already-COMPLETED work — the manager returns the COMPLETED result alongside an error, but the server returned only the gRPC error. Remote never learned the work was COMPLETED and set the state to UNKNOWN. Fixed by returning the work result without a gRPC error when the manager provides one, so Remote sees the actual state. 3. CreateOffloadedDataFile() opened the original file with O_TRUNC before writing stub content, creating a window where the file is 0 bytes. If the process crashed or SetFileDataState() failed after truncation, the file would be empty with no stub URL written. Fixed by writing stub content to a temp file with a random suffix, then atomically replacing the original via OverwriteFile (Renameat2). The original file is either fully intact or fully replaced. Note this issue was fairly inconsequential if the RST ID was set on the file's metadata, and the remote path was the same as the path inside BeeGFS. Assisted-by: Claude:claude-opus-4-6

Assisted-by: Claude:claude-sonnet-4-6

iamjoemccormick · 2026-04-17T18:19:10Z

@claude review once

claude · 2026-04-17T18:40:04Z

+func (m *Manager) ResetUserRateLimit(userId uint32) {
+	m.rateLimiter.resetUser(userId)
+}


🔴 ResetUserRateLimit panics with a nil pointer dereference when the dispatch Manager is disabled. Any caller invoking this exported method on a disabled Manager (cfg.Enabled=false) will crash because New() returns &Manager{} with rateLimiter=nil. Add an if !m.enabled { return } guard matching the Start()/Stop() pattern to fix it.

Extended reasoning...

Bug description

When cfg.Enabled=false, New() (dispatch.go:80) short-circuits and returns &Manager{} — a zero-value struct where rateLimiter is nil and enabled is false. ResetUserRateLimit() at lines 150–152 calls m.rateLimiter.resetUser(userId) unconditionally, which will immediately panic with a nil pointer dereference if the Manager is disabled.

Code path

// New() – disabled path if !cfg.Enabled { log.Warn("automatically dispatching jobs from file system modification events is disabled") return &Manager{}, nil // rateLimiter is nil, enabled is false } // ResetUserRateLimit() – no guard func (m *Manager) ResetUserRateLimit(userId uint32) { m.rateLimiter.resetUser(userId) // nil dereference if disabled }

Why existing code doesn't prevent it

Start() and Stop() both guard with if !m.enabled { return }, so they are safe when the Manager is disabled. ResetUserRateLimit() is the only exported method that skips this pattern. The method's doc comment says "Safe to call from any goroutine (e.g., a gRPC handler)", which implies callers need not check whether dispatch is enabled before calling it — making the broken contract even more dangerous.

Impact

No current callers exist in this PR (the method is newly defined), so the bug cannot panic in production today. However, it is a public API explicitly designed for external callers such as gRPC handlers. Any future caller that does not independently check cfg.Enabled before invoking ResetUserRateLimit() will trigger a nil pointer dereference panic, crashing the BeeRemote process.

Step-by-step proof

Caller configures dispatch.Config{Enabled: false, ...} (the default in beegfs-remote.toml).

dispatch.New(cfg, ...) executes the !cfg.Enabled branch and returns &Manager{}.

All fields are zero: enabled=false, rateLimiter=nil.

A gRPC handler calls dispatchManager.ResetUserRateLimit(1000).

m.rateLimiter is nil; calling nil.resetUser(1000) panics with runtime error: invalid memory address or nil pointer dereference.

Fix

Add the same guard used by Start() and Stop():

func (m *Manager) ResetUserRateLimit(userId uint32) { if !m.enabled { return } m.rateLimiter.resetUser(userId) }

claude · 2026-04-17T18:40:04Z

+func IsDataStateOffloaded(state DataState) bool {
+	return state == DataStateManualRestore || state == DataStateAutoRestore
+}


🔴 The new IsDataStateOffloaded() function omits DataStateDelayedRestore, causing stub files created with --restore-policy=delayed to be silently mishandled by three call sites. Fix: add || state == DataStateDelayedRestore to the return expression in common/beegfs/entry.go:149-151.

Extended reasoning...

What the bug is and how it manifests

IsDataStateOffloaded() (introduced in this PR at common/beegfs/entry.go:149-151) returns true only for DataStateManualRestore and DataStateAutoRestore. This same PR also introduces DataStateDelayedRestore as a valid stub-file state, set by CreateOffloadedDataFile() when the caller passes restorePolicyToDataState(RESTORE_POLICY_DELAYED). Because IsDataStateOffloaded() doesn't include DataStateDelayedRestore, any file stubbed with the delayed restore policy is treated as a regular (non-offloaded) file by every caller of the function.

The specific code paths that break

Three callers are directly impacted: (1) GetLockedInfo() in common/rst/rst.go:698 — the if beegfs.IsDataStateOffloaded(...) block that reads the stub RST URL and populates lockedInfo.StubUrlRstId / StubUrlPath is skipped, so rstIds remains nil and the function returns ErrFileHasNoRSTs, making any push/pull/status operation fail for delayed-restore stubs. (2) prepareJobRequests() in common/rst/jobrequest.go:205 — the stub-content retrieval path is skipped, causing the function to fall through to the "no RSTs configured" error path instead of calling remote.GetStubContents. (3) getDefaultReleaseUnusedFileLock() in rst/remote/internal/job/manager.go:1221 — the guard that skips ClearAccessFlags for offloaded files returns false for DataStateDelayedRestore, so the write lock on a live stub file is incorrectly cleared, potentially allowing writes to stub content.

A fourth caller, getPathStatusFromDatabase() in ctl/pkg/ctl/rst/status.go:568, also uses IsDataStateOffloaded() to display the "contents are offloaded" message, so delayed-restore stubs would show incorrect status output.

Why existing code doesn't prevent it

GetEventDispatchFunc() in manager.go:266 intentionally compares directly to DataStateAutoRestore (not via IsDataStateOffloaded()), which is correct: delayed-restore stubs should not trigger auto-restore on blocked opens. That design choice is sound, but it means the intended behavior for the other callers — where all three restore-policy stub types must be treated as offloaded — is only expressed through the new helper that is now incomplete.

Step-by-step proof

User runs beegfs-ctl rst push --stub-local --restore-policy=delayed /mnt/beegfs/file.

CreateOffloadedDataFile() is called with dataState = DataStateDelayedRestore; the file becomes a stub with this state.

User later runs beegfs-ctl rst pull /mnt/beegfs/file.

prepareJobRequests() calls entry.GetEntry(), which returns DataStateDelayedRestore.

beegfs.IsDataStateOffloaded(DataStateDelayedRestore) returns false.

Code skips the stub-content retrieval branch, falls through to the RST-ID check, finds none, and returns ErrFileHasNoRSTs — the pull fails with a misleading error.

How to fix it

Change common/beegfs/entry.go:151 from:

return state == DataStateManualRestore || state == DataStateAutoRestore

to:

return state == DataStateManualRestore || state == DataStateAutoRestore || state == DataStateDelayedRestore

claude · 2026-04-17T18:40:05Z

+				// Event was filtered out, just ack it.
+				return ack
+			}
+			dispatch = m.defaultDispatchFn
+		}
+	}
+
+	userId := e.V2.GetMsgUserId()
+	eventType := e.V2.GetType()
+	if !m.rateLimiter.check(userId, eventType) {
+		m.log.Debug("event rate limited",
+			zap.Uint32("userId", userId),
+			zap.String("path", e.V2.GetPath()),
+			zap.Any("type", eventType))
+		dispatchEventsRateLimited.Add(1)
+		return ack
+	}
+
+	if dispatch(event) {
+		m.rateLimiter.record(userId, eventType)
+		dispatchEventsAccepted.Add(1)
+	} else {
+		dispatchEventsRejected.Add(1)


🔴 In dispatch(), the local 'dispatch' variable is only assigned inside 'if m.dispatchFns != nil', so when New() is called with dispatchFns=nil and a non-nil defaultDispatchFn, dispatch remains nil and calling dispatch(event) panics. The fix is to add an else branch after the block: 'else if m.defaultDispatchFn != nil { dispatch = m.defaultDispatchFn }'.

Extended reasoning...

What the bug is and how it manifests

The dispatch() method in watch/pkg/dispatch/dispatch.go declares a local 'var dispatch DispatchFunc' (zero value: nil). It only assigns a function to this variable inside the block 'if m.dispatchFns != nil'. If m.dispatchFns is nil, the entire block is skipped, dispatch stays nil, and the subsequent call 'if dispatch(event)' panics with a nil function call.

The specific code path that triggers it

New() guards against misconfiguration with: 'if len(dispatchFns) == 0 && defaultDispatchFn == nil { return error }'. In Go, len(nil) on a map is 0, so passing (dispatchFns=nil, defaultDispatchFn=nonNilFn) satisfies len(dispatchFns)==0 but the overall guard is false because defaultDispatchFn != nil. The Manager is created with m.dispatchFns=nil and m.defaultDispatchFn!=nil. Then in dispatch(), 'if m.dispatchFns != nil' evaluates to false, the block is skipped, dispatch is never set, and dispatch(event) panics.

Why existing code does not prevent it

The guard in New() uses len() which treats nil and empty maps identically, while dispatch() uses != nil which distinguishes them. The API contract of New() allows (nil, nonNilFn) but dispatch() has no code path that consults m.defaultDispatchFn when m.dispatchFns is nil. The only production caller in rst/remote/cmd/beegfs-remote/main.go always passes a non-nil, non-empty map with defaultDispatchFn=nil, so the panic is never triggered today.

Impact

Any future caller that passes dispatchFns=nil with a non-nil defaultDispatchFn will encounter a panic in the dispatch worker goroutine, crashing that goroutine silently or bringing down the service.

Step-by-step proof

Caller invokes dispatch.New(cfg, log, nil, myFallbackFn, grpcServer).

Guard: len(nil)==0 && myFallbackFn==nil evaluates to true && false = false. No error returned.

Manager created with m.dispatchFns=nil, m.defaultDispatchFn=myFallbackFn.

An event arrives; dispatch() is called.

var dispatch DispatchFunc sets dispatch to nil.

if m.dispatchFns != nil evaluates to false. Block skipped entirely.

if dispatch(event) causes a nil function call panic.

How to fix

Add an else branch after the if m.dispatchFns != nil block:

} else if m.defaultDispatchFn \!= nil { dispatch = m.defaultDispatchFn }

This mirrors the existing inner branch that falls back to defaultDispatchFn when a specific type is not found in dispatchFns, and covers the nil-map case.

iamjoemccormick added 7 commits April 17, 2026 18:14

feat(rst): support auto-restore and move locked/offloaded definitions…

78f0a68

… to beegfs

wip: wire event subscriber+dispatcher into remote

c3dda10

feat(watch): improve subscriber stream initialization

4fe9b82

While integrating Watch with Remote, a few updates were needed: - Detect the meta node ID and include it on the gRPC stream context. - Default to waiting for subscribers to ack the last event received before streaming events.

feat(rst): support specifying restore policy with push and pull

27cabc1

feat(watch): add support for event filters

4d8499a

Assisted-by: Claude:claude-sonnet-4-6

todo: replace with merged protobuf commit

f345358

claude bot reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RST Auto Synchronization#318

RST Auto Synchronization#318
iamjoemccormick wants to merge 7 commits intomainfrom
iamjoe/feat/rst-auto-sync

iamjoemccormick commented Apr 17, 2026

Uh oh!

iamjoemccormick commented Apr 17, 2026

Uh oh!

claude bot Apr 17, 2026

Uh oh!

claude bot Apr 17, 2026

Uh oh!

claude bot Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iamjoemccormick commented Apr 17, 2026

What does this PR do / why do we need it?

Related Issue(s)

Where should the reviewer(s) start reviewing this?

Are there any specific topics we should discuss before merging?

What are the next steps after this PR?

Checklist before merging:

Uh oh!

iamjoemccormick commented Apr 17, 2026

Uh oh!

claude bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant