[management] Prevent JWT reuse during peer login#6002
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds per-peer JWT claiming to the management gRPC server: a new SessionStore prevents token reuse via cache-backed registration, the server receives the sessionStore dependency, and tests/constructors updated to match the new nbgrpc.NewServer signature. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant gRPC as gRPC Server
participant Auth as Auth Manager
participant Store as SessionStore
participant Cache as CacheStore
Client->>gRPC: Request with jwtToken
gRPC->>Auth: ValidateAndParseToken(jwtToken)
Auth-->>gRPC: claims (expiresAt, subject...)
gRPC->>Store: RegisterToken(jwtToken, expiresAt)
Store->>Store: hashToken(jwtToken)
Store->>Cache: Get(hash)
alt entry exists
Cache-->>Store: found
Store-->>gRPC: ErrTokenAlreadyUsed
gRPC-->>Client: Unauthenticated
else expired (expiresAt <= now)
Store-->>gRPC: ErrTokenExpired
gRPC-->>Client: Unauthenticated
else not found
Store->>Cache: Set(hash -> marker, TTL)
Cache-->>Store: OK
Store-->>gRPC: nil
gRPC-->>Client: proceed (OK)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
management/internals/shared/grpc/server.go (1)
542-578:⚠️ Potential issue | 🔴 CriticalRetry loop is broken: a token claimed on attempt 1 makes attempt 2 always fail with
ErrTokenAlreadyUsed.
claimLoginTokenis invoked betweenValidateAndParseTokenand the downstreamGetAccountIDFromUserAuth/EnsureUserAccessByJWTGroups/SyncUserJWTGroupscalls. If any of those downstream calls returns a transient error,processJwtToken's retry loop re-invokesvalidateTokenwith the same JWT, and the second pass now fails atclaimLoginTokenwithErrTokenAlreadyUsed→codes.Unauthenticated.Net effect:
- The "IdP cache issue" retry the comment refers to no longer works for failures past the parse step.
- A legitimate user can get permanently locked out of further login attempts with that JWT for the entire JWT TTL.
- The error returned to the client is misleading (always “JWT already used”) rather than the original transient cause.
Fix: claim the token only once the full validation chain has succeeded, so retries on transient downstream errors remain idempotent. Concurrent peer logins with the same JWT are still rejected because
RegisterTokenis the gating operation just before returning success (assuming the TOCTOU race inSessionStore.RegisterTokenis also fixed — see separate comment onsession.go).🐛 Proposed reordering
func (s *Server) validateToken(ctx context.Context, peerKey, jwtToken string) (string, error) { if s.authManager == nil { return "", status.Errorf(codes.Internal, "missing auth manager") } userAuth, token, err := s.authManager.ValidateAndParseToken(ctx, jwtToken) if err != nil { return "", status.Errorf(codes.InvalidArgument, "invalid jwt token, err: %v", err) } - if err := s.claimLoginToken(ctx, peerKey, jwtToken, token); err != nil { - return "", err - } - // we need to call this method because if user is new, we will automatically add it to existing or create a new account accountId, _, err := s.accountManager.GetAccountIDFromUserAuth(ctx, userAuth) if err != nil { return "", status.Errorf(codes.Internal, "unable to fetch account with claims, err: %v", err) } if userAuth.AccountId != accountId { log.WithContext(ctx).Debugf("gRPC server sets accountId from ensure, before %s, now %s", userAuth.AccountId, accountId) userAuth.AccountId = accountId } userAuth, err = s.authManager.EnsureUserAccessByJWTGroups(ctx, userAuth, token) if err != nil { return "", status.Error(codes.PermissionDenied, err.Error()) } err = s.accountManager.SyncUserJWTGroups(ctx, userAuth) if err != nil { log.WithContext(ctx).Errorf("gRPC server failed to sync user JWT groups: %s", err) } + if err := s.claimLoginToken(ctx, peerKey, jwtToken, token); err != nil { + return "", err + } + return userAuth.UserId, nil }Also applies to: 869-887
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/shared/grpc/server.go` around lines 542 - 578, The validateToken flow currently calls claimLoginToken before downstream checks, causing retries to fail with ErrTokenAlreadyUsed; move the claimLoginToken call to after successful GetAccountIDFromUserAuth, EnsureUserAccessByJWTGroups and SyncUserJWTGroups (i.e., only claim/register the JWT once the full validation chain has succeeded) so transient downstream errors can be retried idempotently; update the same change for the duplicate block around the second occurrence (lines noted in the review) and ensure the final success path performs the token registration/claiming (keeping RegisterToken as the gating operation before returning success).
🧹 Nitpick comments (2)
management/server/auth/session_test.go (1)
60-73: Time-based TTL eviction test may be flaky on slow CI.A 50 ms TTL with a 120 ms sleep leaves only a ~70 ms safety margin, and any GC pause / scheduler hiccup on a busy CI runner can cause
RegisterTokenafter the sleep to still see the entry and surfaceErrTokenAlreadyUsed, failing the test intermittently. Consider widening the gap (e.g., 50 ms TTL + 500 ms sleep) or polling with a deadline rather than a single fixed sleep.♻️ Proposed change
- require.NoError(t, s.RegisterToken(ctx, token, time.Now().Add(50*time.Millisecond))) + ttl := 50 * time.Millisecond + require.NoError(t, s.RegisterToken(ctx, token, time.Now().Add(ttl))) - err := s.RegisterToken(ctx, token, time.Now().Add(50*time.Millisecond)) + err := s.RegisterToken(ctx, token, time.Now().Add(ttl)) assert.ErrorIs(t, err, ErrTokenAlreadyUsed) - time.Sleep(120 * time.Millisecond) + time.Sleep(ttl + 500*time.Millisecond) require.NoError(t, s.RegisterToken(ctx, token, time.Now().Add(time.Hour)))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/server/auth/session_test.go` around lines 60 - 73, The TTL eviction test TestSessionStore_EntryEvictsAtTTLAndAllowsReRegistration is flaky due to tight timing: widen the margin by either increasing the TTL and sleep (e.g., use 50ms TTL + 500ms sleep or set both to larger values) or replace the fixed sleep with a polling loop that retries RegisterToken until success or a deadline expires; update calls to RegisterToken and assertions around ErrTokenAlreadyUsed accordingly so the test reliably observes eviction before attempting re-registration.management/internals/shared/grpc/server.go (1)
839-842: Silently bypassing claim whensessionStore == nilweakens the protection in production.If a future code path (or a misconfigured boot) constructs
Serverwithout asessionStore, JWT reuse protection silently disappears with no signal. Since the production constructor inboot.goalways wires one, consider treating a missingsessionStoreas a programmer error in production builds (or at least logging once at startup) rather than silently no-oping per request.If the
nilis exclusively to keep tests compiling, an alternative is to make the constructor toleratenilonce at init time and haveRegisterTokenalways be called via a non-nil interface (e.g., a no-op implementation injected in tests). Not blocking — keep current behavior if the test surface justifies it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@management/internals/shared/grpc/server.go` around lines 839 - 842, The claimLoginToken function currently silently returns when s.sessionStore is nil, removing JWT reuse protection; change this by making a clear, deterministic handling: ensure the Server constructor (or boot.go wiring) requires a non-nil sessionStore or injects a no-op implementation for tests, and add a startup-time check that logs or fails fast if a real sessionStore is absent in production builds; alternatively, if you prefer a per-request guard, replace the silent return in claimLoginToken with an explicit log/error (and optionally a once-only metric) when s.sessionStore is nil so missing wiring is visible; reference: claimLoginToken, Server constructor/boot.go, and the sessionStore.RegisterToken call to locate the change points.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@management/server/auth/session.go`:
- Around line 33-56: The RegisterToken flow in SessionStore uses a non-atomic
Get-then-Set (s.cache.Get + s.cache.Set) allowing a TOCTOU race; fix by ensuring
atomic check-and-set: for single-instance deployments add an in-process lock
(e.g., a sync.Mutex or sharded mutex on the usedTokenKeyPrefix+hashToken(token)
key) to serialize RegisterToken calls before calling s.cache.Get/Set, or for
multi-instance Redis-backed caches replace the unconditional cache.Set with an
atomic SET ... NX (with expiration) against Redis so the Set is only successful
when the key did not previously exist; update RegisterToken to use the chosen
approach and keep the existing error handling
(ErrTokenAlreadyUsed/ErrTokenExpired).
---
Outside diff comments:
In `@management/internals/shared/grpc/server.go`:
- Around line 542-578: The validateToken flow currently calls claimLoginToken
before downstream checks, causing retries to fail with ErrTokenAlreadyUsed; move
the claimLoginToken call to after successful GetAccountIDFromUserAuth,
EnsureUserAccessByJWTGroups and SyncUserJWTGroups (i.e., only claim/register the
JWT once the full validation chain has succeeded) so transient downstream errors
can be retried idempotently; update the same change for the duplicate block
around the second occurrence (lines noted in the review) and ensure the final
success path performs the token registration/claiming (keeping RegisterToken as
the gating operation before returning success).
---
Nitpick comments:
In `@management/internals/shared/grpc/server.go`:
- Around line 839-842: The claimLoginToken function currently silently returns
when s.sessionStore is nil, removing JWT reuse protection; change this by making
a clear, deterministic handling: ensure the Server constructor (or boot.go
wiring) requires a non-nil sessionStore or injects a no-op implementation for
tests, and add a startup-time check that logs or fails fast if a real
sessionStore is absent in production builds; alternatively, if you prefer a
per-request guard, replace the silent return in claimLoginToken with an explicit
log/error (and optionally a once-only metric) when s.sessionStore is nil so
missing wiring is visible; reference: claimLoginToken, Server
constructor/boot.go, and the sessionStore.RegisterToken call to locate the
change points.
In `@management/server/auth/session_test.go`:
- Around line 60-73: The TTL eviction test
TestSessionStore_EntryEvictsAtTTLAndAllowsReRegistration is flaky due to tight
timing: widen the margin by either increasing the TTL and sleep (e.g., use 50ms
TTL + 500ms sleep or set both to larger values) or replace the fixed sleep with
a polling loop that retries RegisterToken until success or a deadline expires;
update calls to RegisterToken and assertions around ErrTokenAlreadyUsed
accordingly so the test reliably observes eviction before attempting
re-registration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1f6646b0-8e40-46a3-a730-7be0ae8b93d4
📒 Files selected for processing (11)
client/cmd/testutil_test.goclient/internal/engine_test.goclient/server/server_test.gomanagement/internals/server/boot.gomanagement/internals/server/controllers.gomanagement/internals/shared/grpc/server.gomanagement/server/auth/session.gomanagement/server/auth/session_test.gomanagement/server/management_proto_test.gomanagement/server/management_test.goshared/management/client/client_test.go
|



Describe your changes
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
New Features
Tests