Skip to content

feat: provision per-host Postgres user for RAG service instances#299

Open
tsivaprasad wants to merge 1 commit intoPLAT-488-rag-service-api-design-validationfrom
PLAT-489-rag-service-service-user-provisioning
Open

feat: provision per-host Postgres user for RAG service instances#299
tsivaprasad wants to merge 1 commit intoPLAT-488-rag-service-api-design-validationfrom
PLAT-489-rag-service-service-user-provisioning

Conversation

@tsivaprasad
Copy link
Contributor

@tsivaprasad tsivaprasad commented Mar 16, 2026

Summary

This PR introduces RAGServiceUserRole, a new resource that creates a dedicated PostgreSQL database user for each RAG service instance running on its co-located host.

Changes

  • rag_service_user_role.go — New resource keyed by serviceInstanceID (not serviceID). Handles Create/Delete/Refresh lifecycle. Refresh queries pg_catalog.pg_roles to verify role existence. connectToColocatedPrimary filters instances by HostID before Patroni primary lookup, ensuring the role lands on the correct node.

  • orchestrator.go — Refactored GenerateServiceInstanceResources to dispatch by ServiceType. MCP logic extracted into generateMCPInstanceResources. New generateRAGInstanceResources and shared buildServiceInstanceResources helper added.

  • resources.go — Registers ResourceTypeRAGServiceUserRole.

  • plan_update.go — Skips ServiceInstanceMonitorResource for RAG instances since no Docker container exists yet (swarm.service_instance dependency would be unsatisfied).

Testing

  • Covered unit tests
    Manual Verification:
  1. Created Cluster

  2. Created a database using the following command:
    restish control-plane-local-1 create-database < ../demo/488/rag_create_db.json
    rag_create_db.json

  3. The database created successfully

  4. Connect to db and confirm that rg service user created

storefront=# SELECT r.rolname AS role, m.rolname AS member FROM pg_auth_members am JOIN pg_roles r ON r.oid = am.roleid JOIN pg_roles m ON m.oid = am.member WHERE m.rolname LIKE 'svc_%';
             role             |          member           
------------------------------+---------------------------
 pgedge_application_read_only | svc_storefront_rag_host_1
(1 row)

Checklist

  • Tests added

Notes for Reviewers

  • Why per-host (keyed by serviceInstanceID) instead of per-service? CREATE ROLE is not replicated by Spock in a multi-active setup.
  • Why no monitor resource for RAG? ServiceInstanceMonitorResource.Dependencies() hard-codes a dependency on swarm.service_instance. Since this PR provisions only the DB user (no container), that dependency would be unsatisfied and fail the planner.
  • Credential re-use on reconciliation: If Refresh finds the role missing in pg_roles, it returns ErrNotFound → Create is called again with a new password. This is intentional — avoids stale credential state across node migrations.

PLAT-489

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3e417bd1-d3fb-44f7-b9f7-8257f84436b2

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch PLAT-489-rag-service-service-user-provisioning
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable the changed files summary in the walkthrough.

Disable the reviews.changed_files_summary setting to disable the changed files summary in the walkthrough.

@tsivaprasad tsivaprasad force-pushed the PLAT-489-rag-service-service-user-provisioning branch from 57345d8 to 4392b87 Compare March 16, 2026 17:04
@tsivaprasad
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

@tsivaprasad Sure, I'll review the changes in this PR!


🧠 Learnings used
Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 266
File: server/internal/workflows/update_database.go:122-153
Timestamp: 2026-02-09T21:47:53.293Z
Learning: In server/internal/workflows/update_database.go and provision_services.go (Go): Re-running ProvisionServices on already-provisioned service instances is intentional and not destructive. StoreServiceInstance performs an upsert that may transiently reset state to "creating," but the monitor quickly transitions it back to "running" (matching the database instance recovery pattern). CreateServiceUser is SQL-level idempotent (CREATE ROLE IF NOT EXISTS / ALTER ROLE), and resource deployment diffs desired vs. current state. This design supports recovery scenarios where a user retries with a corrected spec after a service failure.

Learnt from: rshoemaker
Repo: pgEdge/control-plane PR: 287
File: server/internal/orchestrator/swarm/service_user_role.go:59-60
Timestamp: 2026-03-11T16:05:35.061Z
Learning: In pgEdge/control-plane server/internal/orchestrator/swarm/service_user_role.go (Go): A ResourceVersion bump (e.g., "2" → "3") in ServiceUserRole is a documentation/safety marker only. When the resource Identifier key changes (e.g., from ServiceInstanceID to ServiceID), recreation is driven automatically by the (Type, ID) key mismatch — old resources with the previous identifier are deleted and new ones are created. Do not flag a ResourceVersion bump as a missing recreate trigger when an Identifier change is already causing full recreate.
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

defer conn.Close(ctx)

var exists bool
err = conn.QueryRow(ctx,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking suggestion: the inverse of this query is repeated in server/internal/postgres/roles.go in CreateUserRole. We could refactor it out into its own function and reuse it here and in CreateUserRole, like:

func UserRoleNeedsCreate(name string) Query[bool] {
	return Query[bool]{
		SQL: "SELECT NOT EXISTS (SELECT 1 FROM pg_catalog.pg_roles WHERE rolname = @name);",
		Args: pgx.NamedArgs{
			"name": name,
		},
	}
}

Then you would invert the logic here:

needsCreate, err := postgres.UserRoleNeedsCreate(r.Username).Exec(ctx, conn)
if err != nil {
    // ...
}
if needsCreate {
    return resource.ErrNotFound
}

logger.Info().Msg("creating RAG service user role")

r.Username = database.GenerateServiceUsername(r.ServiceInstanceID)
password, err := utils.RandomString(32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to rely on Create only getting called once, so we need to try and make all resource lifecycle methods idempotent. Could you change this to only generate a new password if the password is unset?:

if r.Password == "" {
    password, err := utils.RandomString(32)
    if err != nil {
		return fmt.Errorf("failed to generate password: %w", err)
	}
	r.Password = password
}

).Scan(&exists)
if err != nil {
// On query failure, assume it exists
logger.Warn().Err(err).Msg("pg_roles query failed, assuming RAG role exists")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like a safe assumption to me. IMO we should return this error and fail the Refresh so that the user can resolve the issue and retry the operation.

Logger()
logger.Info().Msg("deleting RAG service user from database")

conn, err := r.connectToColocatedPrimary(ctx, rc, logger, "postgres")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to connect to r.Database so that Spock will replicate the user to every instance.

return nil
}

// connectToColocatedPrimary finds the primary Postgres instance on the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The things we're doing in connectToColocatedPrimary, resolveColocatedPrimary, colocatedInstances, and findPrimaryAmong are not necessary. You can rely on two things in this resource:

  • Your lifecycle methods are being evaluated on the host with the primary instance because you're using resource.PrimaryExecutor(r.NodeName)
  • As long as you connect to r.DatabaseName, Spock will replicate whichever operations you perform on the user

So, you can use the same methods we use in subscription_resource.go:

	primary, err := database.GetPrimaryInstance(ctx, rc, r.NodeName)
	if err != nil {
		return fmt.Errorf("failed to get primary instance: %w", err)
	}
    conn, err := primary.Connection(ctx, rc, r.DatabaseName)
	if err != nil {
		return fmt.Errorf("failed to connect to database %s on node %s: %w", subscriber.Spec.DatabaseName, s.SubscriberNode, err)
	}
	defer conn.Close(ctx)

// The role is created on the primary of the co-located Postgres instance
// (same HostID) and granted the pgedge_application_read_only built-in role.
type RAGServiceUserRole struct {
ServiceInstanceID string `json:"service_instance_id"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need one of these resources per rag service, so we can remove this property.

Suggested change
ServiceInstanceID string `json:"service_instance_id"`

}

func (r *RAGServiceUserRole) Identifier() resource.Identifier {
return RAGServiceUserRoleIdentifier(r.ServiceInstanceID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need one of these per service, so this should use r.ServiceID:

Suggested change
return RAGServiceUserRoleIdentifier(r.ServiceInstanceID)
return RAGServiceUserRoleIdentifier(r.ServiceID)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants