Skip to content

feat(hosting): add Helm chart for Agenta OSS Kubernetes deployment#3852

Open
endoze wants to merge 7 commits intoAgenta-AI:mainfrom
endoze:feat-add-helm-chart-for-agenta-oss
Open

feat(hosting): add Helm chart for Agenta OSS Kubernetes deployment#3852
endoze wants to merge 7 commits intoAgenta-AI:mainfrom
endoze:feat-add-helm-chart-for-agenta-oss

Conversation

@endoze
Copy link
Contributor

@endoze endoze commented Feb 26, 2026

Enable self-hosted Kubernetes deployments as an alternative to docker-compose. The chart packages all Agenta OSS components (API, web, services, workers, cron, Redis, SuperTokens, PostgreSQL) with Bitnami PostgreSQL as a subchart dependency, Alembic migrations as a pre-install/pre-upgrade hook, and an optional ingress resource. Includes a CI workflow to publish the chart to GHCR on changes.


Open with Devin

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Feb 26, 2026
@CLAassistant
Copy link

CLAassistant commented Feb 26, 2026

CLA assistant check
All committers have signed the CLA.

@vercel
Copy link

vercel bot commented Feb 26, 2026

@endoze is attempting to deploy a commit to the agenta projects Team on Vercel.

A member of the Team first needs to authorize it.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Member

@mmabrouk mmabrouk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together. This is a solid Helm chart that correctly models our docker-compose architecture. The dual Redis setup, the existingSecret pattern, and the external database support are all done well. I appreciate the comprehensive documentation updates too.

I reviewed the chart against Helm community best practices and tested it locally. Below are my findings.


What I Did

I compared this PR against our current docker-compose infrastructure and an older Helm chart attempt (PR #2775). I also ran the chart through multiple validation layers: helm lint, helm template with dry-run, and a full install in a Kind cluster.
The lint and template steps passed. The cluster install revealed two issues that block deployment.


Critical Issues

1. Helm hook ordering causes a deadlock
The Alembic job uses pre-install,pre-upgrade hooks. However, it depends on PostgreSQL, which is part of the main release. Helm runs hooks before the main release. This means Alembic waits for a PostgreSQL that does not exist yet.
The install times out after 10 minutes with the init container stuck waiting.

To fix this the agent suggest the following:

Change the hook to post-install,post-upgrade in templates/alembic-job.yaml:

annotations:
  helm.sh/hook: post-install,post-upgrade
  helm.sh/hook-weight: "0"

2. PostgreSQL image tag does not exist
The bundled Bitnami PostgreSQL subchart (v16.4.16) defaults to an image tag that has been removed from Docker Hub:

Failed to pull image "docker.io/bitnami/postgresql:17.4.0-debian-12-r4": not found

Best Practice Improvements

These are not blockers, but they would strengthen the chart.

Security contexts. The chart defaults to empty security contexts, which means containers run as root. The Helm community recommends setting secure defaults:

securityContext:
  runAsNonRoot: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: [ALL]
  seccompProfile:
    type: RuntimeDefault

Image tags default to latest. This makes deployments unpredictable. Consider defaulting to .Chart.AppVersion instead:

{{ .Values.api.image.tag | default .Chart.AppVersion }}

No values.schema.json. A JSON Schema catches misconfiguration at install time rather than at runtime. This is especially helpful for required fields like secrets.agentaAuthKey.

PostgreSQL password sync. Users must set both secrets.postgresPassword and postgresql.auth.password to the same value. If they mismatch, the app cannot connect. Consider wiring the subchart to use the chart-managed secret via postgresql.auth.existingSecret.

No startup probes. The deployments have liveness and readiness probes, but no startup probes. If the API takes longer than 30 seconds to start, Kubernetes will kill it. Startup probes give slow-starting containers more time.

Empty resource defaults. All components default to resources: {}. This means pods get "BestEffort" QoS class and are first to be evicted under memory pressure. Consider adding suggested defaults or a production values example.

Missing .helmignore. Without this file, the packaged chart may include unnecessary files.

No lint step in CI. The GitHub Actions workflow packages and pushes to GHCR, but it does not run helm lint or ct lint. Adding these steps would catch issues before publishing.

@endoze endoze force-pushed the feat-add-helm-chart-for-agenta-oss branch from 4a1e166 to e499e94 Compare February 27, 2026 15:53
@endoze
Copy link
Contributor Author

endoze commented Feb 27, 2026

Thank you very much for such quick feedback on my contribution!

I've updated my commit to address your feedback. One major thing to note, I swapped the bitnami Postgres chart to a newer version which will deploy a newer version of Postgres as well. My cursory look through the codebase led me to think this is a safe upgrade but I'm curious of your thoughts on this. As for why I upgraded it, bitnami only keeps around so many old tags before they clean things up so I chose a much newer version of things to prolong its viability. I can adjust this as needed however to use the new chart version and default to a specific version of Postgres as necessary to meet the project's database needs.

Let me know if you find any other issues and I'll do my best to address them.

devin-ai-integration[bot]

This comment was marked as resolved.

@endoze endoze force-pushed the feat-add-helm-chart-for-agenta-oss branch from e499e94 to ee80414 Compare February 28, 2026 03:12
devin-ai-integration[bot]

This comment was marked as resolved.

@endoze endoze force-pushed the feat-add-helm-chart-for-agenta-oss branch from ee80414 to 9cf74d2 Compare February 28, 2026 16:09
@mmabrouk
Copy link
Member

mmabrouk commented Mar 1, 2026

Follow-up: Full Cluster Testing Results

I deployed the chart on a k3s cluster (v1.33) and tested end-to-end in a browser. Thank you for addressing all the points from my first review -- the post-install hook, PostgreSQL upgrade, security contexts, startup probes, values schema, lint CI step, and shared PostgreSQL secret all look good.

The chart works, but I found three bugs and one documentation gap during testing. Details below.


Bugs Found

1. appVersion is missing the v prefix (image pull fails)

Chart.yaml has appVersion: "0.86.8", but the GHCR images are tagged v0.86.8 (with the v). A default install without explicit image tag overrides will fail with ImagePullBackOff because the tag 0.86.8 does not exist.

Fix: change appVersion: "0.86.8" to appVersion: "v0.86.8" in Chart.yaml.

2. Web container is unreachable (Next.js binding)

Next.js 15 defaults to binding on the pod hostname, not 0.0.0.0. Health probes and ingress traffic connect via localhost or the pod IP, so they cannot reach the web server. All readiness/liveness probes fail and the pod enters CrashLoopBackOff.

Fix: add HOSTNAME=0.0.0.0 to the web deployment env vars in templates/web-deployment.yaml:

env:
  - name: HOSTNAME
    value: "0.0.0.0"

3. runAsNonRoot: true crashes all pods

The security contexts set runAsNonRoot: true, but our Docker images currently run as root (USER is not set in the Dockerfiles). Every pod fails immediately with a security context violation.

Short-term fix: change the default to runAsNonRoot: false in values.yaml for all components.

Long-term: we need to update our Dockerfiles to run as non-root (tracked in #3868). Once that ships, the chart can flip back to true.


Nginx Ingress: Paths Need Regex Capture Groups

The ingress template uses plain paths (/api, /services, /) with pathType: Prefix. This works with Traefik's StripPrefix middleware, but not with nginx's rewrite-target annotation.

For nginx, the rewrite-target: /$1 annotation requires regex capture groups in the paths. Without them, $1 is empty and everything rewrites to /, causing a redirect loop on the web frontend.

The docs correctly tell nginx users to set rewrite-target and use-regex annotations, but the chart's hardcoded paths won't work with those annotations. Users would need to manually patch the ingress paths to:

/api/(.*)
/services/(.*)
/(.*)

with pathType: ImplementationSpecific.

This is tricky to fix in the template since Traefik needs Prefix paths and nginx needs ImplementationSpecific regex paths. One option: add a ingress.pathOverrides value, or detect the className and switch path styles. Or just document it clearly for now and fix in a follow-up.


Testing Summary

Test Result
helm lint Pass
helm template --dry-run Pass
Cluster install (all 11 pods) Pass (with the three fixes above)
helm test Pass
Migration job (Alembic) Completed successfully
Web UI in browser Works (login, navigation)
API health {"status":"ok"}
Services health 200 OK

Cluster: k3s v1.33 on Hetzner, nginx ingress controller, all images pulled with tag: latest.


I pushed a commit with all three fixes plus documentation improvements to your branch.

@mmabrouk
Copy link
Member

mmabrouk commented Mar 1, 2026

Follow-up: Configurable ingress paths for NGINX support

Pushed a second commit (1c80b77) that makes ingress paths configurable via values.yaml.

Problem

The ingress template hardcoded Prefix paths (/api, /services, /). This works with Traefik's StripPrefix middleware, but NGINX Ingress Controller needs regex capture groups in the paths for rewrite-target to work. Without them, $1 is empty and the web frontend gets stuck in a redirect loop.

Fix

Added ingress.paths.{api,services,web} to values, each with path and pathType. Defaults are unchanged (Prefix), so Traefik setups are not affected.

NGINX users override like this:

ingress:
  className: "nginx"
  annotations:
    nginx.ingress.kubernetes.io/use-regex: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /$1
  paths:
    api:
      path: /api/(.*)
      pathType: ImplementationSpecific
    services:
      path: /services/(.*)
      pathType: ImplementationSpecific
    web:
      path: /(.*)
      pathType: ImplementationSpecific

Verified

Upgraded the chart on the test cluster (k3s + NGINX Ingress Controller) with the new path overrides. All routes work:

  • Web: 200 (follows redirect from / to /w)
  • API: {"status":"ok"}
  • Services: 200

Also updated values.schema.json and the Kubernetes deployment docs with the new fields and a complete NGINX example.

devin-ai-integration[bot]

This comment was marked as resolved.

@endoze
Copy link
Contributor Author

endoze commented Mar 2, 2026

@mmabrouk Do you want me to address Devin-ai's latest comment (which does indeed seem to be a logical hole) or did you want to? Happy to do so but I don't want to step on anyone's toes 😄

@mmabrouk
Copy link
Member

mmabrouk commented Mar 2, 2026

@endoze I'd be thankful if you did :)

endoze and others added 4 commits March 2, 2026 21:39
Enable self-hosted Kubernetes deployments as an alternative to
docker-compose. The chart packages all Agenta OSS components (API, web,
services, workers, cron, Redis, SuperTokens, PostgreSQL) with Bitnami
PostgreSQL as a subchart dependency, Alembic migrations as a
pre-install/pre-upgrade hook, and an optional ingress resource. Includes
a CI workflow to publish the chart to GHCR on changes.
- Fix appVersion to use v-prefixed tag (v0.86.8) matching GHCR images
- Add HOSTNAME=0.0.0.0 to web deployment so Next.js binds to all interfaces
- Change runAsNonRoot default to false (images currently run as root)
- Document PostgreSQL secret name dependency on release name
- Document ingress className default (traefik) with override instructions
The ingress template previously hardcoded Prefix paths which only work
with Traefik. NGINX Ingress Controller requires regex capture groups
and ImplementationSpecific pathType for rewrite-target to work.

Add ingress.paths.{api,services,web} to values.yaml so users can
override path patterns and pathType per backend. Defaults remain
Prefix (backward compatible with Traefik). Update docs with the
full nginx configuration including path overrides.
When users provide a pre-created Kubernetes Secret via
secrets.existingSecret, the Bitnami PostgreSQL subchart silently
fails to find the password unless global.postgresql.auth.existingSecret
is also pointed at the same secret. This adds a fail-fast validation
template and clearer NOTES.txt guidance so users get an actionable
error at install time instead of a broken deployment.
@endoze endoze force-pushed the feat-add-helm-chart-for-agenta-oss branch from 3f1dd6e to 26291b7 Compare March 3, 2026 02:40
@endoze
Copy link
Contributor Author

endoze commented Mar 3, 2026

@mmabrouk I've rebased the branch off the latest from main as well as addressed the last bit of feedback from Devin-ai's review. Let me know if you need anything else on this branch.

@vercel
Copy link

vercel bot commented Mar 3, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Mar 3, 2026 11:55am

Request Review

@mmabrouk
Copy link
Member

mmabrouk commented Mar 3, 2026

@all-contributors please add @endoze for infrastructure and docs and infra

@allcontributors
Copy link
Contributor

@mmabrouk

I've put up a pull request to add @endoze! 🎉

Copy link
Member

@mmabrouk mmabrouk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @endoze this looks great!

@jp-agenta lgtm from my side, I did a final test locally on k3 and it worked all fine.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Mar 3, 2026
@mmabrouk
Copy link
Member

mmabrouk commented Mar 3, 2026

Hey @endoze feel free to share your linkedin or twitter if you would like to be mentioned in a post when we merge this

@endoze
Copy link
Contributor Author

endoze commented Mar 4, 2026

@mmabrouk Just my GitHub if you'd like. I also sent over a pull request to handle running the containers as non-root as a compliment to this one. #3899 should enable the ability to harden the defaults in the helm chart.

@jp-agenta
Copy link
Member

jp-agenta commented Mar 6, 2026

Hey @endoze,

Thank you for contributing, and specifically for this PR. 🚀

Only two open issues (caught by our agents) before we can merge, Leaving aside the root/non-root stuff.

1. postgresql-auth-secret.yaml is missing hook annotations

It looks like the Bridge secret for Bitnami subchart has no Helm hook lifecycle

secrets.yaml is a pre-install,pre-upgrade hook with helm.sh/resource-policy: keep.
postgresql-auth-secret.yaml has none of these annotations. It is a plain resource managed in the default Helm sync wave.

Two failure modes. First, Helm does not guarantee resource ordering within the same sync wave. The Bitnami PostgreSQL StatefulSet may attempt to start before this secret exists, causing the init container to fail to read POSTGRES_PASSWORD. The pod enters CrashLoopBackOff until the secret appears. Second, helm uninstall deletes this secret (since it has no resource-policy: keep), but the main secret survives (it does have keep). A subsequent helm install fails because the main secret still exists in the namespace while the pgauth secret is gone and the Bitnami subchart cannot read credentials. The user must manually delete the orphaned main secret or recreate the pgauth secret before re-installing.

Suggestion: Add the same hook annotations as secrets.yaml:

annotations:
  helm.sh/hook: pre-install,pre-upgrade
  helm.sh/hook-weight: "-5"
  helm.sh/resource-policy: keep

This ensures the pgauth secret is created before any release resources and survives helm uninstall alongside the main secret.

2. Release name mismatch with pgauth secret default

The default global.postgresql.auth.existingSecret value only works for release name agenta

global.postgresql.auth.existingSecret defaults to the hardcoded string agenta-pgauth. The Bitnami PostgreSQL subchart reads this value as a plain string at template render time and cannot evaluate Helm template expressions. The values.yaml comments explain this and advise users with non-default release names to override the value. The existing _validations.tpl only validates the existingSecret case, not the release name mismatch.

A user who runs helm install myrelease hosting/helm/agenta-oss ... gets a chart where every resource is named myrelease-agenta-oss-* except the pgauth secret, which is still named agenta-pgauth. This works (both sides agree on the name) but creates a naming inconsistency that is confusing when inspecting resources. More importantly, two releases in the same namespace would collide on the same agenta-pgauth secret name.

Suggestion: Add a validation in _validations.tpl that detects when the release name would cause a fullname different from agenta-agenta-oss while the pgauth secret still has its default value:

{{- if and .Values.postgresql.enabled
          (not .Values.secrets.existingSecret)
          (ne (include "agenta.fullname" .) "agenta-agenta-oss")
          (eq .Values.global.postgresql.auth.existingSecret "agenta-pgauth") }}
{{- fail "..." }}
{{- end }}

This catches the mismatch at helm install time rather than producing a silent naming inconsistency.

@jp-agenta jp-agenta added runtime/kubernetes Kubernetes runtime maintenance/community Primarily under community maintenance support/experimental Experimental or best-effort repository surface and removed runtime/kubernetes Kubernetes runtime labels Mar 6, 2026
@jp-agenta jp-agenta moved this from Todo to In Progress in Kubernetes (community) Mar 6, 2026
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 22 additional findings in Devin Review.

Open in Devin Review

Comment on lines +63 to +74
```bash
helm install agenta hosting/helm/agenta-oss \
--namespace agenta --create-namespace \
--set secrets.agentaAuthKey=$AG_AUTH_KEY \
--set secrets.agentaCryptKey=$AG_CRYPT_KEY \
--set secrets.postgresPassword=$PG_PASS \
--set postgresql.auth.password=$PG_PASS
```

:::info
`secrets.postgresPassword` and `postgresql.auth.password` must match. The first is used by the application pods; the second is used by the Bitnami PostgreSQL subchart to set the database password.
:::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Documentation instructs setting postgresql.auth.password which is silently ignored

The quick start install command includes --set postgresql.auth.password=$PG_PASS and states that secrets.postgresPassword and postgresql.auth.password must match because "the second is used by the Bitnami PostgreSQL subchart to set the database password." This is factually incorrect. The chart's values.yaml:26 sets global.postgresql.auth.existingSecret: "agenta-pgauth", which causes the Bitnami subchart to read the password exclusively from the pgauth secret (created by hosting/helm/agenta-oss/templates/postgresql-auth-secret.yaml from secrets.postgresPassword). When existingSecret is configured, the Bitnami chart completely ignores auth.password. This means --set postgresql.auth.password=$PG_PASS is a no-op.

If a user later does helm upgrade and only changes postgresql.auth.password (believing it updates the database password), the actual password remains unchanged. Or if a user sets them to different values, the database uses secrets.postgresPassword while the user believes postgresql.auth.password controls it, causing operational confusion during debugging.

Suggested change
```bash
helm install agenta hosting/helm/agenta-oss \
--namespace agenta --create-namespace \
--set secrets.agentaAuthKey=$AG_AUTH_KEY \
--set secrets.agentaCryptKey=$AG_CRYPT_KEY \
--set secrets.postgresPassword=$PG_PASS \
--set postgresql.auth.password=$PG_PASS
```
:::info
`secrets.postgresPassword` and `postgresql.auth.password` must match. The first is used by the application pods; the second is used by the Bitnami PostgreSQL subchart to set the database password.
:::
```bash
helm install agenta hosting/helm/agenta-oss \
--namespace agenta --create-namespace \
--set secrets.agentaAuthKey=$AG_AUTH_KEY \
--set secrets.agentaCryptKey=$AG_CRYPT_KEY \
--set secrets.postgresPassword=$PG_PASS

:::info
secrets.postgresPassword is used both by the application pods and by the Bitnami PostgreSQL subchart (via the chart-managed pgauth secret).
:::


<!-- devin-review-badge-begin -->
<a href="https://app.devin.ai/review/agenta-ai/agenta/pull/3852" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->

---
*Was this helpful? React with 👍 or 👎 to provide feedback.*

Copy link
Member

@jp-agenta jp-agenta Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@endoze ☝️

@jp-agenta
Copy link
Member

On another note,

You prompted us to start community projects. And Kubernetes is the first one : Kubernetes (community).

Thank you @endoze !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/cd feature lgtm This PR has been approved by a maintainer maintenance/community Primarily under community maintenance runtime/kubernetes Kubernetes runtime size:XXL This PR changes 1000+ lines, ignoring generated files. support/experimental Experimental or best-effort repository surface

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants