Skip to content

docs(platform): add workload orchestration lifecycle, monitor failures, and workflow restart retry documentation#74095

Draft
Ian Alton (ian-at-airbyte) wants to merge 3 commits intomasterfrom
devin/1772235147-workload-orchestration-docs
Draft

docs(platform): add workload orchestration lifecycle, monitor failures, and workflow restart retry documentation#74095
Ian Alton (ian-at-airbyte) wants to merge 3 commits intomasterfrom
devin/1772235147-workload-orchestration-docs

Conversation

@ian-at-airbyte
Copy link
Contributor

@ian-at-airbyte Ian Alton (ian-at-airbyte) commented Feb 27, 2026

What

Addresses documentation gaps around Airbyte's workload orchestration that were causing user confusion and poor chatbot answers. Three specific areas were identified:

  1. Users see a multi-minute gap between APPLY Stage: LAUNCH and Attempting to update workload ... to LAUNCHED in logs and don't understand why.
  2. Users encounter WorkloadMonitorException with the message "Airbyte could not track the sync progress" and have no documentation to explain or debug it.
  3. After a workflow restart, users see "An internal transient Airbyte error has occurred. The sync should work fine on the next retry" but no automatic retry occurs — the existing retry docs don't cover this scenario.

How

Adds new sections to two existing documentation pages (no new pages created):

docs/platform/understanding-airbyte/jobs.md:

  • Workload Launch Pipeline — table of the 7 launcher pipeline stages (BUILD → CLAIM → LOAD_SHED → CHECK_STATUS → MUTEX → ARCHITECTURE → LAUNCH) with descriptions, plus an explanation of the LAUNCH → LAUNCHED delay (Kubernetes scheduling, image pulls, autoscaling).
  • Workload Monitor — documents the background cron that fails workloads missing their deadlines, with a table mapping each check type to its error message and likely root cause. Includes kubectl-based debugging steps.
  • Workflow Restarts and Retry Limits — explains that workflow restarts terminally fail all in-progress jobs (no automatic retry), reset retry counters, and clarifies the misleading "next retry" language in the error message.

docs/platform/understanding-airbyte/heartbeats.md:

  • Related: Workload Monitor — brief cross-reference linking to the new Workload Monitor section in jobs.md, without duplicating platform-level details on the connector heartbeats page.

All technical details were derived from reading the platform source code (airbyte-platform-internal).

Review guide

  1. docs/platform/understanding-airbyte/jobs.md — the bulk of the changes. Three new sections inserted into the existing page structure.
    • Verify technical accuracy of pipeline stage descriptions against source (StageName.kt, LaunchPipeline.kt, individual stage classes).
    • Verify quoted error messages match what the code actually produces (WorkloadMonitor.kt, FailureHelper.kt, JobCreationAndStatusUpdateHelper.kt).
    • Check that the kubectl label selector (airbyte=workload) and container name (orchestrator) in the debugging steps are correct for real deployments.
    • Confirm the :::note admonition syntax renders correctly in Docusaurus.
  2. docs/platform/understanding-airbyte/heartbeats.md — two-sentence addition at the end. Check that the ./jobs.md#workload-monitor anchor link resolves correctly.

Human review checklist

  • Error message strings in the Workload Monitor table match what users actually see in production
  • kubectl label selector and container name in debugging steps are correct
  • Claim that workflow restarts terminally fail the entire job (not just the attempt) and do not trigger automatic retries is accurate
  • Pipeline stage descriptions are accurate and complete

User Impact

Users and the AI chatbot will have documentation coverage for three previously undocumented platform behaviors: workload launch delays, WorkloadMonitorException, and workflow restart retry semantics. This should reduce support burden and improve chatbot answer quality for these topics.

Can this PR be safely reverted and rolled back?

  • YES 💚

Link to Devin run
Requested by: Ian Alton (@ian-at-airbyte)

…kflow restart retry docs

Co-Authored-By: ian.alton@airbyte.io <ian.alton@airbyte.io>
@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Contributor

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • 🛠️ Quick Fixes
    • /format-fix - Fixes most formatting issues.
    • /bump-version - Bumps connector versions, scraping changelog description from the PR title.
  • ❇️ AI Testing and Review (internal link: AI-SDLC Docs):
    • /ai-prove-fix - Runs prerelease readiness checks, including testing against customer connections.
    • /ai-canary-prerelease - Rolls out prerelease to 5-10 connections for canary testing.
    • /ai-review - AI-powered PR review for connector safety and quality gates.
  • 🚀 Connector Releases:
    • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-preview.{git-sha}) for all modified connectors in the PR.
    • /bump-progressive-rollout-version - Bumps connector version with an RC suffix (2.16.10-rc.1) for progressive rollouts (enableProgressiveRollout: true).
      • Example: /bump-progressive-rollout-version changelog="Add new feature for progressive rollout"
  • ☕️ JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
    • /bump-bulk-cdk-version bump=patch changelog='foo' - Bump the Bulk CDK's version. bump can be major/minor/patch.
  • 🐍 Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.
  • ⚙️ Admin commands:
    • /force-merge reason="<REASON>" - Force merges the PR using admin privileges, bypassing CI checks. Requires a reason.
      Example: /force-merge reason="CI is flaky, tests pass locally"
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

Deploy preview for airbyte-docs ready!

✅ Preview
https://airbyte-docs-qt5yb1olk-airbyte-growth.vercel.app

Built with commit ae3c810.
This pull request is being automatically deployed with vercel-action

Co-Authored-By: ian.alton@airbyte.io <ian.alton@airbyte.io>
Comment on lines 75 to 79
Airbyte also has a separate **platform-level Workload Monitor** that checks whether the workload pod itself is alive and progressing through its lifecycle (pending → claimed → launched → running). If the pod crashes, is OOM-killed, or never starts, the Workload Monitor fails the workload with the message:

> _"Airbyte could not track the sync progress. Sync process exited without reporting status."_

This error is surfaced as a `WorkloadMonitorException` and is distinct from the source/destination heartbeat errors described above. For details on how the Workload Monitor works and how to debug these errors, see [Workload Monitor](./jobs.md#workload-monitor).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the details here, but as much as possible, I'd like to avoid talking about platform heartbeats on the connector-level heartbeats page. Let's keep the details separate, refer them over to that page, and document what needs to be documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense — slimmed this down to just a two-sentence cross-reference with a link to the jobs.md section. No platform heartbeat details remain on this page. Pushed in ae3c810.


Devin session


After the LAUNCH stage completes, the pipeline's success handler transitions the workload status to **LAUNCHED** via the Workload API.

#### Why is there a delay between LAUNCH and LAUNCHED?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrase as statement, not question

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased to "LAUNCH to LAUNCHED delay" — pushed in ae3c810.


Devin session

…ss-reference

Co-Authored-By: ian.alton@airbyte.io <ian.alton@airbyte.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant