docs(platform): add workload orchestration lifecycle, monitor failures, and workflow restart retry documentation#74095
Conversation
…kflow restart retry docs Co-Authored-By: ian.alton@airbyte.io <ian.alton@airbyte.io>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksPR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
📚 Show Repo GuidanceHelpful Resources
|
|
Deploy preview for airbyte-docs ready! ✅ Preview Built with commit ae3c810. |
Co-Authored-By: ian.alton@airbyte.io <ian.alton@airbyte.io>
| Airbyte also has a separate **platform-level Workload Monitor** that checks whether the workload pod itself is alive and progressing through its lifecycle (pending → claimed → launched → running). If the pod crashes, is OOM-killed, or never starts, the Workload Monitor fails the workload with the message: | ||
|
|
||
| > _"Airbyte could not track the sync progress. Sync process exited without reporting status."_ | ||
|
|
||
| This error is surfaced as a `WorkloadMonitorException` and is distinct from the source/destination heartbeat errors described above. For details on how the Workload Monitor works and how to debug these errors, see [Workload Monitor](./jobs.md#workload-monitor). |
There was a problem hiding this comment.
I like the details here, but as much as possible, I'd like to avoid talking about platform heartbeats on the connector-level heartbeats page. Let's keep the details separate, refer them over to that page, and document what needs to be documented.
There was a problem hiding this comment.
Makes sense — slimmed this down to just a two-sentence cross-reference with a link to the jobs.md section. No platform heartbeat details remain on this page. Pushed in ae3c810.
|
|
||
| After the LAUNCH stage completes, the pipeline's success handler transitions the workload status to **LAUNCHED** via the Workload API. | ||
|
|
||
| #### Why is there a delay between LAUNCH and LAUNCHED? |
There was a problem hiding this comment.
Rephrase as statement, not question
There was a problem hiding this comment.
…ss-reference Co-Authored-By: ian.alton@airbyte.io <ian.alton@airbyte.io>
What
Addresses documentation gaps around Airbyte's workload orchestration that were causing user confusion and poor chatbot answers. Three specific areas were identified:
APPLY Stage: LAUNCHandAttempting to update workload ... to LAUNCHEDin logs and don't understand why.WorkloadMonitorExceptionwith the message "Airbyte could not track the sync progress" and have no documentation to explain or debug it.How
Adds new sections to two existing documentation pages (no new pages created):
docs/platform/understanding-airbyte/jobs.md:docs/platform/understanding-airbyte/heartbeats.md:All technical details were derived from reading the platform source code (
airbyte-platform-internal).Review guide
docs/platform/understanding-airbyte/jobs.md— the bulk of the changes. Three new sections inserted into the existing page structure.StageName.kt,LaunchPipeline.kt, individual stage classes).WorkloadMonitor.kt,FailureHelper.kt,JobCreationAndStatusUpdateHelper.kt).kubectllabel selector (airbyte=workload) and container name (orchestrator) in the debugging steps are correct for real deployments.:::noteadmonition syntax renders correctly in Docusaurus.docs/platform/understanding-airbyte/heartbeats.md— two-sentence addition at the end. Check that the./jobs.md#workload-monitoranchor link resolves correctly.Human review checklist
kubectllabel selector and container name in debugging steps are correctUser Impact
Users and the AI chatbot will have documentation coverage for three previously undocumented platform behaviors: workload launch delays, WorkloadMonitorException, and workflow restart retry semantics. This should reduce support burden and improve chatbot answer quality for these topics.
Can this PR be safely reverted and rolled back?
Link to Devin run
Requested by: Ian Alton (@ian-at-airbyte)