Skip to content

BED-4600: Add Request Timeout#188

Merged
StranDutton merged 7 commits intomainfrom
BED-4600-ingest-timeout-bug
Apr 17, 2026
Merged

BED-4600: Add Request Timeout#188
StranDutton merged 7 commits intomainfrom
BED-4600-ingest-timeout-bug

Conversation

@StranDutton
Copy link
Copy Markdown
Contributor

@StranDutton StranDutton commented Apr 15, 2026

Ticket: BED-4600

Problem: Customer reported AzureHound collections not completing. After analysis, it seems that the issue may have been that we were not failing gracefully if any api call hung - we would block indefinitely.

This PR addresses that behavior so we do not block the entire pipeline if a single call hangs.

An improvement I'd like to make as a result of this would be to implement proper status reporting back to BHE so the user knows that only a partial collection was completed and that some objects may not have been collected - that is outside of the scope of this ticket though.

Summary by CodeRabbit

  • Bug Fixes

    • Added per-page request timeouts (2 minutes) for paginated network operations to avoid hanging requests and improve responsiveness.
  • Tests

    • Added tests covering successful pagination, enforced per-page timeouts, parent-context cancellation handling, error propagation, and graceful termination of pagination.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5cb3533d-da82-451c-963e-64a79c99f2f1

📥 Commits

Reviewing files that changed from the base of the PR and between 9057951 and cdd9104.

📒 Files selected for processing (1)
  • client/client_test.go
✅ Files skipped from review due to trivial changes (1)
  • client/client_test.go

Walkthrough

The changes add a package-level 2-minute per-page request timeout in getAzureObjectList, deriving a pageCtx for each page request and using it for all network calls; pageCtx is cancelled on errors or after successful decode. New tests exercise success, timeout, and parent-cancellation behaviors.

Changes

Cohort / File(s) Summary
Timeout Implementation
client/client.go
Add var pageRequestTimeout = 2 * time.Minute. getAzureObjectList now creates a per-page pageCtx via context.WithTimeout(ctx, pageRequestTimeout) and uses it for all network calls; pageCtx is explicitly cancelled on error paths and after successful decode.
Test Coverage
client/client_test.go
New tests with fakeRestClient covering: success (two items emitted), timeout (override pageRequestTimeout, simulate hung request and assert deadline), and parent cancellation (cancel parent ctx and assert error propagation and channel closure).

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant getAzureObjectList as getAzureObjectList()
    participant Context as Context Management
    participant HTTP as client.Get / client.Send
    participant Decoder as JSON Decoder
    participant Out as Result Channel

    Caller->>getAzureObjectList: invoke with parent ctx
    loop per page
        getAzureObjectList->>Context: WithTimeout(parent ctx, 2m) -> pageCtx
        getAzureObjectList->>HTTP: request using pageCtx
        alt response OK
            HTTP-->>getAzureObjectList: HTTP 200 + body
            getAzureObjectList->>Decoder: decode body
            Decoder-->>getAzureObjectList: items
            getAzureObjectList->>Context: cancel pageCtx
            getAzureObjectList->>Out: send items
            Out-->>Caller: items received
        else error / timeout
            HTTP--x getAzureObjectList: error / ctx.Done
            getAzureObjectList->>Context: cancel pageCtx
            getAzureObjectList->>Out: send error result
            Out-->>Caller: error received
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A two-minute hop per page I keep,

Through contexts deep and deadlines steep,
I cancel when decode is done,
Or when the hung requests have won,
Hooray—no stray calls in my heap!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main change: adding request timeout functionality to handle API calls that hang indefinitely, addressing the core objective of preventing pipeline blocking.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch BED-4600-ingest-timeout-bug

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@StranDutton StranDutton self-assigned this Apr 17, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
client/client_test.go (2)

39-39: fakeRestClient.Send returns (nil, nil) — silent NPE risk for any future test exercising pagination.

The nextLink branch in getAzureObjectList (client.go lines 129-151) calls client.Send(req) and then dereferences res.Body. None of the current tests trigger that path, but a future test that returns a payload containing @odata.nextLink will nil-panic in the second iteration. Consider returning an explicit error so the failure mode is loud:

-func (s *fakeRestClient) Send(req *http.Request) (*http.Response, error) { return nil, nil }
+func (s *fakeRestClient) Send(req *http.Request) (*http.Response, error) {
+	return nil, fmt.Errorf("fakeRestClient.Send not implemented")
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/client_test.go` at line 39, The fakeRestClient.Send implementation
returns (nil, nil) which can cause a nil-pointer dereference when
getAzureObjectList calls client.Send(req) and reads res.Body; change
fakeRestClient.Send in client_test.go to return a non-nil error (e.g.,
fmt.Errorf with a clear message) when no response is intended, or return a valid
*http.Response with a non-nil Body for tests that exercise pagination; update
the Send method to either construct and return an appropriate http.Response
(with io.NopCloser) for success cases or return (nil, err) to fail loudly so
tests won't silently NPE in getAzureObjectList.

70-74: Mutating the package-level pageRequestTimeout makes these tests parallel-unsafe.

TestGetAzureObjectList_HungResponseTimesOut and TestGetAzureObjectList_StalledResponseBody_Integration both write to the package global pageRequestTimeout and restore it via defer. As written this works only because go test runs tests in the same package serially by default; adding t.Parallel() to any test in this package — or to a future test that reads the timeout — would introduce a data race that -race would flag.

Two cleaner options:

  1. Inject the timeout via a function parameter or a struct field on a future azureClient so tests don't need to touch globals.
  2. At minimum, document the constraint and avoid t.Parallel() here.

Also applies to: 142-145

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/client_test.go` around lines 70 - 74, The tests mutate the
package-level pageRequestTimeout (used in
TestGetAzureObjectList_HungResponseTimesOut and
TestGetAzureObjectList_StalledResponseBody_Integration), which is
parallel-unsafe; instead refactor the code so the timeout is injected (e.g., add
a timeout parameter to the function that fetches pages or add a timeout field to
the azureClient struct and use that value instead of the global), update the
tests to construct an azureClient with the shorter timeout (or call the function
with the timeout) rather than writing to pageRequestTimeout, and remove any
direct writes to the global to eliminate races (if refactoring is infeasible,
document the serial-only constraint and ensure these tests do not call
t.Parallel()).
client/client.go (1)

38-39: Consider making pageRequestTimeout configurable.

A hard-coded 2-minute timeout is reasonable for the reported BED-4600 hang scenario, but environments with large pages or high-latency tenants may legitimately exceed it (especially since the timeout also bounds body read/decode at line 161, not just header receipt). Exposing this via the existing config.Config (or a CLI flag) would let operators tune it without a code change. Out-of-scope deferral is fine; flagging for future iteration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/client.go` around lines 38 - 39, Replace the hard-coded var
pageRequestTimeout = 2 * time.Minute with a configurable value sourced from the
existing config (e.g., add a PageRequestTimeout time.Duration field to
config.Config with a default of 2m) and use that value wherever
pageRequestTimeout is referenced (including the body read/decode timeout usage).
Update the client initialization (where the client is constructed or config is
passed) to read config.Config.PageRequestTimeout and fall back to 2*time.Minute
if zero, and ensure any tests or constructors that instantiate the client accept
or pass the config so operators can tune the timeout without changing code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/client_test.go`:
- Around line 92-100: The tests read from the out channel but skip assertions
when the channel is closed (ok==false), allowing false positives; update each
failing test (TestGetAzureObjectList_TimedOut,
TestGetAzureObjectList_ParentContextCanceled,
TestGetAzureObjectList_StalledResponseBody_Integration) to explicitly assert the
receiver succeeded before checking the result: when receiving from out do
"result, ok := <-out" then require.True(t, ok, "expected a value on out, channel
closed") and only after that run require.Error(t, result.Error, ...) and
require.ErrorIs(t, result.Error, context.DeadlineExceeded) (or the existing
error checks) so the test fails if the channel closed without a sent AzureResult
from getAzureObjectList.

---

Nitpick comments:
In `@client/client_test.go`:
- Line 39: The fakeRestClient.Send implementation returns (nil, nil) which can
cause a nil-pointer dereference when getAzureObjectList calls client.Send(req)
and reads res.Body; change fakeRestClient.Send in client_test.go to return a
non-nil error (e.g., fmt.Errorf with a clear message) when no response is
intended, or return a valid *http.Response with a non-nil Body for tests that
exercise pagination; update the Send method to either construct and return an
appropriate http.Response (with io.NopCloser) for success cases or return (nil,
err) to fail loudly so tests won't silently NPE in getAzureObjectList.
- Around line 70-74: The tests mutate the package-level pageRequestTimeout (used
in TestGetAzureObjectList_HungResponseTimesOut and
TestGetAzureObjectList_StalledResponseBody_Integration), which is
parallel-unsafe; instead refactor the code so the timeout is injected (e.g., add
a timeout parameter to the function that fetches pages or add a timeout field to
the azureClient struct and use that value instead of the global), update the
tests to construct an azureClient with the shorter timeout (or call the function
with the timeout) rather than writing to pageRequestTimeout, and remove any
direct writes to the global to eliminate races (if refactoring is infeasible,
document the serial-only constraint and ensure these tests do not call
t.Parallel()).

In `@client/client.go`:
- Around line 38-39: Replace the hard-coded var pageRequestTimeout = 2 *
time.Minute with a configurable value sourced from the existing config (e.g.,
add a PageRequestTimeout time.Duration field to config.Config with a default of
2m) and use that value wherever pageRequestTimeout is referenced (including the
body read/decode timeout usage). Update the client initialization (where the
client is constructed or config is passed) to read
config.Config.PageRequestTimeout and fall back to 2*time.Minute if zero, and
ensure any tests or constructors that instantiate the client accept or pass the
config so operators can tune the timeout without changing code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4ae43e88-83ee-4975-84d4-cf5513b889fd

📥 Commits

Reviewing files that changed from the base of the PR and between 96436c7 and c9365f2.

📒 Files selected for processing (2)
  • client/client.go
  • client/client_test.go

Comment thread client/client_test.go
@StranDutton StranDutton marked this pull request as ready for review April 17, 2026 16:14
@StranDutton StranDutton merged commit 1e211df into main Apr 17, 2026
10 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 17, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants