feat(transport): add HTTP retry with exponential backoff#1520
feat(transport): add HTTP retry with exponential backoff#1520
Conversation
|
b083a57 to
a264f66
Compare
|
@sentry review |
|
@cursor review |
|
@sentry review |
|
@cursor review |
|
@cursor review |
|
@cursor review |
243c880 to
d6aa792
Compare
|
@cursor review |
fbbffb2 to
abd5815
Compare
|
@cursor review |
f030cf9 to
22e3fc4
Compare
|
@sentry review |
|
@cursor review |
22e3fc4 to
ffce486
Compare
|
@cursor review |
| opts->traces_sample_rate = 0.0; | ||
| opts->max_spans = SENTRY_SPANS_MAX; | ||
| opts->handler_strategy = SENTRY_HANDLER_STRATEGY_DEFAULT; | ||
| opts->http_retry = true; |
There was a problem hiding this comment.
Question: default to false to avoid behavioral change? On the other hand, Crashpad already retries, so defaulting to true actually makes the behavior consistent across backends.
There was a problem hiding this comment.
I generally think retry should be on by default, but the bigger problem is the consequence behind that very basic binary decision.
Retry is what most users expect from other SDKs, since most HTTP clients in higher-level languages retry by default. But those retry mechanisms are often trivial and short-lived (5 attempts with exponential back-off, but not for hours or even half an hour), and they do not account for in-flight requests that persist, so we can continue retrying on the next start.
We can essentially center on:
default to false to avoid behavioral change
What behavior should a user expect when no retry is occurring? Is this about the storage of inflight envelopes after a crash or a shutdown? Then yes, retry-by-default could be a problem.
Crashpad already retries, so defaulting to true actually makes the behavior consistent across backends
Yes, but also consider that this affects all events, not only crash events. In the general case, there will be many more regular events than crash events.
This brings us back to the initial point about retry mechanisms: exposing them requires configuration.
If we don't want to expose the retry policy configuration as a first step, we shouldn't default to retrying behind a binary option.
supervacuus
left a comment
There was a problem hiding this comment.
I think this is a sensible retry implementation, and the policy makes sense as well. However, as discussed initially, it applies a single policy for all use cases and hides it behind a binary interface.
I think it is highly likely that all currently hard-coded configurations (timeout, attempts, interval, throttle, http-response codes, etc.) will have to be exposed at some point. And I wonder why we wouldn't do that right away.
This PR started with a configurable number of retries, but after discussing it with others, we ended up simplifying it to a boolean because we didn't find these details exposed in other SDKs. For the other parameters, I just decided to follow Crashpad. Now that we have a conversion from Crashpad reports to Sentry envelopes for offline caching, there's an opportunity to disable retries in Crashpad and let sentry-native take care of it in a backend-agnostic way. Then we could also easily make any retry parameters customizable. |
- Document the 5-attempt retry limit - Note there is no rate limiting between attempts - Warn about potential event loss during extended network outages
…ponses Only network failures (negative status codes) trigger retries. HTTP responses including 5xx (500, 502, 503, 504) are discarded.
|
Bugbot Autofix prepared fixes for 1 of the 1 bugs found in the latest run.
Or push these changes by commenting: Preview (41e84b4fc2)diff --git a/src/sentry_retry.c b/src/sentry_retry.c
--- a/src/sentry_retry.c
+++ b/src/sentry_retry.c
@@ -121,13 +121,13 @@
return sentry__path_join_str(retry->cache_path, filename);
}
-void
+bool
sentry__retry_write_envelope(
sentry_retry_t *retry, const sentry_envelope_t *envelope)
{
sentry_uuid_t event_id = sentry__envelope_get_event_id(envelope);
if (sentry_uuid_is_nil(&event_id)) {
- return;
+ return false;
}
char uuid[37];
@@ -139,9 +139,13 @@
if (sentry_envelope_write_to_path(envelope, path) != 0) {
SENTRY_WARNF(
"failed to write retry envelope to \"%s\"", path->path);
+ sentry__path_free(path);
+ return false;
}
sentry__path_free(path);
+ return true;
}
+ return false;
}
static bool
@@ -379,7 +383,9 @@
if (sentry__atomic_fetch(&retry->sealed)) {
return;
}
- sentry__retry_write_envelope(retry, envelope);
+ if (!sentry__retry_write_envelope(retry, envelope)) {
+ return;
+ }
// prevent the startup poll from re-processing this session's envelope
retry->startup_time = 0;
if (sentry__atomic_compare_swap(&retry->scheduled, 0, 1)) {
diff --git a/src/sentry_retry.h b/src/sentry_retry.h
--- a/src/sentry_retry.h
+++ b/src/sentry_retry.h
@@ -38,8 +38,9 @@
/**
* Writes an event envelope to the retry dir. Non-event envelopes are skipped.
+ * Returns true if a file was written, false otherwise.
*/
-void sentry__retry_write_envelope(
+bool sentry__retry_write_envelope(
sentry_retry_t *retry, const sentry_envelope_t *envelope);
/** |
|
Could not push Autofix changes. The PR branch may have changed since the Autofix ran, or the Autofix commit may no longer exist. |
- Change sentry__retry_write_envelope to return bool indicating success - Return false for nil event IDs (session envelopes) and write failures - sentry__retry_enqueue now returns early if write fails, preserving startup_time for session envelopes so retry_flush_task can flush them
Add null check after cloning cache_path to prevent dereferencing null later in sentry__retry_send when iterating directory or joining paths.
When system clock moves backward (now < ts), the condition now >= ts was false, causing the backoff check to be skipped entirely. This made items immediately eligible for retry regardless of their count. Now checks if now < ts (clock skew) OR if backoff hasn't elapsed.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Crashpad's kRetryAttempts=5 with `upload_attempts > kRetryAttempts` (checked before post-increment) allows upload_attempts 0-5, giving 6 retries with backoffs: 15m, 30m, 1h, 2h, 4h, 8h. sentry-native's `count + 1 < SENTRY_RETRY_ATTEMPTS` with the old value of 5 only allowed counts 0-3 to be re-enqueued, so the max backoff reached was 4h. Bumping to 6 gives the same 6 retries and the full backoff sequence up to 8h. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename sentry__retry_flush to sentry__retry_shutdown and remove the bgworker_flush call so bgworker_shutdown gets the full timeout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Check the sentry__path_rename return value and log a warning on failure instead of silently ignoring it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move eligible++ after all item fields are populated so a NULL path from allocation failure does not leave a half-initialized item in the array. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| sentry_uuid_t event_id = sentry__envelope_get_event_id(envelope); | ||
| if (sentry_uuid_is_nil(&event_id)) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Retry drops session-only envelopes
Medium Severity
sentry__retry_write_envelope only persists envelopes that have a non-nil event_id. Envelopes containing only sessions (and other non-event payloads) will not be written to the retry queue on network failures, so those telemetry items get discarded even when HTTP retry is enabled.
Replace mutable startup_time + sealed with a state enum so that the startup flag is managed via atomic operations instead of clearing a uint64_t that may tear on 32-bit systems. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If sentry__retry_new() fails, retry_func was still set on the transport, causing can_retry to return true and envelopes to be dropped instead of cached. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| } | ||
| sentry__atomic_compare_swap( | ||
| &retry->state, SENTRY_RETRY_STARTUP, SENTRY_RETRY_RUNNING); | ||
| if (sentry__atomic_compare_swap(&retry->scheduled, 0, 1)) { | ||
| sentry__bgworker_submit_delayed(retry->bgworker, retry_poll_task, NULL, |
There was a problem hiding this comment.
Bug: A race condition between retry_poll_task and retry_enqueue can cause a new retry item to be queued without a corresponding poll task being scheduled, delaying its processing.
Severity: MEDIUM
Suggested Fix
The race condition can be resolved by modifying retry_poll_task. After setting retry->scheduled to 0, it should re-check if the queue is empty. If new items have been added in the meantime, it should immediately attempt to schedule a new poll task itself by performing the same compare_swap operation used in retry_enqueue. This ensures that even if a race occurs, the system self-corrects and schedules the necessary poll.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/sentry_retry.c#L408-L412
Potential issue: A race condition exists between the background worker polling for
retries and the function that enqueues new items for retry. Specifically, when the poll
task (`retry_poll_task`) finishes processing the queue, it sets a `scheduled` flag to 0.
However, there is a small window before this flag is set where a new item can be
enqueued by `retry_enqueue`. In this scenario, `retry_enqueue` will see the `scheduled`
flag as 1, assume a poll is active, and fail to schedule a new one. The poll task then
proceeds to set `scheduled` to 0, leaving the newly enqueued item stranded in the queue
with no scheduled task to process it.



Add HTTP retry with exponential backoff for network failures, modeled after Crashpad's upload retry behavior.
Note
Adds an explicit 15s connect timeout to both curl and WinHTTP, matching the value used by Crashpad. This reduces the time the transport worker is blocked on unreachable hosts, previously ~75s on curl (OS TCP SYN retry limit) and 60s on WinHTTP.
Failed envelopes are stored as
<db>/cache/<ts>-<n>-<uuid>.envelopeand retried on startup after a 100ms throttle, and then with exponential backoff (15min, 30min, 1h, 2h, 4h, 8h). When retries are exhausted, and offline caching is enabled, envelopes are stored as<db>/cache/<uuid>.envelopeinstead of being discarded.flowchart TD startup --> R{retry?} R -->|yes| throttle R -->|no| C{cache?} throttle -. 100ms .-> resend resend -->|success| C resend -->|fail| C2[<db>/cache/<br/><ts>-<n>-<uuid>.envelope] C2 --> backoff backoff -. 2ⁿ×15min .-> resend C -->|yes| CACHE[<db>/cache/<br/><uuid>.envelope] C -->|no| discardBuilds upon:
See also: