Skip to content

Comments

feat(transport): add HTTP retry with exponential backoff#1520

Open
jpnurmi wants to merge 75 commits intomasterfrom
jpnurmi/feat/http-retry
Open

feat(transport): add HTTP retry with exponential backoff#1520
jpnurmi wants to merge 75 commits intomasterfrom
jpnurmi/feat/http-retry

Conversation

@jpnurmi
Copy link
Collaborator

@jpnurmi jpnurmi commented Feb 13, 2026

Add HTTP retry with exponential backoff for network failures, modeled after Crashpad's upload retry behavior.

Note

Adds an explicit 15s connect timeout to both curl and WinHTTP, matching the value used by Crashpad. This reduces the time the transport worker is blocked on unreachable hosts, previously ~75s on curl (OS TCP SYN retry limit) and 60s on WinHTTP.

Failed envelopes are stored as <db>/cache/<ts>-<n>-<uuid>.envelope and retried on startup after a 100ms throttle, and then with exponential backoff (15min, 30min, 1h, 2h, 4h, 8h). When retries are exhausted, and offline caching is enabled, envelopes are stored as <db>/cache/<uuid>.envelope instead of being discarded.

flowchart TD
    startup --> R{retry?}
    R -->|yes| throttle
    R -->|no| C{cache?}
    throttle -. 100ms .-> resend
    resend -->|success| C
    resend -->|fail| C2[&lt;db&gt;/cache/<br/>&lt;ts&gt;-&lt;n&gt;-&lt;uuid&gt;.envelope]
    C2 --> backoff
    backoff -. 2ⁿ×15min .-> resend
    C -->|yes| CACHE[&lt;db&gt;/cache/<br/>&lt;uuid&gt;.envelope]
    C -->|no| discard
Loading

Builds upon:

See also:

@github-actions
Copy link

github-actions bot commented Feb 13, 2026

Messages
📖 Do not forget to update Sentry-docs with your feature once the pull request gets approved.

Generated by 🚫 dangerJS against 889ba35

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from b083a57 to a264f66 Compare February 13, 2026 17:47
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from 243c880 to d6aa792 Compare February 16, 2026 19:41
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 16, 2026

@cursor review

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch 2 times, most recently from fbbffb2 to abd5815 Compare February 17, 2026 08:34
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@cursor review

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from f030cf9 to 22e3fc4 Compare February 17, 2026 10:24
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@sentry review

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@cursor review

@jpnurmi jpnurmi force-pushed the jpnurmi/feat/http-retry branch from 22e3fc4 to ffce486 Compare February 17, 2026 10:49
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 17, 2026

@cursor review

@jpnurmi jpnurmi requested a review from mujacica February 23, 2026 15:42
opts->traces_sample_rate = 0.0;
opts->max_spans = SENTRY_SPANS_MAX;
opts->handler_strategy = SENTRY_HANDLER_STRATEGY_DEFAULT;
opts->http_retry = true;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: default to false to avoid behavioral change? On the other hand, Crashpad already retries, so defaulting to true actually makes the behavior consistent across backends.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally think retry should be on by default, but the bigger problem is the consequence behind that very basic binary decision.

Retry is what most users expect from other SDKs, since most HTTP clients in higher-level languages retry by default. But those retry mechanisms are often trivial and short-lived (5 attempts with exponential back-off, but not for hours or even half an hour), and they do not account for in-flight requests that persist, so we can continue retrying on the next start.

We can essentially center on:

default to false to avoid behavioral change

What behavior should a user expect when no retry is occurring? Is this about the storage of inflight envelopes after a crash or a shutdown? Then yes, retry-by-default could be a problem.

Crashpad already retries, so defaulting to true actually makes the behavior consistent across backends

Yes, but also consider that this affects all events, not only crash events. In the general case, there will be many more regular events than crash events.

This brings us back to the initial point about retry mechanisms: exposing them requires configuration.

If we don't want to expose the retry policy configuration as a first step, we shouldn't default to retrying behind a binary option.

Copy link
Collaborator

@supervacuus supervacuus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a sensible retry implementation, and the policy makes sense as well. However, as discussed initially, it applies a single policy for all use cases and hides it behind a binary interface.

I think it is highly likely that all currently hard-coded configurations (timeout, attempts, interval, throttle, http-response codes, etc.) will have to be exposed at some point. And I wonder why we wouldn't do that right away.

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 24, 2026

I think it is highly likely that all currently hard-coded configurations (timeout, attempts, interval, throttle, http-response codes, etc.) will have to be exposed at some point. And I wonder why we wouldn't do that right away.

This PR started with a configurable number of retries, but after discussing it with others, we ended up simplifying it to a boolean because we didn't find these details exposed in other SDKs.

For the other parameters, I just decided to follow Crashpad. Now that we have a conversion from Crashpad reports to Sentry envelopes for offline caching, there's an opportunity to disable retries in Crashpad and let sentry-native take care of it in a backend-agnostic way. Then we could also easily make any retry parameters customizable.

- Document the 5-attempt retry limit
- Note there is no rate limiting between attempts
- Warn about potential event loss during extended network outages
…ponses

Only network failures (negative status codes) trigger retries. HTTP responses
including 5xx (500, 502, 503, 504) are discarded.
@cursor
Copy link

cursor bot commented Feb 24, 2026

Bugbot Autofix prepared fixes for 1 of the 1 bugs found in the latest run.

  • ✅ Fixed: Enqueue side effects fire even when no file written
    • Changed sentry__retry_write_envelope to return bool and modified sentry__retry_enqueue to only clear startup_time and schedule poll task when a file was actually written.

Create PR

Or push these changes by commenting:

@cursor push 41e84b4fc2
Preview (41e84b4fc2)
diff --git a/src/sentry_retry.c b/src/sentry_retry.c
--- a/src/sentry_retry.c
+++ b/src/sentry_retry.c
@@ -121,13 +121,13 @@
     return sentry__path_join_str(retry->cache_path, filename);
 }
 
-void
+bool
 sentry__retry_write_envelope(
     sentry_retry_t *retry, const sentry_envelope_t *envelope)
 {
     sentry_uuid_t event_id = sentry__envelope_get_event_id(envelope);
     if (sentry_uuid_is_nil(&event_id)) {
-        return;
+        return false;
     }
 
     char uuid[37];
@@ -139,9 +139,13 @@
         if (sentry_envelope_write_to_path(envelope, path) != 0) {
             SENTRY_WARNF(
                 "failed to write retry envelope to \"%s\"", path->path);
+            sentry__path_free(path);
+            return false;
         }
         sentry__path_free(path);
+        return true;
     }
+    return false;
 }
 
 static bool
@@ -379,7 +383,9 @@
     if (sentry__atomic_fetch(&retry->sealed)) {
         return;
     }
-    sentry__retry_write_envelope(retry, envelope);
+    if (!sentry__retry_write_envelope(retry, envelope)) {
+        return;
+    }
     // prevent the startup poll from re-processing this session's envelope
     retry->startup_time = 0;
     if (sentry__atomic_compare_swap(&retry->scheduled, 0, 1)) {

diff --git a/src/sentry_retry.h b/src/sentry_retry.h
--- a/src/sentry_retry.h
+++ b/src/sentry_retry.h
@@ -38,8 +38,9 @@
 
 /**
  * Writes an event envelope to the retry dir. Non-event envelopes are skipped.
+ * Returns true if a file was written, false otherwise.
  */
-void sentry__retry_write_envelope(
+bool sentry__retry_write_envelope(
     sentry_retry_t *retry, const sentry_envelope_t *envelope);
 
 /**
This Bugbot Autofix run was free. To enable Autofix for future PRs, go to the Cursor dashboard.

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Feb 24, 2026

@cursor push 41e84b4

@cursor
Copy link

cursor bot commented Feb 24, 2026

Could not push Autofix changes. The PR branch may have changed since the Autofix ran, or the Autofix commit may no longer exist.

- Change sentry__retry_write_envelope to return bool indicating success
- Return false for nil event IDs (session envelopes) and write failures
- sentry__retry_enqueue now returns early if write fails, preserving
  startup_time for session envelopes so retry_flush_task can flush them
Add null check after cloning cache_path to prevent dereferencing null
later in sentry__retry_send when iterating directory or joining paths.
When system clock moves backward (now < ts), the condition now >= ts
was false, causing the backoff check to be skipped entirely. This made
items immediately eligible for retry regardless of their count.

Now checks if now < ts (clock skew) OR if backoff hasn't elapsed.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

jpnurmi and others added 2 commits February 24, 2026 18:12
Crashpad's kRetryAttempts=5 with `upload_attempts > kRetryAttempts`
(checked before post-increment) allows upload_attempts 0-5, giving
6 retries with backoffs: 15m, 30m, 1h, 2h, 4h, 8h.

sentry-native's `count + 1 < SENTRY_RETRY_ATTEMPTS` with the old
value of 5 only allowed counts 0-3 to be re-enqueued, so the max
backoff reached was 4h. Bumping to 6 gives the same 6 retries and
the full backoff sequence up to 8h.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename sentry__retry_flush to sentry__retry_shutdown and remove the
bgworker_flush call so bgworker_shutdown gets the full timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

jpnurmi and others added 2 commits February 24, 2026 19:21
Check the sentry__path_rename return value and log a warning on failure
instead of silently ignoring it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move eligible++ after all item fields are populated so a NULL path from
allocation failure does not leave a half-initialized item in the array.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

sentry_uuid_t event_id = sentry__envelope_get_event_id(envelope);
if (sentry_uuid_is_nil(&event_id)) {
return false;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry drops session-only envelopes

Medium Severity

sentry__retry_write_envelope only persists envelopes that have a non-nil event_id. Envelopes containing only sessions (and other non-event payloads) will not be written to the retry queue on network failures, so those telemetry items get discarded even when HTTP retry is enabled.

Fix in Cursor Fix in Web

Replace mutable startup_time + sealed with a state enum so that
the startup flag is managed via atomic operations instead of
clearing a uint64_t that may tear on 32-bit systems.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
If sentry__retry_new() fails, retry_func was still set on the transport,
causing can_retry to return true and envelopes to be dropped instead of
cached.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +408 to +412
}
sentry__atomic_compare_swap(
&retry->state, SENTRY_RETRY_STARTUP, SENTRY_RETRY_RUNNING);
if (sentry__atomic_compare_swap(&retry->scheduled, 0, 1)) {
sentry__bgworker_submit_delayed(retry->bgworker, retry_poll_task, NULL,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: A race condition between retry_poll_task and retry_enqueue can cause a new retry item to be queued without a corresponding poll task being scheduled, delaying its processing.
Severity: MEDIUM

Suggested Fix

The race condition can be resolved by modifying retry_poll_task. After setting retry->scheduled to 0, it should re-check if the queue is empty. If new items have been added in the meantime, it should immediately attempt to schedule a new poll task itself by performing the same compare_swap operation used in retry_enqueue. This ensures that even if a race occurs, the system self-corrects and schedules the necessary poll.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/sentry_retry.c#L408-L412

Potential issue: A race condition exists between the background worker polling for
retries and the function that enqueues new items for retry. Specifically, when the poll
task (`retry_poll_task`) finishes processing the queue, it sets a `scheduled` flag to 0.
However, there is a small window before this flag is set where a new item can be
enqueued by `retry_enqueue`. In this scenario, `retry_enqueue` will see the `scheduled`
flag as 1, assume a poll is active, and fail to schedule a new one. The poll task then
proceeds to set `scheduled` to 0, leaving the newly enqueued item stranded in the queue
with no scheduled task to process it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants