Skip to content

feat: add retry logic#46

Closed
jcabrero wants to merge 1 commit intomainfrom
feat/add_retry_logic
Closed

feat: add retry logic#46
jcabrero wants to merge 1 commit intomainfrom
feat/add_retry_logic

Conversation

@jcabrero
Copy link
Member

@jcabrero jcabrero commented Jan 23, 2026

The other day, we saw that one of the potential main causes of Inconclusive requests are transient network issues. These can come in many forms (e.g. AMD blocking certificate downloads, GitHub not being available, etc.). These issues are usually transient and can be resolved by retrying after some time.

This PR introduces retry logic to automatically retry functions when required. This is applied to most contract call functions (both read and write) as well as to verification calls.

For read calls, the delay between retries is 5 seconds. For write calls to the blockchain and for verification calls, the retry period is 30 seconds, and can be adjusted through the consts.rs file. A verification is only retried if the result is Inconclusive, this is because those are errors outside the control of nilCC or the verifier.

With this approach, if a user’s transaction is rejected due to gas costs on a given block, it may succeed on a different block on a subsequent retry 30 seconds later. Similarly, if a request to AMD times out, retries allow time for certificates to be re-downloaded successfully.

I am not sure this is the best way to approach this, and whether we want to introduce this change. Happy to open a discussion on this.

@jcabrero jcabrero force-pushed the feat/add_retry_logic branch from 196c1f6 to 22cbd94 Compare January 23, 2026 15:28
@jcabrero jcabrero force-pushed the feat/add_retry_logic branch from 22cbd94 to 179d671 Compare January 23, 2026 15:30
@jcabrero jcabrero marked this pull request as ready for review January 23, 2026 15:36
@jimouris
Copy link
Member

I haven't followed the latest changes on the contracts, but this will heavily depend on the logic there. Do the contracts expect a response within a certain time limit? They should otherwise, a node could just not respond and cause the HTX to never be acknowledged (either positively or negatively).

I like the retry idea, but it might be better to respond "inconclusive" in time than keep retrying and have the contract assume that you never responded.

cc: @tlitre

@mfontanini
Copy link
Member

You need to reply within a limit, yes, I think it's 100 blocks as it is now.

@jcabrero
Copy link
Member Author

I think I read somewhere it was 5 minutes. That's why I chose 30 seconds x 3. Indeed, if it is 100 blocks, then it may be better to adjust the 30 seconds to something smaller.

@mfontanini
Copy link
Member

I like the retry idea, but it might be better to respond "inconclusive" in time than keep retrying and have the contract assume that you never responded.

I agree with this. I don't know if we want to keep trying if we got inconclusive. This is a hard decision, there's too many failure types, some of which will be inconclusive but "always inconclusive" vs others which will only be transient.

Copy link
Member

@mfontanini mfontanini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a little iffy about this whole thing. I'm not against it, just feel like we're adding lots of implicit retries, then top level retries, etc. This may make it hard to understand how long exactly can something take. e.g. if we try to verify and something internally fails, retries, then eventually gives up, then bubbles up, the next one may retry too, internally retrying again, etc. Maybe I'm seeing this wrong but I feel like we can't tell exactly how long can a verification take now given all the inner retries.

retry(RetryConfig::for_reads(), "getNodes", || async {
self.contract
.getNodes()
.call()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as is but I think there could be a much less verbose way of doing the same so you don't need closures everywhere: create an extension trait. e.g. something like (uncompiled code)

#[async_trait]
pub trait CallBuilderExt<T> {
    async fn call_with_retries(self, config: RetryConfig) -> Result<T, SomeError>;
}

impl <...> CallBuilderExt<...> for CallBuilder<P, D> {
    async fn call_with_retries(self, config: RetryConfig) -> Result<T, SomeError> {
        .... do the actual retries
    }
}


// usage

self.contract
      .getNodes()
      .call_with_retries(RetryConfig::for_reads())
      .await?

We don't have to switch to it (and I'm not sure if you wouldn't hit some annoyance around async here) but just to throw it out there.

// =============================================================================

/// Default delay between retry attempts in seconds
pub const DEFAULT_RETRY_DELAY_SECS: u64 = 30;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use Duration directly so you don't need the _SECS and the conversion to Duration on use later on

max_attempts: DEFAULT_MAX_RETRY_ATTEMPTS,
delay: Duration::from_secs(DEFAULT_RETRY_DELAY_SECS),
backoff_multiplier: 1.0,
max_delay: Duration::from_secs(300), // 5 minute cap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To dimitris' point, this I think goes off the 100 block range but I could be wrong.

max_attempts: DEFAULT_MAX_RETRY_ATTEMPTS,
delay: Duration::from_secs(DEFAULT_RETRY_DELAY_SECS),
backoff_multiplier: 1.0,
max_delay: Duration::from_secs(300), // 5 minute cap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To dimitris' point, this I think goes off the 100 block range but I could be wrong.

@tlitre
Copy link
Collaborator

tlitre commented Jan 23, 2026

I think I read somewhere it was 5 minutes. That's why I chose 30 seconds x 3. Indeed, if it is 100 blocks, then it may be better to adjust the 30 seconds to something smaller.

yes it's 5 minutes! though this is a parameter that's easily changeable by the admin RESPONSE_WINDOW_SEC

@jcabrero
Copy link
Member Author

jcabrero commented Feb 5, 2026

I think it's best to remove this PR for now. I agree we may be causing unecessary delays and unpredictable behaviour

@jcabrero jcabrero closed this Feb 5, 2026
@jcabrero jcabrero deleted the feat/add_retry_logic branch February 5, 2026 10:40
@jcabrero jcabrero restored the feat/add_retry_logic branch February 5, 2026 10:40
@jcabrero jcabrero deleted the feat/add_retry_logic branch February 5, 2026 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants