Conversation
196c1f6 to
22cbd94
Compare
22cbd94 to
179d671
Compare
|
I haven't followed the latest changes on the contracts, but this will heavily depend on the logic there. Do the contracts expect a response within a certain time limit? They should otherwise, a node could just not respond and cause the HTX to never be acknowledged (either positively or negatively). I like the retry idea, but it might be better to respond "inconclusive" in time than keep retrying and have the contract assume that you never responded. cc: @tlitre |
|
You need to reply within a limit, yes, I think it's 100 blocks as it is now. |
|
I think I read somewhere it was 5 minutes. That's why I chose |
I agree with this. I don't know if we want to keep trying if we got inconclusive. This is a hard decision, there's too many failure types, some of which will be inconclusive but "always inconclusive" vs others which will only be transient. |
mfontanini
left a comment
There was a problem hiding this comment.
I feel a little iffy about this whole thing. I'm not against it, just feel like we're adding lots of implicit retries, then top level retries, etc. This may make it hard to understand how long exactly can something take. e.g. if we try to verify and something internally fails, retries, then eventually gives up, then bubbles up, the next one may retry too, internally retrying again, etc. Maybe I'm seeing this wrong but I feel like we can't tell exactly how long can a verification take now given all the inner retries.
| retry(RetryConfig::for_reads(), "getNodes", || async { | ||
| self.contract | ||
| .getNodes() | ||
| .call() |
There was a problem hiding this comment.
This is fine as is but I think there could be a much less verbose way of doing the same so you don't need closures everywhere: create an extension trait. e.g. something like (uncompiled code)
#[async_trait]
pub trait CallBuilderExt<T> {
async fn call_with_retries(self, config: RetryConfig) -> Result<T, SomeError>;
}
impl <...> CallBuilderExt<...> for CallBuilder<P, D> {
async fn call_with_retries(self, config: RetryConfig) -> Result<T, SomeError> {
.... do the actual retries
}
}
// usage
self.contract
.getNodes()
.call_with_retries(RetryConfig::for_reads())
.await?We don't have to switch to it (and I'm not sure if you wouldn't hit some annoyance around async here) but just to throw it out there.
| // ============================================================================= | ||
|
|
||
| /// Default delay between retry attempts in seconds | ||
| pub const DEFAULT_RETRY_DELAY_SECS: u64 = 30; |
There was a problem hiding this comment.
nit: you can use Duration directly so you don't need the _SECS and the conversion to Duration on use later on
| max_attempts: DEFAULT_MAX_RETRY_ATTEMPTS, | ||
| delay: Duration::from_secs(DEFAULT_RETRY_DELAY_SECS), | ||
| backoff_multiplier: 1.0, | ||
| max_delay: Duration::from_secs(300), // 5 minute cap |
There was a problem hiding this comment.
To dimitris' point, this I think goes off the 100 block range but I could be wrong.
| max_attempts: DEFAULT_MAX_RETRY_ATTEMPTS, | ||
| delay: Duration::from_secs(DEFAULT_RETRY_DELAY_SECS), | ||
| backoff_multiplier: 1.0, | ||
| max_delay: Duration::from_secs(300), // 5 minute cap |
There was a problem hiding this comment.
To dimitris' point, this I think goes off the 100 block range but I could be wrong.
yes it's 5 minutes! though this is a parameter that's easily changeable by the admin |
|
I think it's best to remove this PR for now. I agree we may be causing unecessary delays and unpredictable behaviour |
The other day, we saw that one of the potential main causes of Inconclusive requests are transient network issues. These can come in many forms (e.g. AMD blocking certificate downloads, GitHub not being available, etc.). These issues are usually transient and can be resolved by retrying after some time.
This PR introduces retry logic to automatically retry functions when required. This is applied to most contract call functions (both read and write) as well as to verification calls.
For read calls, the delay between retries is 5 seconds. For write calls to the blockchain and for verification calls, the retry period is 30 seconds, and can be adjusted through the
consts.rsfile. A verification is only retried if the result isInconclusive, this is because those are errors outside the control of nilCC or the verifier.With this approach, if a user’s transaction is rejected due to gas costs on a given block, it may succeed on a different block on a subsequent retry 30 seconds later. Similarly, if a request to AMD times out, retries allow time for certificates to be re-downloaded successfully.
I am not sure this is the best way to approach this, and whether we want to introduce this change. Happy to open a discussion on this.