-
Notifications
You must be signed in to change notification settings - Fork 70
🐛 Workload should still resilient when catalog is deleted #2439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
🐛 Workload should still resilient when catalog is deleted #2439
Conversation
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive end-to-end tests to verify that installed OLM extensions continue functioning correctly when their source catalog is deleted. The tests cover both standard runtime and experimental Boxcutter runtime scenarios.
Changes:
- Added new feature file with 8 scenarios testing catalog deletion resilience
- Implemented
CatalogIsDeletedfunction to support catalog deletion in tests - Added step registrations for ClusterExtension update operations
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/e2e/steps/steps.go | Adds CatalogIsDeleted function and step registrations for testing catalog deletion and ClusterExtension updates |
| test/e2e/features/catalog-deletion-resilience.feature | Defines 8 test scenarios covering extension resilience, resource restoration, config changes, version upgrades, and revision behavior when catalog is deleted |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d3cbb5a to
f31b184
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f31b184 to
dce6d68
Compare
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
dce6d68 to
b15c262
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b15c262 to
b1d259e
Compare
b1d259e to
c6870c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
c6870c5 to
36e9069
Compare
36e9069 to
6799025
Compare
84e6cc6 to
865ac9b
Compare
865ac9b to
23b7677
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2439 +/- ##
==========================================
+ Coverage 73.00% 73.31% +0.31%
==========================================
Files 100 100
Lines 7641 7727 +86
==========================================
+ Hits 5578 5665 +87
+ Misses 1625 1620 -5
- Partials 438 442 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
23b7677 to
e14ff89
Compare
e14ff89 to
db1c787
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
internal/operator-controller/controllers/clusterextension_reconcile_steps.go
Show resolved
Hide resolved
Enables installed extensions to continue working when their source catalog becomes unavailable or is deleted. When resolution fails due to catalog unavailability, the operator now continues reconciling with the currently installed bundle instead of failing. Changes: - Resolution falls back to installed bundle when catalog unavailable - Unpacking skipped when maintaining current installed state - Helm and Boxcutter appliers handle nil contentFS gracefully - Version upgrades properly blocked without catalog access This ensures workloads remain stable and operational even when the catalog they were installed from is temporarily unavailable or deleted, while appropriately preventing version changes that require catalog access.
db1c787 to
e59f517
Compare
|
/hold TBD internal discussion: https://redhat-internal.slack.com/archives/C06KP34REFJ/p1768241320285279?thread_ts=1768235428.251779&cid=C06KP34REFJ |
|
/hold cancel to receive reviews |
pedjak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for e2e test, after deleting the catalog, we need to ensure that we assert resources after the next reconciliation is done.
| ServiceAccount: ocv1.ServiceAccountReference{Name: "default"}, | ||
| }, | ||
| } | ||
| require.NoError(t, cl.Create(ctx, ext)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are testing reconcile after successful installation, then we should also populate status properly to indicate that, or have two Reconcile calls, first when catalog is available, and the second when it is not.
| "installedVersion", installedVersion) | ||
| setStatusProgressing(ext, err) | ||
| setInstalledStatusFromRevisionStates(ext, state.revisionStates) | ||
| ensureAllConditionsWithReason(ext, ocv1.ReasonRetrying, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps would be better to set condition message to the one emitted in to the log?
|
|
||
| // If spec requests a different version, we cannot fall back - must fail and retry | ||
| if specVersion != "" && specVersion != installedVersion { | ||
| l.Info("resolution failed and spec requests version change - cannot fall back", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps l.Error is better suited here?
| "error", err) | ||
| setStatusProgressing(ext, err) | ||
| setInstalledStatusFromRevisionStates(ext, state.revisionStates) | ||
| ensureAllConditionsWithReason(ext, ocv1.ReasonRetrying, err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps would be better to set condition message to the one emitted in to the log?
| if catalogsExist { | ||
| // Catalogs exist but resolution failed - likely a transient issue (catalog updating, cache stale, etc.) | ||
| // Retry resolution instead of falling back | ||
| l.Info("resolution failed but catalogs exist - retrying instead of falling back", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l.Error perhaps?
| matchLabels: | ||
| "olm.operatorframework.io/metadata.name": test-catalog | ||
| """ | ||
| Then ClusterExtension reports Progressing as True with Reason Retrying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are updated blocked, when we expect Retrying to be reported?
| @BoxcutterRuntime | ||
| Scenario: Revision resources are restored after catalog deletion | ||
| Given ClusterExtension is applied | ||
| """ | ||
| apiVersion: olm.operatorframework.io/v1 | ||
| kind: ClusterExtension | ||
| metadata: | ||
| name: ${NAME} | ||
| spec: | ||
| namespace: ${TEST_NAMESPACE} | ||
| serviceAccount: | ||
| name: olm-sa | ||
| source: | ||
| sourceType: Catalog | ||
| catalog: | ||
| packageName: test | ||
| selector: | ||
| matchLabels: | ||
| "olm.operatorframework.io/metadata.name": test-catalog | ||
| """ | ||
| And ClusterExtension is rolled out | ||
| And ClusterExtension is available | ||
| And ClusterExtensionRevision "${NAME}-1" reports Available as True with Reason ProbesSucceeded | ||
| And resource "configmap/test-configmap" exists | ||
| When ClusterCatalog "test" is deleted | ||
| And resource "configmap/test-configmap" is removed | ||
| Then resource "configmap/test-configmap" is eventually restored | ||
| And ClusterExtensionRevision "${NAME}-1" reports Available as True with Reason ProbesSucceeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we really need that test, given that we have already Scenario: Resources are restored after catalog deletion. From user perspective it is only important that the removed resource is restored, and that ClusterExtension reports the right things. IMHO, ClusterExtensionRevision is an implementation detail that it is of no interested for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must ensure that we do not delete those that belong to the owner, for example, and delete them.
We should ensure that all is kept.
| And resource "deployment/test-operator" reports as not ready | ||
| Then ClusterExtensionRevision "${NAME}-1" reports Available as False with Reason ProbeFailure | ||
| When resource "deployment/test-operator" reports as ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we expect that the deployment should become not ready? The scenario states that
Revision remains available when workload recovers after catalog deletion
hence, I would expect that the deployment remains available too.
| And ClusterExtensionRevision "${NAME}-1" reports Available as True with Reason ProbesSucceeded | ||
|
|
||
| @BoxcutterRuntime | ||
| Scenario: Version upgrade with revisions blocked without catalog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, we can drop this test, we are perfectly covered with the test that does not assert ClusterExtensionRevision resources at all (IMHO, the implementation detail).
| And ClusterExtensionRevision "${NAME}-1" reports Available as True with Reason ProbesSucceeded | ||
|
|
||
| @BoxcutterRuntime | ||
| Scenario: Multiple revisions remain stable after catalog deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to above, not sure if we really need this test.
Problem
When a catalog becomes unavailable (deleted, registry offline, network issues), installed extensions break or stop being maintained. This PR ensures extensions continue working with their installed version until the catalog becomes available again.
What This Fixes
Issues on main when catalog is unavailable/deleted:
Note: Boxcutter already maintains resources via CER controller; Helm did not.
Solution
Added smart fallback logic:
Key Changes
reconcileExistingRelease()to maintain resources whencontentFS == nilcontentFS == nil(CER controller maintains)What "Extension Continue Working" Means
An extension continues working when:
Installed=TrueTesting
Added comprehensive e2e test suite in
test/e2e/features/catalog-deletion-resilience.feature:All scenarios tested for both Helm and Boxcutter runtimes where applicable.
What Still Requires Catalog (Correct Behavior)
Resolution Fails?
TL;DR Reconcile Workflow and Scenarios
Step 1: Resolution + rollout succeed (healthy)
Step 2: Resolution succeeds, rollout starts, rollout fails partway
Step 3: Catalog missing; resolution would fail, but we skip it
What happens:
Result:
Why this makes sense:
When does fallback happen?
Fallback to Installed only happens when:
In this scenario RollingOut is populated, so fallback never triggers.
/hold until we have a RFC approved