🐛 Workload should still resilient when catalog is deleted #2439

camilamacedo86 · 2026-01-11T05:18:07Z

Problem

When a catalog becomes unavailable (deleted, registry offline, network issues), installed extensions break or stop being maintained. This PR ensures extensions continue working with their installed version until the catalog becomes available again.

What This Fixes

Issues on main when catalog is unavailable/deleted:

Accidentally deleted resources are NOT restored (both runtimes)
Configuration changes are blocked (both runtimes)
Auto-updates hang during catalog image transitions (both runtimes)
Status shows "Failed" instead of "Retrying" (both runtimes)
Version upgrade attempts show unclear status (both runtimes)
Catalog deletion/maintenance breaks extensions (both runtimes)
Resources drift if manually modified (Helm only)
Registry outages break extension health (Helm only)

Note: Boxcutter already maintains resources via CER controller; Helm did not.

Solution

Added smart fallback logic:

Catalogs exist but resolution fails → Retry immediately (transient issue)
Catalogs deleted → Fall back to installed bundle, continue maintaining resources
Version change requested → Always retry (cannot upgrade without catalog)

Key Changes

Resolution step - Added catalog existence check before fallback
Helm applier - Added reconcileExistingRelease() to maintain resources when contentFS == nil
Boxcutter applier - Return success when contentFS == nil (CER controller maintains)
Status - Clear "Retrying" status instead of "Failed"

What "Extension Continue Working" Means

An extension continues working when:

Operator Deployment is running and healthy
All managed resources exist in cluster
Deleted resources are automatically restored
Resource specs match desired state (no drift)
Status shows Installed=True
Operator's business logic continues (e.g., Prometheus keeps scraping)

Testing

Added comprehensive e2e test suite in test/e2e/features/catalog-deletion-resilience.feature:

Extension continues running after catalog deletion
Resources auto-restored after catalog deletion
Config changes work without catalog
Version upgrades correctly blocked without catalog
Multiple revisions remain stable (Boxcutter)
Workload availability properly tracked

All scenarios tested for both Helm and Boxcutter runtimes where applicable.

What Still Requires Catalog (Correct Behavior)

Fresh installs
Version upgrades
Package changes

Resolution Fails?

├─ Version change requested (1.0.0 → 1.0.1)?
│  └─ YES → RETRY (need catalog to upgrade)
│
├─ Catalogs exist in cluster?
│  ├─ YES → RETRY (transient issue, catalog updating)
│  └─ NO → Check for installed bundle...
│     ├─ Have installed bundle?
│     │  └─ YES → FALLBACK (maintain current workload)
│     └─ NO → RETRY (fresh install needs catalog)

TL;DR Reconcile Workflow and Scenarios

Step 1: Resolution + rollout succeed (healthy)

RollingOut: empty
Resolution runs → succeeds
Rollout completes → moves to Installed
State: Installed = v1.0.0, RollingOut = []

Step 2: Resolution succeeds, rollout starts, rollout fails partway

RollingOut: empty initially
Resolution runs → succeeds (v2.0.0)
Start rolling out v2.0.0
Rollout FAILS partway through (e.g., resource can't be applied)
State: Installed = v1.0.0, RollingOut = [v2.0.0] ← v2.0.0 stuck here

Step 3: Catalog missing; resolution would fail, but we skip it

What happens:

if len(state.revisionStates.RollingOut) == 0 {  // FALSE (we have v2.0.0 rolling out)
    // Skip resolution entirely
} else {
    // Use the rolling-out revision
    resolvedRevisionMetadata = state.revisionStates.RollingOut[0]  // v2.0.0
}

Result:

We CONTINUE trying to roll out v2.0.0
We skip resolution entirely (don’t check if catalog exists)
We don’t fall back to v1.0.0
We keep retrying the v2.0.0 rollout that already started
We don’t need the catalog because v2.0.0 content was already obtained when resolution succeeded in Step 2

Why this makes sense:

Already committed: v2.0.0 was resolved while the catalog was available
Have the content: bundle image for v2.0.0 was already pulled/cached
Finish what we started: failure is applying resources, not resolving
No catalog needed: retrying rollout doesn’t require catalog access

When does fallback happen?
Fallback to Installed only happens when:

RollingOut is empty
AND resolution fails
AND catalog doesn't exist
AND no version change was requested

In this scenario RollingOut is populated, so fallback never triggers.

/hold until we have a RFC approved

RFC: Catalog Deletion Resilience for ClusterExtensions
Internal discussion conveys: https://redhat-internal.slack.com/archives/C06KP34REFJ/p1768291605594589?thread_ts=1768235428.251779&cid=C06KP34REFJ

netlify · 2026-01-11T05:18:11Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`e59f517`
🔍 Latest deploy log	https://app.netlify.com/projects/olmv1/deploys/6965386bb3c2ff0008083ecd
😎 Deploy Preview	https://deploy-preview-2439--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

openshift-ci · 2026-01-11T05:18:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign pedjak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR adds comprehensive end-to-end tests to verify that installed OLM extensions continue functioning correctly when their source catalog is deleted. The tests cover both standard runtime and experimental Boxcutter runtime scenarios.

Changes:

Added new feature file with 8 scenarios testing catalog deletion resilience
Implemented CatalogIsDeleted function to support catalog deletion in tests
Added step registrations for ClusterExtension update operations

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
test/e2e/steps/steps.go	Adds CatalogIsDeleted function and step registrations for testing catalog deletion and ClusterExtension updates
test/e2e/features/catalog-deletion-resilience.feature	Defines 8 test scenarios covering extension resilience, resource restoration, config changes, version upgrades, and revision behavior when catalog is deleted

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/e2e/steps/steps.go

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/e2e/steps/steps.go

internal/operator-controller/applier/boxcutter.go

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

internal/operator-controller/applier/helm.go

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

test/e2e/features/catalog-deletion-resilience.feature

internal/operator-controller/applier/boxcutter.go

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/operator-controller/applier/boxcutter.go

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

internal/operator-controller/applier/helm.go

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

test/e2e/steps/steps.go

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

internal/operator-controller/applier/boxcutter.go

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

codecov · 2026-01-12T17:37:29Z

Codecov Report

❌ Patch coverage is 74.22680% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.31%. Comparing base (1fa4169) to head (e59f517).

Files with missing lines	Patch %	Lines
internal/operator-controller/applier/helm.go	39.13%	7 Missing and 7 partials ⚠️
...er/controllers/clusterextension_reconcile_steps.go	89.06%	6 Missing and 1 partial ⚠️
internal/operator-controller/applier/boxcutter.go	50.00%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2439      +/-   ##
==========================================
+ Coverage   73.00%   73.31%   +0.31%     
==========================================
  Files         100      100              
  Lines        7641     7727      +86     
==========================================
+ Hits         5578     5665      +87     
+ Misses       1625     1620       -5     
- Partials      438      442       +4

Flag	Coverage Δ
e2e	`47.93% <59.79%> (+1.07%)`	⬆️
experimental-e2e	`49.70% <54.63%> (+1.01%)`	⬆️
unit	`57.04% <52.57%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

Enables installed extensions to continue working when their source catalog becomes unavailable or is deleted. When resolution fails due to catalog unavailability, the operator now continues reconciling with the currently installed bundle instead of failing. Changes: - Resolution falls back to installed bundle when catalog unavailable - Unpacking skipped when maintaining current installed state - Helm and Boxcutter appliers handle nil contentFS gracefully - Version upgrades properly blocked without catalog access This ensures workloads remain stable and operational even when the catalog they were installed from is temporarily unavailable or deleted, while appropriately preventing version changes that require catalog access.

camilamacedo86 · 2026-01-12T18:09:00Z

/hold

TBD internal discussion: https://redhat-internal.slack.com/archives/C06KP34REFJ/p1768241320285279?thread_ts=1768235428.251779&cid=C06KP34REFJ

camilamacedo86 · 2026-01-14T07:16:06Z

/hold cancel to receive reviews

pedjak

for e2e test, after deleting the catalog, we need to ensure that we assert resources after the next reconciliation is done.

pedjak · 2026-01-14T10:36:06Z

internal/operator-controller/controllers/clusterextension_controller_test.go

+				ServiceAccount: ocv1.ServiceAccountReference{Name: "default"},
+			},
+		}
+		require.NoError(t, cl.Create(ctx, ext))


If we are testing reconcile after successful installation, then we should also populate status properly to indicate that, or have two Reconcile calls, first when catalog is available, and the second when it is not.

pedjak · 2026-01-14T10:42:14Z

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

+				"installedVersion", installedVersion)
+			setStatusProgressing(ext, err)
+			setInstalledStatusFromRevisionStates(ext, state.revisionStates)
+			ensureAllConditionsWithReason(ext, ocv1.ReasonRetrying, err.Error())


perhaps would be better to set condition message to the one emitted in to the log?

pedjak · 2026-01-14T10:43:31Z

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

+
+		// If spec requests a different version, we cannot fall back - must fail and retry
+		if specVersion != "" && specVersion != installedVersion {
+			l.Info("resolution failed and spec requests version change - cannot fall back",


perhaps l.Error is better suited here?

pedjak · 2026-01-14T10:44:23Z

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

+				"error", err)
+			setStatusProgressing(ext, err)
+			setInstalledStatusFromRevisionStates(ext, state.revisionStates)
+			ensureAllConditionsWithReason(ext, ocv1.ReasonRetrying, err.Error())


perhaps would be better to set condition message to the one emitted in to the log?

pedjak · 2026-01-14T10:44:36Z

internal/operator-controller/controllers/clusterextension_reconcile_steps.go

+		if catalogsExist {
+			// Catalogs exist but resolution failed - likely a transient issue (catalog updating, cache stale, etc.)
+			// Retry resolution instead of falling back
+			l.Info("resolution failed but catalogs exist - retrying instead of falling back",


l.Error perhaps?

pedjak · 2026-01-14T12:58:17Z

test/e2e/features/catalog-deletion-resilience.feature

+              matchLabels:
+                "olm.operatorframework.io/metadata.name": test-catalog
+      """
+    Then ClusterExtension reports Progressing as True with Reason Retrying


how are updated blocked, when we expect Retrying to be reported?

pedjak · 2026-01-14T13:01:35Z

test/e2e/features/catalog-deletion-resilience.feature

+  @BoxcutterRuntime
+  Scenario: Revision resources are restored after catalog deletion
+    Given ClusterExtension is applied
+      """
+      apiVersion: olm.operatorframework.io/v1
+      kind: ClusterExtension
+      metadata:
+        name: ${NAME}
+      spec:
+        namespace: ${TEST_NAMESPACE}
+        serviceAccount:
+          name: olm-sa
+        source:
+          sourceType: Catalog
+          catalog:
+            packageName: test
+            selector:
+              matchLabels:
+                "olm.operatorframework.io/metadata.name": test-catalog
+      """
+    And ClusterExtension is rolled out
+    And ClusterExtension is available
+    And ClusterExtensionRevision "${NAME}-1" reports Available as True with Reason ProbesSucceeded
+    And resource "configmap/test-configmap" exists
+    When ClusterCatalog "test" is deleted
+    And resource "configmap/test-configmap" is removed
+    Then resource "configmap/test-configmap" is eventually restored
+    And ClusterExtensionRevision "${NAME}-1" reports Available as True with Reason ProbesSucceeded


Not sure if we really need that test, given that we have already Scenario: Resources are restored after catalog deletion. From user perspective it is only important that the removed resource is restored, and that ClusterExtension reports the right things. IMHO, ClusterExtensionRevision is an implementation detail that it is of no interested for users.

We must ensure that we do not delete those that belong to the owner, for example, and delete them.
We should ensure that all is kept.

pedjak · 2026-01-14T13:04:16Z

test/e2e/features/catalog-deletion-resilience.feature

+    And resource "deployment/test-operator" reports as not ready
+    Then ClusterExtensionRevision "${NAME}-1" reports Available as False with Reason ProbeFailure
+    When resource "deployment/test-operator" reports as ready


Why do we expect that the deployment should become not ready? The scenario states that

Revision remains available when workload recovers after catalog deletion

hence, I would expect that the deployment remains available too.

pedjak · 2026-01-14T13:06:19Z

test/e2e/features/catalog-deletion-resilience.feature

+    And ClusterExtensionRevision "${NAME}-1" reports Available as True with Reason ProbesSucceeded
+
+  @BoxcutterRuntime
+  Scenario: Version upgrade with revisions blocked without catalog


IMHO, we can drop this test, we are perfectly covered with the test that does not assert ClusterExtensionRevision resources at all (IMHO, the implementation detail).

pedjak · 2026-01-14T13:07:51Z

test/e2e/features/catalog-deletion-resilience.feature

+    And ClusterExtensionRevision "${NAME}-1" reports Available as True with Reason ProbesSucceeded
+
+  @BoxcutterRuntime
+  Scenario: Multiple revisions remain stable after catalog deletion


similar to above, not sure if we really need this test.

Copilot AI review requested due to automatic review settings January 11, 2026 05:18

openshift-ci bot requested review from ankitathomas and trgeiger January 11, 2026 05:18

Copilot started reviewing on behalf of camilamacedo86 January 11, 2026 05:18 View session

Copilot AI reviewed Jan 11, 2026

View reviewed changes

test/e2e/steps/steps.go Outdated Show resolved Hide resolved

test/e2e/steps/steps.go Outdated Show resolved Hide resolved

camilamacedo86 force-pushed the test-e2e-res branch from d3cbb5a to f31b184 Compare January 11, 2026 05:43

Copilot AI review requested due to automatic review settings January 11, 2026 05:43

Copilot started reviewing on behalf of camilamacedo86 January 11, 2026 05:43 View session

camilamacedo86 changed the title ~~🌱 test: add e2e tests for workload resilience when catalog is deleted~~ WIP 🌱 test: add e2e tests for workload resilience when catalog is deleted Jan 11, 2026

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 11, 2026

Copilot AI reviewed Jan 11, 2026

View reviewed changes

test/e2e/steps/steps.go Outdated Show resolved Hide resolved

test/e2e/steps/steps.go Outdated Show resolved Hide resolved

camilamacedo86 force-pushed the test-e2e-res branch from f31b184 to dce6d68 Compare January 11, 2026 06:48

camilamacedo86 commented Jan 11, 2026

View reviewed changes

internal/operator-controller/applier/boxcutter.go Outdated Show resolved Hide resolved

camilamacedo86 commented Jan 11, 2026

View reviewed changes

internal/operator-controller/controllers/clusterextension_reconcile_steps.go Show resolved Hide resolved

camilamacedo86 commented Jan 11, 2026

View reviewed changes

internal/operator-controller/applier/helm.go Show resolved Hide resolved

camilamacedo86 force-pushed the test-e2e-res branch from dce6d68 to b15c262 Compare January 11, 2026 07:09

Copilot AI review requested due to automatic review settings January 11, 2026 07:09

Copilot started reviewing on behalf of camilamacedo86 January 11, 2026 07:10 View session

Copilot AI reviewed Jan 11, 2026

View reviewed changes

test/e2e/features/catalog-deletion-resilience.feature Show resolved Hide resolved

internal/operator-controller/applier/boxcutter.go Outdated Show resolved Hide resolved

camilamacedo86 force-pushed the test-e2e-res branch from b15c262 to b1d259e Compare January 11, 2026 07:26

Copilot AI review requested due to automatic review settings January 11, 2026 07:30

camilamacedo86 force-pushed the test-e2e-res branch from b1d259e to c6870c5 Compare January 11, 2026 07:30

Copilot started reviewing on behalf of camilamacedo86 January 11, 2026 07:30 View session

camilamacedo86 changed the title ~~WIP 🌱 test: add e2e tests for workload resilience when catalog is deleted~~ WIP 🐛 Workload should still resilient when catalog is deleted Jan 11, 2026

Copilot AI reviewed Jan 11, 2026

View reviewed changes

camilamacedo86 force-pushed the test-e2e-res branch from c6870c5 to 36e9069 Compare January 11, 2026 08:48

Copilot AI review requested due to automatic review settings January 11, 2026 09:00

camilamacedo86 force-pushed the test-e2e-res branch from 36e9069 to 6799025 Compare January 11, 2026 09:00

camilamacedo86 force-pushed the test-e2e-res branch from 84e6cc6 to 865ac9b Compare January 12, 2026 11:47

test: add e2e tests for workload resilience when catalog is deleted

44f5686

Copilot AI review requested due to automatic review settings January 12, 2026 17:11

camilamacedo86 force-pushed the test-e2e-res branch from 865ac9b to 23b7677 Compare January 12, 2026 17:11

Copilot started reviewing on behalf of camilamacedo86 January 12, 2026 17:18 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes

camilamacedo86 force-pushed the test-e2e-res branch from 23b7677 to e14ff89 Compare January 12, 2026 17:48

Copilot AI review requested due to automatic review settings January 12, 2026 17:54

camilamacedo86 force-pushed the test-e2e-res branch from e14ff89 to db1c787 Compare January 12, 2026 17:54

Copilot started reviewing on behalf of camilamacedo86 January 12, 2026 17:54 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes

internal/operator-controller/controllers/clusterextension_reconcile_steps.go Show resolved Hide resolved

camilamacedo86 changed the title ~~WIP 🐛 Workload should still resilient when catalog is deleted~~ 🐛 Workload should still resilient when catalog is deleted Jan 12, 2026

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2026

camilamacedo86 force-pushed the test-e2e-res branch from db1c787 to e59f517 Compare January 12, 2026 18:07

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 12, 2026

camilamacedo86 changed the title ~~🐛 Workload should still resilient when catalog is deleted~~ WIP 🐛 Workload should still resilient when catalog is deleted Jan 12, 2026

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2026

camilamacedo86 changed the title ~~WIP 🐛 Workload should still resilient when catalog is deleted~~ 🐛 Workload should still resilient when catalog is deleted Jan 12, 2026

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 12, 2026

camilamacedo86 requested review from dtfranz, grokspawn, joelanford, pedjak and perdasilva January 13, 2026 08:13

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2026

pedjak reviewed Jan 14, 2026

View reviewed changes

🐛 Workload should still resilient when catalog is deleted #2439

Are you sure you want to change the base?

🐛 Workload should still resilient when catalog is deleted #2439

Uh oh!

Conversation

camilamacedo86 commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What This Fixes

Solution

Key Changes

What "Extension Continue Working" Means

Testing

What Still Requires Catalog (Correct Behavior)

Resolution Fails?

TL;DR Reconcile Workflow and Scenarios

Step 1: Resolution + rollout succeed (healthy)

Step 2: Resolution succeeds, rollout starts, rollout fails partway

Step 3: Catalog missing; resolution would fail, but we skip it

Uh oh!

netlify bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for olmv1 ready!

Uh oh!

openshift-ci bot commented Jan 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

camilamacedo86 commented Jan 12, 2026

Uh oh!

camilamacedo86 commented Jan 14, 2026

Uh oh!

camilamacedo86 commented Jan 11, 2026 •

edited

Loading

netlify bot commented Jan 11, 2026 •

edited

Loading

codecov bot commented Jan 12, 2026 •

edited

Loading