Pods: Gang Scheduling & Topology awareness by Varsius · Pull Request #494 · cobaltcore-dev/cortex

Varsius · 2026-02-04T17:15:58Z

Overview

This PR introduces Gang Scheduling (#393) in conjunction with Topology Aware Scheduling (TAS) (423) to the pods scheduler, as well as a refactoring of the overall scheduling behavior (includes #421).

TODO

The code is in a working state, however the PR is kept in draft state since there is still work needed in order to get a clean integration into the project:

Changelog

Introduce PodGroupSet (PGS) as a new CRD which is used for gang scheduled workloads
Introduce a scheduling cache for tracking the state of the cluster's nodes and topology
- Events are received using informers and state of the pod/PGS are fetched from the respective lister
- Add code generation to get a client, informer and lister for PGSs
Introduce a priority scheduling queue similar to the kube-scheduler
- ActiveQueue: workloads that should be scheduled immeadietly
- BackoffQueue: workloads that could not be scheduled due to temporary failures, e.g. pipeline not ready
- UnschedulableQueue: workloads that are waiting for some event, e.g. new node or other pod completes
- Priority assignment using Largest-Gang-First-Served (LGFS)
Reduce responsibilities of PipelineDecisionController to reconciling pipeline steps
Introduce PodGroupSetController that handles the lifecycle events of PodGroupSets
Introduce scheduling loop as a separate go-rountine
- Replaces the reconciliation of decisions for scheduling pods
- Enables the unified scheduling of both pods and PGSs
- Enables usage of the scheduling queue for event based scheduling retry attempts
Introduce Topology data structure for hierarchical scheduling attempts for PGSs which is used for TAS
- The topology is defined by using node labels, e.g. cortex/topology-rack=rack-23

Varsius added 10 commits February 4, 2026 17:31

pod-pipeline: add gang scheduling and TAS

3e70ad7

pods-pipeline: add events for publishing scheduling results

15944fb

WIP: checkpoint in case everything goes wrong

c10036a

WIP: pod scheduling works, PGS not, needs a lot of cleanup

597e21d

WIP: cleanup

bed0495

WIP: PGS informer/lister (untested)

c6b474a

WIP: testing successfull, cleanup needed

e089c9f

cleanup

c389881

run make generate

37a6d96

fix lints

bf6ca2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods: Gang Scheduling & Topology awareness#494

Pods: Gang Scheduling & Topology awareness#494
Varsius wants to merge 10 commits intocobaltcore-dev:mainfrom
Varsius:topology-awareness

Varsius commented Feb 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Varsius commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

TODO

Changelog

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Varsius commented Feb 4, 2026 •

edited

Loading