Skip to content

Pods: Gang Scheduling & Topology awareness#494

Draft
Varsius wants to merge 10 commits intocobaltcore-dev:mainfrom
Varsius:topology-awareness
Draft

Pods: Gang Scheduling & Topology awareness#494
Varsius wants to merge 10 commits intocobaltcore-dev:mainfrom
Varsius:topology-awareness

Conversation

@Varsius
Copy link
Collaborator

@Varsius Varsius commented Feb 4, 2026

Overview

This PR introduces Gang Scheduling (#393) in conjunction with Topology Aware Scheduling (TAS) (423) to the pods scheduler, as well as a refactoring of the overall scheduling behavior (includes #421).

TODO

The code is in a working state, however the PR is kept in draft state since there is still work needed in order to get a clean integration into the project:

  • Add back the PodGroupSetController to handle the creation of pods from scheduled PGSs which is currently done by the scheduler
  • Add unit tests for all newly added and refactored components:
    • Scheduling Cache
    • Event Handlers
    • PodGroupSet Scheduling
    • Pod Scheduling
    • Scheduling Queue
    • Scheduler
    • Topology
  • List and implement the various TODO comments throughout the code
  • Documentation for PodGroupSets and Topology labels

Changelog

  • Introduce PodGroupSet (PGS) as a new CRD which is used for gang scheduled workloads
  • Introduce a scheduling cache for tracking the state of the cluster's nodes and topology
    • Events are received using informers and state of the pod/PGS are fetched from the respective lister
    • Add code generation to get a client, informer and lister for PGSs
  • Introduce a priority scheduling queue similar to the kube-scheduler
    • ActiveQueue: workloads that should be scheduled immeadietly
    • BackoffQueue: workloads that could not be scheduled due to temporary failures, e.g. pipeline not ready
    • UnschedulableQueue: workloads that are waiting for some event, e.g. new node or other pod completes
    • Priority assignment using Largest-Gang-First-Served (LGFS)
  • Reduce responsibilities of PipelineDecisionController to reconciling pipeline steps
  • Introduce PodGroupSetController that handles the lifecycle events of PodGroupSets
  • Introduce scheduling loop as a separate go-rountine
    • Replaces the reconciliation of decisions for scheduling pods
    • Enables the unified scheduling of both pods and PGSs
    • Enables usage of the scheduling queue for event based scheduling retry attempts
  • Introduce Topology data structure for hierarchical scheduling attempts for PGSs which is used for TAS
    • The topology is defined by using node labels, e.g. cortex/topology-rack=rack-23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant