Skip to content

Conversation

@pajakd
Copy link
Contributor

@pajakd pajakd commented Sep 16, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces an alternative mode of pod (or pod slices) placement in TAS: balanced placement. In this placement, Kueue TAS still tries to use the minimum possible number of topology domains but also maximizes the minimum number of pods (or pod slices) placed on a domain. This balancing happens when feature gate TASBalancedPlacement is enabled on a level indicated by the user with the Preferred flag.

Which issue(s) this PR fixes:

Fixes #6554

Special notes for your reviewer:

Does this PR introduce a user-facing change?

TAS: The balanced placement is introduced with the TASBalancedPlacement feature gate.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Sep 16, 2025
@netlify
Copy link

netlify bot commented Sep 16, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit de3d9c5
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/690daa3d4a60580008cee22e

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 16, 2025
@pajakd
Copy link
Contributor Author

pajakd commented Sep 16, 2025

/test all

@pajakd
Copy link
Contributor Author

pajakd commented Sep 16, 2025

/test all

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 17, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 17, 2025
@pajakd
Copy link
Contributor Author

pajakd commented Oct 17, 2025

/test all

1 similar comment
@pajakd
Copy link
Contributor Author

pajakd commented Oct 17, 2025

/test all

@pajakd pajakd marked this pull request as ready for review October 20, 2025 06:26
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 20, 2025
@k8s-ci-robot k8s-ci-robot requested a review from tenzen-y October 20, 2025 06:26
sortDomainsByCapacityAndEntropy(domains)
}

// domain_placements[i][j][k] stores a list of domains that uses 'i' domains with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this is explanation is generally nice - but in 1 way ambiguous:

A. "use (i domains with j leaders) and (k pods left to fit)
B. "use i domains with ((j leaders and k pods) left to fit)

It's not "abstract mean creativity"; I originally understood A (while I think you meant B).
I think I fell into A due to reading left-to-right too impatiently.

I'm not sure if it can be phrased more clearly and still nicely.
How about just placing a pair of ( ) around "j leaders and k pods"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description. Also updated k pods to k workers. Bc leaders are pods too. But this might actually cause more confusion as LeaderWorkerSet is only one of the types of jobs we support.

Copy link
Contributor

@PBundyra PBundyra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will proceed with the review when you address olekzabl's comments

@pajakd
Copy link
Contributor Author

pajakd commented Nov 6, 2025

@mwysokin we can definitely think about how to incorporate required and unconstrained into the balanced TAS.

The idea of using the preferred is that we need a way to mark a specific level of topology on which we perform the balancing. I might be wrong but I think that the preferred should cover all the use-cases we had so far.

@mwysokin
Copy link
Contributor

mwysokin commented Nov 6, 2025

I wouldn't block this PR from being merged just because there's no support for required and unconstrained. I do think though that it'd be quite beneficial long-term so I'd do a quick follow-up. Maybe we could have someone from AWS and Azure chime-in what the default behavior should be for unconstrained? We'll need to make assumptions about what is the level where the balancing happens, we know that for GKE the topology is block/rack/host/node. If AWS, Azure, Oracle and OpenShift have similar structures maybe we could always assume that we're balancing at 3rd from the bottom (racks) if not level is provided like in unconstrained.

@mimowo
Copy link
Contributor

mimowo commented Nov 12, 2025

What is the algorithmic complexity of the balanced placement in relation to BestFit, at least in some simplified cases. Do we expect increased performance might be a drawback?

Knowing that could influence the decision if this should be default mode or the drawbacks are so big that it should be explicitly opted in

EDIT: it is not a blocker for Alpha. just would like to know going forward

@pajakd
Copy link
Contributor Author

pajakd commented Nov 12, 2025

What is the algorithmic complexity of the balanced placement in relation to BestFit, at least in some simplified cases. Do we expect increased performance might be a drawback?

Knowing that could influence the decision if this should be default mode or the drawbacks are so big that it should be explicitly opted in

EDIT: it is not a blocker for Alpha. just would like to know going forward

The main component in the complexity of Balanced placement is something like

(max # racks in a block)^2 * (# slices to place)

if we balance on "rack" level.

The complexity of BestFit is (unless I'm missing something) linear everywhere. So, there might be some increase in complexity but I don't think it will be too much -- especially that this feature is designed for a special hardware.

// if unconstrained is set, we'll only do it once
currFitDomain = s.updateCountsToMinimumGeneric(currFitDomain, count, leaderCount, sliceSize, unconstrained, true)

for levelIdx := fitLevelIdx; levelIdx+1 < len(s.domainsPerLevel); levelIdx++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about the proper injection point. The current approach is to split at the very top, but I think conceptually balanced placement is very similar to phase 2b. I can see similar traverse down code. So, the natural idea is to inject as we are stepping down and reuse the iteration. I would expect what balanced placement changes is the algorithm for sorting the domains inside sortedDomains, but to reuse calls to updateCountsToMinimumGeneric.

I'm not totally sure if this is a good idea, this is an exploratory question to check your thoughts.

Comment on lines +744 to +745
var currFitDomain []*domain
var fitLevelIdx int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to reduce the scope of the helper variables between the algorithms

Comment on lines +747 to +749
if features.Enabled(features.TASBalancedPlacement) && !required && !unconstrained {
var bestThreshold int32
currFitDomain, bestThreshold, useBalancedPlacement = findBestDomainsForBalancedPlacement(s, levelIdx, sliceLevelIdx, count, leaderCount, sliceSize)
Copy link
Contributor

@mimowo mimowo Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to encapsulate the information, like:

type balancedPlacementInput struct{
...
}
then

initializationInfo := initializeAlgorithm()
if initializationInfo.useBalancedPlacement {
   // applyBalancedPlacementAlgorithm
} else {
  // applyBestFit
} 

but I also left an exploratory question to consider commonizing the code more #6851 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the overall structure of the code stays, it feels it would read better in this form:

var result
var reason
if tryUsingBalancedPlacement() {
   result, reason = useBalancedPlacement()
	if len(reason) > 0 {
		return nil, reason
	}
}
result, reason := useDefaultPlacement()
if len(reason) > 0 {
	return nil, reason
}
return result

As to make the dispatching logic more clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I'm trying to say is that the current injection method seems like a weird mix of two approaches:

  1. generalize the current algorithm flow for 2b and inject the logic when stepping down
  2. generalize the top-level idea of a placement algorithm and introduce some structs for results, using structure similar to [TAS] Balanced placement #6851 (comment)

So, currently we don't benefit from code-reuse in (1.), but also there is no clear place for algorithm dispatching (2.) , because some internal structures are leaking into the dispatching logic. If we go with (2.) I would like to have some structs representing the result in a generic way.

I think going with (1.) feels more natural to me, and is more consistent with the direction we chose for LeastFreeCapacity, but I don't have a strong view between them.

IF the directions were considered but determined as "too hard" I would like to know the reasons. We may then consider if this goes in the PR, or we plan some refactoring for future, or just keep as is long term.

Comment on lines +1435 to +1460
func clearState(d *domain) {
d.state = int32(0)
d.sliceState = int32(0)
d.stateWithLeader = int32(0)
d.sliceStateWithLeader = int32(0)
d.leaderState = int32(0)
for _, child := range d.children {
clearState(child)
}
}

func (s *TASFlavorSnapshot) pruneDomainsBelowThreshold(domains []*domain, threshold int32, sliceSize int32, sliceLevelIdx int, level int) {
for _, d := range domains {
for _, c := range d.children {
if c.sliceStateWithLeader < threshold {
clearState(c)
}
}
}
for _, d := range domains {
d.state, d.sliceState, d.stateWithLeader, d.sliceStateWithLeader, d.leaderState = s.fillInCountsHelper(d, sliceSize, sliceLevelIdx, level)
if d.sliceStateWithLeader < threshold {
clearState(d)
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move out these functions from tas_flavor_snapshot, because they are not used here, so we have a cyclic dependency between files. While this works, it is a bit of code smell. If the functions have a potential for re-usability (they may have), then move them to a helper file like tas_helpers.go

@mimowo
Copy link
Contributor

mimowo commented Nov 13, 2025

Thank you 👍
/lgtm
/approve

I think the remaining comments are non-blocking:

We also need to improve the release-note, so that a user get a feeling of how it works, and why to enable it.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3175763111f4e891762e487df15487006f3d3264

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, olekzabl, pajakd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit 79cef95 into kubernetes-sigs:main Nov 13, 2025
23 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.15 milestone Nov 13, 2025

// findBestDomainsForBalancedPlacement evaluates domains for balanced placement.
// It returns the best set of domains, the balance threshold, and whether a balanced placement is possible.
func findBestDomainsForBalancedPlacement(s *TASFlavorSnapshot, levelIdx, sliceLevelIdx int, count, leaderCount, sliceSize int32) ([]*domain, int32, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last returned bool is redundant, I think we could just check if threshold > 0

This was referenced Nov 13, 2025
Singularity23x0 pushed a commit to Singularity23x0/kueue that referenced this pull request Nov 17, 2025
* Balanced placement for TAS

* Update after merge

* Sort keys

* Address comments

* Fix pruning

* Fix

* Address comments

* Fix verify

* Comment on the future of the FG

* Update doc

* Update site/content/en/docs/concepts/topology_aware_scheduling.md

Co-authored-by: Olek Zabłocki <[email protected]>

* Update site/content/en/docs/concepts/topology_aware_scheduling.md

Co-authored-by: Olek Zabłocki <[email protected]>

* Update keps/2724-topology-aware-scheduling/README.md

Co-authored-by: Olek Zabłocki <[email protected]>

* Add comment

* Add link to cleanup issue

* Address comments

* Align after merge

* Additional testcases to increase coverage

---------

Co-authored-by: Olek Zabłocki <[email protected]>
@pajakd pajakd mentioned this pull request Nov 27, 2025
3 tasks
@tenzen-y
Copy link
Member

/release-note-edit

TAS: The balanced placement is introduced with the TASBalancedPlacement feature gate. 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TAS: support balanced placement

8 participants