-
Notifications
You must be signed in to change notification settings - Fork 479
[TAS] Balanced placement #6851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TAS] Balanced placement #6851
Conversation
|
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
/test all |
|
/test all |
|
/test all |
1 similar comment
|
/test all |
| sortDomainsByCapacityAndEntropy(domains) | ||
| } | ||
|
|
||
| // domain_placements[i][j][k] stores a list of domains that uses 'i' domains with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is explanation is generally nice - but in 1 way ambiguous:
A. "use (i domains with j leaders) and (k pods left to fit)
B. "use i domains with ((j leaders and k pods) left to fit)
It's not "abstract mean creativity"; I originally understood A (while I think you meant B).
I think I fell into A due to reading left-to-right too impatiently.
I'm not sure if it can be phrased more clearly and still nicely.
How about just placing a pair of ( ) around "j leaders and k pods"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the description. Also updated k pods to k workers. Bc leaders are pods too. But this might actually cause more confusion as LeaderWorkerSet is only one of the types of jobs we support.
PBundyra
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will proceed with the review when you address olekzabl's comments
|
@mwysokin we can definitely think about how to incorporate The idea of using the |
|
I wouldn't block this PR from being merged just because there's no support for |
|
What is the algorithmic complexity of the balanced placement in relation to BestFit, at least in some simplified cases. Do we expect increased performance might be a drawback? Knowing that could influence the decision if this should be default mode or the drawbacks are so big that it should be explicitly opted in EDIT: it is not a blocker for Alpha. just would like to know going forward |
The main component in the complexity of Balanced placement is something like (max # racks in a block)^2 * (# slices to place) if we balance on "rack" level. The complexity of BestFit is (unless I'm missing something) linear everywhere. So, there might be some increase in complexity but I don't think it will be too much -- especially that this feature is designed for a special hardware. |
| // if unconstrained is set, we'll only do it once | ||
| currFitDomain = s.updateCountsToMinimumGeneric(currFitDomain, count, leaderCount, sliceSize, unconstrained, true) | ||
|
|
||
| for levelIdx := fitLevelIdx; levelIdx+1 < len(s.domainsPerLevel); levelIdx++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about the proper injection point. The current approach is to split at the very top, but I think conceptually balanced placement is very similar to phase 2b. I can see similar traverse down code. So, the natural idea is to inject as we are stepping down and reuse the iteration. I would expect what balanced placement changes is the algorithm for sorting the domains inside sortedDomains, but to reuse calls to updateCountsToMinimumGeneric.
I'm not totally sure if this is a good idea, this is an exploratory question to check your thoughts.
| var currFitDomain []*domain | ||
| var fitLevelIdx int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to reduce the scope of the helper variables between the algorithms
| if features.Enabled(features.TASBalancedPlacement) && !required && !unconstrained { | ||
| var bestThreshold int32 | ||
| currFitDomain, bestThreshold, useBalancedPlacement = findBestDomainsForBalancedPlacement(s, levelIdx, sliceLevelIdx, count, leaderCount, sliceSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to encapsulate the information, like:
type balancedPlacementInput struct{
...
}
then
initializationInfo := initializeAlgorithm()
if initializationInfo.useBalancedPlacement {
// applyBalancedPlacementAlgorithm
} else {
// applyBestFit
}
but I also left an exploratory question to consider commonizing the code more #6851 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the overall structure of the code stays, it feels it would read better in this form:
var result
var reason
if tryUsingBalancedPlacement() {
result, reason = useBalancedPlacement()
if len(reason) > 0 {
return nil, reason
}
}
result, reason := useDefaultPlacement()
if len(reason) > 0 {
return nil, reason
}
return resultAs to make the dispatching logic more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I'm trying to say is that the current injection method seems like a weird mix of two approaches:
- generalize the current algorithm flow for 2b and inject the logic when stepping down
- generalize the top-level idea of a placement algorithm and introduce some structs for results, using structure similar to [TAS] Balanced placement #6851 (comment)
So, currently we don't benefit from code-reuse in (1.), but also there is no clear place for algorithm dispatching (2.) , because some internal structures are leaking into the dispatching logic. If we go with (2.) I would like to have some structs representing the result in a generic way.
I think going with (1.) feels more natural to me, and is more consistent with the direction we chose for LeastFreeCapacity, but I don't have a strong view between them.
IF the directions were considered but determined as "too hard" I would like to know the reasons. We may then consider if this goes in the PR, or we plan some refactoring for future, or just keep as is long term.
| func clearState(d *domain) { | ||
| d.state = int32(0) | ||
| d.sliceState = int32(0) | ||
| d.stateWithLeader = int32(0) | ||
| d.sliceStateWithLeader = int32(0) | ||
| d.leaderState = int32(0) | ||
| for _, child := range d.children { | ||
| clearState(child) | ||
| } | ||
| } | ||
|
|
||
| func (s *TASFlavorSnapshot) pruneDomainsBelowThreshold(domains []*domain, threshold int32, sliceSize int32, sliceLevelIdx int, level int) { | ||
| for _, d := range domains { | ||
| for _, c := range d.children { | ||
| if c.sliceStateWithLeader < threshold { | ||
| clearState(c) | ||
| } | ||
| } | ||
| } | ||
| for _, d := range domains { | ||
| d.state, d.sliceState, d.stateWithLeader, d.sliceStateWithLeader, d.leaderState = s.fillInCountsHelper(d, sliceSize, sliceLevelIdx, level) | ||
| if d.sliceStateWithLeader < threshold { | ||
| clearState(d) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move out these functions from tas_flavor_snapshot, because they are not used here, so we have a cyclic dependency between files. While this works, it is a bit of code smell. If the functions have a potential for re-usability (they may have), then move them to a helper file like tas_helpers.go
|
Thank you 👍 I think the remaining comments are non-blocking:
We also need to improve the release-note, so that a user get a feeling of how it works, and why to enable it. |
|
LGTM label has been added. Git tree hash: 3175763111f4e891762e487df15487006f3d3264
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, olekzabl, pajakd The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| // findBestDomainsForBalancedPlacement evaluates domains for balanced placement. | ||
| // It returns the best set of domains, the balance threshold, and whether a balanced placement is possible. | ||
| func findBestDomainsForBalancedPlacement(s *TASFlavorSnapshot, levelIdx, sliceLevelIdx int, count, leaderCount, sliceSize int32) ([]*domain, int32, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last returned bool is redundant, I think we could just check if threshold > 0
* Balanced placement for TAS * Update after merge * Sort keys * Address comments * Fix pruning * Fix * Address comments * Fix verify * Comment on the future of the FG * Update doc * Update site/content/en/docs/concepts/topology_aware_scheduling.md Co-authored-by: Olek Zabłocki <[email protected]> * Update site/content/en/docs/concepts/topology_aware_scheduling.md Co-authored-by: Olek Zabłocki <[email protected]> * Update keps/2724-topology-aware-scheduling/README.md Co-authored-by: Olek Zabłocki <[email protected]> * Add comment * Add link to cleanup issue * Address comments * Align after merge * Additional testcases to increase coverage --------- Co-authored-by: Olek Zabłocki <[email protected]>
|
/release-note-edit |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces an alternative mode of pod (or pod slices) placement in TAS: balanced placement. In this placement, Kueue TAS still tries to use the minimum possible number of topology domains but also maximizes the minimum number of pods (or pod slices) placed on a domain. This balancing happens when feature gate
TASBalancedPlacementis enabled on a level indicated by the user with thePreferredflag.Which issue(s) this PR fixes:
Fixes #6554
Special notes for your reviewer:
Does this PR introduce a user-facing change?