-
Notifications
You must be signed in to change notification settings - Fork 479
[BestEffortFIFO] Implement Sticky ClusterQueue Head Policy #7157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gabesaba The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Please use the release-note block so that the label |
I think it was because I had /release-note-edit |
|
@gabesaba: /release-note-edit must be used with a single release note block. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/release-note-edit |
|
@gabesaba: /release-note-edit must be used with a single release note block. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/release-note-edit |
|
@gabesaba: /release-note-edit must be used with a single release note block. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
test/integration/singlecluster/scheduler/fairsharing/fair_sharing_test.go
Show resolved
Hide resolved
| for range 4 { | ||
| createWorkload("cq-p1", "2") | ||
| } | ||
| util.ExpectReservingActiveWorkloadsMetric(cqp1, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, maybe it is better to use the Admitted metrics as this could sometimes be updated before the quota reservation is fully processed, causing flakes, see: #7155
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
/lgtm In case you could fix #7157 (comment). Since this is just a potential issue I'm ok to optimize for fix delivery, feel free to unhold, we could potentially fix in a follow up. |
|
@mimowo: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| util.FinishEvictionOfWorkloadsInCQ(ctx, k8sClient, cq2, 2) | ||
|
|
||
| ginkgo.By("Expect active workloads") | ||
| util.ExpectReservingActiveWorkloadsMetric(cq1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running locally and hit a flake here on the first run:
[FAILED] Timed out after 10.000s.
The function passed to Eventually failed at /usr/local/google/home/michalwozniak/go/src/sigs.k8s.io/kueue/test/util/util.go:571 with:
Expected
<int>: 0
to equal
<int>: 1
In [It] at: /usr/local/google/home/michalwozniak/go/src/sigs.k8s.io/kueue/test/integration/singlecluster/scheduler/fairsharing/fair_sharing_test.go:898 @ 10/03/25 17:11:51.661
PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This flake seems to go away when is addressed: #7157 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran this test 32 times without flakes. I'll try running 128 times now to see if there are any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, maybe this is fixed by the metric change? I will also test locally with the metric change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I tested locally 60 times after the latest changes and no failures, feel free to resolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It flaked 3/128 times for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a race condition where the preempted workloads schedule, given that we consider the inadmissible head after deleting the sticky workload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the test to keep finishing the evictions. It passed 128/128 runs:
gomega.Eventually(func(g gomega.Gomega) {
util.FinishEvictionsOfAnyWorkloadsInCq(ctx, k8sClient, cq2)
util.ExpectAdmittedWorkloadsTotalMetric(cq1, "", 1)
util.ExpectClusterQueueWeightedShareMetric(cq1, 0)
util.ExpectClusterQueueWeightedShareMetric(cq2, 0)
}, util.Timeout, util.Interval).Should(gomega.Succeed())
|
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/7157/pull-kueue-test-integration-extended-main/1975555739745259520 EDIT: #7194 |
|
/test pull-kueue-test-integration-extended-main |
|
Thank you 👍 /lgtm |
|
@mimowo: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
LGTM label has been added. Git tree hash: 01585ba9d560c7214e5fd48ab4bfb68291ff7477
|
|
@mimowo: new pull request created: #7197 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: #7157 failed to apply on top of branch "release-0.13": In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@gabesaba Can you manually open cherry-pick PR to release-0.13? |
Yes. I am waiting for one more PR that I want to cherry-pick at the same time: additional logging for this feature. |
Which PR do you indicate? Has it already merged? |
…s-sigs#7157) * [BestEffortFIFO] Implement Sticky ClusterQueue Head Policy * assert using ExpectAdmittedWorkloadsTotalMetric * FinishEvictionOfAnyWorkloadsInCq method * fix flakiness; new test case
|
Here: #7201 It will get cherrypicked to 0.14 without conflicts. We can already start preparing the cherrypick for 0.13 manually, and open the PR for them. |
* [BestEffortFIFO] Implement Sticky ClusterQueue Head Policy (#7157) * [BestEffortFIFO] Implement Sticky ClusterQueue Head Policy * assert using ExpectAdmittedWorkloadsTotalMetric * FinishEvictionOfAnyWorkloadsInCq method * fix flakiness; new test case * logging for 7157
|
/release-note-edit |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When a workload, in a queue with BestEffortFIFO enabled, has preemption targets, we keep it at the head of the queue until it schedules (or returns inadmissible).
See original proposal here: #6929 (comment)
Which issue(s) this PR fixes:
Fixes #7101, #6929
Special notes for your reviewer:
Let's discuss whether we want to put this behind a feature gate.
Does this PR introduce a user-facing change?