-
Notifications
You must be signed in to change notification settings - Fork 479
RayJobs with clusterSelectors should not be managed/validated by kueue. #7218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RayJobs with clusterSelectors should not be managed/validated by kueue. #7218
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
Welcome @laurafitzgerald! |
|
Hi @laurafitzgerald. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
@laurafitzgerald thank you for the PR. What was the behavior on the main branch when "RayJob is with clusterSelector, but without queue-name"? Also, could you please add an integration or e2e test? cc @andrewsykim ptal |
|
Please target the PR against the main branch, once merged we will cherrypick the fix against 0.13 and 0.14 |
|
Can't speak to correctness of the code, will defer to @mimowo. But it makes sense that Kueue should ignore RayJobs that speciify a cluster selector |
|
Thank you @andrewsykim for confirming the behavior change is intended. The code changes themselves lgtm, but I would like to have an integration test. |
kryanbeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm!
f98847e to
06d587d
Compare
06d587d to
ecc5972
Compare
|
Hi @mimowo
The behaviour on main for "RayJob is with clusterSelector, but without queue-name"?
w.manageJobsWithoutQueueName = typically false (default) In the case where manageJobsWithoutQueueName is true, there is at least one point where there could be a panic faced. For example https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobs/rayjob/rayjob_webhook.go#L128 I can fix that. If the value is not present it's fine because then there is no autoscaling so we should simply handle the deref.
For sure. 👍 |
ecc5972 to
4a30ceb
Compare
|
/remove-milestone v0.13 |
|
@mimowo would it be possible to include this in any upcoming 0.14.x. Currently it's blocking the possibility to use namespace labelling for kueue with RayJobs which have a clusterSpec, aka a RayJob against a RayCluster which is already managed by kueue. So it could be described as a bug. if there was a 0.13.x is there any possibility to include it there? |
4a30ceb to
5a5ed32
Compare
|
When this merges it will be cherry picked to 0.14 and 0.13. |
5a5ed32 to
cb52a4f
Compare
|
@laurafitzgerald: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
cb52a4f to
e1de66e
Compare
…to panic len(spec.ClusterSelector)>0 && spec.RayClusterSpec != nil -> validation error len(spec.ClusterSelector)>0 && spec.RayClusterSpec == nil - valid len(spec.ClusterSelector)==0 && spec.ClusterSpec == nil -> validation error len(spec.ClusterSelector)==0 && spec.ClusterSpec != nil -> valid + perform additional validation
e1de66e to
f4fa7f7
Compare
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 👍
/lgtm
/approve
|
LGTM label has been added. Git tree hash: 0b24af9692339576244e0ac6ab240f6237251ecb
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kryanbeane, laurafitzgerald, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@mimowo: new pull request created: #7240 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@mimowo: new pull request created: #7241 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@laurafitzgerald could you please prepare the cherrypicks manually? The PRs prepared by the robot fail You can use for that the ./hack/cherry_pick_pull.sh script. |
|
/kind bug |
Title
Fix RayJob webhook panic and skip reconciliation for jobs with clusterSelector
Description
This PR fixes a panic in the RayJob webhook validation and ensures that Kueue completely ignores RayJobs that use
clusterSelectorto reference existing RayClusters.Problem
When a RayJob has
spec.clusterSelectorset (indicating it should use an existing RayCluster), the webhook validation would panic trying to access fields inspec.RayClusterSpecthat may be nil or incomplete. Additionally, even if the webhook passed, the reconciler would attempt to manage these jobs, which is incorrect since they reference existing clusters.Changes
1. Webhook Changes (
pkg/controller/jobs/rayjob/rayjob_webhook.go)validateCreate()to skip all Kueue validation whenclusterSelectoris presentclusterSelector(since we now ignore them completely)RayClusterSpecfields for these jobs2. Reconciler Changes (
pkg/controller/jobs/rayjob/rayjob_controller.go)JobWithSkipinterface for RayJobSkip()method that returnstruewhenclusterSelectoris presentBehavior
Use Case
This enables the following workflow:
clusterSelectorto run on that clusterTesting
clusterSelectorclusterSelectorclusterSelector) continue to work as expectedRelated Issues
Fixes #[issue-number]
Checklist
JobWithSkipinterface)/release-note-edit