Skip to content

Conversation

@laurafitzgerald
Copy link
Contributor

@laurafitzgerald laurafitzgerald commented Oct 9, 2025

Title

Fix RayJob webhook panic and skip reconciliation for jobs with clusterSelector

Description

This PR fixes a panic in the RayJob webhook validation and ensures that Kueue completely ignores RayJobs that use clusterSelector to reference existing RayClusters.

Problem

When a RayJob has spec.clusterSelector set (indicating it should use an existing RayCluster), the webhook validation would panic trying to access fields in spec.RayClusterSpec that may be nil or incomplete. Additionally, even if the webhook passed, the reconciler would attempt to manage these jobs, which is incorrect since they reference existing clusters.

Changes

1. Webhook Changes (pkg/controller/jobs/rayjob/rayjob_webhook.go)

  • Added early return in validateCreate() to skip all Kueue validation when clusterSelector is present
  • Removed the validation error that rejected jobs with clusterSelector (since we now ignore them completely)
  • Prevents nil pointer dereference by not accessing RayClusterSpec fields for these jobs

2. Reconciler Changes (pkg/controller/jobs/rayjob/rayjob_controller.go)

  • Implemented JobWithSkip interface for RayJob
  • Added Skip() method that returns true when clusterSelector is present
  • Ensures reconciler exits immediately without processing these jobs

Behavior

RayJob Configuration Webhook Reconciler Result
WITH clusterSelector (any labels) ✅ Passes immediately ✅ Skips immediately Completely ignored by Kueue
WITHOUT clusterSelector + queue label ✅ Full validation ✅ Full reconciliation Managed by Kueue
WITHOUT clusterSelector + no queue label ✅ No validation ✅ Ignores (no queue name) Not managed by Kueue

Use Case

This enables the following workflow:

  1. User creates a RayCluster with a Kueue queue label
  2. Kueue admits the RayCluster when resources are available
  3. User submits RayJobs with clusterSelector to run on that cluster
  4. The RayJobs are handled entirely by the Ray operator, without Kueue interference
  5. Multiple jobs can run against the same admitted cluster

Testing

  • Webhook validation passes for RayJobs with clusterSelector
  • Reconciler skips RayJobs with clusterSelector
  • Normal RayJobs (without clusterSelector) continue to work as expected
  • No linter errors

Related Issues

Fixes #[issue-number]

Checklist

  • Added appropriate comments explaining the behavior
  • Implemented both webhook and reconciler changes for consistency
  • Followed existing patterns (JobWithSkip interface)
  • No breaking changes to existing RayJob functionality

/release-note-edit

Fix handling of RayJobs which specify the spec.clusterSelector and the "queue-name" label for Kueue. These jobs should be ignored by kueue as they are being submitted to a RayCluster which is where the resources are being used and was likely already admitted by kueue. No need to double admit.
Fix on a panic on kueue managed jobs if spec.rayClusterSpec wasn't specified.

@k8s-ci-robot k8s-ci-robot added this to the v0.13 milestone Oct 9, 2025
@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 9, 2025
@netlify
Copy link

netlify bot commented Oct 9, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit f4fa7f7
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68ecd50321eae60008f0bf4d

@k8s-ci-robot
Copy link
Contributor

Welcome @laurafitzgerald!

It looks like this is your first PR to kubernetes-sigs/kueue 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kueue has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 9, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @laurafitzgerald. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 9, 2025
@mimowo
Copy link
Contributor

mimowo commented Oct 9, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 9, 2025
@mimowo
Copy link
Contributor

mimowo commented Oct 9, 2025

@laurafitzgerald thank you for the PR. What was the behavior on the main branch when "RayJob is with clusterSelector, but without queue-name"?

Also, could you please add an integration or e2e test?

cc @andrewsykim ptal

@mimowo
Copy link
Contributor

mimowo commented Oct 9, 2025

Please target the PR against the main branch, once merged we will cherrypick the fix against 0.13 and 0.14

@andrewsykim
Copy link
Member

Can't speak to correctness of the code, will defer to @mimowo. But it makes sense that Kueue should ignore RayJobs that speciify a cluster selector

@mimowo
Copy link
Contributor

mimowo commented Oct 10, 2025

Thank you @andrewsykim for confirming the behavior change is intended. The code changes themselves lgtm, but I would like to have an integration test.

Copy link

@kryanbeane kryanbeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@laurafitzgerald laurafitzgerald changed the base branch from release-0.13 to main October 10, 2025 11:10
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 10, 2025
@laurafitzgerald laurafitzgerald force-pushed the fix-failure-on-webhook-ray-jobs-not-kueue-managed branch from f98847e to 06d587d Compare October 10, 2025 11:14
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 10, 2025
@laurafitzgerald laurafitzgerald force-pushed the fix-failure-on-webhook-ray-jobs-not-kueue-managed branch from 06d587d to ecc5972 Compare October 10, 2025 11:25
@laurafitzgerald
Copy link
Contributor Author

laurafitzgerald commented Oct 10, 2025

Hi @mimowo

@laurafitzgerald thank you for the PR. What was the behavior on the main branch when "RayJob is with clusterSelector, but without queue-name"?

The behaviour on main for "RayJob is with clusterSelector, but without queue-name"?
The entire validation block is skipped - the webhook passes without checking anything.
Based on the line https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobs/rayjob/rayjob_webhook.go#L110

if w.manageJobsWithoutQueueName || jobframework.QueueName(kueueJob) != ""

w.manageJobsWithoutQueueName = typically false (default)
jobframework.QueueName(kueueJob) = "" (no queue label)
Condition evaluates to false || false = false

In the case where manageJobsWithoutQueueName is true,

there is at least one point where there could be a panic faced. For example https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/jobs/rayjob/rayjob_webhook.go#L128

I can fix that. If the value is not present it's fine because then there is no autoscaling so we should simply handle the deref.

Also, could you please add an integration or e2e test?

For sure. 👍

@laurafitzgerald laurafitzgerald force-pushed the fix-failure-on-webhook-ray-jobs-not-kueue-managed branch from ecc5972 to 4a30ceb Compare October 10, 2025 12:12
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 10, 2025
@kannon92
Copy link
Contributor

@mimowo @tenzen-y
I notice this has a milestone of v0.13.

Is there anything that can be an issue for that? I'm not sure I can remove milestone.

@mimowo
Copy link
Contributor

mimowo commented Oct 10, 2025

/remove-milestone v0.13
/milestone v0.15

@k8s-ci-robot k8s-ci-robot modified the milestones: v0.13, v0.15 Oct 10, 2025
@laurafitzgerald
Copy link
Contributor Author

laurafitzgerald commented Oct 12, 2025

@mimowo would it be possible to include this in any upcoming 0.14.x. Currently it's blocking the possibility to use namespace labelling for kueue with RayJobs which have a clusterSpec, aka a RayJob against a RayCluster which is already managed by kueue. So it could be described as a bug. if there was a 0.13.x is there any possibility to include it there?

@laurafitzgerald laurafitzgerald force-pushed the fix-failure-on-webhook-ray-jobs-not-kueue-managed branch from 4a30ceb to 5a5ed32 Compare October 12, 2025 20:36
@kannon92
Copy link
Contributor

@laurafitzgerald

#7218 (comment)

When this merges it will be cherry picked to 0.14 and 0.13.

@laurafitzgerald laurafitzgerald force-pushed the fix-failure-on-webhook-ray-jobs-not-kueue-managed branch from 5a5ed32 to cb52a4f Compare October 13, 2025 08:54
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Oct 13, 2025

@laurafitzgerald: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-unit-release-0-13 f98847e link true /test pull-kueue-test-unit-release-0-13

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@laurafitzgerald laurafitzgerald force-pushed the fix-failure-on-webhook-ray-jobs-not-kueue-managed branch from cb52a4f to e1de66e Compare October 13, 2025 09:20
…to panic

len(spec.ClusterSelector)>0 && spec.RayClusterSpec != nil -> validation error
len(spec.ClusterSelector)>0 && spec.RayClusterSpec == nil - valid
len(spec.ClusterSelector)==0 && spec.ClusterSpec == nil -> validation error
len(spec.ClusterSelector)==0 && spec.ClusterSpec != nil -> valid + perform additional validation
@laurafitzgerald laurafitzgerald force-pushed the fix-failure-on-webhook-ray-jobs-not-kueue-managed branch from e1de66e to f4fa7f7 Compare October 13, 2025 10:31
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 👍
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 13, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0b24af9692339576244e0ac6ab240f6237251ecb

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kryanbeane, laurafitzgerald, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit 2b0d854 into kubernetes-sigs:main Oct 13, 2025
22 of 23 checks passed
@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: new pull request created: #7240

In response to this:

Preparation only, the PRs will be open by robot once this is merged
/cherrypick release-0.14
/cherrypick release-0.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: new pull request created: #7241

In response to this:

Preparation only, the PRs will be open by robot once this is merged
/cherrypick release-0.14
/cherrypick release-0.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mimowo
Copy link
Contributor

mimowo commented Oct 13, 2025

@laurafitzgerald could you please prepare the cherrypicks manually? The PRs prepared by the robot fail

You can use for that the ./hack/cherry_pick_pull.sh script.

@mimowo
Copy link
Contributor

mimowo commented Nov 13, 2025

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants