Skip to content

Conversation

@harche
Copy link
Contributor

@harche harche commented Oct 10, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR adds validation to reject DRA (Dynamic Resource Allocation) v1 API features that are enabled
by default in Kubernetes 1.34 but explicitly excluded from Kueue's DRA support scope per KEP-2941.

Without validation, Kueue silently ignores unsupported DRA features, leading to:

  1. Silent admission failures - FirstAvailable workloads are admitted with 0 DRA resources, causing
    runtime failures
  2. Misleading errors - 5 out of 6 unsupported features produce confusing "DeviceClass not mapped"
    errors instead of clear feature-not-supported messages
  3. Poor UX - Users waste time debugging DeviceClass mappings when the actual issue is using
    unsupported features

Solution is to add early validation in pkg/dra/claims.go to explicitly reject all 6 enabled-by-default DRA features
that Kueue doesn't support:

GA Features:

  • AllocationMode 'All'
  • CEL Selectors
  • Device Constraints
  • Device Config

Beta Features (enabled by default in K8s 1.34):

  • FirstAvailable (DRAPrioritizedList)
  • AdminAccess (DRAAdminAccess)

Each feature now returns a clear, actionable error message like "FirstAvailable device selection is
not supported" instead of misleading DeviceClass mapping errors.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Kueue now properly validates and rejects unsupported DRA (Dynamic Resource Allocation) features with clear error messages instead of silently failing or producing misleading "DeviceClass not mapped" errors. Unsupported features include: AllocationMode 'All', CEL Selectors, Device Constraints, Device Config, FirstAvailable device selection, and AdminAccess.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 10, 2025
@netlify
Copy link

netlify bot commented Oct 10, 2025

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 057dd22
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68fa302c29e15b0008adf648
😎 Deploy Preview https://deploy-preview-7226--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 10, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @harche. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 10, 2025
@mimowo
Copy link
Contributor

mimowo commented Oct 10, 2025

/ok-to-test
@alaypatel07 would you like to give it a pass?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 10, 2025
@kannon92
Copy link
Contributor

/ok-to-test

/cc @alaypatel07

@alaypatel07
Copy link
Contributor

Yes I'll take a look.

var dcName string
var q int64
if req.Exactly != nil {
// Check for CEL selectors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please convert the entire if condition into a switch case handling all the unsupported features first before considering the valid supported case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harche why does this have to be nested ifs and switch cases? Can we work out a logic with one single switch case?

For example:

		switch {
		case req.FirstAvailable != nil:
		case req.Exactly != nil && len(req.Exactly.Selectors) > 0:
		// error
		case req.Exactly != nil && req.Exactly.AdminAccess != nil && *req.Exactly.AdminAccess:
		// error
		case req.Exactly != nil && req.Exactly.AllocationMode == DeviceAllocationModeAll:
		// error
		}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alaypatel07 I didn't want to repeat req.Exactly != nil, would something like following do?

switch {
		case req.FirstAvailable != nil:
			return nil, fmt.Errorf("%w: FirstAvailable device selection is not supported", ErrUnsupportedDRAFeature)
		case req.Exactly == nil:
			continue
		case len(req.Exactly.Selectors) > 0:
			return nil, fmt.Errorf("%w: CEL selectors are not supported", ErrUnsupportedDRAFeature)
		case req.Exactly.AdminAccess != nil && *req.Exactly.AdminAccess:
			return nil, fmt.Errorf("%w: AdminAccess is not supported", ErrUnsupportedDRAFeature)
		case req.Exactly.AllocationMode == resourcev1.DeviceAllocationModeAll:
			return nil, fmt.Errorf("%w: AllocationMode 'All' is not supported", ErrUnsupportedDRAFeature)
		case req.Exactly.AllocationMode == resourcev1.DeviceAllocationModeExactCount:
			dcName = req.Exactly.DeviceClassName
			q = req.Exactly.Count
		case req.Exactly.AllocationMode == "":
			dcName = req.Exactly.DeviceClassName
		default:
			return nil, fmt.Errorf("%w: unsupported allocation mode: %s", ErrUnsupportedDRAFeature, req.Exactly.AllocationMode)
		}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, let me know if that works, but I am open to changing it to repeated req.Exactly != nil.

ErrDeviceClassNotMapped = errors.New("DeviceClass is not mapped in DRA configuration")
ErrResourceClaimInUse = errors.New("ResourceClaim is in use")
ErrClaimSpecNotFound = errors.New("failed to get claim spec")
ErrDeviceClassNotMapped = errors.New("DeviceClass is not mapped in DRA configuration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we use a common error "UnsupportedDRAfeature" and then wrap a human readable error message to be returned from the function later.

I think we only need to have the following concrete error types:

  1. ErrResourceClaimInUse and other
  2. ErrClaimSpecNotFound
  3. ErrUnsupportedDRAfeature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, thanks.

},
want: map[kueuev1beta1.PodSetReference]corev1.ResourceList{"main": {"res-1": resource.MustParse("2")}},
},
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added integration tests.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 10, 2025
@harche harche requested a review from alaypatel07 October 10, 2025 18:54
@harche
Copy link
Contributor Author

harche commented Oct 10, 2025

/test pull-kueue-test-e2e-multikueue-main

@harche
Copy link
Contributor Author

harche commented Oct 10, 2025

@alaypatel07 I have updated the PR. Can you take another look? Thanks.

wl := utiltesting.MakeWorkload("test-wl-all-mode", ns.Name).
Queue("test-lq").
Obj()
wl.Spec.PodSets[0].Template.Spec.ResourceClaims = []corev1.PodResourceClaim{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.


ginkgo.It("Should reject workload with AllocationMode 'All'", func() {
ginkgo.By("Creating a ResourceClaimTemplate with AllocationMode All")
rct := &resourcev1.ResourceClaimTemplate{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can create a ResourceClaimWrapper to produce these templates/claims in utiltesting:

type PodSetWrapper struct{ kueue.PodSet }

it will be helpful in unit testing as well as integration tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great idea, but should we tackle this outside this PR? I can create a separate issue to look into it. I am afraid it might escalate the scope of this PR (which is just to harden the validation around unsupported DRA fields/features)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. IMHO, since this PR is adding all the permutations of creating resource slices with different features, this is the right moment to add the wrapper. I dont really think it expands the scope. Thanks for tackling that.

@harche harche force-pushed the dra_validation branch 3 times, most recently from 03ca256 to f7b275d Compare October 11, 2025 00:12
@alaypatel07
Copy link
Contributor

@harche from what I can see, AllocationMode field itself can be "" i.e. empty because it is not a pointer but it should always be defaulted to AllocationModeExact https://github.com/kubernetes/kubernetes/blob/e1b996a4d3358b783a17e5f54a70e573e394a697/pkg/apis/resource/v1beta1/defaults.go#L43

Are you seeing that this defaulting is not working?

@harche
Copy link
Contributor Author

harche commented Oct 22, 2025

@harche from what I can see, AllocationMode field itself can be "" i.e. empty because it is not a pointer but it should always be defaulted to AllocationModeExact https://github.com/kubernetes/kubernetes/blob/e1b996a4d3358b783a17e5f54a70e573e394a697/pkg/apis/resource/v1beta1/defaults.go#L43

Are you seeing that this defaulting is not working?

yeah, if I remove the req.Exactly.AllocationMode == "" check in the switch, the test that specifically sets AllocationMode to "" fails, but it passes if I bring back that check.

% go test ./pkg/dra/...
--- FAIL: Test_GetResourceRequests (0.05s)
    --- FAIL: Test_GetResourceRequests/Empty_allocation_mode_defaults_to_exact_count (0.00s)
        claims_test.go:392: unexpected error: unsupported DRA feature in ResourceClaimTemplate claim-tmpl-default in workload wl podset main: unsupported DRA feature: unsupported allocation mode: 
FAIL
FAIL	sigs.k8s.io/kueue/pkg/dra	0.402s
FAIL

but passes if I add req.Exactly.AllocationMode == ""

go test ./pkg/dra/...
ok  	sigs.k8s.io/kueue/pkg/dra	(cached)

@alaypatel07
Copy link
Contributor

alaypatel07 commented Oct 22, 2025

@harche from what I can see, AllocationMode field itself can be "" i.e. empty because it is not a pointer but it should always be defaulted to AllocationModeExact https://github.com/kubernetes/kubernetes/blob/e1b996a4d3358b783a17e5f54a70e573e394a697/pkg/apis/resource/v1beta1/defaults.go#L43
Are you seeing that this defaulting is not working?

yeah, if I remove the req.Exactly.AllocationMode == "" check in the switch, the test that specifically sets AllocationMode to "" fails, but it passes if I bring back that check.

% go test ./pkg/dra/...
--- FAIL: Test_GetResourceRequests (0.05s)
    --- FAIL: Test_GetResourceRequests/Empty_allocation_mode_defaults_to_exact_count (0.00s)
        claims_test.go:392: unexpected error: unsupported DRA feature in ResourceClaimTemplate claim-tmpl-default in workload wl podset main: unsupported DRA feature: unsupported allocation mode: 
FAIL
FAIL	sigs.k8s.io/kueue/pkg/dra	0.402s
FAIL

but passes if I add req.Exactly.AllocationMode == ""

go test ./pkg/dra/...
ok  	sigs.k8s.io/kueue/pkg/dra	(cached)

That's because this is a unit test and the resource claim is not going through the API server which has defaulting logic. This test will not fail in the e2e since it goes through API server there. The correct thing IMO is to remove this unit test and let it just be an e2e test

@k8s-ci-robot k8s-ci-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 22, 2025
@harche
Copy link
Contributor Author

harche commented Oct 22, 2025

@harche from what I can see, AllocationMode field itself can be "" i.e. empty because it is not a pointer but it should always be defaulted to AllocationModeExact https://github.com/kubernetes/kubernetes/blob/e1b996a4d3358b783a17e5f54a70e573e394a697/pkg/apis/resource/v1beta1/defaults.go#L43
Are you seeing that this defaulting is not working?

yeah, if I remove the req.Exactly.AllocationMode == "" check in the switch, the test that specifically sets AllocationMode to "" fails, but it passes if I bring back that check.

% go test ./pkg/dra/...
--- FAIL: Test_GetResourceRequests (0.05s)
    --- FAIL: Test_GetResourceRequests/Empty_allocation_mode_defaults_to_exact_count (0.00s)
        claims_test.go:392: unexpected error: unsupported DRA feature in ResourceClaimTemplate claim-tmpl-default in workload wl podset main: unsupported DRA feature: unsupported allocation mode: 
FAIL
FAIL	sigs.k8s.io/kueue/pkg/dra	0.402s
FAIL

but passes if I add req.Exactly.AllocationMode == ""

go test ./pkg/dra/...
ok  	sigs.k8s.io/kueue/pkg/dra	(cached)

That's because this is a unit test and the resource claim is not going through the API server which has defaulting logic. This test will not fail in the e2e since it goes through API server there. The correct thing IMO is to remove this unit test and let it just be an e2e test

@alaypatel07 good point, removed it. Thanks.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 22, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 22, 2025
@harche
Copy link
Contributor Author

harche commented Oct 22, 2025

/test pull-kueue-test-e2e-tas-main

@harche
Copy link
Contributor Author

harche commented Oct 22, 2025

That's because this is a unit test and the resource claim is not going through the API server which has defaulting logic. This test will not fail in the e2e since it goes through API server there. The correct thing IMO is to remove this unit test and let it just be an e2e test

@alaypatel07 good point, removed it. Thanks.

@alaypatel07 replaced with an integration test, https://github.com/kubernetes-sigs/kueue/pull/7226/files#diff-5dc66fc561764c1765de2f7abf5aca47f48c675141ab4c6c5fd6ce7a3a40b1a6R727

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 23, 2025
@harche
Copy link
Contributor Author

harche commented Oct 23, 2025

@mimowo @alaypatel07 I have addressed all the review comments. Thanks.

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
Looks better than before, thank you!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 24, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c89963c8bc4ae64e075c076acfad0c422a188c56

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harche, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2025
@mimowo
Copy link
Contributor

mimowo commented Oct 24, 2025

@harche let me know if:

  1. you could follow up on DRA cleanup: use structured assert on errors using cmp.Diff #7343
  2. you want to cherrypick to 0.14, this can be done but requires manual conflict resolution. Since the integration is an alpha feature it is not necessary

@k8s-ci-robot k8s-ci-robot merged commit 32f93ef into kubernetes-sigs:main Oct 24, 2025
23 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.15 milestone Oct 24, 2025
@harche harche deleted the dra_validation branch October 24, 2025 11:08
@harche
Copy link
Contributor Author

harche commented Oct 24, 2025

@harche let me know if:

  1. you could follow up on DRA cleanup: use structured assert on errors using cmp.Diff #7343
  2. you want to cherrypick to 0.14, this can be done but requires manual conflict resolution. Since the integration is an alpha feature it is not necessary

@mimowo thanks for your review.

  1. I have assigned DRA cleanup: use structured assert on errors using cmp.Diff #7343 to me.
  2. I am not looking for backporting it to 0.14, since DRA support is in Kueue is alpha anyway.

Singularity23x0 pushed a commit to Singularity23x0/kueue that referenced this pull request Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants