-
Notifications
You must be signed in to change notification settings - Fork 479
Add validation for unsupported DRA features #7226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @harche. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
/ok-to-test /cc @alaypatel07 |
|
Yes I'll take a look. |
c524414 to
4318946
Compare
pkg/dra/claims.go
Outdated
| var dcName string | ||
| var q int64 | ||
| if req.Exactly != nil { | ||
| // Check for CEL selectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please convert the entire if condition into a switch case handling all the unsupported features first before considering the valid supported case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harche why does this have to be nested ifs and switch cases? Can we work out a logic with one single switch case?
For example:
switch {
case req.FirstAvailable != nil:
case req.Exactly != nil && len(req.Exactly.Selectors) > 0:
// error
case req.Exactly != nil && req.Exactly.AdminAccess != nil && *req.Exactly.AdminAccess:
// error
case req.Exactly != nil && req.Exactly.AllocationMode == DeviceAllocationModeAll:
// error
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alaypatel07 I didn't want to repeat req.Exactly != nil, would something like following do?
switch {
case req.FirstAvailable != nil:
return nil, fmt.Errorf("%w: FirstAvailable device selection is not supported", ErrUnsupportedDRAFeature)
case req.Exactly == nil:
continue
case len(req.Exactly.Selectors) > 0:
return nil, fmt.Errorf("%w: CEL selectors are not supported", ErrUnsupportedDRAFeature)
case req.Exactly.AdminAccess != nil && *req.Exactly.AdminAccess:
return nil, fmt.Errorf("%w: AdminAccess is not supported", ErrUnsupportedDRAFeature)
case req.Exactly.AllocationMode == resourcev1.DeviceAllocationModeAll:
return nil, fmt.Errorf("%w: AllocationMode 'All' is not supported", ErrUnsupportedDRAFeature)
case req.Exactly.AllocationMode == resourcev1.DeviceAllocationModeExactCount:
dcName = req.Exactly.DeviceClassName
q = req.Exactly.Count
case req.Exactly.AllocationMode == "":
dcName = req.Exactly.DeviceClassName
default:
return nil, fmt.Errorf("%w: unsupported allocation mode: %s", ErrUnsupportedDRAFeature, req.Exactly.AllocationMode)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, let me know if that works, but I am open to changing it to repeated req.Exactly != nil.
pkg/dra/claims.go
Outdated
| ErrDeviceClassNotMapped = errors.New("DeviceClass is not mapped in DRA configuration") | ||
| ErrResourceClaimInUse = errors.New("ResourceClaim is in use") | ||
| ErrClaimSpecNotFound = errors.New("failed to get claim spec") | ||
| ErrDeviceClassNotMapped = errors.New("DeviceClass is not mapped in DRA configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we use a common error "UnsupportedDRAfeature" and then wrap a human readable error message to be returned from the function later.
I think we only need to have the following concrete error types:
- ErrResourceClaimInUse and other
- ErrClaimSpecNotFound
- ErrUnsupportedDRAfeature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, thanks.
| }, | ||
| want: map[kueuev1beta1.PodSetReference]corev1.ResourceList{"main": {"res-1": resource.MustParse("2")}}, | ||
| }, | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also update the integration tests for these errors at https://github.com/kubernetes-sigs/kueue/tree/main/test/integration/singlecluster/controller/dra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added integration tests.
4318946 to
006e7b3
Compare
|
/test pull-kueue-test-e2e-multikueue-main |
|
@alaypatel07 I have updated the PR. Can you take another look? Thanks. |
| wl := utiltesting.MakeWorkload("test-wl-all-mode", ns.Name). | ||
| Queue("test-lq"). | ||
| Obj() | ||
| wl.Spec.PodSets[0].Template.Spec.ResourceClaims = []corev1.PodResourceClaim{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use this opportunity to start using, https://github.com/kubernetes-sigs/kueue/blob/main/pkg/util/testing/wrappers.go#L623 and https://github.com/kubernetes-sigs/kueue/blob/main/pkg/util/testing/wrappers.go#L637?
I think this change is needed across all the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
|
|
||
| ginkgo.It("Should reject workload with AllocationMode 'All'", func() { | ||
| ginkgo.By("Creating a ResourceClaimTemplate with AllocationMode All") | ||
| rct := &resourcev1.ResourceClaimTemplate{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can create a ResourceClaimWrapper to produce these templates/claims in utiltesting:
kueue/pkg/util/testing/wrappers.go
Line 422 in 5a6b10a
| type PodSetWrapper struct{ kueue.PodSet } |
it will be helpful in unit testing as well as integration tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great idea, but should we tackle this outside this PR? I can create a separate issue to look into it. I am afraid it might escalate the scope of this PR (which is just to harden the validation around unsupported DRA fields/features)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. IMHO, since this PR is adding all the permutations of creating resource slices with different features, this is the right moment to add the wrapper. I dont really think it expands the scope. Thanks for tackling that.
03ca256 to
f7b275d
Compare
|
@harche from what I can see, AllocationMode field itself can be Are you seeing that this defaulting is not working? |
yeah, if I remove the % go test ./pkg/dra/...
--- FAIL: Test_GetResourceRequests (0.05s)
--- FAIL: Test_GetResourceRequests/Empty_allocation_mode_defaults_to_exact_count (0.00s)
claims_test.go:392: unexpected error: unsupported DRA feature in ResourceClaimTemplate claim-tmpl-default in workload wl podset main: unsupported DRA feature: unsupported allocation mode:
FAIL
FAIL sigs.k8s.io/kueue/pkg/dra 0.402s
FAILbut passes if I add go test ./pkg/dra/...
ok sigs.k8s.io/kueue/pkg/dra (cached) |
That's because this is a unit test and the resource claim is not going through the API server which has defaulting logic. This test will not fail in the e2e since it goes through API server there. The correct thing IMO is to remove this unit test and let it just be an e2e test |
e46910b to
8ad5160
Compare
@alaypatel07 good point, removed it. Thanks. |
8ad5160 to
081eea8
Compare
|
/test pull-kueue-test-e2e-tas-main |
@alaypatel07 replaced with an integration test, https://github.com/kubernetes-sigs/kueue/pull/7226/files#diff-5dc66fc561764c1765de2f7abf5aca47f48c675141ab4c6c5fd6ce7a3a40b1a6R727 |
Signed-off-by: Harshal Patil <[email protected]>
081eea8 to
057dd22
Compare
|
@mimowo @alaypatel07 I have addressed all the review comments. Thanks. |
mimowo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Looks better than before, thank you!
|
LGTM label has been added. Git tree hash: c89963c8bc4ae64e075c076acfad0c422a188c56
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: harche, mimowo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@harche let me know if:
|
@mimowo thanks for your review.
|
Signed-off-by: Harshal Patil <[email protected]>
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR adds validation to reject DRA (Dynamic Resource Allocation) v1 API features that are enabled
by default in Kubernetes 1.34 but explicitly excluded from Kueue's DRA support scope per KEP-2941.
Without validation, Kueue silently ignores unsupported DRA features, leading to:
runtime failures
errors instead of clear feature-not-supported messages
unsupported features
Solution is to add early validation in pkg/dra/claims.go to explicitly reject all 6 enabled-by-default DRA features
that Kueue doesn't support:
GA Features:
Beta Features (enabled by default in K8s 1.34):
Each feature now returns a clear, actionable error message like "FirstAvailable device selection is
not supported" instead of misleading DeviceClass mapping errors.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?