-
Notifications
You must be signed in to change notification settings - Fork 270
level-based topological sorting and execution #827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
level-based topological sorting and execution #827
Conversation
Signed-off-by: Bruno Schaatsbergen <[email protected]>
Signed-off-by: Bruno Schaatsbergen <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bschaatsbergen The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @bschaatsbergen. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
7da7ea0 to
8a530d5
Compare
96671a5 to
1dab5b0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewers, please focus on the parallelization logic, whether the mutexes I introduced are actually protecting state properly, and the potential impact of multiple calls to a single applyset on cluster mutations beyond level 1.
jakobmoellerdev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First Review from me (need to do a checkout and another proper one): Thanks for the work on this, I remember from our slack discussions that you were very invested in this over a few days 🚀
Most importantly: I would love you to especially look at the deletion / pruning case again, I DONT think we are good on this. Apart from that mostly style and nits to get the PR in easier as the changeset is large.
Also missing tests for the levelled apply
Thanks also for adopting Kahns algorithm as I proposed. I believe this is the right way to do levelled apply. I do wish that you would share a bit more of implementation intent in the code as if you dont know that people wont know why youre doing what youre doing.
Last but not least, a lot of this change will have implicit changes to the reconcile paths in the controller which is why we need to adjust docs and prep release notes. Not something to fix immediately but we need to prep some information for our users.
| // Prune resources that are no longer part of the desired state | ||
| // The ApplySet tracks inventory via GKNN in parent annotations | ||
| // We do a final apply with prune=true to clean up orphaned resources | ||
| pruneResult, err := aset.Apply(ctx, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an issue with prune and levels. Technically we need to prune in a very specific levelled order.
Let me visualize.
Assuming we have a set of resources A,B,C,D and they visualize to this graph:
A->B->C->D
Then the removal order for pruning must be ordered as
D->C->B->A. Currently pruning does not respect this order. I think this is a pretty heavy issue that is even present right now as pruning doesnt respect topology.
I think this is partially covered by deleteResourcesInOrder in the delete branch but is ignored here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. You're right, pruning needs to respect reverse topological order to safely remove dependents before their dependencies. We'll need to get this sorted out (pun intended).
I explored adding PruneByLevels to mirror the existing deleteResourcesInOrder logic, but the implementation is proving more complex than I want it to be, given I prefer to keep applyset changes minimal. Would appreciate another set of eyes on this. How would you approach integrating topological pruning without major refactoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH Im not sure if you can easily achieve this. I think fundamentally apply sets are not designed to be levelled, so the upstream package copy also isnt. anything we build with applysets would be on top of the already existing specification.
| // Use errgroup to process all resources in parallel | ||
| // Resources in the same level have no dependencies on each other | ||
| g, gCtx := errgroup.WithContext(ctx) | ||
| g.SetLimit(stdruntime.NumCPU()) // Limit parallelism to number of CPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a flag for concurrency limit that I think we should respect
| if _, err := igr.runtime.Synchronize(); err != nil { | ||
| return fmt.Errorf("failed to synchronize during deletion state initialization: %w", err) | ||
| } | ||
| func (igr *instanceGraphReconciler) initializeDeletionState(levels [][]string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super high level but why do we even need initializeDeletionState still? I was assuming that this state should be covered by a proper pruned diff.
|
/ok-to-test |
|
@bschaatsbergen: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Thanks everyone for the discussion so far. After a productive conversation this morning with @jakobmoellerdev, we’ve decided to defer this change for now, hence I'll convert it to a draft and attempt to keep it up-to-date with the default branch for the time being. At the moment, it’s challenging to integrate this into the controller due to current limitations in applyset and pruning. Before we can move forward, we need to address these areas, specifically ensuring that our apply and prune logic is designed with level-based topological sorting in mind. To move this in the right direction, @jakobmoellerdev and I will be drafting a KREP over the coming weeks to outline the path forward. For those who are waiting on this feature: thank you for your patience. We want to make sure we’re aligned on the why before we commit to the how. This level of rigor is what keeps kro predictable, stable, and production-ready. |
Closes #822
This change is about how kro processes ResourceGraphDefinition instances, moving from sequential resource-by-resource processing to level-based topological sorting with parallelized execution within each level, while maintaining a single ApplySet for the entire instance lifecycle.
The reconciliation logic currently processes resources in discrete topological levels rather than a flat sequential order. The DAG (
pkg/graph/dag/dag.go) now usesTopologicalSortLevels()based on Kahn's algorithm, which returns[][]stringrepresenting levels where all resources within a level have no dependencies on each other, which allows us to do parallel execution within levels.Instead of creating multiple ApplySets or applying resources individually, we maintain a single ApplySet instance for the entire ResourceGraphDefinition instance. The ApplySet follows the kubectl ApplySet specification, using GKNN-based (Group/Kind/Namespace/Name) inventory tracking stored in parent resource annotations.
The reconciliation for instances (
controller_reconcile.go:reconcileInstance):ApplySetat the startApplySetApply(ctx, false))Apply(ctx, true))Resource deletion now mirrors the creation flow in reverse order, processing topological levels from bottom to top (last level to first). Resources within each level are deleted in parallel, leveraging the same concurrency patterns used during creation to improve deletion performance. The controller validates that all resources in a level are fully deleted before moving to the next level.
Reviewers, please focus on the parallelization logic, whether the mutexes I introduced are actually protecting state properly, and the potential impact of multiple calls to a single applyset on cluster mutations beyond level 1.