Skip to content

Conversation

@bschaatsbergen
Copy link
Contributor

@bschaatsbergen bschaatsbergen commented Nov 17, 2025

Closes #822

This change is about how kro processes ResourceGraphDefinition instances, moving from sequential resource-by-resource processing to level-based topological sorting with parallelized execution within each level, while maintaining a single ApplySet for the entire instance lifecycle.

The reconciliation logic currently processes resources in discrete topological levels rather than a flat sequential order. The DAG (pkg/graph/dag/dag.go) now uses TopologicalSortLevels() based on Kahn's algorithm, which returns [][]string representing levels where all resources within a level have no dependencies on each other, which allows us to do parallel execution within levels.

Instead of creating multiple ApplySets or applying resources individually, we maintain a single ApplySet instance for the entire ResourceGraphDefinition instance. The ApplySet follows the kubectl ApplySet specification, using GKNN-based (Group/Kind/Namespace/Name) inventory tracking stored in parent resource annotations.

The reconciliation for instances (controller_reconcile.go:reconcileInstance):

  • Creates one ApplySet at the start
  • Iterates through topological levels sequentially
  • For each level:
    • Processes all resources in parallel, adding them to the ApplySet
    • Applies the ApplySet without pruning (Apply(ctx, false))
    • Synchronizes runtime state so later levels can reference earlier resource outputs via CEL
    • Validates all resources in the level are ready before proceeding
  • After all levels complete successfully, performs a final apply with pruning enabled (Apply(ctx, true))

Resource deletion now mirrors the creation flow in reverse order, processing topological levels from bottom to top (last level to first). Resources within each level are deleted in parallel, leveraging the same concurrency patterns used during creation to improve deletion performance. The controller validates that all resources in a level are fully deleted before moving to the next level.

Reviewers, please focus on the parallelization logic, whether the mutexes I introduced are actually protecting state properly, and the potential impact of multiple calls to a single applyset on cluster mutations beyond level 1.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bschaatsbergen
Once this PR has been reviewed and has the lgtm label, please assign nicslatts for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @bschaatsbergen. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 17, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 17, 2025
@a-hilaly
Copy link
Member

/ok-to-test

@bschaatsbergen bschaatsbergen changed the title [WIP] level-based topological sorting and execution level-based topological sorting and execution Nov 17, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 17, 2025
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewers, please focus on the parallelization logic, whether the mutexes I introduced are actually protecting state properly, and the potential impact of multiple calls to a single applyset on cluster mutations beyond level 1.

Copy link
Member

@jakobmoellerdev jakobmoellerdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First Review from me (need to do a checkout and another proper one): Thanks for the work on this, I remember from our slack discussions that you were very invested in this over a few days 🚀

Most importantly: I would love you to especially look at the deletion / pruning case again, I DONT think we are good on this. Apart from that mostly style and nits to get the PR in easier as the changeset is large.

Also missing tests for the levelled apply

Thanks also for adopting Kahns algorithm as I proposed. I believe this is the right way to do levelled apply. I do wish that you would share a bit more of implementation intent in the code as if you dont know that people wont know why youre doing what youre doing.

Last but not least, a lot of this change will have implicit changes to the reconcile paths in the controller which is why we need to adjust docs and prep release notes. Not something to fix immediately but we need to prep some information for our users.

// Prune resources that are no longer part of the desired state
// The ApplySet tracks inventory via GKNN in parent annotations
// We do a final apply with prune=true to clean up orphaned resources
pruneResult, err := aset.Apply(ctx, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an issue with prune and levels. Technically we need to prune in a very specific levelled order.

Let me visualize.

Assuming we have a set of resources A,B,C,D and they visualize to this graph:

A->B->C->D

Then the removal order for pruning must be ordered as

D->C->B->A. Currently pruning does not respect this order. I think this is a pretty heavy issue that is even present right now as pruning doesnt respect topology.

I think this is partially covered by deleteResourcesInOrder in the delete branch but is ignored here.

Copy link
Contributor Author

@bschaatsbergen bschaatsbergen Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. You're right, pruning needs to respect reverse topological order to safely remove dependents before their dependencies. We'll need to get this sorted out (pun intended).

I explored adding PruneByLevels to mirror the existing deleteResourcesInOrder logic, but the implementation is proving more complex than I want it to be, given I prefer to keep applyset changes minimal. Would appreciate another set of eyes on this. How would you approach integrating topological pruning without major refactoring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH Im not sure if you can easily achieve this. I think fundamentally apply sets are not designed to be levelled, so the upstream package copy also isnt. anything we build with applysets would be on top of the already existing specification.

// Use errgroup to process all resources in parallel
// Resources in the same level have no dependencies on each other
g, gCtx := errgroup.WithContext(ctx)
g.SetLimit(stdruntime.NumCPU()) // Limit parallelism to number of CPUs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a flag for concurrency limit that I think we should respect

if _, err := igr.runtime.Synchronize(); err != nil {
return fmt.Errorf("failed to synchronize during deletion state initialization: %w", err)
}
func (igr *instanceGraphReconciler) initializeDeletionState(levels [][]string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super high level but why do we even need initializeDeletionState still? I was assuming that this state should be covered by a proper pruned diff.

@jakobmoellerdev
Copy link
Member

/ok-to-test

@k8s-ci-robot
Copy link
Contributor

@bschaatsbergen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmits-e2e-tests 1c9c2bb link true /test presubmits-e2e-tests

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bschaatsbergen
Copy link
Contributor Author

bschaatsbergen commented Nov 19, 2025

Thanks everyone for the discussion so far. After a productive conversation this morning with @jakobmoellerdev, we’ve decided to defer this change for now, hence I'll convert it to a draft and attempt to keep it up-to-date with the default branch for the time being.

At the moment, it’s challenging to integrate this into the controller due to current limitations in applyset and pruning. Before we can move forward, we need to address these areas, specifically ensuring that our apply and prune logic is designed with level-based topological sorting in mind.

To move this in the right direction, @jakobmoellerdev and I will be drafting a KREP over the coming weeks to outline the path forward.

For those who are waiting on this feature: thank you for your patience. We want to make sure we’re aligned on the why before we commit to the how. This level of rigor is what keeps kro predictable, stable, and production-ready.

@bschaatsbergen bschaatsbergen marked this pull request as draft November 19, 2025 21:09
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement parallel DAG walking

4 participants