Skip to content

Conversation

@olivergondza
Copy link
Contributor

When sync fails due to incorrect manifest declarations, this permits fixes to be deployed via future commits.

This is controlled by a new boolean Application CRD field syncPolicy.retry.refresh or via the --sync-retry-refresh flag.

Closes #11494
Related to #6055
Discussed at https://www.youtube.com/watch?v=baIX9Bk6f5w&t=1173s

Kudos to @aslafy-z and @Sayrus for doing most of the heavy lifting here (#15603).

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.
  • Optional. My organization is added to USERS.md.
  • [n/a] Optional. For bug fixes, I've indicated what older releases this fix should be cherry-picked into (this may or may not happen depending on risk/complexity).

@olivergondza olivergondza requested review from a team as code owners May 19, 2025 11:13
@bunnyshell
Copy link

bunnyshell bot commented May 19, 2025

❌ Preview Environment deleted from Bunnyshell

Available commands (reply to this comment):

  • 🚀 /bns:deploy to deploy the environment

@blakepettersson blakepettersson changed the title Fixes 11494: feat(sync): Permit using newer revision when retrying failed sync feat(sync): Permit using newer revision when retrying failed sync (#11494) May 19, 2025
@blakepettersson blakepettersson changed the title feat(sync): Permit using newer revision when retrying failed sync (#11494) feat(controller): Permit using newer revision when retrying failed sync (#11494) May 19, 2025
@codecov
Copy link

codecov bot commented May 19, 2025

Codecov Report

❌ Patch coverage is 92.30769% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.36%. Comparing base (8b8d04e) to head (e22dfc2).
⚠️ Report is 786 commits behind head on master.

Files with missing lines Patch % Lines
cmd/argocd/commands/app.go 70.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #23038      +/-   ##
==========================================
+ Coverage   60.32%   60.36%   +0.03%     
==========================================
  Files         350      350              
  Lines       60032    60061      +29     
==========================================
+ Hits        36217    36254      +37     
+ Misses      20901    20895       -6     
+ Partials     2914     2912       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@olivergondza
Copy link
Contributor Author

@jannfis, I understand you had a PoC doing something similar, I would appreciate your review. Thanks!

Copy link
Member

@agaudreault agaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in contributor meeting, only auto-sync should have this behaviour. #11494 (comment)

@agaudreault agaudreault marked this pull request as draft July 3, 2025 16:17
@anandf
Copy link
Member

anandf commented Jul 4, 2025

If a sync fails, is it possible to differentiate between a retryable error (eg: nodes, resources, api server not available) and a non retryable error (eg: a manifest is not according to the schema, image does not exist etc). And the sync operation will retry only for a retryable error and for the other scenario, sync would just fail as it does not make sense to keep retrying if the error is say manifest not matching the schema.

@reggie-k reggie-k added this to the v3.2 milestone Jul 14, 2025
@olivergondza
Copy link
Contributor Author

If a sync fails, is it possible to differentiate between a retryable error (eg: nodes, resources, api server not available) and a non retryable error (eg: a manifest is not according to the schema, image does not exist etc).

Interesting point. I have few observations:

  • image does not exist can be recoverable in case the tag is being pushed in parallel.
  • Is there a way to tell what exactly have caused the sync error?
    • For manifest is not according to the schema, I agree it sounds not recoverable, provided we can tell it apart from CRD not created/updated yet.

And the sync operation will retry only for a retryable error and for the other scenario, sync would just fail as it does not make sense to keep retrying if the error is say manifest not matching the schema.

Agreed that manifests not matching the schema make no sense retrying with same commit. BUT it is exactly the thing I would like Argo CD to retry with new commit ASAP.

@olivergondza
Copy link
Contributor Author

On the contributors meeting it was agreed that updating the commit sha (as originally implemented here) can be confusing for users as it does not indicate the fact a sync operation with HEAD~1 has failed. There was a consensus it is better to find out a way to get the current retry fail clearly, and let/force the new one to kick in.

@olivergondza
Copy link
Contributor Author

olivergondza commented Jul 16, 2025

Thanks for the pointers, @agaudreault.

I have changed the approach to let the current attempt fail, and let the new one to kick in, but I am struggling to get things to work. The phase never gets to Succeeded in the very last step.

Also, the UI seems to be in an odd state. The "SYNC STATUS" reports success on last (fixed) commit, but "LAST SYNC" reports fail on an earlier commit.

Screenshot From 2025-07-16 16-49-28

Hitting [Sync] in E2E UI gets the app go completely green, but for some reason, it does not happen on its own.

@olivergondza olivergondza requested a review from agaudreault July 16, 2025 14:59
@olivergondza olivergondza marked this pull request as ready for review July 16, 2025 14:59
@olivergondza olivergondza force-pushed the issue-11494 branch 4 times, most recently from 0e6c345 to 3259682 Compare July 23, 2025 12:08
Signed-off-by: Zadkiel AHARONIAN <[email protected]>
Signed-off-by: Oliver Gondža <[email protected]>
Signed-off-by: Zadkiel AHARONIAN <[email protected]>
Signed-off-by: Oliver Gondža <[email protected]>
Signed-off-by: Zadkiel AHARONIAN <[email protected]>
Signed-off-by: Oliver Gondža <[email protected]>
@olivergondza
Copy link
Contributor Author

@agaudreault, I have updated the tests. Curious to hear your thoughts regarding the previous comment...

@agaudreault
Copy link
Member

agaudreault commented Sep 3, 2025

What is the motivation here? Do you suggest the explicit UI sync to behave the same?

These are the use cases / business requirements to test with this change. The UI sync behaviour should also populate the refresh option and correct sources. The rollback experience is different and should never try to refresh. The behaviour of the UI vs CLI for the same operation should be consistent with how retries are configured for each.

To make sure the operation can be refreshed correctly, you must make sure that the sources are properly set in the operation object.

Signed-off-by: Zadkiel AHARONIAN <[email protected]>
Signed-off-by: Oliver Gondža <[email protected]>
Signed-off-by: Zadkiel AHARONIAN <[email protected]>
Signed-off-by: Oliver Gondža <[email protected]>
Signed-off-by: Zadkiel AHARONIAN <[email protected]>
Signed-off-by: Oliver Gondža <[email protected]>
@olivergondza
Copy link
Contributor Author

@agaudreault, I have added the tests (and prod code fix) for explicit CLI sync.

Though, the (CLI) rollback seems to be a no-brainer: FailedPrecondition desc = rollback cannot be initiated when auto-sync is enabled.

@olivergondza olivergondza force-pushed the issue-11494 branch 2 times, most recently from eb627ec to 8317692 Compare September 10, 2025 11:14
olivergondza and others added 12 commits September 10, 2025 13:16
Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Copy link
Member

@agaudreault agaudreault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@agaudreault agaudreault enabled auto-merge (squash) September 11, 2025 14:46
@agaudreault agaudreault added the for-release-blog-3-2 PR that should be highlighted in the Release Blog label Sep 11, 2025
@agaudreault agaudreault merged commit 5a8b427 into argoproj:master Sep 11, 2025
31 of 32 checks passed
@olivergondza
Copy link
Contributor Author

Thanks, @agaudreault. Your help was priceless!

downfa11 pushed a commit to downfa11/argo-cd that referenced this pull request Sep 12, 2025
…nc (argoproj#11494) (argoproj#23038)

Signed-off-by: Zadkiel AHARONIAN <[email protected]>
Signed-off-by: Oliver Gondža <[email protected]>
Signed-off-by: Alexandre Gaudreault <[email protected]>
Co-authored-by: Zadkiel AHARONIAN <[email protected]>
Co-authored-by: Alexandre Gaudreault <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for-release-blog-3-2 PR that should be highlighted in the Release Blog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retrying failed sync's block newer commits; how to achieve declarative, level based gitops semantics?

6 participants