-
Notifications
You must be signed in to change notification settings - Fork 26
use ctrl-runtime for fault-remediation #449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds controller-runtime integration and a new public FaultRemediationReconciler with dual run modes (controller-managed and non-managed), shifts metrics registration to use the controller-runtime registry, updates initializer/main wiring and flags, augments Helm RBAC/deployment/values for ctrl-runtime, and expands multiple go.mod dependency graphs. Changes
Sequence Diagram(s)sequenceDiagram
participant App as main
participant Manager as ctrl-runtime Manager
participant Reconciler as FaultRemediationReconciler
participant Watcher as ChangeStreamWatcher
participant Datastore as Datastore
participant Metrics as Metrics/Health Server
rect rgb(220,240,220)
Note over App,Manager: Controller-runtime managed path
App->>Manager: Start manager (leader-election optional)
Manager->>Reconciler: SetupWithManager(mgr)
Watcher->>Reconciler: AdaptEvents -> typed events
Reconciler->>Datastore: Parse & process event (Reconcile)
Reconciler->>Datastore: Update remediation state / create CRs
Manager->>Reconciler: Health/Ready probes invoked
end
rect rgb(240,230,220)
Note over App,Metrics: Non-controller-runtime path
App->>Metrics: Start metrics/health server (promauto.With(crmetrics.Registry))
Reconciler->>Watcher: StartWatcherStream(ctx)
loop event loop
Watcher->>Reconciler: emit EventWithToken
Reconciler->>Datastore: Parse & process event
Reconciler->>Datastore: Update remediation state
end
App->>Reconciler: Shutdown
Reconciler->>Datastore: CloseAll()
end
Estimated code review effortπ― 4 (Complex) | β±οΈ ~60 minutes
Poem
Pre-merge checks and finishing touchesβ Failed checks (1 warning)
β Passed checks (2 passed)
β¨ Finishing touches
π§ͺ Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
cfa5826 to
783b2df
Compare
64193af to
f722b46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
π§Ή Nitpick comments (13)
health-events-analyzer/go.mod (1)
89-89: Verify whether controller-runtime should be a direct dependency.Line 89 lists
sigs.k8s.io/controller-runtime v0.22.4 // indirect. According to the PR objectives, this PR integrates controller-runtime for fault-remediation. If the fault-remediation component actively imports and uses controller-runtime APIs, it should be declared as a direct dependency in therequiresection (lines 7β18), not as an indirect dependency.Review the actual code changes to confirm whether controller-runtime is directly used in this module. If so, move it to the direct requires section.
commons/pkg/server/server.go (1)
218-234: Improve documentation to clarify usage and differences.The implementation looks correct, but the documentation could be clearer:
- The example code still references
WithPrometheusMetrics()instead ofWithPrometheusMetricsCtrlRuntime()- It doesn't explain when to use this option versus
WithPrometheusMetrics()- It should note that these two options are mutually exclusive (both register
/metrics)Consider updating the documentation to:
- Explain that this option uses the controller-runtime metrics registry instead of the default Prometheus registry
- Indicate when to prefer this option (e.g., "Use this when integrating with controller-runtime to consolidate metrics from controller-runtime and custom metrics in a single registry")
- Note that only one metrics option should be used at a time
Apply this diff to improve the documentation:
-// WithPrometheusMetricsCtrlRuntime registers the Prometheus metrics handler at /metrics. -// This uses the controller runtime Prometheus registry. +// WithPrometheusMetricsCtrlRuntime registers the Prometheus metrics handler at /metrics +// using the controller-runtime metrics registry (sigs.k8s.io/controller-runtime/pkg/metrics). +// +// Use this option when integrating with controller-runtime to consolidate metrics from +// controller-runtime (e.g., workqueue, leader election) and custom application metrics. +// Note: This option is mutually exclusive with WithPrometheusMetrics() - use only one. // // Example: // // srv := NewServer( // WithPort(2112), -// WithPrometheusMetrics(), +// WithPrometheusMetricsCtrlRuntime(), // )distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (2)
31-55: Consider making RBAC rules conditional onctrlRuntimeEnabled.The new RBAC rules for
leasesandeventsare required for controller-runtime leader election and event recording. However, they are unconditionally added regardless of whetherctrlRuntimeEnabledis true. For least-privilege, consider wrapping these in a conditional:{{- if (.Values.global).ctrlRuntimeEnabled }} - apiGroups: - "coordination.k8s.io" resources: - leases verbs: - get - create - update - patch - apiGroups: - "" resources: - "events" verbs: - "create" - "patch" {{- end }}
31-39: Minor formatting inconsistency.The new
coordination.k8s.iorule uses 4-space indentation forapiGroups,resources, andverbs, while other rules in this file use 2-space indentation. Consider aligning for consistency:- apiGroups: - - "coordination.k8s.io" + - "coordination.k8s.io" resources: - - leases + - leases verbs: - - get - - create - - update - - patch + - get + - create + - update + - patchfault-remediation/pkg/initializer/init.go (2)
99-116: Resource cleanup on error paths.The datastore and watcher are created but if subsequent initialization fails (e.g., reconciler creation), these resources are not cleaned up. Consider adding cleanup logic or using defer patterns.
Example approach:
ds, err := datastore.NewDataStore(ctx, *datastoreConfig) if err != nil { return nil, fmt.Errorf("error initializing datastore: %w", err) } // Ensure cleanup on subsequent errors var cleanupNeeded bool defer func() { if cleanupNeeded { if closeErr := ds.Close(ctx); closeErr != nil { slog.Error("failed to close datastore during cleanup", "error", closeErr) } } }() cleanupNeeded = true // ... rest of initialization ... cleanupNeeded = false // Success, don't cleanup return &Components{...}, nil
105-108: Consider extracting the collection name to a constant.The
"HealthEvents"collection name is hardcoded. Consider defining this as a constant to ensure consistency across the codebase and make it easier to maintain.fault-remediation/main.go (3)
17-39: Non-standard import ordering.Go convention groups imports: stdlib, then a blank line, then external packages. Currently stdlib and external imports are interleaved (e.g.,
errors,flag,fmtmixed withfd.axjsq.dpdns.org/nvidia/...,golang.org/x/sync/errgroup).import ( "context" "errors" "flag" "fmt" - "github.com/nvidia/nvsentinel/commons/pkg/server" - "golang.org/x/sync/errgroup" - "k8s.io/apimachinery/pkg/runtime" "log/slog" "os" "os/signal" - "sigs.k8s.io/controller-runtime/pkg/healthz" - metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" "strconv" "strings" "syscall" "time" + "golang.org/x/sync/errgroup" + "k8s.io/apimachinery/pkg/runtime" ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/healthz" + metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" + "github.com/nvidia/nvsentinel/commons/pkg/server" "github.com/nvidia/nvsentinel/commons/pkg/logger" "github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer" )
263-266: Unnecessary explicit return statement.The explicit
returnat the end of a function with no return values is redundant.flag.Parse() - - return }
195-198: Inconsistent error wrapping.Other errors in this function are wrapped with context (e.g., line 191), but the manager start error is returned unwrapped. As per coding guidelines, wrap errors with context using
fmt.Errorf.if err = mgr.Start(ctrl.SetupSignalHandler()); err != nil { slog.Error("Problem running manager", "error", err) - return err + return fmt.Errorf("manager start failed: %w", err) }fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
230-232: Inconsistent error handling: panic vs log.Fatalf.Manager creation errors use
panic(err)while other setup errors uselog.Fatalf. For consistency, uselog.Fatalfthroughout TestMain.mgr, err := ctrl.NewManager(testEnv.Config, ctrl.Options{ Scheme: scheme.Scheme, Metrics: metricsserver.Options{ BindAddress: "0", }, }) if err != nil { - panic(err) + log.Fatalf("Failed to create manager: %v", err) }
1346-1348: Remove commented-out code.The commented
//mockStore := &MockHealthEventStore{}should be removed as it's dead code.watcher := NewMockChangeStreamWatcher() - //mockStore := &MockHealthEventStore{} r := &FaultRemediationReconciler{
17-52: Non-standard import ordering.Imports mix stdlib (
context,log,os,testing, etc.) with external packages. Go convention groups stdlib imports first, then a blank line, then external packages.fault-remediation/pkg/reconciler/reconciler.go (1)
505-517: CloseAll returns early on first error, potentially leaving resources open.If
r.ds.Close(ctx)fails,r.Watcher.Close(ctx)is never called. Consider closing all resources and aggregating errors to ensure proper cleanup.func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error { + var errs []error + if err := r.ds.Close(ctx); err != nil { slog.Error("failed to close datastore", "error", err) - return err + errs = append(errs, fmt.Errorf("close datastore: %w", err)) } if err := r.Watcher.Close(ctx); err != nil { slog.Error("failed to close Watcher", "error", err) - return err + errs = append(errs, fmt.Errorf("close watcher: %w", err)) } + if len(errs) > 0 { + return errors.Join(errs...) + } return nil }Note: You'll need to add
"errors"to the imports.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (5)
commons/go.sumis excluded by!**/*.sumevent-exporter/go.sumis excluded by!**/*.sumfault-remediation/go.sumis excluded by!**/*.sumhealth-events-analyzer/go.sumis excluded by!**/*.sumhealth-monitors/syslog-health-monitor/go.sumis excluded by!**/*.sum
π Files selected for processing (18)
commons/go.mod(2 hunks)commons/pkg/server/server.go(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml(1 hunks)distros/kubernetes/nvsentinel/values-full.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt.yaml(2 hunks)event-exporter/go.mod(1 hunks)fault-remediation/go.mod(4 hunks)fault-remediation/main.go(3 hunks)fault-remediation/pkg/initializer/init.go(5 hunks)fault-remediation/pkg/reconciler/metrics.go(4 hunks)fault-remediation/pkg/reconciler/reconciler.go(17 hunks)fault-remediation/pkg/reconciler/reconciler_e2e_test.go(44 hunks)fault-remediation/pkg/reconciler/reconciler_test.go(15 hunks)fault-remediation/pkg/reconciler/remediation.go(1 hunks)health-events-analyzer/go.mod(1 hunks)health-monitors/syslog-health-monitor/go.mod(1 hunks)tests/fault_remediation_test.go(1 hunks)
π§° Additional context used
π Path-based instructions (3)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/fault_remediation_test.gofault-remediation/pkg/reconciler/remediation.gocommons/pkg/server/server.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/initializer/init.gofault-remediation/main.gofault-remediation/pkg/reconciler/metrics.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/fault_remediation_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
commons/go.modhealth-events-analyzer/go.modfault-remediation/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π§ Learnings (13)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/fault_remediation_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
tests/fault_remediation_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/pkg/reconciler/remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
fault-remediation/pkg/reconciler/remediation.gohealth-events-analyzer/go.modfault-remediation/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
commons/go.modhealth-events-analyzer/go.modhealth-monitors/syslog-health-monitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
commons/go.modhealth-events-analyzer/go.modfault-remediation/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
commons/go.modhealth-events-analyzer/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
commons/go.modfault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
commons/go.modhealth-events-analyzer/go.modfault-remediation/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/reconciler/metrics.go
π Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.
Applied to files:
distros/kubernetes/nvsentinel/values-tilt.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
𧬠Code graph analysis (3)
fault-remediation/pkg/reconciler/reconciler_test.go (1)
fault-remediation/pkg/reconciler/reconciler.go (3)
NewFaultRemediationReconciler(73-88)HealthEventDoc(52-55)ReconcilerConfig(41-50)
fault-remediation/pkg/initializer/init.go (4)
fault-remediation/pkg/reconciler/reconciler.go (2)
FaultRemediationReconciler(63-71)NewFaultRemediationReconciler(73-88)store-client/pkg/datastore/registry.go (1)
NewDataStore(120-123)store-client/pkg/watcher/factory.go (2)
WatcherConfig(26-38)CreateChangeStreamWatcher(65-87)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
fault-remediation/pkg/reconciler/reconciler.go (3)
FaultRemediationReconciler(63-71)ReconcilerConfig(41-50)NewFaultRemediationReconciler(73-88)data-models/pkg/model/health_event_extentions.go (1)
HealthEventStatus(45-51)fault-remediation/pkg/reconciler/remediation.go (1)
FaultRemediationClient(56-68)
π Additional comments (29)
health-events-analyzer/go.mod (2)
27-94: Rungo mod tidyto validate dependency consistency.The expansion of indirect dependencies when integrating controller-runtime is expected, but ensure the go.mod file is in a canonical, validated state by running
go mod tidyto regenerate and clean up the dependency graph.
36-49: Verify unusual go-openapi/swag submodule entries in indirect dependencies.Lines 36β49 list multiple
go-openapi/swag/submodules (cmdutils, conv, fileutils, jsonname, jsonutils, loading, mangling, netutils, stringutils, typeutils, yamlutils) as separate indirect entries with explicit version constraints. Standard Go module practice does not list subpackages within a single module as separate entries in go.mod; only the root module is typically included.Run
go mod tidyto ensure the go.mod file reflects the true dependency graph and remove any spurious submodule entries if present.health-monitors/syslog-health-monitor/go.mod (2)
7-19: Direct dependencies are clean and appropriate.The direct dependencies remain minimal and focused (systemd, retryablehttp, prometheus client, test utilities, gRPC, protobuf, and Kubernetes apimachinery). Local replacements for
commonsanddata-modelsare correctly configured, aligning with coding guidelines for shared utilities.
21-81: Verify necessity of expanded indirect dependency surface and apply go mod diagnostic tools.The indirect dependency list has grown significantly, particularly with OpenAPI tooling (
go-openapi/*), Kubernetes modules (k8s.io/api,k8s.io/client-go,sigs.k8s.io/controller-runtime), and serialization utilities. This aligns with the PR objective to integrate controller-runtime.Per Go module best practices, verify the expansion is justified by:
- Running
go mod tidyto ensure no unused transitive dependencies remain.- Using
go mod why -m <module>on questionable indirect modules (especiallygo-openapi/*submodules) to confirm they're actually needed by imports.- Running
go mod graphto inspect the full dependency chain and identify any unexpectedly deep pulls.If any indirect dependencies are not genuinely required by the codebase, remove them or upgrade direct dependency versions to reduce transitive bloat.
fault-remediation/pkg/reconciler/remediation.go (1)
382-384: LGTM! Minor formatting improvement.The blank line improves readability by visually separating the context setup from the defer statement.
fault-remediation/go.mod (1)
32-32: LGTM! Expected transitive dependencies from controller-runtime.The added indirect dependencies (fsnotify, btree, zap) and updated prometheus/common version are legitimate transitive dependencies pulled in by the controller-runtime integration introduced in this PR.
Also applies to: 51-51, 62-62, 72-72
event-exporter/go.mod (1)
18-84: LGTM! Transitive dependencies from controller-runtime integration.The expanded indirect dependency graph (Kubernetes ecosystem, controller-runtime, and supporting libraries) is expected as the commons module now depends on controller-runtime. These are automatically managed transitive dependencies.
distros/kubernetes/nvsentinel/values-full.yaml (1)
49-61: LGTM! Well-documented feature flag for controller-runtime integration.The introduction of
ctrlRuntimeEnabledwith defaultfalseensures backward compatibility while enabling optional controller-runtime management. The separatehealthPortand clear documentation of when each port is used provides good operational guidance. Marking this as "still experimental" appropriately sets expectations.tests/fault_remediation_test.go (1)
28-28: LGTM! Typo fix improves test naming.Corrected "Reemdiated" to "Remediated" in the test function name.
distros/kubernetes/nvsentinel/values-tilt.yaml (2)
24-24: LGTM! Consistent feature flag configuration.Adding
ctrlRuntimeEnabled: falseensures the Tilt testing environment aligns with the production default configuration.
220-220: LGTM! Appropriate replica count for development.Setting
replicaCount: 1is suitable for the Tilt development/testing environment.commons/go.mod (1)
17-17: LGTM! Enables controller-runtime integration in commons.Adding
sigs.k8s.io/controller-runtime v0.22.4as a direct dependency appropriately supports the newWithPrometheusMetricsCtrlRuntime()option incommons/pkg/server/server.go. The version aligns with the Kubernetes ecosystem dependencies used throughout the project.distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)
81-91: LGTM!The conditional addition of controller-runtime and leader-elect flags, along with the health port exposure, is correctly structured with nil-safe value access.
fault-remediation/pkg/reconciler/reconciler_test.go (4)
377-380: LGTM!The test correctly handles the updated
performRemediationsignature that now returns an error as the third value.
219-221: LGTM!The test correctly uses the new
NewFaultRemediationReconcilerconstructor with nil dependencies for unit testing, which is appropriate when the tests don't exercise the datastore/watcher paths.
273-274: LGTM!Field access correctly updated to use
r.config.RemediationClientmatching the now-private field naming convention.
995-1008: Verify constructor usage and method availability inreconciler_test.go.The review flags inconsistent usage of
NewReconciler(cfg, false)at line 995 versusNewFaultRemediationReconcilerused elsewhere, and questions whetherrunLogCollectorexists on the updated reconciler type. Confirm that:
NewReconcilerconstructor is still available (or has been replaced)runLogCollectormethod exists onFaultRemediationReconciler- The suggested diff is compatible with the current API
fault-remediation/pkg/initializer/init.go (1)
131-134: LGTM!The new
FaultRemediationReconcileris correctly wired with all required dependencies (datastore, watcher, healthEventStore, config, and dryRun flag).fault-remediation/pkg/reconciler/metrics.go (3)
53-59: Breaking change: metric label renamed fromerror_typetoaction.The
totalUnsupportedRemediationActionsmetric now usesactioninstead oferror_typeas a label. While semantically more accurate, this is a breaking change that will affect existing dashboards or alerting rules referencing this metric.Verify that no existing dashboards, alerts, or monitoring configurations reference
fault_remediation_unsupported_actions_total{error_type=...}.
62-92: LGTM!The metrics are consistently registered with
crmetrics.Registry, which is the correct pattern for controller-runtime integration. The histogram buckets and label definitions are appropriate.
29-38: Verify dual-mode operation doesn't cause metric registration conflicts.All metrics are now registered with
crmetrics.Registry. Confirm that when running without controller-runtime integration, the metrics are still properly exposed through the standard Prometheus endpoint, or that a separate initialization path handles non-controller-runtime mode.fault-remediation/main.go (1)
42-42: Scheme is declared but not populated.The
schemevariable is passed to the controller-runtime manager but no types are registered viaAddToScheme. This works currently because the reconciler uses dynamic/unstructured clients, but if you later add typed resources (e.g., watching CRDs with generated types), the manager won't recognize them.Confirm this is intentional for now. If typed resources will be added later, consider adding a comment or TODO.
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
171-255: Well-structured test setup using envtest.The TestMain setup properly uses
envtestfor testing Kubernetes controllers, creates necessary clients, and sets up the controller-runtime manager. This aligns with the coding guidelines.
96-108: Mock implementation looks good.The
MockHealthEventStorewithupdateCalledcounter provides a clean way to verify update behavior in tests.fault-remediation/pkg/reconciler/reconciler.go (5)
321-323: Log collector error silently ignored.The error from
runLogCollectoris discarded with_ = r.runLogCollector(...). While log collection may be non-critical, silently ignoring the error could make debugging difficult. The function already logs internally (lines 178-180), but consider whether the caller should also be aware of failures.Confirm this is intentional. If log collection failures should not block remediation but should be visible, the current behavior (internal logging + discarding at call site) is acceptable.
73-88: Constructor looks good.The
NewFaultRemediationReconcilerproperly initializes all required fields. The embeddedclient.Clientis left unset as it's populated by controller-runtime duringSetupWithManager.
549-573: Well-implemented channel adapter.The
AdaptEventsfunction correctly adapts the watcher's event channel to controller-runtime's typed event channel. The implementation properly:
- Handles context cancellation
- Closes the output channel when done
- Creates a copy of the event before taking its pointer (avoiding loop variable capture issues)
205-206: Acknowledged TODO for ctrl-runtime retry pattern.Based on past review comments, returning errors for ctrl-runtime built-in retries will be addressed in a follow-up PR to keep this change focused. The current retry logic works but will be simplified.
90-125: Reconcile method follows controller-runtime patterns.The method signature
Reconcile(ctx context.Context, event *datastore.EventWithToken) (ctrl.Result, error)aligns with controller-runtime'sReconcilerinterface pattern. Metrics tracking and error handling are appropriate.
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
Outdated
Show resolved
Hide resolved
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
β»οΈ Duplicate comments (2)
fault-remediation/pkg/reconciler/reconciler.go (1)
338-343: Inconsistent error handling: MarkProcessed failure returns nil.When
MarkProcessedfails, the function returnsnilerror instead of propagating it, unlike similar error handling at lines 271, 292, and 315.As noted in past reviews, this is intentional to retain current behavior, with follow-up planned.
fault-remediation/main.go (1)
144-154: Event loop doesn't respect context cancellation.The
for event := range ...loop blocks until the channel closes. If the context is cancelled (e.g., SIGTERM), this loop won't exit until the watcher's channel is closed.Consider using a select with
ctx.Done():g.Go(func() error { components.FaultRemediationReconciler.StartWatcherStream(gCtx) slog.Info("Listening for events on the channel...") - for event := range components.FaultRemediationReconciler.Watcher.Events() { - slog.Info("Event received", "event", event) - _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) + eventCh := components.FaultRemediationReconciler.Watcher.Events() + for { + select { + case <-gCtx.Done(): + slog.Info("Context cancelled, stopping event processing") + return nil + case event, ok := <-eventCh: + if !ok { + return nil + } + slog.Info("Event received", "event", event) + _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) + } } - - return nil })
π§Ή Nitpick comments (7)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
17-27: Non-standard import grouping.Imports are not grouped according to Go conventions (stdlib, then external, then internal packages). The
k8s.io/client-go/kubernetes/schemeimport at line 19 is placed between stdlib imports.Consider reorganizing imports:
import ( "context" - "k8s.io/client-go/kubernetes/scheme" "log" "os" "path/filepath" - ctrl "sigs.k8s.io/controller-runtime" - metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" "testing" "text/template" "time" + + ctrl "sigs.k8s.io/controller-runtime" + metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" + "k8s.io/client-go/kubernetes/scheme"
369-372: Consider using the constructor for consistency.Tests directly instantiate
FaultRemediationReconcilervia struct literal rather than usingNewFaultRemediationReconciler. While this works for isolated tests, it may miss initialization logic added to the constructor in the future.For consistency with the public API, consider using the constructor or documenting why direct instantiation is preferred here.
1345-1351: Remove commented-out code.Line 1346 contains commented-out code that should be removed to maintain code cleanliness.
watcher := NewMockChangeStreamWatcher() - //mockStore := &MockHealthEventStore{} r := &FaultRemediationReconciler{ Watcher: watcher, }fault-remediation/main.go (3)
17-39: Non-standard import grouping.Imports mix stdlib, external, and internal packages without proper grouping. Go conventions recommend grouping: stdlib first, then external packages, then internal packages.
import ( "context" "errors" "flag" "fmt" - "github.com/nvidia/nvsentinel/commons/pkg/server" - "golang.org/x/sync/errgroup" - "k8s.io/apimachinery/pkg/runtime" "log/slog" "os" "os/signal" - "sigs.k8s.io/controller-runtime/pkg/healthz" - metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" "strconv" "strings" "syscall" "time" - - ctrl "sigs.k8s.io/controller-runtime" + + "golang.org/x/sync/errgroup" + "k8s.io/apimachinery/pkg/runtime" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/healthz" + metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" "github.com/nvidia/nvsentinel/commons/pkg/logger" + "github.com/nvidia/nvsentinel/commons/pkg/server" "github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer" )
77-79: Error message should not end with punctuation.Go convention is that error messages should not end with punctuation or be capitalized (unless proper nouns).
if !enableControllerRuntime && enableLeaderElection { - return errors.New("leader-election requires controller-runtime.") + return errors.New("leader-election requires controller-runtime") }
264-267: Remove unnecessary return statement.The explicit
returnat the end ofparseFlags()is unnecessary since the function has no return values.flag.Parse() - - return }fault-remediation/pkg/reconciler/reconciler.go (1)
73-88: Constructor returns value type instead of pointer.
NewFaultRemediationReconcilerreturnsFaultRemediationReconciler(value) but the struct embedsclient.Clientwhich is typically assigned post-construction when using controller-runtime's manager. Returning a value makes it awkward to assign the client later.Consider returning a pointer for consistency with controller-runtime patterns:
func NewFaultRemediationReconciler( ds datastore.DataStore, watcher datastore.ChangeStreamWatcher, healthEventStore datastore.HealthEventStore, config ReconcilerConfig, dryRun bool, -) FaultRemediationReconciler { - return FaultRemediationReconciler{ +) *FaultRemediationReconciler { + return &FaultRemediationReconciler{ ds: ds, Watcher: watcher, healthEventStore: healthEventStore, config: config, annotationManager: config.RemediationClient.GetAnnotationManager(), dryRun: dryRun, } }This would require updating the
Componentsstruct and callers accordingly.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (4)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml(1 hunks)fault-remediation/main.go(3 hunks)fault-remediation/pkg/reconciler/reconciler.go(17 hunks)fault-remediation/pkg/reconciler/reconciler_e2e_test.go(44 hunks)
π§ Files skipped from review as they are similar to previous changes (1)
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π§ Learnings (6)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
𧬠Code graph analysis (1)
fault-remediation/main.go (3)
fault-remediation/pkg/initializer/init.go (3)
InitializationParams(34-39)InitializeAll(46-135)Components(41-43)fault-remediation/pkg/reconciler/reconciler.go (1)
FaultRemediationReconciler(63-71)commons/pkg/server/server.go (4)
NewServer(375-396)WithPort(151-153)WithPrometheusMetricsCtrlRuntime(227-234)WithSimpleHealth(246-254)
π Additional comments (6)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
288-345: Good refactor: Error returns instead of panics.The function now properly returns errors instead of panicking, making tests more robust and easier to debug. The callers have been updated to handle these errors appropriately.
fault-remediation/main.go (1)
42-43: Empty scheme may limit controller-runtime capabilities.The
schemeis declared but no types are registered to it. While this works for the current implementation using dynamic clients and external event sources, it may cause issues if the reconciler later needs to work with typed Kubernetes objects.Verify this is intentional. If the reconciler doesn't need to watch/reconcile typed Kubernetes resources, this is fine. Otherwise, register the needed types:
import ( corev1 "k8s.io/api/core/v1" utilruntime "k8s.io/apimachinery/pkg/util/runtime" ) func init() { utilruntime.Must(corev1.AddToScheme(scheme)) }Also applies to: 162-164
fault-remediation/pkg/reconciler/reconciler.go (4)
90-125: Reconcile method correctly handles malformed events.The method returns
ctrl.Result{}, nilfor parse failures and nil health events, which prevents controller-runtime from retrying malformed events. This aligns with the discussed behavior of dropping unparseable events.
205-224: Manual retry loop to be replaced with controller-runtime retries.The TODO at line 205 notes this retry logic should be replaced with controller-runtime's built-in retry mechanism. The current implementation maintains backward compatibility but adds complexity.
This is acknowledged for follow-up work as discussed in past reviews.
519-545: Watcher lifecycle not tied to manager.While the context is now passed as a parameter (addressing part of the past review), the watcher's lifecycle isn't tied to the manager's lifecycle. If the manager stops gracefully, the watcher won't be explicitly stopped until the context is cancelled by the parent signal handler.
This may be acceptable if the signal context propagation is sufficient. However, for explicit lifecycle management, consider registering the watcher as a
Runnable:if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error { r.Watcher.Start(ctx) <-ctx.Done() return r.Watcher.Close(ctx) })); err != nil { return fmt.Errorf("failed to add watcher runnable: %w", err) }This ensures the watcher is cleanly stopped when the manager shuts down.
547-571: Well-structured channel adaptation with proper context handling.The
AdaptEventsfunction correctly:
- Uses select to respect context cancellation
- Properly closes the output channel via defer
- Creates a local copy before taking pointer (line 564-565) to avoid data races
|
hey @ivelichkovich , can you please complete the DCO -- https://github.com/NVIDIA/NVSentinel/pull/449/checks?check_run_id=57068729072 so that we can trigger the e2e tests on the PR for some feedback? |
9f5f130 to
47c2e36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
β»οΈ Duplicate comments (3)
fault-remediation/pkg/reconciler/reconciler.go (2)
338-343: Inconsistent error handling for MarkProcessed failure.When
MarkProcessedfails at line 338, the function returnsctrl.Result{}, nilinstead of returning the error. This is inconsistent with similar error handling at lines 292 and 315, which return the error. This could cause the event to be reprocessed if marking fails, but controller-runtime won't be notified for retry backoff.Apply this diff to make error handling consistent:
if err = watcherInstance.MarkProcessed(ctx, eventWithToken.ResumeToken); err != nil { processingErrors.WithLabelValues("mark_processed_error", nodeName).Inc() slog.Error("Error updating resume token", "error", err) - return ctrl.Result{}, nil + return ctrl.Result{}, err }Note: Based on past review discussions, this is a known issue with planned follow-up work to align error handling patterns.
505-517: CloseAll returns early, potentially leaving resources open.If
ds.Closefails at line 506, the function returns immediately without callingWatcher.Close, potentially leaving the watcher resource open. This violates proper cleanup patterns.Consider closing all resources regardless of individual failures and aggregating errors:
+import "errors" + func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error { + var errs []error + if err := r.ds.Close(ctx); err != nil { slog.Error("failed to close datastore", "error", err) - return err + errs = append(errs, fmt.Errorf("datastore close: %w", err)) } if err := r.Watcher.Close(ctx); err != nil { slog.Error("failed to close Watcher", "error", err) - return err + errs = append(errs, fmt.Errorf("watcher close: %w", err)) } - return nil + return errors.Join(errs...) }This ensures both resources are always closed, and any errors are properly reported.
Based on coding guidelines requiring proper shutdown handling with context cancellation.
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
237-241: Manager start goroutine lacks readiness synchronization.The manager is started in a goroutine at line 238, but there's no synchronization to ensure it's ready before
m.Run()executes at line 250. Tests may start before the manager and reconciler are fully initialized, leading to flaky test behavior.Consider adding a synchronization mechanism to wait for the manager cache to sync:
+ // Wait for manager to be ready + go func() { + if err := mgr.Start(testContext); err != nil { + log.Printf("Failed to start the test environment manager: %v", err) + } + }() + + // Wait for caches to sync + cacheSynced := mgr.GetCache().WaitForCacheSync(testContext) + if !cacheSynced { + log.Fatalf("Failed to sync manager cache") + } - go func() { - if err := mgr.Start(testContext); err != nil { - log.Fatalf("Failed to start the test environment manager: %v", err) - } - }()This ensures the manager is ready before tests run. Additionally, consider replacing
log.Fatalfwithlog.Printfin the goroutine to avoid abrupt process termination without cleanup.Based on coding guidelines requiring proper test setup and synchronization.
π§Ή Nitpick comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)
205-224: Manual retry logic could be replaced with controller-runtime retries.The manual retry loop with
UpdateMaxRetriesandUpdateRetryDelayduplicates functionality that controller-runtime provides. The TODO at line 205 acknowledges this. By returning errors fromperformRemediation, controller-runtime can handle retries automatically with exponential backoff.Consider returning errors and letting controller-runtime handle retries in a follow-up:
// Remove manual retry loop success, crName := r.config.RemediationClient.CreateMaintenanceResource(ctx, healthEventData) if !success { processingErrors.WithLabelValues("cr_creation_failed", nodeName).Inc() return false, "", fmt.Errorf("failed to create maintenance resource for node %s", nodeName) } return true, crName, nilThen propagate the error in the caller (
handleRemediationEventline 325) instead of ignoring it.Based on past review discussions indicating this refactoring is planned for a future iteration.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (4)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml(1 hunks)fault-remediation/main.go(3 hunks)fault-remediation/pkg/reconciler/reconciler.go(17 hunks)fault-remediation/pkg/reconciler/reconciler_e2e_test.go(44 hunks)
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π§ Learnings (8)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
fault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable
Applied to files:
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
𧬠Code graph analysis (3)
fault-remediation/main.go (5)
node-drainer/pkg/config/config.go (1)
Duration(36-38)fault-remediation/pkg/initializer/init.go (3)
InitializationParams(34-39)InitializeAll(46-135)Components(41-43)fault-remediation/pkg/reconciler/reconciler.go (1)
FaultRemediationReconciler(63-71)commons/pkg/server/server.go (4)
NewServer(375-396)WithPort(151-153)WithPrometheusMetricsCtrlRuntime(227-234)WithSimpleHealth(246-254)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
fault-remediation/pkg/reconciler/reconciler.go (3)
FaultRemediationReconciler(63-71)ReconcilerConfig(41-50)NewFaultRemediationReconciler(73-88)data-models/pkg/model/health_event_extentions.go (1)
HealthEventStatus(45-51)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)store-client/pkg/datastore/types.go (1)
EventWithToken(101-104)
fault-remediation/pkg/reconciler/reconciler.go (5)
store-client/pkg/storewatcher/watch_store.go (1)
ChangeStreamWatcher(43-57)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)fault-remediation/pkg/reconciler/annotation.go (1)
NodeAnnotationManagerInterface(38-43)data-models/pkg/model/health_event_extentions.go (2)
HealthEventStatus(45-51)Status(23-23)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)
Can I see progress or logs to make sure its running anywhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)
15-15: Add package-level godoc comment.Per coding guidelines, all Go packages require a package-level godoc comment.
As per coding guidelines, package-level godoc is required for all Go packages.
Add a comment above the package declaration:
+// Package reconciler implements the fault remediation reconciliation logic, +// supporting both standalone and controller-runtime managed execution modes. package reconciler
β»οΈ Duplicate comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)
339-344: Inconsistent error handling: MarkProcessed failure returns nil.When
MarkProcessedfails at line 339, the function returnsctrl.Result{}, nilinstead of propagating the error. This is inconsistent with similar error paths at lines 293, 316, and 266/272 which return the error.This means controller-runtime won't be aware of the failure and won't apply retry backoff, potentially causing the event to be reprocessed without proper error signaling.
Based on past review comments, this behavior is intentional to retain current behavior for now, with error handling improvements deferred to a follow-up. However, for consistency and proper ctrl-runtime integration, consider updating this in the follow-up:
if err = watcherInstance.MarkProcessed(ctx, eventWithToken.ResumeToken); err != nil { processingErrors.WithLabelValues("mark_processed_error", nodeName).Inc() slog.Error("Error updating resume token", "error", err) - return ctrl.Result{}, nil + return ctrl.Result{}, err }
π§Ή Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler.go (2)
206-206: TODO comment should reference an issue.Per coding guidelines, TODO comments in Go code should reference issues.
As per coding guidelines, TODO comments should reference issues in Go code.
If an issue exists for this work, please add the reference:
- //TODO: return error to use built in ctrl runtime retries + // TODO(#issue_number): return error to use built in ctrl runtime retriesIf no issue exists yet, consider creating one to track the planned ctrl-runtime retry integration.
472-475: Consider using context from caller instead of context.Background().The error path uses
context.Background()forMarkProcessedcalls. If the main context is cancelled, this may cause the MarkProcessed operation to continue running.If the intent is to ensure MarkProcessed completes even on cancellation, add a brief comment explaining this. Otherwise, consider using the context from the caller:
- if markErr := watcherInstance.MarkProcessed(context.Background(), eventWithToken.ResumeToken); markErr != nil { + if markErr := watcherInstance.MarkProcessed(ctx, eventWithToken.ResumeToken); markErr != nil {Note: Similar pattern also appears at line 486.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (1)
fault-remediation/pkg/reconciler/reconciler.go(17 hunks)
π§° Additional context used
π Path-based instructions (1)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/reconciler/reconciler.go
π§ Learnings (3)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Additional comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)
81-88: Consider initializing or documenting the nil Client field.The embedded
client.Clientfield is not initialized in this constructor. If methods onFaultRemediationReconcilerattempt to use the Client before it's set, this will cause a nil pointer dereference. Verify whether the Client is injected viaSetupWithManageror another initialization mechanism, and either accept it as a constructor parameter, provide a setter method, or document the injection requirement in the godoc.
3104f00 to
1a4c25c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
π§Ή Nitpick comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)
331-331: Consider logging runLogCollector errors for observability.The error return from
runLogCollectoris currently discarded. While log collection is auxiliary and its failure shouldn't block remediation, consider logging the error for debugging purposes:- _ = r.runLogCollector(ctx, healthEvent) + if err := r.runLogCollector(ctx, healthEvent); err != nil { + slog.Warn("Log collector failed but continuing with remediation", "node", healthEvent.NodeName, "error", err) + }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (2)
.github/workflows/lint-test.yml(1 hunks)fault-remediation/pkg/reconciler/reconciler.go(17 hunks)
π§° Additional context used
π Path-based instructions (1)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/reconciler/reconciler.go
π§ Learnings (6)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Additional comments (9)
.github/workflows/lint-test.yml (1)
18-20: Workflow trigger addition is sound.The new
pull_requesttrigger withbranches: '*'correctly enables lint-test to run for all PRs. The wildcard pattern is valid and idiomatic in GitHub Actions. Existing conditional logic (line 200) properly routes pull_request events to the coverage-report job.Please confirm this trigger scope aligns with your repository's CI/CD policies. If stricter branch filtering is needed (e.g., PRs targeting only main or specific branches), you can replace
'*'with a more specific pattern.fault-remediation/pkg/reconciler/reconciler.go (8)
17-40: LGTM! Clean controller-runtime integration imports.The new imports for controller-runtime (
ctrl,builder,client,event,handler,source) and thenvstoreclientalias are well-organized and support the refactoring objectives.
64-93: LGTM! Well-structured reconciler with clear dependencies.The
FaultRemediationReconcilerstruct properly embedsclient.Clientfor controller-runtime integration, and the constructor follows standard dependency injection patterns. Godoc comments have been properly added.
95-133: LGTM! Clean reconciliation flow with proper delegation.The
Reconcilemethod follows controller-runtime conventions, with proper context handling, metrics tracking, and delegation to specialized handlers for cancellation and remediation events.
195-255: Manual retry loop tracked for future refactoring.The TODO comment at line 213 correctly notes that this retry logic should eventually be replaced with controller-runtime's built-in retry mechanism. Per past review discussions, this is deferred to a follow-up PR to keep changes manageable.
354-451: LGTM! Status update and CR check logic is sound.The
updateNodeRemediatedStatusmethod includes a manual retry loop (lines 377-390) which is consistent with the pattern inperformRemediationand tracked for future refactoring. ThecheckExistingCRStatusmethod properly validates state and delegates to the status checker.
455-505: LGTM! Robust event parsing with proper error classification.The
parseHealthEventmethod correctly uses shared parsing utilities, classifies errors for metrics, and marks unparseable events as processed to prevent reprocessing. The use ofcontext.Background()forMarkProcessedon error paths (lines 479, 493) is appropriate to ensure cleanup even during cancellation.
507-560: LGTM! Excellent lifecycle management with proper context handling.The lifecycle methods are well-implemented:
StartWatcherStreamprovides a clean entry point for non-controller-runtime modeCloseAllproperly aggregates errors usingerrors.Jointo ensure all cleanup attempts are madeSetupWithManagercorrectly uses the provided context (addressing previous review feedback) and follows controller-runtime patterns for typed controllers
562-589: LGTM! Clean channel adapter with proper lifecycle management.The
AdaptEventsfunction correctly transforms datastore events into controller-runtime generic events, with proper handling of context cancellation, channel closure, and goroutine cleanup viadefer.
9a0d94a to
1a4c25c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (3)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
96-169: GuardMockHealthEventStore.updateCalledagainst concurrent access
updateCalledis incremented insideUpdateHealthEventStatusFn(called from the reconciler running under the manager goroutines) and read/reset directly from tests (e.g.,mockStore.updateCalled = 0,updateCountBefore := mockStore.updateCalled). Without synchronization this is a data race and can produce flaky assertions under concurrent processing or when run with-race.Consider switching to an atomic counter or adding a mutex and helper methods:
type MockHealthEventStore struct { - UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error - updateCalled int + UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error + mu sync.Mutex + updateCalled int } // UpdateHealthEventStatus updates a health event status (mock implementation) func (m *MockHealthEventStore) UpdateHealthEventStatus(ctx context.Context, id string, status datastore.HealthEventStatus) error { if m.UpdateHealthEventStatusFn != nil { return m.UpdateHealthEventStatusFn(ctx, id, status) } - return nil + m.mu.Lock() + m.updateCalled++ + m.mu.Unlock() + return nil } + +func (m *MockHealthEventStore) ResetUpdateCalled() { + m.mu.Lock() + defer m.mu.Unlock() + m.updateCalled = 0 +} + +func (m *MockHealthEventStore) UpdateCalled() int { + m.mu.Lock() + defer m.mu.Unlock() + return m.updateCalled }Then update tests to use
ResetUpdateCalled()/UpdateCalled()instead of touching the field directly.Also applies to: 741-801, 988-989
fault-remediation/pkg/reconciler/reconciler.go (2)
176-255: Documented TODO inperformRemediationshould reference an issue, and retry loop is a good future ctrl-runtime cleanup candidateThe current
performRemediationimplementation preserves existing behavior by:
- Manually retrying
CreateMaintenanceResourceup toUpdateMaxRetrieswithUpdateRetryDelay.- Updating node labels to remediating / succeeded / failed and emitting metrics.
The new
//TODO: return error to use built in ctrl runtime retriesis helpful, but per guidelines it should reference a tracking issue (e.g.// TODO(<issue-id>): ...) so this behavior change isnβt forgotten.Given ctrl-runtime already handles backoff/retries on returned errors, this manual loop is a natural future simplification: once youβre ready to change behavior, you can drop the loop and return errors directly to ctrl-runtime instead of sleeping in-place.
Also applies to: 354-401
258-282: Use request context forMarkProcessedin cancellation pathIn
handleCancellationEvent, you correctly usectxfor annotation cleanup, but still call:if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil {This bypasses cancellation/timeouts from the reconcilerβs context and is inconsistent with other call sites that use
ctxwhen marking events processed.Switching to
ctxhere would align behavior and avoid potentially hangingMarkProcessedcalls during shutdown:- if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil { + if err := watcherInstance.MarkProcessed(ctx, resumeToken); err != nil {
β»οΈ Duplicate comments (2)
health-monitors/syslog-health-monitor/go.mod (1)
31-42: Critical: Malformed subpackage entriesβconsolidate to module path.Lines 32β42 incorrectly list subpackages (
cmdutils,conv,fileutils, etc.) as separatego.modentries. In Go's module system, only the module pathfd.axjsq.dpdns.org/go-openapi/swagshould appear; subpackages are internal to that module and must not be listed separately. Consolidate to a single entry.This is a duplicate of a previously flagged issue that remains unresolved.
Run
go mod tidyto repair invalid subpackage entries:#!/bin/bash # Navigate to the module and run go mod tidy to consolidate subpackage entries cd health-monitors/syslog-health-monitor go mod tidy # Verify the swag entry is consolidated grep -n "go-openapi/swag" go.modfault-remediation/main.go (1)
117-157: Nonβcontroller-runtime event loop still ignores context cancellationIn
setupNonCtrlRuntimeManaged, the goroutine:components.FaultRemediationReconciler.StartWatcherStream(gCtx) slog.Info("Listening for events on the channel...") for event := range components.FaultRemediationReconciler.Watcher.Events() { slog.Info("Event received", "event", event) _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) }will block until the events channel is closed. If
gCtx/ctxis cancelled (SIGINT/SIGTERM or shutdown path) but the watcher does not close its channel promptly, this goroutine will hang and delay graceful shutdown, contrary to the context-cancellation guideline.Consider selecting on both the event channel and
gCtx.Done():- slog.Info("Listening for events on the channel...") - for event := range components.FaultRemediationReconciler.Watcher.Events() { - slog.Info("Event received", "event", event) - _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) - } + slog.Info("Listening for events on the channel...") + eventCh := components.FaultRemediationReconciler.Watcher.Events() + for { + select { + case <-gCtx.Done(): + slog.Info("Context cancelled, stopping event processing") + return nil + case event, ok := <-eventCh: + if !ok { + return nil + } + slog.Info("Event received", "event", event) + _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) + } + }This ensures the loop exits promptly on cancellation even if the watcher channel remains open.
Also applies to: 144-154
π§Ή Nitpick comments (8)
fault-remediation/pkg/initializer/init.go (2)
70-80: Align TODO with guidelines by referencing an issue or converting to a regular comment
//TODO: replace this with manager clientis useful context, but per the Go coding guidelines here, TODO comments should reference an issue (for example,// TODO(#1234): ...). Please either link this to a tracking issue or convert it into a plain explanatory comment if thereβs no concrete follow-up item yet.
99-117: Centralize the"HealthEvents"collection name to avoid future driftThe new datastore/watch initialization looks good, but
CollectionName: "HealthEvents"is hard-coded here while the rest of the datastore API is strongly typed. If thereβs already a shared constant (e.g., in data-models or store-client) for the HealthEvents collection, prefer using it here to avoid silent breakage if the collection name ever changes.Also applies to: 131-134
fault-remediation/pkg/reconciler/reconciler_test.go (1)
991-1009: Consider standardizing onNewFaultRemediationReconcilerin tests
TestLogCollectorOnlyCalledWhenShouldCreateCRstill usesNewReconciler(cfg, false)while the rest of this file has migrated toNewFaultRemediationReconciler. IfNewReconcileris now just a thin wrapper around the new constructor, consider updating this test to callNewFaultRemediationReconcilerdirectly for consistency and to keep tests aligned with the primary public entry point.tests/fault_remediation_test.go (1)
28-73: Test name spelling fix looks goodThe rename corrects the typo and keeps the existing behavior intact. If you touch this again, consider renaming to follow the
TestFunction_Scenario_ExpectedBehaviorstyle used elsewhere.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
171-255: Ctrl-runtime envtest wiring and reconciler usage look solidThe updated
TestMain,createTestRemediationClient, and the integration tests correctly:
- Build a real
FaultRemediationClientagainst envtesttestRestConfig/testDynamic.- Construct
ReconcilerConfigwithStateManagerand pass it intoNewFaultRemediationReconciler.- Wire the reconciler into a controller-runtime
ManagerviaSetupWithManagerwhile still using explicit struct literals in tests that exercise methods directly.The overall structure reflects the new ctrl-runtimeβaware reconciler API without introducing obvious correctness issues in the tests.
Also applies to: 288-345, 347-428, 472-488, 546-558, 603-627, 644-707, 740-895, 897-927, 986-1058, 1060-1139, 1141-1223, 1335-1356
fault-remediation/main.go (1)
41-62: Controller-runtime path, leader election gating, and flag plumbing look correct
- Rejecting
--leader-electwhen--controller-runtimeis disabled prevents invalid combinations.setupCtrlRuntimeManagementcorrectly:
- Uses the shared
ctx(with signal handling) formgr.Start.- Configures metrics, health probes, and leader-election timings from flags.
- Delegates reconciler wiring via
SetupWithManager.- Flag parsing cleanly separates controller-runtime options and reuses the same
metrics-address/health-addressflags across both modes.The structure meets the shutdown-handling and ctrl-runtime integration goals.
Also applies to: 74-115, 159-202, 204-267
fault-remediation/pkg/reconciler/reconciler.go (2)
95-133: Reconcile flow and unsupported-action handling look correct
Reconciledelegates parsing, cancellation vs remediation, and metrics without leaking parse errors back to ctrl-runtime (parse errors are logged + metrics, then dropped), which avoids hot loops on bad events.shouldSkipEventcleanly handles:
- NONE actions.
- Already remediated events.
- Unsupported actions, where it logs, increments
totalUnsupportedRemediationActions, and sets the remediation-failed node label viaStateManager.This matches the desired behavior of dropping unprocessable events while keeping node-level state/metrics accurate.
Also applies to: 135-174
531-560: Ctrl-runtimeSetupWithManagerandAdaptEventsintegration looks good
SetupWithManagerstarts the watcher with the providedctxand wires the reconciler into controller-runtime using a typed channel source and rate-limiting workqueue, which aligns with best practices for channel-driven reconcilers.AdaptEventscorrectly:
- Listens on the input channel until either
ctx.Done()or the channel is closed.- Sends a per-iteration copy (
eventOut) so eachTypedGenericEventholds a stable pointer.- Closes the output channel on exit.
This provides a clean bridge between the datastore event stream and controller-runtimeβs reconciliation loop.
Also applies to: 562-589
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (5)
commons/go.sumis excluded by!**/*.sumevent-exporter/go.sumis excluded by!**/*.sumfault-remediation/go.sumis excluded by!**/*.sumhealth-events-analyzer/go.sumis excluded by!**/*.sumhealth-monitors/syslog-health-monitor/go.sumis excluded by!**/*.sum
π Files selected for processing (18)
commons/go.mod(2 hunks)commons/pkg/server/server.go(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml(1 hunks)distros/kubernetes/nvsentinel/values-full.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt.yaml(2 hunks)event-exporter/go.mod(1 hunks)fault-remediation/go.mod(2 hunks)fault-remediation/main.go(3 hunks)fault-remediation/pkg/initializer/init.go(5 hunks)fault-remediation/pkg/reconciler/metrics.go(4 hunks)fault-remediation/pkg/reconciler/reconciler.go(17 hunks)fault-remediation/pkg/reconciler/reconciler_e2e_test.go(44 hunks)fault-remediation/pkg/reconciler/reconciler_test.go(15 hunks)fault-remediation/pkg/reconciler/remediation.go(1 hunks)health-events-analyzer/go.mod(1 hunks)health-monitors/syslog-health-monitor/go.mod(1 hunks)tests/fault_remediation_test.go(1 hunks)
π§ Files skipped from review as they are similar to previous changes (5)
- fault-remediation/pkg/reconciler/remediation.go
- distros/kubernetes/nvsentinel/values-tilt.yaml
- commons/go.mod
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
π§° Additional context used
π Path-based instructions (3)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
tests/fault_remediation_test.gofault-remediation/pkg/reconciler/metrics.gofault-remediation/pkg/initializer/init.gofault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gocommons/pkg/server/server.gofault-remediation/pkg/reconciler/reconciler.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/fault_remediation_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
event-exporter/go.modhealth-events-analyzer/go.modfault-remediation/go.modhealth-monitors/syslog-health-monitor/go.mod
π§ Learnings (15)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/fault_remediation_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/reconciler/metrics.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
event-exporter/go.modhealth-events-analyzer/go.modfault-remediation/go.modhealth-monitors/syslog-health-monitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
event-exporter/go.modhealth-events-analyzer/go.modfault-remediation/go.modhealth-monitors/syslog-health-monitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
event-exporter/go.modhealth-events-analyzer/go.modhealth-monitors/syslog-health-monitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
event-exporter/go.modhealth-events-analyzer/go.modfault-remediation/main.gofault-remediation/go.modhealth-monitors/syslog-health-monitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
health-events-analyzer/go.modhealth-monitors/syslog-health-monitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
health-monitors/syslog-health-monitor/go.modfault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
𧬠Code graph analysis (2)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/reconciler/reconciler.go (2)
FaultRemediationReconciler(67-75)NewFaultRemediationReconciler(78-93)store-client/pkg/watcher/factory.go (2)
WatcherConfig(26-38)CreateChangeStreamWatcher(65-87)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)
fault-remediation/pkg/reconciler/reconciler.go (5)
store-client/pkg/storewatcher/watch_store.go (1)
ChangeStreamWatcher(43-57)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)fault-remediation/pkg/reconciler/annotation.go (1)
NodeAnnotationManagerInterface(38-43)data-models/pkg/model/health_event_extentions.go (4)
HealthEventStatus(45-51)UnQuarantined(34-34)Cancelled(37-37)Status(23-23)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)
π Additional comments (10)
health-monitors/syslog-health-monitor/go.mod (1)
71-80: Verify necessity of new Kubernetes and controller-runtime dependencies.Lines 71β80 introduce numerous indirect dependencies on
k8s.io/*andsigs.k8s.io/*modules (controller-runtime, client-go, kube-openapi, etc.). Since the PR integrates controller-runtime into fault-remediation, these may be transitive pulls from shared modules likecommonsordata-models. However, verify:
- Whether syslog-health-monitor actually uses any of these, or if they're incidental transitive bloat.
- If these are pulled from commons or data-models, whether they should be direct requires in those modules instead.
Per coding guidelines, keep Go dependencies minimal and up-to-date. Ensure no unnecessary transitive overhead is being carried.
distros/kubernetes/nvsentinel/values-full.yaml (1)
48-62: Global ctrl-runtime toggles and ports look consistentThe updated metricsPort comment plus the new
healthPortandctrlRuntimeEnabledflags are coherent with the dual execution modes described in the PR; defaults and documentation look good.fault-remediation/go.mod (1)
24-95: New indirect deps (fsnotify, btree) align with controller-runtime usageAdding
fsnotifyandgoogle/btreeas indirect dependencies is consistent with the controller-runtime/k8s controller stack youβre using; nothing stands out as problematic here. Just ensurego mod tidyis run periodically so the indirect list reflects actual usage and doesnβt accumulate stale entries.fault-remediation/pkg/reconciler/reconciler_test.go (1)
219-222: Tests updated forNewFaultRemediationReconcilerand newperformRemediationsignature look soundThe migration of tests to
NewFaultRemediationReconciler(nil, nil, nil, cfg, dryRun)and the updated assertions handling(success, crName, err)fromperformRemediationare consistent with the reconcilerβs new API. Using nil for the datastore/watcher/healthEventStore in these unit tests is fine given that the exercised code paths only rely onconfig,RemediationClient, andStateManager.Also applies to: 263-274, 371-381, 428-438, 474-485, 784-803
commons/pkg/server/server.go (1)
52-55: Fix example to use the new ctrl-runtime metrics option and clarify usageThe implementation of
WithPrometheusMetricsCtrlRuntimelooks correct for wiring metrics tocrmetrics.Registry, but the example in the comment still showsWithPrometheusMetrics()instead of the new option. Also, consider adding a brief note thatWithPrometheusMetricsandWithPrometheusMetricsCtrlRuntimeshould not both be used in the same server, sinceServeMux.Handle("/metrics", ...)will panic on duplicate registration.event-exporter/go.mod (1)
16-85: Confirm indirect deps are needed and keepgo.modtidy for event-exporterThe expanded indirect require block (OpenAPI, k8s, controller-runtime, various x/* and sigs.k8s.io/* modules) should be verified to ensure all are actually required. Run
go mod tidyto remove any unnecessary transitive dependencies and keep the manifest lean per Go best practices. If specific indirect versions must be pinned for compatibility, document that rationale.health-events-analyzer/go.mod (1)
27-94: Rungo mod tidyto ensure indirect dependencies are necessaryThe expanded indirect dependency set includes controller-runtime, Kubernetes, OpenAPI, and JSON/YAML libraries. Ensure these modules are actually required by this service and not accumulated cruft. Run
go mod tidyinhealth-events-analyzerto remove unused transitive dependencies and keep the dependency set minimal per project standards.fault-remediation/pkg/reconciler/metrics.go (1)
20-21: Controller-runtime metrics registry integration is correct; address breaking label change and TODO complianceUsing
crmetrics.Registrywithpromauto.With()properly aligns reconciler metrics with the controller-runtime metrics server and the newWithPrometheusMetricsCtrlRuntimehandler option.Critical before merge:
- The label changes for
fault_remediation_unsupported_actions_totalare a breaking change. Since fault-remediation metrics are actively scraped in production (per Helm/Prometheus documentation), existing dashboards and alerts will fail or behave unexpectedly when labels change. Either preserve the old label name for backward compatibility or explicitly call out the breaking change in release notes.- The TODO comment should reference a tracking issue per coding guidelines.
- Verify all processes and binaries importing this metrics package are updated to use the controller-runtime registry server option; anything still using
WithPrometheusMetrics()will not expose these metrics.fault-remediation/pkg/reconciler/reconciler.go (2)
42-52: Reconciler shape and constructor align well with new ctrl-runtime modelThe
FaultRemediationReconcilerstruct andNewFaultRemediationReconcilercleanly capture all dependencies (datastore, watcher, health store, config, annotationManager, dryRun) and embedclient.Clientfor ctrl-runtime use, while still working for non-managed mode.Constructor wiring via
config.RemediationClient.GetAnnotationManager()keeps tests and main in sync without extra pluming.Also applies to: 64-93
284-352: Remediation path and metrics wiring look consistentThe main remediation path in
handleRemediationEvent:
- Respects
shouldSkipEventand marks skipped events processed.- Uses
checkExistingCRStatusfor deduplication, emittingCRStatusSkippedmetrics when reusing an in-progress CR.- Runs the log collector only when actually creating a new CR.
- Invokes
performRemediationandupdateNodeRemediatedStatus, then marks the event processed and bumpsCRStatusCreated.This is consistent with the dedup and metrics expectations described in the tests and maintains the previous semantics while fitting into ctrl-runtimeβs Reconcile signature.
|
looks like the only test that failed is https://github.com/NVIDIA/NVSentinel/actions/runs/19919748936/job/57105903821?pr=449 can you please take a look? |
292c06c to
5004e32
Compare
|
/ok to test 5004e32 |
|
/ok to test 560a318 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
β»οΈ Duplicate comments (4)
health-monitors/syslog-health-monitor/go.mod (1)
32-42: Consolidate go-openapi/swag subpackages into a single module entry.Lines 32β42 incorrectly list subpackage paths (e.g.,
github.com/go-openapi/swag/cmdutils,github.com/go-openapi/swag/conv, etc.) as separate module entries. In Go'sgo.mod, only module paths should appear; subpackages are internal to their parent module and must not be listed separately. Replace these eleven entries with a single line:github.com/go-openapi/swag v0.25.4 // indirect.Run
go mod tidyto automatically repair these entries and remove invalid subpackage paths:go mod tidyAfter running
go mod tidy, verify the file contains only:github.com/go-openapi/swag v0.25.4 // indirectinstead of the current eleven separate subpackage entries.
fault-remediation/main.go (1)
120-162: Nonβctrl runtime event loop doesnβt observe context cancellation directly.
StartWatcherStream(gCtx)gets a cancellable context, but the subsequentfor event := range components.FaultRemediationReconciler.Watcher.Events()will keep blocking until the watcher closes the channel. If the watcher doesnβt close promptly ongCtx.Done(), shutdown can hang ong.Wait(). Consider aselectongCtx.Done()+ the events channel so this goroutine can exit immediately when the context is cancelled.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
171-241: Avoidlog.Fatalfin the manager goroutine and add a simple readiness sync.The goroutine running
mgr.Start(testContext)still useslog.Fatalf, which callsos.Exitfrom a background goroutine and can bypass test cleanup; thereβs also no sync to ensure the manager has started before tests run. Prefer logging the error and failing tests via assertions, and use a small channel ormgr.GetCache().WaitForCacheSyncto signal readiness beforem.Run().fault-remediation/pkg/reconciler/reconciler.go (1)
518-541: Nil-guardingCloseAllwould make it safer for test-constructed reconcilers.
CloseAllunconditionally callsr.ds.Close(ctx)andr.Watcher.Close(ctx). In production these are always non-nil viaInitializeAll, but tests sometimes constructFaultRemediationReconcilerwithdsand/orWatcherset to nil, which would panic ifCloseAllis reused there. Adding simple nil checks around eachClosemakes this helper safer without affecting the main path.func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error { var errs []error - if err := r.ds.Close(ctx); err != nil { - slog.Error("failed to close datastore", "error", err) - errs = append(errs, err) - } - - if err := r.Watcher.Close(ctx); err != nil { - slog.Error("failed to close Watcher", "error", err) - errs = append(errs, err) - } + if r.ds != nil { + if err := r.ds.Close(ctx); err != nil { + slog.Error("failed to close datastore", "error", err) + errs = append(errs, err) + } + } + + if r.Watcher != nil { + if err := r.Watcher.Close(ctx); err != nil { + slog.Error("failed to close Watcher", "error", err) + errs = append(errs, err) + } + } return errors.Join(errs...) }
π§Ή Nitpick comments (3)
fault-remediation/pkg/reconciler/metrics.go (1)
30-30: Add an issue reference to the TODO per repo guidelinesThe TODO at Line 30 lacks an issue reference, which goes against the stated Go guidelines for TODO comments. Please either link the tracking issue or rephrase as a non-TODO comment.
A minimal change could look like:
- //TODO: evaluate and remove redundant metrics with ctrl-runtime defaults + // TODO(<issue-id>): evaluate and remove redundant metrics with ctrl-runtime defaultsfault-remediation/pkg/initializer/init.go (1)
45-45: Add issue reference to TODO comment.Per coding guidelines, TODO comments should reference GitHub issues. Please update the comment to reference a specific issue number.
-// nolint: cyclop // todo +// nolint: cyclop // TODO(#XXX): Refactor to reduce cyclomatic complexityBased on coding guidelines: "TODO comments should reference issues in Go code"
fault-remediation/main.go (1)
164-208: Controller-runtime manager setup is generally correct; consider scheme registration for future K8s clients.The manager wiring (metrics bind address, health/ready checks, leader election options, and using the same
ctxasrun()) looks good. If you ever start usingmgr.GetClient()or caching typed resources, youβll want to register the core/client-go scheme (e.g.clientgoscheme.AddToScheme(scheme)) rather than relying on an emptyruntime.Scheme.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (5)
commons/go.sumis excluded by!**/*.sumevent-exporter/go.sumis excluded by!**/*.sumfault-remediation/go.sumis excluded by!**/*.sumhealth-events-analyzer/go.sumis excluded by!**/*.sumhealth-monitors/syslog-health-monitor/go.sumis excluded by!**/*.sum
π Files selected for processing (18)
commons/go.mod(2 hunks)commons/pkg/server/server.go(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml(1 hunks)distros/kubernetes/nvsentinel/values-full.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt.yaml(2 hunks)event-exporter/go.mod(1 hunks)fault-remediation/go.mod(2 hunks)fault-remediation/main.go(3 hunks)fault-remediation/pkg/initializer/init.go(5 hunks)fault-remediation/pkg/reconciler/metrics.go(4 hunks)fault-remediation/pkg/reconciler/reconciler.go(17 hunks)fault-remediation/pkg/reconciler/reconciler_e2e_test.go(44 hunks)fault-remediation/pkg/reconciler/reconciler_test.go(16 hunks)fault-remediation/pkg/reconciler/remediation.go(1 hunks)health-events-analyzer/go.mod(1 hunks)health-monitors/syslog-health-monitor/go.mod(1 hunks)tests/fault_remediation_test.go(1 hunks)
π§ Files skipped from review as they are similar to previous changes (5)
- fault-remediation/pkg/reconciler/remediation.go
- distros/kubernetes/nvsentinel/values-full.yaml
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
- commons/pkg/server/server.go
- distros/kubernetes/nvsentinel/values-tilt.yaml
π§° Additional context used
π Path-based instructions (3)
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
health-events-analyzer/go.modfault-remediation/go.modcommons/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/initializer/init.gotests/fault_remediation_test.gofault-remediation/pkg/reconciler/metrics.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/reconciler/reconciler_test.gotests/fault_remediation_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π§ Learnings (16)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
health-events-analyzer/go.modfault-remediation/go.modcommons/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
health-events-analyzer/go.modfault-remediation/go.modcommons/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
health-events-analyzer/go.modcommons/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
health-events-analyzer/go.modcommons/go.modhealth-monitors/syslog-health-monitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
health-events-analyzer/go.modfault-remediation/main.gofault-remediation/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable
Applied to files:
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
commons/go.modfault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/fault_remediation_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/reconciler/metrics.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
health-monitors/syslog-health-monitor/go.modfault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
𧬠Code graph analysis (5)
fault-remediation/main.go (2)
fault-remediation/pkg/initializer/init.go (3)
InitializationParams(34-39)InitializeAll(46-135)Components(41-43)commons/pkg/server/server.go (2)
WithPrometheusMetricsCtrlRuntime(227-234)WithSimpleHealth(246-254)
fault-remediation/pkg/reconciler/reconciler_test.go (2)
fault-remediation/pkg/reconciler/reconciler.go (3)
NewFaultRemediationReconciler(78-93)HealthEventDoc(53-56)ReconcilerConfig(42-51)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)
fault-remediation/pkg/initializer/init.go (5)
fault-remediation/pkg/reconciler/reconciler.go (2)
FaultRemediationReconciler(67-75)NewFaultRemediationReconciler(78-93)store-client/pkg/datastore/registry.go (1)
NewDataStore(120-123)store-client/pkg/watcher/factory.go (2)
WatcherConfig(26-38)CreateChangeStreamWatcher(65-87)store-client/pkg/datastore/types.go (1)
Pipeline(121-121)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
fault-remediation/pkg/reconciler/reconciler.go (3)
FaultRemediationReconciler(67-75)ReconcilerConfig(42-51)NewFaultRemediationReconciler(78-93)fault-remediation/pkg/reconciler/remediation.go (1)
FaultRemediationClient(56-68)store-client/pkg/datastore/types.go (1)
EventWithToken(101-104)
fault-remediation/pkg/reconciler/reconciler.go (7)
store-client/pkg/datastore/types.go (1)
EventWithToken(101-104)store-client/pkg/storewatcher/watch_store.go (1)
ChangeStreamWatcher(43-57)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)fault-remediation/pkg/reconciler/annotation.go (1)
NodeAnnotationManagerInterface(38-43)data-models/pkg/model/health_event_extentions.go (4)
HealthEventStatus(45-51)UnQuarantined(34-34)Cancelled(37-37)Status(23-23)commons/pkg/statemanager/statemanager.go (1)
StateManager(197-200)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (48)
- GitHub Check: ko-build-test (event-exporter, .)
- GitHub Check: ko-build-test (fault-remediation, .)
- GitHub Check: modules-lint-test (janitor)
- GitHub Check: modules-lint-test (tests)
- GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/csp-health-monitor)
- GitHub Check: modules-lint-test (store-client)
- GitHub Check: modules-lint-test (node-drainer)
- GitHub Check: modules-lint-test (commons)
- GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/maintenance-notifier)
- GitHub Check: modules-lint-test (fault-quarantine)
- GitHub Check: modules-lint-test (fault-remediation)
- GitHub Check: modules-lint-test (metadata-collector)
- GitHub Check: modules-lint-test (labeler)
- GitHub Check: modules-lint-test (health-events-analyzer)
- GitHub Check: modules-lint-test (data-models)
- GitHub Check: modules-lint-test (platform-connectors)
- GitHub Check: ko-build-test (janitor, .)
- GitHub Check: modules-lint-test (event-exporter)
- GitHub Check: simple-lint (scripts, make -C scripts lint, Run shellcheck on scripts)
- GitHub Check: ko-build-test (fault-quarantine, .)
- GitHub Check: ko-build-test (node-drainer, .)
- GitHub Check: ko-build-test (labeler, .)
- GitHub Check: simple-lint (file-server-cleanup, make -C log-collector lint-file-server-cleanup, Run lint, false)
- GitHub Check: ko-build-test (health-events-analyzer, .)
- GitHub Check: ko-build-test (platform-connectors, .)
- GitHub Check: simple-lint (log-collector, make -C log-collector lint-log-collector, Run lint, false)
- GitHub Check: simple-lint (helm-charts, make helm-lint, Validate Helm charts)
- GitHub Check: container-build-test (kubernetes-object-monitor, make -C health-monitors/kubernetes-object-monito...
- GitHub Check: simple-lint (postgres-schema, make validate-postgres-schema, Validate PostgreSQL schema consistency)
- GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
- GitHub Check: simple-lint (gomod, make gomod-lint, Run gomod lint)
- GitHub Check: simple-lint (license-headers, make license-headers-lint, Run license headers check)
- GitHub Check: simple-lint (kubernetes-distro, make kubernetes-distro-lint, Run lint)
- GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
- GitHub Check: health-monitors-lint-test (gpu-health-monitor, true, true, false)
- GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
- GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
- GitHub Check: health-monitors-lint-test (kubernetes-object-monitor)
- GitHub Check: simple-lint (protos, make protos-lint, Run protos lint)
- GitHub Check: tilt-modules-lint-test (tilt/simple-health-client)
- GitHub Check: health-monitors-lint-test (syslog-health-monitor)
- GitHub Check: health-monitors-lint-test (csp-health-monitor)
- GitHub Check: E2E Tests (AMD64 + PostgreSQL)
- GitHub Check: E2E Tests (ARM64 + MongoDB)
- GitHub Check: E2E Tests (ARM64 + PostgreSQL)
- GitHub Check: E2E Tests (AMD64 + MongoDB)
- GitHub Check: CodeQL PR Analysis
π Additional comments (33)
health-monitors/syslog-health-monitor/go.mod (1)
22-80: Clarify the distinction between direct and indirect dependencies.The 27 listed dependencies are marked
// indirect, meaning they are transitive dependencies automatically included by other dependencies when runninggo mod tidy. Indirect dependencies do not need to be directly imported in the codebaseβthey exist because a direct dependency requires them. To verify the module is correctly configured, check that direct dependencies (unmarked by// indirect) are actually imported in the code, and confirm that runninggo mod tidyregenerates the same indirect dependency list.fault-remediation/pkg/reconciler/metrics.go (1)
20-91: Add issue reference to TODO commentThe TODO comment at line 30 must reference a GitHub issue per project guidelines. Update
//TODO: evaluate and remove redundant metrics with ctrl-runtime defaultsto include an issue number, e.g.,//TODO(#XXXX): evaluate and remove redundant metrics with ctrl-runtime defaults.The controller-runtime metrics registry wiring and label sets (including consistent
node_namelabels across all counter vectors and histogram vectors) follow the correct pattern for observability requirements.fault-remediation/go.mod (1)
32-32: LGTM! Expected transitive dependencies.These indirect dependencies (fsnotify and btree) are expected additions from the controller-runtime integration. Both are well-known, stable libraries and are correctly marked as indirect.
Also applies to: 51-51
health-events-analyzer/go.mod (1)
29-93: LGTM! Expected transitive dependencies from commons module.These indirect dependencies are transitively pulled in from the commons module's controller-runtime integration. All are correctly marked as indirect, and this is expected behavior when a dependency adds substantial new functionality like controller-runtime support.
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (4)
81-84: LGTM! Controller-runtime flags properly configured.The conditional addition of controller-runtime and leader-election flags aligns with the main.go logic that enforces leader election when controller-runtime is enabled. The Helm template syntax is correct.
88-91: LGTM! Health port configuration is correct.The health port is correctly added only when controller-runtime is enabled, as the controller-runtime manager provides dedicated health endpoints on this port. The default value of 9440 matches the configuration in values-full.yaml.
95-95: LGTM! Liveness probe port correctly switches based on mode.The ternary expression correctly uses the nil-safe pattern and switches between the health port (when controller-runtime is enabled) and metrics port (when disabled). The /healthz path is consistent across both modes.
102-103: LGTM! Readiness probe correctly configured for both modes.The readiness probe correctly uses different paths and ports based on the controller-runtime mode:
- Controller-runtime:
/readyzon health port (standard for controller-runtime)- Non-controller-runtime:
/healthzon metrics portBoth ternary expressions use the nil-safe pattern.
event-exporter/go.mod (1)
18-84: LGTM! Expected transitive dependencies from commons module.Similar to health-events-analyzer, these indirect dependencies are transitively pulled in from the commons module's controller-runtime integration. All are correctly marked as indirect, and this is expected when a shared dependency adds controller-runtime support.
fault-remediation/pkg/initializer/init.go (3)
31-31: LGTM! Watcher import correctly added.The watcher package import is necessary for the new controller-runtime integration and follows proper Go import conventions.
99-116: LGTM! Datastore and watcher initialization properly implemented.The initialization logic correctly:
- Creates the datastore with proper error handling
- Uses the factory pattern for watcher creation
- Wraps errors with context using
fmt.Errorf("context: %w", err)- Retrieves the health event store from the datastore interface
The collection name "HealthEvents" is hardcoded, which is acceptable for now but consider making it configurable if it needs to vary across environments.
132-133: LGTM! Reconciler initialization correctly updated.The constructor call properly passes all required dependencies (datastore, watcher, health event store, config, and dry-run flag) to the new
NewFaultRemediationReconcilerfunction, following proper dependency injection patterns.fault-remediation/pkg/reconciler/reconciler_test.go (3)
219-219: LGTM! Constructor calls correctly updated for unit tests.All test cases properly use the new
NewFaultRemediationReconcilerconstructor with nil values for datastore, watcher, and health event store dependencies. This is appropriate for unit tests that focus on testing reconciliation logic with mocked clients rather than actual datastore interactions.Also applies to: 263-263, 320-320, 371-371, 428-428, 474-474, 501-501, 582-582, 661-661, 692-692, 724-724, 789-789, 864-864, 937-937, 995-995
221-221: LGTM! Field access correctly updated to use unexported fields.The test code correctly accesses the now-unexported fields (
dryRunandconfig) of the reconciler. This change improves encapsulation while maintaining testability through package-level access.Also applies to: 273-273, 588-590, 663-663, 694-694, 730-731
377-379: LGTM! performRemediation error handling correctly updated.The test cases properly handle the updated
performRemediationsignature that now returns an error. All test cases correctly assert on the error return value usingassert.NoError(t, err), improving error handling coverage in tests.Also applies to: 434-436, 481-483
tests/fault_remediation_test.go (1)
28-28: LGTM! Typo corrected in test function name.The function name has been corrected from "Reemdiated" to "Remediated", improving code quality and readability with no functional impact.
commons/go.mod (1)
17-17: No action required. Controller-runtime v0.22.4 is a valid, stable release (published 03 Nov 2024) with no known security advisories. The version follows semantic versioning conventions and includes legitimate improvements (cache SyncPeriod configuration, namespaced List fixes, priority-queue fixes, envtest binary-path handling).fault-remediation/main.go (3)
76-118: Leader-election gating and shutdown wiring look sound.Flag parsing + the
!enableControllerRuntime && enableLeaderElectionguard correctly prevent invalid combinations, and using a single signal-derivedctxfor both initialization andmgr.Start(ctx)avoids the earlier dual-signal issue. The deferredreconciler.CloseAll(ctx)centralizes cleanup for both runtime modes cleanly.
210-285: Flag surface looks coherent with the new modes.The new flags for controller-runtime enablement, leader election, timings, and log-collector behavior are clear and align with the runtime branches in
run(). Once behavior stabilizes, consider documenting expected combinations (e.g., production defaults) in README/Helm values to keep operator UX consistent.
120-135:metrics-addresssemantics differ between ctrl-runtime and non-ctrl paths.In the nonβctrl path,
TrimPrefix(metricsAddr, ":")andAtoiexpect only a bare port (e.g.,":2112"); values like"0.0.0.0:2112"will fail withinvalid metrics port, despite the flag help text indicating "address/port" support. The ctrlβruntime path usesmetricsAddrdirectly as a bind address. Either constrain the flag description to "port" for the nonβctrl path, or parse host:port usingnet.SplitHostPortto consistently support full addresses in both modes.Also applies to: 210-223
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (5)
60-87: Mocks are well-shaped for the new reconciler API.
MockChangeStreamWatcherandMockHealthEventStoreimplement the required interfaces and give you the right hooks (EventsChan,UpdateHealthEventStatusFn,updateCalled) to exercise event-loop and metrics behavior without external dependencies. This keeps the e2e tests focused on controller logic while remaining deterministic.Also applies to: 96-100
288-345:createTestRemediationClientwiring matches production behavior.The helper now returns
(*FaultRemediationClient, error)and wires discovery, RESTMapper, template, annotation manager, andCRStatusCheckerthe same way production code does, with proper error propagation at each step. That makes the integration tests realistic while keeping failure modes explicit.
740-895: Event-loop E2E tests effectively validate ctrl-runtime wiring and state behavior.The new tests that drive events through
mockWatcher.EventsChanand assert on annotations, CR existence, metrics, andMarkProcessedcounts give good coverage of the ctrl-runtime powered event path and deduplication/cancellation semantics. This is a solid foundation for catching regressions as you iterate on the controller-runtime integration.Also applies to: 935-955, 1006-1058, 1066-1139, 1161-1223
1335-1355: Metrics error-path test matches the new parsing behavior.
TestMetrics_ProcessingErrorsnow drives a malformed event through aFaultRemediationReconcilerwith a mock watcher and asserts that theprocessingErrorscounter increases for theunmarshal_doc_errorlabel. This aligns with the newparseHealthEventimplementation and ensures you donβt regress error metric emission.
741-801: Unable to complete verification of this review comment. The repository is inaccessible, preventing inspection of the mockStore definition, UpdateHealthEventStatusFn implementation, goroutine usage patterns, and all mentioned line ranges (741-801, 847-849, 986-1049, 1062-1063, 1143-1144). To verify this concern, please provide access to the codebase or share the relevant code sections from the test file.fault-remediation/pkg/reconciler/reconciler.go (8)
42-52: New reconciler struct and constructor are well-aligned with ctrl-runtime.
FaultRemediationReconcilercleanly encapsulates the datastore, watcher, health store, config, and annotation manager, andNewFaultRemediationReconcilerwiresannotationManagerfromRemediationClientso tests and main share the same construction path. This gives you a clear surface for both standalone and controller-runtime-managed operation.Also applies to: 64-92
95-133: Reconcile + parseHealthEvent behavior for malformed events is coherent.On parse failures you increment
processingErrors, log, mark the event as processed, and haveReconcilereturnnilso controller-runtime doesnβt hot-loop on unprocessable events. The nil-guard forHealthEventand the cancellation vs remediation split vianodeQuarantinedkeep the main decision tree straightforward.Also applies to: 464-516
365-412: Health-event status update helper looks correct and bounded.
updateNodeRemediatedStatusbuilds a minimalHealthEventStatus, optionally setsLastRemediationTimestampon success, and caps retries usingUpdateMaxRetries/UpdateRetryDelay, wrapping the final error with document ID context. This keeps write amplification bounded while still surfacing persistent store failures clearly.
414-462: CR deduplication via annotations and status checker is clear and robust.
checkExistingCRStatuslooks up the equivalence group viacommon.GetRemediationGroupForAction, inspects the annotation, and usesstatusChecker.ShouldSkipCRCreationto decide whether to skip or allow a new CR, cleaning up the annotation group on completion/failure. The fallbacks on annotation or status-checker errors are conservative (βallow creationβ) and well-logged.
543-573: SetupWithManager + StartWatcherStream integrate cleanly with controller-runtime.Calling
Watcher.Start(ctx)and adaptingWatcher.Events()viaAdaptEventsinto aTypedChannelsource, then wiring it into a named controller with a typed workqueue, is an idiomatic way to plug this non-K8s event source into controller-runtime. Since youβre using the samectxthatmgr.Start(ctx)uses, the watcher and adapter goroutine will respect the managerβs lifecycle.
574-602: AdaptEvents implementation is correct and avoids common pointer pitfalls.The adapter goroutine selects on
ctx.Done()and the input channel, closes the output channel on exit, and copies eachEventWithTokeninto a new variable before taking its address, so each enqueued pointer is stable. This should behave well under load while respecting context cancellation.
260-287: Unable to verify: Filefault-remediation/pkg/reconciler/reconciler.gonot found in repositoryThe review comment references lines 260-287 and 290-363 in this file, but the file does not exist in the current repository state. This prevents verification of the claims regarding error handling in
handleCancellationEventandhandleRemediationEvent, specifically whether theMarkProcessedcall inhandleRemediationEventreturnsnilon error or whether adding a clarifying comment is appropriate.
135-174: Review verification inconclusive due to repository access failure.The review comment makes specific factual claims about TODO comments without issue references and manual retry loops with
UpdateMaxRetriesandtime.Sleepin theperformRemediationandupdateNodeRemediatedStatusfunctions (lines 176-257). However, I cannot access the repository to verify these claims directly.To proceed, either provide the code snippets for lines 176-194 and 196-257, or re-run this verification once repository access is available.
Merging this branch will decrease overall coverage
Coverage by fileChanged unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
104-116: Potential data race onupdateCalledcounter.Unlike
MockChangeStreamWatcher.markProcessedCount, theupdateCalledcounter is not protected by a mutex. SinceUpdateHealthEventStatusmay be called from the reconciler goroutine while tests readupdateCalled(e.g., lines 749, 808, 856), this could cause a data race.Apply this diff to add thread safety:
// MockHealthEventStore provides a mock implementation of datastore.HealthEventStore for testing type MockHealthEventStore struct { + mu sync.Mutex UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error updateCalled int } // UpdateHealthEventStatus updates a health event status (mock implementation) func (m *MockHealthEventStore) UpdateHealthEventStatus(ctx context.Context, id string, status datastore.HealthEventStatus) error { if m.UpdateHealthEventStatusFn != nil { + m.mu.Lock() + m.updateCalled++ + m.mu.Unlock() return m.UpdateHealthEventStatusFn(ctx, id, status) } return nil } + +// GetUpdateCalledCount returns the update call count in a thread-safe manner +func (m *MockHealthEventStore) GetUpdateCalledCount() int { + m.mu.Lock() + defer m.mu.Unlock() + return m.updateCalled +}
β»οΈ Duplicate comments (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
245-249:log.Fatalfin goroutine and missing manager readiness synchronization.This issue was previously flagged. Using
log.Fatalfin a goroutine terminates the process abruptly without cleanup. Additionally, there's no synchronization to ensure the manager's cache is synced before tests run.Consider waiting for cache sync after starting the manager:
go func() { if err := mgr.Start(testContext); err != nil { - log.Fatalf("Failed to start the test environment manager: %v", err) + log.Printf("Failed to start the test environment manager: %v", err) } }() + + // Wait for cache to sync before running tests + if !mgr.GetCache().WaitForCacheSync(testContext) { + log.Fatalf("Failed to sync manager cache") + }
π§Ή Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
377-380: Consider usingNewFaultRemediationReconcilerconstructor for consistency.Multiple test cases manually construct
FaultRemediationReconcilerwith partial fields. Using the constructor ensures tests stay aligned with production initialization patterns and reduces maintenance burden if the struct evolves.- r := FaultRemediationReconciler{ - config: cfg, - annotationManager: cfg.RemediationClient.GetAnnotationManager(), - } + r := NewFaultRemediationReconciler(nil, nil, nil, cfg, false)
1121-1121: Consider replacingtime.Sleepwith deterministic synchronization.Using
time.Sleepfor synchronization can lead to flaky tests. If possible, userequire.Eventuallyto poll for a verifiable state change, or add a channel/signal mechanism to confirm event processing completion.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go(43 hunks)
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π§ Learnings (4)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
𧬠Code graph analysis (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
data-models/pkg/model/health_event_extentions.go (1)
HealthEventStatus(45-51)fault-remediation/pkg/reconciler/reconciler.go (3)
FaultRemediationReconciler(67-75)ReconcilerConfig(42-51)NewFaultRemediationReconciler(78-93)fault-remediation/pkg/reconciler/remediation.go (1)
FaultRemediationClient(56-68)
β° Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: E2E Tests (ARM64 + PostgreSQL)
- GitHub Check: E2E Tests (AMD64 + MongoDB)
- GitHub Check: E2E Tests (AMD64 + PostgreSQL)
- GitHub Check: E2E Tests (ARM64 + MongoDB)
π Additional comments (3)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
56-61: Good addition of mutex for thread safety.The mutex protects concurrent access to
markProcessedCountbetween the reconciler goroutine and test assertions. This prevents data races in the test.
296-353: LGTM!The refactored
createTestRemediationClientnow properly returns errors instead of callinglog.Fatalfinternally, allowing callers to handle failures appropriately.
218-230: LGTM on test setup refactoring.The test setup properly creates the reconciler using
NewFaultRemediationReconciler, wires mock dependencies, and integrates with the controller-runtime manager. The mock store callback correctly tracks update calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (3)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
55-116: Fix data race onMockHealthEventStore.updateCalledwhile preserving new watcher mutexThe new
sync.Mutexuse inMockChangeStreamWatcherformarkProcessedCountis good, butMockHealthEventStore.updateCalledis incremented from the reconciler goroutine and read/reset directly in tests, with no synchronization. Thatβs a classic data race and can makeupdateCalledβbased assertions flaky under concurrent event processing.Consider guarding
updateCalledsimilarly and exposing helpers instead of direct field access, e.g.:type MockHealthEventStore struct { - UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error - updateCalled int + UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error + mu sync.Mutex + updateCalled int } + +func (m *MockHealthEventStore) incUpdateCalled() { + m.mu.Lock() + defer m.mu.Unlock() + m.updateCalled++ +} + +func (m *MockHealthEventStore) ResetUpdateCalled() { + m.mu.Lock() + defer m.mu.Unlock() + m.updateCalled = 0 +} + +func (m *MockHealthEventStore) GetUpdateCalled() int { + m.mu.Lock() + defer m.mu.Unlock() + return m.updateCalled +}Then in setup/tests:
mockStore = &MockHealthEventStore{} mockStore.UpdateHealthEventStatusFn = func(ctx context.Context, id string, status datastore.HealthEventStatus) error { - mockStore.updateCalled++ + mockStore.incUpdateCalled() return nil }And replace all direct uses like
mockStore.updateCalled = 0,updateCountBefore := mockStore.updateCalled, and equality checks with the new helpers.
678-688: Fix misleading assertion message in EventSequence testHere you assert that a new CR should be created after the previous CR succeeds:
shouldCreate, _, err = r.checkExistingCRStatus(ctx, event3) assert.NoError(t, err) assert.True(t, shouldCreate, "RESTART_VM should be skipped (CR succeeded)")The assertion message says βshould be skippedβ, which contradicts both the comment above and the
assert.True. Suggest updating the message:- assert.True(t, shouldCreate, "RESTART_VM should be skipped (CR succeeded)") + assert.True(t, shouldCreate, "RESTART_VM should be allowed after CR succeeded")This keeps the behavior asβis but avoids confusion when reading failures.
748-857: MakeupdateCalledβbased dedup assertion more robust against async processingThe E2E subtest uses
mockStore.updateCalledto ensure the duplicate event doesnβt trigger a DB update:mockStore.updateCalled = 0 ... updateCountBefore := mockStore.updateCalled ... assert.Equal(t, updateCountBefore, mockStore.updateCalled, "MongoDB update should not be called for skipped event")Given that event handling happens in the reconcilerβs goroutine, this has two issues:
- As noted earlier, direct access to
updateCalledis a data race.- Even after
assert.Eventuallyconfirms deduplication at the annotation level, thereβs still a chance more store updates from prior events are in flight, making an exact equality check brittle.Once
MockHealthEventStorehas synchronized accessors, you can both fix the race and make the assertion more intentβfocused, e.g.:mockStore.ResetUpdateCalled() ... updateCountBefore := mockStore.GetUpdateCalled() ... assert.Eventually(t, func() bool { return mockStore.GetUpdateCalled() == updateCountBefore }, 2*time.Second, 100*time.Millisecond, "MongoDB update should not be called for skipped event")This retains the dedup guarantee while handling async behavior more gracefully.
β»οΈ Duplicate comments (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
167-261: Avoidlog.Fatalfinside manager goroutine and consider minimal readiness sync
TestMaincorrectly uses envtest and a shared reconciler, but the manager goroutine still does:go func() { if err := mgr.Start(testContext); err != nil { log.Fatalf("Failed to start the test environment manager: %v", err) } }()Calling
log.Fatalffrom this goroutine can terminate the process abruptly without runningtearDownTestEnvironment()or flushing test logs. It also hides the failure from individual tests.A safer pattern is to log the error and let tests fail via assertions, optionally adding a simple readiness signal:
- go func() { - if err := mgr.Start(testContext); err != nil { - log.Fatalf("Failed to start the test environment manager: %v", err) - } - }() + mgrReady := make(chan struct{}) + go func() { + close(mgrReady) // manager start goroutine is running + if err := mgr.Start(testContext); err != nil { + log.Printf("Failed to start the test environment manager: %v", err) + } + }() + <-mgrReadyYou can optionally follow this with
mgr.GetCache().WaitForCacheSync(testContext)if you want cache readiness before tests proceed.
π§Ή Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
297-353: createTestRemediationClient wiring looks correct for envtestThe
createTestRemediationClienthelper cleanly wires discovery, RESTMapper, dynamic client, template loading, annotation manager, dryβrun mode, andCRStatusCheckeragainst the envtesttestRestConfig. This aligns with the guidance to use envtest and real k8s clients for controller tests.One small optional improvement would be to wrap returned errors with context (e.g.,
fmt.Errorf("load rebootnode template: %w", err)) to make failures easier to debug, but the current direct returns are acceptable for test code.
1347-1360: EnsureReconciledoesnβt depend on uninitialized fields inTestMetrics_ProcessingErrorsIn
TestMetrics_ProcessingErrors, you construct:invalidEventToken := &datastore.EventWithToken{ ... } watcher := NewMockChangeStreamWatcher() r := &FaultRemediationReconciler{ Watcher: watcher, } r.Reconcile(testContext, invalidEventToken)This assumes that the
Reconcilepath for invalid documents only touchesWatcher(if at all) and global metrics, and does not readds,healthEventStore,config, orannotationManager. If any of those are accessed, this test will panic withnildereferences.If
Reconcilealready guards those paths, this is fine. Otherwise, consider constructing the reconciler viaNewFaultRemediationReconcilerwith a minimalReconcilerConfigto ensure all used dependencies are nonβnil.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
π Files selected for processing (3)
distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt.yaml(2 hunks)fault-remediation/pkg/reconciler/reconciler_e2e_test.go(43 hunks)
π§° Additional context used
π Path-based instructions (2)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π§ Learnings (5)
π Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.
Applied to files:
distros/kubernetes/nvsentinel/values-tilt.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
𧬠Code graph analysis (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
store-client/pkg/datastore/types.go (1)
EventWithToken(101-104)fault-remediation/pkg/reconciler/reconciler.go (3)
FaultRemediationReconciler(67-75)ReconcilerConfig(42-51)NewFaultRemediationReconciler(78-93)
π Additional comments (4)
distros/kubernetes/nvsentinel/values-tilt.yaml (2)
24-24: Consistent flag enablement across Tilt environment configurations.The
ctrlRuntimeEnabled: trueflag is properly positioned in the global section and will affect downstream service deployments. Placement and indentation are correct.A minor observation:
values-tilt-mongodb.yamlincludes an explanatory comment for this flag clarifying the dual-test-coverage rationale, while this file does not. Given the minimal-documentation guideline for Tilt development overrides, this asymmetry is acceptable, though consistency might aid maintainability.
220-220: VerifyreplicaCount: 1is recognized and intentional for fault-remediation Helm chart.The addition of
replicaCount: 1for fault-remediation in a Tilt development environment is reasonable for single-replica testing. However, confirm that:
- The fault-remediation Helm chart recognizes and properly handles the
replicaCountfield.- The value
1is the intended default for controller-runtime-enabled testing (especially relevant given the pod anti-affinity rules below, which become effectively no-op with a single replica).- This change does not contribute to the e2e test failures mentioned in the PR objectives.
distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml (2)
27-28: Clear rationale for dual-mode controller-runtime testing.The comment explaining the test coverage strategy (MongoDB with ctrl-runtime enabled, PostgreSQL with legacy behavior) is excellent documentation and clarifies the design intent for multi-datastore scenarios.
Confirm that
values-tilt-postgresql.yamlintentionally omits or disablesctrlRuntimeEnabledto maintain the legacy behavior pathway for PostgreSQL and ensure the dual test coverage design is fully implemented.
1-62: Verify PostgreSQL configuration maintains legacy behavior per stated dual-test-coverage design.The MongoDB file explicitly documents the test coverage strategy: "Use ctrl runtime in mongo and leave postgres with legacy so both get test coverage." Confirm that
distros/kubernetes/nvsentinel/values-tilt-postgresql.yamleither does not includectrlRuntimeEnabled(defaults to false/unset) or explicitly setsctrlRuntimeEnabled: falseto ensure the dual-mode testing architecture is properly maintained.
583aff9 to
131fb72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
β»οΈ Duplicate comments (7)
health-events-analyzer/go.mod (1)
36-49: Malformed indirect module pathsβsubpackages listed instead of module.Lines 36-49 contain entries like
github.com/go-openapi/swag/cmdutils v0.25.4 // indirect. In standard Go module semantics, subpackages within a module should not appear as separate require entries. These should typically be consolidated to a singlefd.axjsq.dpdns.org/go-openapi/swag v0.25.4entry.Run
go mod tidyto verify these entries are correctly resolved, or confirm if go-openapi/swag v0.25+ has been restructured into separate submodules.go-openapi/swag v0.25 submodules structurehealth-monitors/syslog-health-monitor/go.mod (1)
32-42: Already flagged: Malformed go-openapi/swag subpackage entries.This issue was already identified in a previous review. The subpackage entries (cmdutils, conv, fileutils, etc.) should be consolidated to a single module entry and verified with
go mod tidy.event-exporter/go.mod (1)
28-38: Same malformed go-openapi/swag subpackage entries as other modules.Consistent with the pattern in other go.mod files, these subpackage entries may need consolidation. Ensure
go mod tidyis run across all modules to verify the dependency graph is valid.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
206-249: Consider graceful error handling in manager goroutine.Using
log.Fatalfinside a goroutine (line 247) will abruptly terminate the test process without proper cleanup. This was flagged in a past review but not fully addressed.Consider logging the error and letting tests fail naturally:
go func() { if err := mgr.Start(testContext); err != nil { - log.Fatalf("Failed to start the test environment manager: %v", err) + log.Printf("Failed to start the test environment manager: %v", err) } }()Additionally, consider adding a cache sync wait after starting the manager to ensure it's ready before tests run.
fault-remediation/pkg/reconciler/reconciler.go (2)
527-541: Add nil guards to CloseAll for safer test usage.
CloseAllunconditionally callsr.ds.Close(ctx)andr.Watcher.Close(ctx). In production,InitializeAllensures both are non-nil, but tests constructFaultRemediationReconcilerwithnilvalues. CallingCloseAllon such instances would panic.func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error { var errs []error - if err := r.ds.Close(ctx); err != nil { - slog.Error("failed to close datastore", "error", err) - errs = append(errs, err) + if r.ds != nil { + if err := r.ds.Close(ctx); err != nil { + slog.Error("failed to close datastore", "error", err) + errs = append(errs, err) + } } - if err := r.Watcher.Close(ctx); err != nil { - slog.Error("failed to close Watcher", "error", err) - errs = append(errs, err) + if r.Watcher != nil { + if err := r.Watcher.Close(ctx); err != nil { + slog.Error("failed to close Watcher", "error", err) + errs = append(errs, err) + } } return errors.Join(errs...) }
355-360: Inconsistent error handling for MarkProcessed.When
MarkProcessedfails at line 355, the function returnsctrl.Result{}, nilinstead of the error. This is inconsistent with similar error handling at lines 276-283, 302-306, and 327-331, which return the error. Without returning the error, controller-runtime won't apply retry backoff.if err = watcherInstance.MarkProcessed(ctx, eventWithToken.ResumeToken); err != nil { processingErrors.WithLabelValues("mark_processed_error", nodeName).Inc() slog.Error("Error updating resume token", "error", err) - return ctrl.Result{}, nil + return ctrl.Result{}, err }fault-remediation/main.go (1)
148-159: Event loop doesn't respect context cancellation.The
for event := rangeloop will block until the channel closes. IfgCtxis cancelled (e.g., SIGTERM), the goroutine won't exit untilWatcher.Events()channel is closed. This was flagged in past reviews but appears only partially addressed.Consider using a select with
gCtx.Done():- for event := range components.FaultRemediationReconciler.Watcher.Events() { - slog.Info("Event received", "event", event) - _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) - } - - return nil + eventCh := components.FaultRemediationReconciler.Watcher.Events() + for { + select { + case <-gCtx.Done(): + slog.Info("Context cancelled, stopping event processing") + return nil + case event, ok := <-eventCh: + if !ok { + return nil + } + slog.Info("Event received", "event", event) + _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) + } + }Based on learnings, proper shutdown handling with context cancellation is required for Go code.
π§Ή Nitpick comments (5)
fault-remediation/pkg/reconciler/metrics.go (1)
30-31: TODO should reference an issue.Per coding guidelines, TODO comments should reference issues in Go code. Consider creating a tracking issue for evaluating redundant metrics and updating this comment.
- //TODO: evaluate and remove redundant metrics with ctrl-runtime defaults + // TODO(#XXX): evaluate and remove redundant metrics with ctrl-runtime defaultsfault-remediation/pkg/initializer/init.go (2)
45-45: Consider addressing the cyclomatic complexity.The
nolint:cyclopdirective suppresses cyclomatic complexity warnings. While acceptable for now, consider refactoring initialization into smaller helper functions in a follow-up to improve maintainability.
70-70: TODO should reference an issue.Per coding guidelines, TODO comments should reference issues. Consider creating a tracking issue for the manager client replacement.
- //TODO: replace this with manager client + //TODO(#XXX): replace this with manager clientfault-remediation/pkg/reconciler/reconciler.go (2)
215-216: TODO should reference an issue.Per coding guidelines, TODO comments should reference issues for tracking.
- //TODO: return error to use built in ctrl runtime retries + //TODO(#XXX): return error to use built-in ctrl-runtime retries
339-339: Consider handling runLogCollector error.The error from
runLogCollectoris silently discarded. While log collector failures may be non-fatal, consider at minimum logging at warn level or tracking via metrics.- _ = r.runLogCollector(ctx, healthEvent) + if err := r.runLogCollector(ctx, healthEvent); err != nil { + slog.Warn("Log collector failed but continuing with remediation", "node", nodeName, "error", err) + }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (5)
commons/go.sumis excluded by!**/*.sumevent-exporter/go.sumis excluded by!**/*.sumfault-remediation/go.sumis excluded by!**/*.sumhealth-events-analyzer/go.sumis excluded by!**/*.sumhealth-monitors/syslog-health-monitor/go.sumis excluded by!**/*.sum
π Files selected for processing (19)
commons/go.mod(2 hunks)commons/pkg/server/server.go(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml(1 hunks)distros/kubernetes/nvsentinel/values-full.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt.yaml(2 hunks)event-exporter/go.mod(1 hunks)fault-remediation/go.mod(2 hunks)fault-remediation/main.go(3 hunks)fault-remediation/pkg/initializer/init.go(5 hunks)fault-remediation/pkg/reconciler/metrics.go(4 hunks)fault-remediation/pkg/reconciler/reconciler.go(17 hunks)fault-remediation/pkg/reconciler/reconciler_e2e_test.go(43 hunks)fault-remediation/pkg/reconciler/reconciler_test.go(16 hunks)fault-remediation/pkg/reconciler/remediation.go(1 hunks)health-events-analyzer/go.mod(1 hunks)health-monitors/syslog-health-monitor/go.mod(1 hunks)tests/fault_remediation_test.go(1 hunks)
π§ Files skipped from review as they are similar to previous changes (7)
- fault-remediation/pkg/reconciler/remediation.go
- distros/kubernetes/nvsentinel/values-tilt.yaml
- fault-remediation/go.mod
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
- distros/kubernetes/nvsentinel/values-full.yaml
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
- tests/fault_remediation_test.go
π§° Additional context used
π Path-based instructions (3)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/pkg/initializer/init.gofault-remediation/main.gocommons/pkg/server/server.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/metrics.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler.go
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
commons/go.modhealth-events-analyzer/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π§ Learnings (14)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
commons/go.modhealth-events-analyzer/go.modhealth-monitors/syslog-health-monitor/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
commons/go.modhealth-events-analyzer/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
commons/go.modhealth-events-analyzer/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
commons/go.modhealth-events-analyzer/go.modhealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
commons/go.modfault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
health-events-analyzer/go.modfault-remediation/main.gohealth-monitors/syslog-health-monitor/go.modevent-exporter/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/reconciler/metrics.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
health-monitors/syslog-health-monitor/go.modfault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
𧬠Code graph analysis (4)
fault-remediation/pkg/initializer/init.go (4)
fault-remediation/pkg/reconciler/reconciler.go (2)
FaultRemediationReconciler(67-75)NewFaultRemediationReconciler(78-93)store-client/pkg/datastore/registry.go (1)
NewDataStore(120-123)store-client/pkg/watcher/factory.go (2)
WatcherConfig(26-38)CreateChangeStreamWatcher(65-87)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
store-client/pkg/datastore/types.go (1)
EventWithToken(101-104)data-models/pkg/model/health_event_extentions.go (1)
HealthEventStatus(45-51)fault-remediation/pkg/reconciler/reconciler.go (3)
FaultRemediationReconciler(67-75)ReconcilerConfig(42-51)NewFaultRemediationReconciler(78-93)fault-remediation/pkg/reconciler/remediation.go (1)
FaultRemediationClient(56-68)
fault-remediation/pkg/reconciler/reconciler_test.go (1)
fault-remediation/pkg/reconciler/reconciler.go (3)
NewFaultRemediationReconciler(78-93)HealthEventDoc(53-56)ReconcilerConfig(42-51)
fault-remediation/pkg/reconciler/reconciler.go (4)
store-client/pkg/storewatcher/watch_store.go (1)
ChangeStreamWatcher(43-57)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)fault-remediation/pkg/reconciler/annotation.go (1)
NodeAnnotationManagerInterface(38-43)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)
π Additional comments (38)
distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml (1)
27-28: LGTM!Good approach to enable controller-runtime for MongoDB while keeping legacy mode for PostgreSQL to ensure both paths get test coverage during development.
commons/pkg/server/server.go (1)
227-233: Implementation is correct.The function properly uses
promhttp.HandlerForwith the controller-runtime registry (crmetrics.Registry), enabling metrics from controller-runtime components to be exposed alongside custom metrics.fault-remediation/pkg/reconciler/metrics.go (1)
33-92: LGTM! Consistent migration to controller-runtime registry.All metrics are correctly migrated to use
promauto.With(crmetrics.Registry), ensuring they're registered with the controller-runtime metrics registry. This aligns with theWithPrometheusMetricsCtrlRuntime()server option that exposes these metrics via/metrics.commons/go.mod (1)
17-17: controller-runtime v0.22.4 is the current stable version as of December 2025.Adding
sigs.k8s.io/controller-runtimeas a direct dependency in commons is appropriate for the shared module. However, verify that the module organization follows the guideline requiring each service to usego.modas a separate Go module with semantic import versioning.fault-remediation/pkg/initializer/init.go (4)
31-31: LGTM!The watcher package import is correctly added to support the new factory-based watcher initialization.
41-43: LGTM!The
Componentsstruct field is appropriately renamed to reflect the newFaultRemediationReconcilertype, maintaining consistency with the reconciler refactoring.
99-116: LGTM - initialization flow is well-structured.The datastore, watcher, and health event store are initialized with proper error handling and wrapping. The factory pattern for watcher creation aligns with the relevant code snippets provided.
One consideration: the collection name
"HealthEvents"is hardcoded. Verify this is intentional and doesn't need to be configurable via the TOML config.
131-134: LGTM!The reconciler is correctly constructed using the new
NewFaultRemediationReconcilerconstructor with all required dependencies wired appropriately.fault-remediation/pkg/reconciler/reconciler_test.go (10)
219-222: LGTM!Test correctly updated to use the new constructor with
nildependencies (appropriate for unit tests that don't exercise datastore/watcher) and validates the privatedryRunfield.
263-274: LGTM!Constructor updated correctly. Note the discarded second return value from
CreateMaintenanceResourceis acceptable here since the test focuses on the success boolean.
320-323: LGTM!Test correctly uses the new constructor and validates
shouldSkipEventbehavior for unsupported actions.
371-381: LGTM!Test properly updated to handle the new
(bool, string, error)return signature fromperformRemediation, with appropriateassert.NoErrorcheck.
428-438: LGTM!Failure case test correctly updated with new constructor and error handling.
474-485: LGTM!Label update failure test correctly adapted to the new API surface.
500-501: LGTM!Tests for
shouldSkipEventand log collector behavior correctly updated to user.config.EnableLogCollectorandr.config.RemediationClient.Also applies to: 582-593
661-663: LGTM!All log collector tests correctly adapted to the new constructor and field access patterns.
Also applies to: 692-694, 724-731
784-789: LGTM!
TestUpdateNodeRemediatedStatuscorrectly sets up the mock health store and uses the new constructor.
864-864: LGTM!CR-based deduplication and equivalence group tests correctly migrated to the new constructor API.
Also applies to: 937-937, 995-995
fault-remediation/main.go (5)
17-40: LGTM!Imports are well-organized, including controller-runtime packages, error handling, and the initializer package.
42-63: LGTM!Package-level flag variables are cleanly declared with clear naming. This centralizes configuration management effectively.
75-118: LGTM!The
run()function properly:
- Validates that leader election requires controller-runtime mode
- Uses signal context for graceful shutdown
- Defers
CloseAllfor resource cleanup- Routes to the appropriate setup function based on mode
164-208: LGTM!The controller-runtime setup is well-structured:
- Manager configured with metrics, health probes, and leader election
- Health/ready checks added via
healthz.Ping- Reconciler properly wired via
SetupWithManager- Manager started with the parent context for proper lifecycle management
210-285: LGTM!The
parseFlagsfunction is comprehensive and well-documented:
- Clear flag descriptions with requirements noted
- Sensible defaults for leader election timing
- Proper separation of concerns from the main logic
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (8)
56-61: LGTM - Good concurrency safety!Adding mutex protection to
MockChangeStreamWatchercounters prevents data races when tests and the reconciler access counts concurrently. This is a proper testing practice.Also applies to: 82-101
105-108: LGTM!The
updateCalledcounter enables precise assertions on database update behavior, supporting the deduplication verification tests.
167-177: LGTM!Package-level test variables properly updated to use
FaultRemediationReconcilerand separate mock instances for watcher and store.
297-353: LGTM!
createTestRemediationClientnow properly returns(*FaultRemediationClient, error), following Go error handling conventions. All callers are updated to handle the error appropriately.
364-380: LGTM - Tests properly use public API.Integration tests now construct
FaultRemediationReconcilerdirectly withannotationManagerwired throughcfg.RemediationClient.GetAnnotationManager(). This aligns with the public API design and avoids relying on internal fields.Also applies to: 427-439, 480-492, 554-566, 622-634
393-394: LGTM!All
performRemediationcalls correctly handle the new(bool, string, error)return signature withassert.NoError(t, err)checks.Also applies to: 451-452, 504-505, 531-532, 578-579, 652-653, 715-716
749-749: LGTM!Tests properly reset
mockStore.updateCalledbefore each test run to ensure isolated assertions.Also applies to: 808-808, 856-856, 996-996, 1070-1070, 1151-1151
1344-1363: LGTM!Metrics processing error test correctly creates a local reconciler instance with a fresh mock watcher, avoiding interference with global test state.
fault-remediation/pkg/reconciler/reconciler.go (7)
17-40: LGTM!Imports are well-organized with controller-runtime packages, error handling, and proper aliasing for
nvstoreclient.
64-93: LGTM!The
FaultRemediationReconcilerstruct andNewFaultRemediationReconcilerconstructor have proper godoc comments. The struct correctly embedsclient.Clientfor controller-runtime compatibility and stores all necessary dependencies.
95-133: LGTM!The
Reconcilemethod properly follows the controller-runtime pattern:
- Returns
ctrl.Resultanderror- Metrics tracking for event duration
- Proper delegation to cancellation or remediation handlers
135-174: LGTM!
shouldSkipEventcorrectly handles:
- NONE actions (no remediation needed)
- Already remediated events
- Unsupported actions with metrics tracking and label updates
176-194: LGTM!
runLogCollectornow returns an error, enabling callers to handle failures appropriately.
543-572: LGTM!
SetupWithManagercorrectly:
- Starts the watcher with the provided context (addressing past review about context.Background())
- Uses
AdaptEventsto convert events to controller-runtime format- Wires the reconciler with a typed channel source
574-602: LGTM!
AdaptEventsproperly:
- Transforms events to controller-runtime
TypedGenericEvent- Respects context cancellation
- Closes output channel when input closes or context is done
|
/ok to test 131fb72 |
Merging this branch will decrease overall coverage
Coverage by fileChanged unit test files
|
73a7468 to
260636b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
β»οΈ Duplicate comments (4)
health-monitors/syslog-health-monitor/go.mod (1)
31-42: Fix malformed subpackage module pathsβthis issue persists from prior review.Lines 32β42 list invalid subpackage paths like
github.com/go-openapi/swag/cmdutils,conv,fileutils, etc., as separate module entries. In Go'sgo.mod, only the module path should appear. Subpackages (cmdutils, conv, etc.) are internal to thefd.axjsq.dpdns.org/go-openapi/swagmodule and must not be listed separately.Line 31 already correctly declares
github.com/go-openapi/swag v0.25.4 // indirect, which implicitly covers all subpackages. Remove lines 32β42 and rungo mod tidyto reconcile.github.com/go-openapi/swag v0.25.4 // indirect - github.com/go-openapi/swag/cmdutils v0.25.4 // indirect - github.com/go-openapi/swag/conv v0.25.4 // indirect - github.com/go-openapi/swag/fileutils v0.25.4 // indirect - github.com/go-openapi/swag/jsonname v0.25.4 // indirect - github.com/go-openapi/swag/jsonutils v0.25.4 // indirect - github.com/go-openapi/swag/loading v0.25.4 // indirect - github.com/go-openapi/swag/mangling v0.25.4 // indirect - github.com/go-openapi/swag/netutils v0.25.4 // indirect - github.com/go-openapi/swag/stringutils v0.25.4 // indirect - github.com/go-openapi/swag/typeutils v0.25.4 // indirect - github.com/go-openapi/swag/yamlutils v0.25.4 // indirect github.com/gogo/protobuf v1.3.2 // indirectcommons/pkg/server/server.go (1)
218-234: Fix the example in docstringβshows wrong function name.The docstring example on line 225 shows
WithPrometheusMetrics()but this documentation is forWithPrometheusMetricsCtrlRuntime(). This appears to be a copy-paste oversight from the existingWithPrometheusMetrics()function.Apply this diff to fix the example:
// WithPrometheusMetricsCtrlRuntime registers the Prometheus metrics handler at /metrics. // This uses the controller runtime Prometheus registry. // // Example: // // srv := NewServer( // WithPort(2112), -// WithPrometheusMetrics(), +// WithPrometheusMetricsCtrlRuntime(), // )fault-remediation/main.go (1)
148-159: Event loop doesn't respond to context cancellation.The
for rangeloop at lines 153-156 blocks untilWatcher.Events()closes. IfgCtxis cancelled (e.g., SIGTERM), this goroutine won't exit until the watcher closes its channel, potentially delaying graceful shutdown.Consider using a select statement to monitor both the event channel and context:
g.Go(func() error { components.FaultRemediationReconciler.StartWatcherStream(gCtx) slog.Info("Listening for events on the channel...") - for event := range components.FaultRemediationReconciler.Watcher.Events() { - slog.Info("Event received", "event", event) - _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) + eventCh := components.FaultRemediationReconciler.Watcher.Events() + for { + select { + case <-gCtx.Done(): + slog.Info("Context cancelled, stopping event processing") + return nil + case event, ok := <-eventCh: + if !ok { + return nil + } + slog.Info("Event received", "event", event) + _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) + } } - - return nil })Based on coding guidelines requiring proper shutdown handling with context cancellation.
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
245-249: Add manager readiness synchronization and avoid log.Fatalf.The goroutine uses
log.Fatalfwhich abruptly terminates the test process without cleanup. Additionally, there's no synchronization to ensure the manager is ready beforem.Run()executes tests.Consider adding a readiness channel and handling errors gracefully:
+ mgrReady := make(chan struct{}) go func() { + close(mgrReady) if err := mgr.Start(testContext); err != nil { - log.Fatalf("Failed to start the test environment manager: %v", err) + log.Printf("Failed to start the test environment manager: %v", err) } }() + <-mgrReady // Wait for manager goroutine to startAlternatively, use
mgr.GetCache().WaitForCacheSync(testContext)after starting to ensure caches are synchronized before running tests.
π§Ή Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler.go (2)
525-541: Add nil guards in CloseAll for safer test usage.
CloseAllunconditionally callsr.ds.Close(ctx)andr.Watcher.Close(ctx). While production code ensures both are non-nil, tests constructFaultRemediationReconcilerinstances with nil fields. CallingCloseAllon those would panic.Add nil checks to make the method safer:
func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error { var errs []error - if err := r.ds.Close(ctx); err != nil { - slog.Error("failed to close datastore", "error", err) - errs = append(errs, err) + if r.ds != nil { + if err := r.ds.Close(ctx); err != nil { + slog.Error("failed to close datastore", "error", err) + errs = append(errs, err) + } } - if err := r.Watcher.Close(ctx); err != nil { - slog.Error("failed to close Watcher", "error", err) - errs = append(errs, err) + if r.Watcher != nil { + if err := r.Watcher.Close(ctx); err != nil { + slog.Error("failed to close Watcher", "error", err) + errs = append(errs, err) + } } return errors.Join(errs...) }
215-215: TODO comment should reference a tracking issue.The TODO comment lacks an issue reference. Per coding guidelines, TODO comments should reference issues for tracking.
Consider creating an issue to track the controller-runtime retry refactoring and updating the comment:
- //TODO: return error to use built in ctrl runtime retries + // TODO(#ISSUE): return error to use built-in ctrl-runtime retriesAs per coding guidelines, TODO comments should reference issues in Go code.
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (5)
commons/go.sumis excluded by!**/*.sumevent-exporter/go.sumis excluded by!**/*.sumfault-remediation/go.sumis excluded by!**/*.sumhealth-events-analyzer/go.sumis excluded by!**/*.sumhealth-monitors/syslog-health-monitor/go.sumis excluded by!**/*.sum
π Files selected for processing (19)
commons/go.mod(2 hunks)commons/pkg/server/server.go(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml(1 hunks)distros/kubernetes/nvsentinel/values-full.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt.yaml(2 hunks)event-exporter/go.mod(1 hunks)fault-remediation/go.mod(2 hunks)fault-remediation/main.go(3 hunks)fault-remediation/pkg/initializer/init.go(5 hunks)fault-remediation/pkg/reconciler/metrics.go(4 hunks)fault-remediation/pkg/reconciler/reconciler.go(17 hunks)fault-remediation/pkg/reconciler/reconciler_e2e_test.go(43 hunks)fault-remediation/pkg/reconciler/reconciler_test.go(16 hunks)fault-remediation/pkg/reconciler/remediation.go(1 hunks)health-events-analyzer/go.mod(1 hunks)health-monitors/syslog-health-monitor/go.mod(1 hunks)tests/fault_remediation_test.go(1 hunks)
π§ Files skipped from review as they are similar to previous changes (7)
- distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
- fault-remediation/go.mod
- commons/go.mod
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
- fault-remediation/pkg/reconciler/remediation.go
- distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
- distros/kubernetes/nvsentinel/values-full.yaml
π§° Additional context used
π Path-based instructions (3)
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
commons/pkg/server/server.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/main.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/metrics.gotests/fault_remediation_test.go
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
event-exporter/go.modhealth-monitors/syslog-health-monitor/go.modhealth-events-analyzer/go.mod
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
fault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gotests/fault_remediation_test.go
π§ Learnings (17)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
event-exporter/go.modhealth-monitors/syslog-health-monitor/go.modhealth-events-analyzer/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
event-exporter/go.modhealth-monitors/syslog-health-monitor/go.modhealth-events-analyzer/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
event-exporter/go.modhealth-monitors/syslog-health-monitor/go.modhealth-events-analyzer/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
event-exporter/go.modhealth-monitors/syslog-health-monitor/go.modhealth-events-analyzer/go.modfault-remediation/main.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
health-monitors/syslog-health-monitor/go.modfault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
health-monitors/syslog-health-monitor/go.modhealth-events-analyzer/go.mod
π Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.
Applied to files:
distros/kubernetes/nvsentinel/values-tilt.yaml
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.gofault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.gofault-remediation/pkg/reconciler/metrics.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/fault_remediation_test.go
𧬠Code graph analysis (3)
fault-remediation/pkg/initializer/init.go (5)
fault-remediation/pkg/reconciler/reconciler.go (2)
FaultRemediationReconciler(67-75)NewFaultRemediationReconciler(78-93)store-client/pkg/datastore/registry.go (1)
NewDataStore(120-123)store-client/pkg/watcher/factory.go (2)
WatcherConfig(26-38)CreateChangeStreamWatcher(65-87)store-client/pkg/datastore/types.go (1)
Pipeline(121-121)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)
fault-remediation/pkg/reconciler/reconciler_test.go (2)
fault-remediation/pkg/reconciler/reconciler.go (3)
NewFaultRemediationReconciler(78-93)HealthEventDoc(53-56)ReconcilerConfig(42-51)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
store-client/pkg/datastore/types.go (1)
EventWithToken(101-104)fault-remediation/pkg/reconciler/reconciler.go (3)
FaultRemediationReconciler(67-75)ReconcilerConfig(42-51)NewFaultRemediationReconciler(78-93)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)fault-remediation/pkg/reconciler/remediation.go (1)
FaultRemediationClient(56-68)
π Additional comments (22)
distros/kubernetes/nvsentinel/values-tilt.yaml (2)
220-220: Verify that replicaCount is properly referenced in the fault-remediation deployment template.The addition sets the replica count for fault-remediation to 1, appropriate for Tilt testing. Ensure the deployment template properly references this value using
{{ .Values.replicaCount }}or similar template syntax.
24-24: Verify thatctrlRuntimeEnabledflag is properly referenced in the fault-remediation deployment template.This addition enables controller-runtime integration in the Tilt environment. Ensure the corresponding deployment template references and acts on the
global.ctrlRuntimeEnabledflag to confirm this configuration change is actually utilized.event-exporter/go.mod (1)
18-84: Verify that event-exporter's controller-runtime and k8s.io dependencies are necessary and not unnecessarily transitive.Event-exporter/go.mod includes the same pattern of controller-runtime and k8s.io indirect dependencies as health-events-analyzer. Confirm whether event-exporter directly imports these packages or if they are pulled in purely through the commons module. If these are only transitive dependencies that event-exporter does not directly use, consider whether they should be scoped only to modules that explicitly require controller-runtime integration.
Additionally, verify that version pinning (k8s.io/api v0.34.2, controller-runtime v0.22.4) is consistent across all services and does not create version skew.
To verify:
- Search event-exporter source code for direct imports of
sigs.k8s.io/controller-runtime,k8s.io/client-go, ork8s.io/api- Check commons/go.mod to confirm these are direct requirements there
- Confirm version alignment with other services in the codebase
health-events-analyzer/go.mod (1)
29-94: Verify that all indirect dependencies are transitive from controller-runtime v0.22.4 and Go version compatibility.The expansion of indirect dependencies (lines 29β94) is expected when integrating controller-runtime, but confirm that all added packages are pulled in exclusively by controller-runtime v0.22.4 and k8s.io modules, rather than manually added. Additionally, verify Go 1.25 with toolchain go1.25.3 is compatible with controller-runtime v0.22.4, k8s.io/api v0.34.2, and k8s.io/client-go v0.34.2.
Per coding guidelines, ensure Go dependencies remain minimal. If any indirect dependencies can be removed without breaking the integration, remove them.
commons/pkg/server/server.go (1)
54-54: LGTM on the import addition.The controller-runtime metrics import is correctly aliased and appropriately placed with other imports.
fault-remediation/pkg/reconciler/metrics.go (2)
29-38: LGTM on the metrics registry migration.The migration to
promauto.With(crmetrics.Registry)correctly aligns metrics registration with the controller-runtime registry. This ensures metrics are properly exposed when usingWithPrometheusMetricsCtrlRuntime()in the server.
39-92: Consistent registry usage across all metrics.All remaining metrics (
eventsProcessed,processingErrors,totalUnsupportedRemediationActions,eventHandlingDuration,logCollectorJobs,logCollectorJobDuration,logCollectorErrors) are correctly migrated to use the controller-runtime registry.fault-remediation/pkg/initializer/init.go (1)
131-134: LGTM on the FaultRemediationReconciler wiring.The reconciler is correctly instantiated with all required dependencies: datastore, watcher, healthEventStore, config, and dryRun flag.
tests/fault_remediation_test.go (1)
28-28: LGTM on the typo fix.Correcting "Reemdiated" β "Remediated" in the test function name improves readability. As per coding guidelines, test names should be descriptive using the format
TestFunctionName_Scenario_ExpectedBehavior.fault-remediation/pkg/reconciler/reconciler_test.go (4)
219-221: LGTM on constructor and field access updates.The test correctly uses the new
NewFaultRemediationReconcilerconstructor withnilfor dependencies not exercised in this test, and accesses the unexporteddryRunfield appropriately.
377-378: Proper error handling for performRemediation.The updated
performRemediationreturn signature(bool, string, error)is correctly handled withassert.NoError(t, err)assertions across all test cases.Also applies to: 434-435, 481-482
588-589: Consistent access to config fields.The tests correctly access
r.config.EnableLogCollectorandr.config.RemediationClientfor testing the log collector behavior, matching the updated struct field visibility.Also applies to: 730-731
784-789: Test setup correctly updated for new constructor.The
TestUpdateNodeRemediatedStatustest properly initializes the reconciler with the new constructor pattern.fault-remediation/main.go (3)
210-285: LGTM! Flag definitions are clear and complete.The flag definitions comprehensively cover controller-runtime configuration, leader election parameters, and service addresses with helpful descriptions.
164-208: LGTM! Controller-runtime manager setup is correct.The manager is properly configured with metrics, health checks, leader election, and the reconciler is correctly wired via SetupWithManager. The manager correctly uses the provided context parameter.
76-118: LGTM! Initialization and setup flow is well-structured.The validation that leader-election requires controller-runtime is appropriate, initialization is clean, and the deferred cleanup properly logs errors.
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
56-102: LGTM! Mock watcher has proper concurrency protection.The mutex guards ensure thread-safe access to counters when the reconciler and tests interact concurrently.
367-395: LGTM! Test patterns correctly use the new public reconciler API.Tests properly construct
FaultRemediationReconcilervia the public constructor and invoke methods through the reconciler's public surface with appropriate error handling.Also applies to: 427-461, 480-505, 554-578
fault-remediation/pkg/reconciler/reconciler.go (4)
95-133: LGTM! Reconcile method is well-documented and correctly implements controller-runtime interface.The method properly parses events, handles cancellation vs remediation, and returns appropriate
ctrl.Resultvalues for controller-runtime.
543-572: LGTM! SetupWithManager correctly wires the reconciler with controller-runtime.The method properly starts the watcher with the provided context and registers the reconciler using a typed channel source, enabling controller-runtime management.
574-602: LGTM! AdaptEvents correctly handles context cancellation and channel lifecycle.The adapter properly monitors both context cancellation and input channel closure, ensuring clean shutdown and no goroutine leaks.
Based on learnings requiring proper shutdown handling with context cancellation.
64-75: LGTM! All exported types and functions have proper godoc comments.The struct, constructor, and lifecycle methods are well-documented per Go conventions.
As per coding guidelines, function comments are required for all exported Go functions.
Also applies to: 77-93, 518-523
|
/ok to test 260636b |
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
joejulian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Thanks Igor.
Signed-off-by: Igor Velichkovich <[email protected]> Signed-off-by: Igor Velichkovich <[email protected]>
260636b to
5ab3169
Compare
|
/ok to test 5ab3169 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
β»οΈ Duplicate comments (3)
fault-remediation/pkg/initializer/init.go (1)
99-113: Resource leak: datastore not closed if watcher creation fails.If
watcher.CreateChangeStreamWatcherfails on line 110, the datastore created on line 99 is not closed, potentially leaking database connections and goroutines. This issue was flagged in a previous review.Apply this diff to ensure proper cleanup:
ds, err := datastore.NewDataStore(ctx, *datastoreConfig) if err != nil { return nil, fmt.Errorf("error initializing datastore: %w", err) } // Create watcher using the factory pattern watcherConfig := watcher.WatcherConfig{ Pipeline: pipeline, CollectionName: "HealthEvents", } watcherInstance, err := watcher.CreateChangeStreamWatcher(ctx, ds, watcherConfig) if err != nil { + if closeErr := ds.Close(ctx); closeErr != nil { + slog.Error("failed to close datastore after watcher creation failure", "error", closeErr) + } return nil, fmt.Errorf("error initializing change stream watcher: %w", err) }fault-remediation/main.go (1)
162-173: Event loop still doesn't respect context cancellation.The
for event := range ...loop blocks until the channel closes. If the context is cancelled (e.g., SIGTERM), this goroutine won't exit untilWatcher.Events()channel is closed. This was flagged in a past review.Consider using a select statement:
g.Go(func() error { components.FaultRemediationReconciler.StartWatcherStream(gCtx) slog.Info("Listening for events on the channel...") - - for event := range components.FaultRemediationReconciler.Watcher.Events() { - slog.Info("Event received", "event", event) - _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) + eventCh := components.FaultRemediationReconciler.Watcher.Events() + for { + select { + case <-gCtx.Done(): + slog.Info("Context cancelled, stopping event processing") + return nil + case event, ok := <-eventCh: + if !ok { + return nil + } + slog.Info("Event received", "event", event) + _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event) + } } - - return nil })fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)
245-249: Avoidlog.Fatalfinside goroutine.Using
log.Fatalfin a goroutine will abruptly terminate the test process without proper cleanup or useful error context. This was flagged in a past review. Consider using a channel for error signaling or at minimum usinglog.Printfto log the error while allowing tests to fail naturally.+ mgrReady := make(chan struct{}) + mgrErr := make(chan error, 1) go func() { + close(mgrReady) if err := mgr.Start(testContext); err != nil { - log.Fatalf("Failed to start the test environment manager: %v", err) + mgrErr <- err } }() + <-mgrReady
π§Ή Nitpick comments (6)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (1)
49-55: Consider removing unnecessary quotes for consistency.The events rule is correctly added for controller-runtime event recording. However, the quotes around
"events","create", and"patch"are inconsistent with the unquoted style used in other rules (e.g.,nodes,get,list).- apiGroups: - "" resources: - - "events" + - events verbs: - - "create" - - "patch" + - create + - patchfault-remediation/pkg/reconciler/reconciler_test.go (1)
784-789: Trailing whitespace/formatting issue.There appears to be inconsistent spacing in the config struct definition. Consider ensuring consistent formatting.
cfg := ReconcilerConfig{ RemediationClient: mockK8sClient, UpdateMaxRetries: 1, UpdateRetryDelay: 0, - } + }fault-remediation/main.go (1)
262-268: Clarify lease duration flag description.The help text says "Interval at which non-leader candidates will wait to force acquire leadership" but
LeaseDurationin controller-runtime is the duration that non-leader candidates will wait before attempting to acquire leadership after observing the leader. The phrasing could be clearer.flag.DurationVar( &leaderElectionLeaseDuration, "leader-elect-lease-duration", 15*time.Second, - "Interval at which non-leader candidates will wait to force acquire leadership (duration string). "+ + "Duration that the leader is valid before a new election. "+ "Requires controller-runtime enabled.", )fault-remediation/pkg/reconciler/reconciler.go (3)
215-216: TODO comment should reference an issue.Per coding guidelines, TODO comments should reference issues. Consider linking to a tracking issue for the retry simplification work.
- //TODO: return error to use built in ctrl runtime retries + //TODO(#NNN): return error to use built-in ctrl-runtime retries
339-339: Discarded error from log collector may mask failures.The error from
runLogCollectoris silently discarded. Consider logging a warning or metric to track log collector failures without blocking remediation.- _ = r.runLogCollector(ctx, healthEvent) + if err := r.runLogCollector(ctx, healthEvent); err != nil { + slog.Warn("Log collector failed, continuing with remediation", "node", nodeName, "error", err) + }
530-538: Add nil guards for safer test usage.
CloseAllunconditionally callsr.ds.Close(ctx)andr.Watcher.Close(ctx). Tests may constructFaultRemediationReconcilerwith nil fields, causing panics. This was flagged in a previous review but not addressed.- if err := r.ds.Close(ctx); err != nil { - slog.Error("failed to close datastore", "error", err) - errs = append(errs, err) - } - - if err := r.Watcher.Close(ctx); err != nil { - slog.Error("failed to close Watcher", "error", err) - errs = append(errs, err) - } + if r.ds != nil { + if err := r.ds.Close(ctx); err != nil { + slog.Error("failed to close datastore", "error", err) + errs = append(errs, err) + } + } + + if r.Watcher != nil { + if err := r.Watcher.Close(ctx); err != nil { + slog.Error("failed to close Watcher", "error", err) + errs = append(errs, err) + } + }
π Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (5)
commons/go.sumis excluded by!**/*.sumevent-exporter/go.sumis excluded by!**/*.sumfault-remediation/go.sumis excluded by!**/*.sumhealth-events-analyzer/go.sumis excluded by!**/*.sumhealth-monitors/syslog-health-monitor/go.sumis excluded by!**/*.sum
π Files selected for processing (19)
commons/go.mod(2 hunks)commons/pkg/server/server.go(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml(2 hunks)distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml(1 hunks)distros/kubernetes/nvsentinel/values-full.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml(1 hunks)distros/kubernetes/nvsentinel/values-tilt.yaml(2 hunks)event-exporter/go.mod(1 hunks)fault-remediation/go.mod(2 hunks)fault-remediation/main.go(3 hunks)fault-remediation/pkg/initializer/init.go(5 hunks)fault-remediation/pkg/reconciler/metrics.go(4 hunks)fault-remediation/pkg/reconciler/reconciler.go(17 hunks)fault-remediation/pkg/reconciler/reconciler_e2e_test.go(43 hunks)fault-remediation/pkg/reconciler/reconciler_test.go(16 hunks)fault-remediation/pkg/reconciler/remediation.go(1 hunks)health-events-analyzer/go.mod(1 hunks)health-monitors/syslog-health-monitor/go.mod(1 hunks)tests/fault_remediation_test.go(1 hunks)
π§ Files skipped from review as they are similar to previous changes (7)
- health-monitors/syslog-health-monitor/go.mod
- distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
- commons/pkg/server/server.go
- distros/kubernetes/nvsentinel/values-tilt.yaml
- distros/kubernetes/nvsentinel/values-full.yaml
- fault-remediation/pkg/reconciler/metrics.go
- fault-remediation/go.mod
π§° Additional context used
π Path-based instructions (3)
**/go.mod
π CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
event-exporter/go.modhealth-events-analyzer/go.modcommons/go.mod
**/*.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
fault-remediation/main.gotests/fault_remediation_test.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.gofault-remediation/pkg/reconciler/remediation.gofault-remediation/pkg/reconciler/reconciler.go
**/*_test.go
π CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
tests/fault_remediation_test.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler_test.go
π§ Learnings (16)
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
event-exporter/go.modhealth-events-analyzer/go.modcommons/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
event-exporter/go.modhealth-events-analyzer/go.modcommons/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
event-exporter/go.modhealth-events-analyzer/go.modcommons/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
event-exporter/go.modhealth-events-analyzer/go.modfault-remediation/main.gofault-remediation/pkg/reconciler/remediation.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules
Applied to files:
health-events-analyzer/go.modcommons/go.mod
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code
Applied to files:
fault-remediation/main.gofault-remediation/pkg/initializer/init.gofault-remediation/pkg/reconciler/reconciler_e2e_test.gofault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
commons/go.modfault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`
Applied to files:
tests/fault_remediation_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".
Applied to files:
fault-remediation/pkg/reconciler/reconciler_e2e_test.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.
Applied to files:
fault-remediation/pkg/reconciler/reconciler.go
π Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable
Applied to files:
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
𧬠Code graph analysis (5)
fault-remediation/main.go (5)
node-drainer/pkg/config/config.go (1)
Duration(36-38)fault-remediation/pkg/initializer/init.go (3)
InitializationParams(34-39)InitializeAll(46-135)Components(41-43)fault-remediation/pkg/reconciler/reconciler.go (1)
FaultRemediationReconciler(67-75)commons/pkg/server/server.go (4)
NewServer(375-396)WithPort(151-153)WithPrometheusMetricsCtrlRuntime(227-234)WithSimpleHealth(246-254)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)
fault-remediation/pkg/initializer/init.go (4)
fault-remediation/pkg/reconciler/reconciler.go (2)
FaultRemediationReconciler(67-75)NewFaultRemediationReconciler(78-93)store-client/pkg/datastore/registry.go (1)
NewDataStore(120-123)store-client/pkg/watcher/factory.go (2)
WatcherConfig(26-38)CreateChangeStreamWatcher(65-87)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (6)
store-client/pkg/datastore/types.go (1)
EventWithToken(101-104)data-models/pkg/model/health_event_extentions.go (1)
HealthEventStatus(45-51)fault-remediation/pkg/reconciler/reconciler.go (3)
FaultRemediationReconciler(67-75)ReconcilerConfig(42-51)NewFaultRemediationReconciler(78-93)commons/pkg/statemanager/statemanager.go (2)
StateManager(197-200)NewStateManager(206-210)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
NewManager(41-43)fault-remediation/pkg/reconciler/remediation.go (1)
FaultRemediationClient(58-70)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/reconciler/reconciler.go (3)
NewFaultRemediationReconciler(78-93)HealthEventDoc(53-56)ReconcilerConfig(42-51)node-drainer/pkg/config/config.go (1)
ReconcilerConfig(162-167)commons/pkg/statemanager/statemanager.go (1)
StateManager(197-200)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)
fault-remediation/pkg/reconciler/reconciler.go (9)
janitor/pkg/csp/aws/aws.go (1)
Client(48-50)store-client/pkg/storewatcher/watch_store.go (1)
ChangeStreamWatcher(43-57)store-client/pkg/datastore/interfaces.go (1)
HealthEventStore(63-87)node-drainer/pkg/config/config.go (1)
ReconcilerConfig(162-167)fault-remediation/pkg/reconciler/annotation.go (1)
NodeAnnotationManagerInterface(38-43)data-models/pkg/model/health_event_extentions.go (4)
HealthEventStatus(45-51)UnQuarantined(34-34)Cancelled(37-37)Status(23-23)commons/pkg/statemanager/statemanager.go (1)
StateManager(197-200)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_NONE(43-43)health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
Manager(37-39)
π Additional comments (29)
fault-remediation/pkg/reconciler/remediation.go (1)
388-390: LGTM!This is a minor formatting change adding a blank line for readability. No functional impact.
health-events-analyzer/go.mod (1)
89-93: LGTM!The indirect dependency additions, including
sigs.k8s.io/controller-runtime v0.22.4, are transitive dependencies from the commons module's controller-runtime integration. This is expected behavior from the dependency graph expansion.event-exporter/go.mod (1)
80-84: LGTM!The indirect dependency additions are consistent with the controller-runtime integration in the commons module. No direct dependency changes.
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (2)
90-100: LGTM!The conditional controller-runtime configuration is well-structured:
--controller-runtime=trueand--leader-elect=trueargs are properly gated.- Health port binding is correctly conditional with a sensible default (9440).
- Nil-safe access pattern
((.Values.global).ctrlRuntimeEnabled)is used consistently.
101-116: LGTM!The probe configuration correctly adapts to controller-runtime mode:
- livenessProbe uses
/healthzon the appropriate port.- readinessProbe uses
/readyz(controller-runtime standard) vs/healthz(legacy) with matching ports.The nil-safe ternary pattern is applied consistently, addressing previous review feedback.
fault-remediation/pkg/initializer/init.go (3)
31-31: LGTM!The watcher import is correctly added to support the new change stream watcher initialization.
130-134: LGTM!The initialization return correctly uses the new
NewFaultRemediationReconcilerconstructor with all required dependencies (datastore, watcher, healthEventStore, config, dryRun flag).
41-43: Components struct uses value type instead of pointer.The
FaultRemediationReconcilerfield is now a value type rather than a pointer. Verify that all callers accessingComponents.FaultRemediationReconcilerhandle this appropriatelyβparticularly checking for any unintended pointer operations or dereferencing patterns that may have assumed a pointer type.commons/go.mod (1)
18-18: No action needed β controller-runtime v0.22.4 is the current latest stable release.The dependency
sigs.k8s.io/controller-runtime v0.22.4is appropriate; this is the latest version released November 3, 2025, and suitable for the FaultRemediationReconciler integration.distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (1)
31-39: LGTM! Necessary RBAC for controller-runtime leader election.The leases permission in the
coordination.k8s.ioAPI group is required for controller-runtime's leader election mechanism. The verbs (get, create, update, patch) are the standard set needed.tests/fault_remediation_test.go (1)
28-28: Good catch fixing the typo in the test function name.Correcting "Reemdiated" to "Remediated" improves discoverability and readability. Based on coding guidelines, consider adopting the naming format
TestFunctionName_Scenario_ExpectedBehaviorfor even clearer test intent (e.g.,TestNewCRsAreCreated_AfterFaultsRemediated).fault-remediation/pkg/reconciler/reconciler_test.go (3)
219-221: Constructor migration looks correct.The test correctly uses the new
NewFaultRemediationReconcilerconstructor with nil dependencies (ds, watcher, healthEventStore) since this unit test only validates construction and the dry-run flag, not actual datastore operations.
377-378: LGTM! Error handling correctly added forperformRemediation.The updated signature returning
(bool, string, error)is properly handled withassert.NoError(t, err). This correctly tests the success path where remediation should complete without error.
588-589: Accessing config fields throughr.configis appropriate.The test correctly accesses
r.config.EnableLogCollectorandr.config.RemediationClientafter the refactoring moved these to unexported fields accessed via the config struct.fault-remediation/main.go (3)
92-94: Good constraint: leader-election requires controller-runtime.This validation correctly prevents misconfiguration where leader election is enabled without controller-runtime, which would fail at runtime.
113-117: Proper cleanup with deferred close.Using
deferfor closing datastore components ensures cleanup happens regardless of exit path. Error logging on failure is appropriate.
178-222: Controller-runtime setup looks well-structured.The manager configuration properly includes:
- Metrics server with configurable bind address
- Health/ready probe endpoints
- Leader election with configurable parameters
- Proper error handling and logging
The use of
mgr.Start(ctx)with the signal context addresses the previous review concern about dual signal handlers.fault-remediation/pkg/reconciler/reconciler_e2e_test.go (7)
56-61: Good addition of mutex for thread safety.Adding
sync.Mutexto guard access tomarkProcessedCountprevents data races between the reconciler goroutine writing and test assertions reading. This is a proper fix for concurrent access.
82-87: Thread-safe MarkProcessed implementation.The mutex lock/unlock correctly guards the increment of
markProcessedCount.
99-101: Thread-safe GetCallCounts implementation.The mutex ensures consistent reads of counter values during test assertions.
206-230: TestMain setup is comprehensive.The test initialization correctly:
- Creates a remediation client with error handling
- Configures the ReconcilerConfig with StateManager and retry settings
- Sets up mock health event store with instrumentation
- Creates the reconciler using the new constructor
297-303: Good refactoring to return errors fromcreateTestRemediationClient.Returning
(*FaultRemediationClient, error)instead of usinglog.Fatalfallows callers to handle errors gracefully in test assertions.
376-380: Test refactored to use struct literal correctly.Using
FaultRemediationReconciler{config: cfg, annotationManager: ...}aligns with the new private field structure. This pattern is consistently applied across tests.
1347-1359: Test refactor forTestMetrics_ProcessingErrorslooks correct.The test now uses a pointer to
FaultRemediationReconcilerwith only theWatcherfield set, which is sufficient for testing the error path inReconcilewhen processing invalid events.fault-remediation/pkg/reconciler/reconciler.go (5)
64-93: Clean struct definition and constructor implementation.The godoc comments are properly added, and the constructor correctly initializes all fields except the embedded
client.Client, which is documented as needing to be set separately for ctrl-runtime mode.
95-133: Reconcile method properly orchestrates event processing.The flow correctly handles parsing, nil checks, and dispatches to either cancellation or remediation handlers. Returning
nilerror for unparseable events prevents hot-looping (per past discussion).
279-284: Consider usingctxinstead ofcontext.Background()forMarkProcessed.The function receives
ctxbut usescontext.Background()forMarkProcessed. This prevents proper cancellation during shutdown. If the intent is to ensure the token is always persisted, consider adding a brief comment explaining why.- if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil { + if err := watcherInstance.MarkProcessed(ctx, resumeToken); err != nil {
547-572: Well-structured controller-runtime integration.The implementation correctly uses the provided context for watcher lifecycle, properly adapts events to the controller-runtime channel source pattern, and the typed handler cleanly enqueues events.
574-602: Correct channel adaptation with proper memory safety.Good practice:
eventOut := eon line 595 correctly copies the loop variable before taking its pointer, avoiding the classic Go closure/loop-variable bug. The goroutine properly handles context cancellation and channel closure.
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
|
congrats @ivelichkovich !! |
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Configuration
Bug Fixes
Chores
Tests
βοΈ Tip: You can customize this high-level summary in your review settings.