Skip to content

Conversation

@ivelichkovich
Copy link
Contributor

@ivelichkovich ivelichkovich commented Nov 25, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • New Features

    • Optional controller-runtime management mode with manager integration and leader-election support
    • Controller-runtime-aware metrics and an additional health port when enabled
  • Configuration

    • New ctrlRuntimeEnabled flag (default: false), healthPort (default: 9440), and leader-election flags; dev tilt values enable ctrl-runtime
    • RBAC expanded to allow leases and event creation/patching
  • Bug Fixes

    • Fixed typo in a test name
  • Chores

    • Expanded indirect dependencies across multiple modules
  • Tests

    • Tests and e2e scaffolding updated to run with manager-enabled flow and new reconciler wiring

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 25, 2025

Walkthrough

Adds controller-runtime integration and a new public FaultRemediationReconciler with dual run modes (controller-managed and non-managed), shifts metrics registration to use the controller-runtime registry, updates initializer/main wiring and flags, augments Helm RBAC/deployment/values for ctrl-runtime, and expands multiple go.mod dependency graphs.

Changes

Cohort / File(s) Summary
Dependency manifests
commons/go.mod, event-exporter/go.mod, fault-remediation/go.mod, health-events-analyzer/go.mod, health-monitors/syslog-health-monitor/go.mod
Expanded dependency graphs: added/updated numerous indirect modules including sigs.k8s.io/controller-runtime and various k8s/OpenAPI/transitive modules.
Server metrics option
commons/pkg/server/server.go
Added WithPrometheusMetricsCtrlRuntime() which registers /metrics using the controller-runtime metrics registry (crmetrics.Registry) via promauto.With(...).
Reconciler refactor & metrics registration
fault-remediation/pkg/reconciler/reconciler.go, fault-remediation/pkg/reconciler/metrics.go, .../reconciler_*test.go
Replaced internal Reconciler with public FaultRemediationReconciler; added NewFaultRemediationReconciler(...), ctrl-runtime Reconcile signature, lifecycle helpers (SetupWithManager, StartWatcherStream, CloseAll, AdaptEvents); moved metrics to promauto.With(crmetrics.Registry); updated tests to use the new public constructor and manager wiring.
Main: dual run modes & flags
fault-remediation/main.go
Introduced controller-runtime-managed and non-managed execution paths, new flags (--controller-runtime, --leader-elect and leader-election timing/namespace), validation that leader-election requires controller-runtime, and dedicated setup functions.
Initializer & components wiring
fault-remediation/pkg/initializer/init.go
Initializer now creates datastore, change-stream watcher, and healthEventStore; Components exposes FaultRemediationReconciler (field renamed); reconciler constructed via NewFaultRemediationReconciler(...).
Helm charts & values
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml, .../deployment.yaml, distros/kubernetes/nvsentinel/values-full.yaml, distros/kubernetes/nvsentinel/values-tilt.yaml, distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
Added RBAC rules for coordination.k8s.io/leases and core events; conditional container args (--controller-runtime, --leader-elect); conditional health/metrics ports and probe paths; added ctrlRuntimeEnabled, healthPort, and tilt replicaCount.
Tests & scaffolding
fault-remediation/pkg/reconciler/reconciler_e2e_test.go, fault-remediation/pkg/reconciler/reconciler_test.go, tests/fault_remediation_test.go
Tests updated to use NewFaultRemediationReconciler(...), adjusted field access (r.config, r.dryRun), performRemediation now returns (bool,string,error); e2e tests run a controller-runtime Manager; one test renamed to fix a typo; mocks gain mutexes/counters for concurrency.
Minor formatting
fault-remediation/pkg/reconciler/remediation.go
Small formatting change (blank line); no behavioral change.

Sequence Diagram(s)

sequenceDiagram
    participant App as main
    participant Manager as ctrl-runtime Manager
    participant Reconciler as FaultRemediationReconciler
    participant Watcher as ChangeStreamWatcher
    participant Datastore as Datastore
    participant Metrics as Metrics/Health Server

    rect rgb(220,240,220)
    Note over App,Manager: Controller-runtime managed path
    App->>Manager: Start manager (leader-election optional)
    Manager->>Reconciler: SetupWithManager(mgr)
    Watcher->>Reconciler: AdaptEvents -> typed events
    Reconciler->>Datastore: Parse & process event (Reconcile)
    Reconciler->>Datastore: Update remediation state / create CRs
    Manager->>Reconciler: Health/Ready probes invoked
    end

    rect rgb(240,230,220)
    Note over App,Metrics: Non-controller-runtime path
    App->>Metrics: Start metrics/health server (promauto.With(crmetrics.Registry))
    Reconciler->>Watcher: StartWatcherStream(ctx)
    loop event loop
      Watcher->>Reconciler: emit EventWithToken
      Reconciler->>Datastore: Parse & process event
      Reconciler->>Datastore: Update remediation state
    end
    App->>Reconciler: Shutdown
    Reconciler->>Datastore: CloseAll()
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Pay special attention to:
    • fault-remediation/pkg/reconciler/reconciler.go β€” new public type, ctrl-runtime Reconcile flow, SetupWithManager, AdaptEvents, and lifecycle changes.
    • fault-remediation/main.go β€” dual-mode branching, flag parsing, and leader-election validation.
    • fault-remediation/pkg/reconciler/metrics.go and commons/pkg/server/server.go β€” metrics registry switch and potential duplicate registration.
    • Helm templates β€” RBAC additions and conditional probe/port logic.
    • go.mod updates across modules β€” version pins and transitive dependency alignment.

Poem

πŸ‡ I hopped in with a registry bright and neat,

Metrics humming, leader beats, and probes to meet,
A reconciler listens to the stream tonight,
Two paths to run β€” one manager, one light,
I nibble logs and cheer the cluster's heartbeat.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check βœ… Passed The title 'use ctrl-runtime for fault-remediation' directly and clearly describes the main change in the PR: integrating controller-runtime across the fault-remediation component.
✨ Finishing touches
  • πŸ“ Generate docstrings
πŸ§ͺ Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

πŸ”§ golangci-lint (2.5.0)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ivelichkovich ivelichkovich force-pushed the ctrlruntime branch 3 times, most recently from cfa5826 to 783b2df Compare November 25, 2025 22:24
@ivelichkovich ivelichkovich force-pushed the ctrlruntime branch 4 times, most recently from 64193af to f722b46 Compare December 3, 2025 19:07
@ivelichkovich ivelichkovich marked this pull request as ready for review December 3, 2025 19:07
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (13)
health-events-analyzer/go.mod (1)

89-89: Verify whether controller-runtime should be a direct dependency.

Line 89 lists sigs.k8s.io/controller-runtime v0.22.4 // indirect. According to the PR objectives, this PR integrates controller-runtime for fault-remediation. If the fault-remediation component actively imports and uses controller-runtime APIs, it should be declared as a direct dependency in the require section (lines 7–18), not as an indirect dependency.

Review the actual code changes to confirm whether controller-runtime is directly used in this module. If so, move it to the direct requires section.

commons/pkg/server/server.go (1)

218-234: Improve documentation to clarify usage and differences.

The implementation looks correct, but the documentation could be clearer:

  1. The example code still references WithPrometheusMetrics() instead of WithPrometheusMetricsCtrlRuntime()
  2. It doesn't explain when to use this option versus WithPrometheusMetrics()
  3. It should note that these two options are mutually exclusive (both register /metrics)

Consider updating the documentation to:

  • Explain that this option uses the controller-runtime metrics registry instead of the default Prometheus registry
  • Indicate when to prefer this option (e.g., "Use this when integrating with controller-runtime to consolidate metrics from controller-runtime and custom metrics in a single registry")
  • Note that only one metrics option should be used at a time

Apply this diff to improve the documentation:

-// WithPrometheusMetricsCtrlRuntime registers the Prometheus metrics handler at /metrics.
-// This uses the controller runtime Prometheus registry.
+// WithPrometheusMetricsCtrlRuntime registers the Prometheus metrics handler at /metrics
+// using the controller-runtime metrics registry (sigs.k8s.io/controller-runtime/pkg/metrics).
+// 
+// Use this option when integrating with controller-runtime to consolidate metrics from
+// controller-runtime (e.g., workqueue, leader election) and custom application metrics.
+// Note: This option is mutually exclusive with WithPrometheusMetrics() - use only one.
 //
 // Example:
 //
 //	srv := NewServer(
 //	    WithPort(2112),
-//	    WithPrometheusMetrics(),
+//	    WithPrometheusMetricsCtrlRuntime(),
 //	)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (2)

31-55: Consider making RBAC rules conditional on ctrlRuntimeEnabled.

The new RBAC rules for leases and events are required for controller-runtime leader election and event recording. However, they are unconditionally added regardless of whether ctrlRuntimeEnabled is true. For least-privilege, consider wrapping these in a conditional:

{{- if (.Values.global).ctrlRuntimeEnabled }}
- apiGroups:
    - "coordination.k8s.io"
  resources:
    - leases
  verbs:
    - get
    - create
    - update
    - patch
- apiGroups:
  - ""
  resources:
  - "events"
  verbs:
  - "create"
  - "patch"
{{- end }}

31-39: Minor formatting inconsistency.

The new coordination.k8s.io rule uses 4-space indentation for apiGroups, resources, and verbs, while other rules in this file use 2-space indentation. Consider aligning for consistency:

 - apiGroups:
-    - "coordination.k8s.io"
+  - "coordination.k8s.io"
   resources:
-    - leases
+  - leases
   verbs:
-    - get
-    - create
-    - update
-    - patch
+  - get
+  - create
+  - update
+  - patch
fault-remediation/pkg/initializer/init.go (2)

99-116: Resource cleanup on error paths.

The datastore and watcher are created but if subsequent initialization fails (e.g., reconciler creation), these resources are not cleaned up. Consider adding cleanup logic or using defer patterns.

Example approach:

ds, err := datastore.NewDataStore(ctx, *datastoreConfig)
if err != nil {
    return nil, fmt.Errorf("error initializing datastore: %w", err)
}

// Ensure cleanup on subsequent errors
var cleanupNeeded bool
defer func() {
    if cleanupNeeded {
        if closeErr := ds.Close(ctx); closeErr != nil {
            slog.Error("failed to close datastore during cleanup", "error", closeErr)
        }
    }
}()
cleanupNeeded = true

// ... rest of initialization ...

cleanupNeeded = false // Success, don't cleanup
return &Components{...}, nil

105-108: Consider extracting the collection name to a constant.

The "HealthEvents" collection name is hardcoded. Consider defining this as a constant to ensure consistency across the codebase and make it easier to maintain.

fault-remediation/main.go (3)

17-39: Non-standard import ordering.

Go convention groups imports: stdlib, then a blank line, then external packages. Currently stdlib and external imports are interleaved (e.g., errors, flag, fmt mixed with github.com/nvidia/..., golang.org/x/sync/errgroup).

 import (
 	"context"
 	"errors"
 	"flag"
 	"fmt"
-	"github.com/nvidia/nvsentinel/commons/pkg/server"
-	"golang.org/x/sync/errgroup"
-	"k8s.io/apimachinery/pkg/runtime"
 	"log/slog"
 	"os"
 	"os/signal"
-	"sigs.k8s.io/controller-runtime/pkg/healthz"
-	metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
 	"strconv"
 	"strings"
 	"syscall"
 	"time"
 
+	"golang.org/x/sync/errgroup"
+	"k8s.io/apimachinery/pkg/runtime"
 	ctrl "sigs.k8s.io/controller-runtime"
+	"sigs.k8s.io/controller-runtime/pkg/healthz"
+	metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
 
+	"github.com/nvidia/nvsentinel/commons/pkg/server"
 	"github.com/nvidia/nvsentinel/commons/pkg/logger"
 	"github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer"
 )

263-266: Unnecessary explicit return statement.

The explicit return at the end of a function with no return values is redundant.

 	flag.Parse()
-
-	return
 }

195-198: Inconsistent error wrapping.

Other errors in this function are wrapped with context (e.g., line 191), but the manager start error is returned unwrapped. As per coding guidelines, wrap errors with context using fmt.Errorf.

 	if err = mgr.Start(ctrl.SetupSignalHandler()); err != nil {
 		slog.Error("Problem running manager", "error", err)
-		return err
+		return fmt.Errorf("manager start failed: %w", err)
 	}
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

230-232: Inconsistent error handling: panic vs log.Fatalf.

Manager creation errors use panic(err) while other setup errors use log.Fatalf. For consistency, use log.Fatalf throughout TestMain.

 	mgr, err := ctrl.NewManager(testEnv.Config, ctrl.Options{
 		Scheme: scheme.Scheme,
 		Metrics: metricsserver.Options{
 			BindAddress: "0",
 		},
 	})
 	if err != nil {
-		panic(err)
+		log.Fatalf("Failed to create manager: %v", err)
 	}

1346-1348: Remove commented-out code.

The commented //mockStore := &MockHealthEventStore{} should be removed as it's dead code.

 	watcher := NewMockChangeStreamWatcher()
-	//mockStore := &MockHealthEventStore{}

 	r := &FaultRemediationReconciler{

17-52: Non-standard import ordering.

Imports mix stdlib (context, log, os, testing, etc.) with external packages. Go convention groups stdlib imports first, then a blank line, then external packages.

fault-remediation/pkg/reconciler/reconciler.go (1)

505-517: CloseAll returns early on first error, potentially leaving resources open.

If r.ds.Close(ctx) fails, r.Watcher.Close(ctx) is never called. Consider closing all resources and aggregating errors to ensure proper cleanup.

 func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error {
+	var errs []error
+
 	if err := r.ds.Close(ctx); err != nil {
 		slog.Error("failed to close datastore", "error", err)
-		return err
+		errs = append(errs, fmt.Errorf("close datastore: %w", err))
 	}

 	if err := r.Watcher.Close(ctx); err != nil {
 		slog.Error("failed to close Watcher", "error", err)
-		return err
+		errs = append(errs, fmt.Errorf("close watcher: %w", err))
 	}

+	if len(errs) > 0 {
+		return errors.Join(errs...)
+	}
 	return nil
 }

Note: You'll need to add "errors" to the imports.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 1971e0d and f722b46.

β›” Files ignored due to path filters (5)
  • commons/go.sum is excluded by !**/*.sum
  • event-exporter/go.sum is excluded by !**/*.sum
  • fault-remediation/go.sum is excluded by !**/*.sum
  • health-events-analyzer/go.sum is excluded by !**/*.sum
  • health-monitors/syslog-health-monitor/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (18)
  • commons/go.mod (2 hunks)
  • commons/pkg/server/server.go (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-full.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt.yaml (2 hunks)
  • event-exporter/go.mod (1 hunks)
  • fault-remediation/go.mod (4 hunks)
  • fault-remediation/main.go (3 hunks)
  • fault-remediation/pkg/initializer/init.go (5 hunks)
  • fault-remediation/pkg/reconciler/metrics.go (4 hunks)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (44 hunks)
  • fault-remediation/pkg/reconciler/reconciler_test.go (15 hunks)
  • fault-remediation/pkg/reconciler/remediation.go (1 hunks)
  • health-events-analyzer/go.mod (1 hunks)
  • health-monitors/syslog-health-monitor/go.mod (1 hunks)
  • tests/fault_remediation_test.go (1 hunks)
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/fault_remediation_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • commons/pkg/server/server.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/metrics.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/fault_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
🧠 Learnings (13)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/fault_remediation_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • tests/fault_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/pkg/reconciler/remediation.go
  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • health-monitors/syslog-health-monitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • commons/go.mod
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/metrics.go
πŸ“š Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.

Applied to files:

  • distros/kubernetes/nvsentinel/values-tilt.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧬 Code graph analysis (3)
fault-remediation/pkg/reconciler/reconciler_test.go (1)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • NewFaultRemediationReconciler (73-88)
  • HealthEventDoc (52-55)
  • ReconcilerConfig (41-50)
fault-remediation/pkg/initializer/init.go (4)
fault-remediation/pkg/reconciler/reconciler.go (2)
  • FaultRemediationReconciler (63-71)
  • NewFaultRemediationReconciler (73-88)
store-client/pkg/datastore/registry.go (1)
  • NewDataStore (120-123)
store-client/pkg/watcher/factory.go (2)
  • WatcherConfig (26-38)
  • CreateChangeStreamWatcher (65-87)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • FaultRemediationReconciler (63-71)
  • ReconcilerConfig (41-50)
  • NewFaultRemediationReconciler (73-88)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventStatus (45-51)
fault-remediation/pkg/reconciler/remediation.go (1)
  • FaultRemediationClient (56-68)
πŸ”‡ Additional comments (29)
health-events-analyzer/go.mod (2)

27-94: Run go mod tidy to validate dependency consistency.

The expansion of indirect dependencies when integrating controller-runtime is expected, but ensure the go.mod file is in a canonical, validated state by running go mod tidy to regenerate and clean up the dependency graph.


36-49: Verify unusual go-openapi/swag submodule entries in indirect dependencies.

Lines 36–49 list multiple go-openapi/swag/ submodules (cmdutils, conv, fileutils, jsonname, jsonutils, loading, mangling, netutils, stringutils, typeutils, yamlutils) as separate indirect entries with explicit version constraints. Standard Go module practice does not list subpackages within a single module as separate entries in go.mod; only the root module is typically included.

Run go mod tidy to ensure the go.mod file reflects the true dependency graph and remove any spurious submodule entries if present.

health-monitors/syslog-health-monitor/go.mod (2)

7-19: Direct dependencies are clean and appropriate.

The direct dependencies remain minimal and focused (systemd, retryablehttp, prometheus client, test utilities, gRPC, protobuf, and Kubernetes apimachinery). Local replacements for commons and data-models are correctly configured, aligning with coding guidelines for shared utilities.


21-81: Verify necessity of expanded indirect dependency surface and apply go mod diagnostic tools.

The indirect dependency list has grown significantly, particularly with OpenAPI tooling (go-openapi/*), Kubernetes modules (k8s.io/api, k8s.io/client-go, sigs.k8s.io/controller-runtime), and serialization utilities. This aligns with the PR objective to integrate controller-runtime.

Per Go module best practices, verify the expansion is justified by:

  1. Running go mod tidy to ensure no unused transitive dependencies remain.
  2. Using go mod why -m <module> on questionable indirect modules (especially go-openapi/* submodules) to confirm they're actually needed by imports.
  3. Running go mod graph to inspect the full dependency chain and identify any unexpectedly deep pulls.

If any indirect dependencies are not genuinely required by the codebase, remove them or upgrade direct dependency versions to reduce transitive bloat.

fault-remediation/pkg/reconciler/remediation.go (1)

382-384: LGTM! Minor formatting improvement.

The blank line improves readability by visually separating the context setup from the defer statement.

fault-remediation/go.mod (1)

32-32: LGTM! Expected transitive dependencies from controller-runtime.

The added indirect dependencies (fsnotify, btree, zap) and updated prometheus/common version are legitimate transitive dependencies pulled in by the controller-runtime integration introduced in this PR.

Also applies to: 51-51, 62-62, 72-72

event-exporter/go.mod (1)

18-84: LGTM! Transitive dependencies from controller-runtime integration.

The expanded indirect dependency graph (Kubernetes ecosystem, controller-runtime, and supporting libraries) is expected as the commons module now depends on controller-runtime. These are automatically managed transitive dependencies.

distros/kubernetes/nvsentinel/values-full.yaml (1)

49-61: LGTM! Well-documented feature flag for controller-runtime integration.

The introduction of ctrlRuntimeEnabled with default false ensures backward compatibility while enabling optional controller-runtime management. The separate healthPort and clear documentation of when each port is used provides good operational guidance. Marking this as "still experimental" appropriately sets expectations.

tests/fault_remediation_test.go (1)

28-28: LGTM! Typo fix improves test naming.

Corrected "Reemdiated" to "Remediated" in the test function name.

distros/kubernetes/nvsentinel/values-tilt.yaml (2)

24-24: LGTM! Consistent feature flag configuration.

Adding ctrlRuntimeEnabled: false ensures the Tilt testing environment aligns with the production default configuration.


220-220: LGTM! Appropriate replica count for development.

Setting replicaCount: 1 is suitable for the Tilt development/testing environment.

commons/go.mod (1)

17-17: LGTM! Enables controller-runtime integration in commons.

Adding sigs.k8s.io/controller-runtime v0.22.4 as a direct dependency appropriately supports the new WithPrometheusMetricsCtrlRuntime() option in commons/pkg/server/server.go. The version aligns with the Kubernetes ecosystem dependencies used throughout the project.

distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1)

81-91: LGTM!

The conditional addition of controller-runtime and leader-elect flags, along with the health port exposure, is correctly structured with nil-safe value access.

fault-remediation/pkg/reconciler/reconciler_test.go (4)

377-380: LGTM!

The test correctly handles the updated performRemediation signature that now returns an error as the third value.


219-221: LGTM!

The test correctly uses the new NewFaultRemediationReconciler constructor with nil dependencies for unit testing, which is appropriate when the tests don't exercise the datastore/watcher paths.


273-274: LGTM!

Field access correctly updated to use r.config.RemediationClient matching the now-private field naming convention.


995-1008: Verify constructor usage and method availability in reconciler_test.go.

The review flags inconsistent usage of NewReconciler(cfg, false) at line 995 versus NewFaultRemediationReconciler used elsewhere, and questions whether runLogCollector exists on the updated reconciler type. Confirm that:

  • NewReconciler constructor is still available (or has been replaced)
  • runLogCollector method exists on FaultRemediationReconciler
  • The suggested diff is compatible with the current API
fault-remediation/pkg/initializer/init.go (1)

131-134: LGTM!

The new FaultRemediationReconciler is correctly wired with all required dependencies (datastore, watcher, healthEventStore, config, and dryRun flag).

fault-remediation/pkg/reconciler/metrics.go (3)

53-59: Breaking change: metric label renamed from error_type to action.

The totalUnsupportedRemediationActions metric now uses action instead of error_type as a label. While semantically more accurate, this is a breaking change that will affect existing dashboards or alerting rules referencing this metric.

Verify that no existing dashboards, alerts, or monitoring configurations reference fault_remediation_unsupported_actions_total{error_type=...}.


62-92: LGTM!

The metrics are consistently registered with crmetrics.Registry, which is the correct pattern for controller-runtime integration. The histogram buckets and label definitions are appropriate.


29-38: Verify dual-mode operation doesn't cause metric registration conflicts.

All metrics are now registered with crmetrics.Registry. Confirm that when running without controller-runtime integration, the metrics are still properly exposed through the standard Prometheus endpoint, or that a separate initialization path handles non-controller-runtime mode.

fault-remediation/main.go (1)

42-42: Scheme is declared but not populated.

The scheme variable is passed to the controller-runtime manager but no types are registered via AddToScheme. This works currently because the reconciler uses dynamic/unstructured clients, but if you later add typed resources (e.g., watching CRDs with generated types), the manager won't recognize them.

Confirm this is intentional for now. If typed resources will be added later, consider adding a comment or TODO.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

171-255: Well-structured test setup using envtest.

The TestMain setup properly uses envtest for testing Kubernetes controllers, creates necessary clients, and sets up the controller-runtime manager. This aligns with the coding guidelines.


96-108: Mock implementation looks good.

The MockHealthEventStore with updateCalled counter provides a clean way to verify update behavior in tests.

fault-remediation/pkg/reconciler/reconciler.go (5)

321-323: Log collector error silently ignored.

The error from runLogCollector is discarded with _ = r.runLogCollector(...). While log collection may be non-critical, silently ignoring the error could make debugging difficult. The function already logs internally (lines 178-180), but consider whether the caller should also be aware of failures.

Confirm this is intentional. If log collection failures should not block remediation but should be visible, the current behavior (internal logging + discarding at call site) is acceptable.


73-88: Constructor looks good.

The NewFaultRemediationReconciler properly initializes all required fields. The embedded client.Client is left unset as it's populated by controller-runtime during SetupWithManager.


549-573: Well-implemented channel adapter.

The AdaptEvents function correctly adapts the watcher's event channel to controller-runtime's typed event channel. The implementation properly:

  • Handles context cancellation
  • Closes the output channel when done
  • Creates a copy of the event before taking its pointer (avoiding loop variable capture issues)

205-206: Acknowledged TODO for ctrl-runtime retry pattern.

Based on past review comments, returning errors for ctrl-runtime built-in retries will be addressed in a follow-up PR to keep this change focused. The current retry logic works but will be simplified.


90-125: Reconcile method follows controller-runtime patterns.

The method signature Reconcile(ctx context.Context, event *datastore.EventWithToken) (ctrl.Result, error) aligns with controller-runtime's Reconciler interface pattern. Metrics tracking and error handling are appropriate.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
fault-remediation/pkg/reconciler/reconciler.go (1)

338-343: Inconsistent error handling: MarkProcessed failure returns nil.

When MarkProcessed fails, the function returns nil error instead of propagating it, unlike similar error handling at lines 271, 292, and 315.

As noted in past reviews, this is intentional to retain current behavior, with follow-up planned.

fault-remediation/main.go (1)

144-154: Event loop doesn't respect context cancellation.

The for event := range ... loop blocks until the channel closes. If the context is cancelled (e.g., SIGTERM), this loop won't exit until the watcher's channel is closed.

Consider using a select with ctx.Done():

 	g.Go(func() error {
 		components.FaultRemediationReconciler.StartWatcherStream(gCtx)

 		slog.Info("Listening for events on the channel...")
-		for event := range components.FaultRemediationReconciler.Watcher.Events() {
-			slog.Info("Event received", "event", event)
-			_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
+		eventCh := components.FaultRemediationReconciler.Watcher.Events()
+		for {
+			select {
+			case <-gCtx.Done():
+				slog.Info("Context cancelled, stopping event processing")
+				return nil
+			case event, ok := <-eventCh:
+				if !ok {
+					return nil
+				}
+				slog.Info("Event received", "event", event)
+				_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
+			}
 		}
-
-		return nil
 	})
🧹 Nitpick comments (7)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

17-27: Non-standard import grouping.

Imports are not grouped according to Go conventions (stdlib, then external, then internal packages). The k8s.io/client-go/kubernetes/scheme import at line 19 is placed between stdlib imports.

Consider reorganizing imports:

 import (
 	"context"
-	"k8s.io/client-go/kubernetes/scheme"
 	"log"
 	"os"
 	"path/filepath"
-	ctrl "sigs.k8s.io/controller-runtime"
-	metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
 	"testing"
 	"text/template"
 	"time"
+
+	ctrl "sigs.k8s.io/controller-runtime"
+	metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
+	"k8s.io/client-go/kubernetes/scheme"

369-372: Consider using the constructor for consistency.

Tests directly instantiate FaultRemediationReconciler via struct literal rather than using NewFaultRemediationReconciler. While this works for isolated tests, it may miss initialization logic added to the constructor in the future.

For consistency with the public API, consider using the constructor or documenting why direct instantiation is preferred here.


1345-1351: Remove commented-out code.

Line 1346 contains commented-out code that should be removed to maintain code cleanliness.

 	watcher := NewMockChangeStreamWatcher()
-	//mockStore := &MockHealthEventStore{}

 	r := &FaultRemediationReconciler{
 		Watcher: watcher,
 	}
fault-remediation/main.go (3)

17-39: Non-standard import grouping.

Imports mix stdlib, external, and internal packages without proper grouping. Go conventions recommend grouping: stdlib first, then external packages, then internal packages.

 import (
 	"context"
 	"errors"
 	"flag"
 	"fmt"
-	"github.com/nvidia/nvsentinel/commons/pkg/server"
-	"golang.org/x/sync/errgroup"
-	"k8s.io/apimachinery/pkg/runtime"
 	"log/slog"
 	"os"
 	"os/signal"
-	"sigs.k8s.io/controller-runtime/pkg/healthz"
-	metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
 	"strconv"
 	"strings"
 	"syscall"
 	"time"
-
-	ctrl "sigs.k8s.io/controller-runtime"
+
+	"golang.org/x/sync/errgroup"
+	"k8s.io/apimachinery/pkg/runtime"
+	ctrl "sigs.k8s.io/controller-runtime"
+	"sigs.k8s.io/controller-runtime/pkg/healthz"
+	metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"

 	"github.com/nvidia/nvsentinel/commons/pkg/logger"
+	"github.com/nvidia/nvsentinel/commons/pkg/server"
 	"github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer"
 )

77-79: Error message should not end with punctuation.

Go convention is that error messages should not end with punctuation or be capitalized (unless proper nouns).

 	if !enableControllerRuntime && enableLeaderElection {
-		return errors.New("leader-election requires controller-runtime.")
+		return errors.New("leader-election requires controller-runtime")
 	}

264-267: Remove unnecessary return statement.

The explicit return at the end of parseFlags() is unnecessary since the function has no return values.

 	flag.Parse()
-
-	return
 }
fault-remediation/pkg/reconciler/reconciler.go (1)

73-88: Constructor returns value type instead of pointer.

NewFaultRemediationReconciler returns FaultRemediationReconciler (value) but the struct embeds client.Client which is typically assigned post-construction when using controller-runtime's manager. Returning a value makes it awkward to assign the client later.

Consider returning a pointer for consistency with controller-runtime patterns:

 func NewFaultRemediationReconciler(
 	ds datastore.DataStore,
 	watcher datastore.ChangeStreamWatcher,
 	healthEventStore datastore.HealthEventStore,
 	config ReconcilerConfig,
 	dryRun bool,
-) FaultRemediationReconciler {
-	return FaultRemediationReconciler{
+) *FaultRemediationReconciler {
+	return &FaultRemediationReconciler{
 		ds:                ds,
 		Watcher:           watcher,
 		healthEventStore:  healthEventStore,
 		config:            config,
 		annotationManager: config.RemediationClient.GetAnnotationManager(),
 		dryRun:            dryRun,
 	}
 }

This would require updating the Components struct and callers accordingly.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between f722b46 and 9f5f130.

πŸ“’ Files selected for processing (4)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1 hunks)
  • fault-remediation/main.go (3 hunks)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (44 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (6)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧬 Code graph analysis (1)
fault-remediation/main.go (3)
fault-remediation/pkg/initializer/init.go (3)
  • InitializationParams (34-39)
  • InitializeAll (46-135)
  • Components (41-43)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (63-71)
commons/pkg/server/server.go (4)
  • NewServer (375-396)
  • WithPort (151-153)
  • WithPrometheusMetricsCtrlRuntime (227-234)
  • WithSimpleHealth (246-254)
πŸ”‡ Additional comments (6)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

288-345: Good refactor: Error returns instead of panics.

The function now properly returns errors instead of panicking, making tests more robust and easier to debug. The callers have been updated to handle these errors appropriately.

fault-remediation/main.go (1)

42-43: Empty scheme may limit controller-runtime capabilities.

The scheme is declared but no types are registered to it. While this works for the current implementation using dynamic clients and external event sources, it may cause issues if the reconciler later needs to work with typed Kubernetes objects.

Verify this is intentional. If the reconciler doesn't need to watch/reconcile typed Kubernetes resources, this is fine. Otherwise, register the needed types:

import (
    corev1 "k8s.io/api/core/v1"
    utilruntime "k8s.io/apimachinery/pkg/util/runtime"
)

func init() {
    utilruntime.Must(corev1.AddToScheme(scheme))
}

Also applies to: 162-164

fault-remediation/pkg/reconciler/reconciler.go (4)

90-125: Reconcile method correctly handles malformed events.

The method returns ctrl.Result{}, nil for parse failures and nil health events, which prevents controller-runtime from retrying malformed events. This aligns with the discussed behavior of dropping unparseable events.


205-224: Manual retry loop to be replaced with controller-runtime retries.

The TODO at line 205 notes this retry logic should be replaced with controller-runtime's built-in retry mechanism. The current implementation maintains backward compatibility but adds complexity.

This is acknowledged for follow-up work as discussed in past reviews.


519-545: Watcher lifecycle not tied to manager.

While the context is now passed as a parameter (addressing part of the past review), the watcher's lifecycle isn't tied to the manager's lifecycle. If the manager stops gracefully, the watcher won't be explicitly stopped until the context is cancelled by the parent signal handler.

This may be acceptable if the signal context propagation is sufficient. However, for explicit lifecycle management, consider registering the watcher as a Runnable:

if err := mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
    r.Watcher.Start(ctx)
    <-ctx.Done()
    return r.Watcher.Close(ctx)
})); err != nil {
    return fmt.Errorf("failed to add watcher runnable: %w", err)
}

This ensures the watcher is cleanly stopped when the manager shuts down.


547-571: Well-structured channel adaptation with proper context handling.

The AdaptEvents function correctly:

  • Uses select to respect context cancellation
  • Properly closes the output channel via defer
  • Creates a local copy before taking pointer (line 564-565) to avoid data races

@lalitadithya
Copy link
Collaborator

hey @ivelichkovich , can you please complete the DCO -- https://github.com/NVIDIA/NVSentinel/pull/449/checks?check_run_id=57068729072 so that we can trigger the e2e tests on the PR for some feedback?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
fault-remediation/pkg/reconciler/reconciler.go (2)

338-343: Inconsistent error handling for MarkProcessed failure.

When MarkProcessed fails at line 338, the function returns ctrl.Result{}, nil instead of returning the error. This is inconsistent with similar error handling at lines 292 and 315, which return the error. This could cause the event to be reprocessed if marking fails, but controller-runtime won't be notified for retry backoff.

Apply this diff to make error handling consistent:

 	if err = watcherInstance.MarkProcessed(ctx, eventWithToken.ResumeToken); err != nil {
 		processingErrors.WithLabelValues("mark_processed_error", nodeName).Inc()
 		slog.Error("Error updating resume token", "error", err)
-		return ctrl.Result{}, nil
+		return ctrl.Result{}, err
 	}

Note: Based on past review discussions, this is a known issue with planned follow-up work to align error handling patterns.


505-517: CloseAll returns early, potentially leaving resources open.

If ds.Close fails at line 506, the function returns immediately without calling Watcher.Close, potentially leaving the watcher resource open. This violates proper cleanup patterns.

Consider closing all resources regardless of individual failures and aggregating errors:

+import "errors"
+
 func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error {
+	var errs []error
+
 	if err := r.ds.Close(ctx); err != nil {
 		slog.Error("failed to close datastore", "error", err)
-		return err
+		errs = append(errs, fmt.Errorf("datastore close: %w", err))
 	}
 
 	if err := r.Watcher.Close(ctx); err != nil {
 		slog.Error("failed to close Watcher", "error", err)
-		return err
+		errs = append(errs, fmt.Errorf("watcher close: %w", err))
 	}
 
-	return nil
+	return errors.Join(errs...)
 }

This ensures both resources are always closed, and any errors are properly reported.

Based on coding guidelines requiring proper shutdown handling with context cancellation.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

237-241: Manager start goroutine lacks readiness synchronization.

The manager is started in a goroutine at line 238, but there's no synchronization to ensure it's ready before m.Run() executes at line 250. Tests may start before the manager and reconciler are fully initialized, leading to flaky test behavior.

Consider adding a synchronization mechanism to wait for the manager cache to sync:

+	// Wait for manager to be ready
+	go func() {
+		if err := mgr.Start(testContext); err != nil {
+			log.Printf("Failed to start the test environment manager: %v", err)
+		}
+	}()
+
+	// Wait for caches to sync
+	cacheSynced := mgr.GetCache().WaitForCacheSync(testContext)
+	if !cacheSynced {
+		log.Fatalf("Failed to sync manager cache")
+	}
-	go func() {
-		if err := mgr.Start(testContext); err != nil {
-			log.Fatalf("Failed to start the test environment manager: %v", err)
-		}
-	}()

This ensures the manager is ready before tests run. Additionally, consider replacing log.Fatalf with log.Printf in the goroutine to avoid abrupt process termination without cleanup.

Based on coding guidelines requiring proper test setup and synchronization.

🧹 Nitpick comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)

205-224: Manual retry logic could be replaced with controller-runtime retries.

The manual retry loop with UpdateMaxRetries and UpdateRetryDelay duplicates functionality that controller-runtime provides. The TODO at line 205 acknowledges this. By returning errors from performRemediation, controller-runtime can handle retries automatically with exponential backoff.

Consider returning errors and letting controller-runtime handle retries in a follow-up:

// Remove manual retry loop
success, crName := r.config.RemediationClient.CreateMaintenanceResource(ctx, healthEventData)
if !success {
	processingErrors.WithLabelValues("cr_creation_failed", nodeName).Inc()
	return false, "", fmt.Errorf("failed to create maintenance resource for node %s", nodeName)
}
return true, crName, nil

Then propagate the error in the caller (handleRemediationEvent line 325) instead of ignoring it.

Based on past review discussions indicating this refactoring is planned for a future iteration.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 9f5f130 and 47c2e36.

πŸ“’ Files selected for processing (4)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1 hunks)
  • fault-remediation/main.go (3 hunks)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (44 hunks)
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (8)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable

Applied to files:

  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧬 Code graph analysis (3)
fault-remediation/main.go (5)
node-drainer/pkg/config/config.go (1)
  • Duration (36-38)
fault-remediation/pkg/initializer/init.go (3)
  • InitializationParams (34-39)
  • InitializeAll (46-135)
  • Components (41-43)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (63-71)
commons/pkg/server/server.go (4)
  • NewServer (375-396)
  • WithPort (151-153)
  • WithPrometheusMetricsCtrlRuntime (227-234)
  • WithSimpleHealth (246-254)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • FaultRemediationReconciler (63-71)
  • ReconcilerConfig (41-50)
  • NewFaultRemediationReconciler (73-88)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventStatus (45-51)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
store-client/pkg/datastore/types.go (1)
  • EventWithToken (101-104)
fault-remediation/pkg/reconciler/reconciler.go (5)
store-client/pkg/storewatcher/watch_store.go (1)
  • ChangeStreamWatcher (43-57)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/annotation.go (1)
  • NodeAnnotationManagerInterface (38-43)
data-models/pkg/model/health_event_extentions.go (2)
  • HealthEventStatus (45-51)
  • Status (23-23)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)

@ivelichkovich
Copy link
Contributor Author

hey @ivelichkovich , can you please complete the DCO -- https://github.com/NVIDIA/NVSentinel/pull/449/checks?check_run_id=57068729072 so that we can trigger the e2e tests on the PR for some feedback?

Can I see progress or logs to make sure its running anywhere?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)

15-15: Add package-level godoc comment.

Per coding guidelines, all Go packages require a package-level godoc comment.

As per coding guidelines, package-level godoc is required for all Go packages.

Add a comment above the package declaration:

+// Package reconciler implements the fault remediation reconciliation logic,
+// supporting both standalone and controller-runtime managed execution modes.
 package reconciler
♻️ Duplicate comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)

339-344: Inconsistent error handling: MarkProcessed failure returns nil.

When MarkProcessed fails at line 339, the function returns ctrl.Result{}, nil instead of propagating the error. This is inconsistent with similar error paths at lines 293, 316, and 266/272 which return the error.

This means controller-runtime won't be aware of the failure and won't apply retry backoff, potentially causing the event to be reprocessed without proper error signaling.

Based on past review comments, this behavior is intentional to retain current behavior for now, with error handling improvements deferred to a follow-up. However, for consistency and proper ctrl-runtime integration, consider updating this in the follow-up:

 	if err = watcherInstance.MarkProcessed(ctx, eventWithToken.ResumeToken); err != nil {
 		processingErrors.WithLabelValues("mark_processed_error", nodeName).Inc()
 		slog.Error("Error updating resume token", "error", err)
-		return ctrl.Result{}, nil
+		return ctrl.Result{}, err
 	}
🧹 Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler.go (2)

206-206: TODO comment should reference an issue.

Per coding guidelines, TODO comments in Go code should reference issues.

As per coding guidelines, TODO comments should reference issues in Go code.

If an issue exists for this work, please add the reference:

-	//TODO: return error to use built in ctrl runtime retries
+	// TODO(#issue_number): return error to use built in ctrl runtime retries

If no issue exists yet, consider creating one to track the planned ctrl-runtime retry integration.


472-475: Consider using context from caller instead of context.Background().

The error path uses context.Background() for MarkProcessed calls. If the main context is cancelled, this may cause the MarkProcessed operation to continue running.

If the intent is to ensure MarkProcessed completes even on cancellation, add a brief comment explaining this. Otherwise, consider using the context from the caller:

-		if markErr := watcherInstance.MarkProcessed(context.Background(), eventWithToken.ResumeToken); markErr != nil {
+		if markErr := watcherInstance.MarkProcessed(ctx, eventWithToken.ResumeToken); markErr != nil {

Note: Similar pattern also appears at line 486.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 47c2e36 and 07e9ff0.

πŸ“’ Files selected for processing (1)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧠 Learnings (3)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ”‡ Additional comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)

81-88: Consider initializing or documenting the nil Client field.

The embedded client.Client field is not initialized in this constructor. If methods on FaultRemediationReconciler attempt to use the Client before it's set, this will cause a nil pointer dereference. Verify whether the Client is injected via SetupWithManager or another initialization mechanism, and either accept it as a constructor parameter, provide a setter method, or document the injection requirement in the godoc.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fault-remediation/pkg/reconciler/reconciler.go (1)

331-331: Consider logging runLogCollector errors for observability.

The error return from runLogCollector is currently discarded. While log collection is auxiliary and its failure shouldn't block remediation, consider logging the error for debugging purposes:

-	_ = r.runLogCollector(ctx, healthEvent)
+	if err := r.runLogCollector(ctx, healthEvent); err != nil {
+		slog.Warn("Log collector failed but continuing with remediation", "node", healthEvent.NodeName, "error", err)
+	}
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 07e9ff0 and 9a0d94a.

πŸ“’ Files selected for processing (2)
  • .github/workflows/lint-test.yml (1 hunks)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
🧰 Additional context used
πŸ““ Path-based instructions (1)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧠 Learnings (6)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ”‡ Additional comments (9)
.github/workflows/lint-test.yml (1)

18-20: Workflow trigger addition is sound.

The new pull_request trigger with branches: '*' correctly enables lint-test to run for all PRs. The wildcard pattern is valid and idiomatic in GitHub Actions. Existing conditional logic (line 200) properly routes pull_request events to the coverage-report job.

Please confirm this trigger scope aligns with your repository's CI/CD policies. If stricter branch filtering is needed (e.g., PRs targeting only main or specific branches), you can replace '*' with a more specific pattern.

fault-remediation/pkg/reconciler/reconciler.go (8)

17-40: LGTM! Clean controller-runtime integration imports.

The new imports for controller-runtime (ctrl, builder, client, event, handler, source) and the nvstoreclient alias are well-organized and support the refactoring objectives.


64-93: LGTM! Well-structured reconciler with clear dependencies.

The FaultRemediationReconciler struct properly embeds client.Client for controller-runtime integration, and the constructor follows standard dependency injection patterns. Godoc comments have been properly added.


95-133: LGTM! Clean reconciliation flow with proper delegation.

The Reconcile method follows controller-runtime conventions, with proper context handling, metrics tracking, and delegation to specialized handlers for cancellation and remediation events.


195-255: Manual retry loop tracked for future refactoring.

The TODO comment at line 213 correctly notes that this retry logic should eventually be replaced with controller-runtime's built-in retry mechanism. Per past review discussions, this is deferred to a follow-up PR to keep changes manageable.


354-451: LGTM! Status update and CR check logic is sound.

The updateNodeRemediatedStatus method includes a manual retry loop (lines 377-390) which is consistent with the pattern in performRemediation and tracked for future refactoring. The checkExistingCRStatus method properly validates state and delegates to the status checker.


455-505: LGTM! Robust event parsing with proper error classification.

The parseHealthEvent method correctly uses shared parsing utilities, classifies errors for metrics, and marks unparseable events as processed to prevent reprocessing. The use of context.Background() for MarkProcessed on error paths (lines 479, 493) is appropriate to ensure cleanup even during cancellation.


507-560: LGTM! Excellent lifecycle management with proper context handling.

The lifecycle methods are well-implemented:

  • StartWatcherStream provides a clean entry point for non-controller-runtime mode
  • CloseAll properly aggregates errors using errors.Join to ensure all cleanup attempts are made
  • SetupWithManager correctly uses the provided context (addressing previous review feedback) and follows controller-runtime patterns for typed controllers

562-589: LGTM! Clean channel adapter with proper lifecycle management.

The AdaptEvents function correctly transforms datastore events into controller-runtime generic events, with proper handling of context cancellation, channel closure, and goroutine cleanup via defer.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

96-169: Guard MockHealthEventStore.updateCalled against concurrent access

updateCalled is incremented inside UpdateHealthEventStatusFn (called from the reconciler running under the manager goroutines) and read/reset directly from tests (e.g., mockStore.updateCalled = 0, updateCountBefore := mockStore.updateCalled). Without synchronization this is a data race and can produce flaky assertions under concurrent processing or when run with -race.

Consider switching to an atomic counter or adding a mutex and helper methods:

 type MockHealthEventStore struct {
-	UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error
-	updateCalled              int
+	UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error
+	mu                        sync.Mutex
+	updateCalled              int
 }
 
 // UpdateHealthEventStatus updates a health event status (mock implementation)
 func (m *MockHealthEventStore) UpdateHealthEventStatus(ctx context.Context, id string, status datastore.HealthEventStatus) error {
 	if m.UpdateHealthEventStatusFn != nil {
 		return m.UpdateHealthEventStatusFn(ctx, id, status)
 	}
-	return nil
+	m.mu.Lock()
+	m.updateCalled++
+	m.mu.Unlock()
+	return nil
 }
+
+func (m *MockHealthEventStore) ResetUpdateCalled() {
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	m.updateCalled = 0
+}
+
+func (m *MockHealthEventStore) UpdateCalled() int {
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	return m.updateCalled
}

Then update tests to use ResetUpdateCalled() / UpdateCalled() instead of touching the field directly.

Also applies to: 741-801, 988-989

fault-remediation/pkg/reconciler/reconciler.go (2)

176-255: Documented TODO in performRemediation should reference an issue, and retry loop is a good future ctrl-runtime cleanup candidate

The current performRemediation implementation preserves existing behavior by:

  • Manually retrying CreateMaintenanceResource up to UpdateMaxRetries with UpdateRetryDelay.
  • Updating node labels to remediating / succeeded / failed and emitting metrics.

The new //TODO: return error to use built in ctrl runtime retries is helpful, but per guidelines it should reference a tracking issue (e.g. // TODO(<issue-id>): ...) so this behavior change isn’t forgotten.

Given ctrl-runtime already handles backoff/retries on returned errors, this manual loop is a natural future simplification: once you’re ready to change behavior, you can drop the loop and return errors directly to ctrl-runtime instead of sleeping in-place.

Also applies to: 354-401


258-282: Use request context for MarkProcessed in cancellation path

In handleCancellationEvent, you correctly use ctx for annotation cleanup, but still call:

if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil {

This bypasses cancellation/timeouts from the reconciler’s context and is inconsistent with other call sites that use ctx when marking events processed.

Switching to ctx here would align behavior and avoid potentially hanging MarkProcessed calls during shutdown:

-	if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil {
+	if err := watcherInstance.MarkProcessed(ctx, resumeToken); err != nil {
♻️ Duplicate comments (2)
health-monitors/syslog-health-monitor/go.mod (1)

31-42: Critical: Malformed subpackage entriesβ€”consolidate to module path.

Lines 32–42 incorrectly list subpackages (cmdutils, conv, fileutils, etc.) as separate go.mod entries. In Go's module system, only the module path github.com/go-openapi/swag should appear; subpackages are internal to that module and must not be listed separately. Consolidate to a single entry.

This is a duplicate of a previously flagged issue that remains unresolved.

Run go mod tidy to repair invalid subpackage entries:

#!/bin/bash
# Navigate to the module and run go mod tidy to consolidate subpackage entries
cd health-monitors/syslog-health-monitor
go mod tidy
# Verify the swag entry is consolidated
grep -n "go-openapi/swag" go.mod
fault-remediation/main.go (1)

117-157: Non–controller-runtime event loop still ignores context cancellation

In setupNonCtrlRuntimeManaged, the goroutine:

components.FaultRemediationReconciler.StartWatcherStream(gCtx)

slog.Info("Listening for events on the channel...")
for event := range components.FaultRemediationReconciler.Watcher.Events() {
    slog.Info("Event received", "event", event)
    _, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
}

will block until the events channel is closed. If gCtx/ctx is cancelled (SIGINT/SIGTERM or shutdown path) but the watcher does not close its channel promptly, this goroutine will hang and delay graceful shutdown, contrary to the context-cancellation guideline.

Consider selecting on both the event channel and gCtx.Done():

-	slog.Info("Listening for events on the channel...")
-	for event := range components.FaultRemediationReconciler.Watcher.Events() {
-		slog.Info("Event received", "event", event)
-		_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
-	}
+	slog.Info("Listening for events on the channel...")
+	eventCh := components.FaultRemediationReconciler.Watcher.Events()
+	for {
+		select {
+		case <-gCtx.Done():
+			slog.Info("Context cancelled, stopping event processing")
+			return nil
+		case event, ok := <-eventCh:
+			if !ok {
+				return nil
+			}
+			slog.Info("Event received", "event", event)
+			_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
+		}
+	}

This ensures the loop exits promptly on cancellation even if the watcher channel remains open.

Also applies to: 144-154

🧹 Nitpick comments (8)
fault-remediation/pkg/initializer/init.go (2)

70-80: Align TODO with guidelines by referencing an issue or converting to a regular comment

//TODO: replace this with manager client is useful context, but per the Go coding guidelines here, TODO comments should reference an issue (for example, // TODO(#1234): ...). Please either link this to a tracking issue or convert it into a plain explanatory comment if there’s no concrete follow-up item yet.


99-117: Centralize the "HealthEvents" collection name to avoid future drift

The new datastore/watch initialization looks good, but CollectionName: "HealthEvents" is hard-coded here while the rest of the datastore API is strongly typed. If there’s already a shared constant (e.g., in data-models or store-client) for the HealthEvents collection, prefer using it here to avoid silent breakage if the collection name ever changes.

Also applies to: 131-134

fault-remediation/pkg/reconciler/reconciler_test.go (1)

991-1009: Consider standardizing on NewFaultRemediationReconciler in tests

TestLogCollectorOnlyCalledWhenShouldCreateCR still uses NewReconciler(cfg, false) while the rest of this file has migrated to NewFaultRemediationReconciler. If NewReconciler is now just a thin wrapper around the new constructor, consider updating this test to call NewFaultRemediationReconciler directly for consistency and to keep tests aligned with the primary public entry point.

tests/fault_remediation_test.go (1)

28-73: Test name spelling fix looks good

The rename corrects the typo and keeps the existing behavior intact. If you touch this again, consider renaming to follow the TestFunction_Scenario_ExpectedBehavior style used elsewhere.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

171-255: Ctrl-runtime envtest wiring and reconciler usage look solid

The updated TestMain, createTestRemediationClient, and the integration tests correctly:

  • Build a real FaultRemediationClient against envtest testRestConfig/testDynamic.
  • Construct ReconcilerConfig with StateManager and pass it into NewFaultRemediationReconciler.
  • Wire the reconciler into a controller-runtime Manager via SetupWithManager while still using explicit struct literals in tests that exercise methods directly.

The overall structure reflects the new ctrl-runtime–aware reconciler API without introducing obvious correctness issues in the tests.

Also applies to: 288-345, 347-428, 472-488, 546-558, 603-627, 644-707, 740-895, 897-927, 986-1058, 1060-1139, 1141-1223, 1335-1356

fault-remediation/main.go (1)

41-62: Controller-runtime path, leader election gating, and flag plumbing look correct

  • Rejecting --leader-elect when --controller-runtime is disabled prevents invalid combinations.
  • setupCtrlRuntimeManagement correctly:
    • Uses the shared ctx (with signal handling) for mgr.Start.
    • Configures metrics, health probes, and leader-election timings from flags.
    • Delegates reconciler wiring via SetupWithManager.
  • Flag parsing cleanly separates controller-runtime options and reuses the same metrics-address/health-address flags across both modes.

The structure meets the shutdown-handling and ctrl-runtime integration goals.

Also applies to: 74-115, 159-202, 204-267

fault-remediation/pkg/reconciler/reconciler.go (2)

95-133: Reconcile flow and unsupported-action handling look correct

  • Reconcile delegates parsing, cancellation vs remediation, and metrics without leaking parse errors back to ctrl-runtime (parse errors are logged + metrics, then dropped), which avoids hot loops on bad events.
  • shouldSkipEvent cleanly handles:
    • NONE actions.
    • Already remediated events.
    • Unsupported actions, where it logs, increments totalUnsupportedRemediationActions, and sets the remediation-failed node label via StateManager.

This matches the desired behavior of dropping unprocessable events while keeping node-level state/metrics accurate.

Also applies to: 135-174


531-560: Ctrl-runtime SetupWithManager and AdaptEvents integration looks good

  • SetupWithManager starts the watcher with the provided ctx and wires the reconciler into controller-runtime using a typed channel source and rate-limiting workqueue, which aligns with best practices for channel-driven reconcilers.
  • AdaptEvents correctly:
    • Listens on the input channel until either ctx.Done() or the channel is closed.
    • Sends a per-iteration copy (eventOut) so each TypedGenericEvent holds a stable pointer.
    • Closes the output channel on exit.

This provides a clean bridge between the datastore event stream and controller-runtime’s reconciliation loop.

Also applies to: 562-589

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 9a0d94a and dc2fca4.

β›” Files ignored due to path filters (5)
  • commons/go.sum is excluded by !**/*.sum
  • event-exporter/go.sum is excluded by !**/*.sum
  • fault-remediation/go.sum is excluded by !**/*.sum
  • health-events-analyzer/go.sum is excluded by !**/*.sum
  • health-monitors/syslog-health-monitor/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (18)
  • commons/go.mod (2 hunks)
  • commons/pkg/server/server.go (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-full.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt.yaml (2 hunks)
  • event-exporter/go.mod (1 hunks)
  • fault-remediation/go.mod (2 hunks)
  • fault-remediation/main.go (3 hunks)
  • fault-remediation/pkg/initializer/init.go (5 hunks)
  • fault-remediation/pkg/reconciler/metrics.go (4 hunks)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (44 hunks)
  • fault-remediation/pkg/reconciler/reconciler_test.go (15 hunks)
  • fault-remediation/pkg/reconciler/remediation.go (1 hunks)
  • health-events-analyzer/go.mod (1 hunks)
  • health-monitors/syslog-health-monitor/go.mod (1 hunks)
  • tests/fault_remediation_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • fault-remediation/pkg/reconciler/remediation.go
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • commons/go.mod
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • tests/fault_remediation_test.go
  • fault-remediation/pkg/reconciler/metrics.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • commons/pkg/server/server.go
  • fault-remediation/pkg/reconciler/reconciler.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/fault_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • health-monitors/syslog-health-monitor/go.mod
🧠 Learnings (15)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/fault_remediation_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/metrics.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • health-monitors/syslog-health-monitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • health-monitors/syslog-health-monitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • health-monitors/syslog-health-monitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • fault-remediation/main.go
  • fault-remediation/go.mod
  • health-monitors/syslog-health-monitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • health-events-analyzer/go.mod
  • health-monitors/syslog-health-monitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • health-monitors/syslog-health-monitor/go.mod
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧬 Code graph analysis (2)
fault-remediation/pkg/initializer/init.go (3)
fault-remediation/pkg/reconciler/reconciler.go (2)
  • FaultRemediationReconciler (67-75)
  • NewFaultRemediationReconciler (78-93)
store-client/pkg/watcher/factory.go (2)
  • WatcherConfig (26-38)
  • CreateChangeStreamWatcher (65-87)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/reconciler.go (5)
store-client/pkg/storewatcher/watch_store.go (1)
  • ChangeStreamWatcher (43-57)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/annotation.go (1)
  • NodeAnnotationManagerInterface (38-43)
data-models/pkg/model/health_event_extentions.go (4)
  • HealthEventStatus (45-51)
  • UnQuarantined (34-34)
  • Cancelled (37-37)
  • Status (23-23)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
πŸ”‡ Additional comments (10)
health-monitors/syslog-health-monitor/go.mod (1)

71-80: Verify necessity of new Kubernetes and controller-runtime dependencies.

Lines 71–80 introduce numerous indirect dependencies on k8s.io/* and sigs.k8s.io/* modules (controller-runtime, client-go, kube-openapi, etc.). Since the PR integrates controller-runtime into fault-remediation, these may be transitive pulls from shared modules like commons or data-models. However, verify:

  1. Whether syslog-health-monitor actually uses any of these, or if they're incidental transitive bloat.
  2. If these are pulled from commons or data-models, whether they should be direct requires in those modules instead.

Per coding guidelines, keep Go dependencies minimal and up-to-date. Ensure no unnecessary transitive overhead is being carried.

distros/kubernetes/nvsentinel/values-full.yaml (1)

48-62: Global ctrl-runtime toggles and ports look consistent

The updated metricsPort comment plus the new healthPort and ctrlRuntimeEnabled flags are coherent with the dual execution modes described in the PR; defaults and documentation look good.

fault-remediation/go.mod (1)

24-95: New indirect deps (fsnotify, btree) align with controller-runtime usage

Adding fsnotify and google/btree as indirect dependencies is consistent with the controller-runtime/k8s controller stack you’re using; nothing stands out as problematic here. Just ensure go mod tidy is run periodically so the indirect list reflects actual usage and doesn’t accumulate stale entries.

fault-remediation/pkg/reconciler/reconciler_test.go (1)

219-222: Tests updated for NewFaultRemediationReconciler and new performRemediation signature look sound

The migration of tests to NewFaultRemediationReconciler(nil, nil, nil, cfg, dryRun) and the updated assertions handling (success, crName, err) from performRemediation are consistent with the reconciler’s new API. Using nil for the datastore/watcher/healthEventStore in these unit tests is fine given that the exercised code paths only rely on config, RemediationClient, and StateManager.

Also applies to: 263-274, 371-381, 428-438, 474-485, 784-803

commons/pkg/server/server.go (1)

52-55: Fix example to use the new ctrl-runtime metrics option and clarify usage

The implementation of WithPrometheusMetricsCtrlRuntime looks correct for wiring metrics to crmetrics.Registry, but the example in the comment still shows WithPrometheusMetrics() instead of the new option. Also, consider adding a brief note that WithPrometheusMetrics and WithPrometheusMetricsCtrlRuntime should not both be used in the same server, since ServeMux.Handle("/metrics", ...) will panic on duplicate registration.

event-exporter/go.mod (1)

16-85: Confirm indirect deps are needed and keep go.mod tidy for event-exporter

The expanded indirect require block (OpenAPI, k8s, controller-runtime, various x/* and sigs.k8s.io/* modules) should be verified to ensure all are actually required. Run go mod tidy to remove any unnecessary transitive dependencies and keep the manifest lean per Go best practices. If specific indirect versions must be pinned for compatibility, document that rationale.

health-events-analyzer/go.mod (1)

27-94: Run go mod tidy to ensure indirect dependencies are necessary

The expanded indirect dependency set includes controller-runtime, Kubernetes, OpenAPI, and JSON/YAML libraries. Ensure these modules are actually required by this service and not accumulated cruft. Run go mod tidy in health-events-analyzer to remove unused transitive dependencies and keep the dependency set minimal per project standards.

fault-remediation/pkg/reconciler/metrics.go (1)

20-21: Controller-runtime metrics registry integration is correct; address breaking label change and TODO compliance

Using crmetrics.Registry with promauto.With() properly aligns reconciler metrics with the controller-runtime metrics server and the new WithPrometheusMetricsCtrlRuntime handler option.

Critical before merge:

  • The label changes for fault_remediation_unsupported_actions_total are a breaking change. Since fault-remediation metrics are actively scraped in production (per Helm/Prometheus documentation), existing dashboards and alerts will fail or behave unexpectedly when labels change. Either preserve the old label name for backward compatibility or explicitly call out the breaking change in release notes.
  • The TODO comment should reference a tracking issue per coding guidelines.
  • Verify all processes and binaries importing this metrics package are updated to use the controller-runtime registry server option; anything still using WithPrometheusMetrics() will not expose these metrics.
fault-remediation/pkg/reconciler/reconciler.go (2)

42-52: Reconciler shape and constructor align well with new ctrl-runtime model

The FaultRemediationReconciler struct and NewFaultRemediationReconciler cleanly capture all dependencies (datastore, watcher, health store, config, annotationManager, dryRun) and embed client.Client for ctrl-runtime use, while still working for non-managed mode.

Constructor wiring via config.RemediationClient.GetAnnotationManager() keeps tests and main in sync without extra pluming.

Also applies to: 64-93


284-352: Remediation path and metrics wiring look consistent

The main remediation path in handleRemediationEvent:

  • Respects shouldSkipEvent and marks skipped events processed.
  • Uses checkExistingCRStatus for deduplication, emitting CRStatusSkipped metrics when reusing an in-progress CR.
  • Runs the log collector only when actually creating a new CR.
  • Invokes performRemediation and updateNodeRemediatedStatus, then marks the event processed and bumps CRStatusCreated.

This is consistent with the dedup and metrics expectations described in the tests and maintains the previous semantics while fitting into ctrl-runtime’s Reconcile signature.

@lalitadithya
Copy link
Collaborator

looks like the only test that failed is https://github.com/NVIDIA/NVSentinel/actions/runs/19919748936/job/57105903821?pr=449 can you please take a look?

@lalitadithya
Copy link
Collaborator

/ok to test 5004e32

@lalitadithya
Copy link
Collaborator

/ok to test 560a318

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
health-monitors/syslog-health-monitor/go.mod (1)

32-42: Consolidate go-openapi/swag subpackages into a single module entry.

Lines 32–42 incorrectly list subpackage paths (e.g., github.com/go-openapi/swag/cmdutils, github.com/go-openapi/swag/conv, etc.) as separate module entries. In Go's go.mod, only module paths should appear; subpackages are internal to their parent module and must not be listed separately. Replace these eleven entries with a single line: github.com/go-openapi/swag v0.25.4 // indirect.

Run go mod tidy to automatically repair these entries and remove invalid subpackage paths:

go mod tidy

After running go mod tidy, verify the file contains only:

github.com/go-openapi/swag v0.25.4 // indirect

instead of the current eleven separate subpackage entries.

fault-remediation/main.go (1)

120-162: Non‑ctrl runtime event loop doesn’t observe context cancellation directly.

StartWatcherStream(gCtx) gets a cancellable context, but the subsequent for event := range components.FaultRemediationReconciler.Watcher.Events() will keep blocking until the watcher closes the channel. If the watcher doesn’t close promptly on gCtx.Done(), shutdown can hang on g.Wait(). Consider a select on gCtx.Done() + the events channel so this goroutine can exit immediately when the context is cancelled.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

171-241: Avoid log.Fatalf in the manager goroutine and add a simple readiness sync.

The goroutine running mgr.Start(testContext) still uses log.Fatalf, which calls os.Exit from a background goroutine and can bypass test cleanup; there’s also no sync to ensure the manager has started before tests run. Prefer logging the error and failing tests via assertions, and use a small channel or mgr.GetCache().WaitForCacheSync to signal readiness before m.Run().

fault-remediation/pkg/reconciler/reconciler.go (1)

518-541: Nil-guarding CloseAll would make it safer for test-constructed reconcilers.

CloseAll unconditionally calls r.ds.Close(ctx) and r.Watcher.Close(ctx). In production these are always non-nil via InitializeAll, but tests sometimes construct FaultRemediationReconciler with ds and/or Watcher set to nil, which would panic if CloseAll is reused there. Adding simple nil checks around each Close makes this helper safer without affecting the main path.

 func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error {
 	var errs []error
 
-	if err := r.ds.Close(ctx); err != nil {
-		slog.Error("failed to close datastore", "error", err)
-		errs = append(errs, err)
-	}
-
-	if err := r.Watcher.Close(ctx); err != nil {
-		slog.Error("failed to close Watcher", "error", err)
-		errs = append(errs, err)
-	}
+	if r.ds != nil {
+		if err := r.ds.Close(ctx); err != nil {
+			slog.Error("failed to close datastore", "error", err)
+			errs = append(errs, err)
+		}
+	}
+
+	if r.Watcher != nil {
+		if err := r.Watcher.Close(ctx); err != nil {
+			slog.Error("failed to close Watcher", "error", err)
+			errs = append(errs, err)
+		}
+	}
 
 	return errors.Join(errs...)
 }
🧹 Nitpick comments (3)
fault-remediation/pkg/reconciler/metrics.go (1)

30-30: Add an issue reference to the TODO per repo guidelines

The TODO at Line 30 lacks an issue reference, which goes against the stated Go guidelines for TODO comments. Please either link the tracking issue or rephrase as a non-TODO comment.

A minimal change could look like:

-   //TODO: evaluate and remove redundant metrics with ctrl-runtime defaults
+   // TODO(<issue-id>): evaluate and remove redundant metrics with ctrl-runtime defaults
fault-remediation/pkg/initializer/init.go (1)

45-45: Add issue reference to TODO comment.

Per coding guidelines, TODO comments should reference GitHub issues. Please update the comment to reference a specific issue number.

-// nolint: cyclop // todo
+// nolint: cyclop // TODO(#XXX): Refactor to reduce cyclomatic complexity

Based on coding guidelines: "TODO comments should reference issues in Go code"

fault-remediation/main.go (1)

164-208: Controller-runtime manager setup is generally correct; consider scheme registration for future K8s clients.

The manager wiring (metrics bind address, health/ready checks, leader election options, and using the same ctx as run()) looks good. If you ever start using mgr.GetClient() or caching typed resources, you’ll want to register the core/client-go scheme (e.g. clientgoscheme.AddToScheme(scheme)) rather than relying on an empty runtime.Scheme.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between dc2fca4 and 5004e32.

β›” Files ignored due to path filters (5)
  • commons/go.sum is excluded by !**/*.sum
  • event-exporter/go.sum is excluded by !**/*.sum
  • fault-remediation/go.sum is excluded by !**/*.sum
  • health-events-analyzer/go.sum is excluded by !**/*.sum
  • health-monitors/syslog-health-monitor/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (18)
  • commons/go.mod (2 hunks)
  • commons/pkg/server/server.go (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-full.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt.yaml (2 hunks)
  • event-exporter/go.mod (1 hunks)
  • fault-remediation/go.mod (2 hunks)
  • fault-remediation/main.go (3 hunks)
  • fault-remediation/pkg/initializer/init.go (5 hunks)
  • fault-remediation/pkg/reconciler/metrics.go (4 hunks)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (44 hunks)
  • fault-remediation/pkg/reconciler/reconciler_test.go (16 hunks)
  • fault-remediation/pkg/reconciler/remediation.go (1 hunks)
  • health-events-analyzer/go.mod (1 hunks)
  • health-monitors/syslog-health-monitor/go.mod (1 hunks)
  • tests/fault_remediation_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • fault-remediation/pkg/reconciler/remediation.go
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
  • commons/pkg/server/server.go
  • distros/kubernetes/nvsentinel/values-tilt.yaml
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • commons/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/initializer/init.go
  • tests/fault_remediation_test.go
  • fault-remediation/pkg/reconciler/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
  • tests/fault_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (16)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • commons/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • health-events-analyzer/go.mod
  • fault-remediation/go.mod
  • commons/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • health-events-analyzer/go.mod
  • commons/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • health-events-analyzer/go.mod
  • commons/go.mod
  • health-monitors/syslog-health-monitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • health-events-analyzer/go.mod
  • fault-remediation/main.go
  • fault-remediation/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable

Applied to files:

  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • commons/go.mod
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/fault_remediation_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/metrics.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • health-monitors/syslog-health-monitor/go.mod
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧬 Code graph analysis (5)
fault-remediation/main.go (2)
fault-remediation/pkg/initializer/init.go (3)
  • InitializationParams (34-39)
  • InitializeAll (46-135)
  • Components (41-43)
commons/pkg/server/server.go (2)
  • WithPrometheusMetricsCtrlRuntime (227-234)
  • WithSimpleHealth (246-254)
fault-remediation/pkg/reconciler/reconciler_test.go (2)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • NewFaultRemediationReconciler (78-93)
  • HealthEventDoc (53-56)
  • ReconcilerConfig (42-51)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/initializer/init.go (5)
fault-remediation/pkg/reconciler/reconciler.go (2)
  • FaultRemediationReconciler (67-75)
  • NewFaultRemediationReconciler (78-93)
store-client/pkg/datastore/registry.go (1)
  • NewDataStore (120-123)
store-client/pkg/watcher/factory.go (2)
  • WatcherConfig (26-38)
  • CreateChangeStreamWatcher (65-87)
store-client/pkg/datastore/types.go (1)
  • Pipeline (121-121)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • FaultRemediationReconciler (67-75)
  • ReconcilerConfig (42-51)
  • NewFaultRemediationReconciler (78-93)
fault-remediation/pkg/reconciler/remediation.go (1)
  • FaultRemediationClient (56-68)
store-client/pkg/datastore/types.go (1)
  • EventWithToken (101-104)
fault-remediation/pkg/reconciler/reconciler.go (7)
store-client/pkg/datastore/types.go (1)
  • EventWithToken (101-104)
store-client/pkg/storewatcher/watch_store.go (1)
  • ChangeStreamWatcher (43-57)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/annotation.go (1)
  • NodeAnnotationManagerInterface (38-43)
data-models/pkg/model/health_event_extentions.go (4)
  • HealthEventStatus (45-51)
  • UnQuarantined (34-34)
  • Cancelled (37-37)
  • Status (23-23)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (197-200)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (48)
  • GitHub Check: ko-build-test (event-exporter, .)
  • GitHub Check: ko-build-test (fault-remediation, .)
  • GitHub Check: modules-lint-test (janitor)
  • GitHub Check: modules-lint-test (tests)
  • GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/csp-health-monitor)
  • GitHub Check: modules-lint-test (store-client)
  • GitHub Check: modules-lint-test (node-drainer)
  • GitHub Check: modules-lint-test (commons)
  • GitHub Check: ko-build-test (health-monitors/csp-health-monitor, ./cmd/maintenance-notifier)
  • GitHub Check: modules-lint-test (fault-quarantine)
  • GitHub Check: modules-lint-test (fault-remediation)
  • GitHub Check: modules-lint-test (metadata-collector)
  • GitHub Check: modules-lint-test (labeler)
  • GitHub Check: modules-lint-test (health-events-analyzer)
  • GitHub Check: modules-lint-test (data-models)
  • GitHub Check: modules-lint-test (platform-connectors)
  • GitHub Check: ko-build-test (janitor, .)
  • GitHub Check: modules-lint-test (event-exporter)
  • GitHub Check: simple-lint (scripts, make -C scripts lint, Run shellcheck on scripts)
  • GitHub Check: ko-build-test (fault-quarantine, .)
  • GitHub Check: ko-build-test (node-drainer, .)
  • GitHub Check: ko-build-test (labeler, .)
  • GitHub Check: simple-lint (file-server-cleanup, make -C log-collector lint-file-server-cleanup, Run lint, false)
  • GitHub Check: ko-build-test (health-events-analyzer, .)
  • GitHub Check: ko-build-test (platform-connectors, .)
  • GitHub Check: simple-lint (log-collector, make -C log-collector lint-log-collector, Run lint, false)
  • GitHub Check: simple-lint (helm-charts, make helm-lint, Validate Helm charts)
  • GitHub Check: container-build-test (kubernetes-object-monitor, make -C health-monitors/kubernetes-object-monito...
  • GitHub Check: simple-lint (postgres-schema, make validate-postgres-schema, Validate PostgreSQL schema consistency)
  • GitHub Check: container-build-test (syslog-health-monitor, make -C health-monitors/syslog-health-monitor docker...
  • GitHub Check: simple-lint (gomod, make gomod-lint, Run gomod lint)
  • GitHub Check: simple-lint (license-headers, make license-headers-lint, Run license headers check)
  • GitHub Check: simple-lint (kubernetes-distro, make kubernetes-distro-lint, Run lint)
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm4, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: container-build-test (log-collector, make -C log-collector docker-build-log-collector)
  • GitHub Check: health-monitors-lint-test (gpu-health-monitor, true, true, false)
  • GitHub Check: container-build-test (gpu-health-monitor-dcgm3, make -C health-monitors/gpu-health-monitor docker...
  • GitHub Check: container-build-test (file-server-cleanup, make -C log-collector docker-build-file-server-cleanup)
  • GitHub Check: health-monitors-lint-test (kubernetes-object-monitor)
  • GitHub Check: simple-lint (protos, make protos-lint, Run protos lint)
  • GitHub Check: tilt-modules-lint-test (tilt/simple-health-client)
  • GitHub Check: health-monitors-lint-test (syslog-health-monitor)
  • GitHub Check: health-monitors-lint-test (csp-health-monitor)
  • GitHub Check: E2E Tests (AMD64 + PostgreSQL)
  • GitHub Check: E2E Tests (ARM64 + MongoDB)
  • GitHub Check: E2E Tests (ARM64 + PostgreSQL)
  • GitHub Check: E2E Tests (AMD64 + MongoDB)
  • GitHub Check: CodeQL PR Analysis
πŸ”‡ Additional comments (33)
health-monitors/syslog-health-monitor/go.mod (1)

22-80: Clarify the distinction between direct and indirect dependencies.

The 27 listed dependencies are marked // indirect, meaning they are transitive dependencies automatically included by other dependencies when running go mod tidy. Indirect dependencies do not need to be directly imported in the codebaseβ€”they exist because a direct dependency requires them. To verify the module is correctly configured, check that direct dependencies (unmarked by // indirect) are actually imported in the code, and confirm that running go mod tidy regenerates the same indirect dependency list.

fault-remediation/pkg/reconciler/metrics.go (1)

20-91: Add issue reference to TODO comment

The TODO comment at line 30 must reference a GitHub issue per project guidelines. Update //TODO: evaluate and remove redundant metrics with ctrl-runtime defaults to include an issue number, e.g., //TODO(#XXXX): evaluate and remove redundant metrics with ctrl-runtime defaults.

The controller-runtime metrics registry wiring and label sets (including consistent node_name labels across all counter vectors and histogram vectors) follow the correct pattern for observability requirements.

fault-remediation/go.mod (1)

32-32: LGTM! Expected transitive dependencies.

These indirect dependencies (fsnotify and btree) are expected additions from the controller-runtime integration. Both are well-known, stable libraries and are correctly marked as indirect.

Also applies to: 51-51

health-events-analyzer/go.mod (1)

29-93: LGTM! Expected transitive dependencies from commons module.

These indirect dependencies are transitively pulled in from the commons module's controller-runtime integration. All are correctly marked as indirect, and this is expected behavior when a dependency adds substantial new functionality like controller-runtime support.

distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (4)

81-84: LGTM! Controller-runtime flags properly configured.

The conditional addition of controller-runtime and leader-election flags aligns with the main.go logic that enforces leader election when controller-runtime is enabled. The Helm template syntax is correct.


88-91: LGTM! Health port configuration is correct.

The health port is correctly added only when controller-runtime is enabled, as the controller-runtime manager provides dedicated health endpoints on this port. The default value of 9440 matches the configuration in values-full.yaml.


95-95: LGTM! Liveness probe port correctly switches based on mode.

The ternary expression correctly uses the nil-safe pattern and switches between the health port (when controller-runtime is enabled) and metrics port (when disabled). The /healthz path is consistent across both modes.


102-103: LGTM! Readiness probe correctly configured for both modes.

The readiness probe correctly uses different paths and ports based on the controller-runtime mode:

  • Controller-runtime: /readyz on health port (standard for controller-runtime)
  • Non-controller-runtime: /healthz on metrics port

Both ternary expressions use the nil-safe pattern.

event-exporter/go.mod (1)

18-84: LGTM! Expected transitive dependencies from commons module.

Similar to health-events-analyzer, these indirect dependencies are transitively pulled in from the commons module's controller-runtime integration. All are correctly marked as indirect, and this is expected when a shared dependency adds controller-runtime support.

fault-remediation/pkg/initializer/init.go (3)

31-31: LGTM! Watcher import correctly added.

The watcher package import is necessary for the new controller-runtime integration and follows proper Go import conventions.


99-116: LGTM! Datastore and watcher initialization properly implemented.

The initialization logic correctly:

  • Creates the datastore with proper error handling
  • Uses the factory pattern for watcher creation
  • Wraps errors with context using fmt.Errorf("context: %w", err)
  • Retrieves the health event store from the datastore interface

The collection name "HealthEvents" is hardcoded, which is acceptable for now but consider making it configurable if it needs to vary across environments.


132-133: LGTM! Reconciler initialization correctly updated.

The constructor call properly passes all required dependencies (datastore, watcher, health event store, config, and dry-run flag) to the new NewFaultRemediationReconciler function, following proper dependency injection patterns.

fault-remediation/pkg/reconciler/reconciler_test.go (3)

219-219: LGTM! Constructor calls correctly updated for unit tests.

All test cases properly use the new NewFaultRemediationReconciler constructor with nil values for datastore, watcher, and health event store dependencies. This is appropriate for unit tests that focus on testing reconciliation logic with mocked clients rather than actual datastore interactions.

Also applies to: 263-263, 320-320, 371-371, 428-428, 474-474, 501-501, 582-582, 661-661, 692-692, 724-724, 789-789, 864-864, 937-937, 995-995


221-221: LGTM! Field access correctly updated to use unexported fields.

The test code correctly accesses the now-unexported fields (dryRun and config) of the reconciler. This change improves encapsulation while maintaining testability through package-level access.

Also applies to: 273-273, 588-590, 663-663, 694-694, 730-731


377-379: LGTM! performRemediation error handling correctly updated.

The test cases properly handle the updated performRemediation signature that now returns an error. All test cases correctly assert on the error return value using assert.NoError(t, err), improving error handling coverage in tests.

Also applies to: 434-436, 481-483

tests/fault_remediation_test.go (1)

28-28: LGTM! Typo corrected in test function name.

The function name has been corrected from "Reemdiated" to "Remediated", improving code quality and readability with no functional impact.

commons/go.mod (1)

17-17: No action required. Controller-runtime v0.22.4 is a valid, stable release (published 03 Nov 2024) with no known security advisories. The version follows semantic versioning conventions and includes legitimate improvements (cache SyncPeriod configuration, namespaced List fixes, priority-queue fixes, envtest binary-path handling).

fault-remediation/main.go (3)

76-118: Leader-election gating and shutdown wiring look sound.

Flag parsing + the !enableControllerRuntime && enableLeaderElection guard correctly prevent invalid combinations, and using a single signal-derived ctx for both initialization and mgr.Start(ctx) avoids the earlier dual-signal issue. The deferred reconciler.CloseAll(ctx) centralizes cleanup for both runtime modes cleanly.


210-285: Flag surface looks coherent with the new modes.

The new flags for controller-runtime enablement, leader election, timings, and log-collector behavior are clear and align with the runtime branches in run(). Once behavior stabilizes, consider documenting expected combinations (e.g., production defaults) in README/Helm values to keep operator UX consistent.


120-135: metrics-address semantics differ between ctrl-runtime and non-ctrl paths.

In the non‑ctrl path, TrimPrefix(metricsAddr, ":") and Atoi expect only a bare port (e.g., ":2112"); values like "0.0.0.0:2112" will fail with invalid metrics port, despite the flag help text indicating "address/port" support. The ctrl‑runtime path uses metricsAddr directly as a bind address. Either constrain the flag description to "port" for the non‑ctrl path, or parse host:port using net.SplitHostPort to consistently support full addresses in both modes.

Also applies to: 210-223

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (5)

60-87: Mocks are well-shaped for the new reconciler API.

MockChangeStreamWatcher and MockHealthEventStore implement the required interfaces and give you the right hooks (EventsChan, UpdateHealthEventStatusFn, updateCalled) to exercise event-loop and metrics behavior without external dependencies. This keeps the e2e tests focused on controller logic while remaining deterministic.

Also applies to: 96-100


288-345: createTestRemediationClient wiring matches production behavior.

The helper now returns (*FaultRemediationClient, error) and wires discovery, RESTMapper, template, annotation manager, and CRStatusChecker the same way production code does, with proper error propagation at each step. That makes the integration tests realistic while keeping failure modes explicit.


740-895: Event-loop E2E tests effectively validate ctrl-runtime wiring and state behavior.

The new tests that drive events through mockWatcher.EventsChan and assert on annotations, CR existence, metrics, and MarkProcessed counts give good coverage of the ctrl-runtime powered event path and deduplication/cancellation semantics. This is a solid foundation for catching regressions as you iterate on the controller-runtime integration.

Also applies to: 935-955, 1006-1058, 1066-1139, 1161-1223


1335-1355: Metrics error-path test matches the new parsing behavior.

TestMetrics_ProcessingErrors now drives a malformed event through a FaultRemediationReconciler with a mock watcher and asserts that the processingErrors counter increases for the unmarshal_doc_error label. This aligns with the new parseHealthEvent implementation and ensures you don’t regress error metric emission.


741-801: Unable to complete verification of this review comment. The repository is inaccessible, preventing inspection of the mockStore definition, UpdateHealthEventStatusFn implementation, goroutine usage patterns, and all mentioned line ranges (741-801, 847-849, 986-1049, 1062-1063, 1143-1144). To verify this concern, please provide access to the codebase or share the relevant code sections from the test file.

fault-remediation/pkg/reconciler/reconciler.go (8)

42-52: New reconciler struct and constructor are well-aligned with ctrl-runtime.

FaultRemediationReconciler cleanly encapsulates the datastore, watcher, health store, config, and annotation manager, and NewFaultRemediationReconciler wires annotationManager from RemediationClient so tests and main share the same construction path. This gives you a clear surface for both standalone and controller-runtime-managed operation.

Also applies to: 64-92


95-133: Reconcile + parseHealthEvent behavior for malformed events is coherent.

On parse failures you increment processingErrors, log, mark the event as processed, and have Reconcile return nil so controller-runtime doesn’t hot-loop on unprocessable events. The nil-guard for HealthEvent and the cancellation vs remediation split via nodeQuarantined keep the main decision tree straightforward.

Also applies to: 464-516


365-412: Health-event status update helper looks correct and bounded.

updateNodeRemediatedStatus builds a minimal HealthEventStatus, optionally sets LastRemediationTimestamp on success, and caps retries using UpdateMaxRetries/UpdateRetryDelay, wrapping the final error with document ID context. This keeps write amplification bounded while still surfacing persistent store failures clearly.


414-462: CR deduplication via annotations and status checker is clear and robust.

checkExistingCRStatus looks up the equivalence group via common.GetRemediationGroupForAction, inspects the annotation, and uses statusChecker.ShouldSkipCRCreation to decide whether to skip or allow a new CR, cleaning up the annotation group on completion/failure. The fallbacks on annotation or status-checker errors are conservative (β€œallow creation”) and well-logged.


543-573: SetupWithManager + StartWatcherStream integrate cleanly with controller-runtime.

Calling Watcher.Start(ctx) and adapting Watcher.Events() via AdaptEvents into a TypedChannel source, then wiring it into a named controller with a typed workqueue, is an idiomatic way to plug this non-K8s event source into controller-runtime. Since you’re using the same ctx that mgr.Start(ctx) uses, the watcher and adapter goroutine will respect the manager’s lifecycle.


574-602: AdaptEvents implementation is correct and avoids common pointer pitfalls.

The adapter goroutine selects on ctx.Done() and the input channel, closes the output channel on exit, and copies each EventWithToken into a new variable before taking its address, so each enqueued pointer is stable. This should behave well under load while respecting context cancellation.


260-287: Unable to verify: File fault-remediation/pkg/reconciler/reconciler.go not found in repository

The review comment references lines 260-287 and 290-363 in this file, but the file does not exist in the current repository state. This prevents verification of the claims regarding error handling in handleCancellationEvent and handleRemediationEvent, specifically whether the MarkProcessed call in handleRemediationEvent returns nil on error or whether adding a clarifying comment is appropriate.


135-174: Review verification inconclusive due to repository access failure.

The review comment makes specific factual claims about TODO comments without issue references and manual retry loops with UpdateMaxRetries and time.Sleep in the performRemediation and updateNodeRemediatedStatus functions (lines 176-257). However, I cannot access the repository to verify these claims directly.

To proceed, either provide the code snippets for lines 176-194 and 196-257, or re-run this verification once repository access is available.

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

Merging this branch will decrease overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 30.30% (-0.17%) πŸ‘Ž

Coverage by file

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

104-116: Potential data race on updateCalled counter.

Unlike MockChangeStreamWatcher.markProcessedCount, the updateCalled counter is not protected by a mutex. Since UpdateHealthEventStatus may be called from the reconciler goroutine while tests read updateCalled (e.g., lines 749, 808, 856), this could cause a data race.

Apply this diff to add thread safety:

 // MockHealthEventStore provides a mock implementation of datastore.HealthEventStore for testing
 type MockHealthEventStore struct {
+	mu                        sync.Mutex
 	UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error
 	updateCalled              int
 }

 // UpdateHealthEventStatus updates a health event status (mock implementation)
 func (m *MockHealthEventStore) UpdateHealthEventStatus(ctx context.Context, id string, status datastore.HealthEventStatus) error {
 	if m.UpdateHealthEventStatusFn != nil {
+		m.mu.Lock()
+		m.updateCalled++
+		m.mu.Unlock()
 		return m.UpdateHealthEventStatusFn(ctx, id, status)
 	}
 	return nil
 }
+
+// GetUpdateCalledCount returns the update call count in a thread-safe manner
+func (m *MockHealthEventStore) GetUpdateCalledCount() int {
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	return m.updateCalled
+}
♻️ Duplicate comments (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

245-249: log.Fatalf in goroutine and missing manager readiness synchronization.

This issue was previously flagged. Using log.Fatalf in a goroutine terminates the process abruptly without cleanup. Additionally, there's no synchronization to ensure the manager's cache is synced before tests run.

Consider waiting for cache sync after starting the manager:

 	go func() {
 		if err := mgr.Start(testContext); err != nil {
-			log.Fatalf("Failed to start the test environment manager: %v", err)
+			log.Printf("Failed to start the test environment manager: %v", err)
 		}
 	}()
+
+	// Wait for cache to sync before running tests
+	if !mgr.GetCache().WaitForCacheSync(testContext) {
+		log.Fatalf("Failed to sync manager cache")
+	}
🧹 Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

377-380: Consider using NewFaultRemediationReconciler constructor for consistency.

Multiple test cases manually construct FaultRemediationReconciler with partial fields. Using the constructor ensures tests stay aligned with production initialization patterns and reduces maintenance burden if the struct evolves.

-		r := FaultRemediationReconciler{
-			config:            cfg,
-			annotationManager: cfg.RemediationClient.GetAnnotationManager(),
-		}
+		r := NewFaultRemediationReconciler(nil, nil, nil, cfg, false)

1121-1121: Consider replacing time.Sleep with deterministic synchronization.

Using time.Sleep for synchronization can lead to flaky tests. If possible, use require.Eventually to poll for a verifiable state change, or add a channel/signal mechanism to confirm event processing completion.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 5004e32 and 560a318.

πŸ“’ Files selected for processing (1)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (43 hunks)
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (4)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventStatus (45-51)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • FaultRemediationReconciler (67-75)
  • ReconcilerConfig (42-51)
  • NewFaultRemediationReconciler (78-93)
fault-remediation/pkg/reconciler/remediation.go (1)
  • FaultRemediationClient (56-68)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: E2E Tests (ARM64 + PostgreSQL)
  • GitHub Check: E2E Tests (AMD64 + MongoDB)
  • GitHub Check: E2E Tests (AMD64 + PostgreSQL)
  • GitHub Check: E2E Tests (ARM64 + MongoDB)
πŸ”‡ Additional comments (3)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

56-61: Good addition of mutex for thread safety.

The mutex protects concurrent access to markProcessedCount between the reconciler goroutine and test assertions. This prevents data races in the test.


296-353: LGTM!

The refactored createTestRemediationClient now properly returns errors instead of calling log.Fatalf internally, allowing callers to handle failures appropriately.


218-230: LGTM on test setup refactoring.

The test setup properly creates the reconciler using NewFaultRemediationReconciler, wires mock dependencies, and integrates with the controller-runtime manager. The mock store callback correctly tracks update calls.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (3)

55-116: Fix data race on MockHealthEventStore.updateCalled while preserving new watcher mutex

The new sync.Mutex use in MockChangeStreamWatcher for markProcessedCount is good, but MockHealthEventStore.updateCalled is incremented from the reconciler goroutine and read/reset directly in tests, with no synchronization. That’s a classic data race and can make updateCalled‑based assertions flaky under concurrent event processing.

Consider guarding updateCalled similarly and exposing helpers instead of direct field access, e.g.:

 type MockHealthEventStore struct {
-	UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error
-	updateCalled              int
+	UpdateHealthEventStatusFn func(ctx context.Context, id string, status datastore.HealthEventStatus) error
+	mu                        sync.Mutex
+	updateCalled              int
 }
+
+func (m *MockHealthEventStore) incUpdateCalled() {
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	m.updateCalled++
+}
+
+func (m *MockHealthEventStore) ResetUpdateCalled() {
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	m.updateCalled = 0
+}
+
+func (m *MockHealthEventStore) GetUpdateCalled() int {
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	return m.updateCalled
+}

Then in setup/tests:

 mockStore = &MockHealthEventStore{}
 mockStore.UpdateHealthEventStatusFn = func(ctx context.Context, id string, status datastore.HealthEventStatus) error {
-	mockStore.updateCalled++
+	mockStore.incUpdateCalled()
 	return nil
 }

And replace all direct uses like mockStore.updateCalled = 0, updateCountBefore := mockStore.updateCalled, and equality checks with the new helpers.


678-688: Fix misleading assertion message in EventSequence test

Here you assert that a new CR should be created after the previous CR succeeds:

shouldCreate, _, err = r.checkExistingCRStatus(ctx, event3)
assert.NoError(t, err)
assert.True(t, shouldCreate, "RESTART_VM should be skipped (CR succeeded)")

The assertion message says β€œshould be skipped”, which contradicts both the comment above and the assert.True. Suggest updating the message:

-	assert.True(t, shouldCreate, "RESTART_VM should be skipped (CR succeeded)")
+	assert.True(t, shouldCreate, "RESTART_VM should be allowed after CR succeeded")

This keeps the behavior as‑is but avoids confusion when reading failures.


748-857: Make updateCalled‑based dedup assertion more robust against async processing

The E2E subtest uses mockStore.updateCalled to ensure the duplicate event doesn’t trigger a DB update:

mockStore.updateCalled = 0
...
updateCountBefore := mockStore.updateCalled
...
assert.Equal(t, updateCountBefore, mockStore.updateCalled, "MongoDB update should not be called for skipped event")

Given that event handling happens in the reconciler’s goroutine, this has two issues:

  1. As noted earlier, direct access to updateCalled is a data race.
  2. Even after assert.Eventually confirms deduplication at the annotation level, there’s still a chance more store updates from prior events are in flight, making an exact equality check brittle.

Once MockHealthEventStore has synchronized accessors, you can both fix the race and make the assertion more intent‑focused, e.g.:

mockStore.ResetUpdateCalled()
...
updateCountBefore := mockStore.GetUpdateCalled()
...
assert.Eventually(t, func() bool {
    return mockStore.GetUpdateCalled() == updateCountBefore
}, 2*time.Second, 100*time.Millisecond, "MongoDB update should not be called for skipped event")

This retains the dedup guarantee while handling async behavior more gracefully.

♻️ Duplicate comments (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

167-261: Avoid log.Fatalf inside manager goroutine and consider minimal readiness sync

TestMain correctly uses envtest and a shared reconciler, but the manager goroutine still does:

go func() {
    if err := mgr.Start(testContext); err != nil {
        log.Fatalf("Failed to start the test environment manager: %v", err)
    }
}()

Calling log.Fatalf from this goroutine can terminate the process abruptly without running tearDownTestEnvironment() or flushing test logs. It also hides the failure from individual tests.

A safer pattern is to log the error and let tests fail via assertions, optionally adding a simple readiness signal:

-	go func() {
-		if err := mgr.Start(testContext); err != nil {
-			log.Fatalf("Failed to start the test environment manager: %v", err)
-		}
-	}()
+	mgrReady := make(chan struct{})
+	go func() {
+		close(mgrReady) // manager start goroutine is running
+		if err := mgr.Start(testContext); err != nil {
+			log.Printf("Failed to start the test environment manager: %v", err)
+		}
+	}()
+	<-mgrReady

You can optionally follow this with mgr.GetCache().WaitForCacheSync(testContext) if you want cache readiness before tests proceed.

🧹 Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

297-353: createTestRemediationClient wiring looks correct for envtest

The createTestRemediationClient helper cleanly wires discovery, RESTMapper, dynamic client, template loading, annotation manager, dry‑run mode, and CRStatusChecker against the envtest testRestConfig. This aligns with the guidance to use envtest and real k8s clients for controller tests.

One small optional improvement would be to wrap returned errors with context (e.g., fmt.Errorf("load rebootnode template: %w", err)) to make failures easier to debug, but the current direct returns are acceptable for test code.


1347-1360: Ensure Reconcile doesn’t depend on uninitialized fields in TestMetrics_ProcessingErrors

In TestMetrics_ProcessingErrors, you construct:

invalidEventToken := &datastore.EventWithToken{ ... }
watcher := NewMockChangeStreamWatcher()

r := &FaultRemediationReconciler{
    Watcher: watcher,
}

r.Reconcile(testContext, invalidEventToken)

This assumes that the Reconcile path for invalid documents only touches Watcher (if at all) and global metrics, and does not read ds, healthEventStore, config, or annotationManager. If any of those are accessed, this test will panic with nil dereferences.

If Reconcile already guards those paths, this is fine. Otherwise, consider constructing the reconciler via NewFaultRemediationReconciler with a minimal ReconcilerConfig to ensure all used dependencies are non‑nil.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 560a318 and b78b5af.

πŸ“’ Files selected for processing (3)
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt.yaml (2 hunks)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (43 hunks)
🧰 Additional context used
πŸ““ Path-based instructions (2)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧠 Learnings (5)
πŸ“š Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.

Applied to files:

  • distros/kubernetes/nvsentinel/values-tilt.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
🧬 Code graph analysis (1)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)
store-client/pkg/datastore/types.go (1)
  • EventWithToken (101-104)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • FaultRemediationReconciler (67-75)
  • ReconcilerConfig (42-51)
  • NewFaultRemediationReconciler (78-93)
πŸ”‡ Additional comments (4)
distros/kubernetes/nvsentinel/values-tilt.yaml (2)

24-24: Consistent flag enablement across Tilt environment configurations.

The ctrlRuntimeEnabled: true flag is properly positioned in the global section and will affect downstream service deployments. Placement and indentation are correct.

A minor observation: values-tilt-mongodb.yaml includes an explanatory comment for this flag clarifying the dual-test-coverage rationale, while this file does not. Given the minimal-documentation guideline for Tilt development overrides, this asymmetry is acceptable, though consistency might aid maintainability.


220-220: Verify replicaCount: 1 is recognized and intentional for fault-remediation Helm chart.

The addition of replicaCount: 1 for fault-remediation in a Tilt development environment is reasonable for single-replica testing. However, confirm that:

  • The fault-remediation Helm chart recognizes and properly handles the replicaCount field.
  • The value 1 is the intended default for controller-runtime-enabled testing (especially relevant given the pod anti-affinity rules below, which become effectively no-op with a single replica).
  • This change does not contribute to the e2e test failures mentioned in the PR objectives.
distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml (2)

27-28: Clear rationale for dual-mode controller-runtime testing.

The comment explaining the test coverage strategy (MongoDB with ctrl-runtime enabled, PostgreSQL with legacy behavior) is excellent documentation and clarifies the design intent for multi-datastore scenarios.

Confirm that values-tilt-postgresql.yaml intentionally omits or disables ctrlRuntimeEnabled to maintain the legacy behavior pathway for PostgreSQL and ensure the dual test coverage design is fully implemented.


1-62: Verify PostgreSQL configuration maintains legacy behavior per stated dual-test-coverage design.

The MongoDB file explicitly documents the test coverage strategy: "Use ctrl runtime in mongo and leave postgres with legacy so both get test coverage." Confirm that distros/kubernetes/nvsentinel/values-tilt-postgresql.yaml either does not include ctrlRuntimeEnabled (defaults to false/unset) or explicitly sets ctrlRuntimeEnabled: false to ensure the dual-mode testing architecture is properly maintained.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (7)
health-events-analyzer/go.mod (1)

36-49: Malformed indirect module pathsβ€”subpackages listed instead of module.

Lines 36-49 contain entries like github.com/go-openapi/swag/cmdutils v0.25.4 // indirect. In standard Go module semantics, subpackages within a module should not appear as separate require entries. These should typically be consolidated to a single github.com/go-openapi/swag v0.25.4 entry.

Run go mod tidy to verify these entries are correctly resolved, or confirm if go-openapi/swag v0.25+ has been restructured into separate submodules.

go-openapi/swag v0.25 submodules structure
health-monitors/syslog-health-monitor/go.mod (1)

32-42: Already flagged: Malformed go-openapi/swag subpackage entries.

This issue was already identified in a previous review. The subpackage entries (cmdutils, conv, fileutils, etc.) should be consolidated to a single module entry and verified with go mod tidy.

event-exporter/go.mod (1)

28-38: Same malformed go-openapi/swag subpackage entries as other modules.

Consistent with the pattern in other go.mod files, these subpackage entries may need consolidation. Ensure go mod tidy is run across all modules to verify the dependency graph is valid.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

206-249: Consider graceful error handling in manager goroutine.

Using log.Fatalf inside a goroutine (line 247) will abruptly terminate the test process without proper cleanup. This was flagged in a past review but not fully addressed.

Consider logging the error and letting tests fail naturally:

 	go func() {
 		if err := mgr.Start(testContext); err != nil {
-			log.Fatalf("Failed to start the test environment manager: %v", err)
+			log.Printf("Failed to start the test environment manager: %v", err)
 		}
 	}()

Additionally, consider adding a cache sync wait after starting the manager to ensure it's ready before tests run.

fault-remediation/pkg/reconciler/reconciler.go (2)

527-541: Add nil guards to CloseAll for safer test usage.

CloseAll unconditionally calls r.ds.Close(ctx) and r.Watcher.Close(ctx). In production, InitializeAll ensures both are non-nil, but tests construct FaultRemediationReconciler with nil values. Calling CloseAll on such instances would panic.

 func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error {
 	var errs []error
 
-	if err := r.ds.Close(ctx); err != nil {
-		slog.Error("failed to close datastore", "error", err)
-		errs = append(errs, err)
+	if r.ds != nil {
+		if err := r.ds.Close(ctx); err != nil {
+			slog.Error("failed to close datastore", "error", err)
+			errs = append(errs, err)
+		}
 	}
 
-	if err := r.Watcher.Close(ctx); err != nil {
-		slog.Error("failed to close Watcher", "error", err)
-		errs = append(errs, err)
+	if r.Watcher != nil {
+		if err := r.Watcher.Close(ctx); err != nil {
+			slog.Error("failed to close Watcher", "error", err)
+			errs = append(errs, err)
+		}
 	}
 
 	return errors.Join(errs...)
 }

355-360: Inconsistent error handling for MarkProcessed.

When MarkProcessed fails at line 355, the function returns ctrl.Result{}, nil instead of the error. This is inconsistent with similar error handling at lines 276-283, 302-306, and 327-331, which return the error. Without returning the error, controller-runtime won't apply retry backoff.

 	if err = watcherInstance.MarkProcessed(ctx, eventWithToken.ResumeToken); err != nil {
 		processingErrors.WithLabelValues("mark_processed_error", nodeName).Inc()
 		slog.Error("Error updating resume token", "error", err)
 
-		return ctrl.Result{}, nil
+		return ctrl.Result{}, err
 	}
fault-remediation/main.go (1)

148-159: Event loop doesn't respect context cancellation.

The for event := range loop will block until the channel closes. If gCtx is cancelled (e.g., SIGTERM), the goroutine won't exit until Watcher.Events() channel is closed. This was flagged in past reviews but appears only partially addressed.

Consider using a select with gCtx.Done():

-		for event := range components.FaultRemediationReconciler.Watcher.Events() {
-			slog.Info("Event received", "event", event)
-			_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
-		}
-
-		return nil
+		eventCh := components.FaultRemediationReconciler.Watcher.Events()
+		for {
+			select {
+			case <-gCtx.Done():
+				slog.Info("Context cancelled, stopping event processing")
+				return nil
+			case event, ok := <-eventCh:
+				if !ok {
+					return nil
+				}
+				slog.Info("Event received", "event", event)
+				_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
+			}
+		}

Based on learnings, proper shutdown handling with context cancellation is required for Go code.

🧹 Nitpick comments (5)
fault-remediation/pkg/reconciler/metrics.go (1)

30-31: TODO should reference an issue.

Per coding guidelines, TODO comments should reference issues in Go code. Consider creating a tracking issue for evaluating redundant metrics and updating this comment.

-	//TODO: evaluate and remove redundant metrics with ctrl-runtime defaults
+	// TODO(#XXX): evaluate and remove redundant metrics with ctrl-runtime defaults
fault-remediation/pkg/initializer/init.go (2)

45-45: Consider addressing the cyclomatic complexity.

The nolint:cyclop directive suppresses cyclomatic complexity warnings. While acceptable for now, consider refactoring initialization into smaller helper functions in a follow-up to improve maintainability.


70-70: TODO should reference an issue.

Per coding guidelines, TODO comments should reference issues. Consider creating a tracking issue for the manager client replacement.

-	//TODO: replace this with manager client
+	//TODO(#XXX): replace this with manager client
fault-remediation/pkg/reconciler/reconciler.go (2)

215-216: TODO should reference an issue.

Per coding guidelines, TODO comments should reference issues for tracking.

-	//TODO: return error to use built in ctrl runtime retries
+	//TODO(#XXX): return error to use built-in ctrl-runtime retries

339-339: Consider handling runLogCollector error.

The error from runLogCollector is silently discarded. While log collector failures may be non-fatal, consider at minimum logging at warn level or tracking via metrics.

-	_ = r.runLogCollector(ctx, healthEvent)
+	if err := r.runLogCollector(ctx, healthEvent); err != nil {
+		slog.Warn("Log collector failed but continuing with remediation", "node", nodeName, "error", err)
+	}
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between b78b5af and 131fb72.

β›” Files ignored due to path filters (5)
  • commons/go.sum is excluded by !**/*.sum
  • event-exporter/go.sum is excluded by !**/*.sum
  • fault-remediation/go.sum is excluded by !**/*.sum
  • health-events-analyzer/go.sum is excluded by !**/*.sum
  • health-monitors/syslog-health-monitor/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (19)
  • commons/go.mod (2 hunks)
  • commons/pkg/server/server.go (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-full.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt.yaml (2 hunks)
  • event-exporter/go.mod (1 hunks)
  • fault-remediation/go.mod (2 hunks)
  • fault-remediation/main.go (3 hunks)
  • fault-remediation/pkg/initializer/init.go (5 hunks)
  • fault-remediation/pkg/reconciler/metrics.go (4 hunks)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (43 hunks)
  • fault-remediation/pkg/reconciler/reconciler_test.go (16 hunks)
  • fault-remediation/pkg/reconciler/remediation.go (1 hunks)
  • health-events-analyzer/go.mod (1 hunks)
  • health-monitors/syslog-health-monitor/go.mod (1 hunks)
  • tests/fault_remediation_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • fault-remediation/pkg/reconciler/remediation.go
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • fault-remediation/go.mod
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • tests/fault_remediation_test.go
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/main.go
  • commons/pkg/server/server.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/metrics.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
🧠 Learnings (14)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • health-monitors/syslog-health-monitor/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • commons/go.mod
  • health-events-analyzer/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • commons/go.mod
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • health-events-analyzer/go.mod
  • fault-remediation/main.go
  • health-monitors/syslog-health-monitor/go.mod
  • event-exporter/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/metrics.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • health-monitors/syslog-health-monitor/go.mod
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
🧬 Code graph analysis (4)
fault-remediation/pkg/initializer/init.go (4)
fault-remediation/pkg/reconciler/reconciler.go (2)
  • FaultRemediationReconciler (67-75)
  • NewFaultRemediationReconciler (78-93)
store-client/pkg/datastore/registry.go (1)
  • NewDataStore (120-123)
store-client/pkg/watcher/factory.go (2)
  • WatcherConfig (26-38)
  • CreateChangeStreamWatcher (65-87)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
store-client/pkg/datastore/types.go (1)
  • EventWithToken (101-104)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventStatus (45-51)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • FaultRemediationReconciler (67-75)
  • ReconcilerConfig (42-51)
  • NewFaultRemediationReconciler (78-93)
fault-remediation/pkg/reconciler/remediation.go (1)
  • FaultRemediationClient (56-68)
fault-remediation/pkg/reconciler/reconciler_test.go (1)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • NewFaultRemediationReconciler (78-93)
  • HealthEventDoc (53-56)
  • ReconcilerConfig (42-51)
fault-remediation/pkg/reconciler/reconciler.go (4)
store-client/pkg/storewatcher/watch_store.go (1)
  • ChangeStreamWatcher (43-57)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/annotation.go (1)
  • NodeAnnotationManagerInterface (38-43)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
πŸ”‡ Additional comments (38)
distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml (1)

27-28: LGTM!

Good approach to enable controller-runtime for MongoDB while keeping legacy mode for PostgreSQL to ensure both paths get test coverage during development.

commons/pkg/server/server.go (1)

227-233: Implementation is correct.

The function properly uses promhttp.HandlerFor with the controller-runtime registry (crmetrics.Registry), enabling metrics from controller-runtime components to be exposed alongside custom metrics.

fault-remediation/pkg/reconciler/metrics.go (1)

33-92: LGTM! Consistent migration to controller-runtime registry.

All metrics are correctly migrated to use promauto.With(crmetrics.Registry), ensuring they're registered with the controller-runtime metrics registry. This aligns with the WithPrometheusMetricsCtrlRuntime() server option that exposes these metrics via /metrics.

commons/go.mod (1)

17-17: controller-runtime v0.22.4 is the current stable version as of December 2025.

Adding sigs.k8s.io/controller-runtime as a direct dependency in commons is appropriate for the shared module. However, verify that the module organization follows the guideline requiring each service to use go.mod as a separate Go module with semantic import versioning.

fault-remediation/pkg/initializer/init.go (4)

31-31: LGTM!

The watcher package import is correctly added to support the new factory-based watcher initialization.


41-43: LGTM!

The Components struct field is appropriately renamed to reflect the new FaultRemediationReconciler type, maintaining consistency with the reconciler refactoring.


99-116: LGTM - initialization flow is well-structured.

The datastore, watcher, and health event store are initialized with proper error handling and wrapping. The factory pattern for watcher creation aligns with the relevant code snippets provided.

One consideration: the collection name "HealthEvents" is hardcoded. Verify this is intentional and doesn't need to be configurable via the TOML config.


131-134: LGTM!

The reconciler is correctly constructed using the new NewFaultRemediationReconciler constructor with all required dependencies wired appropriately.

fault-remediation/pkg/reconciler/reconciler_test.go (10)

219-222: LGTM!

Test correctly updated to use the new constructor with nil dependencies (appropriate for unit tests that don't exercise datastore/watcher) and validates the private dryRun field.


263-274: LGTM!

Constructor updated correctly. Note the discarded second return value from CreateMaintenanceResource is acceptable here since the test focuses on the success boolean.


320-323: LGTM!

Test correctly uses the new constructor and validates shouldSkipEvent behavior for unsupported actions.


371-381: LGTM!

Test properly updated to handle the new (bool, string, error) return signature from performRemediation, with appropriate assert.NoError check.


428-438: LGTM!

Failure case test correctly updated with new constructor and error handling.


474-485: LGTM!

Label update failure test correctly adapted to the new API surface.


500-501: LGTM!

Tests for shouldSkipEvent and log collector behavior correctly updated to use r.config.EnableLogCollector and r.config.RemediationClient.

Also applies to: 582-593


661-663: LGTM!

All log collector tests correctly adapted to the new constructor and field access patterns.

Also applies to: 692-694, 724-731


784-789: LGTM!

TestUpdateNodeRemediatedStatus correctly sets up the mock health store and uses the new constructor.


864-864: LGTM!

CR-based deduplication and equivalence group tests correctly migrated to the new constructor API.

Also applies to: 937-937, 995-995

fault-remediation/main.go (5)

17-40: LGTM!

Imports are well-organized, including controller-runtime packages, error handling, and the initializer package.


42-63: LGTM!

Package-level flag variables are cleanly declared with clear naming. This centralizes configuration management effectively.


75-118: LGTM!

The run() function properly:

  • Validates that leader election requires controller-runtime mode
  • Uses signal context for graceful shutdown
  • Defers CloseAll for resource cleanup
  • Routes to the appropriate setup function based on mode

164-208: LGTM!

The controller-runtime setup is well-structured:

  • Manager configured with metrics, health probes, and leader election
  • Health/ready checks added via healthz.Ping
  • Reconciler properly wired via SetupWithManager
  • Manager started with the parent context for proper lifecycle management

210-285: LGTM!

The parseFlags function is comprehensive and well-documented:

  • Clear flag descriptions with requirements noted
  • Sensible defaults for leader election timing
  • Proper separation of concerns from the main logic
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (8)

56-61: LGTM - Good concurrency safety!

Adding mutex protection to MockChangeStreamWatcher counters prevents data races when tests and the reconciler access counts concurrently. This is a proper testing practice.

Also applies to: 82-101


105-108: LGTM!

The updateCalled counter enables precise assertions on database update behavior, supporting the deduplication verification tests.


167-177: LGTM!

Package-level test variables properly updated to use FaultRemediationReconciler and separate mock instances for watcher and store.


297-353: LGTM!

createTestRemediationClient now properly returns (*FaultRemediationClient, error), following Go error handling conventions. All callers are updated to handle the error appropriately.


364-380: LGTM - Tests properly use public API.

Integration tests now construct FaultRemediationReconciler directly with annotationManager wired through cfg.RemediationClient.GetAnnotationManager(). This aligns with the public API design and avoids relying on internal fields.

Also applies to: 427-439, 480-492, 554-566, 622-634


393-394: LGTM!

All performRemediation calls correctly handle the new (bool, string, error) return signature with assert.NoError(t, err) checks.

Also applies to: 451-452, 504-505, 531-532, 578-579, 652-653, 715-716


749-749: LGTM!

Tests properly reset mockStore.updateCalled before each test run to ensure isolated assertions.

Also applies to: 808-808, 856-856, 996-996, 1070-1070, 1151-1151


1344-1363: LGTM!

Metrics processing error test correctly creates a local reconciler instance with a fresh mock watcher, avoiding interference with global test state.

fault-remediation/pkg/reconciler/reconciler.go (7)

17-40: LGTM!

Imports are well-organized with controller-runtime packages, error handling, and proper aliasing for nvstoreclient.


64-93: LGTM!

The FaultRemediationReconciler struct and NewFaultRemediationReconciler constructor have proper godoc comments. The struct correctly embeds client.Client for controller-runtime compatibility and stores all necessary dependencies.


95-133: LGTM!

The Reconcile method properly follows the controller-runtime pattern:

  • Returns ctrl.Result and error
  • Metrics tracking for event duration
  • Proper delegation to cancellation or remediation handlers

135-174: LGTM!

shouldSkipEvent correctly handles:

  • NONE actions (no remediation needed)
  • Already remediated events
  • Unsupported actions with metrics tracking and label updates

176-194: LGTM!

runLogCollector now returns an error, enabling callers to handle failures appropriately.


543-572: LGTM!

SetupWithManager correctly:

  • Starts the watcher with the provided context (addressing past review about context.Background())
  • Uses AdaptEvents to convert events to controller-runtime format
  • Wires the reconciler with a typed channel source

574-602: LGTM!

AdaptEvents properly:

  • Transforms events to controller-runtime TypedGenericEvent
  • Respects context cancellation
  • Closes output channel when input closes or context is done

@dims
Copy link
Collaborator

dims commented Dec 4, 2025

/ok to test 131fb72

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

Merging this branch will decrease overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 30.30% (-0.17%) πŸ‘Ž

Coverage by file

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go

@ivelichkovich ivelichkovich force-pushed the ctrlruntime branch 2 times, most recently from 73a7468 to 260636b Compare December 5, 2025 17:16
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
health-monitors/syslog-health-monitor/go.mod (1)

31-42: Fix malformed subpackage module pathsβ€”this issue persists from prior review.

Lines 32–42 list invalid subpackage paths like github.com/go-openapi/swag/cmdutils, conv, fileutils, etc., as separate module entries. In Go's go.mod, only the module path should appear. Subpackages (cmdutils, conv, etc.) are internal to the github.com/go-openapi/swag module and must not be listed separately.

Line 31 already correctly declares github.com/go-openapi/swag v0.25.4 // indirect, which implicitly covers all subpackages. Remove lines 32–42 and run go mod tidy to reconcile.

 	github.com/go-openapi/swag v0.25.4 // indirect
-	github.com/go-openapi/swag/cmdutils v0.25.4 // indirect
-	github.com/go-openapi/swag/conv v0.25.4 // indirect
-	github.com/go-openapi/swag/fileutils v0.25.4 // indirect
-	github.com/go-openapi/swag/jsonname v0.25.4 // indirect
-	github.com/go-openapi/swag/jsonutils v0.25.4 // indirect
-	github.com/go-openapi/swag/loading v0.25.4 // indirect
-	github.com/go-openapi/swag/mangling v0.25.4 // indirect
-	github.com/go-openapi/swag/netutils v0.25.4 // indirect
-	github.com/go-openapi/swag/stringutils v0.25.4 // indirect
-	github.com/go-openapi/swag/typeutils v0.25.4 // indirect
-	github.com/go-openapi/swag/yamlutils v0.25.4 // indirect
 	github.com/gogo/protobuf v1.3.2 // indirect
commons/pkg/server/server.go (1)

218-234: Fix the example in docstringβ€”shows wrong function name.

The docstring example on line 225 shows WithPrometheusMetrics() but this documentation is for WithPrometheusMetricsCtrlRuntime(). This appears to be a copy-paste oversight from the existing WithPrometheusMetrics() function.

Apply this diff to fix the example:

 // WithPrometheusMetricsCtrlRuntime registers the Prometheus metrics handler at /metrics.
 // This uses the controller runtime Prometheus registry.
 //
 // Example:
 //
 //	srv := NewServer(
 //	    WithPort(2112),
-//	    WithPrometheusMetrics(),
+//	    WithPrometheusMetricsCtrlRuntime(),
 //	)
fault-remediation/main.go (1)

148-159: Event loop doesn't respond to context cancellation.

The for range loop at lines 153-156 blocks until Watcher.Events() closes. If gCtx is cancelled (e.g., SIGTERM), this goroutine won't exit until the watcher closes its channel, potentially delaying graceful shutdown.

Consider using a select statement to monitor both the event channel and context:

 	g.Go(func() error {
 		components.FaultRemediationReconciler.StartWatcherStream(gCtx)
 
 		slog.Info("Listening for events on the channel...")
-		for event := range components.FaultRemediationReconciler.Watcher.Events() {
-			slog.Info("Event received", "event", event)
-			_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
+		eventCh := components.FaultRemediationReconciler.Watcher.Events()
+		for {
+			select {
+			case <-gCtx.Done():
+				slog.Info("Context cancelled, stopping event processing")
+				return nil
+			case event, ok := <-eventCh:
+				if !ok {
+					return nil
+				}
+				slog.Info("Event received", "event", event)
+				_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
+			}
 		}
-
-		return nil
 	})

Based on coding guidelines requiring proper shutdown handling with context cancellation.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

245-249: Add manager readiness synchronization and avoid log.Fatalf.

The goroutine uses log.Fatalf which abruptly terminates the test process without cleanup. Additionally, there's no synchronization to ensure the manager is ready before m.Run() executes tests.

Consider adding a readiness channel and handling errors gracefully:

+	mgrReady := make(chan struct{})
 	go func() {
+		close(mgrReady)
 		if err := mgr.Start(testContext); err != nil {
-			log.Fatalf("Failed to start the test environment manager: %v", err)
+			log.Printf("Failed to start the test environment manager: %v", err)
 		}
 	}()
+	<-mgrReady // Wait for manager goroutine to start

Alternatively, use mgr.GetCache().WaitForCacheSync(testContext) after starting to ensure caches are synchronized before running tests.

🧹 Nitpick comments (2)
fault-remediation/pkg/reconciler/reconciler.go (2)

525-541: Add nil guards in CloseAll for safer test usage.

CloseAll unconditionally calls r.ds.Close(ctx) and r.Watcher.Close(ctx). While production code ensures both are non-nil, tests construct FaultRemediationReconciler instances with nil fields. Calling CloseAll on those would panic.

Add nil checks to make the method safer:

 func (r *FaultRemediationReconciler) CloseAll(ctx context.Context) error {
 	var errs []error
 
-	if err := r.ds.Close(ctx); err != nil {
-		slog.Error("failed to close datastore", "error", err)
-		errs = append(errs, err)
+	if r.ds != nil {
+		if err := r.ds.Close(ctx); err != nil {
+			slog.Error("failed to close datastore", "error", err)
+			errs = append(errs, err)
+		}
 	}
 
-	if err := r.Watcher.Close(ctx); err != nil {
-		slog.Error("failed to close Watcher", "error", err)
-		errs = append(errs, err)
+	if r.Watcher != nil {
+		if err := r.Watcher.Close(ctx); err != nil {
+			slog.Error("failed to close Watcher", "error", err)
+			errs = append(errs, err)
+		}
 	}
 
 	return errors.Join(errs...)
 }

215-215: TODO comment should reference a tracking issue.

The TODO comment lacks an issue reference. Per coding guidelines, TODO comments should reference issues for tracking.

Consider creating an issue to track the controller-runtime retry refactoring and updating the comment:

-	//TODO: return error to use built in ctrl runtime retries
+	// TODO(#ISSUE): return error to use built-in ctrl-runtime retries

As per coding guidelines, TODO comments should reference issues in Go code.

πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 131fb72 and 260636b.

β›” Files ignored due to path filters (5)
  • commons/go.sum is excluded by !**/*.sum
  • event-exporter/go.sum is excluded by !**/*.sum
  • fault-remediation/go.sum is excluded by !**/*.sum
  • health-events-analyzer/go.sum is excluded by !**/*.sum
  • health-monitors/syslog-health-monitor/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (19)
  • commons/go.mod (2 hunks)
  • commons/pkg/server/server.go (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-full.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt.yaml (2 hunks)
  • event-exporter/go.mod (1 hunks)
  • fault-remediation/go.mod (2 hunks)
  • fault-remediation/main.go (3 hunks)
  • fault-remediation/pkg/initializer/init.go (5 hunks)
  • fault-remediation/pkg/reconciler/metrics.go (4 hunks)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (43 hunks)
  • fault-remediation/pkg/reconciler/reconciler_test.go (16 hunks)
  • fault-remediation/pkg/reconciler/remediation.go (1 hunks)
  • health-events-analyzer/go.mod (1 hunks)
  • health-monitors/syslog-health-monitor/go.mod (1 hunks)
  • tests/fault_remediation_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • fault-remediation/go.mod
  • commons/go.mod
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
  • fault-remediation/pkg/reconciler/remediation.go
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • commons/pkg/server/server.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/metrics.go
  • tests/fault_remediation_test.go
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • event-exporter/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • health-events-analyzer/go.mod
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • tests/fault_remediation_test.go
🧠 Learnings (17)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • event-exporter/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • health-events-analyzer/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • event-exporter/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • health-events-analyzer/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • event-exporter/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • health-events-analyzer/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • event-exporter/go.mod
  • health-monitors/syslog-health-monitor/go.mod
  • health-events-analyzer/go.mod
  • fault-remediation/main.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • health-monitors/syslog-health-monitor/go.mod
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • health-monitors/syslog-health-monitor/go.mod
  • health-events-analyzer/go.mod
πŸ“š Learning: 2025-11-26T13:54:26.427Z
Learnt from: rupalis-nv
Repo: NVIDIA/NVSentinel PR: 361
File: distros/kubernetes/nvsentinel/values-tilt.yaml:215-223
Timestamp: 2025-11-26T13:54:26.427Z
Learning: For values-tilt.yaml files: Keep documentation minimal and implementation-focused. Unlike production values.yaml files which need comprehensive inline comments, values-tilt.yaml is a Tilt-specific testing/development override file that doesn't require extensive documentation.

Applied to files:

  • distros/kubernetes/nvsentinel/values-tilt.yaml
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
  • fault-remediation/pkg/reconciler/metrics.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Write table-driven tests when testing multiple scenarios in Go

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/fault_remediation_test.go
🧬 Code graph analysis (3)
fault-remediation/pkg/initializer/init.go (5)
fault-remediation/pkg/reconciler/reconciler.go (2)
  • FaultRemediationReconciler (67-75)
  • NewFaultRemediationReconciler (78-93)
store-client/pkg/datastore/registry.go (1)
  • NewDataStore (120-123)
store-client/pkg/watcher/factory.go (2)
  • WatcherConfig (26-38)
  • CreateChangeStreamWatcher (65-87)
store-client/pkg/datastore/types.go (1)
  • Pipeline (121-121)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/reconciler_test.go (2)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • NewFaultRemediationReconciler (78-93)
  • HealthEventDoc (53-56)
  • ReconcilerConfig (42-51)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (4)
store-client/pkg/datastore/types.go (1)
  • EventWithToken (101-104)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • FaultRemediationReconciler (67-75)
  • ReconcilerConfig (42-51)
  • NewFaultRemediationReconciler (78-93)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/reconciler/remediation.go (1)
  • FaultRemediationClient (56-68)
πŸ”‡ Additional comments (22)
distros/kubernetes/nvsentinel/values-tilt.yaml (2)

220-220: Verify that replicaCount is properly referenced in the fault-remediation deployment template.

The addition sets the replica count for fault-remediation to 1, appropriate for Tilt testing. Ensure the deployment template properly references this value using {{ .Values.replicaCount }} or similar template syntax.


24-24: Verify that ctrlRuntimeEnabled flag is properly referenced in the fault-remediation deployment template.

This addition enables controller-runtime integration in the Tilt environment. Ensure the corresponding deployment template references and acts on the global.ctrlRuntimeEnabled flag to confirm this configuration change is actually utilized.

event-exporter/go.mod (1)

18-84: Verify that event-exporter's controller-runtime and k8s.io dependencies are necessary and not unnecessarily transitive.

Event-exporter/go.mod includes the same pattern of controller-runtime and k8s.io indirect dependencies as health-events-analyzer. Confirm whether event-exporter directly imports these packages or if they are pulled in purely through the commons module. If these are only transitive dependencies that event-exporter does not directly use, consider whether they should be scoped only to modules that explicitly require controller-runtime integration.

Additionally, verify that version pinning (k8s.io/api v0.34.2, controller-runtime v0.22.4) is consistent across all services and does not create version skew.

To verify:

  • Search event-exporter source code for direct imports of sigs.k8s.io/controller-runtime, k8s.io/client-go, or k8s.io/api
  • Check commons/go.mod to confirm these are direct requirements there
  • Confirm version alignment with other services in the codebase
health-events-analyzer/go.mod (1)

29-94: Verify that all indirect dependencies are transitive from controller-runtime v0.22.4 and Go version compatibility.

The expansion of indirect dependencies (lines 29–94) is expected when integrating controller-runtime, but confirm that all added packages are pulled in exclusively by controller-runtime v0.22.4 and k8s.io modules, rather than manually added. Additionally, verify Go 1.25 with toolchain go1.25.3 is compatible with controller-runtime v0.22.4, k8s.io/api v0.34.2, and k8s.io/client-go v0.34.2.

Per coding guidelines, ensure Go dependencies remain minimal. If any indirect dependencies can be removed without breaking the integration, remove them.

commons/pkg/server/server.go (1)

54-54: LGTM on the import addition.

The controller-runtime metrics import is correctly aliased and appropriately placed with other imports.

fault-remediation/pkg/reconciler/metrics.go (2)

29-38: LGTM on the metrics registry migration.

The migration to promauto.With(crmetrics.Registry) correctly aligns metrics registration with the controller-runtime registry. This ensures metrics are properly exposed when using WithPrometheusMetricsCtrlRuntime() in the server.


39-92: Consistent registry usage across all metrics.

All remaining metrics (eventsProcessed, processingErrors, totalUnsupportedRemediationActions, eventHandlingDuration, logCollectorJobs, logCollectorJobDuration, logCollectorErrors) are correctly migrated to use the controller-runtime registry.

fault-remediation/pkg/initializer/init.go (1)

131-134: LGTM on the FaultRemediationReconciler wiring.

The reconciler is correctly instantiated with all required dependencies: datastore, watcher, healthEventStore, config, and dryRun flag.

tests/fault_remediation_test.go (1)

28-28: LGTM on the typo fix.

Correcting "Reemdiated" β†’ "Remediated" in the test function name improves readability. As per coding guidelines, test names should be descriptive using the format TestFunctionName_Scenario_ExpectedBehavior.

fault-remediation/pkg/reconciler/reconciler_test.go (4)

219-221: LGTM on constructor and field access updates.

The test correctly uses the new NewFaultRemediationReconciler constructor with nil for dependencies not exercised in this test, and accesses the unexported dryRun field appropriately.


377-378: Proper error handling for performRemediation.

The updated performRemediation return signature (bool, string, error) is correctly handled with assert.NoError(t, err) assertions across all test cases.

Also applies to: 434-435, 481-482


588-589: Consistent access to config fields.

The tests correctly access r.config.EnableLogCollector and r.config.RemediationClient for testing the log collector behavior, matching the updated struct field visibility.

Also applies to: 730-731


784-789: Test setup correctly updated for new constructor.

The TestUpdateNodeRemediatedStatus test properly initializes the reconciler with the new constructor pattern.

fault-remediation/main.go (3)

210-285: LGTM! Flag definitions are clear and complete.

The flag definitions comprehensively cover controller-runtime configuration, leader election parameters, and service addresses with helpful descriptions.


164-208: LGTM! Controller-runtime manager setup is correct.

The manager is properly configured with metrics, health checks, leader election, and the reconciler is correctly wired via SetupWithManager. The manager correctly uses the provided context parameter.


76-118: LGTM! Initialization and setup flow is well-structured.

The validation that leader-election requires controller-runtime is appropriate, initialization is clean, and the deferred cleanup properly logs errors.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (2)

56-102: LGTM! Mock watcher has proper concurrency protection.

The mutex guards ensure thread-safe access to counters when the reconciler and tests interact concurrently.


367-395: LGTM! Test patterns correctly use the new public reconciler API.

Tests properly construct FaultRemediationReconciler via the public constructor and invoke methods through the reconciler's public surface with appropriate error handling.

Also applies to: 427-461, 480-505, 554-578

fault-remediation/pkg/reconciler/reconciler.go (4)

95-133: LGTM! Reconcile method is well-documented and correctly implements controller-runtime interface.

The method properly parses events, handles cancellation vs remediation, and returns appropriate ctrl.Result values for controller-runtime.


543-572: LGTM! SetupWithManager correctly wires the reconciler with controller-runtime.

The method properly starts the watcher with the provided context and registers the reconciler using a typed channel source, enabling controller-runtime management.


574-602: LGTM! AdaptEvents correctly handles context cancellation and channel lifecycle.

The adapter properly monitors both context cancellation and input channel closure, ensuring clean shutdown and no goroutine leaks.

Based on learnings requiring proper shutdown handling with context cancellation.


64-75: LGTM! All exported types and functions have proper godoc comments.

The struct, constructor, and lifecycle methods are well-documented per Go conventions.

As per coding guidelines, function comments are required for all exported Go functions.

Also applies to: 77-93, 518-523

@lalitadithya
Copy link
Collaborator

/ok to test 260636b

@lalitadithya lalitadithya added this to the v0.6.0 milestone Dec 5, 2025
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Merging this branch will decrease overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/commons/pkg/server 19.17% (-0.39%) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-remediation 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 30.30% (-0.17%) πŸ‘Ž
github.com/nvidia/nvsentinel/tests 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/commons/pkg/server/server.go 19.17% (-0.39%) 798 (+16) 153 645 (+16) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-remediation/main.go 0.00% (ΓΈ) 316 (+198) 0 316 (+198)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go 0.00% (ΓΈ) 123 (+31) 0 123 (+31)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/metrics.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 38.77% (-1.06%) 632 (+47) 245 (+12) 387 (+35) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation.go 19.62% (ΓΈ) 520 102 418

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/tests/fault_remediation_test.go

Copy link

@joejulian joejulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks Igor.

Signed-off-by: Igor Velichkovich <[email protected]>
Signed-off-by: Igor Velichkovich <[email protected]>
@lalitadithya
Copy link
Collaborator

/ok to test 5ab3169

@lalitadithya lalitadithya enabled auto-merge (squash) December 8, 2025 17:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
fault-remediation/pkg/initializer/init.go (1)

99-113: Resource leak: datastore not closed if watcher creation fails.

If watcher.CreateChangeStreamWatcher fails on line 110, the datastore created on line 99 is not closed, potentially leaking database connections and goroutines. This issue was flagged in a previous review.

Apply this diff to ensure proper cleanup:

 	ds, err := datastore.NewDataStore(ctx, *datastoreConfig)
 	if err != nil {
 		return nil, fmt.Errorf("error initializing datastore: %w", err)
 	}

 	// Create watcher using the factory pattern
 	watcherConfig := watcher.WatcherConfig{
 		Pipeline:       pipeline,
 		CollectionName: "HealthEvents",
 	}

 	watcherInstance, err := watcher.CreateChangeStreamWatcher(ctx, ds, watcherConfig)
 	if err != nil {
+		if closeErr := ds.Close(ctx); closeErr != nil {
+			slog.Error("failed to close datastore after watcher creation failure", "error", closeErr)
+		}
 		return nil, fmt.Errorf("error initializing change stream watcher: %w", err)
 	}
fault-remediation/main.go (1)

162-173: Event loop still doesn't respect context cancellation.

The for event := range ... loop blocks until the channel closes. If the context is cancelled (e.g., SIGTERM), this goroutine won't exit until Watcher.Events() channel is closed. This was flagged in a past review.

Consider using a select statement:

 	g.Go(func() error {
 		components.FaultRemediationReconciler.StartWatcherStream(gCtx)
 
 		slog.Info("Listening for events on the channel...")
-
-		for event := range components.FaultRemediationReconciler.Watcher.Events() {
-			slog.Info("Event received", "event", event)
-			_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
+		eventCh := components.FaultRemediationReconciler.Watcher.Events()
+		for {
+			select {
+			case <-gCtx.Done():
+				slog.Info("Context cancelled, stopping event processing")
+				return nil
+			case event, ok := <-eventCh:
+				if !ok {
+					return nil
+				}
+				slog.Info("Event received", "event", event)
+				_, _ = components.FaultRemediationReconciler.Reconcile(gCtx, &event)
+			}
 		}
-
-		return nil
 	})
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (1)

245-249: Avoid log.Fatalf inside goroutine.

Using log.Fatalf in a goroutine will abruptly terminate the test process without proper cleanup or useful error context. This was flagged in a past review. Consider using a channel for error signaling or at minimum using log.Printf to log the error while allowing tests to fail naturally.

+	mgrReady := make(chan struct{})
+	mgrErr := make(chan error, 1)
 	go func() {
+		close(mgrReady)
 		if err := mgr.Start(testContext); err != nil {
-			log.Fatalf("Failed to start the test environment manager: %v", err)
+			mgrErr <- err
 		}
 	}()
+	<-mgrReady
🧹 Nitpick comments (6)
distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (1)

49-55: Consider removing unnecessary quotes for consistency.

The events rule is correctly added for controller-runtime event recording. However, the quotes around "events", "create", and "patch" are inconsistent with the unquoted style used in other rules (e.g., nodes, get, list).

 - apiGroups:
   - ""
   resources:
-  - "events"
+  - events
   verbs:
-  - "create"
-  - "patch"
+  - create
+  - patch
fault-remediation/pkg/reconciler/reconciler_test.go (1)

784-789: Trailing whitespace/formatting issue.

There appears to be inconsistent spacing in the config struct definition. Consider ensuring consistent formatting.

 		cfg := ReconcilerConfig{
 			RemediationClient: mockK8sClient,
 			UpdateMaxRetries:  1,
 			UpdateRetryDelay:  0,
-		}
+		}
fault-remediation/main.go (1)

262-268: Clarify lease duration flag description.

The help text says "Interval at which non-leader candidates will wait to force acquire leadership" but LeaseDuration in controller-runtime is the duration that non-leader candidates will wait before attempting to acquire leadership after observing the leader. The phrasing could be clearer.

 	flag.DurationVar(
 		&leaderElectionLeaseDuration,
 		"leader-elect-lease-duration",
 		15*time.Second,
-		"Interval at which non-leader candidates will wait to force acquire leadership (duration string). "+
+		"Duration that the leader is valid before a new election. "+
 			"Requires controller-runtime enabled.",
 	)
fault-remediation/pkg/reconciler/reconciler.go (3)

215-216: TODO comment should reference an issue.

Per coding guidelines, TODO comments should reference issues. Consider linking to a tracking issue for the retry simplification work.

-	//TODO: return error to use built in ctrl runtime retries
+	//TODO(#NNN): return error to use built-in ctrl-runtime retries

339-339: Discarded error from log collector may mask failures.

The error from runLogCollector is silently discarded. Consider logging a warning or metric to track log collector failures without blocking remediation.

-	_ = r.runLogCollector(ctx, healthEvent)
+	if err := r.runLogCollector(ctx, healthEvent); err != nil {
+		slog.Warn("Log collector failed, continuing with remediation", "node", nodeName, "error", err)
+	}

530-538: Add nil guards for safer test usage.

CloseAll unconditionally calls r.ds.Close(ctx) and r.Watcher.Close(ctx). Tests may construct FaultRemediationReconciler with nil fields, causing panics. This was flagged in a previous review but not addressed.

-	if err := r.ds.Close(ctx); err != nil {
-		slog.Error("failed to close datastore", "error", err)
-		errs = append(errs, err)
-	}
-
-	if err := r.Watcher.Close(ctx); err != nil {
-		slog.Error("failed to close Watcher", "error", err)
-		errs = append(errs, err)
-	}
+	if r.ds != nil {
+		if err := r.ds.Close(ctx); err != nil {
+			slog.Error("failed to close datastore", "error", err)
+			errs = append(errs, err)
+		}
+	}
+
+	if r.Watcher != nil {
+		if err := r.Watcher.Close(ctx); err != nil {
+			slog.Error("failed to close Watcher", "error", err)
+			errs = append(errs, err)
+		}
+	}
πŸ“œ Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 260636b and 5ab3169.

β›” Files ignored due to path filters (5)
  • commons/go.sum is excluded by !**/*.sum
  • event-exporter/go.sum is excluded by !**/*.sum
  • fault-remediation/go.sum is excluded by !**/*.sum
  • health-events-analyzer/go.sum is excluded by !**/*.sum
  • health-monitors/syslog-health-monitor/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (19)
  • commons/go.mod (2 hunks)
  • commons/pkg/server/server.go (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (2 hunks)
  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-full.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml (1 hunks)
  • distros/kubernetes/nvsentinel/values-tilt.yaml (2 hunks)
  • event-exporter/go.mod (1 hunks)
  • fault-remediation/go.mod (2 hunks)
  • fault-remediation/main.go (3 hunks)
  • fault-remediation/pkg/initializer/init.go (5 hunks)
  • fault-remediation/pkg/reconciler/metrics.go (4 hunks)
  • fault-remediation/pkg/reconciler/reconciler.go (17 hunks)
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go (43 hunks)
  • fault-remediation/pkg/reconciler/reconciler_test.go (16 hunks)
  • fault-remediation/pkg/reconciler/remediation.go (1 hunks)
  • health-events-analyzer/go.mod (1 hunks)
  • health-monitors/syslog-health-monitor/go.mod (1 hunks)
  • tests/fault_remediation_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • health-monitors/syslog-health-monitor/go.mod
  • distros/kubernetes/nvsentinel/values-tilt-mongodb.yaml
  • commons/pkg/server/server.go
  • distros/kubernetes/nvsentinel/values-tilt.yaml
  • distros/kubernetes/nvsentinel/values-full.yaml
  • fault-remediation/pkg/reconciler/metrics.go
  • fault-remediation/go.mod
🧰 Additional context used
πŸ““ Path-based instructions (3)
**/go.mod

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

Use go.mod for each service as a separate Go module with semantic import versioning

Files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • commons/go.mod
**/*.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging via log/slog in Go code
Wrap errors with context using fmt.Errorf("context: %w", err) in Go code
Within retry.RetryOnConflict blocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such as synced over ok for cache sync checks
Use client-go for Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types

Files:

  • fault-remediation/main.go
  • tests/fault_remediation_test.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
  • fault-remediation/pkg/reconciler/remediation.go
  • fault-remediation/pkg/reconciler/reconciler.go
**/*_test.go

πŸ“„ CodeRabbit inference engine (.github/copilot-instructions.md)

**/*_test.go: Use envtest for testing Kubernetes controllers instead of fake clients
Use testify/assert and testify/require for assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format: TestFunctionName_Scenario_ExpectedBehavior

Files:

  • tests/fault_remediation_test.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler_test.go
🧠 Learnings (16)
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning

Applied to files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date

Applied to files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code

Applied to files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources

Applied to files:

  • event-exporter/go.mod
  • health-events-analyzer/go.mod
  • fault-remediation/main.go
  • fault-remediation/pkg/reconciler/remediation.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Use `commons/` for shared utilities across Go modules

Applied to files:

  • health-events-analyzer/go.mod
  • commons/go.mod
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Implement proper shutdown handling with context cancellation in Go code

Applied to files:

  • fault-remediation/main.go
  • fault-remediation/pkg/initializer/init.go
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients

Applied to files:

  • commons/go.mod
  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Name Go tests descriptively using format: `TestFunctionName_Scenario_ExpectedBehavior`

Applied to files:

  • tests/fault_remediation_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-06T16:18:09.952Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 280
File: tests/fault_management_test.go:334-345
Timestamp: 2025-11-06T16:18:09.952Z
Learning: In the fault-quarantine component, the "quarantinedNodeUncordonedManually" annotation is set to the string literal "True" (with uppercase T), defined as the constant QuarantinedNodeUncordonedManuallyAnnotationValue in fault-quarantine/pkg/common/common.go. Tests should compare against "True", not "true".

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler_e2e_test.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Within `retry.RetryOnConflict` blocks, return errors without wrapping to preserve retry behavior

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-12-01T17:53:20.067Z
Learnt from: jtschelling
Repo: NVIDIA/NVSentinel PR: 455
File: docs/designs/013-remediation-plugins.md:162-204
Timestamp: 2025-12-01T17:53:20.067Z
Learning: Code examples in design documents (docs/designs/) in the NVSentinel repository are illustrative stubs meant to show intent and architectural concepts, and do not need to be 100% correct or complete implementations (e.g., undefined variables, missing error handling, incomplete struct initialization are acceptable).

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : TODO comments should reference issues in Go code

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-10-29T15:37:49.210Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 160
File: health-events-analyzer/pkg/reconciler/metrics.go:52-58
Timestamp: 2025-10-29T15:37:49.210Z
Learning: In health-events-analyzer/pkg/reconciler/metrics.go, the ruleMatchedTotal metric should include both "rule_name" and "node_name" labels to identify which rule triggered on which node and how many times, as per the project's observability requirements.

Applied to files:

  • fault-remediation/pkg/reconciler/reconciler.go
πŸ“š Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/values.yaml : Note truthy value requirements in Helm chart documentation where applicable

Applied to files:

  • distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml
🧬 Code graph analysis (5)
fault-remediation/main.go (5)
node-drainer/pkg/config/config.go (1)
  • Duration (36-38)
fault-remediation/pkg/initializer/init.go (3)
  • InitializationParams (34-39)
  • InitializeAll (46-135)
  • Components (41-43)
fault-remediation/pkg/reconciler/reconciler.go (1)
  • FaultRemediationReconciler (67-75)
commons/pkg/server/server.go (4)
  • NewServer (375-396)
  • WithPort (151-153)
  • WithPrometheusMetricsCtrlRuntime (227-234)
  • WithSimpleHealth (246-254)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/initializer/init.go (4)
fault-remediation/pkg/reconciler/reconciler.go (2)
  • FaultRemediationReconciler (67-75)
  • NewFaultRemediationReconciler (78-93)
store-client/pkg/datastore/registry.go (1)
  • NewDataStore (120-123)
store-client/pkg/watcher/factory.go (2)
  • WatcherConfig (26-38)
  • CreateChangeStreamWatcher (65-87)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
fault-remediation/pkg/reconciler/reconciler_e2e_test.go (6)
store-client/pkg/datastore/types.go (1)
  • EventWithToken (101-104)
data-models/pkg/model/health_event_extentions.go (1)
  • HealthEventStatus (45-51)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • FaultRemediationReconciler (67-75)
  • ReconcilerConfig (42-51)
  • NewFaultRemediationReconciler (78-93)
commons/pkg/statemanager/statemanager.go (2)
  • StateManager (197-200)
  • NewStateManager (206-210)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • NewManager (41-43)
fault-remediation/pkg/reconciler/remediation.go (1)
  • FaultRemediationClient (58-70)
fault-remediation/pkg/reconciler/reconciler_test.go (4)
fault-remediation/pkg/reconciler/reconciler.go (3)
  • NewFaultRemediationReconciler (78-93)
  • HealthEventDoc (53-56)
  • ReconcilerConfig (42-51)
node-drainer/pkg/config/config.go (1)
  • ReconcilerConfig (162-167)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (197-200)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
fault-remediation/pkg/reconciler/reconciler.go (9)
janitor/pkg/csp/aws/aws.go (1)
  • Client (48-50)
store-client/pkg/storewatcher/watch_store.go (1)
  • ChangeStreamWatcher (43-57)
store-client/pkg/datastore/interfaces.go (1)
  • HealthEventStore (63-87)
node-drainer/pkg/config/config.go (1)
  • ReconcilerConfig (162-167)
fault-remediation/pkg/reconciler/annotation.go (1)
  • NodeAnnotationManagerInterface (38-43)
data-models/pkg/model/health_event_extentions.go (4)
  • HealthEventStatus (45-51)
  • UnQuarantined (34-34)
  • Cancelled (37-37)
  • Status (23-23)
commons/pkg/statemanager/statemanager.go (1)
  • StateManager (197-200)
data-models/pkg/protos/health_event.pb.go (1)
  • RecommendedAction_NONE (43-43)
health-monitors/kubernetes-object-monitor/pkg/annotations/manager.go (1)
  • Manager (37-39)
πŸ”‡ Additional comments (29)
fault-remediation/pkg/reconciler/remediation.go (1)

388-390: LGTM!

This is a minor formatting change adding a blank line for readability. No functional impact.

health-events-analyzer/go.mod (1)

89-93: LGTM!

The indirect dependency additions, including sigs.k8s.io/controller-runtime v0.22.4, are transitive dependencies from the commons module's controller-runtime integration. This is expected behavior from the dependency graph expansion.

event-exporter/go.mod (1)

80-84: LGTM!

The indirect dependency additions are consistent with the controller-runtime integration in the commons module. No direct dependency changes.

distros/kubernetes/nvsentinel/charts/fault-remediation/templates/deployment.yaml (2)

90-100: LGTM!

The conditional controller-runtime configuration is well-structured:

  • --controller-runtime=true and --leader-elect=true args are properly gated.
  • Health port binding is correctly conditional with a sensible default (9440).
  • Nil-safe access pattern ((.Values.global).ctrlRuntimeEnabled) is used consistently.

101-116: LGTM!

The probe configuration correctly adapts to controller-runtime mode:

  • livenessProbe uses /healthz on the appropriate port.
  • readinessProbe uses /readyz (controller-runtime standard) vs /healthz (legacy) with matching ports.

The nil-safe ternary pattern is applied consistently, addressing previous review feedback.

fault-remediation/pkg/initializer/init.go (3)

31-31: LGTM!

The watcher import is correctly added to support the new change stream watcher initialization.


130-134: LGTM!

The initialization return correctly uses the new NewFaultRemediationReconciler constructor with all required dependencies (datastore, watcher, healthEventStore, config, dryRun flag).


41-43: Components struct uses value type instead of pointer.

The FaultRemediationReconciler field is now a value type rather than a pointer. Verify that all callers accessing Components.FaultRemediationReconciler handle this appropriatelyβ€”particularly checking for any unintended pointer operations or dereferencing patterns that may have assumed a pointer type.

commons/go.mod (1)

18-18: No action needed β€” controller-runtime v0.22.4 is the current latest stable release.

The dependency sigs.k8s.io/controller-runtime v0.22.4 is appropriate; this is the latest version released November 3, 2025, and suitable for the FaultRemediationReconciler integration.

distros/kubernetes/nvsentinel/charts/fault-remediation/templates/clusterrole.yaml (1)

31-39: LGTM! Necessary RBAC for controller-runtime leader election.

The leases permission in the coordination.k8s.io API group is required for controller-runtime's leader election mechanism. The verbs (get, create, update, patch) are the standard set needed.

tests/fault_remediation_test.go (1)

28-28: Good catch fixing the typo in the test function name.

Correcting "Reemdiated" to "Remediated" improves discoverability and readability. Based on coding guidelines, consider adopting the naming format TestFunctionName_Scenario_ExpectedBehavior for even clearer test intent (e.g., TestNewCRsAreCreated_AfterFaultsRemediated).

fault-remediation/pkg/reconciler/reconciler_test.go (3)

219-221: Constructor migration looks correct.

The test correctly uses the new NewFaultRemediationReconciler constructor with nil dependencies (ds, watcher, healthEventStore) since this unit test only validates construction and the dry-run flag, not actual datastore operations.


377-378: LGTM! Error handling correctly added for performRemediation.

The updated signature returning (bool, string, error) is properly handled with assert.NoError(t, err). This correctly tests the success path where remediation should complete without error.


588-589: Accessing config fields through r.config is appropriate.

The test correctly accesses r.config.EnableLogCollector and r.config.RemediationClient after the refactoring moved these to unexported fields accessed via the config struct.

fault-remediation/main.go (3)

92-94: Good constraint: leader-election requires controller-runtime.

This validation correctly prevents misconfiguration where leader election is enabled without controller-runtime, which would fail at runtime.


113-117: Proper cleanup with deferred close.

Using defer for closing datastore components ensures cleanup happens regardless of exit path. Error logging on failure is appropriate.


178-222: Controller-runtime setup looks well-structured.

The manager configuration properly includes:

  • Metrics server with configurable bind address
  • Health/ready probe endpoints
  • Leader election with configurable parameters
  • Proper error handling and logging

The use of mgr.Start(ctx) with the signal context addresses the previous review concern about dual signal handlers.

fault-remediation/pkg/reconciler/reconciler_e2e_test.go (7)

56-61: Good addition of mutex for thread safety.

Adding sync.Mutex to guard access to markProcessedCount prevents data races between the reconciler goroutine writing and test assertions reading. This is a proper fix for concurrent access.


82-87: Thread-safe MarkProcessed implementation.

The mutex lock/unlock correctly guards the increment of markProcessedCount.


99-101: Thread-safe GetCallCounts implementation.

The mutex ensures consistent reads of counter values during test assertions.


206-230: TestMain setup is comprehensive.

The test initialization correctly:

  • Creates a remediation client with error handling
  • Configures the ReconcilerConfig with StateManager and retry settings
  • Sets up mock health event store with instrumentation
  • Creates the reconciler using the new constructor

297-303: Good refactoring to return errors from createTestRemediationClient.

Returning (*FaultRemediationClient, error) instead of using log.Fatalf allows callers to handle errors gracefully in test assertions.


376-380: Test refactored to use struct literal correctly.

Using FaultRemediationReconciler{config: cfg, annotationManager: ...} aligns with the new private field structure. This pattern is consistently applied across tests.


1347-1359: Test refactor for TestMetrics_ProcessingErrors looks correct.

The test now uses a pointer to FaultRemediationReconciler with only the Watcher field set, which is sufficient for testing the error path in Reconcile when processing invalid events.

fault-remediation/pkg/reconciler/reconciler.go (5)

64-93: Clean struct definition and constructor implementation.

The godoc comments are properly added, and the constructor correctly initializes all fields except the embedded client.Client, which is documented as needing to be set separately for ctrl-runtime mode.


95-133: Reconcile method properly orchestrates event processing.

The flow correctly handles parsing, nil checks, and dispatches to either cancellation or remediation handlers. Returning nil error for unparseable events prevents hot-looping (per past discussion).


279-284: Consider using ctx instead of context.Background() for MarkProcessed.

The function receives ctx but uses context.Background() for MarkProcessed. This prevents proper cancellation during shutdown. If the intent is to ensure the token is always persisted, consider adding a brief comment explaining why.

-	if err := watcherInstance.MarkProcessed(context.Background(), resumeToken); err != nil {
+	if err := watcherInstance.MarkProcessed(ctx, resumeToken); err != nil {

547-572: Well-structured controller-runtime integration.

The implementation correctly uses the provided context for watcher lifecycle, properly adapts events to the controller-runtime channel source pattern, and the typed handler cleanly enqueues events.


574-602: Correct channel adaptation with proper memory safety.

Good practice: eventOut := e on line 595 correctly copies the loop variable before taking its pointer, avoiding the classic Go closure/loop-variable bug. The goroutine properly handles context cancellation and channel closure.

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Merging this branch will decrease overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/commons/pkg/server 15.59% (-0.32%) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-remediation 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer 0.00% (ΓΈ)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler 30.28% (-0.17%) πŸ‘Ž
github.com/nvidia/nvsentinel/tests 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/commons/pkg/server/server.go 15.59% (-0.32%) 988 (+20) 154 834 (+20) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-remediation/main.go 0.00% (ΓΈ) 344 (+186) 0 344 (+186)
github.com/nvidia/nvsentinel/fault-remediation/pkg/initializer/init.go 0.00% (ΓΈ) 123 (+31) 0 123 (+31)
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/metrics.go 0.00% (ΓΈ) 0 0 0
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler.go 38.77% (-1.06%) 632 (+47) 245 (+12) 387 (+35) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/remediation.go 19.77% (ΓΈ) 531 105 426

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_e2e_test.go
  • github.com/nvidia/nvsentinel/fault-remediation/pkg/reconciler/reconciler_test.go
  • github.com/nvidia/nvsentinel/tests/fault_remediation_test.go

@lalitadithya lalitadithya merged commit 4b2ee27 into NVIDIA:main Dec 8, 2025
86 of 113 checks passed
@dims
Copy link
Collaborator

dims commented Dec 8, 2025

congrats @ivelichkovich !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Use Common Controller/Event Watcher Framework across controllers

5 participants