Skip to content

[Bug]: node-drainer default memory limit (300Mi) causes OOMKill on large clusters #495

@ksaur

Description

@ksaur

Prerequisites

  • I searched existing issues

Feature Summary

The default memory limit for node-drainer (300Mi) is too low for large clusters. On a 1500-node cluster, node-drainer gets OOMKilled during Kubernetes informer sync at startup.

Problem/Use Case

$ kubectl get pods -n nvsentinel -l app.kubernetes.io/name=node-drainer
NAME                            READY   STATUS             RESTARTS         AGE
node-drainer-6847cfff97-rhzkw   0/1     CrashLoopBackOff   10 (2m11s ago)   31m

$ kubectl get pod -n nvsentinel -l app.kubernetes.io/name=node-drainer \
    -o jsonpath='{.items[0].status.containerStatuses[0].lastState}'
{"terminated":{"exitCode":137,"reason":"OOMKilled",...}}

The pod crashes during startup after "Starting Kubernetes informers" - the informer sync for 1500 nodes exceeds the 300Mi limit.

Proposed Solution

Update distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml

resources:
  limits:
    cpu: "500m"
    memory: "1Gi"    # was 300Mi
  requests:
    cpu: "200m"
    memory: "1Gi"    # was 300Mi

Discussion: 1GB is a number that worked well on my 1500 node cluster (further scale testing to follow in #385). I don't think that 1Gi is aggregiously large based on the usual clusters that NVSentinel runs on. However, I can see others arguments for making it a bit bigger or smaller. As long as we pick a solid default (ideally larger than the current!).

Component

Node Drainer

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions