-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or request
Description
Prerequisites
- I searched existing issues
Feature Summary
The default memory limit for node-drainer (300Mi) is too low for large clusters. On a 1500-node cluster, node-drainer gets OOMKilled during Kubernetes informer sync at startup.
Problem/Use Case
$ kubectl get pods -n nvsentinel -l app.kubernetes.io/name=node-drainer
NAME READY STATUS RESTARTS AGE
node-drainer-6847cfff97-rhzkw 0/1 CrashLoopBackOff 10 (2m11s ago) 31m
$ kubectl get pod -n nvsentinel -l app.kubernetes.io/name=node-drainer \
-o jsonpath='{.items[0].status.containerStatuses[0].lastState}'
{"terminated":{"exitCode":137,"reason":"OOMKilled",...}}
The pod crashes during startup after "Starting Kubernetes informers" - the informer sync for 1500 nodes exceeds the 300Mi limit.
Proposed Solution
Update distros/kubernetes/nvsentinel/charts/node-drainer/values.yaml
resources:
limits:
cpu: "500m"
memory: "1Gi" # was 300Mi
requests:
cpu: "200m"
memory: "1Gi" # was 300Mi
Discussion: 1GB is a number that worked well on my 1500 node cluster (further scale testing to follow in #385). I don't think that 1Gi is aggregiously large based on the usual clusters that NVSentinel runs on. However, I can see others arguments for making it a bit bigger or smaller. As long as we pick a solid default (ideally larger than the current!).
Component
Node Drainer
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or request