Kubernetes Cluster Autoscaler
The cluster autoscaler watches for Pending pods that can't fit on existing nodes and asks the Hypervisor.io control panel to add more workers. When workers sit idle long enough, it asks the panel to remove them. This page covers how it's installed, how to tune it, and how to diagnose the common cases where it does (or doesn't) move.
Overview #
The autoscaler shipped with Hypervisor.io clusters is the upstream Kubernetes cluster-autoscaler with a Hypervisor.io cloud provider compiled in. It's a single Go binary running as a Deployment on the control plane, talking to two endpoints:
- The cluster's own kube-apiserver, where it reads Pending pods, node conditions, and DaemonSet specs.
- A small management API exposed by the Hypervisor.io control panel, where it requests new workers and drains old ones.
Because it uses the same upstream codebase that runs every major managed Kubernetes service, anything in the upstream FAQ and flag reference applies here. The Hypervisor.io-specific part is the cloud provider that turns a "scale node group from N to N+1" request into a real worker VM in your region.
What triggers scale-up?
- One or more pods are Pending because their resource requests, node selectors, affinity rules, taints/tolerations or topology spread constraints don't fit any existing schedulable worker.
- A node group exists whose template worker would fit those pods, and that group's current size is below its configured
max.
What triggers scale-down?
- A worker has been below the utilization threshold (default 50% of requests on both CPU and memory) for the unneeded-time window (default 10 minutes).
- All pods on it can be safely rescheduled elsewhere - no PDB violations, no
safe-to-evict: false, no orphaned local storage. - Removing it would not drop the node group below its configured
min. - The cluster has been quiet long enough since the last scale-up (default 10 minutes).
Install #
The autoscaler is auto-installed by the panel the moment a worker pool has autoscaling enabled. There's nothing to helm install and no kubeconfig to wire up. On the cluster's detail page, the Autoscaler tab shows current status, image version, args, recent scale events, and a button to update the args without rolling the rest of the cluster.
When a pool is created or edited with autoscaling enabled:
- The panel writes the Deployment, ServiceAccount, ClusterRole and ClusterRoleBinding into
kube-system. - The image is selected from the compatibility matrix below based on the cluster's Kubernetes version.
- A token tied to the cluster is mounted into the pod so the autoscaler can authenticate to the panel's management API.
- Default args are applied (shown in the table below). They can be edited from the Autoscaler tab; the pod restarts in a few seconds.
To turn it off, disable autoscaling on every worker pool. The Deployment is removed and the cluster reverts to fixed-size pools.
How it works #
Each worker pool becomes a "node group" inside the autoscaler. Workers are tagged so the autoscaler knows which pool they belong to and which template (CPU, RAM, disk) they were sized from.
Pod stuck Pending
│
▼
cluster-autoscaler (running in kube-system on the CP)
│ picks node group whose template fits the pod
│ applies expander rule if multiple groups qualify
▼
Hypervisor.io management API
│ provisions a new worker VM in your region
│ installs kubelet, joins the cluster
▼
New worker registers with the apiserver
│ becomes Ready in 1-3 minutes
▼
Pending pod is scheduled onto the new worker
Scale-down runs in reverse: a candidate worker is cordoned, its pods are drained with respect for PodDisruptionBudgets, then the panel deletes the VM and the node disappears from kubectl get nodes.
Common args #
Defaults are good for most workloads. Override them from the cluster's Autoscaler tab. Changes take effect within seconds of saving (the pod restarts).
| Flag | Default | What it does |
|---|---|---|
--scale-down-delay-after-add |
10m |
How long to wait after the last scale-up before any scale-down is considered. Stops the autoscaler from oscillating during bursty traffic. |
--scale-down-unneeded-time |
10m |
A node must sit below the utilization threshold for at least this long before it becomes a removal candidate. |
--scale-down-utilization-threshold |
0.5 |
A node is "unneeded" only if both CPU and memory request-utilization are under this fraction. Lower it to be more aggressive about reclaiming idle nodes; raise it to keep more headroom. |
--max-node-provision-time |
15m |
If a newly requested worker isn't Ready within this window, the autoscaler gives up on it and tries a different node group (or surfaces the failure). |
--scan-interval |
10s |
How often the autoscaler re-evaluates the cluster. Lower = faster reaction to Pending pods, higher = less apiserver load on very large clusters. |
--expander |
random |
Strategy used when more than one node group could host a Pending pod. See Expander strategies below. |
--max-empty-bulk-delete |
10 |
Maximum number of empty nodes deleted in one scale-down pass. Useful on very large clusters where draining 50 nodes at once is undesirable. |
--skip-nodes-with-system-pods |
true |
Don't scale down nodes hosting kube-system pods that aren't managed by a controller. Keeps stray system pods from blocking removal. Most people leave this alone. |
--skip-nodes-with-local-storage |
true |
Don't scale down a node if any pod on it uses emptyDir or HostPath. Switch to false only if you've verified those pods can lose their local data. |
--scale-down-delay-after-add to 0s or --scale-down-unneeded-time to 30s looks responsive in testing but causes thrashing in production. Each scale-down forces a VM delete + Kubernetes node deregistration, which is not free.
Expander strategies #
When a Pending pod could fit in more than one of your node groups, the expander breaks the tie. Pick the one that matches how your pools differ.
| Value | Behaviour | Use when |
|---|---|---|
random |
Picks any qualifying group at random. | All your pools are roughly equivalent. |
most-pods |
Picks the group that would schedule the largest number of Pending pods with a single new node. | You have a backlog of similar small pods and want fewer, bigger nodes. |
least-waste |
Picks the group whose template node leaves the least unallocated CPU + memory after placing the pods. | Pools differ in size and you want to minimize wasted resource on each new node. |
priority |
Uses a cluster-autoscaler-priority-expander ConfigMap in kube-system to pick groups in a defined order, with regex-matched fallbacks. |
You have a preferred cheap pool and a fallback pool (for example, "use the standard pool first; only burst into the high-memory pool if the standard pool is at max"). |
For mixed-instance clusters, least-waste is the most common pick. For homogeneous clusters with one autoscaling pool, the choice doesn't matter and random is fine.
Why no scale-down? #
The most common autoscaler ticket is "I have an idle node sitting there and the autoscaler won't remove it". Almost always one of these.
1. A pod uses local storage
If any pod on the node mounts an emptyDir or HostPath volume, the autoscaler refuses to drain it by default. emptyDir data is lost when the pod moves to a different node, so the autoscaler errs on the side of caution. Either:
- Annotate the pod with
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"if you genuinely don't care about losing the scratch data. - Move the workload to a PersistentVolumeClaim that survives reschedules.
- Set
--skip-nodes-with-local-storage=falseglobally (only if every pod with local storage is safe to evict; this is rarely the right answer).
2. A kube-system pod has no PodDisruptionBudget
kube-system pods that aren't controlled by a Deployment / DaemonSet / StatefulSet (rare, but it happens with one-off jobs or hand-rolled manifests) block scale-down. Either give the pod a controller, or set cluster-autoscaler.kubernetes.io/safe-to-evict: "true" on it.
3. A pod is annotated safe-to-evict: false
An explicit cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation on any pod pins the node it's on. This is often intentional (singleton workloads, long-running batch jobs) but easy to forget. kubectl get pods -A -o jsonpath='{range .items[?(@.metadata.annotations.cluster-autoscaler\.kubernetes\.io/safe-to-evict=="false")]}{.metadata.namespace}/{.metadata.name}{"\n"}{end}' finds them all.
4. DaemonSet pods
By default the autoscaler ignores DaemonSet pods when computing utilization, which is what you want. If a DaemonSet pod is non-idempotent and you've added a custom annotation to block its eviction, the node won't drain.
5. The pool is already at min
Worker pools have a configured minimum (set on the cluster's Workers tab). The autoscaler never scales a pool below its min, even if every node is empty. If you want it to go to zero, set min: 0 on the pool and make sure no critical workload has affinity for that pool.
6. Recent scale-up cooldown
Right after the autoscaler adds a worker, no scale-down can run for --scale-down-delay-after-add (default 10 minutes). This is a feature, not a bug; it prevents oscillation. If you're testing scale-down behaviour, wait the cooldown out before drawing conclusions.
7. Utilization is just barely above threshold
Utilization is measured against requests, not actual usage. A pod that requests: cpu=500m but actually uses 5m of CPU still counts as 500m. Pools full of generously-sized requests look "busy" even when CPU graphs are flat. Either right-size the requests, or lower --scale-down-utilization-threshold.
kubectl -n kube-system logs deploy/cluster-autoscaler --tail=200. It prints "scale-down: node X is not eligible because Y" for every blocked candidate. Read that before guessing.
Sizing your pool #
The autoscaler is good at filling out a pool to match real demand. It's bad at picking the right shape of node for you. That's a sizing decision.
Pick a worker plan that fits 2-4 pods comfortably
If your average pod requests 500m CPU and 1 GiB RAM, don't pick a 1-vCPU / 2 GiB worker. Reserved overhead (kubelet, container runtime, OS) typically eats 200-400m CPU and 600-900 MiB RAM per node, so a tiny node fits maybe one pod and the autoscaler ends up adding a whole VM per replica. Pick a worker plan where 2-4 typical pods leave the node still useful.
Steady workloads: fixed min, modest headroom
For a service that's pretty constant - five replicas, day in, day out - set min: 5 and max: 8. The autoscaler stays out of the way during normal operation and only kicks in for traffic spikes or rolling deploys.
Batch / queue workloads: low min, large max
If you spin up 200 worker pods when a job lands and run zero in between, set min: 0 (the autoscaler will go all the way to empty) and max generously. Pair with a sensible --max-node-provision-time so a job that overshoots quota fails fast rather than hanging.
Latency-sensitive workloads: keep extra capacity warm
New workers take 1-3 minutes to provision and join. If your pods can't tolerate that, oversize the pool's min by one or two workers' worth so there's always a spare node ready for scheduling. Don't try to "make the autoscaler faster" with aggressive flags; instead, keep idle capacity on purpose.
Multiple pools beats one pool
A single autoscaling pool with mixed workloads almost always ends up oversized to satisfy the most-demanding pod. Splitting into two pools (e.g. app standard workers, build high-CPU workers) lets each one autoscale independently against its own demand.
Image compatibility matrix #
The autoscaler image must be compatible with the cluster's Kubernetes minor version. Upstream cluster-autoscaler is generally tested against its matching minor and the two adjacent ones; mismatched majors will refuse to start or silently misbehave.
The panel selects the image automatically based on the cluster's Kubernetes version. The table below is what it picks today.
| Kubernetes version | Recommended image | Notes |
|---|---|---|
1.30.x |
cluster-autoscaler v1.30 |
Upstream image. Stock cloud provider list. |
1.31.x |
cluster-autoscaler v1.31 |
Upstream image. Stock cloud provider list. |
1.32.x |
cluster-autoscaler v1.32 |
Upstream image. Stock cloud provider list. |
1.33.x |
cluster-autoscaler v1.33 |
Upstream image. Stock cloud provider list. |
1.34.x |
cluster-autoscaler-hypervisor v1.34.3 default |
Hypervisor.io build with the native cloud provider compiled in. Recommended. |
1.35.x |
cluster-autoscaler-hypervisor v1.35.0 default |
Hypervisor.io build with the native cloud provider compiled in. Recommended. |
Hypervisor.io-built images live at:
ghcr.io/hypervisor-io/cluster-autoscaler-hypervisor:v1.34.3 ghcr.io/hypervisor-io/cluster-autoscaler-hypervisor:v1.35.0
When to override the image
Almost never. The default image is the one validated against the matching Kubernetes minor for every release. The Autoscaler tab lets you pin a specific tag if you're chasing a fix in a newer patch release, but anything outside the matrix is unsupported.
Troubleshooting #
Most autoscaler problems show up in its own logs first. Always start with:
kubectl -n kube-system logs deploy/cluster-autoscaler --tail=300 kubectl -n kube-system get events --sort-by=.lastTimestamp | tail -50
Then match against the table below.
| Symptom | Likely cause | Fix |
|---|---|---|
No NodeGroup for node in the logs |
The node either belongs to a non-autoscaling pool, or is a control plane node. CP nodes are excluded by design and this log line is harmless for them. | If you see it for a worker, confirm the worker pool has autoscaling enabled and the node has the expected pool label. |
| Token expired / 401 from management API | Older versions of the autoscaler required a manual token rotation. Current versions self-rotate at T-30d, so this is no longer a real problem. | Save the args again from the Autoscaler tab; the panel will re-mount a fresh token. |
Pods stuck Pending after --max-node-provision-time |
A new worker was requested but never became Ready. Common causes: VM provision took too long, the hypervisor is out of capacity for the requested plan, or kubelet couldn't reach the apiserver. | Check the cluster's Tasks tab for the failed worker provision and the corresponding event on the worker pool. |
| Scale-down isn't happening | One of the seven reasons in Why no scale-down?. | Read the autoscaler logs - it logs the exact reason per candidate node. |
| Autoscaler restarts in a loop | An invalid flag was passed via the Autoscaler tab (typo, removed flag in newer version), or the image isn't compatible with the cluster's Kubernetes version. | Roll back the last args change. Confirm the image tag matches the matrix above. kubectl -n kube-system logs deploy/cluster-autoscaler --previous shows the crash reason. |
| Scale-up happens but pods still Pending | The new worker's template wouldn't actually fit the pod (often due to a node-selector or taint the autoscaler didn't account for, or a pod with bigger requests than the template). | Verify kubectl describe pod <name> shows the pod fits the chosen pool's worker plan, and that selectors/tolerations match the pool's labels and taints. |
| Two pools but autoscaler always picks the wrong one | The default random expander broke a tie poorly. |
Switch --expander to least-waste, or use priority with a ConfigMap to define an explicit order. See Expander strategies. |
Useful one-liners
# Find every pod blocking scale-down via safe-to-evict=false kubectl get pods -A -o json | jq -r ' .items[] | select(.metadata.annotations["cluster-autoscaler.kubernetes.io/safe-to-evict"] == "false") | "\(.metadata.namespace)/\(.metadata.name)"' # See the autoscaler's view of node groups + bounds kubectl -n kube-system get configmap cluster-autoscaler-status -o yaml # Watch scale decisions live kubectl -n kube-system logs deploy/cluster-autoscaler -f | grep -E 'scale-(up|down)|Pod.*unschedulable'
Still stuck?
- Check the cluster's Tasks tab on the panel for failed worker provision events.
- Read the upstream cluster-autoscaler FAQ - 90% of questions about scaling behaviour are answered there.
- Reach out via support or the Discord.
Ready to autoscale?
Enable autoscaling on a worker pool and let the cluster grow itself.