Context

If you are using GKE to create your Kubernetes cluster, you can add cluster-autoscaler by checking the Enable cluster auto-scaler option while creating the cluster (this is a Standard GKE cluster and not Autopilot GKE cluster)

Problem

If you want to tweak this cluster-autoscaler e.g., change some flag) or deploy your own image of cluster-autoscaler (which I wanted to do to test https://github.com/kubernetes/autoscaler/pull/5419), it’s hard to do (StackOverflow question around this). You can tweak some parameters of the GKE deployed cluster-autoscaler but it’s quite limited. You can’t access the cluster-autoscaler deployment as a user using kubectl get deployment cluster-autoscaler -nkube-system. GKE hides it from you.

How do you tweak cluster-autoscaler on GKE?

Solution

One of the ways to solve this problem is to disable GKE cluster-autoscaler while creating the cluster and deploy your own cluster-autoscaler. Here’s a step-by-step guide to to do that:

1. Create a cluster without GKE cluster-autoscaler

Refer to this guide to create a Standard GKE cluster (please create the cluster after reading this section completely). When you are configuring nodepools in NODE POOLS section, don’t enable cluster-autscaler.

Make sure you check Enable Workload Identity (more info here; We need this later)

2. Check GKE cluster-autoscaler has not been enabled

Here’s my cluster for example: Connect to your cluster (if you don’t know how, check the official guide).

1
2
3
4
$ gcloud container clusters describe cluster-1 --region=us-central1-c | grep "nodePools:" -A 5
nodePools:
- autoscaling: {}
  ...

autoscaling is empty which means GKE cluster-autoscaler is disabled.

3. Add permissions for cluster-autoscaler to manage Instance Groups

(I like to think of Instance Groups as AWS AutoScalingGroups in GCP world.)

To let cluster-autoscaler scale the nodes, we need to give it permission to manage Instance Groups.

1. Create a new GCP Service Account for cluster-autoscaler

Refer the official docs for detailed instructions.
You can skip (2) and (3) and directly click DONE button.

2. Attach an IAM role to your GCP Service Account

I will be using Compute Instance Admin (v1) role which might not be the best role.

Ref: https://cloud.google.com/iam/docs/understanding-roles#compute.instanceAdmin.v1

You can define your own role for granular control by adding a new role

Click on CREATE ROLE:

Define your role:

Once the role is created/decided we need to link it to the IAM Service Account. To do this, go to IAM and admin -> IAM

Click on GRANT ACCESS

Select our Service Account from the drop-down list

And assign the role

Click on SAVE.

Our GCP Service Account now has permissions to manage Instance Groups.

3. Deploy ResourceQuota resource

1
2
3
4
5
6
suraj@suraj:~/sandbox$ k explain resourcequota
KIND:     ResourceQuota
VERSION:  v1

DESCRIPTION:
     ResourceQuota sets aggregate quota restrictions enforced per namespace

Think of ResourceQuota as per-namespace limits on pod creation. Refer to the official docs for more info: https://v1-24.docs.kubernetes.io/docs/concepts/policy/resource-quotas/

Why are we talking about this?

cluster-autoscaler Deployment uses system-cluster-critical PriorityClass (ref).

If you deploy cluster-autoscaler in non-kube-system namespace you will see cluster-autoscaler ReplicaSet unable to create the pods

1
  Warning  FailedCreate  9s (x15 over 92s)  replicaset-controller  Error creating: insufficient quota to match these scopes: [{PriorityClass In [system-node-critical system-cluster-critical]}]

GKE by default creates gcp-critical-pods ResourceQuota for kube-system namespace and gke-resource-quotas ResourceQuota for all other namespaces.

A set of resource quotas is automatically applied to clusters with 100 nodes or fewer and to namespaces on those clusters. These quotas, which cannot be removed, protect the cluster’s control plane from instability caused by potential bugs in applications deployed to the cluster.

https://cloud.google.com/kubernetes-engine/quotas#resource_quotas

1
2
3
4
5
6
7
$ kubectl get resourcequota -A
NAMESPACE         NAME                  AGE   REQUEST                                                                                                                               LIMIT
default           gke-resource-quotas   57m   count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 0/1500, services: 2/500    
kube-node-lease   gke-resource-quotas   57m   count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 0/1500, services: 0/500    
kube-public       gke-resource-quotas   57m   count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 0/1500, services: 0/500    
kube-system       gcp-critical-pods     64m   pods: 9/1G                                                                                                                            
kube-system       gke-resource-quotas   57m   count/ingresses.extensions: 0/100, count/ingresses.networking.k8s.io: 0/100, count/jobs.batch: 0/5k, pods: 11/1500, services: 3/500 

Looking at gcp-critical-pods ResourceQuota,

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ kubectl get resourcequota gcp-critical-pods -oyaml -nkube-system 
apiVersion: v1
kind: ResourceQuota
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
  name: gcp-critical-pods
  namespace: kube-system
  ...
spec:
  hard:
    pods: 1G
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - system-node-critical
      - system-cluster-critical
status:
  hard:
    pods: 1G
  used:
    pods: "9"
...

As you can see, gcp-critical-pods ResourceQuota places a hard limit of 10 pods for pods using system-node-critical and system-cluster-critical PriorityClass. Since the quota isn’t filled up yet (used is 9 pods), we won’t see any problems in kube-system namespace if we deploy our cluster-autoscaler there. The only problem is, we can’t deploy our cluster-autoscaler in kube-system namespace (I will get to the why in a moment).

Now, if we look at the gke-resource-quotas in a non-kube-system namespace e.g., default namespace

ResourceQuota in the default namespace

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ kubectl get resourcequota gke-resource-quotas -oyaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gke-resource-quotas
  namespace: default
  ...
spec:
  hard:
    count/ingresses.extensions: "100"
    count/ingresses.networking.k8s.io: "100"
    count/jobs.batch: 5k
    pods: "1500"
    services: "500"
status:
  hard:
    count/ingresses.extensions: "100"
    count/ingresses.networking.k8s.io: "100"
    count/jobs.batch: 5k
    pods: "1500"
    services: "500"
  used:
    count/ingresses.networking.k8s.io: "0"
    count/jobs.batch: "0"
    pods: "0"
    services: "2"

Notice the hard limit on pods (spec.hard.pods) is 1500 and the current status.used.pods is 0. Why are we not able to create new cluster-autoscaler pods?
I think it’s because GKE sets a limit on Priority Class consumption for certain Priority Classes. What does that mean?

It may be desired that pods at a particular priority, eg. “cluster-services”, should be allowed in a namespace, if and only if, a matching quota object exists.

With this mechanism, operators are able to restrict usage of certain high priority classes to a limited number of namespaces and not every namespace will be able to consume these priority classes by default.

Ref: https://v1-24.docs.kubernetes.io/docs/concepts/policy/resource-quotas/#limit-priority-class-consumption-by-default

Because we can’t rely on the gke-resource-quotas ResourceQuota for cluster-autoscaler, we have to create our own.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: v1
kind: ResourceQuota
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
  name: gcp-critical-pods
  namespace: default
spec:
  hard:
    pods: 2 # 2 because we need it only for cluster-autoscaler
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
      - system-cluster-critical # cluster-autoscaler priority class

Save the above yaml in a file and kubectl apply -f <file>.yaml.

4. Deploy your own cluster-autoscaler

4.1. Before Installing

To install the chart we first need to add the helm repo:

1
2
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update

ref: https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.25.0#tl-dr

To check all the available chart versions, do

1
2
3
4
5
6
helm search repo cluster-autoscaler --versions
NAME                               	CHART VERSION	APP VERSION	DESCRIPTION                                       
autoscaler/cluster-autoscaler      	9.25.0       	1.24.0     	Scales Kubernetes worker nodes within autoscali...
autoscaler/cluster-autoscaler      	9.24.0       	1.23.0     	Scales Kubernetes worker nodes within autoscali...
autoscaler/cluster-autoscaler      	9.23.2       	1.23.0     	Scales Kubernetes worker nodes within autoscali...
...

APP VERSION above is the Kubernetes version supported by the chart.

1
2
3
4
5
6
$ kubectl version --short
...
Client Version: v1.24.4
Kustomize Version: v4.5.4
Server Version: v1.24.9-gke.3200

Since I am on 1.24.x version of Kubernetes, I will be installing 9.25.0 version of the helm chart.

We can install the chart in any namespace as long as it’s not kube-system. Why? If you deploy the chart in kube-system namespace you will see the following issue when you check your cluster-autoscaler pod logs:

1
2
3
4
5
I0125 05:50:05.187469       1 leaderelection.go:253] failed to acquire lease kube-system/cluster-autoscaler
I0125 05:50:07.915634       1 leaderelection.go:352] lock is held by gke-20e46577c7cf4842bc69-6353-ac86-vm and has not yet expired
I0125 05:50:07.915663       1 leaderelection.go:253] failed to acquire lease kube-system/cluster-autoscaler
I0125 05:50:11.162260       1 leaderelection.go:352] lock is held by gke-20e46577c7cf4842bc69-6353-ac86-vm and has not yet expired
I0125 05:50:11.162323       1 leaderelection.go:253] failed to acquire lease kube-system/cluster-autoscaler

More info: https://github.com/kubernetes/autoscaler/issues/5277

4.2. Install the chart

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
$ helm install custom-ca autoscaler/cluster-autoscaler \
--set "autoscalingGroupsnamePrefix[0].name=gke-cluster-1,autoscalingGroupsnamePrefix[0].maxSize=10,autoscalingGroupsnamePrefix[0].minSize=1" \
--set autoDiscovery.clusterName=cluster-1 \
--set "rbac.serviceAccount.annotations.iam\.gke\.io\/gcp-service-account=cluster-autoscaler@my-project-123456.iam.gserviceaccount.com" \
--set cloudProvider=gce \
--version=9.25.0 \
--namespace=default
NAME: custom-ca
LAST DEPLOYED: Fri Mar 10 11:55:49 2023
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
To verify that cluster-autoscaler has started, run:

  kubectl --namespace=default get pods -l "app.kubernetes.io/name=gce-cluster-autoscaler,app.kubernetes.io/instance=custom-ca"

ref1: https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.25.0#gce
ref2: https://artifacthub.io/packages/helm/cluster-autoscaler/cluster-autoscaler/9.25.0?modal=values&path=rbac.serviceAccount.annotations

your-ig-prefix is the prefix of Instance Groups you want to let cluster-autoscaler manage. If you want to target all the Instance Groups, you can specify a short prefix like gke-<your-cluster-name>. For example, in my case since I created a cluster called cluster-1, it would be gke-cluster-1.

Note that autoscalingGroupsnamePrefix is an array because you can specify multiple Instance Groups to match using multiple prefixes (check line 74 to 78 here). In the above command we are only specifying one prefix to match all the nodepools.

4.3. Create Kubernetes Service Account to GCP IAM Service Account binding

Allow the Kubernetes service account to impersonate the IAM service account by adding an IAM policy binding between the two service accounts. This binding allows the Kubernetes service account to act as the IAM service account.

https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

We have to link our Kubernetes Service Account with GCP IAM Service Account binding. The command looks like this:

1
2
3
$ gcloud iam service-accounts add-iam-policy-binding cluster-autoscaler@<GCP-project-ID>.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:<GCP-project-ID>.svc.id.goog[<Kubernetes-namespace>/<Kubernetes-Service-Account-name>]"

To find out your cluster-autoscaler’s Kubernetes Service Account,

1
2
$ kubectl get pod custom-ca-gce-cluster-autoscaler-5989f4d65c-trbmm -o=jsonpath='{.spec.serviceAccountName}'
custom-ca-gce-cluster-autoscaler

Replace custom-ca-gce-cluster-autoscaler-5989f4d65c-trbmm with name of your cluster-autoscaler pod.

Let’s say if the GCP project ID is my-project-123456, GCP Service account is cluster-autoscaler@my-project-123456.iam.gserviceaccount.com, Kubernetes Service Account is in the default namespace and it is called custom-ca-gce-cluster-autoscaler you can do:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ gcloud iam service-accounts add-iam-policy-binding cluster-autoscaler@my-project-123456.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:my-project-123456.svc.id.goog[default/custom-ca-gce-cluster-autoscaler]"
Updated IAM policy for serviceAccount [cluster-autoscaler@my-project-123456.iam.gserviceaccount.com].
bindings:
- members:
  - serviceAccount:my-project-123456.svc.id.goog[default/custom-ca-gce-cluster-autoscaler]
  role: roles/iam.workloadIdentityUser
etag: CoB1hlRFPq0=
version: 1

4.4. Use the GCP Service Account in Kubernetes

Go to IAM and admin -> Service accounts
Click on our Service Account

Note that GCP doesn’t recommend downloading Service Account json:

Service account keys could pose a security risk if compromised. We recommend that you avoid downloading service account keys and instead use the Workload Identity Federation. You can learn more about the best way to authenticate service accounts on Google Cloud here.

To use the Service Account, we have to add the following annotation to the Kubernetes Service Account for our cluster-autoscaler:

1
iam.gke.io/gcp-service-account: <GCP-Service-Account-name>@<GCP-Project-Id>.iam.gserviceaccount.com

Ref: https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#authenticating_to

For example,

1
2
3
kubectl annotate sa custom-ca-gce-cluster-autoscaler iam.gke.io/gcp-service-account=cluster-autoscaler@my-project-123456.iam.gserviceaccount.com
serviceaccount/custom-ca-gce-cluster-autoscaler annotated

You should start seeing your cluster-autoscaler pod running

Conclusion

That’s all. If you have any sort of constructive feedback for me, feel free to mail me at surajrbanakar@gmail.com.