Kubernetes Cluster Configuration

We use kubernetes to run our JupyterHubs. It has a healthy open source community, managed offerings from multiple vendors & a fast pace of development. We can run easily on many different cloud providers with similar config by running on top of Kubernetes, so it is also our cloud agnostic abstraction layer.

We prefer using a managed Kubernetes service (such as Google Kubernetes Engine). This document lays out our preferred cluster configuration on various cloud providers.

Google Kubernetes Engine

In our experience, Google Kubernetes Engine (GKE) has been the most stable, performant, and reliable managed kubernetes service. We prefer running on this when possible.

A gcloud container clusters create command can succintly express the configuration of our kubernetes cluster. The following command represents the currently favored configuration.

gcloud container clusters create \
     --enable-ip-alias \
     --enable-autoscaling \
     --max-nodes=20 --min-nodes=1 \
     --region=us-central1 --node-locations=us-central1-b \
     --image-type=ubuntu \
     --disk-size=100 --disk-type=pd-standard \
     --machine-type=n1-highmem-8 \
     --cluster-version latest \
     --no-enable-autoupgrade \
     --enable-network-policy \
     --create-subnetwork="" \
     --tags=hub-cluster \
     <cluster-name>
gcloud container node-pools create  \
    --machine-type e2-highmem-8 \
    --num-nodes 1 \
    --enable-autoscaling \
    --min-nodes 1 --max-nodes 20 \
    --node-labels hub.jupyter.org/pool-name=<pool-name>-pool \
    --node-taints hub.jupyter.org_dedicated=user:NoSchedule \
    --region=us-central1 \
    --image-type=ubuntu \
    --disk-size=200 --disk-type=pd-ssd \
    --no-enable-autoupgrade \
    --tags=hub-cluster \
    --cluster=<cluster-name> \
    user-pool-<pool-name>-<yyyy>-<mm>-<dd>

IP Aliasing

--enable-ip-alias creates VPC Native Clusters.

This becomes the default soon, and can be removed once it is the default.

Autoscaling

We use the kubernetes cluster autoscaler to scale our node count up and down based on demand. It waits until the cluster is completely full before triggering creation of a new node - but that’s ok, since new node creation time on GKE is pretty quick.

--enable-autoscaling turns the cluster autoscaler on.

--min-nodes sets the minimum number of nodes that will be maintained regardless of demand. This should ideally be 2, to give us some headroom for quick starts without requiring scale ups when the cluster is completely empty.

--max-nodes sets the maximum number of nodes that the cluster autoscaler will use - this sets the maximum number of concurrent users we can support. This should be set to a reasonably high number, but not too high - to protect against runaway creation of hundreds of VMs that might drain all our credits due to accident or security breach.

Highly available master

The kubernetes cluster’s master nodes are managed by Google Cloud automatically. By default, it is deployed in a non-highly-available configuration - only one node. This means that upgrades and master configuration changes cause a few minutes of downtime for the kubernetes API, causing new user server starts / stops to fail.

We request our cluster masters to have highly available masters with --region parameter. This specifies the region where our 3 master nodes will be spread across in different zones. It costs us nothing extra, so we should always do it.

By default, asking for highly available masters also asks for 3x the node count, spread across multiple zones. We don’t want that, since all our user pods have in-memory state & can’t be relocated. Specifying --node-locations explicitly lets us control how many and which zones the nodes are located in.

Region / Zone selection

We generally use the us-central1 region and a zone in it for our clusters - simply because that is where we have asked for quota.

There are regions closer to us, but latency hasn’t really mattered so we are currently still in us-central1. There are also unsubstantiated rumors that us-central1 is their biggest data center and hence less likely to run out of quota.

Ubuntu operating system

Since we use NFS for user home directories, we select Ubuntu as our node operating system. The default (Container Optimized OS) does not have NFS support enabled.

Disk Size

--disk-size sets the size of the root disk on all the kubernetes nodes. This isn’t used for any persistent storage such as user home directories. It is only used ephemerally for the operations of the cluster - primarily storing docker images and other temporary storage. We can make this larger if we use a large number of big images, or if we want our image pulls to be faster (since disk performance increases with disk size ).

--disk-type=pd-standard gives us standard spinning disks, which are cheaper. We can also request SSDs instead with --disk-type=pd-ssd - it is much faster, but also much more expensive.

Node size

--machine-type lets us select how much RAM and CPU each of our nodes have. For non-trivial hubs, we generally pick n1-highmem-8, with 52G of RAM and 8 cores. This is based on the following heuristics:

  1. Students generally are memory limited than CPU limited. In fact, while we have a hard limit on memory use per-user pod, we do not have a CPU limit - it hasn’t proven necessary.
  2. We try overprovision clusters by about 2x - so we try to fit about 100G of total RAM use in a node with about 50G of RAM. This is accomplished by setting the memory request to be about half of the memory limit on user pods. This leads to massive cost savings, and works out ok.
  3. There is a kubernetes limit on 100 pods per node.

Based on these heuristics, n1-highmem-8 seems to be most bang for the buck currently. We should revisit this for every cluster creation.

Cluster version

GKE automatically upgrades cluster masters, so there is generally no harm in being on the latest version available.

Node autoupgrades

When node autoupgrades are enabled, GKE will automatically try to upgrade our nodes whenever needed (our GKE version falling off the support window, security issues, etc). However, since we run stateful workloads, we disable this right now so we can do the upgrades manually.

Network Policy

Kubernetes Network Policy lets you firewall internal access inside a kubernetes cluster, whitelisting only the flows you want. The JupyterHub chart we use supports setting up appropriate NetworkPolicy objects it needs, so we should turn it on for additional security depth. Note that any extra in-cluster services we run must have a NetworkPolicy set up for them to work reliabliy.

Subnetwork

We put each cluster in its own subnetwork, since seems to be a limit on how many clusters you can create in the same network with IP aliasing on - you just run out of addresses. This also gives us some isolation - subnetworks are isolated by default and can’t reach other resources. You must add firewall rules to provide access, including access to any manually run NFS servers. We add tags for this.

Tags

To help with firewalling, we add network tags to all our cluster nodes. This lets us add firewall rules to control traffic between subnetworks.

Cluster name

We try use a descriptive name as much as possible.

Azure Kubernetes Service

Subscription owner access

Use ‘role’, not cluster. Global administrator on the directory service.

SPA / Account access clusterfuck

Options: Use your gmail.com address, or use ds-instr SPA. Using berkeley.edu account with azure without the SPA might make you want to tear your hair out.

Microsoft Azure also provides a managed kubernetes service and we have run at least one large course on it each semester. The following commands will create a suitable cluster on AKS:

az group create --name <group-name> --location=westus2

Make a new SSH Key

Put it in secrets folder

ssh-keygen -f deployments/<deployment-name>/secrets/aks_ssh_key

Create service principal

You should create a new service principal for this cluster. We should probably scope this a bit more, otherwise I think this gives it too many privileges.

az ad sp create-for-rbac \
   --role=Contributor \
   --scopes=/subscriptions/<uuid-of-active subscription> \
   -o json > deployments/<deployment-name>/secrets/serviceprincipal.json

Kubernetes Version

Find out latest version of Kubernetes supported by AKS and use it. Move forward if possible, not backwards. ‘Stability’ is a myth and does not exist in this world we have now.

az aks get-versions -l westus2

Create cluster

export AZURE_CLUSTER_NAME=data100-fall-2019
export AZURE_DEPLOYMENT=data100

Pick the latest version at this time. You might need to update your local version of az to get this more accurate.

az aks create \
    --name $AZURE_CLUSTER_NAME \
    --resource-group $AZURE_CLUSTER_NAME \
    --ssh-key-value deployments/$AZURE_DEPLOYMENT/secrets/aks_ssh_key.pub \
    --node-count 1 \
    --node-vm-size Standard_E16s_v3 \
    --node-osdisk-size 100 \
    --kubernetes-version 1.14.0 \
    --nodepool-name default \
    --service-principal $(jq -r .appId deployments/$AZURE_DEPLOYMENT/secrets/serviceprincipal.json) \
    --client-secret $(jq -r .password deployments/$AZURE_DEPLOYMENT/secrets/serviceprincipal.json) \
    --output table

The first command creates a resource group in a local region and the second creates the cluster. The options are fairly self explanatory.

Note

Make sure to specify a VM type that supports premium storage disks. For example “E2s-64 v3” does, but “E2-64 v3” does not.

AKS and SSH

Connecting to Azure nodes by ssh is not as simple as gcloud compute ssh. One must run a vanilla Linux pod in-cluster, add an ssh client, copy an ssh key to it, then exec into the pod.