Practical Issues Setting up Kubernetes for Data Science on AWS

Data science has unique workflows that don’t always match those of software engineers and require special setup for Kubernetes.

Kubernetes provides a ton of useful primitives in setting up your own infrastructure. However, the standard way of provisioning Kubernetes isn’t set up very well for data science workflows. This article describes those problems, and how we think of them.

UPDATE: The EIP limits are still an issue with the standard aws-cni. We’ve updated our Kubernetes clusters to use the calico-cni, which avoids these issues. This is now what we recommend.

Multiple AZ (Availability Zones) interact poorly with EBS (Elastic Block Store) and Jupyter

Love it or hate it, Jupyter is the most common data science tool IDE today. If you are provisioning a Kubernetes cluster for data scientists, they will definitely want to run Jupyter notebooks. And if they are running Jupyter notebooks they are going to want persistent storage – working in an IDE where your filesystem is erased whenever you shut down the machine is a very bad experience! This is very different than conventional Kubernetes workloads which are often stateless.

This also means that if you are supporting multiple AZs, you need to make sure that Jupyter instances are always started in the same AZ, since your persistent storage (EBS) cannot be automatically migrated to other regions. One option is to use EFS instead of EBS. We do not do that at Saturn because most data scientists want to store data on their drives as well, and once you start getting into massively parallel cluster compute, EFS is not reliable. At Saturn, we route all Jupyter instances to the same AZ, and setup ASGs(auto-scaling groups) that are specific to this AZ. You trade off high availability, but you never really had that anyway, unless you are backing up snapshots to multiple AZs and are prepared to restore an EBS snapshot to another AZ if your primary goes down.

Standard VPCs run into EIP(Elastic IP) limits

The standard Kubernetes setup keeps workers in a private subnet. Most people end up creating private subnets with CIDR blocks that allocate roughly 256 IP addresses. This seems reasonable because you probably aren’t going to need more than 256 machines right? WRONG! With EKS, the instance type determines the defaults (and maximum) number of IP addresses allocated per instance. By default, some of the larger instances (r5.16xlarge with 512 GB of ram for example) consumes 50 IP addresses. This means that a team of 5 data scientists, each using 1 r5.16xlarge, can easily exhaust the number of IP addresses in your private subnet. We recommend allocating really big subnets (our default configuration has 8000 available in the CIDR block)

Standard Secure Networking Configurations with Calico are unstable as nodes scale up.

Many data science tools don’t provide much built in security. This is completely reasonable for an OSS project, but a huge problem for enterprise deployments. For example, by default Dask clusters don’t protect the scheduler, which means that anyone that has access to the cluster, can submit jobs to any Dask cluster(Dask has other ways to mitigate this, however for us dealing with this at the network level is the most flexible). Even if you are securing access into the Kubernetes cluster, it’s a good idea to protect data science tools by using Kubernetes networking policies to control to lock down resources.

The current standard for secure networking in EKS is Calico. The standard Calico setup will horizontally autoscale the metadata store based on the number of nodes – and is overly aggressive for most data science workloads because we’re typically running 1 user pod per machine (more on this later). This horizontal autoscaling behavior causes instability in Calico which results in nodes failing to communicate. We solve this by controlling the number of replicas in the metadata store manually – this has been very stable for our data science workloads.

ASG(Auto Scaling Groups) and the cluster-autoscaler

Data science workloads are very different from normal Kubernetes workloads because they are almost always memory bound. EC2 pricing is mostly driven by memory, so you don’t get very much value out of multi-tenancy with data science workloads. Data science workloads are often long-running (many data scientists will provision a Jupyter instance, and leave it up for an entire month). You are much better off allocating one entire node per data science pod. If you do not do this, it’s very likely that you’ll end up scheduling a pod that should only cost $30 a month on a node that costs $3000 a month.
At Saturn, we create many ASGs (4GB/16GB/64GB/128GB/256GB/512GB) to waste as little RAM as possible.

Data science workloads are also very spikey. One day, a data scientist might need to edit code on a 4GB instance which costs $30/month. The next day, they might want a 512 GB Jupyter instance ($3000/mo) which connects to a 10 node Dask cluster (5TB /ram total). Being able to autoscale Kubernetes is very important.

The Kubernetes cluster-autoscaler

The Kubernetes cluster-autoscaler is a sophisticated genius and can figure out the optimal node to spin up in order to accommodate a new pod, as well as which pods should be shuffled around in order to free up nodes that can be spun down. However, if you’ve bought into the one-pod-per-instance configuration, this sophistication causes a ton of added complexity, which almost always causes problems.

Problem – ASG selection

The Kubernetes cluster autoscaler is configured with expanders, which determine how it allocates new EC2 instances. The standard for cost conscious users is least-waste However the least-waste expander computes waste in percentage terms. This means that if you create a pod, and you forget to specify any CPU or RAM requests (you request 0), the cluster autoscaler will determine that every node type results in 100% waste. When an expander decides it has a tie, everything is routed to the random expander. Congratulations! your $30/month pod now costs $3000/month.

Problem – wasted capacity

Kubernetes pods are scheduled based on what nodes currently have capacity before falling back on the cluster autoscaler. Once a pod vacates a node, it takes 10 minutes before the node is torn down. For active clusters, I’ve often seen small pods (4GB) which accidentally get scheduled on large nodes (512 GB). The data scientists think they’re consuming 4GB of ram, and so they’re happy leaving this pod up forever. Congratulations! your $30/month pod now costs $3000/month.

Solution – being specific about ASGs.

Data science workloads are almost always memory bound and should monopolize the entire instance. Data science workloads are also very spikey, and so autoscaling in Kubernetes is very important. at Saturn, we solve this problem by explicitly tagging the node selector of every pod, so that it can only be scheduled on the correct ASG.

Instance configuration

This section assumes you are configuring EKS with terraform.

Should users be able to build docker images? This might be necessary if you want to do things like run binderhub. At Saturn, we allow certain pods to build docker images because we want to enable our docker image building service from standard data science configuration files. If so you’ll want to configure your ASGs with

    bootstrap_extra_args = "--enable-docker-bridge true"

But not for GPU machines!

The docker bridge does not work on GPU nodes, so you should set

    bootstrap_extra_args = "--enable-docker-bridge false"

terraform-[aws]( defaults ASGs to use the CPU AMI. If you are provisioning GPU nodes, you will have to find the GPU ami ID for your region, and pass that in to terraform. Here is the Python code we use to do this

def get_gpu_ami(region):
    eks_owner = "602401143452"
    client = boto3.client("ec2", region_name=region_name)
    images = client.describe_images(Owners=eks_owner)["Images"]
    ami_prefix = f"amazon-eks-gpu-node-"
    images = [x for x in images if x["Name"].startswith(ami_prefix)]
    images = sorted(images, key=lambda x: x["CreationDate"])
    image = images[-1]
    return image["ImageId"]

If you configure the wrong AMI, your nodes will come up like normal, but your expensive GPUs will not be available inside Kubernetes pods.

Don’t forget the cluster-autoscaler.

You need to tag ASGs to indicate how many GPUs are available, or else the autoscaler won’t know that your deep learning pod which needs 4 V100s can be scheduled on this box.

          key = ""
          value = "4"
          propagate_at_launch = true

What Now?

If you’ve read this far you must be thinking about setting up your own Kubernetes cluster for your data science team. I just have one question for you.

Is this really the best use of your time?

Your company can probably jerry rig a Kubernetes deployment, and set up a bunch of ASGs, and deploy jupyterhub, and then figure out how to deploy dask cluster, and then make sure that the dependencies you need are built into docker images that work well for both Jupyter, and Dask, and then make sure you configure EBS volumes properly so that data scientists don’t lose their work, and make sure you configure Calico, and nginx-ingress, and the load balancer so that traffic into your applications are secure. Are you planning to deploy dashboards? or models? Is there any sensitive information there? Do you need to provide access control and security around them? Don’t forget to setup jupyter-server-proxy, so that data scientists working in Jupyter/Lab need to use the Dask dashboard and experiment with bokeh/plotly/voila as well right? Did you remember to bind one Dask worker per GPU? Are you configuring RMM, and do you have proper GPU to Host memory spillover configured?

Is this really the best use of your time?

Given all the complexities involving this, I’d recommend renting our software rather than building your own.

About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.