Docker for Data Scientists

How use Docker to host a dashboard, API, and more
Want to learn a lot more about using Docker for data science? Check out this in-depth presentation on the topic.

There are many ways to deploy data science applications like dashboards or APIs, which are helpfully enumerated in my earlier blog post. However, one of the downsides of deploying code is you have to set up dependencies, install the right version of the programming language (like R or Python), and ensure the environment is exactly how you want it. You may have to do this over and over as you switch between developing locally or on the cloud, or if you set up different applications for different users.

If example, if you have a new dashboard, you’ll have to create a new virtual machine, set all the cloud settings again, install all the software again, and start the process again. This can possibly take hours to do. Worse, you may lose track of which particular packages you installed and might not be able to reproduce your deployed code later. It can become a huge hassle to try and keep track of what is installed in which virtual machine. Thankfully, there is a way to manage computing environments in a much cleaner way.

Docker is a framework for creating and shipping applications. Rather than dealing with virtual machines that are hard to keep track of and are disconnected to each other, you instead use a system that is more scalable. It relies on three concepts:

  • An image is a snapshot of a computing environment. It includes an operating system, plus whatever software, files, and environment settings you apply to it. So for instance, you may have image that contains Linux, Python, the library FastAPI, the Python code for your particular API, and the run command for how to start it. Importantly, you can make a new image by modifying an existing one. So rather than creating an entire image from scratch for your API, you can use an existing image that already has Linux, Python, and FastAPI already on it. At that point all you have to add is the Python code you wrote for your particular API, not all the requisite software. In fact that FastAPI image linked above is based on a different image that only has Linux and Python. It’s images all the way down.
  • A Dockerfile is a text file that specifies how to create an image. It will contain a few possible different commands, including:
    • FROM - this declares which image to use as the starting point
    • COPY - copy a file into the docker image
    • RUN - run a command within the image, such as downloading and installing packages
    • ENV - set an environment variable in the image
    • ENTRYPOINT - the command to run, such as starting a FastAPI server or running a Shiny dashboard All together this gets put into a .Dockerfile that you store with your code. You can then create the image by using the command docker build command. This makes it much much easier to keep track of what code is in what image–you just create a .Dockerfile in your code base specifying how to build the image, and at any time you can reference back to that file. You can also have all of your application images built upon a single shared base image, so if you have to update package versions it is much easier to do it all at the same time.
  • A container is a running image. These can be started or stopped like virtual machines and connected to externally. There are many different platforms to run them on. From running them on your local machine to hosting them in a Kubernetes cluster, or with a cloud service designed for Docker containers like GCP Cloud Run. If you ever stop a container, you can just take the same image and start a new one so you don’t have to worry about losing how it was configured. You can also pass your images to a different team at your company (like a DevOps team) and let them start running the containers for you.

So, by creating Docker images and then running the containers, you get the same benefits as using virtual machines, only in a much more reproducible and scalable way.

Steps to using Docker

Before following these steps, you’ll first need the code you want to deploy! This can be a dashboard like (Shiny in R or say Dash in Python), an API (Plumber in R or say FastAPI in Python), or a script you’ll want to run when the Docker container starts. For strategies on what sort of code to deploy, see this separate blog post.

You’ll also need Docker installed on your development machine, which is free for personal use! Once installed, you’ll need to make sure it’s started before following these steps.

Creating a Dockerfile

First, you’ll need a Dockerfile that describes how your Docker image should be built. This will list the starting image to use, the steps to modify it, and the command to execute when the container starts. This should be a text file names .Dockerfile. There are lots of commands you might want to put into a Dockerfile, but here are a few of the common ones:

  • FROM {xyz} (where {xyz} is another image). The FROM command specifies which image to use to start. There are lots of images you could start building from, including many public ones available on DockerHub. For example, FROM python:3.10.5-bullseye will start with a Python image from DockerHub that has Ubuntu 20.04 and Python 3.10.5 installed. You might want to use an image already set up with FastAPI and Python, R and Shiny, or other configurations. The more you have pre-installed, the less you have to install yourself.
  • RUN {xyz} (where {xyz} is a Linux bash command). The RUN command lets you run any arbitrary linux commands. This is useful for downloading and installing software with commands like apt-get install, wget, and the like. You can use the bash && command to chain multiple commands into a single RUN statement, and \ to have commands span multiple lines.
  • COPY {xyz} {location} (where {xyz} is a file or folder and {location} is a location in the image). The COPY command lets you copy files into the Docker image. You’ll need to copy the code you want to run into the image, as well as configuration files, data, and things of the like.
  • EXPORT ["{command}"] (where {command} is the command you want to run). The ENTRYPOINT statement declares what will happen the moment a container starts. You’ll want to use to this start your dashboard or API, or begin running your script. If your command takes arguments you can add them with commas within the [].

All together, a simple Dockerfile might look something like this:

# start from an image with Python and Ubuntu
FROM python:3.10.5-bullseye

# install some linux libraries needed
RUN apt-get update -qq && apt-get install -y \
  libssl-dev \
  libcurl4-gnutls-dev

# install some Python libraries
RUN pip install "fastapi[all]"

# copy everything from the current directory into the container
COPY / /

# when the container starts, run the FastAPI command
ENTRYPOINT ["uvicorn", "main:app", "--reload"]

Build the Docker image

To build the image, navigate to your directory with your code and Dockerfile and run the command.

docker build -t {image-name}

Where {image-name} is what you want to call the image. This may take several minutes as it will need to download the base image you’re starting from (if not already downloaded), then run the steps in the Dockerfile. Note that if you change what’s in a Dockerfile and rerun the build, any step at or after the one you changed will have to run again.

Once the build is complete your image is ready to use!

Test the image locally

Test that the image works on your local machine by using docker run -it --rm -p {port}:{port} {image-name}, where {port} is the port you want your application to listen to (this isn’t required if you’re only running a script).

Then, if you go to http://127.0.0.1:{port} in your browser, a tool like Postman, or a command like curl you should connect to your application running locally!

Deploy your Docker image

Once you have your Docker image running locally, which can be very exciting, you’ll need to still figure out how to have it run in a more long term and stable place like the Cloud. To deploy your code on the cloud, you have a number of options, including:

  1. Run your container with Docker on a virtual machine. There is nothing stopping you from creating a cloud virtual machine and install docker than use the same docker run command from step 2. You will get the reproducible benefits of Docker, but you will still be using a single virtual machine. This is the most straightforward way you could run a Docker container on the cloud. Once you have your container running, you should be able to connect to http://{ip-address}:{port}, where {ip-address} is the address of your virtual machine.
  2. Run your container on Kubernetes. Kubernetes is a tool for having many machines running containers for you, scaling up and down as you have more traffic. This is how most Enterprises use containers, and it provides great benefits for running many containers at scale. AWS, GCP, and Azure all have different versions of managed Kubernetes, where you pay them money and they run a Kubernetes instance for you. The downside is that to use Kubernetes you’ll have to learn an additional tool besides Docker. Managing and using a Kubernetes cluster is outside the scope of this tutorial, but there are many guides online and if you are in a corporate environment there might already be a Kubernetes cluster you can use.
  3. Run your container on GCP Cloud Run. Cloud Run is an extremely useful tool where you can just upload your image and it will deploy the image for you. This is a great way to host containers without the hassle of having to set up a Virtual Machine or Kubernetes. Also, Cloud Run is clever enough that if your endpoint isn’t getting traffic it will stop your resources, and if it’s getting a lot of traffic it will increase the number of containers. To use it, you (1) create a GCP account, (2) create an Artifact Registry to store the images, then (3) use the docker push command to push your image to the registry. Finally, you (4) create the Cloud Run instance for your application. All of this can be done in under an hour or two! Once you’ve set it up, you should be able to connect to your application by using the address provided in the Cloud Run UI.
  4. Use your images on Saturn Cloud. Saturn Cloud is a data science platform that lets you write and deploy code in the cloud. Saturn Cloud has resources for development that include JupyterLab or RStudio, as well as the ability to have code continuously run or run on a schedule. Saturn Cloud resources all start with Docker images. You can upload your own Docker images and use those as development enviroments, or deploy them directly. Using Saturn Cloud also gives you the advantage that you can install additional components after the container starts. So for instance you can make a generic image that has the required libraries and dependencies, then at the container start time download whatever git repo has the code you want to run. This will let you have many different deployments without having to make a separate image for each.

And there you have it! With Docker you can keep your code reproducible, easily share it with other members of your organization like engineering teams, and spin it up as needed.