Docker for Data Scientists

How to use Docker to have your data science code run anywhere

When I was a data scientist early in my career I was shocked with how challenging one particular problem was. No it was cleaning my data or getting my model to converge, it was just how to get my code to run somewhere else. If I had cool interactive dashboard I wanted to show someone or a model I wanted them to try using, I had to go through an enormous amount of steps to get it to work for the other person. I wish I could have just:

  • Bundled my code in some sort of “app.exe” file to send to them to run
  • Put my code on some sort of web page that they could view at any time
  • Handed my code to the engineers at my company to figure out what to do with it (who weren’t data scientists)

In the year 2022 for many data scientists this is still a very real problem! For many data scientists it’s not clear how to deploy the exact operating system you’re using, bundle particular version of programming language you’re using (R, Python, whatever), and add particular packages, linux libraries, and other dependencies of your code so that other people can use it. Or is there?

You may have guessed it: Docker is a easy tool for data scientists to let other people use their code.

Docker will let you bundle your code and all of the operating system, packages, and other dependencies, so that anyone can run them anywhere. You can understand how docker works with just three concepts:

  • A Docker image is a snapshot of a particular computing environment. An image is like a full representation of a computer–so you can take an image and run it somewhere and know exactly what will happen. An image can contain an operating system, any programs you want installed and files included, environment variables, and things like that. You can also specify that command you want to run when the snapshot is turned on.
  • A Dockerfile is the instructions used to create an image. To make an image, you start with an existing image and add more programs, files, or