Docker for Data Scientists

When I was a data scientist early in my career I was shocked with how challenging one particular problem was. No it was cleaning my data or getting my model to converge, it was just how to get my code to run somewhere else. If I had cool interactive dashboard I wanted to show someone or a model I wanted them to try using, I had to go through an enormous amount of steps to get it to work for the other person. I wish I could have just:
- Bundled my code in some sort of “app.exe” file to send to them to run
- Put my code on some sort of web page that they could view at any time
- Handed my code to the engineers at my company to figure out what to do with it (who weren’t data scientists)
In the year 2022 for many data scientists this is still a very real problem! For many data scientists it’s not clear how to deploy the exact operating system you’re using, bundle particular version of programming language you’re using (R, Python, whatever), and add particular packages, linux libraries, and other dependencies of your code so that other people can use it. Or is there?
You may have guessed it: Docker is a easy tool for data scientists to let other people use their code.
When data scientists deploy code
In practice, for data scientists there are many situations where you might want to have someone else use the code you’ve made. Some especially important ones are:
- Creating dashboards for users to view and interact with your model and analysis.
- Making APIs for engineers and other data scientists to run your model in real time.
- Scheduling scripts that should run at fixed times like bulk data processes.
Beyond that, there is the case of reproducibility: at any time another data scientist might need to continue the work you’ve started and need to be able to run it. Without clear guidance it can be very difficult to recreate the exact programming environment you used for an analysis or model training. This case won’t be addressed in this article, but the principles are the same.
You may want to share your work just with people within your company, as part of the product your company is building, or even with the general public.
Docker will let you bundle your code and all of the operating system, packages, and other dependencies, so that anyone can run them anywhere. You can understand how docker works with just three concepts:
- A Docker image is a snapshot of a particular computing environment. An image is like a full representation of a computer–so you can take an image and run it somewhere and know exactly what will happen. An image can contain an operating system, any programs you want installed and files included, environment variables, and things like that. You can also specify that command you want to run when the snapshot is turned on.
- A Dockerfile is the instructions used to create an image. To make an image, you start with an existing image and add more programs, files, or