Sharing Your Data Science Work
One of the surprising challenges for data scientists is figuring out how to deploy your code so that other people can use it. You may want to make your machine learning models run in a way that business people can interact with them. Or you might want to write models that engineers can call. While it can be easy to get data science code running on on your local machine, if you run them locally then your application is usually only available on your local machine and will stop running the moment your power off your machine. Ideally, you want your work to be continuously available to a large group of users beside you. Making your code continuously online and running is what people typically mean by “putting code into production.” But before you have your code running for other people to use, you need to think about the best format to deliver it in.
This blog post discusses the different avenues for getting your data science code ready for others. If you’re interest in how to deploy the code into production, see my blog post on deployments.
So far as I can tell, there are three distinct strategies for having data science code continuously run: dashboards, APIs, and automated scripts. Each of these delivers a different value for a different type of target audience. It’s important to have an understanding of all three methods and their strengths and weaknesses, since different data science projects require different styles of deliverables.
Dashboards for business users
If you have a model or analysis that you want non-data scientists to interact with, a dashboard can be a great way to do so. Typically, these are HTML websites where users can edit variables and mess with sliders and see how your model would react. The data could be static, or it could update live with new data on a schedule. For example, you might have a model that shows customer value scores, and you want to show your business stakeholders how the model is currently scoring the overall customer base. A dashboard could be a nice way to show the aggregated scores to the business, plus let the stakeholders try controlling different parameters of the model.
Typical libraries for deploying dashboards include Shiny in R, Dash in Python, or the entire products of Tableau and Power BI. Dashboards are often some of the first deliverables data scientists make because they can be so straightforward to use.
The benefit of dashboards are that the end user needs no understanding of data science to be able to interact with it. That’s a really big advantage! There are, however, a few issues with dashboards:
- They generally aren’t readable by machines. only humans can view and interact with them. If you want other systems to be powered by the data in the dashboard, you’ll need to find another way of getting that data to the system.
- They can require a lot of work to develop. It’ll take time working with your business stakeholder to decide what should be on the dashboard, and then you as a data scientist will have to manually create the charts and tables in the dashboard. This can take a lot of effort to develop and debug, which can seriously slow down a data science project.
- They require ongoing maintenance. You’ll need to find a location to host the dashboard so that other people can see it, and then you as a data scientist will be required to ensure that it’s always running. If a connection to the data it loads breaks or a number on your dashboard doesn’t line up with other data in the company, you will have to fix it.
To deploy a dashboard in R and Python, you typically need to find a machine in your company or on your corporate cloud that can host the code. That machine will need to be open to corporate traffic too, all of which is the topic of the next blog post!
An API that can be read by other systems
If your data science work like a machine learning model should be called by other systems, your best bet is to write an API for it. An example would be a model that scores the value a customer each time they make a purchase. In that situation the engineering team would want to run the model for the customer the moment they make a purchase. They could do so by passing data to an API you create which runs the model and returns a score.
When using R, the typical library for writing APIs is plumber. In Python, you have multiple options like Django, Flask, and FastAPI. In practice, an API is extremely similar to a dashboard, only instead of a browser making an HTTP request to your system to view an HTML website, a computer is making an HTTP request to your API to get a response in JSON or another data format.
APIs are great because by making an API, you allow your data science code to be used by other systems, including ones written by other teams in other languages? Does the engineering team at your company use Java while your code is in Python? No problem! This is amazing, if you like think about it.
The downside of having to make an API is that, again, you’re on the hook for maintaining it. If your data science model breaks when you pass it certain input data, then it’ll probably be on you to fix it. You’ll need your API to be able to handle scale–if a lot of systems are calling your API at once, you will need to make sure the API doesn’t crash under the load!
Just like a dashboard in R or Python, when you make an API, you’ll need a location to host it. Typically, that’s a system on a corporate computer or cloud. It must be open to HTTP traffic within your corporate network so other systems can connect.
Just run a script on a schedule
Sometimes, the easiest way to have your code run is just to have a system hit start on a script every time the clock strikes a certain hour. For instance, suppose again you have a customer value scoring model. Maybe rather than updating each customer the moment they make a purchase like in the API scenario, you, instead, just want to rescore every customer once a month. In that case, you can just write a script that runs the model on everyone and set that script to run on the first day of the month. The script could load the data from a database, run the model, then return the scores to the same database for other systems to use. Scripts can be great for batch tasks that take a lot of time or only need to be run infrequently. If customers purchasing scores rarely change, there’s no need to be constantly running a model.
With a script, the way other people at the company interact with the model output is with the data the model saves to a database or a shared location. This means you don’t need to worry about HTTP connections, downtime, or things like that. If your script has an error when running and doesn’t output results, you can often just debug it and rerun it with minimal consequences. This makes having a script on a schedule be a much easier solution to maintain than the other methods here. And the data your script outputs to a shared location could be used by another team’s dashboard, API, or whatever!
The downside of a script that runs on a schedule is that many business situations need more responsiveness than a script provides. For many customer facing real-time situations like product recommendation models, natural language models, or image recognition models, you can’t just do the work in a batch. Because scheduling scripts can be so much easier to set up than real-time dashboards or APIs, data science teams that get into the habit of only using scheduled scripts might not build up the capabilities for the other methods of delivering data science code.
To get a script running on a schedule, all you need is a corporate machine or cloud virtual machine that has R or Python installed. You can set up the machine’s scheduler (a cron job in Linux or with the Task Scheduler on a Windows Server) and set your script to run when needed. So long as the machine has access to the databases to read and write from, you should be good to go!
These three methods for getting your data science code in a format you can deploy are all extremely useful for different situations! Depending on if your end user is a stakeholder or a engineering system and how responsive your model needs to be can dramatically alter how the model should be used. But this article only covered the different formats your code could be, not how to actually deploy it! For more information on that, see the next blog post in this series.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.