An Introduction to Data Science Platforms
A data science platform is a set of centralized tools for data scientists to do their work, and they can be immensely valuable to a data science organization. They are infrastructure for data scientists to run code, train models, and deploy APIs, and can replace a data scientist having to manually set up their programming environment themselves. Some examples of data science platforms are Saturn Cloud, SageMaker, and Databricks.
At their best, data science platforms can help the team work closely together, use more sophisticated hardware and analyses, and keep work more reproducible. At their worst, they can make the life of data scientist dramatically harder and require huge allocations of money and time. Yet, despite their potential opportunity (and potential catastrophe!), there aren’t many resources on what a data science platform actually is and how to get the most from them. This post will provide an introduction to data science platforms and guidance on what to look for when transitioning your team to one.
Setting the stage for data science platforms
The work of a data scientist involves tasks like cleaning data, training models, and creating analyses. A decade ago, those tasks would primarily be done by the data scientist on their local machines, like the laptop or desktop their company assigned to them when they started. The data scientist would install a programming language like R or Python and the particular packages they need on the machine. They’d do the work, and then they would share the results with rest of the organization (maybe by email). If the team was lucky, the code itself would be stored in a shared location like GitHub, otherwise it would only live on their machine.
Using a local machine however has a number of meaningful limitations to it:
- The hardware is fixed. If it turns out that the data scientist needs more memory, more processing power, or a GPU, then they would have no method to get that hardware. Instead, they would have to adjust their work to meet the limitations of the hardware, like only sampling a small amount of a dataset rather than analyzing the whole thing.
- Dependencies are difficult to manage. While a data scientist may share the code they wrote using a tool like GitHub, that often doesn’t capture exactly what the environment was that the code was running on. The analysis may only work with a particular combination of operating system, operating system dependencies, hardware, and programming language libraries. While language-specific tools like conda and renv can be used to capture some of the environment, it’s difficult to capture everything exactly. This can create situations where past work can’t be reproduced and has to be redone–a huge waste of data scientists' time.
- They aren’t easily backed up and secured. If a data scientist quits their job, it’s nearly impossible to transfer their work off their machine and to someone else. This is because not only are there highly specific dependencies installed, but files and folders are saved within the machine in ways that only make sense to the user. That means when someone leaves their work will often have to be duplicated. There are also security concerns about a laptop being stolen with vulnerable information on it, and it can be dropped in an especially large puddle.
Given these problems, a natural improvement that data science teams have been making has been to shift to cloud resources. Rather than having a data scientist use a fixed local machine on their desk, the data scientist can just spin up a cloud virtual machine on AWS or GCP. These virtual machines act just like a laptop would for a data scientist, but instead of running on physical box in the data scientist’s office they instead run on the cloud and the data scientist remotely connects every time they want to do work. You could grant multiple virtual machines to each data scientist if they have different requirements for different projects.
Cloud virtual machines solve a number of the problems that local machines have. First, it is very easy to adjust the hardware on the virtual machines. If you want to increase the RAM, CPUs, or hard drive space of your machine you can do that in seconds with a few button clicks. They also are potentially more secure. You no longer have to worry about someone losing a laptop on a subway ride, and you can easily take snapshots to back up the machines at points in time.
However, cloud virtual machines do not solve all the problems. They have the exact same issues as laptops when it comes to dependencies–it’s hard to keep track of exactly what’s installed on a virtual machine to ensure work can be shared with colleagues. Some types of hardware, like GPUs, require such different dependencies that to use the hardware you have to machine a whole new virtual machine. And as each data scientist creates more virtual machines for their work, it becomes more and more difficult to keep track of what each one is for and what’s installed on it. Cloud virtual machines are more expensive than local machines, and if they’re left running you can accidentally spend tens of thousands of dollars.
For a sufficiently large data science team using virtual machines can become total chaos. The number of virtual machines can become vast, and won’t be clear which ones are supporting vitally important projects and which ones are minor experiments that should be deleted. Often a person in the organization becomes the defacto “cloud parent” who is tasked with wrangling the chaos and helping data scientists get their virtual machines working the way they should. This further adds time and cost.
The case for data science platforms
And thus, a data science platform is a tool that is meant to provide the power of using cloud infrastructure for a data science team without adding chaos or technical complexity. In their ideal form, a data science platform should:
- let a data scientist do a project with precisely the hardware and software they need,
- keep track of exactly the environment that the data scientist is using for reproducibility,
- allow data scientists to share work with each other,
- deploy code as scheduled pipelines to run or host APIs and dashboards, and
- give the team leader admin privileges to manage the platform and its users.
This is well beyond what you could do with local machines or cloud virtual machines. It’s a full abstraction layer over the hardware and ecosystems that data scientists use so they no longer have to worry about. What to share your work with your coworker? Have them make a clone of the environment you’re using on the platform! Want to publish your work to show executives? Deploy it on the platform as a dashboard. Need to switch from Python to R? No problem, the data science platform is already equipped with environments set up for what you want to do. So much of the menial work of being a data scientist can be removed by a strong platform.
The leader of the data science team gets new tools as well. They can keep track of how much cloud resources data scientists are using, manage who has access to data and secret credentials, and add new users to the platform. They can also see directly what a data scientist is running, compared to local machines and cloud virtual machines where it’s opaque.
Because of the administrative tools on a data science platform, they are less work to manage can be cheaper than other options. They also provide solutions to other problems of managing a team. If an employee quits you can transfer the cloud resources to another data scientist. If data scientists are overly eager in their use of ultra-high powered distributed clusters, you can provide strong limits on what can be used.
All together, a data science platform solves the problems inherent in using local machines or cloud virtual machines for your data science teams, and also provides useful tools that you may have not realized you wanted.
A sample of some of the data science platforms on the market today.
The issues with using data science platforms
Unfortunately, data science platforms aren’t a silver bullet for data science leaders looking to streamline their infrastructure. Here are some areas you’ll want to think about if you’re considering a data science platform. We, of course, at Saturn Cloud have put a lot of effort into ensuring our platform runs as smoothly as it can–problems are often worse on competing products.
Data science platforms should work with your existing patterns
If a data science platform requires you to change how data scientists do their work, the platform may end up making your team far less efficient. The data scientists will have to change their existing work to run on the platforms, and they may get frustrated if the forced changes require continuous rework or limits to the technologies they use. Here are some real examples of ways some data science platforms force data scientists to change (we won’t mention any names as to which platforms):
- Force Python code to be packaged in highly specific ways before it can run on the platform.
- Force data scientists to only use a single branch of a single git repo
- Force data scientists to keep all of their work in a single notebook
- Limit which programming languages can be used on the platform (sorry R and Julia)
- Limit which IDEs the data scientists can use (sorry Visual Studio Code)
If a data scientist, who has happily writing Julia in Visual Studio code across a bunch of files in multiple git repos, finds out they have to use a platform which doesn’t support any of those things, they will revolt. Instead of migrating to your shiny new data science platform, they will instead continue to use whatever they had before, creating shadow IT you don’t know about.
We, at Saturn Cloud, designed a platform that has essentially no limits to what you can run on it or which IDEs you use. Each resource is running an installation of Linux with whatever software you want installed on it. This means the platform will never be a blocker to a data scientist doing their work.
Data science platforms should not lock you in
Many data science platforms are connected to a particular technology (like spark in the case of Databricks) or a particular cloud provider (like AWS in the case of SageMaker). At times, that can be beneficial. If your team’s tech stack is designed all around spark, then having your platform be based on it makes a lot of sense. However, many data science platforms require that you write code that will only run on their platform. If your team is ever considering leaving the platform, you would have to rewrite large portions of your code base. That can be catastrophic and could force your team to continue to use a data science platform that no longer works for you.
Saturn Cloud has no lock-in. Our platform is designed around the idea of resources–data science environments that can be started and stopped. On the backend each resource is:
- A Docker image,
- A list of hardware attributes (amount of memory, number of cores, etc), and
- A startup script.
If you ever wanted to leave Saturn Cloud, you could easily run the Saturn Cloud resources on any kubernetes cluster.
Data science platforms should be as useful for development as they are for production
In recent years, many tech companies have been developing new MLOps tools to help deploy machine learning models. These are tools for machine learning tasks like hosting models as APIs, tracking experiments, auto-retraining models and adding human-in-the-loop feedback to models. Often times, these are bundled into data science platforms. These tools can be invaluable when you are deploying large-scale machine learning models to production environments with high through-put.
Those are all great and useful capabilities! However, a large percentage of data scientists never deploy machine learning models into production environments and instead, focus heavily on creating analyses with business insights. Further, the data scientists who do put models into production still also need to develop those models, which requires a strong platform to manage that work. Often, the task of data science analysis creation and model development is treated as an afterthought compared to deploying machine learning. Choosing a platform that overly indexes on deployments can mean your team ends up with a not very useful yet very expensive set of infrastructure.
At Saturn Cloud, we focus heavily on ensuring the development phase of data science is enjoyable on our platform. Our workspace resources let you quickly get going with JupyterLab, RStudio, or any IDE that can connect with SSH. These resources can then be converted into jobs and deployments when you are ready for production. Further, Saturn Cloud is able to connect to tools like Comet for MLOps capabilities as needed.
If your team is hitting the limits of your current infrastructure because of the size of their local machines, being overwhelmed with the complexity of managing virtual machines, or is finding themselves locked-in to a platform they don’t currently like, it might be a good time to consider what else is on the market. Saturn Cloud is a flexible platform that makes managing a data science team’s infrastructure simple, while still being open enough that data scientists can use it for whatever they need.
If you’re interested in learning more, you can try out our Hosted plan in seconds. You can also learn more about our Hosted Organizations plan, which lets a team work on our infrastructure, or our Enterprise plan that you can install within your corporate AWS account.
Leading Data Science Teams
To learn more about best practices for running a data science organization, check out our O'reilly report: Leading Data Science Teams.
- An Intro to Data Science Platforms
- What are Data Science Platforms
- Most Data Science Platforms are a Bad Idea
- Top 10 Data Science Platforms And Their Customer Reviews 2022
- Saturn Cloud: An Alternative to SageMaker
- PDF Saturn Cloud vs Amazon Sagemaker
- Configuring Sagemaker
- Top Computational Biology Platforms
- Top 10 ML Platforms
- What is dask and how does it work?
- Setting up JupyterHub
- Setting up JupyterHub Securely on AWS
- Setting up HTTPS and SSL for JupyterHub
- Using JupyterHub with a Private Container Registry
- Setting up JupyterHub with Single Sign-on (SSO) on AWS
- List: How to Setup Jupyter Notebooks on EC2
- List: How to Set Up JupyterHub on AWS