Build or Buy Data Science Tools
We’re kicking off a series of blog posts on setting up data science infrastructure. Infrastructure decisions around data science often fall to the first data scientist hired by the company. The company may have hired a new head of data science or a single data scientist who reports up to the head of analytics. These articles will be written with that audience in mind - people who need functional data science infrastructure and who may not have substantial DevOps knowledge.
Your team is looking at a new data science tool - how do you decide whether to purchase one off the shelf, or build one from scratch? Most articles about this subject are written by a vendor (I’m a vendor too) and are trying to convince you that adopting their product has a much lower TCO (total cost of ownership). I don’t think that’s a useful framework - an off-the-shelf product almost always has a lower TCO, since the cost of developing the product is amortized across all customers.
This article focuses on asking the questions you might not be thinking about, or re-thinking old questions in different ways. We assume that the product you’re considering is a good fit to solve your problems - otherwise, you would not be considering that product in the first place. You should think about:
- Can you live without a solution to this problem?
- What is your real cost to build and maintain the solution?
- How much vendor lock-in is there?
- What type of behavior does the product encourage?
Can you live without a solution?
In reality, the answer to this question is never black and white, but it’s important to understand whether a tool is a must-have or a nice-to-have. You should never build a solution for a nice-to-have. You should be willing to pay more for a must-have.
Let’s use Saturn Cloud as an example. If your data is large, then you can’t do data science on your laptop. Solving this problem becomes a must-have. If deploying jobs and dashboards is part of your business function, solving this problem is also a must-have. If your data is small, and you’re happy working on your laptop, then Saturn Cloud becomes a nice-to-have. It’s important to understand the severity of your problem first, without thinking about specific vendors.
How expensive is this to build?
This is the most deceptive question. Developers always think that it (whatever it is) can be built on a weekend.
- You could probably solve your problem on a weekend. Your built-in-a-weekend solution is really a prototype, so think about how much you want to rely on that as a must-have.
- The last mile - getting all the details right to go from a prototype to a real solution - consumes much more time than the first mile.
- Don’t forget about maintenance. Your weekend prototype is going to need care and feeding for the rest of its existence.
- If you’re not building this tool, what other cooler, more useful stuff could you be doing with your time?
Again, let’s use Saturn Cloud as an example. Many Saturn Cloud customers have in the past set up JupyterHub on Kubernetes, ran Dask clusters on EMR, and spun up EC2 instances manually for data scientists. The work that has gone into the last mile and maintenance is what has convinced these customers to use Saturn Cloud.
- Reliability is a significant issue with home-built solutions. Most solutions work well as a prototype, but once data scientists are blowing up memory limits and doing other terrible things, reliability suffers.
- Maintenance and support for homegrown solutions become exceedingly difficult, especially ensuring that data scientists have the flexibility to install all the packages they need, without accidentally breaking their environments.
What about vendor lock-in?
Data Science is heavily focused on R&D which means there are many unknowns and your needs are likely to change in the future.
The amount of vendor lock-in you face with a product or tool should count significantly against that vendor. Let’s look at a few concrete examples.
Saturn Cloud: Saturn Cloud was designed with this metric in mind so we score quite high here. Everything that runs in Saturn Cloud is your docker images, hosted in your ECR, running code from your git repositories. If you can run it in Saturn Cloud, it should be very easy to run elsewhere.
Databricks: Databricks is probably the worst offender. If you’ve got all of your code defined in Databricks notebooks, some of which are sourcing other Databricks notebooks, It’s going to be challenging to run that code anywhere else.
SageMaker: If you’re only using Jupyter notebooks in SageMaker and you’ve managed to build and load your own Docker images into SageMaker, it will probably be easy to run your workflows outside of SageMaker. However, if you rely on the SageMaker image, or if you use any of the SageMaker libraries, good luck training your model outside of SageMaker.
What type of behavior does a product encourage?
Any tool you incorporate is going to encourage certain behaviors. Ideally, they encourage behavior that you would like to encourage for your team. There are 2 examples that I’m thinking about where Saturn Cloud has succeeded in some ways, and failed in others.
Does Saturn Cloud encourage software development best practices?
Most tools that focus on notebooks encourage code to be developed in notebooks. This can be bad because notebooks don’t encourage software development best practices. How many notebooks have you worked with that are copy-pastes of one another?
Saturn Cloud partially succeeds in encouraging software development best practices. Saturn Cloud makes Jupyter Lab available, and also has SSH integration which allows data scientists to also use PyCharm and VS Code to write code, which does encourage software development best practices. However, Jupyter notebooks are still the easiest thing to use when getting started with Saturn Cloud.
Does Saturn Cloud encourage containerized workloads?
Once you start running in docker containers, reproducibility becomes much easier. Saturn Cloud partially succeeds in encouraging containerized workloads because everything that runs in Saturn Cloud runs in a container. Saturn Cloud has an image-building tool to help data scientists build and customize their docker containers. Many Saturn Cloud customers are building containers in their CI systems that are automatically pushed to Saturn Cloud.
However, Saturn Cloud also allows for customization on container startup, which is typically used for installing additional packages. This is especially convenient because it makes sure you don’t need to build a whole new container to make a small change, but reproducibility definitely takes a hit.
As you stand up a new data science team, you’re going to need to consider purchasing data science products. The data science product landscape is richer than ever. Purchasing a product is almost always cheaper than building a product, but you need to make sure that the amount of vendor lock-in is acceptable, and that the product encourages the behavior you want to encourage in your team.