Deploying Data Pipelines at Saturn Cloud with Dask and Prefect

Letās talk about how we deploy data pipelines on Saturn Cloud internally at Saturn Cloud. This article will discuss how we do that and some lessons learned. Is also assumes that youāre already a fan of Prefect and Dask.
Use Dask! But only when you need it
Scaling up should be progressive. The more you scale, the more inherent complexity you deal with. I believe most jobs should be written with Pandas first, then Dask on a local cluster, and finally Dask on a multi-node cluster (if you really need it). The way we do this is by passing in parameters that determine which type of Dask cluster weāre using.
@resource_manager
class LocalDaskResource:
def setup(self):
self.cluster = LocalCluster(n_workers=1, threads_per_worker=15)
Client(self.cluster)
def cleanup(self, resource):
self.cluster.close()
@resource_manager
class SaturnDaskResource:
def __init__(self, teardown=True):
self.teardown = teardown
def setup(self):
self.cluster = SaturnCluster(n_workers=1, threads_per_worker=15)
Client(self.cluster)
def cleanup(self, resource):
if self.teardown:
self.cluster.close()
def make_flow(mode, storage=None):
if mode == "SaturnCluster":
def resource():
return SaturnDaskResource(teardown=False)
else:
def resource():
return LocalDaskResource()
with Flow("...", storage=storage) as flow:
with resource():
...
return flow
We wrote this flow was to backfill (and keep current) a job that loaded usage data from S3 and passed it into Snowflake. When developing the flow we used a multi-node Saturn Dask Cluster, but in our deployment, since weāre only processing the most recent 24 hours worth of data. itās much easier to use a LocalCluster.
Donāt force yourself to use
The data science world has standardized on Jupyter as the standard IDE of choice for data scientists. This isnāt so bad now that Jupyter Lab has a decent text editor you can use to work on Python scripts and libraries, in addition to working in notebooks.
I love emacs. When writing these flows I worked locally on my laptop, but I connected to Saturn Clusters to offload the expensive computations. Since then, Iāve switched to SSHing into my Jupyter instance so that I can run emacs (this is also how our VS code and PyCharm integrations work). I found that itās helpful to make my development environment exactly match my production environment. Additionally, having a development machine thatās more powerful than my laptop has been really nice.
There are many data science platforms out there that focus on Notebooks. Notebooks have their place, but they will never completely replace writing code.
Prefect has multiple deployment patternsāyou donāt need to limit yourself
Weāve been building out our Prefect Cloud integration for some time. Our integration provides a Storage object for your flows, registers them with Prefect Cloud, and also shows you all the Saturn Cloud logs for your flows. Your flows will be deployed in a Kubernetes pod with the Prefect Cloud Kubernetes Agent and can also use a Saturn Dask cluster. We do a lot of work to make sure that your Prefect Flow runs in a pod that matches up precisely with the Jupyter environment you used to create it, without you needing to configure any of your own infrastructure.
Sounds great right?
Iām planning on passing on our integration for some of our newer flows.
For a variety of reasons, we have a few flows that should be run every minute. And for that a Kubernetes Agent doesnāt make sense. Spinning pods up and down isnāt useful when you need to run every minute.
Instead, Iām leveraging Saturn deployments. In Saturn we have a capability of executing long running, always on tasks (these are often used to serve ML models, or data science dashboards. Iām running a Saturn deployment, that instead of running a webserver, is running a Prefect Local Agent. And Iām labeling the agent, and my flows to run on that.
For my hourly and daily flows, the k8s agent still makes sense. But for something that runs every minute, this makes more sense. Donāt restrict yourself to thinking there is only one way to do things. There are multiple types of agents, you might need them.
Working with Research and Production
In the past Iāve struggled with the logistics of working with Prefect cloud flows. Do I just write them in a notebook? If I write them in the notebook, is that what I use to make production deployments? If I move my flows to Python code, how do I explore my flows interactively?
Iāve settled on using click to solve this problem (really any command line interface will do).
@click.group()
def cli():
pass
@cli.command()
def register():
flow = make_flow(...)
flow.storage.build()
flow.register(...)
@cli.command()
@click.option("--mode", default=False)
def run(mode=None):
flow = make_flow(mode)
flow.run()
- I have a cli that I can use to register the flow. I call this from my development machine, but I can easily trigger this from any CI system.
- There is a separate cli I can use to run the flow. This allows me to pass in parameters, so I can run it with a LocalCluster (simulate production), or a Saturn Cluster (if I want to run it on the full dataset)
- Since flow creation has been encapsulated in a function, if I want to explore the flow interactively, I can import that function into a notebook, or just import and run the individual tasks.
Next steps
With that, you have an overview on how we deploy data pipelines on Saturn Cloud and some things weāve learned. For more on deployments, you can check out some helpful pages here:
Create and Use Deployments and Jobs
To try this out right away and use Saturn Cloud for free, you can get started here.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.