Connect to Dask from SageMaker

Directions specific to connecting from AWS SageMaker

At Saturn Cloud, one of our passions is to help users build their productivity and accelerate their machine learning from whatever working environment they prefer. This poses lots of interesting challenges for us, of course, but we really believe in making the experience of our customers as convenient as possible. Many of us are data scientists ourselves, and have struggled with having great tools that just don’t work for our practice.

As a result, we let you use Jupyter Lab in our cloud product, SSH from your IDE into Jupyter Lab, or let you just create and use machine clusters directly from your local IDE, no Jupyter server required.

This last functionality is brilliant, because it opens up so many possibilities for connecting with powerful Dask resource clusters in so many other tools and workspaces. In this post, I’m going to show you how you can combine Saturn Cloud with AWS Sagemaker to get all the power of Dask clusters in the Sagemaker environment. If you’re a regular Sagemaker user, but want to add Dask parallelism to your workflow, read on!

Introduction

If you’re not familiar with Dask or cluster computing, here’s a brief overview. Dask allows parallelization of Python code, including across many machines in clusters.

Dask system diagram

As this diagram illustrates, the pieces in the gray box constitute a machine cluster, and in this example, that’s what will be hosted on Saturn Cloud. Instead of the pink box (the Client) being a Jupyter server also on Saturn Cloud, this will be your Sagemaker instance. Your code will be transmitted from Sagemaker to the cluster Scheduler, which will distribute tasks to the workers.

Setup

Log in to your Sagemaker environment and open a Jupyter instance. For this example, I’m using Sagemaker Studio, as shown in the screenshot below.

SageMaker control panel

Inside Sagemaker Studio, open a new Notebook, and you’re ready to begin! You’ll be asked to select a kernel, and for this we recommend the “Python 3 (Data Science)” kernel.

SageMaker new notebook

Environment Management

This kernel won’t be complete for our needs, however. Whenever you use our direct machine cluster access functionality, you’ll want to pay attention to the working environments. If your local workspace has a different image, including different packages or versions, than the Saturn resources, you’ll need to resolve that before running Dask code or using your cluster.

To fix this easily, the first thing we recommend is checking that your Sagemaker notebook has the same versions of certain key libraries that your Saturn Cloud cluster image does, after you get things set up as shown below. These are the libraries that ought to be installed or updated if you use the Sagemaker “Python 3 (Data Science)” kernel.

  • pandas: upgrade to 1.2.3 or better
  • dask: install 2.30.0 or better
  • distributed: install 2.30.1 or better
  • dask-saturn: install 0.2.2 or better

pandas will likely be installed, but the version may be quite old in the kernel. Upgrading this is vital for Dask to work well for you.

All of this can be done with pip. In Sagemaker Jupyter Notebooks, you can use the %pip magic in regular code chunks to run these commands, so for me, it looks like the first chunk in this screenshot.

SageMaker pip installing

To find out about some conflicts early, you can run client.get_versions(check=True) after you set up your Saturn client object. (I’ll explain that in a moment!) But that check won’t tell you about pandas conflicts, so don’t forget pandas!


Connect to a Saturn Cloud project

If you have not yet created a Saturn Cloud account, go to saturncloud.io and click “Start For Free” on the upper right corner. It’ll ask you to create a login.

Saturn homepage

Once you have done so, you’ll be brought to the Saturn Cloud projects page. Click “Create Custom Project”.

Saturn creating a new project

Give the project a name (ex: “sagemaker-demo”), but you can leave all other settings as their defaults. Then click “Create”.

After the project is created you’ll be brought to that project’s page. At this point you’ll need to retrieve two ID values:

  • project_id - the id for this particular project. You can get this from the URL of the project page. For example: https://app.community.saturnenterprise.io/dash/projects/a753517c0d4b40b598823cb759a83f50 has the project_id: a753517c0d4b40b598823cb759a83f50.
  • user_id - the ID that identifies you as a valid user in Saturn Cloud. Go to https://app.community.saturnenterprise.io/api/user/token and save the page as token.json, then upload that file to the Sagemaker Studio workspace. Do not share this file with others.

Protect your user token, as it allows access to your account!

You can now load the token inside Sagemaker Studio in a notebook, as shown.

# Load token
import json

with open('../config.json') as f:
  data = json.load(f)

Connect to your Project

Now you are ready to connect your Sagemaker Studio workspace to your Saturn Cloud project, allowing you to interact with it from this notebook. Your user_id is required (here shown as data['token']), as well as the project_id discussed earlier.

from dask_saturn.external import ExternalConnection
from dask_saturn import SaturnCluster
import dask_saturn
from dask.distributed import Client, progress

conn = ExternalConnection(
    project_id=project_id,
    base_url='https://app.community.saturnenterprise.io',
    saturn_token=data['token']
)
conn

#> dask_saturn.external.ExternalConnection at 0x7f04d067e0d0>

Set Up Cluster

Finally, you are ready to set up a cluster in this project! You’ll see info messages logging here until the cluster is started and ready to use.

If you have a cluster already created on the project, here you can just start it up without creating a new one, using this same code. You can also ask it to change size using cluster.scale(). For more details, we have documentation about managing clusters.

cluster = SaturnCluster(
    external_connection=conn,
    n_workers=4,
    worker_size='8xlarge',
    scheduler_size='2xlarge',
    nthreads=32,
    worker_is_spot=False)

Create Client Object

This lets us connect from our Sagemaker environment to this new cluster, and when we call the object, it gives us a link to the Dask Dashboard for that cluster. We can watch at this link to see how the cluster is behaving.

client = Client(cluster)
client.wait_for_workers(4)
client
Created Dask client

Analysis!

At this point, you are able to do load data and complete whatever analysis you want. You can monitor the performance of your cluster at the link described earlier, or you can log in to Saturn Cloud and see the Dask dashboard, logs for the cluster workers, and other useful information.

You can also connect to Dask from Google Colab, Azure, or anywhere else outside of Saturn Cloud.




Need help, or have more questions? Contact us at:We'll be happy to help you and answer your questions!