SCALE PYTHON EFFORTLESSLY

Dask is a free, flexible library for parallel computing in Python. It allows you work on arbitrarily large datasets and dramatically increases the speed of your computations. It was developed with community projects such as NumPy, pandas, scikit-learn, and Jupyter.

WHY USE DASK?

Familiar Interface

Python has a rich ecosystem of data science libraries include numpy for arrays, pandas for DataFrames, xarray for nd-data, and scikit-learn for machine learning. Dask mirrors the api of these libraries to make it easy to switch to a distributed alternative.

Flexible

Sometimes your data doesn’t fit neatly in a dataframe or an array. Or maybe you have already written a whole piece of your pipeline and you just want to make it faster. That’s what dask.delayed is for. Wrap any function with the dask.delayed decorator and it will run in parallel.

Python-Native

Dask is written in Python for easy integration with Python code, easy troubleshooting, and access to the full PyData stack.

Fast

Write your code to run on a laptop and easily scale it up to run on clusters with 1000s of cores. Dask simplifies the big data workflow and its excellent single-machine performance speeds up the prototyping stage, leading to faster model deployment.

Dask is composed of two parts:

Dynamic Task Scheduling

Dynamic Task Scheduling Dask can easily take a set of tasks which normally would have run sequentially and distribute them to a collection of workers. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads

"Big Data" Collections

Dask has parallel arrays, data frames, and lists that extend common interfaces like NumPy, pandas, or Python iterators to larger-than-memory or distributed environments.

getting started

Get started with Dask in minutes. Below, you can find resources on learning about and using Dask.

View the Dask docs

Learn more about Dask via their full documentation here

Train a Model with XGBoost and Dask

Watch guides and examples from Dask's YouTube channel

FREE DASK CHEATSHEET

This cheatsheet will guide you through the basic utilities of Dask

Learn with dask tutorials

View Dask tutorials on Github, along with code samples for common Dask user tasks

Use dask on GPUS

Dask integrates with RAPIDS and XGBoost for GPU-accelerated data analytics and machine learning. Accelerate your model runtime by 2000x with RAPIDS on Saturn Cloud

Use RAPIDS on a GPU cluster

Learn how to scale to larger data sizes with multiple GPUs. This exercise uses RAPIDS to speed up a machine learning workload on data that would be too large for a single machine.

Train a Model with XGBoost and Dask

XGBoost is a popular algorithm for supervised learning with tabular data. Learn how to use the XGBoost ML library to train a model on a distributed Dask cluster.

Try Dask on Saturn Cloud