Dask is a free, flexible library for parallel computing in Python. It lets you work on arbitrarily large datasets and dramatically increases the speed of your computations. Dask is composed of two parts:
Dynamic task scheduling optimized for computation. This is similar to Airflow,
, or Make, but optimized for interactive computational
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.
Python has a rich ecosystem of data science libraries including numpy for arrays, pandas for dataframes, xarray for nd-data, and scikit-learn for machine learning. Dask matches those libraries.
Sometimes your data doesn’t fit neatly in a dataframe or an array. Or maybe you have already
written a whole piece of your pipeline and you just want to make it faster. That’s what
dask.delayed is for. Wrap any function with the
and it will run in parallel.
Dask is written in Python and runs Python for people who want to write in Python and troubleshoot in Python, along with access to the full PyData stack.
Scale up and easily run on clusters with 1000s of cores or scale down to run on a laptop in a single process. Dask simplifies the big data workflow and its excellent single-machine performance speeds up the prototyping stage, and leads to faster model deployment.
What makes up Dask
The Dask framework has a number of valuable components, including:
Dask has data collections that use similar API calls to popular types like pandas DataFrames and NumPy ndarrays, however on the backend they are distributed across workers. This allows users to get the speed and memory benefits of distributed computing while using the same programming styles they are used to.
Dask can easily take a set of tasks which normally would have run sequentially and distribute them to a collection of workers. If you have code where tasks could be computed concurrently then you can use Dask to easily parallelize them.
Machine Learning Support
Dask can be used in many ways for machine learning. The Dask ML library provides distributed support for tasks like cross validation and hyperparameter tuning, and libraries like XGBoost and LightGBM can be run directly on a Dask cluster of workers.