Dask

Dask is an open-source tool that makes it easier for data scientists to carry out parallel computing in Python. Through distributed computing and Dask dataframes, it allows you to work with large datasets for both data manipulation and building ML models.

Dask takes away the problem of pandas, which is a python package that cannot handle bigger data than what you can fit into a RAM. Dask makes it easier to work on your data with other python packages e.g Numpy, Scikit-learn, Tensorflow, etc. It is built to help you improve code performance and scale up without having to rewrite your entire code.

Dask’s core advantage is a principle called “lazy evaluation” or “delayed” functions. Delaying a task with Dask can queue up a set of transformations or calculations so that it’s ready to run later, in parallel. This means that Python won’t evaluate the requested computations until explicitly told to. This differs from other kinds of functions, which compute instantly upon being called. Many very common and handy functions are ported to be native in Dask, which means they will be lazy (delayed computation) without you ever having to even ask.

Dask is composed of two parts:

  • A collection API - For parallel lists, arrays, and Dataframes for natively scaling Numpy, Pandas and scikit-learn to run in distributed environments.

Dask’s three parallel collections, namely; Bags, Arrays and DataFrames can each automatically use data partitioned between RAM and disk as well, distributed across multiple nodes in a cluster, depending on resource availability.

  • A task scheduler - It is used for building task graphs and coordination, scheduling and monitoring of tasks optimized for interactive workloads across machines and CPU cores.

Dask’s task scheduler can scale to thousand-node clusters and one scheduler is able to coordinate many workers and move computation to the correct worker thus maintaining a continuous, non-blocking conversation.

Dask delivers a low overhead, low latency and minimal serialization necessary for speed. It enables faster execution of large, multi-dimensional datasets analysis, and accelerates and scales data science pipelines or workflows.

Dask enables applications in time-series analysis, business intelligence and data preparation.

Additional Resources: