Strategies for managing big data

Suppose you have a lot of data (say 500gb worth) that you want to use for data science using your favorite programming language (say Python or R). Since most computers do not have enough RAM to load that data, there’s a devastating problem standing in the way of getting data science work done. In a perfect world, as data scientists we should be able to wave the “make this out-of-memory problem go away” wand and be able to use all the exact same techniques as we use on small data. Unfortunately, that isn’t quite the case. There are however ways to get close to the feeling of having a small dataset and lots of memory. This blog post outlines a few strategies for how to tackle problems when you have large datasets.

Solution 1: buy your way out 💰

One solution to dealing with large data is just to get a larger machine. At Saturn Cloud we have machines available with up to four terabytes of RAM. That is enough RAM to handle most real-world datasets.

This method has two downsides. One, it’s far more expensive to run code on a machine with 4TB of RAM than it is one with only 4GB of ram. That can put this solution out of reach for many applications, including fun side projects. But more importantly, even if you can fit that much data in memory, if your code is single threaded it won’t run any more quickly. So while the can be loaded on a single enormous machine, depending on what method your using it could still take hours or days for your analysis to run.

Solution 2: sample your data 🎲

With sampling, you limit yourself to a smaller pool of data that will fit on your machine. For example, If you are working with a customer base of 100 million users, you could randomly select 5 million to look at. Assuming you take care in selecting the subset of users in an unbiased way, the smaller data should give you insights.

To be frank, I think this often works great. It is rarely the case that there are trends in the full dataset that you can’t also see clearly in only a small subset of it. For instance if you are building a churn model to predict which customers won’t come back, your model won’t need many customers before determining “customer has an angry call with customer support” is a good predictor of churn. Sampling doesn’t work if your data can’t be randomly sampled in a way that preserves the interesting relationships between data points–for instance say you are looking for very unlikely events in the data and so if you sample you won’t have enough of them.

Solution 3: summarize the data 📊

Similar to sampling data, another way you can make it reasonably sized is to aggregate it. For example, if your data is sensor readings taken every millisecond, you might be able to average each sensor by minute and analyze that. Like sampling, there are some datasets where this won’t work because you can’t aggregate without losing the meaning of the data.

Solution 4: analyze your data piecemeal 🔬

Often your data can be split into smaller groups and each one analyzed separately. Say if you have customers across a country to train a model on you could split the customers by region and train models separately. You get around the limits of the size of your machine by only using a small amount of data at a given time. Once you’ve analyzed each group (or in this case region of customers), you can combine the results to an overall final result.

So if your original method was:

  1. Load all the data
  2. Analyze it together
  3. Produce a result

The new method is

  1. For each subset of data
  2. load the subset
  3. analyze it
  4. produce a result
  5. Combine the results into one

Depending on what sort of analysis you’re doing there may be no penalty in accuracy or performance by splitting the data.

This doesn’t make your code any faster, but it does at least allow you to fit it on a smaller machine. The next solution on the other hand…

Solution 5: analyze your data piecemeal–in parallel 🔬🔬🔬

If it’s possible to use solution 4 to split the problem into small chunks you solve individually, but then you run the chunks at the same time producing results far faster. If you have 10 processor threads or separate worker machines you could solve this up to 10x faster! This is referred to as running your code in parallel, and often be straightforward to set up. There are a number of frameworks for running code in parallel for data science. I particularly like:

  • Dask (Python) - Dask is a framework for distributed computing written in and for Python. It has a number of useful methods, but the one I particularly like are delayed functions. By tagging a Python function as @dask.delayed you can run it concurrently across a cluster or multiple processes. Check out this blogpost to learn more about Dask.
  • futures (R) - The R package futures lets you run code in parallel, including across multiple machines. You can use it to have a whole set of tasks execute concurrently. The package {furrr} is a nice wrapper to give you purrr-like map functions in parallel, like future_map(). Check out the futures GitHub page for more info. A runner-up R package for this is {callr}, for spawning new R jobs.

With these methods, you can really move quickly. Provided your data can be chunked up, you can basically have it run as quickly as you need by using more machines. This works less well if the chunks need to communicate results together as that can make executing it complex.

Solution 6: use a framework for large data ▶️

Some techniques do need all of the data to compute and can’t run on smaller chunks at the same time–such as if you want to train an XGBoost model on the entire dataset. Thankfully, lots of work has been done to make methods that can handle a full dataset that is scattered across multiple machines. So rather than having 500GB of data on a single machine training an XGBoost model, instead there may be 100 machines each with 5GB of data and still a single XGBoost model being trained over all of them. These sorts of techniques require the algorithm be written to handle scattered data, but here are a few selections to use:

  • RAPIDS - This Python framework from NVIDIA uses GPUs to train machine learning models quickly, and can use Dask to run on data scattered across multiple machines. Check out our blog post on it.
  • LightGBM - This machine learning model, available both in R and Python, is designed to get accurate results very quickly. For Python, it can utilize Dask to run on data over multiple machines. Read our example of using it with Dask in our docs.
  • XGBoost - This popular machine learning model has a Python backend that uses Dask to train more quickly on large data. See our documentation for an example.

More generally, Dask has data collections like Dask DataFrames that let you feel like you’re using pandas for general data analysis, while on the backend the data is scattered. Overall these methods can be extremely powerful and make for an easier transition from using a single machine to using a cluster of them. The downside is that there are only a limited number of algorithms coded to work this way, and if your use-case doesn’t fall right into them you may have less luck.

📊💰🔬🎲▶️

So there you have it, there are a lot of different ways to handle big data. The right way for you really depends on what you are doing, but either limiting the data so it can work on a single machine, or finding ways to chunk it into problems that can be attacked separately can both be great approaches. If you want to try these methods out, Saturn Cloud is a great platform for working with big data. The free Hosted version lets you use 10 hours of Jupyter notebooks (with GPUs) and 3 hours of Dask. When you switch to Pro you can unlock computing machines with up to 4TB of RAM or many GPUs on a single machine.