Load Data from Local Files

Load and use data stored in files in Saturn Cloud

Flat files are a very common data storage option, and lots of our customers will use them at some time. This tutorial will show you how to load a dataset from a file (for example, a CSV or JSON file) on disk or on a cloud storage space into Saturn Cloud using pandas or Dask.

Before starting this, you should create a Jupyter server resource and start it. See our startup documentation if you don’t know how to do this yet.

Upload Files

If you want to place a flat file in the Saturn Cloud Jupyter server workspace, there’s a simple UI option.

Jupyter Lab interface showing a red arrow pointing to the Upload button

Alternately, you can store your data in a cloud storage tool, such as S3 or Snowflake, and access it directly from there. We have tutorials to help you work with data in those storage tools.

Loading Files

Once files are uploaded into your workspace, you can treat them like local files. For many files, using pandas tools will be perfectly adequate. If your data is too large to work in pandas, Dask tools are also available. For reference, here are some useful sections of pandas and Dask documentation for loading files.

These Dask options work best with one file, or with folders full of files that are formatted exactly alike (same column names, same data types, etc). They make assumptions about the data characteristics because Dask doesn’t read all the data immediately, but creates a pointer to it that allows “delayed” loading later.

If your data doesn’t fit this description, but you have many files to load, check out the next section.

Dask.delayed Options

Data files aren’t always provided in a clean, tabular form that’s readable with a read_* method from pandas or Dask. With dask.delayed functions, we can write a function that processes a single chunk of raw data and then tell Dask to collect these into a Dask DataFrame.

We’ll illustrate that now with the CSV files, but it’s always better to use a dd.read_* method if your data supports it. When you need it though, dask.delayed is very flexible and powerful - chances are you will use it for some of your workloads.

We’ll define a function, load_csv, which will return a pandas DataFrame for a given file path. In here, you could include any number of custom data processing steps that you want done at the time of data loading! Then, call this for the folder full of CSV files, and create a Dask DataFrame with dd.from_delayed.

import pandas as pd
import dask
import dask.dataframe as dd

def load_csv(file):
    df = pd.read_csv(file, mode='rb')

    # We are selecting only the columns all the files have in common
    df = df[['Column1', 'Column3', 'Column5']]
    return df

dfs = []
for f in s3.glob('s3://bucket_name/data-file-*.csv'):
    df = load_csv(f)

df_delayed = dd.from_delayed(dfs)

Need help, or have more questions? Contact us at:

We'll be happy to help you and answer your questions!