Structured Vs Unstructured Data

A comparison between the two different methods for storing data

Data can be broadly thought of as two different types, structured data andunstructured data. Structured data is data that is stored set of tables with rows and columns–think of Excel spreadsheets or CSV files. The data may be spread over multiple sheets, but by using indices you can connect the data together. Unstructured data is data that is not stored in a tabular format, meaning it isn’t coerced into a set of tables. A common example of unstructured data is data stored in JSON files–a JSON object has properties within it that themselves are JSON objects, creating more of a hierarchical structure. Data can even be a combination of structured and unstructured data–like a table that each cell is a JSON object. This is called semi-structured data. Structured data can be stored in files (like CSV and JSON files, respectively) or in databases (like SQL and MongoDB databases, respectively).

A data analyst or data scientist will often directly store data in the same style that they get it from–like saving JSON files and CSV files to a hard drive. They also may coerce the data into a different format; for instance they might take a large set of unstructured JSON files and turn them into a tabular and structured SQL table. In this article we will explore structured and unstructured data in detail and talk about how to handle both types of data within Saturn Cloud, a cloud platform for data science.

Example of structured data

Here the data fits into one or more tables with rows and columns.

1Taylor Swift2022-01-01
2Maisie Peters2022-01-01
3Carly Rae Jepsen2022-01-01

Example of unstructured data

Here the data has elements with complex properties that change with each data point.

        "name": "Taylor S",
        "id": 1,
        "records": ["Lover", "Reputation", "Folklore"]
        "name": "Tom Hanks",
        "id": 10,
        "movies": ["Forrest Gump", "Castaway"]

The benefits and drawbacks of structured data

There are two main advantages of using structured data. First, structured data requires an explicit definition of how the data is stored–each table has a fixed set of columns that each have a type like bool or string. By having a fixed structure it makes it much more easy to reason about the data. For instance if you are storing structured customer data, you may know that each customer always has a name and an ID, and the ID can be used to join to the structured table of sales transactions. The second benefit of structured data is that it’s often much faster to query. Because you are explicitly defining the structure you can use things like SQL indices and clever keys to store the data efficiently and have your queries run quickly.

The downside of structured data is that it requires a fixed structure. Coming up with a defined schema can be a lot of work for certain datasets, and if the schema is often changing then it can be even more of a hassle to maintain. And the more that your data has optional fields you may end up with a confusing web of connected tables.

The benefits and drawbacks of unstructured data

With unstructured data you are trading off the hassle of having to define a structure with the guarantees a structure provides. Maybe for example you are dealing with data about movies–your unstructured data could make it so some movies have ratings that are numeric number of stars, and other movies have ratings that are text strings. Maybe some of your data has critics reviews and some doesn’t. It’s all fine! There is no structure. This can be advantageous because you can add data very quickly. You do not need to worry about if the data you are adding to your data set matches the structure of the previous data.

Having no structure often backfires at some point in the future when you want to do complex queries on the data. You have no guarantees that your queries will work. Your data might have an name field for some points and a customer_name for others, and so you might not get the results back you want. Also because there isn’t a structure to your data it is hard to optimize how to store it to query it quickly. With unstructured data you are trading off ease of loading your data into storage with ease of getting it out.

Using Saturn Cloud for Structured and Unstructured Data

You can choose Saturn Cloud as your computing environment while working with either structured and unstructured data. Here are some examples of using both:

Structured and unstructured data with Snowflake - Snowflake is typically known for its structured data storage, but recently they’ve added more capabilities to unstructured data. We have examples of Saturn Cloud for structured and unstructured data.

Distributed structured and unstructured data with Dask - Dask is a powerful distributed framework for Python. Dask supports structured data via Dask DataFrames and unstructured data via Dask Bags

Check out those examples to get started, or try Saturn Cloud for your general data science work too!

About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.