How to Convert Nested JSON to Pandas DataFrame with Specific Format

This blog will show you how to efficiently convert nested JSON files into a Pandas DataFrame, a vital skill for data scientists and software engineers. Simplify the process of working with complex data structures and achieve a specific format for your data analysis tasks.

As a data scientist or software engineer, you are likely to encounter nested JSON files frequently. While JSON is a popular format for data exchange, it can be challenging to work with when dealing with nested structures. One common task when working with nested JSON files is to convert them into a Pandas DataFrame. In this blog post, we will explore how to convert a nested JSON file into a Pandas DataFrame with a specific format.

What is JSON?

JSON stands for JavaScript Object Notation. It is a lightweight data interchange format that is easy for humans to read and write and easy for machines to parse and generate. JSON is based on a subset of the JavaScript Programming Language and is often used to transmit data between a server and a web application, as an alternative to XML.

JSON is structured as a collection of name/value pairs, where the name is a string and the value can be a string, number, boolean, null, array, or another JSON object. JSON objects can be nested, meaning that a value can be another JSON object or an array of JSON objects.

What is Pandas?

Pandas is a popular data manipulation library for Python. It provides easy-to-use data structures and data analysis tools for handling tabular data, time series, and more. Pandas also provides powerful data cleaning, transformation, and visualization capabilities.

Pandas DataFrames are two-dimensional, size-mutable, tabular data structures with labeled axes (rows and columns). They can be thought of as a spreadsheet or SQL table, with rows and columns of data. DataFrames can be created from a variety of data sources, including CSV files, SQL databases, and JSON files.

Converting Nested JSON to Pandas DataFrame

To convert a nested JSON file into a Pandas DataFrame, we will use the json_normalize() function from the pandas.io.json module. The json_normalize() function is used to normalize semi-structured JSON data into a flat table.

JSON Example

Let’s start with an example nested JSON file that we want to convert into a Pandas DataFrame. The JSON file contains information about books, including the title, author, and publisher.

{
  "books": [
    {
      "title": "The Catcher in the Rye",
      "author": {
        "first_name": "J.D.",
        "last_name": "Salinger"
      },
      "publisher": {
        "name": "Little, Brown and Company",
        "location": "Boston, MA"
      }
    },
    {
      "title": "To Kill a Mockingbird",
      "author": {
        "first_name": "Harper",
        "last_name": "Lee"
      },
      "publisher": {
        "name": "J. B. Lippincott & Co.",
        "location": "Philadelphia, PA"
      }
    },
    {
      "title": "1984",
      "author": {
        "first_name": "George",
        "last_name": "Orwell"
      },
      "publisher": {
        "name": "Secker & Warburg",
        "location": "London, UK"
      }
    }
  ]
}

Converting JSON to Pandas DataFrame

To convert this JSON file into a Pandas DataFrame, we will use the following code:

import pandas as pd
import json

with open('books.json') as f:
    data = json.load(f)

# Use pd.json_normalize to convert the JSON to a DataFrame
df = pd.json_normalize(data['books'], 
                     meta=['title', ['author', 'first_name'], ['author', 'last_name'], ['publisher', 'name'], ['publisher', 'location']])

# Rename the columns for clarity
df.columns = ['Title', 'Author_First_Name', 'Author_Last_Name', 'Publisher_Name', 'Publisher_Location']

# Display the DataFrame
print(df)

Let’s review the code:

  1. We import the Pandas and JSON libraries.
  2. We open the JSON file using the open() function and load the data using the json.load() function.
  3. We use the json_normalize() function to normalize the data. The data['books'] parameter specifies the path to the nested data, and the meta parameter specifies the columns to include in the resulting DataFrame.
  4. We rename the columns for better visualization.
  5. We print the resulting DataFrame.

Resulting Pandas DataFrame

The resulting Pandas DataFrame will look like this:

                    Title  Author_First_Name  Author_Last_Name             Publisher_Name  Publisher_Location  
0  The Catcher in the Rye               J.D.          Salinger  Little, Brown and Company          Boston, MA  
1  To Kill a Mockingbird              Harper               Lee     J. B. Lippincott & Co.    Philadelphia, PA  
2                  1984               George            Orwell           Secker & Warburg          London, UK  

As you can see, the resulting Pandas DataFrame has the specific format we defined. It contains columns for the book title, author first name, author last name, publisher name, and publisher location.

Conclusion

In this blog post, we explored how to convert a nested JSON file into a Pandas DataFrame with a specific format. We used the json_normalize() function from the pandas.io.json module to normalize the nested data and create a flat table. We also reviewed the resulting Pandas DataFrame and saw how it has the specific format we defined.

Converting nested JSON files to Pandas DataFrames can be challenging, but with the right tools and techniques, it can be a straightforward process. By using the json_normalize() function and specifying the record path and meta parameters, you can easily convert nested JSON files into Pandas DataFrames with a specific format.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.