How to Open a PDF and Read in Tables with Python Pandas

As a data scientist or software engineer you may encounter situations where you need to extract data from a PDF file While PDFs can be challenging to work with due to their nonstructured nature and lack of native support in Python it is possible to extract tables from PDFs using Python libraries such as PyPDF2 and pandas

As a data scientist or software engineer, you may encounter situations where you need to extract data from a PDF file. While PDFs can be challenging to work with due to their non-structured nature and lack of native support in Python, it is possible to extract tables from PDFs using Python libraries such as PyPDF2 and pandas.

Table of Contents

  1. Installing Required Libraries
  2. Opening a PDF File with PyPDF2
  3. Reading Tables from PDFs with pandas
  4. Cleaning and Manipulating Extracted Tables
  5. Exporting Tables to CSV or Excel
  6. Common Errors
  7. Conclusion

In this article, we will demonstrate how to open a PDF file and read in tables using Python pandas. We will cover the following topics:

  1. Installing Required Libraries
  2. Opening a PDF File with PyPDF2
  3. Reading Tables from PDFs with pandas
  4. Cleaning and Manipulating Extracted Tables
  5. Exporting Tables to CSV or Excel

1. Installing Required Libraries

Before we get started, we need to make sure we have the necessary libraries installed. We will be using PyPDF2 and pandas for this tutorial, so let’s install them using pip:

pip install PyPDF2 pandas

2. Opening a PDF File with PyPDF2

To extract tables from a PDF, we first need to open the file and locate the pages that contain the tables we are interested in. We will use PyPDF2 to accomplish this.

example.pdf

import PyPDF2
import re

# Define a regular expression to match tables
table_regex = r'(?s)\b(?:\w+\s+){2,}\w+\b(?:\s*[,;]\s*\b(?:\w+\s+){2,}\w+\b)*'

with open('example.pdf', 'rb') as f:
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfReader(f)
    
    # Get the number of pages in the PDF file
    num_pages = len(pdf_reader.pages)
    
    # Loop through each page in the PDF file
    for page_num in range(num_pages):
        # Get the current page object
        page = pdf_reader.pages[page_num]
        
        # Extract the text from the current page
        page_text = page.extract_text()
        
        # Find all tables in page_text
        tables = re.findall(table_regex, page_text)
        
        # TODO: Identify tables in page_text

In this code snippet, we open the PDF file in read-binary mode using a context manager. We then create a PDF reader object and get the number of pages in the PDF file using the pdf_reader.pages method. We loop through each page in the PDF file using a for loop and get the current page object using the pdf_reader.pages[] method. We then extract the text from the current page using the extract_text() method.

3. Reading Tables from PDFs with pandas

Now that we have extracted the text from each page in the PDF file, we need to identify the tables in the text and extract them using pandas. We can use regular expressions to identify the tables in the text.

example.pdf

import re
import PyPDF2
import pandas as pd

# Define a regular expression to match tables
table_regex = r'(?s)\b(?:\w+\s+){2,}\w+\b(?:\s*[,;]\s*\b(?:\w+\s+){2,}\w+\b)*'

# Open the PDF file in read-binary mode
with open('example.pdf', 'rb') as f:
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfReader(f)
    
    # Get the number of pages in the PDF file
    num_pages = len(pdf_reader.pages)
    
    # Loop through each page in the PDF file
    for page_num in range(num_pages):
        # Get the current page object
        page = pdf_reader.pages[page_num]
        
        # Extract the text from the current page
        page_text = page.extract_text()
        
        # Find all tables in page_text
        tables = re.findall(table_regex, page_text)
        
        # Loop through each table and create a pandas DataFrame
        for table in tables:
            # Split the table into rows
            rows = table.strip().split('\n')
            
            # Split the rows into cells
            cells = [row.split('|') for row in rows]
            
            # Remove leading and trailing whitespace from cells
            cells = [[cell.strip() for cell in row] for row in cells]
            
            # Remove empty rows and columns
            cells = [[cell for cell in row if cell] for row in cells if row]
            
            # Create a pandas DataFrame from the cells
            df = pd.DataFrame(cells[1:], columns=cells[0])
            
            # TODO: Clean and manipulate the df as needed

Output:

        Header 1
0       Header 2
1       Header 3
2   Row 1, Col 1
3   Row 1, Col 2
4   Row 1, Col 3
5   Row 2, Col 1
6   Row 2, Col 2
7   Row 2, Col 3
8       Column A
9       Column B
10      Column C
11        Data 1
12        Data 2
13        Data 3
14        Data 4
15        Data 5
16        Data 6

In this code snippet, we define a regular expression to match tables in the text. We then loop through each page in the PDF file and extract the text using PyPDF2. We use the regular expression to find all tables in the text and loop through each table. We split each table into rows and then into cells using the split() method. We remove leading and trailing whitespace from the cells using a list comprehension and remove empty rows and columns using another list comprehension. We then create a pandas DataFrame from the cells using the DataFrame() constructor.

4. Cleaning and Manipulating Extracted Tables

Now that we have extracted the tables from the PDF file and created pandas DataFrames, we may need to clean and manipulate the data before we can use it in our analysis. For example, we may need to convert strings to numerical values or merge multiple tables together.

# TODO: Clean and manipulate the df as needed

# Convert string columns to numerical values
df['col1'] = pd.to_numeric(df['col1'])
df['col2'] = pd.to_numeric(df['col2'])

# Merge multiple tables together
merged_df = pd.concat([df1, df2, df3], ignore_index=True)

In this code snippet, we convert string columns to numerical values using the to_numeric() method. We merge multiple tables together using the concat() function.

5. Exporting Tables to CSV or Excel

Finally, we may want to export the tables to a CSV or Excel file for further analysis or sharing with others. We can use the to_csv() and to_excel() methods in pandas to accomplish this.

# Export the DataFrame to a CSV file
df.to_csv('example.csv', index=False)

# Export the DataFrame to an Excel file
df.to_excel('example.xlsx', index=False)

In this code snippet, we use the to_csv() method to export the DataFrame to a CSV file and the to_excel() method to export the DataFrame to an Excel file.

Common Errors

1. File Not Found Error

It’s crucial to handle scenarios where the specified PDF file is not found. The following code snippet demonstrates how to incorporate error handling for this situation:

try:
    # Open the PDF file in read-binary mode
    with open(pdf_file_path, 'rb') as f:
        # ... (rest of the code)
except FileNotFoundError:
    print(f"Error: The file '{pdf_file_path}' does not exist.")

This error may occur if the specified PDF file path is incorrect or if the file has been moved or deleted.

2. PDF Reading Error

Another potential issue is the inability to read the PDF file, which may arise due to a corrupted file or an unexpected format. The following code snippet addresses this concern:

try:
    # Open the PDF file in read-binary mode
    with open(pdf_file_path, 'rb') as f:
        # ... (rest of the code)
except PyPDF2.utils.PdfReadError:
    print(f"Error: Unable to read PDF file '{pdf_file_path}'.")

This error may indicate problems with the PDF file’s structure or content.

3. Unexpected Errors

In the event of unforeseen errors during the data extraction process, a generic exception block can be added to capture and handle these issues:

Copy code
try:
    # ... (main code)
except Exception as e:
    print(f"An unexpected error occurred: {e}")

This catch-all block ensures that any unexpected errors are logged with relevant information for further investigation.

Conclusion

In this article, we have demonstrated how to open a PDF file and read in tables using Python pandas. We have covered the installation of required libraries, opening a PDF file with PyPDF2, reading tables from PDFs with pandas, cleaning and manipulating extracted tables, and exporting tables to CSV or Excel. By following these steps, you can extract data from PDFs and use it in your data analysis projects.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.