How to Open a PDF and Read in Tables with Python Pandas
As a data scientist or software engineer, you may encounter situations where you need to extract data from a PDF file. While PDFs can be challenging to work with due to their non-structured nature and lack of native support in Python, it is possible to extract tables from PDFs using Python libraries such as PyPDF2 and pandas.
Table of Contents
- Installing Required Libraries
- Opening a PDF File with PyPDF2
- Reading Tables from PDFs with pandas
- Cleaning and Manipulating Extracted Tables
- Exporting Tables to CSV or Excel
- Common Errors
- Conclusion
In this article, we will demonstrate how to open a PDF file and read in tables using Python pandas. We will cover the following topics:
- Installing Required Libraries
- Opening a PDF File with PyPDF2
- Reading Tables from PDFs with pandas
- Cleaning and Manipulating Extracted Tables
- Exporting Tables to CSV or Excel
1. Installing Required Libraries
Before we get started, we need to make sure we have the necessary libraries installed. We will be using PyPDF2 and pandas for this tutorial, so let’s install them using pip:
pip install PyPDF2 pandas
2. Opening a PDF File with PyPDF2
To extract tables from a PDF, we first need to open the file and locate the pages that contain the tables we are interested in. We will use PyPDF2 to accomplish this.
import PyPDF2
import re
# Define a regular expression to match tables
table_regex = r'(?s)\b(?:\w+\s+){2,}\w+\b(?:\s*[,;]\s*\b(?:\w+\s+){2,}\w+\b)*'
with open('example.pdf', 'rb') as f:
# Create a PDF reader object
pdf_reader = PyPDF2.PdfReader(f)
# Get the number of pages in the PDF file
num_pages = len(pdf_reader.pages)
# Loop through each page in the PDF file
for page_num in range(num_pages):
# Get the current page object
page = pdf_reader.pages[page_num]
# Extract the text from the current page
page_text = page.extract_text()
# Find all tables in page_text
tables = re.findall(table_regex, page_text)
# TODO: Identify tables in page_text
In this code snippet, we open the PDF file in read-binary mode using a context manager. We then create a PDF reader object and get the number of pages in the PDF file using the pdf_reader.pages
method. We loop through each page in the PDF file using a for loop and get the current page object using the pdf_reader.pages[]
method. We then extract the text from the current page using the extract_text()
method.
3. Reading Tables from PDFs with pandas
Now that we have extracted the text from each page in the PDF file, we need to identify the tables in the text and extract them using pandas. We can use regular expressions to identify the tables in the text.
import re
import PyPDF2
import pandas as pd
# Define a regular expression to match tables
table_regex = r'(?s)\b(?:\w+\s+){2,}\w+\b(?:\s*[,;]\s*\b(?:\w+\s+){2,}\w+\b)*'
# Open the PDF file in read-binary mode
with open('example.pdf', 'rb') as f:
# Create a PDF reader object
pdf_reader = PyPDF2.PdfReader(f)
# Get the number of pages in the PDF file
num_pages = len(pdf_reader.pages)
# Loop through each page in the PDF file
for page_num in range(num_pages):
# Get the current page object
page = pdf_reader.pages[page_num]
# Extract the text from the current page
page_text = page.extract_text()
# Find all tables in page_text
tables = re.findall(table_regex, page_text)
# Loop through each table and create a pandas DataFrame
for table in tables:
# Split the table into rows
rows = table.strip().split('\n')
# Split the rows into cells
cells = [row.split('|') for row in rows]
# Remove leading and trailing whitespace from cells
cells = [[cell.strip() for cell in row] for row in cells]
# Remove empty rows and columns
cells = [[cell for cell in row if cell] for row in cells if row]
# Create a pandas DataFrame from the cells
df = pd.DataFrame(cells[1:], columns=cells[0])
# TODO: Clean and manipulate the df as needed
Output:
Header 1
0 Header 2
1 Header 3
2 Row 1, Col 1
3 Row 1, Col 2
4 Row 1, Col 3
5 Row 2, Col 1
6 Row 2, Col 2
7 Row 2, Col 3
8 Column A
9 Column B
10 Column C
11 Data 1
12 Data 2
13 Data 3
14 Data 4
15 Data 5
16 Data 6
In this code snippet, we define a regular expression to match tables in the text. We then loop through each page in the PDF file and extract the text using PyPDF2. We use the regular expression to find all tables in the text and loop through each table. We split each table into rows and then into cells using the split()
method. We remove leading and trailing whitespace from the cells using a list comprehension and remove empty rows and columns using another list comprehension. We then create a pandas DataFrame from the cells using the DataFrame()
constructor.
4. Cleaning and Manipulating Extracted Tables
Now that we have extracted the tables from the PDF file and created pandas DataFrames, we may need to clean and manipulate the data before we can use it in our analysis. For example, we may need to convert strings to numerical values or merge multiple tables together.
# TODO: Clean and manipulate the df as needed
# Convert string columns to numerical values
df['col1'] = pd.to_numeric(df['col1'])
df['col2'] = pd.to_numeric(df['col2'])
# Merge multiple tables together
merged_df = pd.concat([df1, df2, df3], ignore_index=True)
In this code snippet, we convert string columns to numerical values using the to_numeric()
method. We merge multiple tables together using the concat()
function.
5. Exporting Tables to CSV or Excel
Finally, we may want to export the tables to a CSV or Excel file for further analysis or sharing with others. We can use the to_csv()
and to_excel()
methods in pandas to accomplish this.
# Export the DataFrame to a CSV file
df.to_csv('example.csv', index=False)
# Export the DataFrame to an Excel file
df.to_excel('example.xlsx', index=False)
In this code snippet, we use the to_csv()
method to export the DataFrame to a CSV file and the to_excel()
method to export the DataFrame to an Excel file.
Common Errors
1. File Not Found Error
It’s crucial to handle scenarios where the specified PDF file is not found. The following code snippet demonstrates how to incorporate error handling for this situation:
try:
# Open the PDF file in read-binary mode
with open(pdf_file_path, 'rb') as f:
# ... (rest of the code)
except FileNotFoundError:
print(f"Error: The file '{pdf_file_path}' does not exist.")
This error may occur if the specified PDF file path is incorrect or if the file has been moved or deleted.
2. PDF Reading Error
Another potential issue is the inability to read the PDF file, which may arise due to a corrupted file or an unexpected format. The following code snippet addresses this concern:
try:
# Open the PDF file in read-binary mode
with open(pdf_file_path, 'rb') as f:
# ... (rest of the code)
except PyPDF2.utils.PdfReadError:
print(f"Error: Unable to read PDF file '{pdf_file_path}'.")
This error may indicate problems with the PDF file’s structure or content.
3. Unexpected Errors
In the event of unforeseen errors during the data extraction process, a generic exception block can be added to capture and handle these issues:
Copy code
try:
# ... (main code)
except Exception as e:
print(f"An unexpected error occurred: {e}")
This catch-all block ensures that any unexpected errors are logged with relevant information for further investigation.
Conclusion
In this article, we have demonstrated how to open a PDF file and read in tables using Python pandas. We have covered the installation of required libraries, opening a PDF file with PyPDF2, reading tables from PDFs with pandas, cleaning and manipulating extracted tables, and exporting tables to CSV or Excel. By following these steps, you can extract data from PDFs and use it in your data analysis projects.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.