How to Read Multiple CSV Files into Python Pandas Dataframe

In this blog, we delve into the realm of data science and software engineering, where encountering large datasets is a routine occurrence. It becomes crucial in such scenarios to possess the capability to adeptly extract information from diverse sources and merge them into a unified dataset. Among the widely used formats for data storage, CSV (Comma Separated Values) stands out. This article will guide you through the process of efficiently assimilating data from multiple CSV files into a singular Python Pandas dataframe.

As a data scientist or software engineer, working with large datasets is a common scenario. In such cases, it’s important to be able to efficiently read data from various sources and combine them into a single dataset. One of the most common formats for storing data is CSV (Comma Separated Values). In this article, we’ll explore how to read multiple CSV files into a single Python Pandas dataframe.

Table of Contents

  1. What Is a CSV File?
  2. Why Read Multiple CSV Files into a Single Dataframe?
  3. How to Read Multiple CSV Files into a Single Dataframe
  4. Common Errors and How to Handle
  5. Conclusion

What Is a CSV File?

A CSV file is a text file that stores tabular data in a plain-text format. Each line in the file represents a row in the table, while commas separate the columns. The first row in the file typically contains headers that describe the columns.

Here’s an example of a CSV file:

Name, Age, Gender
John, 25, M
Jane, 30, F
Bob, 40, M

Why Read Multiple CSV Files into a Single Dataframe?

In many cases, data is stored in multiple CSV files that need to be combined into a single dataset. For example, you might have data for different years or regions that need to be combined for analysis. Combining data into a single dataframe allows you to perform statistical analysis, data visualization, and machine learning tasks more easily.

How to Read Multiple CSV Files into a Single Dataframe

Let’s assume that we have these following csv files Alt text

Python’s Pandas library provides a convenient way to read CSV files into a dataframe. To read multiple CSV files into a single dataframe, we can use the concat function from Pandas.

Assuming that all CSV files have the same structure, we can use the following code:

import pandas as pd
import glob

# Get a list of all CSV files in a directory
csv_files = glob.glob('saturn/*.csv')

# Create an empty dataframe to store the combined data
combined_df = pd.DataFrame()

# Loop through each CSV file and append its contents to the combined dataframe
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    combined_df = pd.concat([combined_df, df])

# Print the combined dataframe
print(combined_df)

Output:

   ID     Value
0   1  0.462535
1   2  0.747471
2   3  0.036683
3   4  0.252437
4   5  0.713350
0   1  0.895207
1   2  0.511677
2   3  0.532113
3   4  0.107172
4   5  0.447412
...
0   1  0.245958
1   2  0.160681
2   3  0.186567
3   4  0.285095
4   5  0.173374

Here’s what the code does:

  1. We import the Pandas library and the glob module, which allows us to easily get a list of all CSV files in a directory.
  2. We use the glob function to get a list of all CSV files in the specified directory.
  3. We create an empty dataframe called combined_df to store the combined data.
  4. We loop through each CSV file in the list and read its contents into a dataframe using the read_csv function from Pandas.
  5. We use the concat function from Pandas to append the contents of each CSV file to the combined_df dataframe.
  6. Finally, we print the combined dataframe to verify that the data has been combined correctly.

Common Errors and How to Handle

  • Error 1: Inconsistent Column Headers
import glob

files = glob.glob('sample_files/*.csv')
dfs = [pd.read_csv(file) for file in files]
  • Error 2: Memory Issues
# Reading large files in chunks
chunk_size = 1000
chunks = pd.read_csv('large_file.csv', chunksize=chunk_size)

for chunk in chunks:
    process_data(chunk)

Conclusion

In this article, we’ve learned how to read multiple CSV files into a single Python Pandas dataframe. This is a useful technique for combining data from different sources and preparing it for analysis. With the glob and concat functions from Pandas, it’s easy to read and combine data from multiple CSV files.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.