How to Import Multiple CSV Files into Pandas and Concatenate into One DataFrame

As a data scientist or software engineer, you may often encounter situations where you need to work with multiple CSV files and combine them into a single DataFrame. This can be a time-consuming process if done manually, but thankfully, pandas provides a simple and efficient solution to automate this task.

As a data scientist or software engineer, you may often encounter situations where you need to work with multiple CSV files and combine them into a single DataFrame. This can be a time-consuming process if done manually, but thankfully, pandas provides a simple and efficient solution to automate this task.

Table of Contents

  1. Introduction
  2. Step 1: Import the Necessary Libraries
  3. Step 2: Define the File Path and File Extension
  4. Step 3: Create a List of CSV File Names
  5. Step 4: Import the CSV Files into Pandas
  6. Step 5: Concatenate the DataFrames into One
  7. Step 6: Optional Data Cleaning and Manipulation
  8. Pros and Cons of Importing and Concatenating CSV Files Using Pandas
  9. Error Handling
  10. Conclusion

In this article, we will walk you through the steps to import multiple CSV files into pandas and concatenate them into one DataFrame. We will also provide examples and tips to optimize your code and handle different scenarios.

CSV examples:

Name,Age,Salary
Alice,28,60000
Bob,35,75000
Charlie,40,90000
Name,Age,Salary
Eva,22,55000
Frank,29,70000
Grace,32,85000
Name,Age,Salary
Henry,30,62000
Ivy,28,68000
Jack,25,72000

Step 1: Import the Necessary Libraries

Before we start, we need to import the pandas library and any other libraries that we may need for data analysis or manipulation. We can do this using the import statement:

import pandas as pd

Step 2: Define the File Path and File Extension

Next, we need to define the file path where our CSV files are stored and the file extension we want to import. This can be done using the os library:

import os
path = '/path/to/csv/files/'
extension = '.csv'

Step 3: Create a List of CSV File Names

Now, we need to create a list of the CSV file names that we want to import. We can use the os library to loop through the files in the directory and append the file names to a list:

files = [file for file in os.listdir(path) if file.endswith(extension)]

Step 4: Import the CSV Files into Pandas

Using the list of file names, we can loop through each file and import it into pandas using the read_csv() function. We can also append each DataFrame to a list to concatenate them later:

dfs = []
for file in files:
    df = pd.read_csv(os.path.join(path, file))
    dfs.append(df)

Step 5: Concatenate the DataFrames into One

Now that we have a list of DataFrames, we can use the concat() function to concatenate them into one DataFrame:

df = pd.concat(dfs, ignore_index=True)

Output:

      Name  Age  Salary
0    Alice   28   60000
1      Bob   35   75000
2  Charlie   40   90000
3      Eva   22   55000
4    Frank   29   70000
5    Grace   32   85000
6    Henry   30   62000
7      Ivy   28   68000
8     Jack   25   72000

The ignore_index=True argument is used to reset the index of the concatenated DataFrame.

Step 6: Optional Data Cleaning and Manipulation

At this point, we have successfully imported and concatenated our CSV files into a single DataFrame. However, depending on the data and your analysis goals, you may need to perform additional data cleaning and manipulation. Here are some examples:

Remove Duplicates

If your CSV files contain duplicate rows, you can remove them using the drop_duplicates() function:

df.drop_duplicates(inplace=True)

Rename Columns

If your CSV files have different column names, you can rename them using the rename() function:

df = df.rename(columns={'old_name': 'new_name'})

Drop Columns

If you have columns that are not relevant to your analysis, you can drop them using the drop() function:

df = df.drop(columns=['column_name'])

Change Data Types

If your CSV files have columns with incorrect data types, you can change them using the astype() function:

df['column_name'] = df['column_name'].astype('new_type')

Pros and Cons of Importing and Concatenating CSV Files Using Pandas

Pros

  • Efficiency and Automation: Pandas simplifies the process of importing and concatenating multiple CSV files, reducing the need for manual intervention. Automation is achieved through concise code, making it suitable for handling large datasets effortlessly.

  • Flexibility: The approach is flexible and can be adapted to different scenarios where combining data from multiple sources is required.

  • Customization: Users have the flexibility to perform additional data cleaning, manipulation, and analysis tailored to their specific needs.

  • Data Integrity: Pandas ensures data integrity during concatenation, preserving the original indices or providing an option to reset them.

Cons

  • Memory Usage: Loading multiple large CSV files into memory simultaneously may lead to high memory usage. This can be a challenge when working with extremely large datasets.

  • File Compatibility: The approach assumes that all CSV files have the same structure. Mismatched columns or data types between files may cause issues during concatenation.

  • Performance Impact: Concatenating a large number of files sequentially may have a performance impact. It’s essential to consider optimization techniques for improved efficiency.

Error Handling

  1. Directory Existence Check: Before proceeding with file operations, it’s important to check whether the specified directory exists. This helps prevent issues related to non-existent paths.

  2. Empty Directory Handling: If the directory is empty and does not contain any CSV files, an appropriate error or exception should be raised to notify the user. This prevents attempts to process non-existent files.

  3. Data Loading Errors: During the loading of CSV files using pd.read_csv(), potential errors such as file not found, empty file, or incorrect file format should be handled. This ensures smooth data loading without unexpected interruptions.

  4. Concatenation Errors: Errors that may occur during the concatenation step need to be addressed. This includes handling scenarios where the column names or data types in the CSV files are not consistent, leading to difficulties in creating a cohesive DataFrame.

Conclusion

In this article, we have shown you how to import multiple CSV files into pandas and concatenate them into a single DataFrame. We have also provided examples of how to perform additional data cleaning and manipulation to optimize your analysis.

By following these steps and using the pandas library, you can efficiently handle large datasets and automate repetitive tasks in your data analysis projects.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.