How to Import Multiple CSV Files into Pandas and Concatenate into One DataFrame
As a data scientist or software engineer, you may often encounter situations where you need to work with multiple CSV files and combine them into a single DataFrame. This can be a time-consuming process if done manually, but thankfully, pandas provides a simple and efficient solution to automate this task.
Table of Contents
- Introduction
- Step 1: Import the Necessary Libraries
- Step 2: Define the File Path and File Extension
- Step 3: Create a List of CSV File Names
- Step 4: Import the CSV Files into Pandas
- Step 5: Concatenate the DataFrames into One
- Step 6: Optional Data Cleaning and Manipulation
- Pros and Cons of Importing and Concatenating CSV Files Using Pandas
- Error Handling
- Conclusion
In this article, we will walk you through the steps to import multiple CSV files into pandas and concatenate them into one DataFrame. We will also provide examples and tips to optimize your code and handle different scenarios.
CSV examples:
Name,Age,Salary
Alice,28,60000
Bob,35,75000
Charlie,40,90000
Name,Age,Salary
Eva,22,55000
Frank,29,70000
Grace,32,85000
Name,Age,Salary
Henry,30,62000
Ivy,28,68000
Jack,25,72000
Step 1: Import the Necessary Libraries
Before we start, we need to import the pandas library and any other libraries that we may need for data analysis or manipulation. We can do this using the import
statement:
import pandas as pd
Step 2: Define the File Path and File Extension
Next, we need to define the file path where our CSV files are stored and the file extension we want to import. This can be done using the os
library:
import os
path = '/path/to/csv/files/'
extension = '.csv'
Step 3: Create a List of CSV File Names
Now, we need to create a list of the CSV file names that we want to import. We can use the os
library to loop through the files in the directory and append the file names to a list:
files = [file for file in os.listdir(path) if file.endswith(extension)]
Step 4: Import the CSV Files into Pandas
Using the list of file names, we can loop through each file and import it into pandas using the read_csv()
function. We can also append each DataFrame to a list to concatenate them later:
dfs = []
for file in files:
df = pd.read_csv(os.path.join(path, file))
dfs.append(df)
Step 5: Concatenate the DataFrames into One
Now that we have a list of DataFrames, we can use the concat()
function to concatenate them into one DataFrame:
df = pd.concat(dfs, ignore_index=True)
Output:
Name Age Salary
0 Alice 28 60000
1 Bob 35 75000
2 Charlie 40 90000
3 Eva 22 55000
4 Frank 29 70000
5 Grace 32 85000
6 Henry 30 62000
7 Ivy 28 68000
8 Jack 25 72000
The ignore_index=True
argument is used to reset the index of the concatenated DataFrame.
Step 6: Optional Data Cleaning and Manipulation
At this point, we have successfully imported and concatenated our CSV files into a single DataFrame. However, depending on the data and your analysis goals, you may need to perform additional data cleaning and manipulation. Here are some examples:
Remove Duplicates
If your CSV files contain duplicate rows, you can remove them using the drop_duplicates()
function:
df.drop_duplicates(inplace=True)
Rename Columns
If your CSV files have different column names, you can rename them using the rename()
function:
df = df.rename(columns={'old_name': 'new_name'})
Drop Columns
If you have columns that are not relevant to your analysis, you can drop them using the drop()
function:
df = df.drop(columns=['column_name'])
Change Data Types
If your CSV files have columns with incorrect data types, you can change them using the astype()
function:
df['column_name'] = df['column_name'].astype('new_type')
Pros and Cons of Importing and Concatenating CSV Files Using Pandas
Pros
Efficiency and Automation: Pandas simplifies the process of importing and concatenating multiple CSV files, reducing the need for manual intervention. Automation is achieved through concise code, making it suitable for handling large datasets effortlessly.
Flexibility: The approach is flexible and can be adapted to different scenarios where combining data from multiple sources is required.
Customization: Users have the flexibility to perform additional data cleaning, manipulation, and analysis tailored to their specific needs.
Data Integrity: Pandas ensures data integrity during concatenation, preserving the original indices or providing an option to reset them.
Cons
Memory Usage: Loading multiple large CSV files into memory simultaneously may lead to high memory usage. This can be a challenge when working with extremely large datasets.
File Compatibility: The approach assumes that all CSV files have the same structure. Mismatched columns or data types between files may cause issues during concatenation.
Performance Impact: Concatenating a large number of files sequentially may have a performance impact. It’s essential to consider optimization techniques for improved efficiency.
Error Handling
Directory Existence Check: Before proceeding with file operations, it’s important to check whether the specified directory exists. This helps prevent issues related to non-existent paths.
Empty Directory Handling: If the directory is empty and does not contain any CSV files, an appropriate error or exception should be raised to notify the user. This prevents attempts to process non-existent files.
Data Loading Errors: During the loading of CSV files using pd.read_csv(), potential errors such as file not found, empty file, or incorrect file format should be handled. This ensures smooth data loading without unexpected interruptions.
Concatenation Errors: Errors that may occur during the concatenation step need to be addressed. This includes handling scenarios where the column names or data types in the CSV files are not consistent, leading to difficulties in creating a cohesive DataFrame.
Conclusion
In this article, we have shown you how to import multiple CSV files into pandas and concatenate them into a single DataFrame. We have also provided examples of how to perform additional data cleaning and manipulation to optimize your analysis.
By following these steps and using the pandas library, you can efficiently handle large datasets and automate repetitive tasks in your data analysis projects.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.