Pandas readexcel with Multiple Sheets and Specific Columns
As a data scientist or software engineer, you often encounter datasets that are spread across multiple sheets in an Excel workbook. Pandas, a popular data manipulation library in Python, provides an easy way to read Excel files into dataframes using the read_excel()
function. In this article, we will explore how to use this function to read multiple sheets from an Excel file and select specific columns for analysis.
Table of Contents
- Introduction
- Why Use Pandas for Reading Excel Files?
- Reading Multiple Sheets from an Excel File
- Selecting Specific Columns from Multiple Sheets
- Conclusion
Why Use Pandas for Reading Excel Files?
Pandas provides an efficient and flexible way to work with tabular data, making it a popular choice for data manipulation tasks. The read_excel()
function in Pandas allows you to read data from Excel files and create dataframes, which can be further processed and analyzed. Some of the benefits of using Pandas for reading Excel files include:
- Easy data manipulation: Data can be easily manipulated using Pandas' powerful functions and methods.
- Data cleaning: You can easily clean and preprocess data using Pandas.
- Data visualization: Pandas provides several visualization tools that can be used to create meaningful visualizations for data analysis.
Reading Multiple Sheets from an Excel File
To read multiple sheets from an Excel file using Pandas, you need to pass a dictionary to the sheet_name
parameter of the read_excel()
function. The keys of the dictionary represent the sheet names, and the values represent the sheet numbers or names.
# Import pandas
import pandas as pd
# Read Excel file with multiple sheets
xls = pd.read_excel("D:\\SamNewLocation\\Desktop\\my_file.xlsx", sheet_name=['Sheet1', 'Sheet2'])
# Access individual sheets using sheet names
sheet1_df = xls['Sheet1']
sheet2_df = xls['Sheet2']
print(sheet1_df)
print(sheet2_df)
Output :
Age Gender Genre
0 20 1 HipHop Creole
1 23 1 HipHop Creole
2 25 1 HipHop Creole
3 26 1 Jazz
4 29 1 Jazz
... .. ... ...
47509 27 0 Acoustic
47510 30 0 Acoustic
47511 31 0 Classical
47512 34 0 Classical
47513 35 0 Classical
[47514 rows x 3 columns]
Quantity Color Car_brand
0 20 Red Toyota Camry
1 23 Balck Ford Mustang
2 25 Blue Honda Civic
3 26 Yellow Chevrolet Malibu
4 29 Grey BMW 3 Series
5 30 White Volkswagen Golf
6 31 Green Nissan Altima
7 33 Orange Mercedes-Benz C-Class
8 37 Violet Hyundai Sonata
9 20 Brown Jeep Wrangler
10 21 Silver Tesla Model S
11 25 Purple Subaru Outback
12 26 Gold Audi Q5
13 27 Olive Kia Soul
14 30 Aqua Ford Explorer
15 31 Beige Tesla
16 34 Crimson Suzuki
17 35 Jade Toyota RAV4
In the above code snippet, we are reading two sheets from an Excel file named my_file.xlsx
. The dictionary passed to the sheet_name
parameter specifies that we want to read the first sheet as Sheet1
and the second sheet as Sheet2
. The sheet numbers start from 0.
Selecting Specific Columns from Multiple Sheets
Once you have read multiple sheets from an Excel file, you may want to select specific columns for analysis. Pandas provides several ways to select columns from a dataframe. One way is to use the loc[]
function, which allows you to select columns by name.
# Select specific columns from multiple sheets
selected_columns = sheets_dict['Sheet2'].loc[:, ['Color', 'Car_brand']]
print(selected_columns)
selected_columns = sheets_dict['Sheet2'].loc[:, ['Color', 'Car_brand']]
print(selected_columns)
In the above code snippet, we are selecting columns Color
and Car_brand
from the Sheet2
dataframe using the loc[]
function. The :
in the row index specifies that we want to select all rows.
Output :
Color Car_brand
0 Red Toyota Camry
1 Balck Ford Mustang
2 Blue Honda Civic
3 Yellow Chevrolet Malibu
4 Grey BMW 3 Series
5 White Volkswagen Golf
6 Green Nissan Altima
7 Orange Mercedes-Benz C-Class
8 Violet Hyundai Sonata
9 Brown Jeep Wrangler
10 Silver Tesla Model S
11 Purple Subaru Outback
12 Gold Audi Q5
13 Olive Kia Soul
14 Aqua Ford Explorer
15 Beige Tesla
16 Crimson Suzuki
17 Jade Toyota RAV4
Another way to select columns is to use the iloc[]
function, which allows you to select columns by index.
# Select specific columns by index
selected_columns = sheets_dict['Sheet1'].iloc[:, [0, 1]]
Output :
Age Gender
0 20 1
1 23 1
2 25 1
3 26 1
4 29 1
... .. ...
47509 27 0
47510 30 0
47511 31 0
47512 34 0
47513 35 0
In the above code snippet, we are selecting columns 0 and 1 from the Sheet1
dataframe using the iloc[]
function. Again, the :
in the row index specifies that we want to select all rows.
Conclusion
In this article, we explored how to use Pandas to read multiple sheets from an Excel file and select specific columns for analysis. We learned that Pandas provides an easy way to read Excel files into dataframes using the read_excel()
function and that selecting specific columns can be done using the loc[]
and iloc[]
functions. By using Pandas for reading Excel files, we can easily manipulate, preprocess, and visualize data, making it a powerful tool for data analysis.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.