Pandas readexcel with Multiple Sheets and Specific Columns

As a data scientist or software engineer, you often encounter datasets that are spread across multiple sheets in an Excel workbook. Pandas, a popular data manipulation library in Python, provides an easy way to read Excel files into dataframes using the readexcel function. In this article, we will explore how to use this function to read multiple sheets from an Excel file and select specific columns for analysis.

As a data scientist or software engineer, you often encounter datasets that are spread across multiple sheets in an Excel workbook. Pandas, a popular data manipulation library in Python, provides an easy way to read Excel files into dataframes using the read_excel() function. In this article, we will explore how to use this function to read multiple sheets from an Excel file and select specific columns for analysis.

Table of Contents

  1. Introduction
  2. Why Use Pandas for Reading Excel Files?
  3. Reading Multiple Sheets from an Excel File
  4. Selecting Specific Columns from Multiple Sheets
  5. Conclusion

Why Use Pandas for Reading Excel Files?

Pandas provides an efficient and flexible way to work with tabular data, making it a popular choice for data manipulation tasks. The read_excel() function in Pandas allows you to read data from Excel files and create dataframes, which can be further processed and analyzed. Some of the benefits of using Pandas for reading Excel files include:

  • Easy data manipulation: Data can be easily manipulated using Pandas' powerful functions and methods.
  • Data cleaning: You can easily clean and preprocess data using Pandas.
  • Data visualization: Pandas provides several visualization tools that can be used to create meaningful visualizations for data analysis.

Reading Multiple Sheets from an Excel File

To read multiple sheets from an Excel file using Pandas, you need to pass a dictionary to the sheet_name parameter of the read_excel() function. The keys of the dictionary represent the sheet names, and the values represent the sheet numbers or names.

# Import pandas
import pandas as pd

# Read Excel file with multiple sheets
xls = pd.read_excel("D:\\SamNewLocation\\Desktop\\my_file.xlsx", sheet_name=['Sheet1', 'Sheet2'])

# Access individual sheets using sheet names
sheet1_df = xls['Sheet1']
sheet2_df = xls['Sheet2']

print(sheet1_df)
print(sheet2_df)

Output :

        Age Gender          Genre
0      20      1  HipHop Creole
1      23      1  HipHop Creole
2      25      1  HipHop Creole
3      26      1           Jazz
4      29      1           Jazz
...    ..    ...            ...
47509  27      0       Acoustic
47510  30      0       Acoustic
47511  31      0      Classical
47512  34      0      Classical
47513  35      0      Classical

[47514 rows x 3 columns]
    Quantity    Color               Car_brand
0         20      Red            Toyota Camry
1         23    Balck            Ford Mustang
2         25     Blue             Honda Civic
3         26   Yellow        Chevrolet Malibu
4         29     Grey            BMW 3 Series
5         30    White         Volkswagen Golf
6         31    Green           Nissan Altima
7         33   Orange   Mercedes-Benz C-Class
8         37   Violet          Hyundai Sonata
9         20    Brown           Jeep Wrangler
10        21   Silver           Tesla Model S
11        25   Purple          Subaru Outback
12        26     Gold                 Audi Q5
13        27    Olive                Kia Soul
14        30     Aqua           Ford Explorer
15        31   Beige                    Tesla
16        34  Crimson                  Suzuki
17        35    Jade              Toyota RAV4

In the above code snippet, we are reading two sheets from an Excel file named my_file.xlsx. The dictionary passed to the sheet_name parameter specifies that we want to read the first sheet as Sheet1 and the second sheet as Sheet2. The sheet numbers start from 0.

Selecting Specific Columns from Multiple Sheets

Once you have read multiple sheets from an Excel file, you may want to select specific columns for analysis. Pandas provides several ways to select columns from a dataframe. One way is to use the loc[] function, which allows you to select columns by name.

# Select specific columns from multiple sheets
selected_columns = sheets_dict['Sheet2'].loc[:, ['Color', 'Car_brand']]
print(selected_columns)

selected_columns = sheets_dict['Sheet2'].loc[:, ['Color', 'Car_brand']]
print(selected_columns)

In the above code snippet, we are selecting columns Color and Car_brand from the Sheet2 dataframe using the loc[] function. The : in the row index specifies that we want to select all rows.

Output :

        Color             Car_brand
0       Red            Toyota Camry
1     Balck            Ford Mustang
2      Blue             Honda Civic
3    Yellow        Chevrolet Malibu
4      Grey            BMW 3 Series
5     White         Volkswagen Golf
6     Green           Nissan Altima
7    Orange   Mercedes-Benz C-Class
8    Violet          Hyundai Sonata
9     Brown           Jeep Wrangler
10   Silver           Tesla Model S
11   Purple          Subaru Outback
12     Gold                 Audi Q5
13    Olive                Kia Soul
14     Aqua           Ford Explorer
15   Beige                    Tesla
16  Crimson                  Suzuki
17    Jade              Toyota RAV4

Another way to select columns is to use the iloc[] function, which allows you to select columns by index.

# Select specific columns by index
selected_columns = sheets_dict['Sheet1'].iloc[:, [0, 1]]

Output :

       Age   Gender
0      20      1
1      23      1
2      25      1
3      26      1
4      29      1
...    ..    ...
47509  27      0
47510  30      0
47511  31      0
47512  34      0
47513  35      0

In the above code snippet, we are selecting columns 0 and 1 from the Sheet1 dataframe using the iloc[] function. Again, the : in the row index specifies that we want to select all rows.

Conclusion

In this article, we explored how to use Pandas to read multiple sheets from an Excel file and select specific columns for analysis. We learned that Pandas provides an easy way to read Excel files into dataframes using the read_excel() function and that selecting specific columns can be done using the loc[] and iloc[] functions. By using Pandas for reading Excel files, we can easily manipulate, preprocess, and visualize data, making it a powerful tool for data analysis.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.