How to exclude certain columns of a pandas dataframe

As a data scientist or software engineer working with data is a daily routine One of the most common tasks is to manipulate data and extract meaningful insights from it Pandas is a widely used library in Python for data manipulation which provides a lot of functionality for data cleaning and analysis In this article we will discuss how to exclude certain columns of a pandas dataframe

As a data scientist or software engineer, working with data is a daily routine. One of the most common tasks is to manipulate data and extract meaningful insights from it. Pandas is a widely used library in Python for data manipulation, which provides a lot of functionality for data cleaning and analysis. In this article, we will discuss how to exclude certain columns of a pandas dataframe.

What is a Pandas Dataframe?

A pandas dataframe is a two-dimensional, size-mutable, tabular data structure with labeled axes (rows and columns). It is a container of data that can hold different types of data, such as integers, floats, strings, and more. A dataframe can be created in many ways, such as from a CSV file, a SQL database, or by manually creating it using Python.

Excluding Columns from a Pandas Dataframe

There are many scenarios in which we need to exclude certain columns from a pandas dataframe. For example, when we have a large dataset with many columns, and we need only a subset of columns to work with. Or, when we have some columns that contain irrelevant data and we want to remove them from the dataframe. There are several ways to exclude columns from a pandas dataframe.

Using the drop() method

The drop() method is used to remove rows or columns from a pandas dataframe. The drop() method takes two arguments: labels and axis. The labels argument is used to specify the rows or columns to remove, and the axis argument is used to specify whether to remove rows or columns. By default, axis=0 which means rows will be removed. To remove columns, we need to set axis=1.

To exclude one or more columns from a pandas dataframe, we can use the drop() method with the axis=1 argument. For example, let’s say we have a dataframe df with four columns ‘A’, ‘B’, ‘C’, and ‘D’:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9], 'D': [10, 11, 12]})

Output:

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12

To exclude column ‘D’ from the dataframe, we can use the following code:

df = df.drop('D', axis=1)

Output:

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

This will remove column ‘D’ from the dataframe df. If we want to exclude multiple columns, we can pass a list of column names to the labels argument of the drop() method. For example, to exclude columns ‘C’ and ‘D’, we can use the following code:

df = df.drop(['C', 'D'], axis=1)

Output:

   A  B
0  1  4
1  2  5
2  3  6

This will remove columns ‘C’ and ‘D’ from the dataframe df.

Using the loc[] method

The loc[] method is used to access a group of rows and columns by labels or a boolean array. We can also use the loc[] method to exclude certain columns from a pandas dataframe. To exclude columns, we need to pass a list of column names to the loc[] method and prefix it with a colon (:). The colon (:) is used to select all rows. For example, let’s say we have a dataframe df with four columns ‘A’, ‘B’, ‘C’, and ‘D’:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9], 'D': [10, 11, 12]})

Output:

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12

To exclude column ‘D’ from the dataframe, we can use the following code:

df = df.loc[:, ['A', 'B', 'C']]

Output:

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

This will select all rows and columns ‘A’, ‘B’, and ‘C’ from the dataframe df, excluding column ‘D’. If we want to exclude multiple columns, we can pass a list of column names to the loc[] method. For example, to exclude columns ‘C’ and ‘D’, we can use the following code:

df = df.loc[:, ['A', 'B']]

Output:

   A  B
0  1  4
1  2  5
2  3  6

This will select all rows and columns ‘A’ and ‘B’ from the dataframe df, excluding columns ‘C’ and ‘D’.

Using the iloc[] method

The iloc[] method is used to access a group of rows and columns by integer positions. We can also use the iloc[] method to exclude certain columns from a pandas dataframe. To exclude columns, we need to pass a list of column indices to the iloc[] method. For example, let’s say we have a dataframe df with four columns ‘A’, ‘B’, ‘C’, and ‘D’:

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9], 'D': [10, 11, 12]})

Output:

   A  B  C   D
0  1  4  7  10
1  2  5  8  11
2  3  6  9  12

To exclude column ‘D’ from the dataframe, we can use the following code:

df = df.iloc[:, [0, 1, 2]]

Output:

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

This will select all rows and columns at positions 0, 1, and 2 from the dataframe df, excluding column ‘D’. If we want to exclude multiple columns, we can pass a list of column indices to the iloc[] method. For example, to exclude columns ‘C’ and ‘D’, we can use the following code:

df = df.iloc[:, [0, 1]]

Output:

   A  B
0  1  4
1  2  5
2  3  6

This will select all rows and columns at positions 0 and 1 from the dataframe df, excluding columns ‘C’ and ‘D’.

Error Handling

  1. Handling Nonexistent Columns: When using the drop() method, it’s crucial to check whether the columns you are trying to exclude actually exist in the dataframe. Attempting to drop a non-existent column will result in a KeyError. It’s recommended to verify the existence of columns before attempting to drop them.
# Example of handling non-existent column
column_to_exclude = 'E'
if column_to_exclude in df.columns:
    df = df.drop(column_to_exclude, axis=1)
else:
    print(f"Column '{column_to_exclude}' does not exist in the dataframe.")
  1. Indexing Errors with iloc[]: When using the iloc[] method, ensure that the column indices you provide are within the valid range of the dataframe’s columns. Attempting to access an index outside the range will result in an IndexError.
# Example of checking valid column indices
columns_to_exclude = [4, 5]
valid_indices = range(df.shape[1])

if all(idx in valid_indices for idx in columns_to_exclude):
    df = df.iloc[:, [idx for idx in valid_indices if idx not in columns_to_exclude]]
else:
    print("Invalid column indices provided.")
  1. Alternative Syntax for loc[] with Column Names: While the article demonstrated the usage of df = df.loc[:, ['A', 'B', 'C']] to include columns ‘A’, ‘B’, and ‘C’, an alternative syntax can be employed to achieve the exclusion of specific columns. To exclude columns ‘C’ and ‘D’, users can utilize the difference method with df.columns. This approach enhances clarity by explicitly stating the columns to be excluded.
# Alternative syntax to exclude columns 'C' and 'D' using difference
df = df.loc[:, df.columns.difference(['C', 'D'])]

Output:

   my_column
0        1.0
1        2.0
2        NaN
3        4.0
4        5.0
5        NaN
6        7.0
7        8.0
8        9.0

This alternative provides a more explicit way to exclude columns, ensuring a precise and unambiguous representation of the intended operation.

  1. In-Place Modification: Remind users that the operations performed by these methods can either modify the dataframe in-place or return a new dataframe. If they want to keep the original dataframe unchanged, they should assign the result to a new variable.
# In-place modification
df.drop('D', axis=1, inplace=True)

# Creating a new dataframe without modifying the original
new_df = df.drop(['C', 'D'], axis=1)

Conclusion

In this article, we have discussed how to exclude certain columns from a pandas dataframe. We have explored three different methods: using the drop() method, using the loc[] method, and using the iloc[] method. Depending on the scenario and the preference of the user, any of these methods can be used to exclude columns from a pandas dataframe. With these methods, data scientists and software engineers can easily manipulate data and extract meaningful insights from it.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.