Python Pandas: How to Skip Columns When Reading a File?

In this blog, discover how to efficiently skip columns when reading files in data processing using Pandas, a versatile Python library for data manipulation and analysis. Ideal for data scientists and software engineers looking to save time and resources.

Python Pandas: How to Skip Columns When Reading a File?

As a data scientist or a software engineer, you might have faced a scenario where you need to read a file but want to skip some columns in it. This is a common requirement in data processing, where the data may contain unnecessary or irrelevant columns that need to be skipped to save memory and processing time. Pandas is a popular Python library for data manipulation and analysis, and it offers a simple and flexible way to read files while skipping columns.

In this blog post, we will discuss how to skip columns when reading a file using Pandas. We will cover the following topics:

  • Reading a file with Pandas
  • Skipping columns using index or name
  • Handling missing values
  • Conclusion

Reading a File with Pandas

Before we dive into skipping columns, let’s first understand how to read a file using Pandas. Pandas provides several functions to read different file formats, such as CSV, Excel, JSON, and more. For this blog post, we will focus on reading a CSV file using the read_csv() function.

import pandas as pd

# Read a CSV file
df = pd.read_csv('data.csv')
print(df)

Output:

      Name  Age           City
0    Alice   25       New York
1      Bob   30    Los Angeles
2  Charlie   35  San Francisco
3    David   40        Chicago
4    Marie   20     Washington

The read_csv() function reads a CSV file and returns a DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types. By default, Pandas assumes that the first row of the CSV file contains column names, and it uses them as column labels. If your CSV file does not have column names, you can pass header=None to the read_csv() function.

Skipping Columns Using Index or Name

Now, let’s see how to skip columns while reading a file using Pandas. There are two ways to skip columns in Pandas: by index or by name.

Skipping Columns by Index

To skip columns by index, you can use the usecols parameter of the read_csv() function. This parameter accepts a list of column indices to include in the DataFrame. For example, if you want to skip the first and third columns of a CSV file, you can pass [1, 3] to the usecols parameter.

# Skip columns by index
df = pd.read_csv('data.csv', usecols=[1, 2])
print(df)

Output:

   Age           City
0   25       New York
1   30    Los Angeles
2   35  San Francisco
3   40        Chicago
4   20     Washington

Skipping Columns by Name

To skip columns by name, you can use the usecols parameter with a list of column names to include in the DataFrame. For example, if you want to skip the column1 and column3 columns of a CSV file, you can pass ['Name', 'City'] to the usecols parameter.

# Skip columns by name
df = pd.read_csv('data.csv', usecols=['Name', 'City'])
print(df)

Output:

      Name           City
0    Alice       New York
1      Bob    Los Angeles
2  Charlie  San Francisco
3    David        Chicago
4    Marie     Washington

Note that if your CSV file does not have column names, you can pass header=None to the read_csv() function and use column indices instead of names.

Handling Missing Values

Skipping columns while reading a file can lead to missing values in the resulting DataFrame. Pandas provides several functions to handle missing values, such as isna(), fillna(), and dropna().

Let’s consider the following csv file:

      Name   Age           City
0    Alice  25.0       New York
1      Bob  30.0    Los Angeles
2  Charlie  35.0  San Francisco
3    David  40.0        Chicago
4    Marie  20.0     Washington
5   Stuart   NaN         Nevada

isna()

The isna() function returns a Boolean mask indicating which values are missing (NaN or None).

# Check for missing values
print(df.isna())

Output:

    Name    Age   City
0  False  False  False
1  False  False  False
2  False  False  False
3  False  False  False
4  False  False  False
5  False   True  False

fillna()

The fillna() function fills missing values with a specified value or method. For example, you can fill missing values with 0 using the following code:

# Fill missing values with 0
df = df.fillna(0)
print(df)

Output:

      Name   Age           City
0    Alice  25.0       New York
1      Bob  30.0    Los Angeles
2  Charlie  35.0  San Francisco
3    David  40.0        Chicago
4    Marie  20.0     Washington
5   Stuart   0.0         Nevada

dropna()

The dropna() function removes rows or columns with missing values. For example, you can remove rows with missing values using the following code:

# Remove rows with missing values
df = df.dropna()
print(df)

Output:

      Name   Age           City
0    Alice  25.0       New York
1      Bob  30.0    Los Angeles
2  Charlie  35.0  San Francisco
3    David  40.0        Chicago
4    Marie  20.0     Washington

Conclusion

In this blog post, we have discussed how to skip columns when reading a file using Pandas. We have seen two ways to skip columns: by index and by name. We have also discussed how to handle missing values that may arise when skipping columns. Pandas is a versatile library that provides powerful tools for data manipulation and analysis, and we hope this blog post has helped you in your data processing tasks.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.