How to Filter Out Records with Null or Empty Strings in Python Pandas

In this article, we’ll examine the significance of cleaning and preprocessing data for analysis or modeling, which is a crucial task for data scientists and software engineers. One prevalent aspect of data preprocessing involves filtering out records containing null or empty strings within a specific field. We’ll delve into how to achieve this task using Python Pandas, a widely used library for data manipulation and analysis.

As a data scientist or software engineer, you know how important it is to clean and preprocess data before using it for analysis or modeling. One common task in data preprocessing is to filter out records with null or empty strings for a given field. In this article, we will explore how to accomplish this task using Python Pandas, a popular library for data manipulation and analysis.

Table of Contents

  1. What is Pandas?
  2. Filtering out Null or Empty Strings in Pandas
  3. Filtering out Null or Empty Strings for Multiple Fields
  4. Handling Null or Empty Strings in a Pandas DataFrame
  5. Conclusion

What is Pandas?

Pandas is an open-source Python library that provides data structures and tools for efficient data manipulation and analysis. It is built on top of NumPy, another popular library for numerical computing in Python. Pandas provides two main data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type, while a DataFrame is a two-dimensional table-like data structure with rows and columns.

Filtering out Null or Empty Strings in Pandas

Consider the following DataFrame:

      name   age      city
0    Alice  25.0  New York
1      Bob  30.0     Paris
2  Charlie   NaN          
3    David  35.0    London

Using dropna() method:

To filter out records with null or empty strings in Pandas, we will use the dropna() method. This method is used to remove missing values (i.e., null, NaN, or None) from a DataFrame. We can specify the axis along which to drop the missing values (i.e., rows or columns) and the subset of columns to consider.

import pandas as pd

# create a sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 30, None, 35],
        'city': ['New York', 'Paris', '', 'London']}
df = pd.DataFrame(data)

# filter out records using dropna method
df_filtered = df.dropna()

print(df_filtered)

Output:

    name   age      city
0  Alice  25.0  New York
1    Bob  30.0     Paris
3  David  35.0    London

In this example, we create a sample DataFrame with four records and three fields: name, age, and city. The third record has an empty string in the city field, which we want to filter out. We then use the dropna() method to drop all rows that contain empty cell and store the result in a new DataFrame called df_filtered.

Note that the dropna() method returns a new DataFrame by default, so we need to assign the result to a new variable if we want to keep the filtered DataFrame.

Using notna() Method

The notna() method returns a boolean mask indicating the positions of non-null (non-NA) values. We can use this method to filter out records with null or empty strings by negating the boolean mask.

import pandas as pd

# create a sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 30, None, 35],
        'city': ['New York', 'Paris', '', 'London']}
df = pd.DataFrame(data)

# filter out records with empty strings in the 'city' column using notna() method
df_filtered = df[df['age'].notna()]

print(df_filtered)

Output:

    name   age      city
0  Alice  25.0  New York
1    Bob  30.0     Paris
3  David  35.0    London

In this example, the notna() method is used to create a boolean mask that identifies non-null values in the ‘city’ column. Then, this boolean mask is used to filter out records with null or empty strings in the ‘city’ column, resulting in the DataFrame df_filtered.

This method offers an alternative approach to filtering out records with null or empty strings in Pandas DataFrame columns.

Using boolean indexing

Another method is to use boolean indexing to filter the DataFrame based on conditions. We can create a boolean mask to identify records with null or empty strings and use it to filter the DataFrame. Here’s an example:

import pandas as pd

# create a sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 30, None, 35],
        'city': ['New York', 'Paris', '', 'London']}
df = pd.DataFrame(data)

# filter out records with empty strings in the 'city' column
df_filtered = df[df['city'] != '']

print(df_filtered)

Output:

    name   age      city
0  Alice  25.0  New York
1    Bob  30.0     Paris
3  David  35.0    London

In this example, we create a boolean mask to identify records with empty strings in the 'city' column using the != operator to drop the records with empty strings in the city field and store the result in a new DataFrame called df_filtered.

Filtering out Null or Empty Strings for Multiple Fields

If we want to filter out records with null or empty strings for multiple fields, we can simply pass a list of field names to the subset parameter. Here is an example:

Consider the following DataFrame:

      name   age      city country
0    Alice  25.0  New York     USA
1      Bob  30.0     Paris        
2  Charlie   NaN            France
3    David  35.0    London      UK
import pandas as pd

# create a sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 30, None, 35],
        'city': ['New York', 'Paris', '', 'London'],
        'country': ['USA', '', 'France', 'UK']}
df = pd.DataFrame(data)

# filter out records with empty strings in the 'city' and 'country' columns
df_filtered = df[(df['city'] != '') & (df['country'] != '')]

print(df_filtered)

Output:

    name   age      city country
0  Alice  25.0  New York     USA
3  David  35.0    London      UK

In this example, we create a sample DataFrame with four records and four fields: name, age, city, and country. The third and second records have empty strings in the city and country fields, respectively, which we want to filter out. We then use the dropna() method with the != operator to check for empty strings in both fields and filter the records accordingly and store the result in a new DataFrame called df_filtered.

Handling Null or Empty Strings in a Pandas DataFrame

Replacing Null or Empty Strings with New String

To handle null or empty strings in a Pandas DataFrame, we can replace them with a default value or with values derived from other columns. Here’s an example of replacing empty strings in the ‘city’ column with ‘Unknown’:

# replace empty strings in the 'city' column with 'Unknown'
df['city'].replace('', 'Unknown', inplace=True)

print(df)

Output:

      name   age      city
0    Alice  25.0  New York
1      Bob  30.0     Paris
2  Charlie   NaN   Unknown
3    David  35.0    London

In this example, we use the replace() method to replace empty strings in the ‘city’ column with ‘Unknown’. The inplace=True parameter modifies the DataFrame in place.

Handling Null or Empty Numeric Values in the DataFrame

To handle null or empty values in the 'age' column, we can fill them with the mean age of the dataset. Here’s how:

# fill null or empty values in the 'age' column with the mean age
mean_age = df['age'].mean()
df['age'].fillna(mean_age, inplace=True)

print(df)

Output:

      name   age      city
0    Alice  25.0  New York
1      Bob  30.0     Paris
2  Charlie  30.0   Unknown
3    David  35.0    London

In this example, we calculate the mean age of the dataset using the mean() method and fill null or empty values in the 'age' column with the calculated mean age using the fillna() method.

Conclusion

Filtering out records with null or empty strings in a given field is a common data preprocessing task in data analysis and modeling. In this article, we have explored how to accomplish this task using Python Pandas, a powerful library for data manipulation and analysis. By using the dropna() method with the subset parameter, we can easily filter out records with null or empty strings for one or more fields. We hope this article has been helpful and informative for your data science or software engineering projects.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.