How to Filter Out Records with Null or Empty Strings in Python Pandas
As a data scientist or software engineer, you know how important it is to clean and preprocess data before using it for analysis or modeling. One common task in data preprocessing is to filter out records with null or empty strings for a given field. In this article, we will explore how to accomplish this task using Python Pandas, a popular library for data manipulation and analysis.
Table of Contents
- What is Pandas?
- Filtering out Null or Empty Strings in Pandas
- Filtering out Null or Empty Strings for Multiple Fields
- Handling Null or Empty Strings in a Pandas DataFrame
- Conclusion
What is Pandas?
Pandas is an open-source Python library that provides data structures and tools for efficient data manipulation and analysis. It is built on top of NumPy, another popular library for numerical computing in Python. Pandas provides two main data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type, while a DataFrame is a two-dimensional table-like data structure with rows and columns.
Filtering out Null or Empty Strings in Pandas
Consider the following DataFrame:
name age city
0 Alice 25.0 New York
1 Bob 30.0 Paris
2 Charlie NaN
3 David 35.0 London
Using dropna()
method:
To filter out records with null or empty strings in Pandas, we will use the dropna()
method. This method is used to remove missing values (i.e., null, NaN, or None) from a DataFrame. We can specify the axis along which to drop the missing values (i.e., rows or columns) and the subset of columns to consider.
import pandas as pd
# create a sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, None, 35],
'city': ['New York', 'Paris', '', 'London']}
df = pd.DataFrame(data)
# filter out records using dropna method
df_filtered = df.dropna()
print(df_filtered)
Output:
name age city
0 Alice 25.0 New York
1 Bob 30.0 Paris
3 David 35.0 London
In this example, we create a sample DataFrame with four records and three fields: name, age, and city. The third record has an empty string in the city field, which we want to filter out. We then use the dropna()
method to drop all rows that contain empty cell and store the result in a new DataFrame called df_filtered
.
Note that the dropna()
method returns a new DataFrame by default, so we need to assign the result to a new variable if we want to keep the filtered DataFrame.
Using notna()
Method
The notna()
method returns a boolean mask indicating the positions of non-null (non-NA) values. We can use this method to filter out records with null or empty strings by negating the boolean mask.
import pandas as pd
# create a sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, None, 35],
'city': ['New York', 'Paris', '', 'London']}
df = pd.DataFrame(data)
# filter out records with empty strings in the 'city' column using notna() method
df_filtered = df[df['age'].notna()]
print(df_filtered)
Output:
name age city
0 Alice 25.0 New York
1 Bob 30.0 Paris
3 David 35.0 London
In this example, the notna()
method is used to create a boolean mask that identifies non-null values in the ‘city’ column. Then, this boolean mask is used to filter out records with null or empty strings in the ‘city’ column, resulting in the DataFrame df_filtered
.
This method offers an alternative approach to filtering out records with null or empty strings in Pandas DataFrame columns.
Using boolean indexing
Another method is to use boolean indexing to filter the DataFrame based on conditions. We can create a boolean mask to identify records with null or empty strings and use it to filter the DataFrame. Here’s an example:
import pandas as pd
# create a sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, None, 35],
'city': ['New York', 'Paris', '', 'London']}
df = pd.DataFrame(data)
# filter out records with empty strings in the 'city' column
df_filtered = df[df['city'] != '']
print(df_filtered)
Output:
name age city
0 Alice 25.0 New York
1 Bob 30.0 Paris
3 David 35.0 London
In this example, we create a boolean mask to identify records with empty strings in the 'city
' column using the !=
operator to drop the records with empty strings in the city field and store the result in a new DataFrame called df_filtered
.
Filtering out Null or Empty Strings for Multiple Fields
If we want to filter out records with null or empty strings for multiple fields, we can simply pass a list of field names to the subset
parameter. Here is an example:
Consider the following DataFrame:
name age city country
0 Alice 25.0 New York USA
1 Bob 30.0 Paris
2 Charlie NaN France
3 David 35.0 London UK
import pandas as pd
# create a sample DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, None, 35],
'city': ['New York', 'Paris', '', 'London'],
'country': ['USA', '', 'France', 'UK']}
df = pd.DataFrame(data)
# filter out records with empty strings in the 'city' and 'country' columns
df_filtered = df[(df['city'] != '') & (df['country'] != '')]
print(df_filtered)
Output:
name age city country
0 Alice 25.0 New York USA
3 David 35.0 London UK
In this example, we create a sample DataFrame with four records and four fields: name, age, city, and country. The third and second records have empty strings in the city and country fields, respectively, which we want to filter out. We then use the dropna()
method with the !=
operator to check for empty strings in both fields and filter the records accordingly and store the result in a new DataFrame called df_filtered
.
Handling Null or Empty Strings in a Pandas DataFrame
Replacing Null or Empty Strings with New String
To handle null or empty strings in a Pandas DataFrame, we can replace them with a default value or with values derived from other columns. Here’s an example of replacing empty strings in the ‘city’ column with ‘Unknown’:
# replace empty strings in the 'city' column with 'Unknown'
df['city'].replace('', 'Unknown', inplace=True)
print(df)
Output:
name age city
0 Alice 25.0 New York
1 Bob 30.0 Paris
2 Charlie NaN Unknown
3 David 35.0 London
In this example, we use the replace()
method to replace empty strings in the ‘city’ column with ‘Unknown’. The inplace=True
parameter modifies the DataFrame in place.
Handling Null or Empty Numeric Values in the DataFrame
To handle null or empty values in the 'age'
column, we can fill them with the mean age of the dataset. Here’s how:
# fill null or empty values in the 'age' column with the mean age
mean_age = df['age'].mean()
df['age'].fillna(mean_age, inplace=True)
print(df)
Output:
name age city
0 Alice 25.0 New York
1 Bob 30.0 Paris
2 Charlie 30.0 Unknown
3 David 35.0 London
In this example, we calculate the mean age of the dataset using the mean()
method and fill null or empty values in the 'age'
column with the calculated mean age using the fillna()
method.
Conclusion
Filtering out records with null or empty strings in a given field is a common data preprocessing task in data analysis and modeling. In this article, we have explored how to accomplish this task using Python Pandas, a powerful library for data manipulation and analysis. By using the dropna()
method with the subset
parameter, we can easily filter out records with null or empty strings for one or more fields. We hope this article has been helpful and informative for your data science or software engineering projects.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.