Python Pandas Selecting Rows Whose Column Value is Null None Nan

As a data scientist or software engineer working with large datasets its a common task to select rows from a dataframe based on certain criteria One common scenario is to select rows whose column value is null none or nan In this article we will explore the various ways to achieve this task using Python pandas library

As a data scientist or software engineer working with large datasets, it’s a common task to select rows from a dataframe based on certain criteria. One common scenario is to select rows whose column value is null, none or nan. In this article, we will explore the various ways to achieve this task using Python pandas library.

What is Pandas?

Pandas is a powerful open-source data analysis and manipulation library for Python. It provides easy-to-use data structures and data analysis tools for dealing with structured data such as tables, time-series, and matrix data. Pandas is built on top of NumPy, another popular Python library for scientific computing.

Selecting Rows with Null / None / Nan Values

Creating a Sample Dataset

Let’s start by creating a sample dataset to work with. We can create a pandas dataframe by passing a dictionary of data to the pandas.DataFrame() constructor.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 31, 22, None, 27],
        'Gender': ['F', 'M', None, 'M', 'F'],
        'Salary': [50000, None, 30000, 40000, 60000]}

df = pd.DataFrame(data)

This will create a dataframe with four columns - Name, Age, Gender, and Salary. We intentionally set some of the values to null or None to demonstrate how to select rows based on null values.

Using isnull() Method

The simplest way to select rows with null values is to use the isnull() method. This method returns a boolean mask that indicates whether each element in the dataframe is null or not. We can then use this boolean mask to select the rows that have null values.

null_mask = df.isnull().any(axis=1)
null_rows = df[null_mask]

print(null_rows)

Output:

      Name   Age Gender   Salary
1      Bob  31.0   None      NaN
2  Charlie  22.0   None  30000.0
3    David   NaN      M  40000.0

Here, we first apply the isnull() method to the entire dataframe, which returns a boolean mask indicating whether each element is null or not. Then, we apply the any(axis=1) method to the result to check if any value in each row is null. Finally, we use this boolean mask to select the rows that have null values.

Using notnull() Method

We can also select rows based on non-null values using the notnull() method. This method is the opposite of isnull() and returns a boolean mask indicating whether each element in the dataframe is not null.

not_null_mask = df.notnull().all(axis=1)
not_null_rows = df[not_null_mask]

print(not_null_rows)

Output:

     Name   Age Gender   Salary
0   Alice  25.0      F  50000.0
4     Eva  27.0      F  60000.0

Here, we first apply the notnull() method to the entire dataframe, which returns a boolean mask indicating whether each element is not null. Then, we apply the all(axis=1) method to the result to check if all values in each row are not null. Finally, we use this boolean mask to select the rows that have non-null values.

Using Query Method

We can also use the query() method to select rows based on null values. This method allows us to write SQL-like queries on the dataframe, making it easy to select rows based on certain conditions.

null_rows = df.query('Age != Age or Gender != Gender or Salary != Salary')

print(null_rows)

Output:

      Name   Age Gender   Salary
1      Bob  31.0   None      NaN
2  Charlie  22.0   None  30000.0
3    David   NaN      M  40000.0

Here, we use the query() method to select rows where any of the columns have null values. We use the != operator to compare the column values with themselves, which returns True if the value is null.

Using loc Method

Finally, we can also use the loc method to select rows based on null values. This method allows us to select rows and columns by label or boolean mask.

null_rows = df.loc[df['Age'].isnull() | df['Gender'].isnull() | df['Salary'].isnull()]

print(null_rows)

Output:

      Name   Age Gender   Salary
1      Bob  31.0   None      NaN
2  Charlie  22.0   None  30000.0
3    David   NaN      M  40000.0

Here, we use the loc method to select rows where any of the columns have null values. We use the isnull() method to create a boolean mask for each column and use the | operator to combine them.

Conclusion

In this article, we explored various ways to select rows from a Pandas dataframe based on null values. We learned how to use the isnull(), notnull(), query() and loc methods to achieve this task. These methods provide a lot of flexibility and allow us to select rows based on complex conditions. As a data scientist or software engineer, it’s important to have a good understanding of these methods to efficiently analyze and manipulate data.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.