How to drop Pandas DataFrame rows with NAs in a specific column

Learn how to filter a DataFrame by NA values

During the data cleaning process, you may find that you need to discard rows from your pandas DataFrame based on whether or not they have NA values in a certain column. While this task is slightly more complex than dropping rows containing any NA values, there are some quick and easy ways to go about it.

The first is to manually subset your DataFrame, keeping only rows where your column of interest contains non-null values using DataFrame.notna():

import pandas as pd
import numpy as np

data = pd.DataFrame({'Gene': ["MITF", "MITF", "KIT", "KIT", "KIT"], 
                     'Allele': ["A", "G", "A", "TA", np.nan], 
                     'Count': [2, 7, np.nan, 8, 2]})
            
data = data[data['Count'].notna()]

data

While this does exactly what we want, consider using the Pandas function DataFrame.dropna() instead. This method is a more explicit way to handle missing data in Pandas, and provides a variety of useful options. You can use dropna() to discard rows with any or all NA values, with a certain number of NA values, or by a specific subset. You can also use the axis parameter to discard columns by NA value instead. For our purposes, we can use the subset parameter:

import pandas as pd
import [numpy](https://saturncloud.io/glossary/numpy) as np

data = pd.DataFrame({'Gene': ["MITF", "MITF", "KIT", "KIT", "KIT"], 
                     'Allele': ["A", "G", "A", "TA", np.nan], 
                     'Count': [2, 7, np.nan, 8, 2]})
                     
data = data.dropna(subset = ['Count'])

data

Note that by default, dropna() returns a copy of your data. If you’d instead like to modify your data in-place, you can use:

data.dropna(subset = ['Count'], inplace = True)

To wrap up, there are several simple strategies for dropping DataFrame rows depending on NA values in a certain column. While you can certainly manually subset your data, dropna() provides flexibility and speed for this use case and a variety of others.

Additional Resources:


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.