How to select rows by column value in Pandas

Methods for efficiently selecting rows by column value

If you need to select rows in your pandas DataFrame based on the value of one or more columns, you’re in luck - there are several methods for accessing the data you need. Which method to use depends on performance considerations and how your data is set up.

One straightforward method is boolean indexing. Pandas.DataFrame.loc allows you to simply select rows by value:

import pandas as pd

data = pd.DataFrame({'Color': 'Tabby Black Calico Tabby Tabby Black'.split(),
                   'Name': 'Maxine Angel Delilah Tom Jeff Fluffy'.split(),
                   'Age': [2, 5, 17, 10, 7, 2]})

#select by scalar value
data.loc[data['Color'] == 'Tabby']

#select by iterable value
data.loc[data['Age'].isin([2, 5])]

Boolean indexing also allows for selection by negation, or by multiple conditions (with &, |):

#select rows where column value does NOT equal some scalar value
data.loc[data['Color'] != 'Calico']

#select rows where column value is NOT in some iterable value
data.loc[~data['Name'].isin(['Tom', 'Fluffy'])]

#select rows by multiple conditions
data.loc[(data['Color'] == 'Tabby') & (data['Age'] <= 7)]

Note: Using the loc method is usually not strictly necessary. These two lines of code accomplish the same task, with no major differences in performance:

data.loc[data['Color'] == 'Tabby']
data[data['Color'] == 'Tabby']

However, loc has the advantage of being explicit rather than implicit, which means that is often a safer choice for edge cases. Additionally, loc does have a performance advantage in certain situations, such as chained assignment.

If you are calling your boolean indexing function repeatedly, it’s more efficient to first use set_index():

#set index
data = data.set_index(['Color'])

#select by scalar value
data.loc['Tabby']

#select by iterable value
data.loc[data.index.isin(['Tabby', 'Calico'])]

For larger datasets (over approximately 200,000 rows), using the query() function may speed up your computation time:

#select by scalar value
data.query('Color == "Tabby"')

#select by iterable value
data.query('Age in [2, 5]')

Finally, in some cases, Numpy methods can offer a faster alternative to Pandas methods, although at the cost of some readability:

import numpy as np

#select by scalar value
data[data['Age'].values == 2]

#select by iterable value
data[np.in1d(data['Age'].values, [2, 5])]

To wrap up, there are a variety of ways to select DataFrame rows by column value. Boolean indexing (with or without loc) offers a quick and intuitive way to index DataFrames, especially for smaller datasets. For bigger datasets, query() is an elegant alternative. And for some use cases, it may be more efficient to skip the Pandas API altogether and use an applicable Numpy method.

Additional Resources:


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.