Pandas DataFrame Loc vs Query Performance

In this blog, we will learn howto deal with substantial datasets, Pandas, a widely employed data analysis library in Python, serves as a potent tool for handling data in tabular form. Specifically, we will delve into one of its pivotal features—the DataFrame object—enabling a spreadsheet-like approach to data manipulation. Our focus in this article will be on examining the performance distinctions between two frequently utilized methods for data selection in a Pandas DataFrame: loc and query.

As a data scientist or software engineer, you are likely to work with large amounts of data on a regular basis. Pandas is one of the most widely used data analysis libraries in Python, and it provides a powerful tool for working with data in a tabular format. One of the key features of Pandas is the DataFrame object, which allows you to work with data in a way that is similar to working with a spreadsheet. In this article, we will explore the performance differences between two of the most commonly used methods for selecting data from a Pandas DataFrame: loc and query.

Table of Contents

  1. What is a Pandas DataFrame?
  2. What is loc?
  3. What is query?
  4. Performance Differences Between loc and query
  5. Pros an Cons
  6. Conclusion

What is a Pandas DataFrame?

A Pandas DataFrame is a two-dimensional table-like data structure that stores data in rows and columns. It is similar to a spreadsheet in that it has labeled columns and rows, and each column can have a different data type. This makes it a powerful tool for working with data in a variety of formats, including CSV, Excel, SQL databases, and many others.

What is loc?

Consider the following DataFrame:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   22
3    David   35
4      Eve   28

The loc method is used to select data from a Pandas DataFrame based on label-based indexing. This means that you can select data using the row and column labels. For example, you can filter those who are older than 25 years old with the following code:

# Filtering using loc
result_loc = df.loc[df['Age'] > 25]
print(result_loc)

Output:

    Name  Age
1    Bob   30
3  David   35
4    Eve   28

What is query?

The query method is used to select data from a Pandas DataFrame based on a Boolean expression. This means that you can select data based on conditions that you specify using logical operators such as AND, OR, and NOT. For example, if you had a DataFrame with columns labeled “Name”, “Age”, and “Gender”, you could select the data for all females over the age of 30 using the following code:

# Filtering using query
result_query = df.query('Age > 25')
print(result_query)

Output will be the same as the output we got using loc.

Performance Differences Between loc and query

While both loc and query are useful methods for selecting data from a Pandas DataFrame, there are some performance differences between the two that are worth considering.

Memory Usage

One of the key differences between loc and query is the amount of memory that they use. When you use loc to select data from a DataFrame, Pandas creates a new view of the data in memory. This means that the DataFrame itself is not modified, and the view takes up additional memory. On the other hand, when you use query to select data from a DataFrame, Pandas creates a new DataFrame that contains only the rows that match the specified conditions. This means that the memory usage is generally lower when using query compared to loc.

Performance

Another key difference between loc and query is their performance. In general, loc is faster than query when selecting data from a DataFrame. This is because loc uses label-based indexing, which is optimized for selecting data from a DataFrame, while query uses Boolean expressions, which can be slower to evaluate.

To illustrate the performance differences between loc and query, let’s consider a simple example. Suppose we have a DataFrame with 1 million rows and 3 columns, and we want to select all the rows where the value of column “A” is greater than 0.5. We can do this using both loc and query, and measure the time it takes to execute each method:

import pandas as pd
import numpy as np
import time

# Create a DataFrame with 1 million rows and 3 columns
df = pd.DataFrame(np.random.rand(1000000, 3), columns=['A', 'B', 'C'])

# Use loc to select all rows where A > 0.5
start_time = time.time()
df.loc[df['A'] > 0.5]
print("--- %s seconds for loc ---" % (time.time() - start_time))

# Use query to select all rows where A > 0.5
start_time = time.time()
df.query('A > 0.5')
print("--- %s seconds for query ---" % (time.time() - start_time))

When we run this code, we get the following output:

--- 0.06963920593261719 seconds for loc ---
--- 0.19866657257080078 seconds for query ---

As you can see, loc is faster than query in this example, taking only 0.07 seconds to execute compared to 0.20 seconds for query.

Pros an Cons of loc and query:

loc Method

Pros:

  • More Versatile for Complex Conditions:

The loc method shines when dealing with intricate filtering conditions. Its label-based indexing allows you to create sophisticated queries involving multiple columns and complex logical conditions.

  • Intuitive Syntax using Square Brackets:

The use of square brackets makes the syntax of loc highly intuitive. This simplicity contributes to the ease of understanding and writing code, especially for users familiar with Python indexing conventions.

Cons:

  • Slower for Simple Queries:

Despite its versatility, the loc method might exhibit slightly slower performance for straightforward queries. This is because it involves label-based indexing and additional operations, which can lead to overhead for simple filtering tasks.

  • Verbose Syntax for Complex Conditions:

As the complexity of filtering conditions increases, the syntax of the loc method may become verbose. Constructing intricate queries with multiple logical conditions and nested expressions might lead to longer and less readable code.

query Method

Pros:

  • Concise Syntax for Simple Conditions:

The query method excels in simplicity, offering a concise syntax for straightforward filtering conditions. This makes it an excellent choice for scenarios where brevity and readability are essential.

  • Improved Readability for Straightforward Queries:

With its expressive syntax, the query method enhances code readability for basic filtering conditions. This can be particularly beneficial when writing code that needs to be clear and understandable at a glance.

Cons:

  • Limited to Basic Queries:

One of the main drawbacks of the query method is its limitation to basic queries. It may struggle when handling more complex conditions that require intricate logical operations or involve multiple columns.

  • Potential Slowness for Complex Conditions:

While query excels in simplicity, it may not be the most performant option for complex filtering conditions. The trade-off for concise syntax might result in slower execution times when dealing with intricate queries.

Common Errors and Solutions

Key Error

Error:

df.loc[df['NonexistentColumn'] > 25]

Solution:

When encountering a KeyError, the first step is to verify that the column name is spelled correctly. Double-check for typos or discrepancies in capitalization to ensure accurate column referencing.

Syntax Error in query

Error:

df.query('Age > 25 and NonexistentColumn > 30')

Solution:

If facing a syntax error in the query method, carefully inspect the query expression. Ensure proper syntax, correct logical operators, and valid column names. In this case, verify that NonexistentColumn is a valid column in the DataFrame.

Conclusion

In conclusion, both loc and query are useful methods for selecting data from a Pandas DataFrame, but there are some performance differences between the two that are worth considering. If you need to select data based on label-based indexing, then loc is generally faster and more memory-efficient than query. On the other hand, if you need to select data based on Boolean expressions, then query can be a useful tool for filtering data in a DataFrame. Ultimately, the choice between loc and query depends on your specific use case and the performance requirements of your application.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.