Pandas DataFrame Loc vs Query Performance
As a data scientist or software engineer, you are likely to work with large amounts of data on a regular basis. Pandas is one of the most widely used data analysis libraries in Python, and it provides a powerful tool for working with data in a tabular format. One of the key features of Pandas is the DataFrame object, which allows you to work with data in a way that is similar to working with a spreadsheet. In this article, we will explore the performance differences between two of the most commonly used methods for selecting data from a Pandas DataFrame: loc
and query
.
Table of Contents
- What is a Pandas DataFrame?
- What is loc?
- What is query?
- Performance Differences Between loc and query
- Pros an Cons
- Conclusion
What is a Pandas DataFrame?
A Pandas DataFrame is a two-dimensional table-like data structure that stores data in rows and columns. It is similar to a spreadsheet in that it has labeled columns and rows, and each column can have a different data type. This makes it a powerful tool for working with data in a variety of formats, including CSV, Excel, SQL databases, and many others.
What is loc
?
Consider the following DataFrame:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 22
3 David 35
4 Eve 28
The loc
method is used to select data from a Pandas DataFrame based on label-based indexing. This means that you can select data using the row and column labels. For example, you can filter those who are older than 25 years old with the following code:
# Filtering using loc
result_loc = df.loc[df['Age'] > 25]
print(result_loc)
Output:
Name Age
1 Bob 30
3 David 35
4 Eve 28
What is query
?
The query
method is used to select data from a Pandas DataFrame based on a Boolean expression. This means that you can select data based on conditions that you specify using logical operators such as AND, OR, and NOT. For example, if you had a DataFrame with columns labeled “Name”, “Age”, and “Gender”, you could select the data for all females over the age of 30 using the following code:
# Filtering using query
result_query = df.query('Age > 25')
print(result_query)
Output will be the same as the output we got using loc
.
Performance Differences Between loc
and query
While both loc
and query
are useful methods for selecting data from a Pandas DataFrame, there are some performance differences between the two that are worth considering.
Memory Usage
One of the key differences between loc
and query
is the amount of memory that they use. When you use loc
to select data from a DataFrame, Pandas creates a new view of the data in memory. This means that the DataFrame itself is not modified, and the view takes up additional memory. On the other hand, when you use query
to select data from a DataFrame, Pandas creates a new DataFrame that contains only the rows that match the specified conditions. This means that the memory usage is generally lower when using query
compared to loc
.
Performance
Another key difference between loc
and query
is their performance. In general, loc
is faster than query
when selecting data from a DataFrame. This is because loc
uses label-based indexing, which is optimized for selecting data from a DataFrame, while query
uses Boolean expressions, which can be slower to evaluate.
To illustrate the performance differences between loc
and query
, let’s consider a simple example. Suppose we have a DataFrame with 1 million rows and 3 columns, and we want to select all the rows where the value of column “A” is greater than 0.5. We can do this using both loc
and query
, and measure the time it takes to execute each method:
import pandas as pd
import numpy as np
import time
# Create a DataFrame with 1 million rows and 3 columns
df = pd.DataFrame(np.random.rand(1000000, 3), columns=['A', 'B', 'C'])
# Use loc to select all rows where A > 0.5
start_time = time.time()
df.loc[df['A'] > 0.5]
print("--- %s seconds for loc ---" % (time.time() - start_time))
# Use query to select all rows where A > 0.5
start_time = time.time()
df.query('A > 0.5')
print("--- %s seconds for query ---" % (time.time() - start_time))
When we run this code, we get the following output:
--- 0.06963920593261719 seconds for loc ---
--- 0.19866657257080078 seconds for query ---
As you can see, loc
is faster than query
in this example, taking only 0.07 seconds to execute compared to 0.20 seconds for query
.
Pros an Cons of loc
and query
:
loc
Method
Pros:
- More Versatile for Complex Conditions:
The loc
method shines when dealing with intricate filtering conditions. Its label-based indexing allows you to create sophisticated queries involving multiple columns and complex logical conditions.
- Intuitive Syntax using Square Brackets:
The use of square brackets makes the syntax of loc
highly intuitive. This simplicity contributes to the ease of understanding and writing code, especially for users familiar with Python indexing conventions.
Cons:
- Slower for Simple Queries:
Despite its versatility, the loc
method might exhibit slightly slower performance for straightforward queries. This is because it involves label-based indexing and additional operations, which can lead to overhead for simple filtering tasks.
- Verbose Syntax for Complex Conditions:
As the complexity of filtering conditions increases, the syntax of the loc
method may become verbose. Constructing intricate queries with multiple logical conditions and nested expressions might lead to longer and less readable code.
query
Method
Pros:
- Concise Syntax for Simple Conditions:
The query
method excels in simplicity, offering a concise syntax for straightforward filtering conditions. This makes it an excellent choice for scenarios where brevity and readability are essential.
- Improved Readability for Straightforward Queries:
With its expressive syntax, the query
method enhances code readability for basic filtering conditions. This can be particularly beneficial when writing code that needs to be clear and understandable at a glance.
Cons:
- Limited to Basic Queries:
One of the main drawbacks of the query
method is its limitation to basic queries. It may struggle when handling more complex conditions that require intricate logical operations or involve multiple columns.
- Potential Slowness for Complex Conditions:
While query
excels in simplicity, it may not be the most performant option for complex filtering conditions. The trade-off for concise syntax might result in slower execution times when dealing with intricate queries.
Common Errors and Solutions
Key Error
Error:
df.loc[df['NonexistentColumn'] > 25]
Solution:
When encountering a KeyError, the first step is to verify that the column name is spelled correctly. Double-check for typos or discrepancies in capitalization to ensure accurate column referencing.
Syntax Error in query
Error:
df.query('Age > 25 and NonexistentColumn > 30')
Solution:
If facing a syntax error in the query
method, carefully inspect the query
expression. Ensure proper syntax, correct logical operators, and valid column names. In this case, verify that NonexistentColumn
is a valid column in the DataFrame.
Conclusion
In conclusion, both loc
and query
are useful methods for selecting data from a Pandas DataFrame, but there are some performance differences between the two that are worth considering. If you need to select data based on label-based indexing, then loc
is generally faster and more memory-efficient than query
. On the other hand, if you need to select data based on Boolean expressions, then query
can be a useful tool for filtering data in a DataFrame. Ultimately, the choice between loc
and query
depends on your specific use case and the performance requirements of your application.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.