PySpark - Multiple Conditions in When Clause: An Overview
when
function with multiple conditions to efficiently filter and transform data.PySpark is a powerful tool for data processing and analysis, but it can be challenging to work with when dealing with complex conditional statements. In this blog post, we will explore how to use the PySpark when
function with multiple conditions to efficiently filter and transform data.
Table of Contents
- Basic When Clause
- Using
&
and|
Operators - Chaining
otherwise
Conditions - Nested When Conditions
- Common Errors and Solutions
- Conclusion
Basic When Clause
The basic syntax of the when
clause involves specifying a condition and the corresponding value to assign if the condition is true. Here’s a simple example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
result_df = df.withColumn("Category", when(col("Age") < 25, "Young").otherwise("Adult"))
result_df.show()
This code snippet creates a DataFrame with a Category
column based on the age condition.
Output:
+-------+---+--------+
| Name|Age|Category|
+-------+---+--------+
| Alice| 25| Adult|
| Bob| 30| Adult|
|Charlie| 22| Young|
+-------+---+--------+
Using &
and |
Operators
To apply multiple conditions in the when
clause, you can use logical operators &
(and) and |
(or). For instance:
result_df = df.withColumn(
"Category",
when((col("Age") > 22) & (col("Name") != "Bob"), "Adult").otherwise("Young")
)
result_df.show()
This code sets the Category
column to Adult
only if the age is more than 22 and the name is not Bob
.
Output:
+-------+---+--------+
| Name|Age|Category|
+-------+---+--------+
| Alice| 25| Adult|
| Bob| 30| Young|
|Charlie| 22| Young|
+-------+---+--------+
Chaining otherwise
Conditions
Chaining otherwise
conditions allows you to handle multiple scenarios. Each condition is evaluated sequentially, and the first true condition is applied. Here’s an example:
result_df = df.withColumn(
"Category",
when(col("Age") < 25, "Young").otherwise(when(col("Age") > 29, "Senior").otherwise("Adult"))
)
result_df.show()
In this example, the Category
column is set based on age, with different categories for young, senior, and adult.
Output:
+-------+---+--------+
| Name|Age|Category|
+-------+---+--------+
| Alice| 25| Adult|
| Bob| 30| Senior|
|Charlie| 22| Young|
+-------+---+--------+
Nested When Conditions
You can also nest when
conditions for more complex logic. Consider the following example:
result_df = df.withColumn(
"Category",
when(col("Age") < 25, "Young")
.when(col("Age") > 29, "Senior")
.otherwise("Adult")
)
result_df.show()
This code achieves the same result as the previous example but with a more compact syntax.
Output:
+-------+---+--------+
| Name|Age|Category|
+-------+---+--------+
| Alice| 25| Adult|
| Bob| 30| Senior|
|Charlie| 22| Young|
+-------+---+--------+
Common Errors and Solutions
Error 1: AnalysisException: cannot resolve
- Description: This error occurs when a column or expression in the
when
clause is not found in the DataFrame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Error: 'Salary' column does not exist in the DataFrame
result_df = df.withColumn("Category", when(col("Salary") > 50000, "High").otherwise("Low"))
result_df.show()
- Solution: Double-check column names and ensure that all referenced columns exist in the DataFrame.
Error 2: TypeError: 'Column' object is not callable
- Description: This error happens when trying to call a column as if it were a function.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Error: Incorrect usage of parentheses
result_df = df.withColumn("Category", when(col("Age") > 25, "High")("High").otherwise("Low"))
result_df.show()
- Solution: Ensure you are not mistakenly using parentheses instead of square brackets when referencing columns.
Error 3: TypeError: unsupported operand type(s) for &: 'Column' and 'Column'
- Description: This error occurs when using the
&
operator between two columns without proper parentheses.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
spark = SparkSession.builder.appName("example").getOrCreate()
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Error: Missing parentheses around conditions
result_df = df.withColumn("Category", when(col("Age") > 25 & col("Name") == "Alice", "High").otherwise("Low"))
result_df.show()
- Solution: Always use parentheses to explicitly define the order of operations in complex conditions.
Conclusion
In this blog post, we have explored how to use the PySpark when
function with multiple conditions to efficiently filter and transform data. We have seen how to use the and
and or
operators to combine conditions, and how to chain when
functions together using the otherwise
clause. By understanding these concepts, you can make the most of PySpark’s powerful data processing capabilities.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.