PySpark - Multiple Conditions in When Clause: An Overview

PySpark is a powerful tool for data processing and analysis, but it can be challenging to work with when dealing with complex conditional statements. In this blog post, we will explore how to use the PySpark when function with multiple conditions to efficiently filter and transform data.

By Saturn Cloud | Monday, May 29, 2023 | Miscellaneous | Updated: Saturday, January 13, 2024

Basic When Clause
Using & and | Operators
Chaining otherwise Conditions
Nested When Conditions
Common Errors and Solutions
Conclusion

Basic When Clause

The basic syntax of the when clause involves specifying a condition and the corresponding value to assign if the condition is true. Here’s a simple example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

result_df = df.withColumn("Category", when(col("Age") < 25, "Young").otherwise("Adult"))
result_df.show()

This code snippet creates a DataFrame with a Category column based on the age condition.

Output:

+-------+---+--------+
|   Name|Age|Category|
+-------+---+--------+
|  Alice| 25|   Adult|
|    Bob| 30|   Adult|
|Charlie| 22|   Young|
+-------+---+--------+

Using `&` and `|` Operators

To apply multiple conditions in the when clause, you can use logical operators & (and) and | (or). For instance:

result_df = df.withColumn(
    "Category",
    when((col("Age") > 22) & (col("Name") != "Bob"), "Adult").otherwise("Young")
)
result_df.show()

This code sets the Category column to Adult only if the age is more than 22 and the name is not Bob.

Output:

+-------+---+--------+
|   Name|Age|Category|
+-------+---+--------+
|  Alice| 25|   Adult|
|    Bob| 30|   Young|
|Charlie| 22|   Young|
+-------+---+--------+

Chaining `otherwise` Conditions

Chaining otherwise conditions allows you to handle multiple scenarios. Each condition is evaluated sequentially, and the first true condition is applied. Here’s an example:

result_df = df.withColumn(
    "Category",
    when(col("Age") < 25, "Young").otherwise(when(col("Age") > 29, "Senior").otherwise("Adult"))
)
result_df.show()

In this example, the Category column is set based on age, with different categories for young, senior, and adult.

Output:

+-------+---+--------+
|   Name|Age|Category|
+-------+---+--------+
|  Alice| 25|   Adult|
|    Bob| 30|  Senior|
|Charlie| 22|   Young|
+-------+---+--------+

Nested When Conditions

You can also nest when conditions for more complex logic. Consider the following example:

result_df = df.withColumn(
    "Category",
    when(col("Age") < 25, "Young")
    .when(col("Age") > 29, "Senior")
    .otherwise("Adult")
)
result_df.show()

This code achieves the same result as the previous example but with a more compact syntax.

Output:

+-------+---+--------+
|   Name|Age|Category|
+-------+---+--------+
|  Alice| 25|   Adult|
|    Bob| 30|  Senior|
|Charlie| 22|   Young|
+-------+---+--------+

Common Errors and Solutions

Error 1: `AnalysisException: cannot resolve`

Description: This error occurs when a column or expression in the when clause is not found in the DataFrame.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Error: 'Salary' column does not exist in the DataFrame
result_df = df.withColumn("Category", when(col("Salary") > 50000, "High").otherwise("Low"))
result_df.show()

Solution: Double-check column names and ensure that all referenced columns exist in the DataFrame.

Error 2: `TypeError: 'Column' object is not callable`

Description: This error happens when trying to call a column as if it were a function.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Error: Incorrect usage of parentheses
result_df = df.withColumn("Category", when(col("Age") > 25, "High")("High").otherwise("Low"))
result_df.show()

Solution: Ensure you are not mistakenly using parentheses instead of square brackets when referencing columns.

Error 3: `TypeError: unsupported operand type(s) for &: 'Column' and 'Column'`

Description: This error occurs when using the & operator between two columns without proper parentheses.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Error: Missing parentheses around conditions
result_df = df.withColumn("Category", when(col("Age") > 25 & col("Name") == "Alice", "High").otherwise("Low"))
result_df.show()

Solution: Always use parentheses to explicitly define the order of operations in complex conditions.

Conclusion

In this blog post, we have explored how to use the PySpark when function with multiple conditions to efficiently filter and transform data. We have seen how to use the and and or operators to combine conditions, and how to chain when functions together using the otherwise clause. By understanding these concepts, you can make the most of PySpark’s powerful data processing capabilities.

About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.

Start for free

PySpark - Multiple Conditions in When Clause: An Overview

Table of Contents

Basic When Clause

Using `&` and `|` Operators

Chaining `otherwise` Conditions

Nested When Conditions

Common Errors and Solutions

Error 1: `AnalysisException: cannot resolve`

Error 2: `TypeError: 'Column' object is not callable`

Error 3: `TypeError: unsupported operand type(s) for &: 'Column' and 'Column'`

Conclusion

About Saturn Cloud

Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.

PySpark - Multiple Conditions in When Clause: An Overview

Table of Contents

Basic When Clause

Using & and | Operators

Chaining otherwise Conditions

Nested When Conditions

Common Errors and Solutions

Error 1: AnalysisException: cannot resolve

Error 2: TypeError: 'Column' object is not callable

Error 3: TypeError: unsupported operand type(s) for &: 'Column' and 'Column'

Conclusion

SHARE:

About Saturn Cloud

Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.

Using `&` and `|` Operators

Chaining `otherwise` Conditions

Error 1: `AnalysisException: cannot resolve`

Error 2: `TypeError: 'Column' object is not callable`

Error 3: `TypeError: unsupported operand type(s) for &: 'Column' and 'Column'`