PySpark - Multiple Conditions in When Clause: An Overview

PySpark is a powerful tool for data processing and analysis, but it can be challenging to work with when dealing with complex conditional statements. In this blog post, we will explore how to use the PySpark when function with multiple conditions to efficiently filter and transform data.

PySpark is a powerful tool for data processing and analysis, but it can be challenging to work with when dealing with complex conditional statements. In this blog post, we will explore how to use the PySpark when function with multiple conditions to efficiently filter and transform data.

Table of Contents

  1. Basic When Clause
  2. Using & and | Operators
  3. Chaining otherwise Conditions
  4. Nested When Conditions
  5. Common Errors and Solutions
  6. Conclusion

Basic When Clause

The basic syntax of the when clause involves specifying a condition and the corresponding value to assign if the condition is true. Here’s a simple example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

result_df = df.withColumn("Category", when(col("Age") < 25, "Young").otherwise("Adult"))
result_df.show()

This code snippet creates a DataFrame with a Category column based on the age condition.

Output:

+-------+---+--------+
|   Name|Age|Category|
+-------+---+--------+
|  Alice| 25|   Adult|
|    Bob| 30|   Adult|
|Charlie| 22|   Young|
+-------+---+--------+

Using & and | Operators

To apply multiple conditions in the when clause, you can use logical operators & (and) and | (or). For instance:

result_df = df.withColumn(
    "Category",
    when((col("Age") > 22) & (col("Name") != "Bob"), "Adult").otherwise("Young")
)
result_df.show()

This code sets the Category column to Adult only if the age is more than 22 and the name is not Bob.

Output:

+-------+---+--------+
|   Name|Age|Category|
+-------+---+--------+
|  Alice| 25|   Adult|
|    Bob| 30|   Young|
|Charlie| 22|   Young|
+-------+---+--------+

Chaining otherwise Conditions

Chaining otherwise conditions allows you to handle multiple scenarios. Each condition is evaluated sequentially, and the first true condition is applied. Here’s an example:

result_df = df.withColumn(
    "Category",
    when(col("Age") < 25, "Young").otherwise(when(col("Age") > 29, "Senior").otherwise("Adult"))
)
result_df.show()

In this example, the Category column is set based on age, with different categories for young, senior, and adult.

Output:

+-------+---+--------+
|   Name|Age|Category|
+-------+---+--------+
|  Alice| 25|   Adult|
|    Bob| 30|  Senior|
|Charlie| 22|   Young|
+-------+---+--------+

Nested When Conditions

You can also nest when conditions for more complex logic. Consider the following example:

result_df = df.withColumn(
    "Category",
    when(col("Age") < 25, "Young")
    .when(col("Age") > 29, "Senior")
    .otherwise("Adult")
)
result_df.show()

This code achieves the same result as the previous example but with a more compact syntax.

Output:

+-------+---+--------+
|   Name|Age|Category|
+-------+---+--------+
|  Alice| 25|   Adult|
|    Bob| 30|  Senior|
|Charlie| 22|   Young|
+-------+---+--------+

Common Errors and Solutions

Error 1: AnalysisException: cannot resolve

  • Description: This error occurs when a column or expression in the when clause is not found in the DataFrame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Error: 'Salary' column does not exist in the DataFrame
result_df = df.withColumn("Category", when(col("Salary") > 50000, "High").otherwise("Low"))
result_df.show()
  • Solution: Double-check column names and ensure that all referenced columns exist in the DataFrame.

Error 2: TypeError: 'Column' object is not callable

  • Description: This error happens when trying to call a column as if it were a function.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Error: Incorrect usage of parentheses
result_df = df.withColumn("Category", when(col("Age") > 25, "High")("High").otherwise("Low"))
result_df.show()
  • Solution: Ensure you are not mistakenly using parentheses instead of square brackets when referencing columns.

Error 3: TypeError: unsupported operand type(s) for &: 'Column' and 'Column'

  • Description: This error occurs when using the & operator between two columns without proper parentheses.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.appName("example").getOrCreate()

data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# Error: Missing parentheses around conditions
result_df = df.withColumn("Category", when(col("Age") > 25 & col("Name") == "Alice", "High").otherwise("Low"))
result_df.show()
  • Solution: Always use parentheses to explicitly define the order of operations in complex conditions.

Conclusion

In this blog post, we have explored how to use the PySpark when function with multiple conditions to efficiently filter and transform data. We have seen how to use the and and or operators to combine conditions, and how to chain when functions together using the otherwise clause. By understanding these concepts, you can make the most of PySpark’s powerful data processing capabilities.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.