Home > Database > Mysql Tutorial > How Can I Preserve Null Values During Apache Spark Joins?

How Can I Preserve Null Values During Apache Spark Joins?

DDD
Release: 2024-12-31 17:36:11
Original
209 people have browsed it

How Can I Preserve Null Values During Apache Spark Joins?

Preserving Null Values in Apache Spark Joins

By default, Apache Spark omits rows with null values when performing joins. To include these values in the join output, Spark provides several options.

NULL-Safe Equality Operator (<=>)

Spark 1.6 introduced a special NULL-safe equality operator that allows you to include null values in your join criteria.

numbersDf
  .join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers"))
  .drop(lettersDf("numbers"))
Copy after login

Column.eqNullSafe (PySpark 2.3.0 )

In PySpark 2.3.0 and later, you can use Column.eqNullSafe to perform NULL-safe equality checks.

numbers_df = sc.parallelize([
    ("123", ), ("456", ), (None, ), ("", )
]).toDF(["numbers"])

letters_df = sc.parallelize([
    ("123", "abc"), ("456", "def"), (None, "zzz"), ("", "hhh")
]).toDF(["numbers", "letters"])

numbers_df.join(letters_df, numbers_df.numbers.eqNullSafe(letters_df.numbers))
Copy after login

%<=>% (SparkR)

SparkR offers a %<=>% operator for NULL-safe equality checks.

numbers_df <- createDataFrame(data.frame(numbers = c("123", "456", NA, "")))
letters_df <- createDataFrame(data.frame(
  numbers = c("123", "456", NA, ""),
  letters = c("abc", "def", "zzz", "hhh")
))

head(join(numbers_df, letters_df, numbers_df$numbers %<=>% letters_df$numbers))
Copy after login

IS NOT DISTINCT FROM (SQL)

In SQL (Spark 2.2.0 ), you can use IS NOT DISTINCT FROM to preserve null values in joins.

SELECT * FROM numbers JOIN letters 
ON numbers.numbers IS NOT DISTINCT FROM letters.numbers
Copy after login

This operator can also be used with the DataFrame API:

numbersDf.alias("numbers")
  .join(lettersDf.alias("letters"))
  .where("numbers.numbers IS NOT DISTINCT FROM letters.numbers")
Copy after login

The above is the detailed content of How Can I Preserve Null Values During Apache Spark Joins?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template