Including Null Values in Apache Spark Join
Apache Spark's default join behavior excludes rows with null values, presenting a challenge for users who require the preservation of such values. This article addresses this issue by explaining how to include null values in an Apache Spark join operation.
Default Spark Behavior
By default, Spark performs an inner join, which eliminates rows with null values in either column involved in the join condition. To illustrate this, consider the following example:
val numbersDf = Seq( ("123"), ("456"), (null), ("") ).toDF("numbers") val lettersDf = Seq( ("123", "abc"), ("456", "def"), (null, "zzz"), ("", "hhh") ).toDF("numbers", "letters") val joinedDf = numbersDf.join(lettersDf, Seq("numbers"))
The output of joinedDf.show() would be as follows:
+-------+-------+ |numbers|letters| +-------+-------+ | 123| abc| | 456| def| | | hhh| +-------+-------+
As expected, the row with a null value in the "numbers" column of numbersDf is excluded from the join results.
Including Null Values
Spark provides a solution to this issue through a null-safe equality operator. By using the following code, the null values can be included in the join operation:
numbersDf .join(lettersDf, numbersDf("numbers") <=> lettersDf("numbers")) .drop(lettersDf("numbers"))
This modified join operation results in the following output, which now includes the previously excluded row with null value:
+-------+-------+ |numbers|letters| +-------+-------+ | 123| abc| | 456| def| | null| zzz| | | hhh| +-------+-------+
It is important to note that this null-safe equality operator should be used carefully in Spark versions prior to 1.6, as it required a costly Cartesian product.
Spark 2.3.0 and Later
In Spark 2.3.0 and later, the Column.eqNullSafe method can be used in PySpark, while the %=<=>% operator can be used in SparkR to achieve the same result. Additionally, the SQL syntax IS NOT DISTINCT FROM can be employed in Spark 2.2.0 along with the DataFrame API to include null values in join operations.
The above is the detailed content of How to Include Null Values in Apache Spark Joins?. For more information, please follow other related articles on the PHP Chinese website!