How to Preserve Non-Aggregated Columns in Spark DataFrame GroupBy
When aggregating data using DataFrame's groupBy method, the resulting DataFrame only contains the group-by key and the aggregated values. However, in some cases, it may be desirable to also include non-aggregated columns from the original DataFrame in the result.
Limitation of Spark SQL
Spark SQL follows the convention of pre-1999 SQL, which does not allow additional columns in aggregation queries. Aggregations like count produce results that are not well-defined when applied to multiple columns, so different systems handling such queries exhibit varying behaviors.
Solution:
To preserve non-aggregated columns in a Spark DataFrame groupBy, there are several options:
val aggregatedDf = df.groupBy(df("age")).agg(Map("id" -> "count")) val joinedDf = aggregatedDf.join(df, Seq("age"), "left")
import org.apache.spark.sql.expressions.Window val windowSpec = Window.partitionBy(df("age")) val aggregatedDf = df.withColumn("name", first(df("name")).over(windowSpec)) .groupBy(df("age")).agg(Map("id" -> "count"))
The above is the detailed content of How to Keep Non-Aggregated Columns After a Spark DataFrame GroupBy?. For more information, please follow other related articles on the PHP Chinese website!