First row selection in grouped DataFrame
When working with complex datasets in Spark, you often need to select specific rows from each group based on specific criteria. A common scenario is to select the first row from each group and sort by a specific column.
In order to select the first row from each group of the DataFrame, several methods can be used:
Window function:
<code>import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window // 创建一个带有分组数据的 DataFrame val df = sc.parallelize(Seq((0, "cat26", 30.9), (0, "cat13", 22.1), (0, "cat95", 19.6), (0, "cat105", 1.3), (1, "cat67", 28.5), (1, "cat4", 26.8), (1, "cat13", 12.6), (1, "cat23", 5.3), (2, "cat56", 39.6), (2, "cat40", 29.7), (2, "cat187", 27.9), (2, "cat68", 9.8), (3, "cat8", 35.6))).toDF("Hour", "Category", "TotalValue") // 创建窗口规范 val w = Window.partitionBy($"Hour").orderBy($"TotalValue".desc) // 计算每个组的行号 val dfTop = df.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn") // 显示每个组的第一行 dfTop.show</code>
Simple SQL aggregations and joins:
<code>val dfMax = df.groupBy($"Hour".as("max_hour")).agg(max($"TotalValue").as("max_value")) val dfTopByJoin = df.join(broadcast(dfMax), ($"Hour" === $"max_hour") && ($"TotalValue" === $"max_value")) .drop("max_hour") .drop("max_value") dfTopByJoin.show</code>
Structure sorting:
<code>val dfTop = df.select($"Hour", struct($"TotalValue", $"Category").alias("vs")) .groupBy($"Hour") .agg(max("vs").alias("vs")) .select($"Hour", $"vs.Category", $"vs.TotalValue") dfTop.show</code>
DataSet API:
Spark 1.6:
<code>case class Record(Hour: Integer, Category: String, TotalValue: Double) df.as[Record] .groupBy($"Hour") .reduce((x, y) => if (x.TotalValue > y.TotalValue) x else y) .show</code>
Spark 2.0 or higher:
<code>df.as[Record] .groupByKey(_.Hour) .reduceGroups((x, y) => if (x.TotalValue > y.TotalValue) x else y)</code>
These methods provide multiple ways to select the first row from each group based on specified sorting criteria. The choice of method depends on specific needs and performance considerations.
The above is the detailed content of How to Select the First Row from Each Group in a Spark DataFrame?. For more information, please follow other related articles on the PHP Chinese website!