Home > Database > Mysql Tutorial > How to Efficiently Select the First Row of Each Group with the Highest Value in a Spark DataFrame?

How to Efficiently Select the First Row of Each Group with the Highest Value in a Spark DataFrame?

Susan Sarandon
Release: 2025-01-23 13:02:16
Original
705 people have browsed it

How to Efficiently Select the First Row of Each Group with the Highest Value in a Spark DataFrame?

How to choose the homepage of each group

The goal is to extract the first page with the highest value from each group of the DataFrame. This is often the case when analyzing data trends or identifying the top players in a specific category. To achieve this, several methods can be used:

Window function:

Window functions provide a way to perform calculations within a group. In this case, we can use the row_number() function to assign a sequence number to each row based on the specified ordering. The homepage can then be identified by filtering the ranking to 1.

<code class="language-scala">import org.apache.spark.sql.functions.{row_number, max, broadcast}
import org.apache.spark.sql.expressions.Window

val df = sc.parallelize(Seq(
  (0,"cat26",30.9), (0,"cat13",22.1), (0,"cat95",19.6), (0,"cat105",1.3),
  (1,"cat67",28.5), (1,"cat4",26.8), (1,"cat13",12.6), (1,"cat23",5.3),
  (2,"cat56",39.6), (2,"cat40",29.7), (2,"cat187",27.9), (2,"cat68",9.8),
  (3,"cat8",35.6))).toDF("Hour", "Category", "TotalValue")

val w = Window.partitionBy($"hour").orderBy($"TotalValue".desc)

val dfTop = df.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn")

dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// |   0|   cat26|      30.9|
// |   1|   cat67|      28.5|
// |   2|   cat56|      39.6|
// |   3|    cat8|      35.6|
// +----+--------+----------+</code>
Copy after login

Simple SQL post-aggregation join:

Alternatively, we can use SQL to perform the aggregation and then concatenate the results with the original DataFrame to extract the first page of each group.

<code class="language-scala">val dfMax = df.groupBy($"hour".as("max_hour")).agg(max($"TotalValue").as("max_value"))

val dfTopByJoin = df.join(broadcast(dfMax),
    ($"hour" === $"max_hour") && ($"TotalValue" === $"max_value"))
  .drop("max_hour")
  .drop("max_value")

dfTopByJoin.show

// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// |   0|   cat26|      30.9|
// |   1|   cat67|      28.5|
// |   2|   cat56|      39.6|
// |   3|    cat8|      35.6|
// +----+--------+----------+</code>
Copy after login

Use structure sorting:

A neat way to get the same results without using window functions or joins is to sort the data based on a structure containing values ​​and categories. The maximum value of this structure will then return the desired first page for each group.

<code class="language-scala">val dfTop = df.select($"Hour", struct($"TotalValue", $"Category").alias("vs"))
  .groupBy($"hour")
  .agg(max("vs").alias("vs"))
  .select($"Hour", $"vs.Category", $"vs.TotalValue")

dfTop.show
// +----+--------+----------+
// |Hour|Category|TotalValue|
// +----+--------+----------+
// |   0|   cat26|      30.9|
// |   1|   cat67|      28.5|
// |   2|   cat56|      39.6|
// |   3|    cat8|      35.6|
// +----+--------+----------+</code>
Copy after login

The above is the detailed content of How to Efficiently Select the First Row of Each Group with the Highest Value in a Spark DataFrame?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template