How to Perform Grouped TopN Aggregation Using Spark DataFrame
In Spark SQL, you can leverage SQL-like syntax to perform complex data manipulations. One common task is to group data and retrieve the top N values from each group. Here's how you can achieve this using a Spark DataFrame:
To group data by a specific column, specify the column name in the GROUP BY clause:
<br>df.groupBy("user")<br>
To sort the results within each group, you can use the orderBy function:
<br>df.groupBy("user").orderBy(desc("rating"))<br>
This will sort the data in descending order of the rating column for each user group.
To retrieve only the top N records from each group, you can use the limit function:
<br>df.groupBy("user").orderBy(desc("rating")).limit(n)<br>
Where n is the desired number of top records to retrieve.
Alternatively, you can use window functions to rank the records within each group and then filter based on the rank:
<br>import org.apache.spark.sql.expressions.Window<br>import org.apache.spark.sql.functions.{rank, desc}</p> <p>// Window definition<br>val w = Window.partitionBy($"user").orderBy(desc("rating"))</p> <p>// Filter<br>df.withColumn("rank", rank.over(w)).where($"rank" <= n)<br>
Note that if you don't care about ties, you can replace the rank function with the row_number function.
The above is the detailed content of How to Efficiently Perform Grouped Top-N Aggregation in Spark DataFrames?. For more information, please follow other related articles on the PHP Chinese website!