How to Generate Sequential Row Numbers in Spark RDDs, Similar to SQL's `row

How to Generate Sequential Row Numbers in Spark RDDs, Similar to SQL's `row_number()`?

Barbara Streisand

Release： 2024-12-20 05:40:09

Original

906 people have browsed it

How to Generate Sequential Row Numbers in Spark RDDs, Similar to SQL's `row_number()`?

How to Replicate SQL's Row Numbering in Spark RDDs

Understanding the Problem

You want to generate a sequential row number for each entry in a Spark RDD, ordered by specific columns and partitioned by a key column. Similar to SQL's row_number() over (partition by ... order by ...), but using Spark RDDs.

Your Initial Attempt

Your initial attempt used sortByKey and zipWithIndex, which did not produce the desired partitioned row numbers. Note that sortBy is not applicable directly to RDDs, requiring you to collect them first, resulting in a non-RDD output.

Solution using Spark 1.4

Data Preparation

Create an RDD with tuples of the form (K, (col1, col2, col3)).

val sample_data = Seq(((3,4),5,5,5),((3,4),5,5,9),((3,4),7,5,5),((1,2),1,2,3),((1,2),1,4,7),((1,2),2,2,3))
val temp1 = sc.parallelize(sample_data)

Copy after login

Generating Partitioned Row Numbers

Use rowNumber over a partitioned window to generate row numbers for each key:

import org.apache.spark.sql.functions._

temp1.toDF("key", "col1", "col2", "col3").withColumn("rownum", rowNumber() over (Window partitionBy "key" orderBy desc("col2"), "col3")))

Copy after login

Example Output

+---+----+----+----+------+
|key|col1|col2|col3|rownum|
+---+----+----+----+------+
|1,2|1   |4   |7    |2     |
|1,2|1   |2   |3    |1     |
|1,2|2   |2   |3    |3     |
|3,4|5   |5   |5    |1     |
|3,4|5   |5   |9    |2     |
|3,4|7   |5   |5    |3     |
+---+----+----+----+------+

Copy after login

The above is the detailed content of How to Generate Sequential Row Numbers in Spark RDDs, Similar to SQL's `row_number()`?. For more information, please follow other related articles on the PHP Chinese website!