How to Replicate SQL's Row Numbering in Spark RDDs
Understanding the Problem
You want to generate a sequential row number for each entry in a Spark RDD, ordered by specific columns and partitioned by a key column. Similar to SQL's row_number() over (partition by ... order by ...), but using Spark RDDs.
Your Initial Attempt
Your initial attempt used sortByKey and zipWithIndex, which did not produce the desired partitioned row numbers. Note that sortBy is not applicable directly to RDDs, requiring you to collect them first, resulting in a non-RDD output.
Solution using Spark 1.4
Data Preparation
Create an RDD with tuples of the form (K, (col1, col2, col3)).
val sample_data = Seq(((3,4),5,5,5),((3,4),5,5,9),((3,4),7,5,5),((1,2),1,2,3),((1,2),1,4,7),((1,2),2,2,3)) val temp1 = sc.parallelize(sample_data)
Generating Partitioned Row Numbers
Use rowNumber over a partitioned window to generate row numbers for each key:
import org.apache.spark.sql.functions._ temp1.toDF("key", "col1", "col2", "col3").withColumn("rownum", rowNumber() over (Window partitionBy "key" orderBy desc("col2"), "col3")))
Example Output
+---+----+----+----+------+ |key|col1|col2|col3|rownum| +---+----+----+----+------+ |1,2|1 |4 |7 |2 | |1,2|1 |2 |3 |1 | |1,2|2 |2 |3 |3 | |3,4|5 |5 |5 |1 | |3,4|5 |5 |9 |2 | |3,4|7 |5 |5 |3 | +---+----+----+----+------+
The above is the detailed content of How to Generate Sequential Row Numbers in Spark RDDs, Similar to SQL's `row_number()`?. For more information, please follow other related articles on the PHP Chinese website!