Splitting Vector Column into Columns using PySpark
You have a PySpark DataFrame with two columns: word and vector, where vector is a VectorUDT column. Your goal is to split the vector column into multiple columns, each representing one dimension of the vector.
Solution:
Spark >= 3.0.0
In Spark versions 3.0.0 and above, you can use the vector_to_array function to achieve this:
<code class="python">from pyspark.ml.functions import vector_to_array (df .withColumn("xs", vector_to_array("vector"))) .select(["word"] + [col("xs")[i] for i in range(3)]))</code>
This will create new columns named word and xs[0], xs[1], xs[2], and so on, representing the values of the original vector.
Spark < 3.0.0
For older Spark versions, you can follow these approaches:
Convert to RDD and Extract
<code class="python">from pyspark.ml.linalg import Vectors df = sc.parallelize([ ("assert", Vectors.dense([1, 2, 3])), ("require", Vectors.sparse(3, {1: 2})) ]).toDF(["word", "vector"]) def extract(row): return (row.word, ) + tuple(row.vector.toArray().tolist()) df.rdd.map(extract).toDF(["word"]) # Vector values will be named _2, _3, ...</code>
Create a UDF:
<code class="python">from pyspark.sql.functions import udf, col from pyspark.sql.types import ArrayType, DoubleType def to_array(col): def to_array_(v): return v.toArray().tolist() # Important: asNondeterministic requires Spark 2.3 or later # It can be safely removed i.e. # return udf(to_array_, ArrayType(DoubleType()))(col) # but at the cost of decreased performance return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col) (df .withColumn("xs", to_array(col("vector"))) .select(["word"] + [col("xs")[i] for i in range(3)]))</code>
Either approach will result in a DataFrame with separate columns for each dimension of the original vector, making it easier to work with the data.
The above is the detailed content of How to Split a Vector Column into Columns in PySpark?. For more information, please follow other related articles on the PHP Chinese website!