PySpark: Split Vector Into Columns
In PySpark, you may encounter a DataFrame with a vector column and the need to split it into multiple columns, one for each dimension. Here's how to achieve this:
For Spark >= 3.0.0
Starting from Spark 3.0.0, a convenient way to extract vector components is using vector_to_array function:
<code class="python">df = df.withColumn("xs", vector_to_array("vector")) # Pick the first three dimensions for illustration result = df.select(["word"] + [col("xs")[i] for i in range(3)])</code>
For Spark < 3.0.0
Method 1:RDD Conversion
One approach involves converting the DataFrame to an RDD and extracting the vector components manually:
<code class="python">rdd = df.rdd.map(lambda row: (row.word, ) + tuple(row.vector.toArray().tolist())) result = rdd.toDF(["word"])</code>
Method 2: UDF Creation
Alternatively, you can create a user-defined function (UDF) and apply it to the vector column:
<code class="python">@udf(ArrayType(DoubleType())) def to_array(vector): return vector.toArray().tolist() result = df.withColumn("xs", to_array(col("vector"))).select(["word"] + [col("xs")[i] for i in range(3)])</code>
The above is the detailed content of How to Split a Vector Column into Individual Columns in PySpark?. For more information, please follow other related articles on the PHP Chinese website!