Splitting Vector Data into Columns in PySpark
The problem of converting a "vector" column with vector data into multiple columns, one for each dimension of the vectors, arises frequently in data analysis and machine learning. This question addresses this issue in the context of Apache PySpark.
Extraction Using Spark >= 3.0.0
For Spark versions 3.0.0 and above, a simplified approach is available using the vector_to_array function:
<code class="python">from pyspark.ml.functions import vector_to_array (df .withColumn("xs", vector_to_array("vector"))) .select(["word"] + [col("xs")[i] for i in range(3)]))</code>
This will create a new column xs with an array containing the elements of the vector.
Extraction Using Spark < 3.0.0
For Spark versions prior to 3.0.0, the following methods can be employed:
Converting to RDD and Extracting:
Convert the DataFrame to an RDD and perform element-wise extraction of vector values:
<code class="python">def extract(row): return (row.word, ) + tuple(row.vector.toArray().tolist()) df.rdd.map(extract).toDF(["word"])</code>
UDF Approach:
Define a user-defined function (UDF) to convert the vector column to an array:
<code class="python">from pyspark.sql.functions import udf, col from pyspark.sql.types import ArrayType, DoubleType def to_array(col): def to_array_(v): return v.toArray().tolist() return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col) (df .withColumn("xs", to_array(col("vector"))) .select(["word"] + [col("xs")[i] for i in range(3)]))</code>
Both of these approaches will extract the vector elements into separate columns, enabling further analysis and usage.
The above is the detailed content of How to Split Vector Data into Columns in PySpark?. For more information, please follow other related articles on the PHP Chinese website!