In PySpark, you may encounter the need to extract individual dimensions from vector columns stored as VectorUDTs. To accomplish this, you can leverage various approaches based on your Spark version.
Spark >= 3.0.0
PySpark 3.0.0 brings built-in functionality for this task:
<code class="python">from pyspark.ml.functions import vector_to_array df.withColumn("xs", vector_to_array("vector")).select(["word"] + [col("xs")[i] for i in range(3)])</code>
This concisely converts the vector into an array and projects the desired columns.
Spark < 3.0.0
Pre-3.0.0 Spark versions require more intricate approaches:
RDD Conversion:
<code class="python">df.rdd.map(lambda row: (row.word,) + tuple(row.vector.toArray().tolist())).toDF(["word"])</code>
UDF Method:
<code class="python">from pyspark.sql.functions import udf, col from pyspark.sql.types import ArrayType, DoubleType def to_array(col): return udf(lambda v: v.toArray().tolist(), ArrayType(DoubleType()))(col) df.withColumn("xs", to_array(col("vector"))).select(["word"] + [col("xs")[i] for i in range(3)])</code>
Note: For increased performance, ensure asNondeterministic is used with the UDF (requires Spark 2.3 ).
Scala Equivalent
For the Scala equivalent of these approaches, refer to "Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]."
The above is the detailed content of How do you Convert VectorUDTs into Columns in PySpark?. For more information, please follow other related articles on the PHP Chinese website!