How to Split a Vector Column into Individual Columns in PySpark?-Python Tutorial-php.cn

How to Split a Vector Column into Individual Columns in PySpark?

Mary-Kate Olsen

Release： 2024-11-03 12:25:29

Original

1038 people have browsed it

How to Split a Vector Column into Individual Columns in PySpark?

PySpark: Split Vector Into Columns

In PySpark, you may encounter a DataFrame with a vector column and the need to split it into multiple columns, one for each dimension. Here's how to achieve this:

For Spark >= 3.0.0

Starting from Spark 3.0.0, a convenient way to extract vector components is using vector_to_array function:

<code class="python">df = df.withColumn("xs", vector_to_array("vector"))

# Pick the first three dimensions for illustration
result = df.select(["word"] + [col("xs")[i] for i in range(3)])</code>

Copy after login

For Spark < 3.0.0

Method 1:RDD Conversion

One approach involves converting the DataFrame to an RDD and extracting the vector components manually:

<code class="python">rdd = df.rdd.map(lambda row: (row.word, ) + tuple(row.vector.toArray().tolist()))
result = rdd.toDF(["word"])</code>

Copy after login

Method 2: UDF Creation

Alternatively, you can create a user-defined function (UDF) and apply it to the vector column:

<code class="python">@udf(ArrayType(DoubleType()))
def to_array(vector):
    return vector.toArray().tolist()

result = df.withColumn("xs", to_array(col("vector"))).select(["word"] + [col("xs")[i] for i in range(3)])</code>

Copy after login

The above is the detailed content of How to Split a Vector Column into Individual Columns in PySpark?. For more information, please follow other related articles on the PHP Chinese website!