Converting PySpark String to Date Format
You have a PySpark DataFrame with a string column representing dates in the MM-dd-yyyy format. Your attempt to convert this column to a date format using to_date function returns nulls. This article provides methods for addressing this issue.
Updated Recommendation (Spark 2.2 ):
For Spark versions 2.2 and above, the preferred approach is to use the to_date or to_timestamp functions, which now support the format argument. This allows you to specify the input format and convert the string column directly to a date or timestamp:
from pyspark.sql.functions import to_timestamp df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t']) df.select(to_timestamp(df.t, 'yyyy-MM-dd HH:mm:ss').alias('dt')).collect() # Output: # [Row(dt=datetime.datetime(1997, 2, 28, 10, 30))]
Original Answer (Spark < 2.2):
For earlier Spark versions, you can use the following method without the need for a user-defined function (UDF):
from pyspark.sql.functions import unix_timestamp, from_unixtime df = spark.createDataFrame( [("11/25/1991",), ("11/24/1991",), ("11/30/1991",)], ['date_str'] ) df2 = df.select( 'date_str', from_unixtime(unix_timestamp('date_str', 'MM/dd/yyy')).alias('date') ) print(df2) # Output: # DataFrame[date_str: string, date: timestamp] df2.show(truncate=False) # Output: # +----------+-------------------+ # |date_str |date | # +----------+-------------------+ # |11/25/1991|1991-11-25 00:00:00| # |11/24/1991|1991-11-24 00:00:00| # |11/30/1991|1991-11-30 00:00:00| # +----------+-------------------+
In this method, unix_timestamp converts the string column to a Unix timestamp, and from_unixtime converts the Unix timestamp to a date column.
The above is the detailed content of How to Convert PySpark String Columns to Date Format?. For more information, please follow other related articles on the PHP Chinese website!