I have a lazyframe called "hourly_data" which contains an hourly datetime column called "time". I also have a dataframe called "future_periods" which contains two datetime columns called "start" (start datetime of future period) and "end" (end time of future period). Importantly, these future periods do not overlap.
I want to create a column called "period" for hourly_data lazyframe, it should have an int value based on which period (future_periods dataframe row, from 0 to 9 if there are 10 periods) time column value The value of hourly_data is between the start and end column values of future_periods.
I tried doing the following:
periods = pl.series(range(future_periods.height)) hourly_data = hourly_data.with_columns( ( pl.when(((future_periods.get_column('start') <= pl.col('time')) & (pl.col('time') <= future_periods.get_column('end'))).any()) .then(periods.filter(pl.series((future_periods.get_column('start') <= pl.col('real_time')) & (pl.col('real_time') <= future_periods.get_column('end')))).to_list()[0]) .otherwise(none) ).alias('period') )
But this gives me the error: typeerror: Calling series constructor with unsupported type 'expr' for values
argument
What I want to achieve: enter:
hourly_data: ┌────────────────────┐ │ time │ │ --- │ │ datetime │ ╞════════════════════╡ │ 2024-01-01 00:00:00│ │ 2024-01-01 01:00:00│ │ 2024-01-01 02:00:00│ │ ... │ │ 2024-03-31 23:00:00│ │ 2024-04-01 00:00:00│ │ 2024-04-01 01:00:00│ │ ... │ │ 2024-06-01 00:00:00│ └────────────────────┘ future_periods: ┌─────────────────────────┬───────────────────────┐ │ start ┆ end │ │ --- ┆ --- │ │ datetime ┆ datetime │ ╞═════════════════════════╪═══════════════════════╡ │ 2024-01-01 00:00:00 ┆ 2024-01-31 23:00:00 │ │ 2024-02-01 00:00:00 ┆ 2024-02-28 23:00:00 │ │ 2024-03-01 00:00:00 ┆ 2024-03-31 23:00:00 │ │ 2024-04-01 00:00:00 ┆ 2024-05-31 23:00:00 │ └─────────────────────────┴───────────────────────┘
Output:
hourly_data: ┌─────────────────────────┬────────┐ │ time ┆ period │ │ --- ┆ --- │ │ datetime ┆ int │ ╞═════════════════════════╪════════╡ │ 2024-01-01 00:00:00 ┆ 0 │ │ 2024-01-01 01:00:00 ┆ 0 │ │ 2024-01-01 02:00:00 ┆ 0 │ │ ... ┆ ... │ │ 2024-03-31 23:00:00 ┆ 2 │ │ 2024-04-01 00:00:00 ┆ 3 │ │ 2024-04-01 01:00:00 ┆ 3 │ │ ... ┆ ... │ │ 2024-06-01 00:00:00 ┆ None │ └─────────────────────────┴────────┘
Generally speaking, it is an inequality join, or in your case, a range join. Here's one way to do it. Let's first create some sample data:
hourly_data = pl.dataframe({ "time": ['2023-01-01 14:00','2023-01-02 09:00', '2023-01-04 11:00'] }).lazy() future_periods = pl.dataframe({ "id": [1,2,3,4], "start": ['2023-01-01 11:00','2023-01-02 10:00', '2023-01-03 15:00', '2023-01-04 10:00'], "end": ['2023-01-01 16:00','2023-01-02 11:00', '2023-01-03 18:00', '2023-01-04 15:00'] }).lazy() ┌──────────────────┬──────┐ │ time ┆ data │ │ --- ┆ --- │ │ str ┆ str │ ╞══════════════════╪══════╡ │ 2023-01-01 14:00 ┆ a │ │ 2023-01-02 09:00 ┆ b │ │ 2023-01-04 11:00 ┆ c │ └──────────────────┴──────┘ ┌─────┬──────────────────┬──────────────────┐ │ id ┆ start ┆ end │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪══════════════════╪══════════════════╡ │ 1 ┆ 2023-01-01 11:00 ┆ 2023-01-01 16:00 │ │ 2 ┆ 2023-01-02 10:00 ┆ 2023-01-02 11:00 │ │ 3 ┆ 2023-01-03 15:00 ┆ 2023-01-03 18:00 │ │ 4 ┆ 2023-01-04 10:00 ┆ 2023-01-04 15:00 │ └─────┴──────────────────┴──────────────────┘
Now you can do it in two steps - first, calculate the link between time
and the future period id
:
time_periods = ( hourly_data .join(future_periods, how="cross") .filter( pl.col("time") > pl.col("start"), pl.col("time") < pl.col("end") ).select(["time","id"]) ) ┌──────────────────┬─────┐ │ time ┆ id │ │ --- ┆ --- │ │ str ┆ i64 │ ╞══════════════════╪═════╡ │ 2023-01-01 14:00 ┆ 1 │ │ 2023-01-04 11:00 ┆ 4 │ └──────────────────┴─────┘
You can then join it with the original data frame:
hourly_data.join(time_periods, how="left", on="time").collect() ┌──────────────────┬──────┬──────┐ │ time ┆ data ┆ id │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞══════════════════╪══════╪══════╡ │ 2023-01-01 14:00 ┆ a ┆ 1 │ │ 2023-01-02 09:00 ┆ b ┆ null │ │ 2023-01-04 11:00 ┆ c ┆ 4 │ └──────────────────┴──────┴──────┘
Another way to do this might be to use duckdb
Thanks for integrating with polars
:
import duckdb import polars as pl duckdb.sql(""" select h.time, h.data, p.id from hourly_data as h left join future_periods as p on p.start < h.time and p.end > h.time """).pl() ┌──────────────────┬──────┬──────┐ │ time ┆ data ┆ id │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞══════════════════╪══════╪══════╡ │ 2023-01-01 14:00 ┆ A ┆ 1 │ │ 2023-01-04 11:00 ┆ C ┆ 4 │ │ 2023-01-02 09:00 ┆ B ┆ null │ └──────────────────┴──────┴──────┘
The above is the detailed content of How to create columns based on other DataFrame row filters?. For more information, please follow other related articles on the PHP Chinese website!