How to create columns based on other DataFrame row filters?

WBOY
Release: 2024-02-09 13:30:04
forward
534 people have browsed it

如何基于其他 DataFrame 行过滤器创建列?

Question content

I have a lazyframe called "hourly_data" which contains an hourly datetime column called "time". I also have a dataframe called "future_periods" which contains two datetime columns called "start" (start datetime of future period) and "end" (end time of future period). Importantly, these future periods do not overlap.

I want to create a column called "period" for hourly_data lazyframe, it should have an int value based on which period (future_periods dataframe row, from 0 to 9 if there are 10 periods) time column value The value of hourly_data is between the start and end column values ​​of future_periods.

I tried doing the following:

periods = pl.series(range(future_periods.height))
hourly_data = hourly_data.with_columns(
    (
        pl.when(((future_periods.get_column('start') <= pl.col('time')) & (pl.col('time') <= future_periods.get_column('end'))).any())
        .then(periods.filter(pl.series((future_periods.get_column('start') <= pl.col('real_time')) & (pl.col('real_time') <= future_periods.get_column('end')))).to_list()[0])
        .otherwise(none)
    ).alias('period')
)
Copy after login

But this gives me the error: typeerror: Calling series constructor with unsupported type 'expr' for values argument

What I want to achieve: enter:

hourly_data:
┌────────────────────┐
│ time               │
│ ---                │
│ datetime           │
╞════════════════════╡
│ 2024-01-01 00:00:00│
│ 2024-01-01 01:00:00│
│ 2024-01-01 02:00:00│
│         ...        │
│ 2024-03-31 23:00:00│
│ 2024-04-01 00:00:00│
│ 2024-04-01 01:00:00│
│         ...        │
│ 2024-06-01 00:00:00│
└────────────────────┘
future_periods:
┌─────────────────────────┬───────────────────────┐
│ start                   ┆ end                   │
│ ---                     ┆ ---                   │
│ datetime                ┆ datetime              │
╞═════════════════════════╪═══════════════════════╡
│ 2024-01-01 00:00:00     ┆ 2024-01-31 23:00:00   │
│ 2024-02-01 00:00:00     ┆ 2024-02-28 23:00:00   │
│ 2024-03-01 00:00:00     ┆ 2024-03-31 23:00:00   │
│ 2024-04-01 00:00:00     ┆ 2024-05-31 23:00:00   │
└─────────────────────────┴───────────────────────┘
Copy after login

Output:

hourly_data:
┌─────────────────────────┬────────┐
│ time                    ┆ period │
│ ---                     ┆ ---    │
│ datetime                ┆ int    │
╞═════════════════════════╪════════╡
│ 2024-01-01 00:00:00     ┆ 0      │
│ 2024-01-01 01:00:00     ┆ 0      │
│ 2024-01-01 02:00:00     ┆ 0      │
│          ...            ┆ ...    │
│ 2024-03-31 23:00:00     ┆ 2      │
│ 2024-04-01 00:00:00     ┆ 3      │
│ 2024-04-01 01:00:00     ┆ 3      │
│          ...            ┆ ...    │
│ 2024-06-01 00:00:00     ┆ None   │
└─────────────────────────┴────────┘
Copy after login


Correct answer


Generally speaking, it is an inequality join, or in your case, a range join. Here's one way to do it. Let's first create some sample data:

hourly_data = pl.dataframe({
    "time": ['2023-01-01 14:00','2023-01-02 09:00', '2023-01-04 11:00']
}).lazy()

future_periods = pl.dataframe({
    "id": [1,2,3,4],
    "start": ['2023-01-01 11:00','2023-01-02 10:00', '2023-01-03 15:00', '2023-01-04 10:00'],
    "end": ['2023-01-01 16:00','2023-01-02 11:00', '2023-01-03 18:00', '2023-01-04 15:00']
}).lazy()

┌──────────────────┬──────┐
│ time             ┆ data │
│ ---              ┆ ---  │
│ str              ┆ str  │
╞══════════════════╪══════╡
│ 2023-01-01 14:00 ┆ a    │
│ 2023-01-02 09:00 ┆ b    │
│ 2023-01-04 11:00 ┆ c    │
└──────────────────┴──────┘ 
┌─────┬──────────────────┬──────────────────┐
│ id  ┆ start            ┆ end              │
│ --- ┆ ---              ┆ ---              │
│ i64 ┆ str              ┆ str              │
╞═════╪══════════════════╪══════════════════╡
│ 1   ┆ 2023-01-01 11:00 ┆ 2023-01-01 16:00 │
│ 2   ┆ 2023-01-02 10:00 ┆ 2023-01-02 11:00 │
│ 3   ┆ 2023-01-03 15:00 ┆ 2023-01-03 18:00 │
│ 4   ┆ 2023-01-04 10:00 ┆ 2023-01-04 15:00 │
└─────┴──────────────────┴──────────────────┘
Copy after login

Now you can do it in two steps - first, calculate the link between time and the future period id:

time_periods = (
   hourly_data
       .join(future_periods, how="cross")
       .filter(
           pl.col("time") > pl.col("start"),
           pl.col("time") < pl.col("end")
        ).select(["time","id"])
)

┌──────────────────┬─────┐
│ time             ┆ id  │
│ ---              ┆ --- │
│ str              ┆ i64 │
╞══════════════════╪═════╡
│ 2023-01-01 14:00 ┆ 1   │
│ 2023-01-04 11:00 ┆ 4   │
└──────────────────┴─────┘
Copy after login

You can then join it with the original data frame:

hourly_data.join(time_periods, how="left", on="time").collect()

┌──────────────────┬──────┬──────┐
│ time             ┆ data ┆ id   │
│ ---              ┆ ---  ┆ ---  │
│ str              ┆ str  ┆ i64  │
╞══════════════════╪══════╪══════╡
│ 2023-01-01 14:00 ┆ a    ┆ 1    │
│ 2023-01-02 09:00 ┆ b    ┆ null │
│ 2023-01-04 11:00 ┆ c    ┆ 4    │
└──────────────────┴──────┴──────┘
Copy after login

Another way to do this might be to use duckdb Thanks for integrating with polars:

import duckdb
import polars as pl

duckdb.sql("""
    select
        h.time, h.data, p.id
    from hourly_data as h
        left join future_periods as p on
            p.start < h.time and
            p.end > h.time
""").pl()

┌──────────────────┬──────┬──────┐
│ time             ┆ data ┆ id   │
│ ---              ┆ ---  ┆ ---  │
│ str              ┆ str  ┆ i64  │
╞══════════════════╪══════╪══════╡
│ 2023-01-01 14:00 ┆ A    ┆ 1    │
│ 2023-01-04 11:00 ┆ C    ┆ 4    │
│ 2023-01-02 09:00 ┆ B    ┆ null │
└──────────────────┴──────┴──────┘
Copy after login

The above is the detailed content of How to create columns based on other DataFrame row filters?. For more information, please follow other related articles on the PHP Chinese website!

source:stackoverflow.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!