計算多索引 pandas 資料幀外部索引每行的總和

WBOY
發布: 2024-02-05 22:00:13
轉載
1143 人瀏覽過

计算多索引 pandas 数据帧外部索引每行的总和

問題內容

我有一個資料框:selleritempriceshipping免費送貨最低count availablecount required。我的目標是根據稍後計算的 total 找到 selleritem 的最便宜的組合(計算程式碼如下所示)。範例資料如下:

import pandas as pd

item1 = ['item 1', 'item 2', 'item 1', 'item 1', 'item 2']
seller1 = ['seller 1', 'seller 2', 'seller 3', 'seller 4', 'seller 1']
price1 = [1.85, 1.94, 2.00, 2.00, 2.02]
shipping1 = [0.99, 0.99, 0.99, 2.99, 0.99]
freeship1 = [5, 5, 5, 50, 5]
countavailable1 = [1, 2, 2, 5, 2]
countneeded1 = [2, 1, 2, 2, 1]

df1 = pd.dataframe({'seller':seller1,
                    'item':item1,
                    'price':price1,
                    'shipping':shipping1,
                    'free shipping minimum':freeship1,
                    'count available':countavailable1,
                    'count needed':countneeded1})

# create columns that states if seller has all counts needed.
# this will be used to sort by to prioritize the smallest number of orders possible
for index, row in df1.iterrows():
    if row['count available'] >= row['count needed']:
        df1.at[index, 'fulfills count needed'] = 'yes'
    else:
        df1.at[index, 'fulfills count needed'] = 'no'

# dont want to calc price based on [count available], so need to check if seller has count i need and calc cost based on [count needed].
# if doesn't have [count needed], then calc cost on [count available].
for index, row in df1.iterrows():
    if row['count available'] >= row['count needed']:
        df1.at[index, 'price x count'] = row['count needed'] * row['price']
    else:
        df1.at[index, 'price x count'] = row['count available'] * row['price']
登入後複製

但是,任何一個seller都可以出售多個item。我想盡量減少支付的運費,所以我想透過 selleritems 分組在一起。因此,我根據我在另一個線程中看到的方式使用 .first() 方法對它們進行分組,以便將每一列保留在新的分組資料框中。

# don't calc [total] until sellers have been grouped
# use first() method to return all columns and perform no other aggregations
grouped1 = df1.sort_values('price').groupby(['seller', 'item']).first()
登入後複製

此時我想透過seller計算total。所以我有以下程式碼,但它為每個item 計算total,而不是seller,這意味著shipping 根據每個群組中的商品數量被多次添加,或當price x count 結束時不應用免費送貨最低免運費

# calc [Total]
for index, row in grouped1.iterrows():
    if (row['Free Shipping Minimum'] == 50) & (row['Price x Count'] > 50):
        grouped1.at[index, 'Total'] = row['Price x Count'] + 0
    elif (row['Free Shipping Minimum'] == 5) & (row['Price x Count'] > 5):
        grouped1.at[index, 'Total'] = row['Price x Count'] + 0
    else:
        grouped1.at[index, 'Total'] = row['Price x Count'] + row['Shipping']
登入後複製

實際上看起來我可能需要在計算total 時對每個seller 求和price x count ,但這本質上是同一個問題,因為我不知道如何計算外部索引的每行列。我可以使用什麼方法來做到這一點?

另外,如果有人對如何實現我的後半部目標有任何建議,請儘管提出。我只想退回我需要的每件商品。例如,我需要 2 個「項目 1」和 2 個「項目 2」。如果“賣家1”有2 個“商品1”和1 個“商品2”,而“賣家2”有1 個“商品1”和1 個“商品2”,那麼我想要“賣家1”的所有商品(假設它最便宜),但只有「賣家2」的1 個「商品1」。這似乎會影響 total 列的計算,但我不確定如何實現它。


正確答案


我最終決定先將seller 分組,並對price x count 進行求和以找到subtotals,將其轉換為資料幀,然後將df1 與新的subtotal 資料幀合併以建立groupedphpcnend cphpcn 資料框。然後我使用 <code>np.where 建議創建了 totals 列(這比我的 for 循環優雅得多,並且可以輕鬆處理 nan 值)。最後按sellertotalitem分組傳回我想要的結果。最終程式碼如下:

import pandas as pd
import numpy as np

item1 = ['item 1', 'item 2', 'item 1', 'item 1', 'item 2']
seller1 = ['Seller 1', 'Seller 2', 'Seller 3', 'Seller 4', 'Seller 1']
price1 = [1.85, 1.94, 2.69, 2.00, 2.02]
shipping1 = [0.99, 0.99, 0.99, 2.99, 0.99]
freeship1 = [5, 5, 5, 50, 5]
countavailable1 = [1, 2, 2, 5, 2]
countneeded1 = [2, 1, 2, 2, 1]

df1 = pd.DataFrame({'Seller':seller1,
                    'Item':item1,
                    'Price':price1,
                    'Shipping':shipping1,
                    'Free Shipping Minimum':freeship1,
                    'Count Available':countavailable1,
                    'Count Needed':countneeded1})

# create columns that states if seller has all counts needed.
# this will be used to sort by to prioritize the smallest number of orders possible
for index, row in df1.iterrows():
    if row['Count Available'] >= row['Count Needed']:
        df1.at[index, 'Fulfills Count Needed'] = 'Yes'
    else:
        df1.at[index, 'Fulfills Count Needed'] = 'No'

# dont want to calc price based on [count available], so need to check if seller has count I need and calc cost based on [count needed].
# if doesn't have [count needed], then calc cost on [count available].
for index, row in df1.iterrows():
    if row['Count Available'] >= row['Count Needed']:
        df1.at[index, 'Price x Count'] = row['Count Needed'] * row['Price']
    else:
        df1.at[index, 'Price x Count'] = row['Count Available'] * row['Price']

# subtotals by seller, then assign calcs to column called [Subtotal] and merge into dataframe
subtotals = df1.groupby(['Seller'])['Price x Count'].sum().reset_index()

subtotals.rename({'Price x Count':'Subtotal'}, axis=1, inplace=True)

grouped = df1.merge(subtotals[['Subtotal', 'Seller']], on='Seller')


# calc [Total]
grouped['Total'] = np.where(grouped['Subtotal'] > grouped['Free Shipping Minimum'],
                             grouped['Subtotal'], grouped['Subtotal'] + grouped['Shipping'])

grouped.groupby(['Seller', 'Total', 'Item']).first()
登入後複製

以上是計算多索引 pandas 資料幀外部索引每行的總和的詳細內容。更多資訊請關注PHP中文網其他相關文章!

來源:stackoverflow.com
本網站聲明
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn
熱門教學
更多>
最新下載
更多>
網站特效
網站源碼
網站素材
前端模板