計算多索引 pandas 資料幀外部索引每行的總和-Python教學-PHP中文網

正確答案

首頁

後端開發

Python教學

計算多索引 pandas 資料幀外部索引每行的總和

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Feb 05, 2024 pm 10:00 PM

计算多索引 pandas 数据帧外部索引每行的总和

問題內容

我有一個資料框：seller、item、price、shipping、免費送貨最低、count available和count required。我的目標是根據稍後計算的 total 找到 seller 和 item 的最便宜的組合（計算程式碼如下所示）。範例資料如下：

import pandas as pd

item1 = ['item 1', 'item 2', 'item 1', 'item 1', 'item 2']
seller1 = ['seller 1', 'seller 2', 'seller 3', 'seller 4', 'seller 1']
price1 = [1.85, 1.94, 2.00, 2.00, 2.02]
shipping1 = [0.99, 0.99, 0.99, 2.99, 0.99]
freeship1 = [5, 5, 5, 50, 5]
countavailable1 = [1, 2, 2, 5, 2]
countneeded1 = [2, 1, 2, 2, 1]

df1 = pd.dataframe({'seller':seller1,
                    'item':item1,
                    'price':price1,
                    'shipping':shipping1,
                    'free shipping minimum':freeship1,
                    'count available':countavailable1,
                    'count needed':countneeded1})

# create columns that states if seller has all counts needed.
# this will be used to sort by to prioritize the smallest number of orders possible
for index, row in df1.iterrows():
    if row['count available'] >= row['count needed']:
        df1.at[index, 'fulfills count needed'] = 'yes'
    else:
        df1.at[index, 'fulfills count needed'] = 'no'

# dont want to calc price based on [count available], so need to check if seller has count i need and calc cost based on [count needed].
# if doesn't have [count needed], then calc cost on [count available].
for index, row in df1.iterrows():
    if row['count available'] >= row['count needed']:
        df1.at[index, 'price x count'] = row['count needed'] * row['price']
    else:
        df1.at[index, 'price x count'] = row['count available'] * row['price']

登入後複製

但是，任何一個seller都可以出售多個item。我想盡量減少支付的運費，所以我想透過 seller 將 items 分組在一起。因此，我根據我在另一個線程中看到的方式使用 .first() 方法對它們進行分組，以便將每一列保留在新的分組資料框中。

# don't calc [total] until sellers have been grouped
# use first() method to return all columns and perform no other aggregations
grouped1 = df1.sort_values('price').groupby(['seller', 'item']).first()

登入後複製

此時我想透過seller計算total。所以我有以下程式碼，但它為每個item 計算total，而不是seller，這意味著shipping 根據每個群組中的商品數量被多次添加，或當price x count 結束時不應用免費送貨最低免運費。

# calc [Total]
for index, row in grouped1.iterrows():
    if (row['Free Shipping Minimum'] == 50) & (row['Price x Count'] > 50):
        grouped1.at[index, 'Total'] = row['Price x Count'] + 0
    elif (row['Free Shipping Minimum'] == 5) & (row['Price x Count'] > 5):
        grouped1.at[index, 'Total'] = row['Price x Count'] + 0
    else:
        grouped1.at[index, 'Total'] = row['Price x Count'] + row['Shipping']

登入後複製

實際上看起來我可能需要在計算total 時對每個seller 求和price x count ，但這本質上是同一個問題，因為我不知道如何計算外部索引的每行列。我可以使用什麼方法來做到這一點？

另外，如果有人對如何實現我的後半部目標有任何建議，請儘管提出。我只想退回我需要的每件商品。例如，我需要 2 個「項目 1」和 2 個「項目 2」。如果“賣家1”有2 個“商品1”和1 個“商品2”，而“賣家2”有1 個“商品1”和1 個“商品2”，那麼我想要“賣家1”的所有商品（假設它最便宜），但只有「賣家2」的1 個「商品1」。這似乎會影響 total 列的計算，但我不確定如何實現它。

正確答案

我最終決定先將seller 分組，並對price x count 進行求和以找到subtotals，將其轉換為資料幀，然後將df1 與新的subtotal 資料幀合併以建立groupedphpcnend cphpcn 資料框。然後我使用 <code>np.where 建議創建了 totals 列（這比我的 for 循環優雅得多，並且可以輕鬆處理 nan 值）。最後按seller、total、item分組傳回我想要的結果。最終程式碼如下：

import pandas as pd
import numpy as np

item1 = ['item 1', 'item 2', 'item 1', 'item 1', 'item 2']
seller1 = ['Seller 1', 'Seller 2', 'Seller 3', 'Seller 4', 'Seller 1']
price1 = [1.85, 1.94, 2.69, 2.00, 2.02]
shipping1 = [0.99, 0.99, 0.99, 2.99, 0.99]
freeship1 = [5, 5, 5, 50, 5]
countavailable1 = [1, 2, 2, 5, 2]
countneeded1 = [2, 1, 2, 2, 1]

df1 = pd.DataFrame({'Seller':seller1,
                    'Item':item1,
                    'Price':price1,
                    'Shipping':shipping1,
                    'Free Shipping Minimum':freeship1,
                    'Count Available':countavailable1,
                    'Count Needed':countneeded1})

# create columns that states if seller has all counts needed.
# this will be used to sort by to prioritize the smallest number of orders possible
for index, row in df1.iterrows():
    if row['Count Available'] >= row['Count Needed']:
        df1.at[index, 'Fulfills Count Needed'] = 'Yes'
    else:
        df1.at[index, 'Fulfills Count Needed'] = 'No'

# dont want to calc price based on [count available], so need to check if seller has count I need and calc cost based on [count needed].
# if doesn't have [count needed], then calc cost on [count available].
for index, row in df1.iterrows():
    if row['Count Available'] >= row['Count Needed']:
        df1.at[index, 'Price x Count'] = row['Count Needed'] * row['Price']
    else:
        df1.at[index, 'Price x Count'] = row['Count Available'] * row['Price']

# subtotals by seller, then assign calcs to column called [Subtotal] and merge into dataframe
subtotals = df1.groupby(['Seller'])['Price x Count'].sum().reset_index()

subtotals.rename({'Price x Count':'Subtotal'}, axis=1, inplace=True)

grouped = df1.merge(subtotals[['Subtotal', 'Seller']], on='Seller')


# calc [Total]
grouped['Total'] = np.where(grouped['Subtotal'] > grouped['Free Shipping Minimum'],
                             grouped['Subtotal'], grouped['Subtotal'] + grouped['Shipping'])

grouped.groupby(['Seller', 'Total', 'Item']).first()

登入後複製

以上是計算多索引 pandas 資料幀外部索引每行的總和的詳細內容。更多資訊請關注PHP中文網其他相關文章！

本網站聲明

本文內容由網友自願投稿，版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容，請聯絡admin@php.cn