Pandas DataFrames에서 쉼표로 구분된 문자열을 효율적으로 분할하는 방법은 무엇입니까?-파이썬 튜토리얼-php.cn

Pandas DataFrames에서 쉼표로 구분된 문자열을 효율적으로 분할하는 방법은 무엇입니까?

Linda Hamilton

풀어 주다： 2024-12-19 06:18:15

원래의

1046명이 탐색했습니다.

How to Efficiently Split Comma-Separated Strings in Pandas DataFrames?

Pandas DataFrame에서 쉼표로 구분된 문자열 항목 분할

입력 데이터는 쉼표와 같은 문자로 구분된 값으로 구성되는 경우가 많습니다. Pandas 데이터프레임으로 작업할 때 이러한 문자열 항목을 분할하고 각 값에 대해 별도의 행을 생성해야 합니다. 이번 글에서는 이 목표를 효율적으로 달성하기 위해 사용할 수 있는 방법을 살펴보겠습니다.

Pandas의 .explode() 메서드 활용

Pandas 버전 0.25.0 및 1.3.0에 도입된, .explode() 메서드는 목록이나 배열이 포함된 열을 폭발시키기 위한 간단하고 효율적인 솔루션을 제공합니다. 단일 및 다중 열 모두에서 작동하므로 복잡한 데이터세트를 처리할 때 유연성을 제공합니다.

구문:

dataframe.explode(column_name)

로그인 후 복사

예:

import pandas as pd

# Dataframe with a column containing comma-separated values
df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})

# Exploding the 'var1' column
df = df.explode('var1')

# Resulting dataframe with separate rows for each value
print(df)

로그인 후 복사

여러 폭발을 위한 사용자 정의 벡터화 기능 열

여러 열을 확장해야 하는 더 복잡한 시나리오의 경우 사용자 정의 벡터화 함수가 다양한 솔루션을 제공할 수 있습니다.

함수 정의:

def explode(df, lst_cols, fill_value='', preserve_index=False):
    # Calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    # Repeat values for non-empty lists
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in df.columns.difference(lst_cols)},
                index=np.repeat(df.index.values, lens))
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))

    # Append rows with empty lists
    if (lens == 0).any():
        res = (res.append(df.loc[lens==0, df.columns.difference(lst_cols)], sort=False)
                  .fillna(fill_value))

    # Revert index order and reset index if requested
    res = res.sort_index()
    if not preserve_index:
        res = res.reset_index(drop=True)
    return res

로그인 후 복사

예:

# Dataframe with multiple columns containing lists
df = pd.DataFrame({
    'var1': [['a', 'b'], ['c', 'd']],
    'var2': [['x', 'y'], ['z', 'w']]
})

# Exploding 'var1' and 'var2' columns
df = explode(df, ['var1', 'var2'])

# Resulting dataframe with separate rows for each list item
print(df)

로그인 후 복사

그룹화로 변환

또 다른 접근 방식은 .transform()을 사용하여 적용하는 것입니다. 문자열 항목을 분할하고 새 항목을 생성하는 사용자 정의 함수 행:

사용자 정의 함수:

def split_fun(row):
    return [row['var1'].split(',')]

로그인 후 복사

예:

# Dataframe with a column containing comma-separated values
df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})

# Creating a new column with split values using transform
df['var1_split'] = df.transform(split_fun)

# Unnest the newly created column to separate rows
df = df.unnest('var1_split')

# Resulting dataframe with separate rows for each value
print(df)

로그인 후 복사

결론

데이터 세트의 특정 요구 사항과 복잡성에 따라 다양한 방법을 사용하여 분할할 수 있습니다. Pandas 데이터프레임의 쉼표로 구분된 문자열 항목입니다. .explode() 메소드를 활용하면 간단하고 효율적인 접근 방식을 제공하는 동시에 사용자 정의 벡터화된 함수는 더 복잡한 시나리오를 처리할 수 있는 유연성을 제공합니다.

위 내용은 Pandas DataFrames에서 쉼표로 구분된 문자열을 효율적으로 분할하는 방법은 무엇입니까?의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!