python - 大文本数据合并问题思路-PHP 중국어 네트워크 Q&A

지역 사회

배우다

도구 라이브러리

AI 도구

여가

한국어

python - 大文本数据合并问题思路

迷茫 2017-04-18 10:30:26

0

1

526

背景：

我有三个csv文件，分别如下：

afile: userid, username, ....
bfile: postid, userid, postname, ...
cfile: postid, postnum, ...

afile = 10G
bfile = 150G
cfile = 20G

注：各个field的分隔符并不是单个字符（例如逗号），而是一串特殊符号，因为部分field可能会包含某些单字符分隔符，键盘上的单字符都试过了，都有包含，所以用了一串几个字符组成的特殊字符串来分隔，所以并不是严格的csv，这是最蛋疼的地方

目的：

我想合并这三个文件，bfile和cfile根据postid列合并，合并后再根据userid列合并afile，最终大概是postid, userid, postname, postnum, username这样的形式。

目前我的伪代码如下：

import pandas as pd
chunksize = 1000000  # 100W 目前看没问题
    try:
        resultchunktotal = []
        bfilereader = pd.read_csv(bfile,  iterator=True, engine='python', sep='##')
        goon_1 = True
        while goon_1:
            try:
                # 分块读取 bfile
                bfilechunk = bfilereader.get_chunk(chunksize)
                if not bfilechunk.empty:
                    cfilereader = pd.read_csv(cfile, iterator=True, engine='python', sep='##')
                    goon_2 = True
                    while goon_2:
                        try:
                            # 分块读取 cfile
                            cfilechunk = cfilereader.get_chunk(chunksize)
                            if not cfilechunk.empty:
                                bfilecfilechunk = pd.merge(bfilechunk, cfilechunk, on='postid')
                                # 不为空代表 bfile cfile有共同的postid
                                if not bfilecfilechunk.empty:
                                    afilereader = pd.read_csv(afile, iterator=True, engine='python', sep='##')
                                    goon_3 = True
                                    while goon_3:
                                        try:
                                            # 分块读取afile
                                            afilechunk = afilereader.get_chunk(chunksize)
                                            if not afilechunk.empty:
                                                chunkresult = pd.merge(bfilecfilechunk, afilechunk, on='')
                                                # 不为空表示有共同的userid
                                                if not chunkresult.empty:
                                                    resultchunktotal.append(chunkresult)
                                        except StopIteration:
                                            goon_3 = False
                        except StopIteration:
                            goon_2 = False
            except StopIteration:
                goon_1 = False
        if len(resultchunktotal) > 0:
            pd.concat(resultchunktotal).to_csv('result.csv', index=False)
    except Exception as e:
        print(e)

但是感觉这样，很低效，所以跪求各位大神好的思路以及好的工具方法

ps: 这是一道“大数据”的伪命题，无非数据稍大了点

迷茫

业精于勤，荒于嬉;行成于思，毁于随。

모든 응답(1)

巴扎黑

巴扎黑2017-04-18 10:32:26 1층

코드 작성을 중단하세요. xsv Join 하위 명령을 사용하는 한 줄짜리 쉘 스크립트인 것 같습니다.

좋다 +0

답글 추가

인기 주제

더>

인기 기사

인기 튜토리얼

더>

관련 튜토리얼

인기 추천

최신 강좌

최신 다운로드

더>

웹 효과

웹사이트 소스 코드

웹사이트 자료

프론트엔드 템플릿