用 python 给数据打标签，500 万条数据怎样提高效率？

Question

新手向大家求助，需要用python对一列word打标签，标签规则是包含其中某些词就标记成某个标签。word数量比较多大约有500万个词。我用下面的代码标注，效率特别低，需要一个多小时才能标注完。求问有什么优化更快的...

迷茫 · Answer

So do you really use pandas as a tool for reading data?.

Added a column is_tobacco as the mark you said

filter_query returns a list containing these words, and the efficiency has been improved

Secondly, you can split it and use multiprocessing to execute it. This will speed up the process by more than a little

import pandas as pd
word = pd.read_table('test.txt', encoding = 'utf-8', names = ['query'])

def signquery(word):
    tobacco = [u'烟', u'白沙', u'黄金叶', u'利群', u'南京九五', u'黄鹤楼软',  u'黄鹤楼硬', u'娇子', u'钻石荷花', u'玉溪', u'七匹狼尚品',  u'七匹狼软灰']
    word['is_tobacco'] = word['query'].apply(lambda name:name in tobacco)
    return word

def filter_query(word):
    tobacco = [u'烟', u'白沙', u'黄金叶', u'利群', u'南京九五', u'黄鹤楼软',  u'黄鹤楼硬', u'娇子', u'钻石荷花', u'玉溪', u'七匹狼尚品',  u'七匹狼软灰']
    return word[word['query'].apply(lambda name:name in tobacco)]['query'].to_dict().values()

result = filter_query(word)

print result

怪我咯 · Answer

You can try using regular expressions:

import re
pattern = re.compile(u'烟|白沙|黄金叶|利群|南京九五|黄鹤楼软|黄鹤楼硬|娇子|钻石荷花|玉溪|七匹狼尚品|七匹狼软灰')
result = filter(pattern.search, word['query'])

ringa_lee · Answer

KMP algorithm

天蓬老师 · Answer

KMP
Manacher
TireTree

Php8, I'm coming too

Learn website layout in 30 minutes

Shangguan Oracle Beginner to Proficient Video Tutorial

Your first line of UNI-APP code

Flutter from scratch to app launch

Brother Lian New Linux Video Tutorial

AXURE 9 Video Tutorial (Suitable for Product Manager Interactive Product Design UI)

Zero Basic Proficiency PS Video Tutorial

16 day UI video tutorial to get you started

PS Techniques and Slicing Techniques Video Tutorial

Alibaba Cloud Environment Construction and Project Launch Video Tutorial

Overview of Computer Networks - Basic Knowledge that Programmers Must Master

Essential Tutorial for Programmers - HTTP Protocol Explanation

Websocket Video Tutorial