Text:
Each row contains some numbers after promotion. If these numbers are the same, they are considered to be the same row. For the same rows, only one row is kept.
Thoughts:
Cut based on dictionary and string.
Create an empty dictionary.
Read the text and cut the first half of each line. During the process of reading the text, loop through the dictionary to search. If not found, write the line to the dictionary. Otherwise, it means that the row has been written into the dictionary (that is, a duplicate row has appeared) and will no longer be written into the dictionary. This achieves the purpose of retaining only one row for duplicate rows.
The text is as follows:
/promotion/232 utm_source /promotion/237 LandingPage/borrowExtend/? ; /promotion/25113 LandingPage/mhd /promotion/25113 LandingPage/mhd /promotion/25199 com/LandingPage /promotion/254 LandingPage/mhd/mhd4/? ; /promotion/259 LandingPage/ydy/? ; /promotion/25113 LandingPage/mhd /promotion/25199 com/LandingPage /promotion/25199 com/LandingPage
The procedure is as follows:
line_dict_uniq = dict() with open('1.txt','r') as fd: for line in fd: key = line.split(' ')[0] if key not in line_dict_uniq.values(): line_dict_uniq[key] = line else: continue print line_dict_uniq print len(line_dict_uniq) # 这里是打印了不重复的行(重复的只打印一次),实际再把这个结果写入文件就可以了, # 就不写这段写入文件的代码了
The execution efficiency of the above program is relatively low, changing it to the following will improve it:
line_dict_uniq = dict() with open('1.txt','r') as fd: for line in fd: key = line.split(' ')[0] if key not in line_dict_uniq.keys(): line_dict_uniq[key] = line else: continue print line_dict_uniq print len(line_dict_uniq)
The above is the Python that the editor introduces to you to deduplicate text by line. I hope it will be helpful to you. If you have any questions, please leave me a message and the editor will reply to you in time. I would also like to thank you all for your support of the Script House website!