Python新手问题——大txt文件按条件将多行合并

Question

数据格式如下：······1107 1385331000000 1.31425116071267541107 1385331000000 0.00216831966616601571107 1385331600000 0.0021683196661660157 1107 1385331600000 1.48678059856709231107 1385331600000 0.0...

黄舟 · Answer

I solved it myself. Although it may be complicated, it can meet the needs

__author__ = 'Administrator'
file = open('day24.txt', 'a+')
s = "area       time            data
"
file.write(s)
file.close


file = open('sms-call-internet-mi-2013-11-24-24.txt','r')
line = file.readline()
list1 = []#时间
num1 = []#data
area = []

while 1:
    line = file.readline()
    if line == '':
        break
    a = line.split()
    if int(a[0]) == 1:
        if a[2] == "NA":
            a[2] = '0'
        area.append(a[0])
        if a[1] in list1:
            num1[list1.index(a[1])] = float(num1[list1.index(a[1])])+float(a[2])
        else:
            list1.append(a[1])
            num1.append(a[2])
    elif int(a[0]) < 10001:

        if a[2] == "NA":
            a[2] = '0'
        if a[0] not in area:
            area.append(a[0])

            file1 = open('day24.txt', 'a+')

            for i in list1:
                file1.write("%-8s%-16s%.20f
" % (area[area.index(a[0])-1], i, float(num1[list1.index(i)])))
            file1.close
            file1 = open('day24.txt', 'r')
            file1.close
            list1 = []
            num1 = []

        if a[1] in list1:
            num1[list1.index(a[1])] = float(num1[list1.index(a[1])])+float(a[2])

        else:
            list1.append(a[1])
            num1.append(a[2])
    else:
        break
file.close

file = open('day24.txt', 'a+')
for j in list1: 
    file.write("%-8s%-16s%.20f
" % (a[0], j, float(num1[list1.index(j)])))
file.close
file = open('day24.txt', 'r')
file.close

ringa_lee · Answer

If it is based on time series, just use the generator to read the original file, generate new lines and then output it.

ringa_lee · Answer

pandas can solve your needs, read the data into a dataframe and then process it

怪我咯 · Answer

This depends on how much data you have

Use file handle traversal without using readlines() (memory may not be enough)
Use a data structure similar to a dictionary to store your information. If the memory is not enough, you have to find a way to write the intermediate information to disk, etc.

The general idea is as follows

from collections import Counter
c = Counter()
f = ['1107 1385332800000 1.2847329440609827',
'1107 1385332800000 0.0021683196661660157',
'1107 1385333400000 1.2891586380834603',
'1108 1385247600000 0.026943168177151356',
'1108 1385247600000 6.184696475262653',
'1108 1385248200000 0.05946288920050806' ]

'''
with open('xxoo.txt') as f:  # f 文件遍历句柄，相当于上面的 list f
    for i in f:
        s = i.split()
        c[s[0]] += s[2]
'''


for i in f:  # 这里是遍历 f， 这里遍历的是 list f， 你实际情况要用上面的 f
    s = i.split()  # 这里是空格分割，可以使用 print s 看看结果
    c[s[0]] += float(s[2])  # c 用来统计

for i in c:
    print i, c[i]

PHPz · Answer

What you are doing is grouping statistics based on two indicators: label and hour. Use pandas to read in, use to_datetime to convert the timestamp into a time column and then get the number of hours. Then use groupby to classify the label and hour at the same time, and sum it up.

黄舟 · Answer

Please use this idea
https://www.zhihu.com/questio...

阿神 · Answer

I think your data format can be analyzed a little before doing it
1. The first column represents the date, you can use it as the key of the first level of the result array, result[date]
2. The second column should look like Timestamp of time (minutes), so if you require results by hour, you initialize 24 elements for each result[data] item, and the key is the number of hours (the value of the timestamp of the corresponding number of hours can be used as the key) , the key value corresponds to the sum of data within this hour, that is, resultdate
3. After initializing the result array, it is simple. You just traverse the file and process it line by line. For each line, first read the value of the first column. , such as 1107,
operates on result[1107]. Then read the second column, find the corresponding hourtimestamp key, and just add it up.
4. Finally, traverse the result array and output the result.

天蓬老师 · Answer

You need:

from itertools import groupby

It can be done in less than ten lines of code.