Efficient Parsing of Fixed Width Files
Fixed width files pose a challenge when it comes to parsing due to their rigid structure. To address this, multiple approaches can be employed to efficiently extract data from such files.
Using the struct Module
The Python standard library's struct module offers a concise and fast solution for parsing fixed width lines. It allows for predefined field widths and data types, making it a suitable option for large datasets. The following code snippet demonstrates how to utilize struct for this purpose:
<code class="python">import struct fieldwidths = (2, -10, 24) fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths) # Convert Unicode input to bytes and the result back to Unicode string. unpack = struct.Struct(fmtstring).unpack_from # Alias. parse = lambda line: tuple(s.decode() for s in unpack(line.encode())) print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring))) line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n' fields = parse(line) print('fields: {}'.format(fields))</code>
String Slicing with Compile-Time Optimization
String slicing is another viable method for parsing fixed width files. While initially less efficient, a technique known as "compile-time optimization" can significantly improve performance. The following code implements this optimization:
<code class="python">def make_parser(fieldwidths): cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths)) pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1] # ignore final one slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad) parse = eval('lambda line: ({})\n'.format(slcs)) # Create and compile source code. # Optional informational function attributes. parse.size = sum(abs(fw) for fw in fieldwidths) parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths) return parse</code>
This optimized approach provides both efficiency and readability for parsing fixed width files.
The above is the detailed content of How can I efficiently parse fixed width files in Python?. For more information, please follow other related articles on the PHP Chinese website!