How can I efficiently parse fixed-width file lines in Python?

Barbara Streisand
Release: 2024-10-30 17:09:26
Original
374 people have browsed it

How can I efficiently parse fixed-width file lines in Python?

Fast Parsing of Fixed Width File Lines

Parsing fixed width files, where each column occupies a specific number of characters in a line, can be a task requiring efficiency. Here's a discussion on how to achieve this efficiently:

The Problem

Consider a fixed-width file where the first 20 characters represent one column, followed by 21-30 for the second, and so on. Given a line with 100 characters, how can we effectively parse it into its respective columns?

Solutions

1. Struct Module:

Utilizing the Python standard library's struct module provides both simplicity and speed due to its C implementation. The code below demonstrates its usage:

<code class="python">import struct

fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)

# Convert Unicode input to bytes and decode result.
unpack = struct.Struct(fmtstring).unpack_from  # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

# Parse a sample line.
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields:', fields)</code>
Copy after login

Output:

fmtstring: '2s 10x 24s', record size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
Copy after login
Copy after login

2. Optimized String Slicing:

While string slicing is commonly used, it can become cumbersome for large lines. Here's an optimized approach:

<code class="python">from itertools import zip_longest
from itertools import accumulate

def make_parser(fieldwidths):
    # Calculate slice boundaries.
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    # Create field slice tuples.
    flds = tuple(zip_longest(cuts, (0,)+cuts))[:-1]  # Ignore final value.
    # Construct the parsing function.
    parse = lambda line: tuple(line[i:j] for i, j in flds)
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                            for fw in fieldwidths)
    return parse

# Parse a sample line.
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)  # Negative values indicate ignored padding fields.
parse = make_parser(fieldwidths)
fields = parse(line)
print('fmtstring:', parse.fmtstring, ', record size:', parse.size, 'chars')
print('fields:', fields)</code>
Copy after login

Output:

fmtstring: '2s 10x 24s', record size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')
Copy after login
Copy after login

The above is the detailed content of How can I efficiently parse fixed-width file lines in Python?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!