Problem: Reading massive .csv files (up to 1 million rows, 200 columns) in Python 2.7 encounters memory errors.
The initial approach iterates through the entire file and stores data in memory as lists. However, this method becomes impractical for large files, as it consumes excessive memory.
Solution:
1. Process Rows as They Are Produced:
Avoid loading the entire file into memory. Instead, process rows as they are generated using a generator function.
def getstuff(filename, criterion): with open(filename, "rb") as csvfile: datareader = csv.reader(csvfile) yield next(datareader) # yield the header row for row in datareader: if row[3] == criterion: yield row
2. Use Generator Functions for Filtering:
Filter data while iterating through the file using generator functions. This approach allows for matching multiple consecutive rows meeting a specific criterion.
def getstuff(filename, criterion): with open(filename, "rb") as csvfile: datareader = csv.reader(csvfile) yield next(datareader) # yield the header row yield from takewhile( lambda r: r[3] == criterion, dropwhile(lambda r: r[3] != criterion, datareader)) return
3. Optimize Memory Consumption:
Refactor getdata() to use a generator function as well, ensuring that only one row is held in memory at any time.
def getdata(filename, criteria): for criterion in criteria: for row in getstuff(filename, criterion): yield row
Additional Tips for Speed:
The above is the detailed content of How to Effectively Handle Large CSV Files in Python 2.7?. For more information, please follow other related articles on the PHP Chinese website!