In certain scenarios, it becomes necessary to calculate the MD5 hash of large files that exceed the available RAM. The native Python function hashlib.md5() is not suitable for such scenarios as it requires the entire file to be loaded into memory.
To overcome this limitation, a practical approach is to read the file in manageable chunks and iteratively update the hash. This allows efficient hash computation without exceeding memory limits.
<code class="python">import hashlib def md5_for_file(f, block_size=2**20): md5 = hashlib.md5() while True: data = f.read(block_size) if not data: break md5.update(data) return md5.digest()</code>
To calculate the MD5 hash of a file, use the following syntax:
<code class="python">with open(filename, 'rb') as f: md5_hash = md5_for_file(f)</code>
The md5_hash variable will contain the computed MD5 hash as a bytes-like object.
Make sure to open the file in binary mode ('rb') to avoid incorrect results. For comprehensive file processing, consider the following function:
<code class="python">import os import hashlib def generate_file_md5(rootdir, filename, blocksize=2**20): m = hashlib.md5() with open(os.path.join(rootdir, filename), 'rb') as f: while True: buf = f.read(blocksize) if not buf: break m.update(buf) return m.hexdigest()</code>
This function takes a file path and returns the MD5 hash as a hexadecimal string.
By utilizing these techniques, you can efficiently compute MD5 hashes for large files without encountering memory limitations.
The above is the detailed content of How to Efficiently Compute MD5 Hash of Large Files in Python. For more information, please follow other related articles on the PHP Chinese website!