MapReduce is a pattern borrowed from functional programming languages. In some scenarios, it can greatly simplify the code. Let’s first take a look at what MapReduce is:
MapReduce is a software architecture proposed by Google for parallel operations on large-scale data sets (larger than 1TB). The concepts "Map" and "Reduce", and their main ideas, are borrowed from functional programming languages, as well as features borrowed from vector programming languages.
The current software implementation specifies a Map (mapping) function to map a set of key-value pairs into a new set of key-value pairs, and specifies a concurrent Reduce (induction) function to ensure that all mapped key-value pairs are aligned Each of them share the same set of keys.
Simply put, MapReduce decomposes the problem to be processed into two parts: Map and Reduce. The data to be processed is treated as a sequence, and the data in each sequence is calculated through the Map function, and then aggregated into the final result through the Reduce function.
The following uses mapreduce mode to implement a simple program that counts the number of occurrences of words in the log:
from functools import reduce from multiprocessing import Pool from collections import Counter def read_inputs(file): for line in file: line = line.strip() yield line.split() def count(file_name): file = open(file_name) lines = read_inputs(file) c = Counter() for words in lines: for word in words: c[word] += 1 return c def do_task(): job_list = ['log.txt'] * 10000 pool = Pool(8) return reduce(lambda x, y: x+y, pool.map(count, job_list)) if __name__ == "__main__": rv = do_task()