转载自:http://yangguan.org/mapreduce-patterns-algorithms-and-use-cases/翻译自:http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/在这篇文章里总结了几种网上或者论文中常见的MapReduce模式和算法,并系统化的解释了这些技术的不同
转载自:http://yangguan.org/mapreduce-patterns-algorithms-and-use-cases/ 翻译自:http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/ 在这篇文章里总结了几种网上或者论文中常见的MapReduce模式和算法,并系统化的解释了这些技术的不同之处。所有描述性的文字和代码都使用了标准hadoop的MapReduce模型,包括Mappers, Reduces, Combiners, Partitioners,和 sorting。如下图所示。class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)
这种方法的缺点显而易见,Mapper提交了太多无意义的计数。它完全可以通过先对每个文档中的词进行计数从而减少传递给Reducer的数据量:
class Mapper method Map(docid id, doc d) H = new AssociativeArray for all term t in doc d do H{t} = H{t} + 1 for all term t in H do Emit(term t, count H{t})
class Mapper method Map(docid id, doc d) for all term t in doc d do Emit(term t, count 1) class Combiner method Combine(term t, [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum) class Reducer method Reduce(term t, counts [c1, c2,...]) sum = 0 for all count c in [c1, c2,...] do sum = sum + c Emit(term t, count sum)
class Mapper method Map(id n, object N) Emit(id n, object N) for all id m in N.OutgoingRelations do Emit(id m, message getMessage(N)) class Reducer method Reduce(id m, [s1, s2,...]) M = null messages = [] for all s in [s1, s2,...] do if IsObject(s) then M = s else // s is a message messages.add(s) M.State = calculateState(messages) Emit(id m, item M)
class N State in {True = 2, False = 1, null = 0}, initialized 1 or 2 for end-of-line categories, 0 otherwise method getMessage(object N) return N.State method calculateState(state s, data [d1, d2,...]) return max( [d1, d2,...] )
class N State is distance, initialized 0 for source node, INFINITY for all other nodes method getMessage(N) return N.State + 1 method calculateState(state s, data [d1, d2,...]) min( [d1, d2,...] )
class N State is PageRank method getMessage(object N) return N.State / N.OutgoingRelations.size() method calculateState(state s, data [d1, d2,...]) return ( sum([d1, d2,...]) )
class Mapper method Initialize H = new AssociativeArray method Map(id n, object N) p = N.PageRank / N.OutgoingRelations.size() Emit(id n, object N) for all id m in N.OutgoingRelations do H{m} = H{m} + p method Close for all id n in H do Emit(id n, value H{n}) class Reducer method Reduce(id m, [s1, s2,...]) M = null p = 0 for all s in [s1, s2,...] do if IsObject(s) then M = s else p = p + s M.PageRank = p Emit(id m, item M)
Record 1: F=1, G={a, b} Record 2: F=2, G={a, d, e} Record 3: F=1, G={b} Record 4: F=3, G={a, b} Result: a -> 3 // F=1, F=2, F=3 b -> 2 // F=1, F=3 d -> 1 // F=2 e -> 1 // F=2
class Mapper method Map(null, record [value f, categories [g1, g2,...]]) for all category g in [g1, g2,...] Emit(record [g, f], count 1) class Reducer method Reduce(record [g, f], counts [n1, n2, ...]) Emit(record [g, f], null )
class Mapper method Map(record [f, g], null) Emit(value g, count 1) class Reducer method Reduce(value g, counts [n1, n2,...]) Emit(value g, sum( [n1, n2,...] ) )
class Mapper method Map(null, record [value f, categories [g1, g2,...] ) for all category g in [g1, g2,...] Emit(value f, category g) class Reducer method Initialize H = new AssociativeArray : category -> count method Reduce(value f, categories [g1, g2,...]) [g1', g2',..] = ExcludeDuplicates( [g1, g2,..] ) for all category g in [g1', g2',...] H{g} = H{g} + 1 method Close for all category g in H do Emit(category g, count H{g})
class Mapper method Map(null, items [i1, i2,...] ) for all item i in [i1, i2,...] for all item j in [i1, i2,...] Emit(pair [i j], count 1) class Reducer method Reduce(pair [i j], counts [c1, c2,...]) s = sum([c1, c2,...]) Emit(pair[i j], count s)
class Mapper method Map(null, items [i1, i2,...] ) for all item i in [i1, i2,...] H = new AssociativeArray : item -> counter for all item j in [i1, i2,...] H{j} = H{j} + 1 Emit(item i, stripe H) class Reducer method Reduce(item i, stripes [H1, H2,...]) H = new AssociativeArray : item -> counter H = merge-sum( [H1, H2,...] ) for all item j in H.keys() Emit(pair [i j], H{j})
class Mapper method Map(rowkey key, tuple t) if t satisfies the predicate Emit(tuple t, null)
class Mapper method Map(rowkey key, tuple t) tuple g = project(t) // extract required fields to tuple g Emit(tuple g, null) class Reducer method Reduce(tuple t, array n) // n is an array of nulls Emit(tuple t, null)
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, null) class Reducer method Reduce(tuple t, array n) // n is an array of one or two nulls Emit(tuple t, null)
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, null) class Reducer method Reduce(tuple t, array n) // n is an array of one or two nulls if n.size() = 2 Emit(tuple t, null)
class Mapper method Map(rowkey key, tuple t) Emit(tuple t, string t.SetName) // t.SetName is either 'R' or 'S' class Reducer method Reduce(tuple t, array n) // array n can be ['R'], ['S'], ['R' 'S'], or ['S', 'R'] if n.size() = 1 and n[1] = 'R' Emit(tuple t, null)
class Mapper method Map(null, tuple [value GroupBy, value AggregateBy, value ...]) Emit(value GroupBy, value AggregateBy) class Reducer method Reduce(value GroupBy, [v1, v2,...]) Emit(value GroupBy, aggregate( [v1, v2,...] ) ) // aggregate() : sum(), max(),...
class Mapper method Map(null, tuple [join_key k, value v1, value v2,...]) Emit(join_key k, tagged_tuple [set_name tag, values [v1, v2, ...] ] ) class Reducer method Reduce(join_key k, tagged_tuples [t1, t2,...]) H = new AssociativeArray : set_name -> values for all tagged_tuple t in [t1, t2,...] // separate values into 2 arrays H{t.tag}.add(t.values) for all values r in H{'R'} // produce a cross-join of the two arrays for all values l in H{'L'} Emit(null, [k r l] )
class Mapper method Initialize H = new AssociativeArray : join_key -> tuple from R R = loadR() for all [ join_key k, tuple [r1, r2,...] ] in R H{k} = H{k}.append( [r1, r2,...] ) method Map(join_key k, tuple l) for all tuple r in H{k} Emit(null, tuple [k r l] )
原文地址:[转载]MapReduce的模式、算法和用例, 感谢原作者分享。