Stop Thinking, Just Do!

Sungsoo Kim's Blog

Mapping and Folding and the Map-Reduce Paradigm

tagsTags

26 March 2014


Mapping and Folding and the Map-Reduce Paradigm

The Map-Reduce Paradigm

Map-reduce is a programming model that has its roots in functional programming.  In addition to often producing short, elegant code for problems involving lists or collections, this model has proven very useful for large-scale highly parallel data processing.  Here we will think of map and reduce as operating on lists for concreteness, but they are appropriate to any collection (sets, etc.).  Map operates on a list of values in order to produce a new list of values, by applying the same computation to each value.  Reduce operates on a list of values to collapse or combine those values into a single value (or more generally some number of values), again by applying the same computation to each value. 

For large datasets, it has proven particularly valuable to think about processing in terms of the same operation being applied independently to each item in the dataset, either to produce a new dataset, or to produce summary result for the entire dataset. This way of thinking often works well on parallel hardware, where each of many processors can handle one or more data items at the same time.  The Connection Machine, a massively parallel computer developed in the late 1980’s made heavy use of this programming paradigm.  More recently, Google, with their large server farms, have made very effective use of it.  Two Google fellows, Dean and Ghemawat, have reported some of these uses in a 2004 paper at OSDI (the Operating System Design and Implementation Conference) and a 2008 paper in CACM (the monthly magazine of the main Computer Science professional society).   Much of the focus in those papers is on separating the fault tolerance and distributed processing issues of operating on large clusters of machines from the programming language abstractions.  At Google they have found map-reduce to be a highly useful way of separating out the conceptual model of large-scale computations on big collections of data, using mapping and reducing operations, from the issue of how that computation is implemented in a reliable manner on a large cluster of machines. 

In the introduction of their 2008 paper, Dean and Ghemawat write “Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key in order to combine the derived data appropriately.” In the paper they discuss a number of applications that are simplified by this functional programming view of massive parallelism. To give a sense of the scale of the processing done by these programs, they note that over ten thousand programs using map-reduce have been implemented at Google since 2004, and that in September 2007 over 130,000 machine-months of processing time at Google were used by map-reduce, processing over 450PB (450,000 TB) of data.

For our purposes here in a programming course, it is illustrative to see what kinds of problems Google found useful to express in the map-reduce paradigm. Counting the number of occurrences of each word in a large collection of documents is a central computational issue for indexing large document collections.  This can be expressed as mapping a  function that returns the count of a given word in each document across a document collection.  Then the result is reduced by summing all the counts together. So if we have a list of strings, the map returns a list of integers with the count for each string.  The reduce then sums up those integers.  Many other counting problems can be implemented similarly.  For instance, counting the number of occurrences of some pattern in log files, such as the number of occurrences of a given user query or a particular url.  Again there are many log files on different hosts, this can be viewed as a large collection of strings, with the same map and reduce operations as for document word counts.

Reversing the links of the web graph is another problem that can be viewed this way.  The web is a set of out-links, from a given page to the pages that it links to.  A map function can output target-source pairs for each source page, and a reduce function can collapse these into a list of source pages corresponding to each target page (i.e., links in to pages).

An inverted index is a map from a term to all the documents containing that term.  It is an important part of a search engine, as the engine must be able to quickly map a term to relevant pages.  In this case a map function returns pairs of terms and document identifiers for each document. The reduce collapses the result into the list of document ID’s for a given term.

In the Google papers they report that re-implementing the production indexing system resulted in code that was simpler, smaller, easier to understand and modify, and resulted in a service that was easier to operate (ie failure diagnosis, recovery, etc.), yet the approach results in fast enough code to be used for a key part of the service.

Mapping and Folding (Reducing) in ML

First lets look in more detail at the map operation. Map applies a specified function f to each element of a list to produce a resulting list. That is, each element of the result is obtained by applying f to the corresponding element of the input list. We will consider the curried form of map. In OCaml the built-in function List.map produces the following value for a list of three elements:

map f [a; b; c] = [f a; f b; f c]

Recall in last lecture we introduced the list_ type:

type 'a list_ = Nil_ | Cons_ of ('a * 'a list_)

The map operation for this type can be written as:

let rec map (f: 'a->'b) (x: 'a list_): 'b list_ = 
  match x with
      Nil_ -> Nil_
    | Cons_(h,t) -> Cons_(f(h), map f t)

Note the type signature of map which is

('a -> 'b) -> 'a list_ -> 'b list_

The parameter f is a function from the element type of the input list 'a to the element type of the output list 'b.

Using map we can define a function to make a copy of a list_ (using an anonymous function),

let copy l = map (fun x -> x) l

Similarly we can create a string list_ from an int list_:

map string_of_int Cons_(1,Cons_(2,Cons_(3,Nil_)))

Now lets consider the reduce operation, which like map applies a function to every element of a list, but in doing so accumulates a result rather than just producing another list. Thus in comparison with map, the reduce operator takes an additional argument of an accumulator. As with map, we will consider the curried form of reduce.

There are two versions of reduce, based on the nesting of the applications of the function f in creating the resulting value. In OCaml there are built-in reduce functions that operate on lists are called List.fold_right and List.fold_left. These functions produce the following values:

fold_right f [a; b; c] r = f a (f b (f c r))
fold_left f r [a; b; c] = f (f (f r a) b) c

From the forms of the two results it can be seen why the functions are called fold_right which uses a right-parenthesization of the applications of f, and fold_left which uses a left-parenthesization of the applications of f. Note that the formal parameters of the two functions are in different orders, in fold_right the accumulator is to the right of the list and in fold_left the accumulator is to the left of the list.

Again using the list_ type we can define these two functions as follows:

let rec fold_right (f:'a -> 'b -> 'b) (lst: 'a list_) (r:'b): 'b = 
    match lst with
      Nil_ -> r
    | Cons_(hd,tl) -> f hd (fold_right f tl r)

and

let rec fold_left (f: 'a -> 'b -> 'a) (r: 'a) (lst: 'b list_): 'a = 
    match lst with
      Nil_ -> r
    | Cons_(hd,tl) -> fold_left f (f r hd) tl

Note the type signature of fold_right which is

('a -> 'b -> 'b) -> 'a list_ -> 'b -> 'b

The parameter f is a function from the element type of the input list 'a and the type of the accumulator 'b to the type of the accumulator. The type signature is analogous for fold_left,except the order of the parameters to both f and to fold_left itself are reversed compared with fold_right.

Given these definitions, operations such as summing all of the elements of a list of integers can naturally be defined using either fold_right or fold_left.

fold_right (fun x y -> x+y) il 0
fold_left (fun x y -> x+y) 0 il

The power of fold

Folding is a very powerful operation.  We can write many other list functions in terms of fold.  In fact map, while it initially sounded quite different from fold can naturally be defined using fold_right, by accumulating a result that is a list. Continuing with our List_ type,

let mapp f l = (fold_right (fun x y -> Cons_((f x),y)) l Nil_)

The accumulator function simply applies f to each element and builds up the resulting list, starting from the empty list.

The entire map-reduce paradigm can thus actually be implemented using ` fold_left and fold_right`.  However, it is often conceptually useful to think of map as producing a list and of reduce as producing a value. 

What about using fold_left instead to define map? In this case we get a function that not only does a map but also produces an output that is in reverse order of the input list. Note that fold_left takes its arguments in a different order than fold_right (the order of the list and accumulator are swapped), it also requires a function f that takes its arguments in the opposite order of the f used in fold_right.

let maprev f l = fold_left (fun x y -> Cons_((f y),x)) Nil_ l

This resulting function can also be quite useful, particularly as it is tail recursive.

Another useful variation on mapping is filtering, which selects a subset of a list according to some Boolean criterion,

let filter f l = (fold_right (fun x y -> if (f x) then Cons_(x,y) else y) l Nil_)

The function f takes just one argument, the predicate for determining membership in the resulting list. Now we can easily filter a list of integers for the even ones:

filter (fun x -> (x / 2)*2 = x) Cons_(1,Cons_(2,Cons_(3,Nil_)))

Note that if we define a function that filters for even elements of a list:

let evens l = filter (fun x -> (x / 2)*2 = x) l;;

then type of the parameter and result are restricted to be int list_ rather than the more general 'a list of the underlying filter, because the anonymous function takes an integer parameter and returns an integer value.

Determining the length of a list is another operation that can easily be defined in terms of folding. 

let length l = fold_left (fun x _ -> 1 + x) 0 l

What about using fold_right for this?

You should try writing some functions using  map, ` fold_left and fold_right`.  These primitives can be incredibly useful.  Even in languages that do not have these operations built-in, they are useful ways of thinking about structuring many computations.

References

[1] Mapping and Folding and the Map-Reduce Paradigm, Data Structures and Functional Programming Course, Computer Science Department, Cornell University, Spring 2009.


comments powered by Disqus