Getting Started with MapReduce

Mihail Mateev

0/5 (0 vote)

Jul 28, 2016

CPOL

7 min read

23873

Getting Started with MapReduce

Introduction

The Cloudant engine supports several ways for fast querying data based on indexes maintained separately from the core data. Some of the activities (like CREATE,UPDATE and DELETE) affect the generation and re-generation of the indexes, which is scheduled in background tasks to have a faster, non-blocking write throughput.

We have the following indexing types to look at:

MapReduce views
Search Indexes

These indexing types are implemented by using different algorithms in the Cloudant database:

MapReduce is based on Binary Tree indexing key-value pairs, filtered from the main dataset or subsets of it, and this way presenting a configurable views.
Search indexes are used based on request parameters and this is why they are architected using Apache Lucene. This allows them to be more flexible with complicated requests such as free-text search.

What is MapReduce?

There are many good definitions describing MapReduce:

MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Conceptually similar approaches have been very well known since 1995 with the Message Passing Interface standard having reduced scatter operations. A MapReduce program is composed of a Map() procedure (method) that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() method that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once.

MapReduce libraries have been written in many programming languages, with different levels of optimization. (Wikipedia https://en.wikipedia.org/wiki/MapReduce)

I'm not going to try to create a better description of MapReduce than the one given by Wikipedia. Instead of this I will go into details about how Cloudant implemented it.

Scalability is based on incremental view updates; each time a record is created, updated or deleted the view is triggered for re-generation. The Cloudant engine allows us to build chained MapReduce views for better performance and faster development - this is all about the ability to re-use filtered or aggregated view data into new designed views in order to create multiple similar views designed for different levels of access, different regions, etc.

MapReduce Basics

The design of the MapReduce view is based on two JavaScript functions:

mapping function - designed to allow you to filter and convert the dataset into key-value pairs in order to process the values
reduce function - designed to process values in order to aggregate data into smaller dataset

The usage of reduce script is optional. It allows us to build simple filters using the mapping function only.

    function (doc) {   
      if (doc.temp) {   
        emit(doc.time,doc.temp);   
      }   
    }

(This example of mapping function is based on our previous post's examples.)

It is creating a view of key-value part where the time stamp is a key and temperature is the value.

It's important to note keys in this case can be duplicated (as the values too), so this mapping function will not remove duplicates. MapReduce views are stored into Design Documents with both RESTful access and design UI at the Cloudant dashboard. Each design document has a unique id, storing the unique URL of this document.

    {   
      "_id": "_design/sensorExt",   
      "_rev": "7-5f949af68be60bbc8dca57447aeef309",   
      "views": {       
        "RecordCount": {   
          "reduce": "_count",   
          "map": "function (doc) {\n  emit(doc.dspl, doc.temp);\n}"  
        }   
      },   
      "language": "javascript"  
    }

The example here is counting the records with one of the predefined functions (_count).

Reduce functions

To speedup and optimize view recalculation the reduce functions are designed to work simultaneously on partial subsets of the whole dataset and to combine the results between them into the final result.

A simple example is the default code for counting:

    function (keys, values, rereduce) {   
      if (rereduce) {   
        return sum(values);   
      } else {   
        return values.length;   
      }   
    }

In this code the first "reduce" (parameter rereduce is false) is executed on array of key-value pairs (subset of whole dataset after mapping function is applied) and this is why the return is counting the records in array with "|length|."

Each next execution is receiving key-value pairs with results of counting initial bocks (and the parameter rereduce is true), so the return value is calculated as a summary of all values into the given array of counts.

Below is a Reduce function for extracting the Max Value and group:

Let's assume that we have a database of sensors, where each sensor transmits the temperature in a specific moment. We need to implement a function that implements a MapReduce that fetches maximum temperature for each sensor (this is based on previous post examples).

    function (keys, values, rereduce) {   
      var getMaxOfValues = function(vs) {      
        var max = {};      
        for(var i in vs) {      
          var val = vs[i];      
          if(val instanceof Array) {      
            val = getMaxOfValues(val, max);      
          }      
          for(var sensorType in val) {      
            var temperature = val[sensorType];      
            if(max[sensorType] === undefined ||  max[sensorType] < temperature ) {      
              max[sensorType] = temperature;      
            }      
          }     
        }      
        return max;      
      };      
            
      if(rereduce) {      
        return getMaxOfValues(values);      
      } else {      
        var max = {};      
        for(var j = 0 ; j < keys.length ; j++) {      
          var sensorType = keys[j][0];      
          var temperature = values[j];      
          if(max[sensorType] === undefined || max[sensorType] < temperature){      
            max[sensorType] = temperature;      
          }      
        }      
        return max;      
      }      
    }

The picture below shows how Cloudant MapReduce works with this function:

The first column (in blue) is representing the multithreaded "reduce" execution of the function, that is processing the mapping results. All other columns are representing the multithreaded "rereduce" executions of the functions, that are aggregating the result of the first reduce.

This way you can see it's not required to recalculate the whole tree to get to the correct result when a record is changed or added. Also it's easy to multithread it for fast results.

Using the Code

New Design Document

As mentioned above you can create views by posting a formatted JSON to your database RESTful API.

The second option is to use the Cloudant Dashboard and create new Design Document.

Cloudant Tutorial

Extending a Simple Mapping Function

In the example above we create a key/value pair of temperature and time stamp, but what if we need more than just one value? The Mapping function allows us to map as key-JSON pairs. And as this is JavaScript, you can use simple manipulations of the values (considering the load they will add to the server).

Next example exports humidity, temperature in Fahrenheit and recalculate it in Celsius:

    function (doc) {   
      if (doc.dspl=="sensorB") {   
        emit(doc.time,{"F":doc.temp,"C":((doc.temp-32)*0.5556),   
        "hmdt":doc.hmdt});   
      }   
    }

Data normalization of one to many relationships

In some cases, the JSON document stored into the database may contain multiple records - internal arrays - one to many relationships. When you need to process those it may be better to build a view exporting them as key-value pairs (more like key-JSON pairs) indexed to the original record.

The same goes into the opposite direction - when you try to reuse a view that already generates one to many relationships, you can access the original document by emitting a value with field "|_id|".

The example below creates an index, containing a relation between the document _id and each child from the children collection.

    function(doc) {   
      if (doc.children) {   
        for (child in children) {   
          emit(doc._id, { "_id": child });   
        }   
      }   
    }

Complex keys

As we mentioned above values are not limited to simple values, but a mapping function can be designed to emit key-JSON pairs. That same lack of limitation is applicable to keys - you can use JSON values to define keys. So your mapping function may produce [JSON key]-[JSON values] pairs (where the keys are indexed)

Examples for Count and MaxValues

Let us make some views using sensor data we have already in our database. In the examples below we have simple Record Count view and the Max Value view from the example above:

Note that those map functions emit key/value pairs with keys equal to dismantlement values and this is way we will have a results distinct by those value.

Request

    GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/$DATABASE/_design/sensorExt/_view/RecordCount?reduce=true&group=true HTTP/1.1   
    Accept: application/json   
    Content-Type: application/json

Response

    {   
        "rows": [   
            {   
                "key": "sensorA",   
                "value": 2   
            },   
            {   
                "key": "sensorB",   
                "value": 1   
            },   
            {   
                "key": "sensorC",   
                "value": 2   
            },   
            {   
                "key": "sensorD",   
                "value": 6   
            },   
            {   
                "key": "sensorE",   
                "value": 2   
            }   
        ]   
    }

Note that same request without "|reduce|" and "|group|" properties set to true returns simple total count.

Request

GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/$DATABASE/_design/sensorExt/_view/RecordCount HTTP/1.1   
Accept: application/json   
Content-Type: application/json

Response

{ "rows": [ { "key": null, "value": 13 } ] }

Max Value request

    GET https://$USERNAME:$PASSWORD@$USERNAME.cloudant.com/$DATABASE/_design/sensorExt/_view/maxValue HTTP/1.1   
    Accept: application/json   
    Content-Type: application/json

Response

    {   
        "rows": [   
            {   
                "key": null,   
                "value": {   
                    "sensorD": 148,   
                    "sensorA": 148,   
                    "sensorC": 145,   
                    "sensorB": 135,   
                    "sensorE": 148   
                }   
            }   
        ]   
    }

In custom reduce functions the "|Reduce|" and "|Group|" properties of the request will be valuable only in case function implementation is correct.

Response with "?reduce=true&group=true"

    {   
        "rows": [   
            {   
                "key": "sensorA",   
                "value": {   
                    "sensorA": 148   
                }   
            },   
            {   
                "key": "sensorB",   
                "value": {   
                    "sensorB": 135   
                }   
            },   
            {   
                "key": "sensorC",   
                "value": {   
                    "sensorC": 145   
                }   
            },   
            {   
                "key": "sensorD",   
                "value": {   
                    "sensorD": 148   
                }   
            },   
            {   
                "key": "sensorE",   
                "value": {   
                    "sensorE": 148   
                }   
            }   
        ]   
    }

As you can see the keys are now grouped with the max value function and we have one Max Value per key.

Summary

This tutorial is about the basics of how to use Cloudant MapReduce.

You can also learn how to use the REST interface with IBM Cloudant MapReduce model.

The next part will be focused on Cloudant Search. There will be examples how to use this feature in a simple IoT solution.