Distributed, In-Memory Data Grids Accelerate Map/Reduce Analysis

William L. Bain

0/5 (0 vote)

May 4, 2012

CPOL

6 min read

21882

The era of “big data” is upon us, and the need for fast analysis has never been more pressing.

Download a free trial copy of ScaleOut StateServer!

The era of "big data" is upon us, and the need for fast analysis has never been more pressing. Companies that can mine data quickly and effectively can serve their customers better, hone their internal strategies, and gain a competitive advantage. For example, fast analysis enables social media Web sites to deliver timely, fine grained results on customer behavior, and it enables financial analysts to optimize trading strategies to quickly respond to market changes. Countless industry sectors from banking to climate simulation are looking for technologies which can analyze data quickly and easily.

One of the leading analysis techniques, called "map/reduce" and popularized by Google, has been quickly gaining momentum because of the popularity of the open source Hadoop programming platform. Simply stated, this technique harnesses many computers to simultaneously analyze different data within a large dataset and then combine the results. Map/reduce analysis dramatically speeds up analysis in comparison to sequentially stepping through the dataset to look for patterns of interest. Representing an evolution of data-parallel computing technology that emerged in the 1980’s, it has again taken center stage as cloud computing has made parallel computing broadly accessible.

File I/O Limits Performance

Hadoop and various commercial implementations of map/reduce typically host data either in a distributed file system, such as the Hadoop Distributed File System, or in a database and then stream it into memory for analysis. Likewise, Hadoop streams both intermediate and final results back into the file system for retrieval and reporting. The analyst decides how to assign various sections (or "splits") of a dataset to the many concurrent tasks that will analyze them, and then Hadoop stages these splits in memory as it starts up analysis tasks on the computers.

Performance studies have shown that file I/O can significantly impact the performance of map/reduce analysis. The time required to move data into and out of memory for each split lengthens the execution time for its associated map and reduce tasks, and this in turn delays completion of the overall analysis. For very large datasets which do not fit into memory, this overhead – and its performance penalty – cannot be avoided. However, ways to circumvent these delays clearly are needed to boost map/reduce performance.

In-Memory Data Grids Tackle the I/O Bottleneck

A recent Forrester survey of big data initiatives reported that about 63% of industry databases were smaller than 10TB. Since cloud computing has now made it practical to harness 200 or more servers to perform map/reduce analysis, these datasets now can be held entirely in memory. This creates an exciting opportunity to avoid file I/O and dramatically reduce analysis time for many useful datasets.

Of course, techniques have to be devised to host datasets in memory scattered across a large array of servers and then efficiently access the data on demand. A technology called distributed "in-memory data grid" (IMDG) has evolved over the last several years precisely for this purpose. Originally employed primarily as a scalable, distributed cache for database systems, commercial IMDGs have become full-fledged storage systems with integrated load-balancing, high availability, parallel query, event processing, and many other capabilities.

Typical IMDGs store data as key/value pairs. This is a perfect match for map/reduce analysis, which also uses key/value pairs to identify data. IMDGs provide programming interfaces ("APIs") which applications use to store and retrieve data. These APIs access key/value pairs somewhat like files in a file system using straightforward create/read/update/delete semantics. The APIs integrate nicely with modern, object-oriented programming languages, such as Java and C#, hosting key/value pairs in namespaces which correspond to language-defined object collections.

Commercial in-memory data grids automatically handle the problem of uniformly distributing stored data across the memory of the computers which host the grid. Applications never have to know which grid servers hold the key/value pairs they need; the IMDG takes care of mapping access requests to servers. IMDGs also provide very fast access to stored data and automatically scale their throughput to handle simultaneous accesses by many client computers. This precisely meets the demand that map/reduce analysis places on its storage system, namely that a large set of map and reduce tasks be able to simultaneously access key/value pairs for reading or writing.

For datasets that can be held in memory, IMDGs provide an extremely fast storage layer that can significantly boost the performance of map/reduce analysis. For example, consider a well known application in financial services which back-tests a mix of stock trading strategies based on historical market data. This application helps analysts to test out new trading strategies on historical stock data as a predictor of their value in live market transactions.

To see the impact of storing datasets in memory, the performance of a Hadoop map/reduce analysis on data hosted in a commercial IMDG called ScaleOut StateServer (SOSS) was compared to another run in which the data was hosted in the Hadoop Distributed File System (HDFS). The stock history data were stored in both the in-memory data grid and HDFS as key/ value pairs, one for each stock symbol. In this analysis, servers were added to scale processing power as the population of stock histories was proportionally increased to create additional workload. In all scenarios, storing the market data in an IMDG instead of in HDFS provided a significant performance improvement that reduced analysis time, yielding approximately a 6X increase in analysis throughput.

Throughput Comparison

(See the green and blue lines in the graph.) This is a remarkable boost in performance, but can we do even better?

The Next Step: Integrate Map/Reduce into the IMDG

Whether data is stored in an in-memory data grid or in a distributed file system, getting the data to and from the computer performing its associated map/reduce task creates data transfer overhead that impacts overall performance. The Hadoop map/reduce platform also introduces additional overhead, such as networking delays, in its task scheduling and in its handling of intermediate results that move between the map and reduce stages of an overall analysis.

Tightly integrating a map/reduce execution engine into an IMDG can eliminate unnecessary data motion and many other map/reduce overheads to deliver breathtaking performance gains. For example, ScaleOut StateServer’s Grid Computing Edition offers a map/reduce execution model called "parallel method invocation" (PMI) which lets applications invoke a map/reduce analysis as a simple extension to the standard APIs used for grid access. The ability to invoke map/reduce inline with program execution eliminates much of the overhead associated with Hadoop’s batch scheduler. Moreover, SOSS’s map/reduce engine ensures that map and reduce tasks are always performed on the computers which host their associated key/value pairs. This eliminates almost all network overhead involved in staging data and significantly reduces delays.

To evaluate the performance benefits of integrating map/reduce into an in-memory data grid, the financial application for back-testing stock trading strategies was implemented for SOSS’s PMI model of map/reduce. The test runs delivered a remarkable 16X increase in throughput over hosting the application in Hadoop using HDFS (as shown by the red and blue lines in the above graph). Especially in the world of financial analysis, faster analysis offers an important competitive advantage.

As cloud computing becomes more pervasive, large pools of servers will become increasing cost-effective to use for map/reduce analysis. This will enable IMDGs to host increasingly large datasets in memory, which will provide important benefits in reducing analysis time, thereby enabling analysts to more quickly and easily gain insights from their datasets.