Adv Topics: MapReduce and Batching Computations

With an increase in data produced by end-user and interne of things connectivity it creates a need for a computational infrastructure that deals with live-streaming big data processing (Sakr, 2014). Real-time data processing requires continuous data input and data output that needs to be processed in a small amount of time (Narsude, 2015). Therefore, once the data comes in it must be processed to be considered a real-time analysis of the data. Real-time data processing deals with a data stream that is defined by an unbounded set of data that can be transformed into stream spouts, which is a bound subset of the stream data (Sakr, 2014).

MapReduce and Hadoop frameworks are amazing with big data-at-rest because it relies on batch jobs with a beginning and an end (Sakr, 2014). However, Narsude (2015) and Sakr (2014), stated that batch processing is not great for real-time stream analytics because stream data doesn’t stop coming in unless an end-user stops it (Sakr, 2014). Batch processing means that the end-user is periodically processing data that has been previously collected to reach a certain size or time-dependent variable (Narsude, 2015).

Streaming data has the following properties (Sakr, 2014): (a) it is generated from multiple places sensuously; (b) has continuous data being produced and in transit; (c) requires continuous parallel processing jobs; (d) online processing of data that isn’t saved, so data analytics lifecycle is applied in real-time to the streaming data; (e) persistently storing streaming data is not optimal; and (f) queries are continuous. Real-time streaming can usually be conducted through two different paradigms (figure 1): processing data one event at a time also known as tuple at a time and keeping data batches small enough to use micro-batching jobs also known as micro-batching (Narsude, 2015).

IPV4.png

Figure 1: Illustrating the difference in paradigms in real time processing where the orange dots represent an equal sized event and the blue boxes are the data processing (adapted from Narsude, 2015).   Differing event sizes do exist, which can complicate this image further.

Notice that from figure 1, these two different paradigms create different sets of latency. Thus, real-time processing produces a near real-time output. Therefore, data scientist need to take this into account when picking a tool to handle their real-time data. Some of the many large-scale stream tools include (Sakr, 2014): Aurora, Borealis, IBM System S and IBM Spade, Deduce, StreamCloud, Stormy, and Twitter Storm. The focus here will be on DEDUCE and Twitter Storm.

DEDUCE

The DEDUCE architecture offers a middleware solution that can handle streaming data processing and use MapReduce capabilities for real-time data processing (Kumar, Andrade, Gedik, & Wu, 2010; Sakr, 2014). This middleware is using a concept called micro-batching where data is collected in near real-time and is being processed distributive over a batched MapReduce job (Narsude, 2015). Thus, this middleware is using a simplified MapReduce job design for data-at-rest and data-in-motion, where the data stream can come from a variable of an operator or a punctuated input stream and the data output is written in a distributed database system (Kumar et al., 2010; Sakr, 2014). An operator output can come from an end-user, a mapper output or a reducer output (Sakr, 2014).

For its scalability features the middleware creates the illusion to the end-user that the data is stored in one location though it is stored in multiple distributed systems through a common interface (Kumar et al., 2010; Sakr, 2014). DEDUCE’s unified abstraction and runtime has the following customizable features that corresponds to its performance optimization capability (Sakr, 2014): (a) using MapReduce jobs for data flow; (b) re-using MapReduce jobs offline to calibrate data analytics models; (c) optimizing the mapper and reducer jobs under the current computational infrastructure; and (d) being able to control configurations, like update frequency and resource utilization, for the MapReduce jobs.

Twitter Storm

Twitter Storm is an open-source code that is distrusted at GitHub (David, 2011). The Twitter Storm architecture is a distributed and fault-tolerant system that is horizontally scalable to handle parallel processing, guaranteed message processing of all data at least once, fault tolerance to ensure fidelity of results, and programming language agnostic (David, 2011; Sakr, 2014). This system uses processing data one event at a time paradigm to real-time processing (Narsude, 2015).

This software uses data stream spouts and bolts, and a task can be run on either spouts or bolts. Bolts consume a multiple data stream spouts to process and analyze the data to either produce a single output or a new data stream spout (Sakr, 2014). A master node takes care of what stream data and code goes to which spout and bolt through worker node supervisors, and the worker nodes usually work a task(s) per instructions of the worker node supervisor (David, 2011; Sakr, 2014). Considering that the more complex the query, the more data spouts and bolts are needed in the processing topology (figure 2), which helps to address this solution’s scalability features (Sakr, 2014).

Capture1.GIF

Figure 2: A sample Twitter Storm stream to spout to bolt topology (Adapted from Sakr, 2014).

Finally, the performance optimization capability is based on the above sample topology image data, where stream data can be grouped in five different topological spouts and bolts configuration: stream data topology is customizable by the end user; stream data can be randomly distributed, stream data is replicated across all spouts and bolts, one data stream goes to a single bolt directly, or stream data goes into a specified group based on parameters (Sakr, 2014).

Conclusion

Real-time processing is in reality near real-time processes as it depends on the paradigm that is chosen to run the data processing scheme. DEDUCE uses a micro-batching technique, and Twitter Storm uses a tuple at a time. Thus, depending on the problem and project requirements data scientists can use either paradigm and processing scheme that meets their needs.

Resources

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s