Adv Topics: MapReduce and Incremental Computation

Data usually gets update on a regular basis. Connolly and Begg (2014) defined that data can be updated incrementally, only small sections of the data, or can be updated completely. An example of data that can be updated incrementally are webpages, computer codes, stale data, data-at-rest, bodies of knowledge, etc. Whereas, some examples of data that can be updated completely are: weather data, space weather data, social media data, data-in-motion, dynamic data, etc. Both sets of data provide their own unique challenges when it comes to data processing. On average, analyzing web data, new to old data can range from 10-1000x (Sakr, 2014). Thus, the focus of this discussion is on incremental data update and how to process data in between two data processing runs.

Incoop is an extension of Hadoop to allow for processing incremental changes on big data, by splitting the main computation to its sub-computation, logging in data updates in a memoization server, while checking the inputs of the input data to each sub-computation (Bhatotia et al., 2011; Sakr, 2014). These sub-computations are usually mappers and reducers (Sakr, 2014). Incremental mappers check against the memoization servers, and if the data has already been processed and unchanged it will not reprocess the data, and a similar process for incremental reducers that check for changed mapper outputs (Bhatotia et al., 2011).

Subsequently, MapReduce is an analytical engine and pattern that takes advantage of distributed systems while keeping the processes and data in one machine (Sadalage & Fowler, 2012). There are a few key principles to using the MapReduce framework and Hadoop efficiently to improve incremental computation:

  • Data partitioning: The MapReduce framework aids in partitioning the data into similar size sets into Hadoop Distributed File System, aka HDFS (Lublinsky, Smith, & Yakubovich, 2013). Thus, MapReduce can support smaller sets of data stored in HDFS. This is part of the scalability of the cluster.
  • Fault tolerance and durability: Given that data can be partitioned to tiny chunks across thousands of computations nodes and run in parallel sometimes these nodes can fail (Sakr, 2014). The MapReduce framework replicates the data in the background and can launch backup jobs if a node fails (Lublinsky et al., 2013; Sakr, 2014). Thus, failure doesn’t disrupt the data processing. However it does increase the number of processors needed (Connolly & Begg, 2014).
  • Parallelization: The partitioned input data are considered as independent sets of data, such that the mapper functions can process the data in a parallel environment (Lublinsky et al., 2013; Sakr, 2014). This principle allows for the sub-functions within the mapper and reducer function to handle smaller data. It allows a data analyst to focus on the main problem rather than low-level parallel coding abstraction, like multithreading, file allocation, memory management, etc. (Sakr, 2014). Serialization does not allow for small incremental updates for large data (Connolly & Begg, 2014).
  • Data reuse: There is no need to read or write of intermediate data, thus preserving the input data to enable the data to be reused because it is unchanged (Lublinsky et al., 2013; Sakr, 2014).
  • Self-Adjusting Computation: Used for incremental computation, which only allows mappers and reducers only work on the smaller size sets of data that are impacted by the change (Sakr, 2014).

Both Bhatotia et al. (2011) and Sakr (2014), suggested an Inc-HDFS which is also an extension of the HDFS, for partitioning data based on content and removal of data duplication. There is the limitation of this approach where the number of files may be grouped in too many content bins or too little content bins and thus may not be evenly be spaced out (Bhatotia et al., 2011). Thus, invoking: too many mapper functions can create an infrastructure overhead, which increases resources and thus cost, or too few mapper functions can create huge workloads for certain types of computational nodes, or too many reducers can provide too many outputs, and too little reducers can provide too little outputs (Lublinsky et al., 2013; Sakr, 2014). A constraint must be added on both ends of the spectrum to allow for evenly distributed data sets (Bhatotia et al., 2011).

Subsequently, the Hadoop out-of-the-box product scheduler doesn’t account for memoization server data, therefore is not built for incremental analysis. Thus, Incoop has a memoization-aware scheduler, that schedules the sub-computations based on the affinity of the task and allows for efficient use of previously (Bhatotia et al., 2011; Sakr, 2014). The scheduler can run tasks on computational nodes that are either faster or locally to where the data is stored (Bhatotia et al., 2011). Using this type of scheduler, the scheduler should place priority on whether it is faster to conduct a data movement and run the task at a faster computational node or reduce data movement, and processes the data locally, while still making effective use of unchanged and processed data.

In the end, a practical application of this technique when it comes to analyzing web data would be to first partition the web data by its content. Scientific content can go in one context partition, corporate financial content into another context partition, etc. under the Inc-HDFS framework. These partitions are capped in size to allow for proper load balance. Incoop will then run the MapReduce function to process the data distributively through using parallel processes. If the data gets updated, like a new corporate update to the SEC 10K data, it will be recorded by the memoization server. This will allow for Incoop to be able to process that incremental change from the corporate financial content partition because it was using the memoization-aware scheduler, and reprocess the data through mapping and reducing function on just this small and partitioned dataset. Therefore, making effective use of unchanged and processed data.

Resources:

  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.ok
Advertisements

Adv Topics: MapReduce and Iterative computation

Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems, while data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Social media data or social network analysis data can be considered as data-in-motion and processing that type of data can be quite problematic. Data-in-motion has to be iteratively processed until there is a certain termination condition is reached and it can be reached between iterations (Sakr, 2014). Data-at-rest is probably considered easier to analyze; however, this type of data can also be problematic. If the data-at-rest is large in size and even if the data does not change or evolve, its large size requires iterative processes to analyze the data.

An example of an iterative process as suggested by Lusblinsky, Smith, and Yakubovich (2014), is solving a linear equation by approximation algorithms on hundreds of equations and variables. The data can be stored in matrix Ax = b, where A is a matrix of coefficients, b is a vector of output values, and x is a vector of variables. If the data is too large, using a simple linear algebraic solution would be impossible, so a quadratic spline solution would consist of the following:

f(x) = ½ xTAx-xT­b

and a superscript “T” represents transposing the vector or matrix. Each iteration of this spline would result in a better vector solution. However, Sakr (2014) stated that MapReduce does not support iterative data processing and analysis directly. Thus workaround is needed to handle iterative programs for situations like data-in-motion or even streaming data.

 Root causes and technical steps to address them

To deal with datasets that require iterative processes to analyze the data, computer coders need to create and arrange multiple MapReduce functions in a loop (Sakr, 2014). This workaround would increase the processing time of the serialized program because data would have to be reloaded and reprocessed, because there is no read or write of intermediate data, which was there for preserving the input data (Lusblinksy et al., 2014; Sakr, 2014). The root cause exists because in its simplest form MapReduce can consist of many mappers and one reducer, which is a performance bottleneck. This simplified model of the MapReduce analytical engine means one cannot reduce across keys, just one key at a time (Sadalage & Fowler, 2012). Thus, the algorithm is not built for iterations. HaLoop is an iteration solution built on top of the Hadoop infrastructure that has a loop control module, task scheduler, which caches invariant data from a previous iteration and uses it in future iterations (Sakr, 2014). This solution can be applied to static and unchanged data.

There are also disadvantages of using MapReduce on these types of data because too many mapper functions can create an infrastructure overhead or too many reducers can provide too many outputs (Lusblinksy et al., 2014; Sakr, 2014). Thus there has to be a basic implementation plan of effective data placement over the Hadoop cluster, to ensure proper load balance of data across the Hadoop servers (Sakr, 2014). Sometimes, there is a need to run two separate MapReduce functions, one to prepare the data by evenly distributing data across the servers and one that iteratively goes through the data (Lusblinksy et al., 2014). CoHadoop allows for data files and related files to be stored on the same server, provide a means of load balancing and fault tolerance by creating a file-level locator property (Sakr, 2014). Sakr describes the locator property as a means to keep track of where the data is stored via a unique identification number for each file in the system. This solution can also be applied to static and unchanged data.

For data-in-motion and streaming data, data has to be iteratively processed until there is a certain termination condition is reached. MapReduce Online has an approach where between the mapper and reducer function data is pipelined to allow for data processing and analysis as soon as the mappers produce their outputs (Sakr, 2014). This can run the MapReduce functions iteratively and provide relatively live reduced data outputs. Sakr, further explains that this approach is where the reducer contacts every mapper upon initiation of the scheduler and the data is temporarily stored on the pipeline, which is an in-memory buffer.

Resources:

  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
  • Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.

Adv Topics: MapReduce and Hadoop

Hadoop allows for data processing through MapReduce and it also allows for data storage (Lublinsky et al., 2014). MapReduce is an analytical engine and pattern that takes advantage of distributed systems while keeping the processes and data in one machine (Sadalage & Fowler, 2012). MapReduce thus contains two functions that work in parallel on distributed systems (Hortonworks, 2013; Sadalage & Fowler, 2012; Sakr, 2014; Sathupadi, 2010):

    1. Mappers functions create and process transactions on the system by mapping and aggregating data by key values. Mappers can read only one data record at a time.
    2. Reducers functions know what that key values are and will take all those values stored in a map to reduce the data to what is relevant. Reducers help summarize the data into a single output. This helps deal with the amount of data moving between multiple computational nodes.

Lublinsky, Smith, and Yakubovich, (2014), stated that an intermediate component of MapReduce is known as the shuffle and sort, where the data from the mapping function outputs are moved and presented to the reducer function.

Thus, MapReduce is a framework that uses parallel sequential algorithms that capitalize on cloud architecture, which became popular under the open source Hadoop project, as its main executable analytic engine (Lublinsky et al., 2014; Sadalage & Fowler, 2012; Sakr, 2014). Essentially, a sequential algorithm is a computer program that runs on a sequence of commands, and a parallel algorithm runs a set of sequential commands over separate computational cores (Brookshear & Brylow, 2014; Sakr, 2014). Thus, a parallel sequential algorithm runs a full sequential program over multiple but separate cores (Sakr, 2014). Another feature of MapReduce is that a reduced output can become another’s map function (Sadalage & Fowler, 2012). Subsequently, the advantages and disadvantages of using MapReduce are (Lusblinksy et al., 2014; Sakr, 2014):

+ aggregation techniques under the mapper function can exploit multiple different techniques

+ no read or write of intermediate data, thus preserving the input data

+ no need to serialize or de-serialize code in either memory or processing

+ it is scalable based on the size of data and resources needed for processing the data

+ isolation of the sequential program from data distribution, scheduling, and fault tolerance

– too many mapper functions can create an infrastructure overhead, which increases resources and thus cost

– too few mapper functions can create huge workloads for certain types of computational nodes

– too many reducers can provide too many outputs, and too little reducers can provide too little outputs

 – it’s a different programming paradigm that most programmers are not familiar with

 – the use of available parallelism will be underutilized for smaller data sets

Given that Hadoop is predominately known for popularizing MapReduce tasks, it is also known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems (Rathbone, 2013). Hadoop’s service is part of the cloud (as Platform as a Service = PaaS).  For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers and can be processed across all these servers, e.g. Hadoop Cluster (IBM, n.d.).

A common example of a parallel sequential program is dynamical weather forecasting models. In dynamical weather forecasting models, there is a set of defined geodynamic, thermodynamic, and physical sequential algorithms define and evolve the main seven variables of weathers across time. For each time step, the forecasting models run these sequential algorithms over each grid point, which can represent a finite geospatial region. Each of these geospatial regions is split amongst multiple computational scores. This example expands in complexity when data has to travel between different finite geospatial regions through the boundaries, which is an example of data parallelism (Sakr, 2014). MapReduce uses the concept of data parallelism to help map and reduce data. Therefore, weather models could be considered as a loose form of MapReduce algorithm.

Resources: