Adv Topics: Extracting Knowledge from big data

The evolution of data to wisdom is defined by the DIKW pyramid, where Data is just facts without any context, but when facts are used to understand relationships it generates Information (Almeyer-Stubbe & Coleman, 2014). That information can be used to understand patterns, it can then help build Knowledge, and when that knowledge is used to understand principles, it builds Wisdom (Almeyer-Stubbe & Coleman, 2014; Bellinger, Castro, Mills, n.d.). Building an understanding to jump from one level of the DIKW pyramid, is an appreciation of learning “why” (Bellinger et al., n.d.). Big data was first coined in a Gartner blog post, is data that has high volume, variety, and velocity, but without any interest in understanding that data, data scientist will lack context (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Laney, 2001). Therefore, applying the DIKW pyramid can help turn that big data into extensive knowledge (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Sakr, 2014). Extensive knowledge is a derived from placing meaning to big data usually in the form of predictive analytics algorithms (Sakr, 2014).

Machine learning requires historical data and is part of the data analytics process under data mining to understand hidden patterns or structures within the data (Almeyer-Stubbe & Coleman, 2014). Machine learning is easier to build and maintain than other classical data mining techniques (Wollan, Smith, & Zhou, 2010). Machine learning algorithms include clustering, classification, and association rules techniques and the right algorithm from any of these three techniques must be selected that meet the needs of the data (Services, 2015). Unsupervised machine learning techniques like clustering are used when data scientist do not understand or classify data prior to data mining techniques to understand hidden structures within the data set (Brownlee, 2016; Services, 2015). Supervised machine learning involves model training and model testing to aid in understanding which input variables feed into an output variable, involving such techniques as classification and regression (Brownlee, 2016).

An example of an open source Hadoop machine learning algorithm library would include Apache Mahout, which can be found at http://mahout.apache.org (Lublinsky, Smith, & Yakubovich, 2013). A limitation from learning from historical data to predict the future is it can “stifle innovation and imagination” (Almeyer-Stubbe & Coleman, 2014). Another limitation can exist that current algorithms may not run on distributed database systems. Thus some tailoring of the algorithms may be needed (Services, 2015). The future of machine learning involves its algorithms becoming more interactive to the end user, known as active learning (Wollan, Smith, & Zhou, 2010).

Case Study: Machine learning, medical diagnosis, and biomedical engineering research – commentary (Foster, Koprowski, & Skufca, 2014)

The authors created a synthetic training data set to simulate a typical medical classification problem of healthy and ill people and assigned random numbers to 10 health variables. Given this information, the actual classification accuracy should be 50%, which is also similar to pure chance alone. These authors found that when classification machine learning algorithms are misapplied, it can lead to false results. This was proven when their model took only 50 people to produce similar accuracy values of pure chance alone. Thus, the authors of this paper were trying to warn the medical field that misapplying classification techniques can lead to overfitting.

The authors then looked at feature selection for classifying Hashimoto’s disease from 250 clinical ultrasound data with the disease and 250 healthy people. Ten variables were selected to help classify these images, and a MATLAB machine learning algorithm was trained on 400 people (200 healthy and 200 ill) to then be tested on 100 people (50 healthy and 50 ill). They were able to show that when 3-4 variables were used they produced better classification results, thus 3-4 variables had huge information gain. This can mislead practitioners, because of the small data set that could be generalized too broadly and the lack of independence between training and testing datasets. The authors argued that larger data sets are needed to get rid of some of the issues that could result in the misapplication of classifiers.

The authors have the following four recommendations when considering the use of supervised machine learning classification algorithms:

    1. Clearly, state the purpose of the study and come from a place of understanding of that problem and its applications.
    2. Minimize the number of a variable when used in classifiers, such as using pruning algorithms in classification algorithms to only select certain variables that meet a certain level of information gain. This is more important with smaller data sets than with big data.
    3. Understand that classifiers are sensitive and that results gained from one set of instances might require further adjustments to be implemented elsewhere.
    4. Classification algorithms and data mining are part of the experimental process not the answer to all problems.

Resources:

Advertisements

Adv Topics: Graph dataset partitioning in Cloud Computing

When dealing with cloud environments, the network management systems are key to processing big data, because some computational components within the network can have bandwidth unevenness (Conolly & Begg, 2014; Sakr, 2014). Chen et al. (2016) stated that often the cloud infrastructure is designed as interconnected computational resources in a tree structure, where they are first group in pods, which are then connected to other pods This unevenness can be defined as differing network performance between computational components within and across different racks from each other (Lublinsky, Smith, & Yakubovich, 2013). This unevenness grows as the cloud gets upgrade and the computational components grow to become increasingly heterogeneous with time (Chen et al., 2016). Thus, network management systems, which is part of resource pooling in cloud environments, helps with data load balancing, route connections, and diagnose problems between the “computational node ó link ó computational node ó link ó computational node …;” where each node is connected to multiple databases through multiple links (Conolly & Begg, 2014).

Network bandwidth unevenness becomes an issue when trying to process graph and network data, because (a) the complexity of the data cannot be stored in typical relational database systems; (b) the data is complex to process; and (c) scalability issues (Chen et al., 2016). Sakr (2014) stated there are two ways to scale data: vertical scaling where researchers are finding the few computational components that can handle huge data loads or horizontal scaling where researchers are spreading the data across multiple computational components. Chen et al. (2016) and Sakr (2014) recommended these two large-scale graph dataset partitioning methods for cloud computing systems:

  • Machine Graph: Models the graphical data by treating each computational node as a graphical vertex, such that the interconnectivity between each computational node would represent the connection of that graphical data element to others graphical data elements. A benefit of using this design is that the computational nodes representing this graphical data doesn’t need the knowledge of the topology of the network and can be built by utilizing the network bandwidth between computational nodes.
  • Partition Sketch: A graphical data set is partitioned out to represent multiple tree-structured data, and each computational node address a single level of the graphical tree data. From the perspective of the partitioned tree; the root node holds the input graph, non-leaf nodes hold partition iteration data, and leaf nodes hold the partitioned graph data points. This is built iteratively, with the goal of reducing the number of cross leaf-nodes between partitions. Design wise, each partition sketch is locally optimized, addresses monotonicity (cross leaf-nodes between partitions and non-leaf node depth), and addresses the proximity of parent nodes by defining common parent data between leaf nodes and non-leaf nodes. For instance, data that has a low common parent data should be stored together in high bandwidth computational nodes, whereas leaf-node data should be stored in low bandwidth computational nodes. However, the limitation of such a design is that the partition size must be carefully selected to fit monotonicity, but at the same time not have too many nodes as to run into input and output errors when trying to retrieve the data. Thus, more processing nodes should be used to decrease the number of cross leaf-nodes with low bandwidth, but not too many to cause memory issues.

Resources

  • Chen, R. Weng, X., He, B., Choi, B. and Yang, M. (2016). Network performance aware graph partitioning for large graph processing systems in the cloud. Retrieved from http://www.comp.nus.edu.sg/~hebs/pub/rishannetwork_crc14.pdf
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.

Adv Topics: IP size distributions and detecting network traffic anomalies

In 2011, internet advertising has generated over $31B in the US (Sakr, 2014). Much of this revenue is generated by created contextual advertising, which is when online advertisers, search engine optimizers, and sponsored search providers try to engage a user experience and revenue with displaying relevant and context based ads online (Chakrabarti, Agarwal, & Josifovski, 2008). Thus, understanding click rates of online advertisers, search engine optimizers, and sponsored search providers can provide online revenue to any business’ products or services (Regelson & Fain, 2006). When a consumer clicks on the ad to decide whether to purchase the product or service, a small amount of money is withdrawn from the online advertising budget from the company (Regelson & Fain, 2006; Sakr, 2014).

This business model is subjected to cyber-attacks, such that a competitor can create an automated piece of code to click on the advertising without making a purchase, which in the end depletes the online advertising budget (Sakr, 2014). This automated piece of code usually comes from an IP size distribution, which is a group of IPs set to target one ad and pretending to be an actual consumer, which sounds like a DoS attack – Denial of Service attack (Park & Lee, 2001; Sakr, 2014). However, DoS attack is to use IP size distributions to block services from a website, and the best way to prevent this situation is to trace back the source of the IP size distribution and block it (Park & Lee, 2001). This is slightly different though; it is not denying the company’s service or products, its depleting their online advertisement budget, which will reduce one company’s online market share.

Sakr (2014) says that IP size distributions are defined by two dimensions (a) application and (b) time; which change throughout time due to business cycles, flash crowds, etc. IP size distributions are generated three ways: (a) legitimate users, (b) publisher’s friends that could include sponsored providers with some fraudulent clicks, and (c) bot-master with botnets (Sakr, 2014; Soldo & Metwally, 2012). The goal is now to identify the bot-master with botnets and the fraudulent clicks. Thus, companies need to be able to detect network traffic anomalies based on the IP size distribution:

  • Sakr (2014) and Soldo and Metwally (2012) suggested using anomaly detection algorithms, which relies on the current IP size distribution and analyzes the data to search for patterns that are characteristic of these attacks. These methods of detection are robust because it uses these characters of fraudulent clicks, which has low complexity and can be written to run MapReduce in parallel processing. This method can assign a distinct cookie ID for analysis when a click is generated. This technique uses a regression model and compares IP rates to a Poisson distribution, as well as using an explanatory diversity feature which counts the distinct cookies and measures an entropy of that distribution; setting this as the true IP sizes. The use of this information to generate explanatory diversity models, which can then also be analyzed using quantile regression, linear regression, percentage regression, and principal component analysis. Then each of these analyses has their root mean square error computed, relative error, and bucket error to allow inter-comparability between the results of each of these models to the true value. This inter-comparison allows for detection of anomalous activities because each method measures different properties within the same data. Once the IP addresses have been identified as fraudulent, they are then flagged.
  • Regelson and Fain (2006) suggests using historical data if it is available to create reliable prior IP size distribution to compare it to current IP size distributions. Though the authors suggested using this for studying click through rates, which is clicking on the ad to purchase, this could also be used for this scenario. This method of using historical data can sometimes work when there is a wealth of historical information, but in cases that there are little to none historical information a creative aggregation technique could work. This technique uses a cluster of less frequent and similar items as well as completely novel items to develop that historical context needed to build the historical IP size distribution. This technique uses a logistic regression analysis. This method could reduce error by 50% when there was no historical data to compare to.

With further analysis of the first method, the strengths of this method are:

  • that there is no need to obtain personally identifiable information
  • no need to authenticate end user clicks
  • fully automated statistical aggregation method that can scale linearly using MapReduce
  • creating a legitimate looking IP size distributions is really difficult

while the limitations of this method are:

  • It requires many actual click data to create these models
  • Colluding with other companies to provide their click data can help create a large amount of click data needed, but usually, that data is proprietary.

That is why the second method was mentioned from Regelson and Fain (2006) because they address the limitations of the Sakr (2014) and Soldo and Metwally (2012) method.

Resources:

  • Chakrabarti, D., Agarwal, D., & Josifovski, V. (2008). Contextual advertising by combining relevance with click feedback. In Proceedings of the 17th international conference on World Wide Web (pp. 417-426). ACM.
  • Park, K., & Lee, H. (2001). On the effectiveness of probabilistic packet marking for IP traceback under denial of service attack. In INFOCOM 2001. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 1, pp. 338-347). IEEE.
  • Regelson, M., & Fain, D. (2006). Predicting click-through rate using keyword clusters. In Proceedings of the Second Workshop on Sponsored Search Auctions (Vol. 9623).
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.
  • Soldo, F., & Metwally, A. (2012). Traffic anomaly detection based on the IP size distribution. In INFOCOM, 2012 Proceedings IEEE (pp. 2005-2013). IEEE.

Adv Topics: Stochastic Modeling and three-pool Cloud Architecture

There is a need to have an easy and effective performance analysis method suited to the large Infrastructure as a Service (IaaS) cloud computing environment. There are typical; tree approaches can be used to conduct performance analysis to any type of target system:

  • Experiment-based performance analysis
  • Discrete event-simulation-based performance analysis
  • Stochastic model-based performance analysis

The experimental-based performance analysis can be cost and time prohibitive as data and processing scales increase on IaaS computing environments, and discrete event simulations take too much time to compute (Sakr, 2014). A scalable multi-level stochastic model-based performance analysis has been proposed by Ghosh, Longo, Naik and Trivedi (2012) for IaaS cloud computing. Stochastic analysis and models are a form of predicting how probable an outcome will occur using a form of chaotic deterministic models that help in dealing with analyzing one or more outcomes that are cloaked in uncertainty (Anadale, 2016; Investopedia, n.d.). Properties of stochastic models consist of (Anadale, 2016):

  1. All aspects of uncertainty should be representing (variables), by having a solution space that contains all possible outcomes
  2. Each of the possible variables would have probability distribution attached to it
  3. The distribution of variables is run through thousands of simulations to identify the probability of a preferred key outcome

These steps are essential in trying to gain information on a variety of outcomes with varying variables, to make data-driven decisions because stochastic modeling runs thousands of simulations to understand the eccentricities of each variable per outcome (Investopedia, n.d.). Ghosh et al. (2012) selection of a stochastic model for performance analysis is because of its low-cost analysis tool that covers a large solution space. Sakr (2014), shared this same solution for big data IaaS cloud environments. The solution consists of a three-pool cloud architecture.

Three-pool cloud architecture

In IaaS cloud scenario, a request for resources initiates a request for one or more virtual machines to be deployed to access a physical machine (Ghosh et al., 2012). The architecture (Figure 1) assumes that physical machines are grouped in pools hot, warm, and cool (Sakr, 2014). An architectural technique that allows for cost savings because it minimizes power and cooling costs (Ghosh et al., 2012). A Resource Provisioning Decision Engine (RPDE), helps an incoming queue asking for computational resources (Ghosh et al., 2012; Sakr, 2014). The RPDE searches for physical machines in the hot pool first to build a virtual machine to access its resources. However, if all the hot pool resources are currently used, it will search for the physical machines in the warm pool, then the cool pool, and if none are available the queue gets rejected. This three-pool architecture can be scalable and practical to do performance analysis on IaaS cloud computing environment

The hot pool consists of physical machines that are constantly on, and virtual machines are deployed upon request (Ghosh et al., 2012; Sakr, 2014).   Whereas a warm pool has the physical machines in power saving mode, and cold pools have physical machines that are turned off. For both warm and cold pools setting up a virtual machine is delayed compared to a hot pool, since the physical machines need to be powered up or awaken before a virtual machine is deployed (Ghosh et al., 2012). The optimal number of physical machines in each pool is predetermined by the information technology architects.

ip3V4.png

Figure 1: Request provisioning steps in a three-pool cloud architecture adapted from Ghosh et al. (2012).

Stochastic modeling comes into play here when trying to calculate the job rejection probability, which is when there are no physical machines in either pool, with input variables such as job arrival rate, mean searching delay to find a physical machine, probabilities that a physical machine can accept the job, and maximum number of jobs, sizes of the job (Ghosh et al., 2012; Sakr, 2014). Each of these variables is submodels used to help define the job rejection probability. In other words, each of these variables has their defined statistical distributions to them, thus when running thousands of simulations, a given job’s rejection probability can be given.

Limitations

Unfortunately, the jobs can be rejected for two reasons: (a) buffer is full or (b) insufficient physical machine resources (Ghosh et al., 2012; Sakr, 2014). Thus, the use of stochastic modeling allows for predicting a job rejection rate and thus could reinitiate the queue if the probability is high for a more optimal time.

Also, both Ghosh et al. (2012) and Sakr (2014) stated that there is a limitation based on the assumption that all the physical machines in IaaS cloud platforms are homogeneous and that virtual machines are homogeneous. Adding more submodels and variables to the main stochastic model should alleviate the situation. However, this will build a bigger model that would be slower than the original example.

Potential applications that can be developed

Ghosh et al. (2012) suggested that this model can be used to calculate the job rejection probability and mean response delay from this architecture, by analyzing different variables than what was previously discussed like changing job arrival rates, mean job service times, number of physical machines per pool, number of virtual machines that can be supported on a particular physical machine.

Another possible application of the three-pool architecture could be used for data recovery efforts. Data and computational algorithms that are mission critical to the success of an organization can be prioritized to be stored in a hot pool, whereas data that are not time sensitive can be stored in warm and cold pools. Storing key data and algorithms in a hot pool means that if the main architecture goes down, the backups stored in the cloud can be recovered relatively quickly. These solutions could be implemented in a private, hybrid, or public cloud.   This solution is a cloud version of the different types of sites used for disaster recovery (Segue Technologies, 2013).

Conclusions

Stochastic modeling is a great way to use a chaotic version of deterministic modeling to predict the future when the variables could be defined by a probability distribution. Using differing pools of physical machines in an IaaS cloud architecture for demanding computational resources can result in a loss of service. Stochastic modeling could be used here to describe the probability that a certain queue request will be rejected, which can allow people to make data driven the decision on when it would be an optimal time to request resources from the IaaS cloud architecture.

Reference

Adv Topics: Big Data Visualization

Volume visualization is used to understand large amounts of data, in other words, big data, where it can be processed on a server or in the cloud and rendered onto a hand-held device to allow for the end user to interact with the data (Johnson, 2011). Tanahashi, Chen, Marchesin and Ma (2010), define a framework for creating an entirely web-based visualization interface (or web application), which is leveraging the cloud computing environment. The benefit of using this type of interface is that there is no need to download or install the software, but that the software can be accessed through any mobile device with a connection to the internet. A scientist can then use visualization and multimedia functionally as a tool to enhance their thinking and understanding of current problems, from understanding the 3-dimentional structure of DNA or the 3-dimentional structure of a hurricane (Minelli, Chambers, & Dhiraj, 2013; Johnson, 2011).

Db3.jpg

Figure 1: Hurricane Joaquin’s 3-dimensional rendering of its rain structure from the NASA-enhanced infrared satellite image and GPM data. Adapted from NASA (September 29, 2015). This image shows snow particles in the storm’s anvil, but also shows that significant amounts of heat being released by the storm’s core, which is driving the circulation of the storm and providing the storm energy required for further intensification.

McNurlin, Sprague, and Bui (2008), stated that the ideal web-based visualization tool would have simplified operations, allows for reusable templates, rapid deployment, multilingual support, and allows for control over the creation, update, access, customization, and destruction. Tanahashi et al. (2010) had proposed a web-based visualization framework to have a:

(a)   preprocessing phase = data is collected, indexed, stored

(b)   interface = end-user connecting to data in the databases and makes the request for processing and modifying the data

(c)   processing phase = a set of images, video, 3-dimensional renderings are returned per request

(d)   modification phase = end-user can request further modifications

(e)   reprocessing phase = a set of images, video, 3-dimensional renderings are returned per request, which goes into an iterative loop between parts d and e until a final product is rendered to the end user

It was designed for all people to use, and go by the philosophy that “Knowledge should be openly assessable to the broader community.” (Tanahashi et al., 2010, Sakr, 2014). Performance bottlenecks of the above Tanahashi et al. (2010) framework include difficulty with dealing with different data formats, different rendering algorithms, transferring cloud-based data rendering onto the web interface, and organization of big data for efficient retrieval. With the goal of any visualization is to be providing the right user the right information in their preferred or a suggested rendering these bottlenecks must be addressed (McNurlin et al., 2008). Thus, the algorithms can be indexed to allow for classifying the algorithms ‘properties of aesthetics or analytical significance, which can be searched for by an end-user with a search bar (Tanahashi et al., 2010).

Subsequently, it is proposed that using and indexing metadata can resolve the issues of data organization (Tanahashi et al., 2010). Data transfer issues could be mitigated by minimizing the amount of data-in-motion via a MapReduce paradigm (Tanahashi et al., 2010, Sakr, 2014). In the MapReduce paradigm, the mapper’s process and render the data and reducers create the final composition of the data (Tanahashi et al., 2010).

Resources

  • Johnson, C. (2011) Visualizing large data sets. TEDx Salt Lake City. Retrieved from https://www.youtube.com/watch?v=5UxC9Le1eOY
  • McNurlin, B., Sprague, R., & Bui, T. (2008) Information Systems Management, (8th ed.). Pearson Learning Solution. VitalBook file.
  • Minelli, M., Chambers, M., & Dhiraj, M. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • NASA (2015). October 02, 2015 – Update #1 – A 3-D Look at Hurricane Joaquin from NASA’s GPM Satellite.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.
  • Tanahashi, Y., Chen, C., Marchesin, S., & Ma, K. (2010). An interface design for future cloud-based visualization services. Proceedings of 2010 IEEE Second International Conference on Cloud Computing Technology and Service, 609–613. doi: 10.1109/CloudCom.2010.46

Adv Topics: CAP Theory and NoSQL Databases

Brewer (2000) and Gilbert and Lynch (2012) concluded that for a distributed shared-data system you could only have at most two of the three properties: consistency, availability, partition-tolerance (CAP theory). Gilbert and Lynch (2012) describes these three as akin to the safety of the data, live data, and reliability of the data. Thus, systems that are giving up

  • consistency creates a system that needs expirations, conflict resolution, and optimistic locking (Brewer, 2000). A lack of consistency means that there is a chance that the data or processes may not return the right response to a request (Gilbert & Lynch, 2012).
  • availability creates a system that needs pessimistic locking and making some partitions unavailable (Brewer, 2000). A lack of availability means that there is a chance that a request may not get a response (Gilbert & Lynch, 2012).
  • Partition-tolerance creates a system that needs a 2-phase commit and cache validation profiles (Brewer, 2000). A lack of partition-tolerance means that there is a chance that messages between servers, tasks, threads, can be lost forever and never are committed (Gilbert & Lynch, 2012).

Therefore, in a NoSQL distributed database systems (DDBS), it means that partition-tolerance should exist, and therefore administrators should then select between consistency and availability (Gilbert & Lynch, 2012; Sakr, 2014). However, if the administrators focus on availability they can try to achieve weak consistency, or if the administrators focus on consistency, they are planning on having a strong consistency system. An availability focus is having access to the data even during downtimes (Sakr, 2014). However, providing high levels of availability can cost money. Per the web application Uptime.is:

Availability Level Monthly downtime Yearly downtime
99.9% 43m 49.7s 8h 45m 75.0s
99.99% 4m 23.0s 52m 35.7s
99.999% 26.3s 5m 15.6s
99.9999% 2.6s 31.6s

To achieve high levels of availability means having a set of fail-safe systems to build for fault tolerance.

From the previous paragraph, there is both strong and weak consistency. Strong consistency ensures that all copies of the data are updated in real-time, whereas weak consistency means that eventually all the copies of the data will be updated (Connolly and Begg, 2014; Sakr, 2014). Thus, there is a resource cost to have stronger consistency over weaker consistency due to how fast the data needs to be updated (Gilbert & Lynch, 2012). Consequently, this is where the savings come from when handling for overhead in a NoSQL DDBS.

Finally, the table below illustrates some of the NoSQL databases that are either an AP or CP system (Hurst, 2010).

Availability & Partition Tolerance

NoSQL systems

Consistency & Partition Tolerance

NoSQL systems

Dynamo, Voldemort, Tokyo Cabinet, KAI, Riak, CouchDB, SimpleDB, Cassandra Big Table, MongoDB, Terrastore, Hypertable, Hbase, Scalaris, Berkley DB, MemcacheDB, Redis

 Resources

  • Brewer, E. (2000). Towards robust distributed systems. Proceedings of 19th Annual ACM Symposium Principles of Distributed Computing (PODC00). 7–10.
  • Connolly, T., & Begg, B. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Gilbert, S., and Lynch N. A. (2012). Perspectives on the CAP Theorem. Computer 45(2), 30–36. doi: 10.1109/MC.2011.389

 

Adv Topics: Distributed Programing

Distributed programming can be divided into the following two models:

  • Shared memory distributed programming: Is where serialized programs run on multiple threads, where all the threads have access to the underlying data that is stored in shared memory (Sakr, 2014). Each thread should be synchronized as to ensure that read and write functions aren’t being done on the same segment of the shared data at the same time. Sandén (2011) and Sakr, (2014) stated that this could be achieved via semaphores (signals other threads that data is being written/posted and other threads should wait to use the data until a condition is met), locks (data can be locked or unlocked from reading and writing), and barriers (threads cannot run on this next step until everything preceding it is completed). A famous example of this style of parallel programming is the use of MapReduce on data stored in the Hadoop Distributed File System (HDFS) (Lublinsky, Smith, & Yakubovich, 2013; Sakr, 2014). The HDFS is where the data is stored, and the mapper and reducers functions can access the data stored in the HDFS.
  • Message passing distributed programming: Is where data is stored in one location, and a master thread helps spread chunks of the data onto sub-tasks and threads to process the overall data in parallel (Sakr, 2014).       There are explicitly direct send and receive messages that have synchronized communications (Lublinsky et al., 2013; Sakr, 2014).   At the end of the runs, data is the merged together by the master thread (Sakr, 2014). A famous example of this style of parallel programming is Message Passing Interface (MPI), such that many weather models like the Weather Research and Forecasting (WRF) model benefits use this form of distributed programming (Sakr, 2014; WRF, n.d.). The initial weather conditions are stored in one location and are chucked into small pieces and spread across the threads, which are then eventually joined in the end to produce one cohesive forecast.

However, there are six challenges to distributed programming model: Heterogeneity, Scalability, Communications, Synchronization, Fault-tolerance, and Scheduling (Sakr, 2014). Each of these six challenges is interrelated. Thus, an increase in complexity in one of these challenges can increase the level of complexity of one or more of the other ones. Therefore, both the shared memory and message passing distributed programming are insufficient when processing the large-scale data in cloud computing environment. This post will focus on two of these six:

  • Scalability issues exist when an increase in the number of users, the amount of data, and request for resources and the distributed processing system can still be effective (Sakr, 2014). Using Hadoop and HDFS in the cloud allows for a mitigation of the scalability issues by providing a free open-source way of managing such an explosion of data and demand on resources. But, the storage costs on the cloud will also increase, even though it is usually 10% of the cost than normal information technology infrastructure (Minelli, Chambers, & Dhiraj, 2013). As the scale of resources increase, it can also increase a number of resources needed for a deal with communication and synchronization (Sakr, 2014).
  • Synchronization is a critical challenge that must be addressed because multiple threads should be able to share data without corrupting the data or cause inconsistencies (Sandén, 2011; Sakr, 2014). Lublinsky et al. (2013), stated that MapReduce requires proper synchronization between the mapper and reducer functions to work. Improper synchronization can lead to issues in fault tolerance. Thus, efficient synchronization between reading and write operations are vital and are within the control of the programmers (Sakr, 2014). The challenge comes when scalability issues are introduced and applying synchronization methods without degrading performances, causing deadlocks where two tasks want access to the same data, load balancing issues, or wasteful use of computational resources (Lublinsky et al., 2013; Sandén, 2011; Sakr, 2014).

Resources

  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Minelli, M., Chambers, M., & Dhiraj, M. (2013) Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
  • Sandén, B. I. (2011) Design of Multithreaded Software: The Entity-Life Modeling Approach. Wiley-Blackwell. VitalBook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.