Compelling Topics in Advance Topics

  • The evolution of data to wisdom is defined by the DIKW pyramid, where Data is just facts without any context, but when facts are used to understand relationships it generates Information (Almeyer-Stubbe & Coleman, 2014). That information can be used to understand patterns, it can then help build Knowledge, and when that knowledge is used to understand principles, it builds Wisdom (Almeyer-Stubbe & Coleman, 2014; Bellinger, Castro, Mills, n.d.). Building an understanding to jump from one level of the DIKW pyramid, is an appreciation of learning “why” (Bellinger et al., n.d.).
  • The internet has evolved into a socio-technical system. This evolution has come about in five distinct stages:
    • Web 1.0: Created by Tim Berners-Lee in the 1980s, where it was originally defined as a way of connecting static read-only information hosted across multiple computational components primarily for companies (Patel, 2013).
    • Web 2.0: Changed the state of the internet from a read-only state to a read/write state and had grown communities that hold a common interest (Patel, 2013). This version of the web led to more social interaction, giving people and content importance on the web, due to the introduction of social media tools through the introduction of web applications (Li, 2010; Patel, 2013; Sakr, 2014). Web applications can include event-driven and object-oriented programming that are designed to handle concurrent activities for multiple users and had a graphical user interface (Connolly & Begg, 2014; Sandén, 2011).
  • Web 3.0: This is the state the web at 2017. Involves the semantic web that is driven by data integration through the uses of metadata (Patel, 2013). This version of the web supports a worldwide database with static HTML documents, dynamically rendered data, next standard HTML (HTML5), and links between documents with hopes of creating an interconnected and interrelated openly accessible world data such that tagged micro-content can be easily discoverable through search engines (Connolly & Begg, 2014; Patel, 2013). This new version of HTML, HTML5 can handle multimedia and graphical content, which are great for semantic content (Connolly & Begg, 2014). Also, end-users are beginning to build dynamic web applications for others to interact with (Patel, 2013).
    • Web 4.0: It is considered the symbiotic web, where data interactions occur between humans and smart devices, the internet of things (Atzori, 2010; Patel, 2013). These smart devices can be wired to the internet or connected via wireless sensors through enhanced communication protocols (Atzori, 2010). Thus, these smart devices would have read and write concurrently with humans, where the largest potential of web 4.0 has these smart devices analyze data online and begin to migrate the online world into the real world (Patel, 2013).
    • Web 5.0: Previous iterations of the web do not perceive people’s emotion, but one day it could be able to understand a person’s emotional (Patel, 2013). Kelly (2007) predicted that in 5,000 days the internet would become one machine and all other devices would be a window into this machine. In 2007, Kelly stated that this one machine “the internet” has the processing capability of one human brain, but in 5,000 days it will have the processing capability of all the humanity.
  • MapReduce is a framework that uses parallel sequential algorithms that capitalize on cloud architecture, which became popular under the open source Hadoop project, as its main executable analytic engine (Lublinsky et al., 2014; Sadalage & Fowler, 2012; Sakr, 2014). Another feature of MapReduce is that a reduced output can become another’s map function (Sadalage & Fowler, 2012).
  • A sequential algorithm is a computer program that runs on a sequence of commands, and a parallel algorithm runs a set of sequential commands over separate computational cores (Brookshear & Brylow, 2014; Sakr, 2014).
  • A parallel sequential algorithm runs a full sequential program over multiple but separate cores (Sakr, 2014).
    • Shared memory distributed programming: Is where serialized programs run on multiple threads, where all the threads have access to the underlying data that is stored in shared memory (Sakr, 2014). Each thread should be synchronized as to ensure that read and write functions aren’t being done on the same segment of the shared data at the same time. Sandén (2011) and Sakr, (2014) stated that this could be achieved via semaphores (signals other threads that data is being written/posted and other threads should wait to use the data until a condition is met), locks (data can be locked or unlocked from reading and writing), and barriers (threads cannot run on this next step until everything preceding it is completed).
    • Message passing distributed programming: Is where data is stored in one location, and a master thread helps spread chunks of the data onto sub-tasks and threads to process the overall data in parallel (Sakr, 2014). There are explicitly direct send and receive messages that have synchronized communications (Lublinsky et al., 2013; Sakr, 2014).         At the end of the runs, data is the merged together by the master thread (Sakr, 2014).
  • A scalable multi-level stochastic model-based performance analysis has been proposed by Ghosh, Longo, Naik and Trivedi (2012) for Infrastructure as a Service (IaaS) cloud computing. Stochastic analysis and models are a form of predicting how probable an outcome will occur using a form of chaotic deterministic models that help in dealing with analyzing one or more outcomes that are cloaked in uncertainty (Anadale, 2016; Investopedia, n.d.).
  • Three-pool cloud architecture: In IaaS cloud scenario, a request for resources initiates a request for one or more virtual machines to be deployed to access a physical machine (Ghosh et al., 2012). The architecture assumes that physical machines are grouped in pools hot, warm, and cool (Sakr, 2014). The hot pool consists of physical machines that are constantly on, and virtual machines are deployed upon request (Ghosh et al., 2012; Sakr, 2014).   Whereas a warm pool has the physical machines in power saving mode, and cold pools have physical machines that are turned off. For both warm and cold pools setting up a virtual machine is delayed compared to a hot pool, since the physical machines need to be powered up or awaken before a virtual machine is deployed (Ghosh et al., 2012). The optimal number of physical machines in each pool is predetermined by the information technology architects.
  • Data-at-rest is probably considered easier to analyze; however, this type of data can also be problematic. If the data-at-rest is large in size and even if the data does not change or evolve, its large size requires iterative processes to analyze the data.
  • Data-in-motion and streaming data has to be iteratively processed until there is a certain termination condition is reached and it can be reached between iterations (Sakr, 2014). However, Sakr (2014) stated that MapReduce does not support iterative data processing and analysis directly.
    • To deal with datasets that require iterative processes to analyze the data, computer coders need to create and arrange multiple MapReduce functions in a loop (Sakr, 2014). This workaround would increase the processing time of the serialized program because data would have to be reloaded and reprocessed, because there is no read or write of intermediate data, which was there for preserving the input data (Lusblinksy et al., 2014; Sakr, 2014).
  • Data usually gets update on a regular basis. Connolly and Begg (2014) defined that data can be updated incrementally, only small sections of the data, or can be updated completely. This data update can provide its own unique challenges when it comes to data processing.
    • For processing incremental changes on big data, one must split the main computation to its sub-computation, logging in data updates in a memoization server, while checking the inputs of the input data to each sub-computation (Bhatotia et al., 2011; Sakr, 2014). These sub-computations are usually mappers and reducers (Sakr, 2014). Incremental mappers check against the memoization servers, and if the data has already been processed and unchanged it will not reprocess the data, and a similar process for incremental reducers that check for changed mapper outputs (Bhatotia et al., 2011).
  • Brewer (2000) and Gilbert and Lynch (2012) concluded that for a distributed shared-data system you could only have at most two of the three properties: consistency, availability, partition-tolerance (CAP theory). Gilbert and Lynch (2012) describes these three as akin to the safety of the data, live data, and reliability of the data.
    • In a NoSQL distributed database systems (DDBS), it means that partition-tolerance should exist, and therefore administrators should then select between consistency and availability (Gilbert & Lynch, 2012; Sakr, 2014). However, if the administrators focus on availability they can try to achieve weak consistency, or if the administrators focus on consistency, they are planning on having a strong consistency system. Strong consistency ensures that all copies of the data are updated in real-time, whereas weak consistency means that eventually all the copies of the data will be updated (Connolly and Begg, 2014; Sakr, 2014). An availability focus is having access to the data even during downtimes (Sakr, 2014).
  • Volume visualization is used to understand large amounts of data, in other words, big data, where it can be processed on a server or in the cloud and rendered onto a hand-held device to allow for the end user to interact with the data (Johnson, 2011). Tanahashi, Chen, Marchesin and Ma (2010), define a framework for creating an entirely web-based visualization interface (or web application), which leverages the cloud computing environment. The benefit of using this type of interface is that there is no need to download or install the software, but that the software can be accessed through any mobile device with a connection to the internet.

Resources:

  • Brookshear, G., Brylow, D. (2014). Computer Science: An Overview, (12th ed.). Vitalbook file.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Ghosh, R., Longo, F., Naik, V. K., & Trivedi, K. S. (2013). Modeling and performance analysis of large scale IaaS clouds. Future Generation Computer Systems, 29 (5), 1216-1234.
  • Gilbert, S., and Lynch N. A. (2012). Perspectives on the CAP Theorem. Computer 45(2), 30–36. doi: 10.1109/MC.2011.389
  • Investopedia (n.d.). Stochastic modeling. Retrieved from http://www.investopedia.com/terms/s/stochastic-modeling.asp
  • Johnson, C. (2011) Visualizing large data sets. TEDx Salt Lake City. Retrieved from https://www.youtube.com/watch?v=5UxC9Le1eOY
  • Kelly, K. (2007). The next 5,000 days of the web. TED Talk. Retrieved from https://www.ted.com/talks/kevin_kelly_on_the_next_5_000_days_of_the_web
  • Li, C. (2010). Open Leadership: How Social Technology Can Transform the Way You Lead, (1st ed.). VitalBook file.
  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Patel, K. (2013). Incremental journey for World Wide Web: Introduced with Web 1.0 to recent Web 5.0 – A survey paper. International Journal of Advanced Research in Computer Science and Software Engineering, 3(10), 410–417.
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.
  • Sandén, B. I. (2011). Design of Multithreaded Software: The Entity-Life Modeling Approach. Wiley-Blackwell. VitalBook file.
  • Tanahashi, Y., Chen, C., Marchesin, S., & Ma, K. (2010). An interface design for future cloud-based visualization services. Proceedings of 2010 IEEE Second International Conference on Cloud Computing Technology and Service, 609–613. doi: 10.1109/CloudCom.2010.46

 

Advertisements

Adv Topics: Possible future study

Application of Artificial Intelligence for real-time cybersecurity threat identification and resolution for network vulnerabilities in the cloud

Motivation: Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015). The goal of AI is to use huge amounts of data to draw out a set of rules through machine learning that will effectively supplement cyber security experts in identifying and remediating cyberattacks (Cringely, 2013; Power, 2015).

Problem statement: Must consider an attacker’s choices are unknown, if they will be successful in their targets and goals and the physical paths for an attack in the explicit and abstract form, which are hard to do without the use of big data analysis coupled with AI for remediation.

Hypothesis statement:

  • Null: The use of Bayesian Networks and AI cannot be used for both identification and remediation of cyber-attacks that deal with the network infrastructure on a cloud environment.
  • Alternative: The use of Bayesian Networks and AI can be used for both identification and remediation of cyber-attacks that deal with the network infrastructure on a cloud environment.

Proposed solution:

  • New contribution made to the body of knowledge by your proposed solution: The merging of these two technologies can be a first line of defense that can work 24×7 and learn new remediation and identification techniques as time moves forward.

2 research questions:

  • Can the merger of Bayesian Networks and AI be used for both identification and remediation of cyber-attacks that deal with the network infrastructure on a cloud environment? –> This is just based off of the hypothesis.
  • Can the use of Bayesian Networks and AI can be used for both identification and remediation of cyber-attacks that deal with multiple network attacks from various white hat hackers at the same time? –> This is taken from real life. A fortune 500 company is constantly bombarded with thousand if not millions of attempted cyber attackers at a given day. If there is a vulnerability found, it might result in multiple people entering in through that vulnerability and doing serious damage. Could this proposed system handle multiple attacks coming right at the cloud network infrastructure? Essentially providing practitioners some tangible results.

Resources:

Adv Topics: Big data addressing Security Issues

Cybersecurity attacks are limited by their physical path, the network connectivity and reachability limits, and the attack structure, which is by exploiting a vulnerability that enables an attack (Xie et al., 2010). Previously, automated systems and tools were implemented to deal with moderately skilled cyber-attackers, plus white hat hackers are used to identify security vulnerabilities, but it is not enough to keep up with today’s threats (Peterson, 2012). Preventative measure only deals with newly discoverable items, not the ones that have yet to be discoverable (Fink, Sharifi, Carbonell, 2011). These two methods are preventative measures, with the goal of protecting the big data and cyberinfrastructure used to store and process big data from malicious intent. Setting up these preventative measures are no longer good enough to protect big data and its infrastructure. Thus there has been a migration towards using real-time analysis on monitored data (Glick, 2013). Real-time analysis is concerned with “What is really happening?” (Xie et al., 2010).

If algorithms used to process big data can be pointed towards cyber security, Security Information and Event Management (SIEM), it can add another solution towards identifying cyber security threat (Peterson, 2012). All that big data cyber security analysis will do is make security teams faster to react if they have the right context to the analysis, but it won’t make the security teams act in a more proactive way (Glick, 2013). SIEM has gone above and beyond current cyber security prevention measures, usually by collecting the log data in real time that is generated and processing the log data in real time using algorithms like correlation, pattern recognition, behavioral analysis, and anomaly analysis (Glick, 2013; Peterson, 2012). Glick (2013), reported that data from a variety of sources help build a cyber security risk and threat profile in real-time that can be taken to cyber security teams to react to each threat in real time, but it works on small data sets.

SIEM couldn’t handle the vast amounts of big data and therefore analyzing the next cyber threats came from using tools like Splunk to identify anomalies amongst the data (Glick, 2013). SIEM was proposed for use in the Olympics games, but Splunk was being used for investment banking purposes (Glick, 2013; Peterson, 2012). FireEye is another big data analytics security tool that was used for identifying network threats (Glick, 2013).

  • Xie et al. (2010), proposed the use of Bayesian networks for cyber security analysis. This solution considers that modeling cyber security profiles are difficult to construct and uncertain, plus they built the tool for near real-time systems. That is because Bayesian models try to model cause-and-effect relationships. Using deterministic security models are unrealistic and do not capture the full breadth of a cyber attack and cannot capture all the scenarios for real-time analysis. If the Bayesian models are built to reflect reality, then it could be used for near real-time analysis. In real-time cyber security analysis, analysts must consider an attacker’s choices are unknown or if they will be successful in their targets and goals. Building a modular graphical attack model can help calculate uncertainties, which can be done by decomposing the problem into finite small parts, where realistic data can be used to pre-populate all the parameters. These modular graphical attack models should consider the physical paths in the explicit and abstract form. Thus, the near real-time Bayesian network considers the three important uncertainties introduced in a real-time attack (italicized). Using this method is robust as determined by a holistic sensitivity analysis.
  • Fink et al. (2011), proposed a mashup of crowdsourcing, machine learning, and natural language processing to dealing both vulnerabilities and careless end user actions, for automated threat detection. In their study, they focused on scam websites and cross-site request forgeries. For scam website identification, the concept of using crowdsourced end users to flag certain websites as a scam is key to this process. The goal is that when a new end user approaches the scam website, a popup appears stating “This website is a scam! Do not provide personal information.” The authors’ solution ties data from heterogeneously common web scam blacklist databases. This solution has high precisions (98%), and high recall (98.1%) on their test of 837 manually labeled sites that was cross-validated using a ten-fold cross -validation analysis between the blacklisted database. The current system’s limitation does not address new threats and different sets of threats.

These studies and articles illustrate that the benefit of using big data analytics for cybersecurity analysis provides the following benefits (Fink et al., 2011; Glick, 2013; IBM Software, 2013; Peterson, 2012; Xie et al., 2010):

(a) moving away from preventative cybersecurity and moving towards real-time analysis to become reactive faster to a current threat;

(b) creating security models that more accurately reflect the reality and uncertainty that exists between the physical paths, successful attacks, and unpredictability of humans for near real-time analysis;

(c) provide a robust identification technique; and

(d) reduction of identifying false positives, which eat up the security team’s time.

Thus, helping security teams to solve difficult issues in real-time. However, this is a new and evolving field that is applying big data analytics. Thus it is expected that many tools will be developed, and the most successful tool would be able to provide real-time cybersecurity data analysis with the huge set of algorithms each aimed at studying different types of attacks. It is even possible for one day to see artificial intelligence to become the next new phase of providing real-time cyber security analysis and resolutions.

Resources:

Adv Topics: Security Issues with Cloud Technology

Big data requires huge amounts of resources to analyze it for data driven decisions, thus there has been a gravitation towards cloud computing to work in this era of big data (Sakr, 2014). Cloud technology is different than personal systems that place different demands on cyber security, where personal systems could have single authority systems and cloud computing systems, have no individual owners, have multiple users, groups rights, and shared responsibility (Brookshear & Brylow, 2014; Prakash & Darbari, 2012). Cloud security can be just as good or better than personal systems because cloud providers could have the economies of scales that can support a budget to have an information security team that many organizations may not be able to afford (Connolly & Begg, 2014). Cloud security can be designed to be independently modular, which is great for heterogenous distributed systems (Prakash & Darbari, 2012).

For cloud computing eavesdropping, masquerading, message tampering, replaying the message, and denial of services are security issues that should be addressed (Prakash & Darbari, 2012). Sakr (2014) stated that exploitation of co-tenancy, a secure architecture for the cloud, accountability for outsourced data, confidentiality of data and computation, privacy, verifying outsourced computation, verifying capability, cloud forensics, misuse detection, and resource accounting and economic attacks are big issues for cloud security. This post will discuss the exploitation of co-tendency and confidentiality of data and computation.

Exploitation of Co-Tenancy: An issue with cloud security is within one of its properties, that it is a shared environment (Prakash & Darbari, 2012; Sakr, 2014). Given that it is a shared environment, people with malicious intent could pretend to be someone they are not to gain access, in other words masquerading (Prakash & Darbari, 2012). Once inside, these people with malicious intent tend to gather information about the cloud system and the data contained within it (Sakr, 2014). Another way these services could be used by malicious people is to use the computational resources of the cloud to carry out denial of service attacks on other people.   Prakash and Darbari (2012) stated that two-factor authentications were used on personal devices and for shared distributed systems, there has been proposed a use of a three-factor authentication. The first two factors are the use passwords and smart cards. The last one could be either biometrics or digital certificates. Digital certificates can be used automatically to reduce end-user fatigue on using multiple authentications (Connolly & Begg, 2014). The third level of authentication helps to create a trusted system. Subsequently, a three-factor authentication could primarily mitigate masquerading. Sakr (2014), proposed using a tool that hides the IP addresses the infrastructure components that make up the cloud, to prevent the cloud for being used if the entry is granted to a malicious person.

Confidentiality of data and computation: If data in the cloud is accessed malicious people can gain information, and change the content of that information. Data stored on the distributed systems are sensitive to the owners of the data, like health care data which is heavily regulated for privacy (Sakr, 2014). Prakash and Darbari (2012) suggested the use of public key cryptography, software agents, XML binding technology, public key infrastructure, and role-based access control are used to deal with eavesdropping and message tampering. This essentially hides the data in such a way that it is hard to read without key items that are stored elsewhere in the cloud system. Sakr (2014) suggested homomorphic encryption may be needed, but warns that the use of encryption techniques increases the cost and time of performance. Finally, Lublinsky, Smith, and Yakubovich (2013), stated that encrypting the network to protect data-in-motion is needed.

Overall, a combination of data encryption, hiding IP addresses of computational components, and three-factor authentication may mitigate some of the cloud computing security concerns, like eavesdropping, masquerading, message tampering, and denial of services. However, using these techniques will increase the time it takes to process big data. Thus a cost-benefit analysis must be conducted to compare and contrast these methods while balancing data risk profiles and current risk models.

Resources:

  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th ed.). Pearson Learning Solutions. VitalBook file.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox. VitalBook file.
  • Prakash, V., & Darbari, M. (2012). A review on security issues in distributed systems. International Journal of Scientific & Engineering Research, 3(9), 300–304.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.

Adv Topics: Security Issues associated with Big Data

The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013). Per Khan et al. (2014), the entire data lifecycle consists of the following eight stages:

  • Raw big data
  • Collection, cleaning, and integration of big data
  • Filtering and classification of data usually by some filtering criteria
  • Data analysis which includes tool selection, techniques, technology, and visualization
  • Storing data with consideration of CAP theory
  • Sharing and publishing data, while understanding ethical and legal requirements
  • Security and governance
  • Retrieval, reuse, and discovery to help in making data-driven decisions

Prajapati (2013), stated the entire data lifecycle consists of the following five steps:

  • Identifying the problem
  • Designing data requirements
  • Pre-processing data
  • Data analysis
  • Data visualizing

It should be noted that Prajapati includes steps that first ask what, when, who, where, why, and how with regards to trying to solve a problem. It doesn’t just dive into getting data. Combining both Prajapati (2013) and Kahn et al. (2014) data lifecycles, provides a better data lifecycle. However, there are 2 items to point out from the above lifecycle: (a) the security phase is an abstract phase because security considerations are involved in stages (b) storing data, sharing and publishing data, and retrieving, reusing and discovery phase.

Over time the threat landscape has gotten worse and thus big data security is a major issue. Khan et al. (2014) describe four aspects of data security: (a) privacy, (b) integrity, (c) availability, and (d) confidentiality. Minelli, Chambers, and Dhiraj (2013) stated that when it comes to data security a challenge to it is understanding who owns and has authority over the data and the data’s attributes, whether it is the generator of that data, the organization collecting, processing, and analyzing the data. Carter, Farmer, and Siegel (2014) stated that access to data is important, because if competitors and substitutes to the service or product have access to the same data then what advantage does that provide the company. Richard and King (2014), describe that a binary notion of data privacy does not exist.  Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).

Carter et al. (2014) focused on data access where access management leads to data availabilities to certain individuals. Whereas, Minelli et al. (2013) focused on data ownership. However, Richard and King (2014) was able to tie those two concepts into data privacy. Thus, each of these data security aspects is interrelated to each other and data ownership, availability, and privacy impacts all stages of the lifecycle. The root causes of the security issues in big data are using dated techniques that are best practices but don’t lead to zero-day vulnerability action plans, with a focus on prevention, focus on perimeter access, and a focus on signatures (RSA, 2013). Specifically, certain attacks like denial of service attacks are a threat and root cause to data availability issues (Khan, 2014). Also, RSA (2013) stated that from a sample of 257 security officials felt that the major challenges to security were the lack of staffing, large false positive amounts which creates too much noise, lack of security analysis skills, etc. Subsequently, data privacy issues arise from balancing compensation risks, maintaining privacy, and maintaining ownership of the data, similar to a cost-benefit analysis problem (Khan et al., 2014).

One way to solve security concerns when dealing with big data access, privacy, and ownership is to place a single entry point gateway between the data warehouse and the end-users (The Carology, 2013). The single entry point gateway is essentially middleware, which help ensures data privacy and confidentiality by acting on behalf of an individual (Minelli et al., 2013). Therefore, this gateway should aid in threat detection, assist in recognizing too many requests to the data which can cause a denial of service attacks, provides an audit trail and doesn’t require to change the data warehouse (The Carology, 2013). Thus, the use of middleware can address data access, privacy, and ownership issues. RSA (2013) proposed a solution to use data analytics to solve security issues by automating detection and responses, which will be covered in detail in another post.

Resources:

  • Carter, K. B., Farmer, D., and Siegel, C. (2014). Actionable Intelligence: A Guide to Delivering Business Results with Big Data Fast! John Wiley & Sons P&T. VitalBook file.
  • Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z. Ali, W. K. M., Alam, M., Shiraz, M., & Gani., A. (2014). Big data: Survey, technologies, opportunities, and challenges. The Scientific World Journal, 2014. Retrieved from http://www.hindawi.com/journals/tswj/2014/712826/
  • Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.

Adv Topics: MapReduce and Batching Computations

With an increase in data produced by end-user and interne of things connectivity it creates a need for a computational infrastructure that deals with live-streaming big data processing (Sakr, 2014). Real-time data processing requires continuous data input and data output that needs to be processed in a small amount of time (Narsude, 2015). Therefore, once the data comes in it must be processed to be considered a real-time analysis of the data. Real-time data processing deals with a data stream that is defined by an unbounded set of data that can be transformed into stream spouts, which is a bound subset of the stream data (Sakr, 2014).

MapReduce and Hadoop frameworks are amazing with big data-at-rest because it relies on batch jobs with a beginning and an end (Sakr, 2014). However, Narsude (2015) and Sakr (2014), stated that batch processing is not great for real-time stream analytics because stream data doesn’t stop coming in unless an end-user stops it (Sakr, 2014). Batch processing means that the end-user is periodically processing data that has been previously collected to reach a certain size or time-dependent variable (Narsude, 2015).

Streaming data has the following properties (Sakr, 2014): (a) it is generated from multiple places sensuously; (b) has continuous data being produced and in transit; (c) requires continuous parallel processing jobs; (d) online processing of data that isn’t saved, so data analytics lifecycle is applied in real-time to the streaming data; (e) persistently storing streaming data is not optimal; and (f) queries are continuous. Real-time streaming can usually be conducted through two different paradigms (figure 1): processing data one event at a time also known as tuple at a time and keeping data batches small enough to use micro-batching jobs also known as micro-batching (Narsude, 2015).

IPV4.png

Figure 1: Illustrating the difference in paradigms in real time processing where the orange dots represent an equal sized event and the blue boxes are the data processing (adapted from Narsude, 2015).   Differing event sizes do exist, which can complicate this image further.

Notice that from figure 1, these two different paradigms create different sets of latency. Thus, real-time processing produces a near real-time output. Therefore, data scientist need to take this into account when picking a tool to handle their real-time data. Some of the many large-scale stream tools include (Sakr, 2014): Aurora, Borealis, IBM System S and IBM Spade, Deduce, StreamCloud, Stormy, and Twitter Storm. The focus here will be on DEDUCE and Twitter Storm.

DEDUCE

The DEDUCE architecture offers a middleware solution that can handle streaming data processing and use MapReduce capabilities for real-time data processing (Kumar, Andrade, Gedik, & Wu, 2010; Sakr, 2014). This middleware is using a concept called micro-batching where data is collected in near real-time and is being processed distributive over a batched MapReduce job (Narsude, 2015). Thus, this middleware is using a simplified MapReduce job design for data-at-rest and data-in-motion, where the data stream can come from a variable of an operator or a punctuated input stream and the data output is written in a distributed database system (Kumar et al., 2010; Sakr, 2014). An operator output can come from an end-user, a mapper output or a reducer output (Sakr, 2014).

For its scalability features the middleware creates the illusion to the end-user that the data is stored in one location though it is stored in multiple distributed systems through a common interface (Kumar et al., 2010; Sakr, 2014). DEDUCE’s unified abstraction and runtime has the following customizable features that corresponds to its performance optimization capability (Sakr, 2014): (a) using MapReduce jobs for data flow; (b) re-using MapReduce jobs offline to calibrate data analytics models; (c) optimizing the mapper and reducer jobs under the current computational infrastructure; and (d) being able to control configurations, like update frequency and resource utilization, for the MapReduce jobs.

Twitter Storm

Twitter Storm is an open-source code that is distrusted at GitHub (David, 2011). The Twitter Storm architecture is a distributed and fault-tolerant system that is horizontally scalable to handle parallel processing, guaranteed message processing of all data at least once, fault tolerance to ensure fidelity of results, and programming language agnostic (David, 2011; Sakr, 2014). This system uses processing data one event at a time paradigm to real-time processing (Narsude, 2015).

This software uses data stream spouts and bolts, and a task can be run on either spouts or bolts. Bolts consume a multiple data stream spouts to process and analyze the data to either produce a single output or a new data stream spout (Sakr, 2014). A master node takes care of what stream data and code goes to which spout and bolt through worker node supervisors, and the worker nodes usually work a task(s) per instructions of the worker node supervisor (David, 2011; Sakr, 2014). Considering that the more complex the query, the more data spouts and bolts are needed in the processing topology (figure 2), which helps to address this solution’s scalability features (Sakr, 2014).

Capture1.GIF

Figure 2: A sample Twitter Storm stream to spout to bolt topology (Adapted from Sakr, 2014).

Finally, the performance optimization capability is based on the above sample topology image data, where stream data can be grouped in five different topological spouts and bolts configuration: stream data topology is customizable by the end user; stream data can be randomly distributed, stream data is replicated across all spouts and bolts, one data stream goes to a single bolt directly, or stream data goes into a specified group based on parameters (Sakr, 2014).

Conclusion

Real-time processing is in reality near real-time processes as it depends on the paradigm that is chosen to run the data processing scheme. DEDUCE uses a micro-batching technique, and Twitter Storm uses a tuple at a time. Thus, depending on the problem and project requirements data scientists can use either paradigm and processing scheme that meets their needs.

Resources

Adv Topics: Extracting Knowledge from big data

The evolution of data to wisdom is defined by the DIKW pyramid, where Data is just facts without any context, but when facts are used to understand relationships it generates Information (Almeyer-Stubbe & Coleman, 2014). That information can be used to understand patterns, it can then help build Knowledge, and when that knowledge is used to understand principles, it builds Wisdom (Almeyer-Stubbe & Coleman, 2014; Bellinger, Castro, Mills, n.d.). Building an understanding to jump from one level of the DIKW pyramid, is an appreciation of learning “why” (Bellinger et al., n.d.). Big data was first coined in a Gartner blog post, is data that has high volume, variety, and velocity, but without any interest in understanding that data, data scientist will lack context (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Laney, 2001). Therefore, applying the DIKW pyramid can help turn that big data into extensive knowledge (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Sakr, 2014). Extensive knowledge is a derived from placing meaning to big data usually in the form of predictive analytics algorithms (Sakr, 2014).

Machine learning requires historical data and is part of the data analytics process under data mining to understand hidden patterns or structures within the data (Almeyer-Stubbe & Coleman, 2014). Machine learning is easier to build and maintain than other classical data mining techniques (Wollan, Smith, & Zhou, 2010). Machine learning algorithms include clustering, classification, and association rules techniques and the right algorithm from any of these three techniques must be selected that meet the needs of the data (Services, 2015). Unsupervised machine learning techniques like clustering are used when data scientist do not understand or classify data prior to data mining techniques to understand hidden structures within the data set (Brownlee, 2016; Services, 2015). Supervised machine learning involves model training and model testing to aid in understanding which input variables feed into an output variable, involving such techniques as classification and regression (Brownlee, 2016).

An example of an open source Hadoop machine learning algorithm library would include Apache Mahout, which can be found at http://mahout.apache.org (Lublinsky, Smith, & Yakubovich, 2013). A limitation from learning from historical data to predict the future is it can “stifle innovation and imagination” (Almeyer-Stubbe & Coleman, 2014). Another limitation can exist that current algorithms may not run on distributed database systems. Thus some tailoring of the algorithms may be needed (Services, 2015). The future of machine learning involves its algorithms becoming more interactive to the end user, known as active learning (Wollan, Smith, & Zhou, 2010).

Case Study: Machine learning, medical diagnosis, and biomedical engineering research – commentary (Foster, Koprowski, & Skufca, 2014)

The authors created a synthetic training data set to simulate a typical medical classification problem of healthy and ill people and assigned random numbers to 10 health variables. Given this information, the actual classification accuracy should be 50%, which is also similar to pure chance alone. These authors found that when classification machine learning algorithms are misapplied, it can lead to false results. This was proven when their model took only 50 people to produce similar accuracy values of pure chance alone. Thus, the authors of this paper were trying to warn the medical field that misapplying classification techniques can lead to overfitting.

The authors then looked at feature selection for classifying Hashimoto’s disease from 250 clinical ultrasound data with the disease and 250 healthy people. Ten variables were selected to help classify these images, and a MATLAB machine learning algorithm was trained on 400 people (200 healthy and 200 ill) to then be tested on 100 people (50 healthy and 50 ill). They were able to show that when 3-4 variables were used they produced better classification results, thus 3-4 variables had huge information gain. This can mislead practitioners, because of the small data set that could be generalized too broadly and the lack of independence between training and testing datasets. The authors argued that larger data sets are needed to get rid of some of the issues that could result in the misapplication of classifiers.

The authors have the following four recommendations when considering the use of supervised machine learning classification algorithms:

    1. Clearly, state the purpose of the study and come from a place of understanding of that problem and its applications.
    2. Minimize the number of a variable when used in classifiers, such as using pruning algorithms in classification algorithms to only select certain variables that meet a certain level of information gain. This is more important with smaller data sets than with big data.
    3. Understand that classifiers are sensitive and that results gained from one set of instances might require further adjustments to be implemented elsewhere.
    4. Classification algorithms and data mining are part of the experimental process not the answer to all problems.

Resources: