Adv Topics: MapReduce and Iterative computation

Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems, while data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Social media data or social network analysis data can be considered as data-in-motion and processing that type of data can be quite problematic. Data-in-motion has to be iteratively processed until there is a certain termination condition is reached and it can be reached between iterations (Sakr, 2014). Data-at-rest is probably considered easier to analyze; however, this type of data can also be problematic. If the data-at-rest is large in size and even if the data does not change or evolve, its large size requires iterative processes to analyze the data.

An example of an iterative process as suggested by Lusblinsky, Smith, and Yakubovich (2014), is solving a linear equation by approximation algorithms on hundreds of equations and variables. The data can be stored in matrix Ax = b, where A is a matrix of coefficients, b is a vector of output values, and x is a vector of variables. If the data is too large, using a simple linear algebraic solution would be impossible, so a quadratic spline solution would consist of the following:

f(x) = ½ xTAx-xT­b

and a superscript “T” represents transposing the vector or matrix. Each iteration of this spline would result in a better vector solution. However, Sakr (2014) stated that MapReduce does not support iterative data processing and analysis directly. Thus workaround is needed to handle iterative programs for situations like data-in-motion or even streaming data.

 Root causes and technical steps to address them

To deal with datasets that require iterative processes to analyze the data, computer coders need to create and arrange multiple MapReduce functions in a loop (Sakr, 2014). This workaround would increase the processing time of the serialized program because data would have to be reloaded and reprocessed, because there is no read or write of intermediate data, which was there for preserving the input data (Lusblinksy et al., 2014; Sakr, 2014). The root cause exists because in its simplest form MapReduce can consist of many mappers and one reducer, which is a performance bottleneck. This simplified model of the MapReduce analytical engine means one cannot reduce across keys, just one key at a time (Sadalage & Fowler, 2012). Thus, the algorithm is not built for iterations. HaLoop is an iteration solution built on top of the Hadoop infrastructure that has a loop control module, task scheduler, which caches invariant data from a previous iteration and uses it in future iterations (Sakr, 2014). This solution can be applied to static and unchanged data.

There are also disadvantages of using MapReduce on these types of data because too many mapper functions can create an infrastructure overhead or too many reducers can provide too many outputs (Lusblinksy et al., 2014; Sakr, 2014). Thus there has to be a basic implementation plan of effective data placement over the Hadoop cluster, to ensure proper load balance of data across the Hadoop servers (Sakr, 2014). Sometimes, there is a need to run two separate MapReduce functions, one to prepare the data by evenly distributing data across the servers and one that iteratively goes through the data (Lusblinksy et al., 2014). CoHadoop allows for data files and related files to be stored on the same server, provide a means of load balancing and fault tolerance by creating a file-level locator property (Sakr, 2014). Sakr describes the locator property as a means to keep track of where the data is stored via a unique identification number for each file in the system. This solution can also be applied to static and unchanged data.

For data-in-motion and streaming data, data has to be iteratively processed until there is a certain termination condition is reached. MapReduce Online has an approach where between the mapper and reducer function data is pipelined to allow for data processing and analysis as soon as the mappers produce their outputs (Sakr, 2014). This can run the MapReduce functions iteratively and provide relatively live reduced data outputs. Sakr, further explains that this approach is where the reducer contacts every mapper upon initiation of the scheduler and the data is temporarily stored on the pipeline, which is an in-memory buffer.

Resources:

  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
  • Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional Hadoop Solutions. Vitalbook file.
  • Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
  • Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, (1st ed.). Vitalbook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.
Advertisements

Compelling topics on analytics of big data

  • Big data is defined as high volume, high variety/complexity, and high velocity, which is known as the 3Vs (Services, 2015).
  • Depending on the goal and objectives of the problem, that should help define which theories and techniques of big data analytics to use. Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics. Thus, these three divisions of big data analytics are:
    • Descriptive analytics explains “What happened?”
    • Predictive analytics explains “What will happen?”
    • Prescriptive analytics explains “Why will it happen?”
  • The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013; Services, 2015). According to Dietrich (2013), it is a cyclical life cycle that has iterative parts in each of its six steps: discovery; pre-processing data; model planning; model building; communicate results, and
  • Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Ramachandran & Chang, 2016).  The analysis of real-time streaming data in a timely fashion is also known as stream reasoning and implementing solutions for stream reasoning revolve around high throughput systems and storage space with low latency (Della Valle et al., 2016).
  • Data brokers are tasked collecting data from people, building a particular type of profile on that person, and selling it to companies (Angwin, 2014; Beckett, 2014; Tsesis, 2014). The data brokers main mission is to collect data and drop down the barriers of geographic location, cognitive or cultural gaps, different professions, or parties that don’t trust each other (Long, Cunningham, & Braithwaite, 2013). The danger of collecting this data from people can raise the incidents of discrimination based on race or income directly or indirectly (Beckett, 2014).
  • Data auditing is assessing the quality and fit for the purpose of data via key metrics and properties of the data (Techopedia, n.d.). Data auditing processes and procedures are the business’ way of assessing and controlling their data quality (Eichhorn, 2014).
  • If following an agile development processes the key stakeholders should be involved in all the lifecycles. That is because the key stakeholders are known as business user, project sponsor, project manager, business intelligence analyst, database administers, data engineer, and data scientist (Services, 2015).
  • Lawyers define privacy as (Richard & King, 2014): invasions into protecting spaces, relationships or decisions, a collection of information, use of information, and disclosure of information.
  • Richard and King (2014), describe that a binary notion of data privacy does not Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).
  • Fraud is deception; fraud detection is needed because as fraud detection algorithms are improving, the rate of fraud is increasing (Minelli, Chambers, &, Dhiraj, 2013). Data mining has allowed for fraud detection via multi-attribute monitoring, where it tries to find hidden anomalies by identifying hidden patterns through the use of class description and class discrimination (Brookshear & Brylow, 2014; Minellli et al., 2013).
  • High-performance computing is where there is either a cluster or grid of servers or virtual machines that are connected by a network for a distributed storage and workflow (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli et al., 2013).
  • Parallel computing environments draw on the distributed storage and workflow on the cluster and grid of servers or virtual machines for processing big data (Bhokare et al., 2016; Minelli et al., 2013).
  • NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data and evolve with time (Sadalage & Fowler, 2012).
    • Document store NoSQL databases, use a key/value pair that is the file/file itself, and it could be in JSON, BSON, or XML (Sadalage & Fowler, 2012; Services, 2015). These document files are hierarchical trees (Sadalage & Fowler, 2012). Some sample document databases consist of MongoDB and CouchDB.
    • Graph NoSQL databases are used drawing networks by showing the relationship between items in a graphical format that has been optimized for easy searching and editing (Services, 2015). Each item is considered a node and adding more nodes or relationships while traversing through them is made simpler through a graph database rather than a traditional database (Sadalage & Fowler, 2012). Some sample graph databases consist of Neo4j Pregel, etc. (Park et al., 2014).
    • Column-oriented databases are perfect for sparse datasets, ones with many null values and when columns do have data the related columns are grouped together (Services, 2015). Grouping demographic data like age, income, gender, marital status, sexual orientation, etc. are a great example for using this NoSQL database. Cassandra is an example of a column-oriented database.
  • Public cloud environments are where a supplier to a company provides a cluster or grid of servers through the internet like Spark AWS, EC2 (Connolly & Begg, 2014; Minelli et al. 2013).
  • A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014).
  • Private cloud environments have a similar infrastructure to a public cloud, but the infrastructure only holds the data one company exclusively, and its services are shared across the different business units of that one company (Connolly & Begg, 2014; Minelli et al., 2013).
  • Hybrid clouds are two or more cloud structures that have either a private, community or public aspect to them (Connolly & Begg, 2014).
  • Cloud computing allows for the company to purchase the services it needs, without having to purchase the infrastructure to support the services it might think it will need. This allows for hyper-scaling computing in a distributed environment, also known as hyper-scale cloud computing, where the volume and demand of data explode exponentially yet still be accommodated in public, community, private, or hybrid cloud in a cost efficiently (Mainstay, 2016; Minelli et al., 2013).
  • Building block system of big data analytics involves a few steps Burkle et al. (2001):
    • What is the purpose that the new data will and should serve
      • How many functions should it support
      • Marking which parts of that new data is needed for each function
    • Identify the tool needed to support the purpose of that new data
    • Create a top level architecture plan view
    • Building based on the plan but leaving room to pivot when needed
      • Modifications occur to allow for the final vision to be achieved given the conditions at the time of building the architecture.
      • Other modifications come under a closer inspection of certain components in the architecture

 

References

  • Angwin, J. (2014). Privacy tools: Opting out from data brokers. Pro Publica. Retrieved from https://www.propublica.org/article/privacy-tools-opting-out-from-data-brokers
  • Beckett, L. (2014). Everything we know about what data brokers know about you. Pro Publica. Retrieved from https://www.propublica.org/article/everything-we-know-about-what-data-brokers-know-about-you
  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th). Pearson Learning Solutions. VitalBook file.
  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: what is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • Della Valle, E., Dell’Aglio, D., & Margara, A. (2016). Tutorial: Taming velocity and variety simultaneous big data and stream reasoning. Retrieved from https://pdfs.semanticscholar.org/1fdf/4d05ebb51193088afc7b63cf002f01325a90.pdf
  • Dietrich, D. (2013). The genesis of EMC’s data analytics lifecycle. Retrieved from https://infocus.emc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/
  • Eichhorn, G. (2014). Why exactly is data auditing important? Retrieved from http://www.realisedatasystems.com/why-exactly-is-data-auditing-important/
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
  • Long, J. C., Cunningham, F. C., & Braithwaite, J. (2013). Bridges, brokers and boundary spanners in collaborative networks: a systematic review.BMC health services research13(1), 158.
  • (2016). An economic study of the hyper-scale data center. Mainstay, LLC, Castle Rock, CO, the USA, Retrieved from http://cloudpages.ericsson.com/ transforming-the-economics-of-data-center
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. [Bookshelf Online].
  • Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
  • Park, Y., Shankar, M., Park, B. H., & Ghosh, J. (2014, March). Graph databases for large-scale healthcare systems: A framework for efficient data management and data services. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on (pp. 12-19). IEEE.
  • Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
  • Richards, N. M., & King, J. H. (2014). Big Data Ethics. Wake Forest Law Review, 49, 393–432.
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, (1st). [Bookshelf Online].
  • Technopedia (n.d.). Data audit. Retrieved from https://www.techopedia.com/definition/28032/data-audit
  • Tsesis, A. (2014). The right to erasure: Privacy, data brokers, and the indefinite retention of data.Wake Forest L. Rev.49, 433.
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521

Data in-motion, Data at-rest, & Data in-use

Data in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016).  Data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Ramachandran & Chang, 2016).  The analysis of real-time streaming data in a timely fashion is also known as stream reasoning and implementing solutions for stream reasoning revolve around high throughput systems and storage space with low latency (Della Valle et al., 2016). Cisco (2017), stated that data in motion’s value decreases with time, unlike data-at-rest.

Data-in-motion focuses on the velocity and variety portion of the Gartner’s 3Vs of Big Data definition (Della Valle, Dell’Aglio, Margara, 2016). This is becoming an important issue in data analytics due to the emergence of the Internet of Things (IoT), which could be deployed in the cloud and can constitute as the variety portion of data-in-motion (Ovum, 2016).  Della Valle et al. (2016), stated that knowledge had been represented in various ways and the analysis of this data would allow for understanding implicit information hidden in these different forms of explicit knowledge.

5db3

Figure 1 is adapted from Della Valle et al. (2016), which is a conceptual model for real-time streaming that can provide a scalable solution for large volumes of data or a large variety of data sources. In this diagram (Figure 1), a wrapper hides the individuality of the data source by transforming it to look like one data source, while the mapping ties all the data together (Della Valle et al., 2016).

Kishore and Sharma (2016) and Ramachandran and Chang (2016), describes in their conceptual model the definition of data-in-motion as data-in-transit from two systems of data-at-rest.  Kishore and Sharma (2016) stated that data is most vulnerable while it is in motion. Given the vulnerabilities of data-in-motion, Kishore and Sharma (2016), discussed that protecting data-in-motion could be done either through encryption or Virtual Private Network (VPN) connections through the entire process. Ramachandran and Chang (2016), stated that encryption is the only security technique for data-in-motion.  However, security is not addressed in Della Valle et al. (2016) system, and this is just one reason of many on why Kishore and Sharma (2016) suggested that security for data-in-motion as an area for future research.

Cisco (2017), illustrates the need for further knowledge and development of data-in-motion research because retailers have the most to benefit from it. For retail environments, data collection and processing is key for thriving and increasing their profit margins because retailers are trying to build brand recognition, brand affinity, and a relationship with their customers.  All of this is done to enhance the customer experience, for example using the data coming from the web camera to create a virtual mirror where the customer can try on accessories and see how this accessory fits their personal style, is creating a customer experience from data-in-motion (Cisco, 2017).  This virtual mirror must use facial recognition technology similar to the use of Snapchat filters.  Other ways retailers could use data-in-motion data is by collecting phone data location and demographic data to create a real-time promotion for nearby travelers or in-store customers (Cisco, 2017).  Finally, Cisco (2017), also discussed how data in motion could help in providing proactive and cost-effective health care, enhancing manufacturing supply chain, provide scalable and secure energy production, etc.

References

Data Tools: Data-In-Motion

Definition of terms

Data in-motion: a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

Data complexity: consists of the joining, cleaning, and transformation of data from multiple systems to find relationships that are highly correlated (Katal et al., 2013).  Complexity increases as the velocity of data coming in or transferred increases (Katal et al., 2013; Tsinoremas et al., n.d.).

Data-in-motion analytics performed in case study (Blount et al., 2010)

Artemis was designed, built and deployed in 2009 through a coalition of the University of Ontario Institute of Technology, SickKids, Department of Pediatrics, and University of Toronto, to help read in data from multiple sensors taken from neonatal intensive care units (NICU).  The goal is to have Artemis to read in data from multiple physiological instruments like an electrocardiogram (ECG), heart rate, blood oxygen saturation, respiratory states, etc. to find key patterns and relationships in the data streams (data-in-motion) to provide the best care for infants in NICU.  To make Artemis a success, the coalition had to analyze huge amounts of data from a large group of patients.  Artemis had to interface with multiple medical devices, should be scalable to add more medical devices, and store raw physiological data while at the same time de-identifying the data per U.S. and Canadian Health Privacy laws.  From these multiple medical devices new rules could be created by unsupervised machine learning techniques, and through supervising machine learning techniques with medical/clinical derived rules.  The Artemis system has to read in the data in real-time to sort, join, clean, and transform, to evaluate against certain rules and send out an alert or not to medical staff about one of the NICU patients, while at the same time de-identifying the data and storing it into a database for future analysis and tests.

In the test phase, 5 infants were enrolled and in the deployed state 19 infants were enrolled in the study. This study has to take into account, that the cables from all the sensors and the equipment use to collect all the streaming data must not get in the way of the medical/clinical staff when they need to help out the infant. In some cases, when the Artemis system was deployed, some of the sensors were not attached, and thus the Information Management Teams had to work with medical/clinical staff to help train the model on fewer data as well, if they do not have all the ideal sensors needed to send out alerts for certain situations.  Therefore, this system provides a way for medical/clinical staff to have constant data on NICU patients in real time from multiple sensors and allow the machine to alert them when certain markers and key performance indicators are met.

Importance of applying data analytics to data-in-motion

It can be easily seen that analyzing infant NICU data is important.  It is especially important to leverage analytics to the data stream of the key medical sensors needed per infant in the NICU.  What is not easily seen sometimes is how important all the data really is.  Since, in the real-life deployment showed that not all the medical sensors are being used to help provide the model with enough information to be of use to medical/clinical staff (Blount et al., 2010).

Also, the use of data streams in a university setting would allow for a different perspective that could be used in the NICU case study above.  At the University of Miami, data is triaged into a four-tiered system (Tsinoremas et al., n.d.):

  • High-speed storage – for data that is currently being processed, data-in-motion is at its highest (has 300TB of space and costs $2000/TB)
  • Mid-range speed storage – for data that is currently being looked at (costs $600-$700/TB)
  • Deep storage – long-term data storage, data that is looked at every so often, but not regularly, usually old data (costs $300/TB)
  • Archived – data to be stored offline, but it is perfect for data at rest

This tiered system above could be applied to Artemis, such that they could process which of the medical devices should be processed first when resources are limited.  Also, this could be applied different, such that there should be a window of which data is currently available, e.g. a 1-hour long record of NICU stats saved locally, with longer records still accessible, but not stored in vital processing spaces.  Data windows were discussed, but depending on the situation, data windows could be adjusted to provide the best care for the infants (Blount et al., 2010).

Also, the quality of the sensor data must be taken into account.  If more data is needed/preferred to make informed decisions on infant patients in the NICU (Blount et al., 2010), then there should be a focus in collecting, analyzing, high-quality data and the right types of data.  This would lead the designers of Artemis, medical, and clinician staff to think deeply about which data is relevant, and how much data is enough to make the decisions needed to tend to the infants (Katal et al., 2013).

Resources

  • Blount, M., Ebling, M. R., Eklund, J. M., James, A. G., McGregor, C., Percival, N., … & Sow, D. (2010). Real-time analysis for intensive care: development and deployment of the Artemis analytic system.IEEE Engineering in Medicine and Biology Magazine29(2), 110-118.
  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Tsinoremas, N. F., Zysman, J., Mader, C., Kirtma, B., & Blaire, J. (n.d.) Data in motion: A new paradigm in research data lifecycle management. Center for Computational Science: University of Miami.