Data Tools: Data-In-Motion

How is data-in-motion performed and why is it important to apply data analytics to it?

Advertisements

Definition of terms

Data in-motion: a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

Data complexity: consists of the joining, cleaning, and transformation of data from multiple systems to find relationships that are highly correlated (Katal et al., 2013).  Complexity increases as the velocity of data coming in or transferred increases (Katal et al., 2013; Tsinoremas et al., n.d.).

Data-in-motion analytics performed in case study (Blount et al., 2010)

Artemis was designed, built and deployed in 2009 through a coalition of the University of Ontario Institute of Technology, SickKids, Department of Pediatrics, and University of Toronto, to help read in data from multiple sensors taken from neonatal intensive care units (NICU).  The goal is to have Artemis to read in data from multiple physiological instruments like an electrocardiogram (ECG), heart rate, blood oxygen saturation, respiratory states, etc. to find key patterns and relationships in the data streams (data-in-motion) to provide the best care for infants in NICU.  To make Artemis a success, the coalition had to analyze huge amounts of data from a large group of patients.  Artemis had to interface with multiple medical devices, should be scalable to add more medical devices, and store raw physiological data while at the same time de-identifying the data per U.S. and Canadian Health Privacy laws.  From these multiple medical devices new rules could be created by unsupervised machine learning techniques, and through supervising machine learning techniques with medical/clinical derived rules.  The Artemis system has to read in the data in real-time to sort, join, clean, and transform, to evaluate against certain rules and send out an alert or not to medical staff about one of the NICU patients, while at the same time de-identifying the data and storing it into a database for future analysis and tests.

In the test phase, 5 infants were enrolled and in the deployed state 19 infants were enrolled in the study. This study has to take into account, that the cables from all the sensors and the equipment use to collect all the streaming data must not get in the way of the medical/clinical staff when they need to help out the infant. In some cases, when the Artemis system was deployed, some of the sensors were not attached, and thus the Information Management Teams had to work with medical/clinical staff to help train the model on fewer data as well, if they do not have all the ideal sensors needed to send out alerts for certain situations.  Therefore, this system provides a way for medical/clinical staff to have constant data on NICU patients in real time from multiple sensors and allow the machine to alert them when certain markers and key performance indicators are met.

Importance of applying data analytics to data-in-motion

It can be easily seen that analyzing infant NICU data is important.  It is especially important to leverage analytics to the data stream of the key medical sensors needed per infant in the NICU.  What is not easily seen sometimes is how important all the data really is.  Since, in the real-life deployment showed that not all the medical sensors are being used to help provide the model with enough information to be of use to medical/clinical staff (Blount et al., 2010).

Also, the use of data streams in a university setting would allow for a different perspective that could be used in the NICU case study above.  At the University of Miami, data is triaged into a four-tiered system (Tsinoremas et al., n.d.):

  • High-speed storage – for data that is currently being processed, data-in-motion is at its highest (has 300TB of space and costs $2000/TB)
  • Mid-range speed storage – for data that is currently being looked at (costs $600-$700/TB)
  • Deep storage – long-term data storage, data that is looked at every so often, but not regularly, usually old data (costs $300/TB)
  • Archived – data to be stored offline, but it is perfect for data at rest

This tiered system above could be applied to Artemis, such that they could process which of the medical devices should be processed first when resources are limited.  Also, this could be applied different, such that there should be a window of which data is currently available, e.g. a 1-hour long record of NICU stats saved locally, with longer records still accessible, but not stored in vital processing spaces.  Data windows were discussed, but depending on the situation, data windows could be adjusted to provide the best care for the infants (Blount et al., 2010).

Also, the quality of the sensor data must be taken into account.  If more data is needed/preferred to make informed decisions on infant patients in the NICU (Blount et al., 2010), then there should be a focus in collecting, analyzing, high-quality data and the right types of data.  This would lead the designers of Artemis, medical, and clinician staff to think deeply about which data is relevant, and how much data is enough to make the decisions needed to tend to the infants (Katal et al., 2013).

Resources

  • Blount, M., Ebling, M. R., Eklund, J. M., James, A. G., McGregor, C., Percival, N., … & Sow, D. (2010). Real-time analysis for intensive care: development and deployment of the Artemis analytic system.IEEE Engineering in Medicine and Biology Magazine29(2), 110-118.
  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Tsinoremas, N. F., Zysman, J., Mader, C., Kirtma, B., & Blaire, J. (n.d.) Data in motion: A new paradigm in research data lifecycle management. Center for Computational Science: University of Miami.

Data Tools: XML & Hadoop

Hadoop is a cluster-based file system and has a special processing framework called MapReduce. Does XML have any impact on MapReduce application design?

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Rathbone, 2013). As more data gets added in real time, data in motion, MapReduce can do the recalculations cheaper than before, and the data scientist doesn’t have to touch the data (Eini, 2010; Roy, 2014). Roy (2014) had suggested an example of using Intensive Care Unit (ICU) sensor data, which comes into a database multiple times per second to help avoid false positive alarms that could lead to overwork hospital staffers.  However, Hadoop is best used for non-realtime tasks with a huge demand for processing power (Rathbone, 2013). The issue for Hadoop is to identify the correct instance that an actionable item is needed and acting on that item (Roy, 2014).

Does XML have any impact on MapReduce application design?

XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). There are posts out there by coders use workarounds to allow for XML processing in Hadoop (Atom, 2010; Krishna, 2014; Rohit, 2013; Smith, 2012).  Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase.  So, depending on the path the data scientist chooses will mean how much work is needed to be able to use MapReduce: code a new version of reading, mapping and reducing XML data from scratch; or use libraries from other code that is compatible with Hadoop.  Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse. Depending on the complexity of the XML document, Smith’s (2012) statement may mean the more complex use of XML input codes may be needed.  Therefore, a well designed XML document could make this process a bit easier, but the complexity of the data stored in it will make the task of creating code for using MapReduce on XML data harder.  Finally, Smith (2012) recommended a preprocessing step to convert XML data and treat it as a line of a record into other libraries native for MapReduce.

References

Data Tools: XML Design

A design document helps communicate to others what you want to design, your design decisions and the rationale for those decisions. There are many ways to present a design document. Here are some ways to design a good XML document.

Good XML Design Documentation for improved performance

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).

A few rules (not a comprehensive list) on making a good XML design:

  1. Be consistent with your design and design for extensions by multiple people (Google, 2008; Font; 2010).
  2. Reuse existing XML formats (Google, 2008)
  3. Tag each unit of information, maintain a minimal amount of text that can be processed as a whole (Harold, 2003)
    1. An element has a start tag and an end tag only <menu></menu> but an attribute describes an element inside of a tag <menu-item portion-size =”500” portion-units=”g”></menu-item>; therefore an attribute provides some properties to the element (Harold, 2003; Ogbuji, 2004).
    2. The principle of core content: Know the difference of when to use an element versus an attribute: use elements when the information is an essential part of the material, and use an attribute if the information is peripheral or incidental to the main message. Essentially “Data goes in elements, metadata in attributes” (Ogbuji, 2004; Oracle, n.d.).
      1. Elements must be in a namespace, and attributes shouldn’t be in a namespace (Google, 2008)
    3. Avoid implicit structures, which occurs by the addition of white space (Harold, 2003)
      1. This can be seen easily with names, where white spaces are seen between the first name, middle name, and last name. Ogbuji (2004b) suggested to use well-established elements like: <firstname/>; <othername/>; <surname/>; <forename/>; <rolename/>; <namelink/>; <genname/>; and <addname/> to address the eccentricities of a person’s name from various cultures.
      2. Post office addresses pose this same issue, so Ogbuji (2004b) suggested these established elements: <street/>; <postcode/>; <pob/>; <city/>; <state/>; <country/>; <otheraddr/>; <phone/>; <fax/>; and <email/>/
    4. Use a standard and accepted element reference guide like DocBook Element Reference (Walsh & Muellner, 2006). Or something similar and stick with that convention.
      1. Use published standard abbreviations for constructing names (Google, 2008; Walsh & Muellner, 2006)
    5. Avoid using hyphens “-“ in your naming convention (Font, 2010)
    6. Avoid the use of boolean values (Google, 2008)
    7. Keep the document structure readable (Principle of readability), do not make it too troublesome to process or read (Harold, 2003; Ogbuji 2004). For example, use elements for readability and understandability by humans, and attributes for machine digest (Ogbuji, 2004).
    8. Comments should not be used to contain data, but rather to dos (Google, 2008)

Example of an XML Document (W3 Schools, n.d.)

<?xml version="1.0" encoding="UTF-8"?>
 
 <shiporder orderid="889923"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="shiporder.xsd">
   <orderperson>John Smith</orderperson>
   <shipto>
     <name>Ola Nordmann</name>
     <address>Langgt 23</address>
     <city>4000 Stavanger</city>
     <country>Norway</country>
   </shipto>
   <item>
     <title>Empire Burlesque</title>
     <note>Special Edition</note>
     <quantity>1</quantity>
     <price>10.90</price>
   </item>
   <item>
     <title>Hide your heart</title>
     <quantity>1</quantity>
     <price>9.90</price>
   </item>
 </shiporder>

Analysis of XML design document from the user’s perspective for improved performance

It should be best to have <shipto/> information to contain only the address, not just the two major datasets like name and address, which represents designing for extendibility as in Rule 1.  The tags <note/> and <price/> should be an attribute to the <title> per Rule 3a and 3b. Rule 4a. was not followed for <name>Ola Nordmann</name>.  Quantity is not an attribute of <item> thus should be a child element of <item> per Rule 4b. Tags like <name/>; <item/>; <quantity/>; and <price/> do not follow a naming convention as stated in Rule 5., but they could come from a naming convention that is internal to this company, so this one is hard to evaluate without much more information. Rules 6-9 were kept in this example.

References                                          

Data tools: Analysis of big data involving text mining

The demand for big data talent is growing, and there is a shortage of data analytics talent in United States. Because data analytics is used by many different industries and the data analytics is an interdisciplinary sector, learning and teaching it requires careful planning. This post discusses how big data analytics can be implemented in a given case study.

Definitions

Big data – any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do 2013, Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).

Text mining – a process that involves discovering implicit knowledge from unstructured textual data (Gera & Goel, 2015; Hashimi & Hafez, 2015; Nassirtoussi Aghabozorgi, Wah, & Ngo, 2015).

Case study: Basole, Seuss, and Rouse (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics.

The goal of this study was to use text mining techniques on 472 quality peer reviewed articles that spanned 30 years of knowledge (1977-2008).  The selection criteria for the articles were based on articles focused on the adoption of IT innovation; focused on the enterprise, organization, or firm; rigorous research methods; and publishable leading journals.  The reason to go through all this analysis is to prove the usefulness of text analytics for literature reviews.  In 2016, most literature reviews contain recent literature from the last five years, and in certain fields, it may not just be useful to focus on the last five years.  Extending the literature search beyond this 5-year period, requires a ton of attention and manual labor, which makes the already literature an even more time-consuming endeavor than before. So, the author’s question is to see if it is possible to use text mining to conduct a more thorough review of the body of knowledge that expands beyond just the typical five years on any subject matter.  They argue that the time it takes to conduct this tedious task could benefit from automation.  However, this should be thought of as a first pass through the literature review. Thinking of this regarding a first pass allows for the generation of new research questions and a generation of ideas, which drives more future analysis.In the end, the study was able to conclude that cost and complexity were two of the most frequent determinants of IT innovation adoption from the perspective of an IT department.  Other determinants for IT departments were the complexity, capability, and relative advantage of the innovation.  However, when going up one level of extraction to the enterprise/organizational level, the perceived benefits and usefulness were the main determinants of IT innovation.  Ease of use of the technology was a big deal for the organization.  When comparing, IT innovation with costs there was a negative correlation between the two, while IT innovation has a positive correlation to organization size and top management support.

How was big data analytics learned, taught, and used in the case study?

The research approach for this study was: (1) Document Identification and extraction, (2) document classification and coding, (3) document analysis and knowledge discovery (key terms, co-occurrence), and (4) research gap identification.

Analysis of the data consisted of classifying the data into four time periods (bins): 1988-1979; 1980-1989; 1990-1999; and 2000-2008 and use of a classification scheme based on existing taxonomies (case study, content analysis, field experiment, field study, frameworks and conceptual model, interview, laboratory experiment, literature analysis, mathematical model, qualitative research, secondary data, speculation/commentary, and survey).  Data was also classified by their functional discipline (Information systems and computer science, decision science, management and organization sciences, economics, and innovation) and finally by IT innovation (software, hardware, networking infrastructure, and the tool’s IT term catalog). This study used a tool called Northernlight (http://georgiatech.northernlight.com/).

The hopes of this study are to use the bag-of-words technique and word proximity to other words (or their equivalents) to help extract meaning from a large set of text-based documents.  Bag-of-words technique is known for counting and identifying key terms and phrases, which help uncover themes.  The simplest way of thinking of the bag-of-words technique is word frequencies in a document.

However, understanding the meaning behind the themes means studying the context in which the words are located in, and relating them amongst other themes, also called co-occurrence of terms.  The best way of doing this meaning extraction is to measure the strength/distance between the themes.  Finally, the researcher in this study can set minimums, maximums that can enhance the meaning extraction algorithm to garner insights into IT innovation, while reducing the overall noise in the final results. The researchers set the following rules for co-occurrences between themes:

  • There are approximately 40 words per sentence
  • There are approximately 150 words per paragraph

How could this implementation of big data have been improved upon?

Goldbloom (2016) stated that using big data techniques (machine learning) is best on big data that requires classifying and it breaks down when the task is too small and specialized, therefore prime for only human analysis.  This study only looked at 427 articles, is this considered big enough for analysis, or should the analysis go back through multiple years beyond just the 30 years (Basole et al., 2013).  What is considered big data in 2013 (the time of this study), may not be big data in 2023 (Fox & Do, 2013).

Mei & Zhai (2005), observed how terms and term frequencies evolved over time and graphed it by year, rather than binning the data into four different groups as in Basole et al. (2013).  This case study could have shown how cost and complexity in IT innovation changed over time.  Graphing the results similar to Mei & Zhai (2005) and Yoon and Song (2014) would also allow for an analysis of IT innovation themes and if each of these themes is in an Introduction, Growth, Majority, or Decline mode.

 Reference

  • Basole, R. C., Seuss, C. D., & Rouse, W. B. (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics. Decision Support Systems, 54, 1044-1054. Retrieved from http://www.sciencedirect.com.ctu.idm.oclc.org/science/article/pii/S0167923612002849
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Goldbloom, A. (2016). The jobs we’ll lose to machines –and the ones we won’t. TED. Retrieved from http://www.ted.com/talks/anthony_goldbloom_the_jobs_we_ll_lose_to_machines_and_the_ones_we_won_t
  • Hashimi, H., & Hafez, A. (2015). Selection criteria for text mining approaches. Computers in Human Behavior, 51, 729–733. http://doi.org/10.1016/j.chb.2014.10.062
  • Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 198–207. http://doi.org/10.1145/1081870.1081895
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2015). Text-mining of news-headlines for FOREX market prediction: a multi-layer dimension reduction algorithm with semantics and sentiment. Expert Systems with Applications42(1), 306-324.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Yoon, B., & Song, B. (2014). A systematic approach of partner selection for open innovation. Industrial Management & Data Systems, 114(7), 1068.

Data Tools: WEKA

Many tools are used for the purpose of data analytics. WEKA is one of those free tools in the market.

WEKA

The Java based, open sourced, and platform independent Waikato Environment for Knowledge Analysis (WEKA) tool, for data preprocessing, predictive data analytics, and facilitation interpretations and evaluation (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Miranda, n.d.; Xia & Gong, 2014).  It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (WEKA, n.d). Also, WEKA contains classification, clustering, association rules, regression, and visualization capabilities, in particular, the C4.5 decision tree predictive data analytics algorithm (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011). Here WEKA is an open source data and text mining software tool, thus it is free to use. Therefore there are no costs associated with this software solution.

WEKA can be applied to big data (WEKA, n.d.) and SQL Databases (Patel & Donga, 2015). Subsequently, WEKA has been used in many research studies that are involved in big data analytics (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011; Parkavi & Sasikumar, 2016; Xia & Gong, 2014). For instance, Barak and Modarres (2015) used WEKA for decision tree analysis on predicting stock risks and returns.

The fact that it has been using in this many research studies is that the reliability and validity of the software are high and well established.  Even in a study comparing WEKA with 12 other data analytics tools, is one of two apps studied that have a classification, regression, and clustering algorithms (Gera & Goel, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if one can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage. Another disadvantage of WEKA is that it cannot handle raw data directly, meaning the data had to be preprocessed before it is entered into the software package and analyzed (Hoonlor, 2011). WEKA cannot even import excel files, data in Excel have to be converted into CSV format to be usable within the system (Miranda, n.d.)

References:

  • Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. doi:http://dx.doi.org/10.1007/s10799-012-0135-7
  • Gera, M., & Goel, S. (2015). Data Mining -Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Kumar, D., & Fet, D. (2011). Performance Analysis of Various Data Mining Algorithms: A Review. International Journal of Computer Applications, 32(6), 9–16.
  • Miranda, S. (n.d.). An Introduction to Social Analytics : Concepts and Methods.
  • Parkavi, S. & Sasikumar, S. (2016). Prediction of Commodities Market by Using Data Mining Technique. i-Manager’s Journal on Computer Science.
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).
  • WEKA (n.d.) WEKA 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Xia, B. S., & Gong, P. (2014). Review of business intelligence through data analysis. Benchmarking, 21(2), 300–311. http://doi.org/http://dx.doi.org/10.1108/BIJ-08-2012-0051

Data Tools: Hadoop Vs Spark

The Hadoop ecosystem is rapidly evolving. Apache Spark is a recent addition to the Hadoop ecosystem. Both help with traditional challenges of storing and processing of large data sets.

 

Apache Spark

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark can have applications written in Java, Scala, Python, R, and interfaces with SQL, which increases ease of use (Spark, n.d.; Zaharia et al., 2012).

Essentially, Spark is a high-performance computing cluster framework, but it doesn’t have its distributed file system and thus uses Hadoop Distributed File System (HDFS, HBase) as in input and output (Gu & Li, 2013).  Not only can it access data from HDFS, HBase, it can also access data from Cassandra, Hive, Tachyon, and any other Hadoop data source (Spark, n.d.).  However, Spark uses its data structure called Resilient Distribution Datasets (RDD) which cache’s data and is a read-only operation to improve its processing time as long as there is enough memory for it in all the nodes of a cluster (Gu & Li, 2013; Zaharia et al., 2012). Spark tries to avoid data reloading from the disk that is why it stores its data in the node’s cache system, for initial and intermediate results (Gu & Li, 2013).

Machines in the cluster can be rebuilt if lost, thus making the RDDs are fault-tolerant without requiring replication (Gu &LI, 2013; Zaharia et al., 2012).  Each RDD is tracked in a lineage graph, and reruns the operations if data becomes lost, therefore reconstructing data, even if all the nodes running spark were to fail (Zaharia et al., 2012).

Hadoop

Hadoop is Java-based system that allows for manipulation and calculations to be done by calling on MapReduce function on its HDFS system (Hortonworks, 2013; IBM, n.d.).

HFDS big data is broken up into smaller blocks across different locations, no matter the type or amount of data, each of these blocs can be still located, which can be aggregated like a set of Legos throughout a distributed database system (IBM, n.d.; Minelli, Chambers, & Dhiraj, 2013). Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). IBM (n.d.) boasts that the data blocks in the HFDS are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs), offering fault tolerance as well. Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to run its procedures on the server in which the data is stored, where data is stored in a memory, not in cache and allow for continuous service.

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013; Sathupadi, 2010). Reducers can work on different keys, and when huge amounts of data are entered into MapReduce, then the Mapper maps the data, where the data is then shuffled and sorted before it is reduced (Hortonworks, 2013).  Once the data is reduced, the researcher gets the output that they sought.

Significant Differences between Hadoop and Apache Spark              

Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.).

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

References

  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.

Data Tools: Case Study on Hadoop’s effectiveness

Hadoop and Spark allow storing of very large files, and it stores unique approach on how files are stored and accessed. This post identified a real life case study where Hadoop was used in meteorology.

Case Study: Open source Cloud Computing Tools: A case study with a weather application

Focus on: Hadoop V0.20, which has a Platform as a Service cloud solution, which have parallel processing capabilities

Cluster size: 6 nodes, with Hadoop, Eucalyptus, and Django-Python clouds interfaces installed

Variables: Managing historical average temperature, rainfall, humidity data, and weather conditions per latitude and longitude across time and mapping it on top of a Google’s Map user interface

Data Source: Yahoo! Weather Page

Results/Benefits to the Industry:  The Hadoop platform has been evaluated by ten different criteria and compared to Eucalyptus and Django-Python, from a scale of 0-3, where 0 “indicates [a] lack of adequate feature support” and 3 “indicates that the particular tool provides [an] adequate feature to fulfill the criterion.”

Table 1: The criterion matrix and numerical scores have been adopted from Greer, Rodriguez-Martinez, and Seguel (2010) results.

Criterion Description Score
Management Tools Tools to deploy, configure, and maintain the system 0
Development Tools Tools to build new applications or features 3
Node Extensibility Ability to add new nodes without re-initialization 3
Use of Standards Use of TCP/IP, SSH, etc. 3
Security Built-in security as oppose to use of 3rd party patches. 3
Reliability Resilience to failures 3
Learning Curve Time to learn technology 2
Scalability Capacity to grow without degrading performance
Cost of Ownership Investments needed for usage 2
Support Availability of 3rd party support 3
Total 22

Eucalyptus scored 18, and Django-Python scored 20, therefore making Hadoop a better solution for this case study.  They study mentioned that:

  • Management tools: configuration was done by hand with XML and text and not graphical user interface
  • Development tools: Eclipse plug-in aids in debugging Hadoop applications
  • Node Extensibility: Hadoop can accept new nodes with no interruption in service
  • Use of standards: uses TCP/IP, SSH, SQL, JDK 1.6 (Java Standard), Python V2.6, and Apache tools
  • Security: password protected user-accounts and encryption
  • Reliability: Fault-tolerance is presented, and the user is shielded from the effects
  • Learning curve: It is not intuitive and required some experimentation after practicing from online tutorials
  • Scalability: not assessed due to the limits of the study (6-nodes is not enough)
  • Cost of Ownership: To be effective Hadoop needs a cluster, even if they are cheap machines
  • Support: there is a third party support for Hadoop

The authors talk about how Hadoop fails in providing a real-time response, and that part of the batch code should include email requests to be sent out at the start, key points of the iteration, or even at the end of the job when the output is ready.  The speed of Hadoop is slower to the other two solutions that were evaluated, but the fault tolerance features make up for it.  For set-up and configuration, Hadoop is simple to use.

Use in the most ample manner?

Hadoop was not fully used in my opinion and the opinion of the authors because they stated that they could not scale their research because the study was limited to a 6-node cluster. Hadoop is built for big data sets from various sources, formats, etc. to be ingested and processed to help deliver data-driven insights and the features of scalability that address this point were not addressed adequately in this study.

Resources

  • Greer, M., Rodriguez-Martinez, M., & Seguel, J. (2010). Open Source Cloud Computing Tools: A Case Study with a Weather Application.Florida: IEEE Open Source Cloud Computing.