Data Tools: Hadoop Vs Spark

The Hadoop ecosystem is rapidly evolving. Apache Spark is a recent addition to the Hadoop ecosystem. Both help with traditional challenges of storing and processing of large data sets.

Advertisements

 

Apache Spark

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark can have applications written in Java, Scala, Python, R, and interfaces with SQL, which increases ease of use (Spark, n.d.; Zaharia et al., 2012).

Essentially, Spark is a high-performance computing cluster framework, but it doesn’t have its distributed file system and thus uses Hadoop Distributed File System (HDFS, HBase) as in input and output (Gu & Li, 2013).  Not only can it access data from HDFS, HBase, it can also access data from Cassandra, Hive, Tachyon, and any other Hadoop data source (Spark, n.d.).  However, Spark uses its data structure called Resilient Distribution Datasets (RDD) which cache’s data and is a read-only operation to improve its processing time as long as there is enough memory for it in all the nodes of a cluster (Gu & Li, 2013; Zaharia et al., 2012). Spark tries to avoid data reloading from the disk that is why it stores its data in the node’s cache system, for initial and intermediate results (Gu & Li, 2013).

Machines in the cluster can be rebuilt if lost, thus making the RDDs are fault-tolerant without requiring replication (Gu &LI, 2013; Zaharia et al., 2012).  Each RDD is tracked in a lineage graph, and reruns the operations if data becomes lost, therefore reconstructing data, even if all the nodes running spark were to fail (Zaharia et al., 2012).

Hadoop

Hadoop is Java-based system that allows for manipulation and calculations to be done by calling on MapReduce function on its HDFS system (Hortonworks, 2013; IBM, n.d.).

HFDS big data is broken up into smaller blocks across different locations, no matter the type or amount of data, each of these blocs can be still located, which can be aggregated like a set of Legos throughout a distributed database system (IBM, n.d.; Minelli, Chambers, & Dhiraj, 2013). Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). IBM (n.d.) boasts that the data blocks in the HFDS are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs), offering fault tolerance as well. Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to run its procedures on the server in which the data is stored, where data is stored in a memory, not in cache and allow for continuous service.

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013; Sathupadi, 2010). Reducers can work on different keys, and when huge amounts of data are entered into MapReduce, then the Mapper maps the data, where the data is then shuffled and sorted before it is reduced (Hortonworks, 2013).  Once the data is reduced, the researcher gets the output that they sought.

Significant Differences between Hadoop and Apache Spark              

Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.).

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

References

  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.

Data Tools: Case Study on Hadoop’s effectiveness

Hadoop and Spark allow storing of very large files, and it stores unique approach on how files are stored and accessed. This post identified a real life case study where Hadoop was used in meteorology.

Case Study: Open source Cloud Computing Tools: A case study with a weather application

Focus on: Hadoop V0.20, which has a Platform as a Service cloud solution, which have parallel processing capabilities

Cluster size: 6 nodes, with Hadoop, Eucalyptus, and Django-Python clouds interfaces installed

Variables: Managing historical average temperature, rainfall, humidity data, and weather conditions per latitude and longitude across time and mapping it on top of a Google’s Map user interface

Data Source: Yahoo! Weather Page

Results/Benefits to the Industry:  The Hadoop platform has been evaluated by ten different criteria and compared to Eucalyptus and Django-Python, from a scale of 0-3, where 0 “indicates [a] lack of adequate feature support” and 3 “indicates that the particular tool provides [an] adequate feature to fulfill the criterion.”

Table 1: The criterion matrix and numerical scores have been adopted from Greer, Rodriguez-Martinez, and Seguel (2010) results.

Criterion Description Score
Management Tools Tools to deploy, configure, and maintain the system 0
Development Tools Tools to build new applications or features 3
Node Extensibility Ability to add new nodes without re-initialization 3
Use of Standards Use of TCP/IP, SSH, etc. 3
Security Built-in security as oppose to use of 3rd party patches. 3
Reliability Resilience to failures 3
Learning Curve Time to learn technology 2
Scalability Capacity to grow without degrading performance
Cost of Ownership Investments needed for usage 2
Support Availability of 3rd party support 3
Total 22

Eucalyptus scored 18, and Django-Python scored 20, therefore making Hadoop a better solution for this case study.  They study mentioned that:

  • Management tools: configuration was done by hand with XML and text and not graphical user interface
  • Development tools: Eclipse plug-in aids in debugging Hadoop applications
  • Node Extensibility: Hadoop can accept new nodes with no interruption in service
  • Use of standards: uses TCP/IP, SSH, SQL, JDK 1.6 (Java Standard), Python V2.6, and Apache tools
  • Security: password protected user-accounts and encryption
  • Reliability: Fault-tolerance is presented, and the user is shielded from the effects
  • Learning curve: It is not intuitive and required some experimentation after practicing from online tutorials
  • Scalability: not assessed due to the limits of the study (6-nodes is not enough)
  • Cost of Ownership: To be effective Hadoop needs a cluster, even if they are cheap machines
  • Support: there is a third party support for Hadoop

The authors talk about how Hadoop fails in providing a real-time response, and that part of the batch code should include email requests to be sent out at the start, key points of the iteration, or even at the end of the job when the output is ready.  The speed of Hadoop is slower to the other two solutions that were evaluated, but the fault tolerance features make up for it.  For set-up and configuration, Hadoop is simple to use.

Use in the most ample manner?

Hadoop was not fully used in my opinion and the opinion of the authors because they stated that they could not scale their research because the study was limited to a 6-node cluster. Hadoop is built for big data sets from various sources, formats, etc. to be ingested and processed to help deliver data-driven insights and the features of scalability that address this point were not addressed adequately in this study.

Resources

  • Greer, M., Rodriguez-Martinez, M., & Seguel, J. (2010). Open Source Cloud Computing Tools: A Case Study with a Weather Application.Florida: IEEE Open Source Cloud Computing.

Data Tools: Hadoop and how to install it

Installation Guide to Hadoop for Windows 10.

What is Hadoop

Hadoop’s Distributed File System (HFDS) is where big data is broken up into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). Blocks are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on the data needs).

HFDS can support many different data types, even those that are unknown or yet to be classified and it can store a bunch of data.  Thus, Hadoop’s technology to manage big data allows for parallel processing, which can allow for parallel searching, metadata management, parallel analysis (with MapReduce), the establishment of workflow system analysis, etc. (Gary et al., 2005, Hortonworks, 2013, & IBM, n.d.).

Given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli et al., 2013).  Hadoop, which is Java based allows for manipulation and calculations to be done by calling on MapReduce, which pulls on the data which is distributed on its servers, to map key items/objects, and reduces the data to the query at hand (Hortonworks, 2013 & Sathupadi, 2010).

Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. This is the largest benefit of Hadoop, which allows for parallel processing.  Another advantage of parallel processing is when one processor/node goes out; another node can pick up from where that task last saved safe object task (which can slow down the calculation but by just a bit).  Hadoop knows that this happens all the time with their nodes, so the processor/node create backups of their data as part of their fail safe (IBM, n.d).  This is done so that another processor/node can continue its work on the copied data, which enhances data availability, which in the end gets the task you need to be done now.

Minelli et al. (2013) stated that traditional relational database systems could depend on hardware architecture.  However, Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, we manage the applications, and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).  The next section discusses how to install Hadoop and how to set up Eclipse to access map/reduce servers.

Installation steps

  • Go to the Hadoop Main Page < http://hadoop.apache.org/ > and scroll down to the getting started section, and click “Download Hadoop from the release page.” (Birajdar, 2015)
  • In the Apache Hadoop Releases < http://hadoop.apache.org/releases.html > Select the link for the “source” code for Hadoop 2.7.3, and then select the first mirror: “http://apache.mirrors.ionfish.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3-src.tar.gz” (Birajdar, 2015)
  • Open the Hadoop-2.7.3 tarball file with a compression file reader like WinRAR archiver < http://www.rarlab.com/download.htm >. Then drag the file into the Local Disk (C:). (Birajdar, 2015)
  • Once the file has been completely transferred to the Local Disk drive, close the tarball file, and open up the hadoop-2.7.3-src folder. (Birajdar, 2015)
  • Download Hadoop 0.18.0 tarball file < https://archive.apache.org/dist/hadoop/core/hadoop-0.18.0/ > and place the copy the “Hadoop-vm-appliance-0-18-0” folder into the Java “jdk1.8.0_101” folder. (Birajdar, 2015; Gnsaheb, 2013)
  • Download Hadoop VM file < http://ydn.zenfs.com/site/hadoop/hadoop-vm-appliance-0-18-0_v1.zip >, unzip it and place it inside the Hadoop src file. (Birajdar, 2015)
  • Open up VMware Workstation 12, and open a virtual machine “Hadoop-appliance-0.18.0.vmx” and select play virtual machine. (Birajdar, 2015)
  • Login: Hadoop-user and password: Hadoop. (Birajdar, 2015; Gnsaheb, 2013)
  • Once in the virtual machine, type “./start-hadoop” and hit enter. (Birajdar, 2015; Gnsaheb, 2013)
    1. To test MapReduce on the VM: bin/Hadoop jar Hadoop-0.18.0-examples.jar pi 10 100000000
      1. You should get a “job finished in X seconds.”
      2. You should get an “estimated value of PI is Y.”
  • To bind MapReduce plugin to eclipse (Gnsaheb, 2013)
    1. Go into the JDK folder, under Hadoop-0.18.0 > contrib> eclipse-plugin > “Hadoop-0.18.0-eclipse-plugin” and place it into the eclipse neon 1 plugin folder “eclipse\plugins”
    2. Open eclipse, then open perspective button> other> map/reduce.
    3. In Eclipse, click on Windows> Show View > other > MapReduce Tools > Map/Reduce location
    4. Adding a server. On the Map/Reduce Location window, click on the elephant
      1. Location name: your choice
      2. Map/Reduce master host: IP address achieved after you log in via the VM
  • Map/Reduce Master Port: 9001
  1. DFS Master Port: 9000
  2. Username: Hadoop-user
  1. Go to the advance parameter tab > mapred.system.dir > edit to /Hadoop/mapped/system

Issues experienced in the installation processes (Discussion of any challenges and explain how it was investigated and solved)

Not one source has the entire solution Birajdar, 2015; Gnsaheb, 2013; Korolev, 2008).  It took a combination of all three sources, to get the same output that each of them has described.  Once the solution was determined to be correct, and the correct versions of the files were located, they were expressed in the instruction set above.  Whenever a person runs into a problem with computer science, google.com is their friend.  The links above will become outdated with time, and methods will change.  Each person’s computer system is different than those from my personal computer system, which is reflected in this instruction manual.  This instruction manual should help others google the right terms and in the right order to get Hadoop installed correctly onto their system.  This process takes about 3-5 hours to install correctly, with the long time it takes to download and install the right files, and with the time to set up everything correctly.

Resources

Data Tools: Use of XML

Many industries are using XML. Some see advantages and others see challenges or disadvantages in using XML.

XML advantages

+ Writing your markup language and are not limited to the tags defined by other people (UK Web Design Company, n.d.)

+ Creating your tags at your pace rather than waiting for a standard body to approve of the tag structure (UK Web Design Company, n.d.)

+ Allows for a specific industry or person to design and create their set of tags that meet their unique problem, context, and needs (Brewton, Yuan, & Akowuah, 2012; UK Web Design Company, n.d.)

+ It is both human and machine-readable format (Hiroshi, 2007)

+ Used for data storage and processing both online and offline (Hiroshi, 2007)

+ Platform independent with forward and backward capability (Brewton et al., 2012; Hiroshi, 2007)

XML disadvantages

– Searching for information in the data is tough and time-consuming without a computer processing application (UK Web Design Company, n.d.)

– Data is tied to the logic and language similar to HTML without a readily made browser to simply explore the data and therefore may require HTML or other software to process the data (Brewton et al., 2012; UK Web Design Company, n.d.)

– Syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

– Limited to relational models and object-oriented graphs (Hiroshi, 2007)

– Tags are chosen by their creator. Thus there are no standard set of tags that should be used (Brewton et al., 2012)

XML use in Healthcare Industry

Thanks to the American National Standards Institute, the Health Level 7 (HL7) was created with standards for health care XML, which is now in use by 90% of all large hospitals (Brewton et al., 2012; Institute of Medicine, 2004). The Institute of Medicine (2004), stated that health care data could consist of: allergies immunizations, social histories, histories, vital signs, physical examination, physician’s and nurse’s notes, laboratory tests, diagnostic tests, radiology test, diagnoses, medications, procedures, clinical documentations, clinical measure for specific clinical conditions, patient instructions, dispositions, health maintenance schedules, etc.  More complex datasets like images, sounds, and other types of multimedia, are yet to be included (Brewton et al., 2012).  Also, terminologies within the data elements are not systemized nomenclature, and it does not support web-protocols for more advanced communications of health data (Institute of Medicine, 2004). HL7 V3 should resolve a lot of these issues, which should also account for a wide variety of health care scenarios (Brewton et al., 2012).

XML use in Astronomy

The Flexible Image Transport System (FITS), currently used by NASA/Goddard Space Flight Center, holds images, spectra, tables, and sky atlases data, which has been in use for 30 years (NASA, 2016; Pence et al. 2010). The newest version has a definition of time coordinates, support of long string keywords, multiple keywords, checksum keywords, image and table compression standards (NASA, 2016).  There was support for mandatory keywords previously (Pence et al. 2010).  Besides the differences in data entities and therefore tags needed to describe the data between the XML for healthcare and astronomy, the use of XML for a much longer period has allowed for a more robust solution that has evolved with technology.  It is also widely used as it is endorsed by the International Astronomical Union (NASA, 2016; Pence et al., 2010).  Based on the maturity of FITS, due to its creations in the late 1970s, and the fact that it is still in use, heavily endorsed, and is a standard still in use today, the healthcare industry could learn something from this system.  The only problem with FITS is that it removes some of the benefits of XML, which includes flexibility to create your tags due to the heavy standardization and standardization body.

Resources

Data Tools: XML

Large datasets are often represented using XML. Data and logs can be represented in XML. Markup and breaking marked sections are features of XML.

What is XML and how is it used to represent data

XML, also known as the eXtendend Markup Language, it is a standardized way that allows objects or data items to be referred to and identified by type(s) in a flexible hierarchical approach (Brookshear & Brylow, 2014; Lublinsky, Smith, & Yakubovich, 2013; McNurlin, Sprague, & Bui, 2008; Sadalage & Fowler, 2012).  XML refers to and identifies objects by types, when it assigns tags to certain parts of the data, defining the data (McNurlin et al., 2008).  JSON provides similar functionality to XML, the XML schema, and query capabilities are better than JSON (Sadalage & Fowler, 2012). XML focuses more on semantics than appearance, which allows for searches that understand the contents of the data being considered. Therefore it is considered to be a standard for producing markup languages like HTML (Brookshear & Brylow, 2014).  Finally, XML uses object principles, which tell you what data is needed to perform the function and what output they can give you, just not the how they will do it (McNurlin et al., 2008).

XML documents contain descriptions of a service or function, how to request the service or function, the data it needs to perform the work of the service or function, and the results the service or function will deliver (McNurlin et al., 2008). Also, relational databases have taken on XML as a structuring mechanism by taking on the XML document as a column type to allow for the creation of XML querying languages (Sadalage & Fowler, 2012).  Therefore, XML essentially helps in defining the data, giving the data meaning for which computers can manipulate and work on, and therefore transforming data from a human readable format into a computer readable, and can help with currency conversion, credit card processing application, etc. (McNurlin et al., 2008).

XML can handle a high volume of data and can represent all varieties of data, structured, unstructured, and semi-structured data in an active or live (streaming) fashion, such as CAD data use to design buildings; represent multimedia, house music, product bar code scanning, photographs of damaged property held by insurance companies, etc. (Brookshear & Brylow, 2014; Lublinsky et al., 2013 McNurlin et al., 2008; Sadalage & Fowler, 2012). By definition, big data is any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do 2013).  Therefore XML can represent big data quite nicely.

Use of XML to represent data in various forms

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublisnky et al., 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Data can be comprised of text and numbers, like a telephone number (123)-456-7890, which can be represented in XML as <phone country = “U.S.”> 1234567890</phone>.  The country adds hierarchy to the object, defining it further as a U.S. phone number (Myer, 2005).  Given its hierarchical nature, the root data element helps sort all of the data below similar to a hierarchical data tree (Lublisnky et al., 2013; Myer, 2005).

Since, elements can be described in tags, which help add context to the data turning it into information, and when adding hierarchical structure to these informational elements, we are describing the natural relationships which aid in transforming information into knowledge (Myer, 2005). This can help aid in analyzing big data sets.

Myer (2005), provided the following example of an XML syntax, which can also showcase the XML representation of data in a hierarchical structure:

<Actor type=”superstar”>

                <name> Harrison Ford</name>

                <gender> male</gender>

                <age>50<age>

<Actor>

Given this simple structure, it can easily be created by humans or by code, and can thus be produced in a document. This is great, however, to derive the value that Myer (2005) was talking about from the XML formatted data, it will need to be ingested for analysis into Hadoop (Agrawal, 2014).  Finally, XML can deal with numerical data such as integers, real, float, long, double, NaN, INF, -INF, Probabilities, percentages, string data, plain arrays of values (numerical and string arrays), sparse arrays, matrices, sparse matrices, etc. (Data Mining Group, n.d.).  Thus, addressing various types of data as aforementioned, at high volumes.

Resources

  • Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/
  • Brookshear, G., & Brylow D. (2014). Computer Science: An Overview (12th ed.). Pearson Learning Solutions. VitalBook file.
  • Data Mining Group (n.d.). PMML 4.1 – General structure. Retrieved from http://dmg.org/pmml/v4-1/GeneralStructure.html
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox. VitalBook file.
  • McNurlin, B., Sprague, R., & Bui, T. (2008). Information Systems Management (8th ed.). Pearson Learning Solutions. VitalBook file.
  • Myer, T. (2005). A really, really, really good introduction to xml. Retrieved from https://www.sitepoint.com/really-good-introduction-xml/
  • Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Pearson Learning Solutions. VitalBook file.

Futuring & Innovation: Compelling Topics

The most compelling topics on the subject of Futuring and Innovation.

  • There are forces that may help facilitate or reduce the likelihood of success of innovation, such as technological, cultural, economic, legal, ethical, temporal, social, societal, global, national, and local.
  • TED talks are videos that addresses innovations related to Technology, Education, and Design, and they can be found at this Web site,
  • Sociotechnical Systems: the interplay, impact, and mutual influence when technology is introduced into a social system, i.e. workplace, school, home, etc. (com, n.d.; Sociotechnical theory, n.d.) The social system comprises people at all levels of knowledge, skills, attitudes, values and needs (Sociotechnical theory, n.d.).
  • Think tanks are a group of people that review the literature, discuss the literature, think about ideas, do tons of research, write, provide ideas, legitimize ideas, advocate, lobby, and arguing just to address a problem(s) (Mendizabal, 2011; TBS, 2015; Whittenhauer, n.d.). In short, they are idea factories: creating, producing, and sharing (Whittenhauer, n.d.). The balance between research, consultancy, and advocacy and their source of their arguments/ideas: applied, empirical, synthesis, theoretical or academic research; help shape what type of think tank they are (Mendizabal, 2011). Finally, there are two types of think tank models, one roof model where everyone gathers in one physical place to meet face-to-face or the without walls model where members only communicate through technological means (Whittenhauer, n.d.).
  • Nominal Grouping Technique (NTG) is a tool for decision making, where it can be used to identify elements of a problem, identify and rank goals by priorities, identify experts, involve people from all levels to promote buy-in of the results (Deip, Thensen, Motiwalla, & Seshardi, 1997; Hashim et al., 2016; Pulat, 2014). Pulat (2014) describes the process as listing and prioritizing a list of options that is created through a normal brainstorming session, where the list of ideas is generated without criticism or evaluation.  Whereas Deip et al. (1977) describe the process as one that taps into the experiences of all people by asking them all to state their idea on a list, and no discussion is permitted until all ideas are listed, from which after a discussion on each item on the list can ranking each idea can begin. Finally, Hashim et al. (2016) stated that the method is best used to help a small team to reach consensus by gathering ideas from all and exciting buy-in of ideas.
  • Dalkey and Helmer (1963), described that the Delphi project as a way to use expert opinion, with the hopes of getting the strongest consensus of a group of experts. Pulat (2014) states that ideas are listed, and prioritized by a weighted point system to help reduce the number of possible solutions with no communication between the experts or of the results during the process until the very end.  However, Dalkey and Helmer (1963) described the process as repeated interviewing or questioning individual experts while avoiding confrontation of other experts.  Questions are centered on some central problem and between each round of questioning consists of available data requested by one expert to be shown to all experts, or new information that is considered potentially relevant by an expert (Dalkey & Helmer, 1963; Pulat, 2014).  The solution from this technique improves with soliciting experts with a range of experiences (Okoli & Pawlowski, 2004; Pulat, 2014).
  • Serendipitous innovations: discovering what makes one thing special and applying it elsewhere, like Velcro’s.
  • Exaptation innovations: Never giving up, finding secondary uses for the same product, and not being afraid to pivot when needed, like Play-Doh.
  • Erroneous innovations: Creating something by accident in the pursuit of something else, like Saccharin (C7H5NO3S) the artificial sweetener.
  • Kodak is a great example where a good plan but something went wrong because of circumstances beyond their control.
  • The traditional forecast is essentially extrapolating where you were and where are you are now into the future, and at the end of this extrapolated line this is “the most likely scenario” (Wade, 2012; Wade, 2014). Mathematical formulations and extrapolations is a mechanical basis for traditional forecasting (Wade, 2012). At one point, these forecasts make ±5-10% in their projections and call it the “the best and worst case scenario” (Wade, 2012; Wade, 2014).  This ± difference is a range of possibilities out of an actual 360o solution spherical space (Wade, 2014). There are both mathematical forms of extrapolation and mental forms of extrapolation and both are quite dangerous because they assume that the world doesn’t change much (Wade, 2012).
  • Scenario planning could be done with 9-30 participants (Wade, 2012). But, a key requirement of scenario planning is for everyone to understand that knowing the future is impossible and yet people want to know where the future could go (Wade, 2014).  However, it is important to note that scenarios are not predictions; scenarios only illuminate different ways the future may unfold (Wade, 2012)! Therefore, this tool to come up with an approach that is creative, yet methodological, that would help spell out some of the future scenarios that could happen has ten steps (Wade, 2012; Wade, 2014):
    1. Framing the challenge
    2. Gathering information
    3. Identifying driving forces
    4. Defining the future’s critical “either/or” uncertainties
    5. Generating the scenarios
    6. Fleshing them out and creating story lines
    7. Validating the scenarios and identifying future research needs
    8. Assessing their implications and defining possible responses
    9. Identifying signposts
    10. Monitoring and updating the scenarios as times goes on

Resources:

  • Dalkey, N., & Helmer, O. (1963). An experimental application of the Delphi method to the use of experts.Management science9(3), 458-467.
  • Deip, P., Thesen, A., Motiwalla, J., & Seshardi, N. (1977). Nominal group technique.
  • com (n.d.) socio-technical system. A Dictionary of Sociology. Retrieved from Encyclopedia.com: http://www.encyclopedia.com/social-sciences/dictionaries-thesauruses-pictures-and-press-releases/socio-technical-system
  • Hashim, A. T., Ariffin, A., Razalli, A. R., Shukor, A. A., NizamNasrifan, M., Ariffin, A. K., … & Yusof, N. A. A. (2016). Nominal Group Technique: a Brainstorming Tool for Identifying Learning Activities Using Musical Instruments to Enhance Creativity and Imagination of Young Children.International Advisory Board,23, 80.
  • Mendizabal, E. (2011). Different ways to define and describe think tanks. On Think Tanks. Retrieved from https://onthinktanks.org/articles/different-ways-to-define-and-describe-think-tanks/
  • Okoli, C., & Pawlowski, S. D. (2004). The Delphi method as a research tool: an example, design considerations and applications.Information & management42(1), 15-29.
  • Pulat, B. (2014) Lean/six sigma black belt certification workshop: body of knowledge. Creative Insights, LLC.
  • Socio-Technical Theory (n.d.) Brigham Young University. Retrieved from http://istheory.byu.edu/wiki/Socio-technical_theory
  • Wade, W. (2012) Scenario Planning: A Field Guide to the Future. John Wiley & Sons P&T. VitalSource Bookshelf Online.
  • Wade, W. (2014). Scenario Planning – Thinking differently about future innovation. Globis Retrieved from http://e.globis.jp/article/343

Whittenhauer, K. (n.d.). Effective think tank methods. eHow. Retrieved from http://www.ehow.com/way_5728092_effective-think-tank-methods.html

Sociotechnology plan for Getting People Out to Vote

Sociotechnical systems is a term in organizational development that acknowledges the collaboration between people and technology in the workplace. The term is also indicative of the social aspects of people and society and the technical aspects of organizational structure and processes. This is a the sociotechnical plan for an innovation (for a real a issue).

Abstract:

According to the US Census Bureau (2016), there are approximately 227 Million eligible voters in the U.S. However; the Bipartisan Policy Center stated that in 2012 the voter turnout was 57.5%. This helps establish a need for Getting Out to Vote (GOTV) efforts. Regardless of any party’s political views, ideologies, and platforms, each party should improve their Get Out to Vote (GOTV) initiatives, which help convert citizens into active voters on election day (Garecht, 2006). Fortunately, technology and big data could be used to leverage the GOTV strategies to allow for mass social contact that is tailored to the voter and yet still cheaper than door-to-door canvassing. The purpose of this sociotechnical plan for GOTV is to build a bi-partisan mobile application that serves the needs of the citizens and politicians to increase poll attendance rate and ensuring a more democratic process.

Introduction:

Democracy in any nation is at its best when everyone participates.  Regardless of any party’s political views, ideologies, and platforms, each party should improve their Get Out to Vote (GOTV) initiatives, which help convert citizens into active voters on election day (Garecht, 2006). GOTV initiatives are meant to get voter who doesn’t usually vote to get out and vote on election day or those who intend to vote to follow through (Bash, 2016; Stanford Business, 2012).  According to the Institution for Social and Policy Studies (ISPS) (2016), a large number of studies have found that personalized methods, like door-to-door canvassing for GOTV, is the best and most robust method currently out there. However, mass email, mailers, or robocalls isn’t, because it lacks dynamic and authentic interaction.  Nearing the last few days of the election, voters or would-be voter already have picked who they would vote for and which way to vote on certain initiatives (Bash, 2016).  So it is not a matter of convincing people but having a high voter turnout.

A good goal for any political party’s GOTV initiative is to obtain 10% of the voters needed to win the election (Garecht, 2006). Door-to-door canvassing is very cost prohibited, but they are cost efficient whereas mass social contact is not cost efficient even though it is cheaper (Gerber, Green, & Larimer, 2008; ISPS, 2016). Green et al. (2008) stated that door-to-door canvassing costs approximately 10x more than mass social contact per vote.  Even though the costs are huge when doing door-to-door canvassing, the Republican National Committee for 2016 projects to have knocked on 17 million doors for their efforts, compared to 11.5 million in 2012’s elections (Bash, 2016).  Fortunately, technology and big data could be used to leverage the GOTV strategies to allow for mass social contact that is tailored to the voter and yet still cheaper than door-to-door canvassing.  Currently, Social media, email, online ads, and websites are used for GOTV (Fuller, 2012).

Scope: 

The current and next generation voters will be highly social and technologically advance, leveraging social media and other digital tools to learn about the issues from each candidate and become social media influencers (Horizons Report, 2016c). Therefore, as a feature, social media could be used as a way to develop personal learning networks and personal learning on the issues and initiatives (Horizons Report, 2016c; Horizons Report, 2016e). Twitter has been used by students to discover information, publishing, and sharing ideas, while at the same time exploring different perspectives on the same topic to promote cultural and scientific progress (Horizons Report, 2016a).

Walk book, an app is being used by the Republican National Committee to aid in their GOTV efforts, which shows voter’s living location, their party affiliation, and how reliable they are as a voter (Bash, 2016). The walking book mobile app, also allows for door-to-door canvassing personnel to have dynamic discussions through dynamic scripting which handles a ton of different scenarios and responses to their questions.  Data is then collected and returned to the Republican National Committee for further analysis and future updates.

Another feature of Social media technologies, is that as these technologies continue to evolve beyond 2016, these tools can be used for crowdsourcing, establishing an identity, networking, etc. (Horizons Report, 2016b; Horizons Report, 2016e).  Establishing identities can work for the political campaign, but leveraging established voter social media identities could help create that more tailored response to their values and what is at stake in the election.  The limitation comes from joining all these data sources containing huge amounts of unstructured data into one system, to not only decipher a voter’s propensity to vote but their political leaning (Fuller, 2012).

Purpose: 

According to the US Census Bureau (2016), there are approximately 227 Million eligible voters in the U.S. However; the Bipartisan Policy Center stated that in 2012 the voter turnout was 57.5%. This helps establish a need for GOTV efforts. When more people go out to vote, their voices get heard.  Even, in states that are heavily democratic or republican, a higher turnout can increase the chance of the political candidate to be more centrist in their policies to remain elected in the future (ThinkTank, 2016; vlogbrothers, 2014). The lower the voter turnout, the higher the chance the actual voice of the people are not heard of.  Vlogbrother’s best said: “If you aren’t voting, no one hears your voice, so they have no reason to represent you!”. Also, elections are not usually about the top ticket vote, but all the down ballot stuff as well, like local city, county, and state level public offices and ballot initiatives (ThinkTank, 2016). The purpose of this sociotechnical plan for GOTV is to build a bi-partisan mobile application that serves the needs of the citizens and politicians to increase poll attendance rate and ensuring a more democratic process.

Supporting forces: 

Social: When using scripts that state that voter turnout will be high, helps increase voter turnout, because the people start identifying voting as something they must do as part of their identity because others are doing it as well (Stanford Business, 2012). Also, in today’s world, social media has become a way for people to be connected to their social network at all times, and there is a real fear of missing out (FOMO) if they are not as responsive to their social media tools (Horizons Report, 2016d). Other social aspects have been derived from behavioral science, like adding simple set of questions like

  • “What time are you voting?”
  • “Where are you voting?”
  • if there is early voting “What day will you be voting?”
  • “How will you get there?”
  • “Where would you be coming from?”

Has shown to double voter commitment and turnout, when focused on people who don’t organically think of these questions (Stanford Business, 2012). Helping, voters determine answers to these questions help with their personal logistics and show how easy it is for them to vote.  This was one of the key deciding factors between Barak Obama’s versus Hillary Clinton’s GOTV 2008 Democratic Primary campaign (Rigoglioso, 2012).  Also, if there is a way to have those logistic questions posted on their social media platform, they become social champions to vote but are also now socially held accountable to vote.  Finally, VotingBecause.com is a social platform for voters to share why they are voting in the election, making them more socially accountable to vote (Sutton, 2016).

Technological: Currently platforms like YouTube are using their resources and their particular platform to help their users get out to vote (Alcorn, 2016). The USA Today Network has worked together to launch a one stop shop, VotingBecause.com, which helps voters easily read about the issues and candidates in the election and even register to vote (Sutton, 2016).

Economical:  According to a Pew Survey (2015), 68% of US adults own a smartphone, while 86% of all young adults (ages 18-20) own a smartphone. In a different Pew Survey (2015a), showed that 65% of adults use social media, up from the 7% in 2005. Therefore, a huge voting block has a social media account and a smartphone (Levitan, 2015b) for which technology can be leveraged at a cheaper cost than door-to-door canvassing.

Challenging forces: 

Social: Unfortunately, this FOMO leads to people feeling burnt out or exhaustive. Therefore users of social media need to balance their time on it (Horizons Report, 2016d). Facts and rumors are both all posing as information in social media, and deciphering which is which is exhausting (Levitan, 2015). Therefore, for this innovation to become a reality, any information shared via social media should come from an independent and trusted source.

Legal: In most states, it is illegal to take a photograph of a polling place, which would make it hard for people to show their civic duty accomplishment on social platforms without getting into legal issues (Fallen, 2016). This may decrease people’s want to share and feel connected, which eventually could also impact the likelihood of decreasing personal and social accountability in voting.

Technological: Building a comprehensive database of the typical low propensity voter, so that a campaign can create personal messages and personal conversations with those voters (Fuller, 2012). A good database would include a Phone number, address, voting propensity, voting record, street address, email address, issues of importance, etc. (Fuller, 2012). Security is also an issue, the one stop shop mobile application solution must take into account each person’s right to access certain types of data, to still ensure anonymity of voting records of civilians.

Ethics: Collecting huge amounts of data from social media and tying that to personal yet public voting record could cause harm, given that political beliefs are of a private manner.  If the data is not used primarily for GOTV initiatives to get everyone’s voice heard in the political system, then it shouldn’t be collected.  Data could be used unethically by some to suppress the vote as well. Therefore this must be conducted by an independent (non-partisan) group.

Methods:

Social media is constantly evolving thus the use of Delphi Technique from a political think tank, political scientists, sociologist, social behaviorist, and actual GOTV managers would be needed on an ongoing basis (Horizons Report, 2016b; Horizons Report, 2016e; Stanford Business, 2012). Dalkey and Helmer (1963), described that the Delphi project was a way to use expert opinion, with the hopes of getting the strongest consensus of a group of experts.  Pulat (2014) states that ideas are listed, and prioritized by a weighted point system to help reduce the number of possible solutions with no communication between the experts or of the results during the process until the very end.  However, Dalkey and Helmer (1963) described the process as repeated interviewing or questioning individual experts while avoiding confrontation of other experts.  Experts must be drawn from different groups on the research spectrum: theoretical to the application as well as different academic fields to help build the best consensus on the methodology to leverage social media on GOTV efforts. Finally, one could consider conducting the Delphi Technique either in a one roof model where everyone gathers in one physical place to meet face-to-face or the without walls model where members only communicate through technological means (Whittenhauer, n.d.).

Models:

To build this socio-technical system to leverage unstructured social media data with voter registration data in GOTV efforts, one must consider the different levels in designing a socio-technical system as seen in Figure 1 (Sommerville, 2013).  Each of the levels plays an important role in facilitating the entire socio-technical plan and must be heavily detailed, with backup systems.  Backup systems and a disaster recovery plan are needed to avoid the same fate that Mitt Romney’s GOTV 2012 program ORCA suffered, where thousands of Romney volunteers were left with a data blackout (Haberman & Burns, 2012). But, it is important to note that a good socio-technical GOTV plan would include all the different levels because all these different levels in the socio-technical system feed into each other and overlap in certain domains (Sommerville, 2013).

2

Figure 1. The socio-technical Architecture with seven levels. (Source: Adapted from the Sommerville, 2013).

But, elections are bound at fixed points in time, and they must come to an end.  Thus, this allows for the socio-technical GOTV plan to have a work breakdown structure.  The resulting work breakdown structure could be multiplied or divided based on the lead time or the importance of the election candidacy or initiative (Garecht, 2002; Progressive Technology Project, 2006):

  • As soon as possible:
    • Create a GOTV plan, strategy, methodology, based on the methodologies created through the Delphi method, with a wide range of experts.
    • Assign one person as a chairman for GOTV.
    • Sign up volunteers for GOTV efforts, remembering that they are not there to convince anyone who to vote for, just to get them to vote.
  • 90-60 days before the election:
    • Gather all data from all the data sources and create one system that ties it together.
    • Identify phase: Have predictive data analytics begin running algorithms to decipher which voter has a lower than average propensity to vote on election day and allow the algorithm to triage voters.
      • Add people who attend political events, staff members, volunteers, and they should already have a higher propensity to vote.
    • When applicable have GOTV staff file absentee ballots.
  • 30 days before the election:
    • Begin implementation of the GOTV Plan.
    • Updating databases, registration information, voter addresses, etc.
    • Identify phase: Keep rerunning the predictive data analytics model.
    • Motivation Phase: Getting people to vote, by making it easy for them to vote, and establishing social accountability via their social media accounts.
    • When applicable have GOTV staff file absentee ballots.
  • Ten days to 1 week before the election:
    • New volunteers come in at this time, to help, and in the GOTV plan, there should be training and roles given to them so that they can be of most use.
    • Identification + Motivation Phase: Contact each person on that list to remind them of their civic duty, motivate them, and remind them where their polling place is and what time they stated would be best for them to vote.
    • Motivation Phase: Using social media advertising tools to drop ads on people located on these low propensity voter lists as derived from predictive data analytics. Even sending text and email reminders of their polling places and times of operation, would help make the voting process easier for these voters.
    • When applicable have GOTV staff file absentee ballots.
  • Election day:
    • Have voter log into a system to say that they have voted, or scan their social media to see who has voted to cross out their names from those who have yet to vote that day. The aim is to get 100% conversion rate of the inactive voter to active voter.

Understanding that this work breakdown structure deals with the intersection of technology and people is key to making it work effectively.

Analytical Plan:

The aim of this socio-technical GOTV plan is to have 100% conversion rates of inactive to active voters.  However, this is quite impossible for larger campaigns and big elections (Progressive Technology Project, 2006).  But, to analyze the effectiveness, of the socio-technical GOTV plan is to cross reference the data that the predictive data analytics has created for low propensity voters, to the voters reached through various technological or social media means, to the voters who voted and that match the GOTV list.

Another way to evaluate the effectiveness of the socio-technical GOTV plan is to see how closely did the real results matched to the milestones identified in the work breakdown structure.  Daily figures should be captured along with narratives to supplement the numerical data to create lessons learned, to eventually be fed back to the experts who devised the methodology to conduct further future developments.  The Delphi Technique can be reiterated with the new data to build a better socio-technical GOTV plan in the future.

Anticipated Results:

As voter turnout increases, no matter which political party wins, the views will be more centrist rather than polarizing, because each of the voter’s voices was heard (ThinkTank, 2016; vlogbrothers, 2014).  Another result from this socio-technical GOTV plan is that the voter is now empowered to make a data-driven decision from the national level and down ballot due to the information presented in these GOTV plans (Alcorn, 2016; Sutton, 2016). This in turn will help create a positive use of social media as a tool to enhance learning and develop personal learning network (Horizons Report, 2016c; Horizons Report, 2016e). Finally, an unintended social impact of this GOTV plan is creating more civically minded citizens that are active in politics at all levels who are willing to be influencers discovering, creating, publishing and sharing ideas (Horizons Report, 2016a, Horizons Report, 2016c, ThinkTank, 2016).

Conclusion:

According to the US Census Bureau (2016), there are approximately 227 Million eligible voters in the U.S. However; the Bipartisan Policy Center stated that in 2012 the voter turnout was 57.5%. This voter turnout rate is horrible, and the Vlogbrother’s said: “If you aren’t voting, no one hears your voice, so they have no reason to represent you!”. This helps establish a need for GOTV efforts. When more people go out to vote, their voices get heard.  Also, elections are not usually about the top ticket vote, but all the down ballot stuff as well, like local city, county, and state level public offices and ballot initiatives (ThinkTank, 2016).

According to a Pew Survey (2015), 68% of US adults own a smartphone, while 86% of all young adults (ages 18-20) own a smartphone. In a different Pew Survey (2015a), showed that 65% of adults use social media, up from the 7% in 2005. Therefore, a huge voting block has a social media account and a smartphone (Levitan, 2015b) for which technology can be leveraged at a cheaper cost than door-to-door canvassing. Green et al. (2008) stated that door-to-door canvassing costs approximately 10x more than mass social contact per vote.  Currently, Social media, email, online ads, and websites are used for GOTV (Fuller, 2012). Fortunately, technology and big data could be used to leverage the GOTV strategies to allow for mass social contact that is tailored to the voter and yet still cheaper than door-to-door canvassing.

The Diffusion of Innovation (DOI) theory is concerned with the why, what, how, and rate of innovation dissemination and adoption between entities, which are carried out through different communication channels over a period (Bass, 1969; Robertson, 1967; Rogers, 1962; Rogers 2010). Entities need to make a binary decision that can fluctuate in time between whether or not to adopt an innovation (Herrera, Armelini, & Salvaj, 2015). Rogers (1962) first proposed, that the timing of the adoption rates of innovation between entities follows a normal frequency distribution: innovators (2.5%), early adopters (13.5%), early majorities (34%), late majority (34%), and laggards (16%). The cumulative frequency distribution mimics an S-curve (Bass, 1969; Robertson, 1967; Rogers, 1962). However, Bass (1969) claimed that Rogers’ frequency distribution was arbitrarily assigned, and therefore he reclassified both innovators and early adopters as innovators and everyone else as imitators, to create a simplified numerical DoI model.

Imitators are more deliberate, conservative, and traditional learners, who learn from innovators, whereas innovators are more venturesome (Bass, 1969; Rogers, 2010). Innovators have a lower threshold of resistance to adopting an innovation than imitators since the innovator’s adoption rates are driven by the adopters’ perception of the innovation’s usefulness (Rogers, 2010; Sang-Gun, Trimi, & Kim, 2013).  Therefore, an innovator will adopt the implications of this socio-technological GOTV plan, record their findings and when a certain amount of data is collected, and this innovation has been implemented much time over, will it finally get adopted by the imitators.  Once a significant amount of imitators has adopted these measures, we can then begin to see U.S. voter turnout reach the 80% or higher mark.

Areas of Future Research:

For further the GOTV low propensity voter prediction and more accurate predictive data analytics algorithms are needed. Thus more research is needed.  Different types of predictive data analytics can produce different results, whether the algorithm is using supervised or unsupervised machine learning techniques. Other efforts needed in the future is preprocessing the unstructured social media data and connecting it to voter registration data.

References