Data Tools: Hadoop Vs Spark

The Hadoop ecosystem is rapidly evolving. Apache Spark is a recent addition to the Hadoop ecosystem. Both help with traditional challenges of storing and processing of large data sets.

Advertisements

 

Apache Spark

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark can have applications written in Java, Scala, Python, R, and interfaces with SQL, which increases ease of use (Spark, n.d.; Zaharia et al., 2012).

Essentially, Spark is a high-performance computing cluster framework, but it doesn’t have its distributed file system and thus uses Hadoop Distributed File System (HDFS, HBase) as in input and output (Gu & Li, 2013).  Not only can it access data from HDFS, HBase, it can also access data from Cassandra, Hive, Tachyon, and any other Hadoop data source (Spark, n.d.).  However, Spark uses its data structure called Resilient Distribution Datasets (RDD) which cache’s data and is a read-only operation to improve its processing time as long as there is enough memory for it in all the nodes of a cluster (Gu & Li, 2013; Zaharia et al., 2012). Spark tries to avoid data reloading from the disk that is why it stores its data in the node’s cache system, for initial and intermediate results (Gu & Li, 2013).

Machines in the cluster can be rebuilt if lost, thus making the RDDs are fault-tolerant without requiring replication (Gu &LI, 2013; Zaharia et al., 2012).  Each RDD is tracked in a lineage graph, and reruns the operations if data becomes lost, therefore reconstructing data, even if all the nodes running spark were to fail (Zaharia et al., 2012).

Hadoop

Hadoop is Java-based system that allows for manipulation and calculations to be done by calling on MapReduce function on its HDFS system (Hortonworks, 2013; IBM, n.d.).

HFDS big data is broken up into smaller blocks across different locations, no matter the type or amount of data, each of these blocs can be still located, which can be aggregated like a set of Legos throughout a distributed database system (IBM, n.d.; Minelli, Chambers, & Dhiraj, 2013). Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). IBM (n.d.) boasts that the data blocks in the HFDS are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on your data needs), offering fault tolerance as well. Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to run its procedures on the server in which the data is stored, where data is stored in a memory, not in cache and allow for continuous service.

MapReduce contains two job types that work in parallel on distributed systems: (1) Mappers which creates & processes transactions on the system by mapping/aggregating data by key values, and (2) Reducers which know what that key value is, will take all those values stored in a map and reduce the data to what is relevant (Hortonworks, 2013; Sathupadi, 2010). Reducers can work on different keys, and when huge amounts of data are entered into MapReduce, then the Mapper maps the data, where the data is then shuffled and sorted before it is reduced (Hortonworks, 2013).  Once the data is reduced, the researcher gets the output that they sought.

Significant Differences between Hadoop and Apache Spark              

Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.).

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

References

  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.

Data Tools: Hadoop and how to install it

Installation Guide to Hadoop for Windows 10.

What is Hadoop

Hadoop’s Distributed File System (HFDS) is where big data is broken up into smaller blocks (IBM, n.d.), which can be aggregated like a set of Legos throughout a distributed database system. Data blocks are distributed across multiple servers.  This block system provides an easy way to scale up or down the data needs of the company and allows for MapReduce to do it tasks on the smaller sets of the data for faster processing (IBM, n.d). Blocks are small enough that they can be easily duplicated (for disaster recovery purposes) in two different servers (or more, depending on the data needs).

HFDS can support many different data types, even those that are unknown or yet to be classified and it can store a bunch of data.  Thus, Hadoop’s technology to manage big data allows for parallel processing, which can allow for parallel searching, metadata management, parallel analysis (with MapReduce), the establishment of workflow system analysis, etc. (Gary et al., 2005, Hortonworks, 2013, & IBM, n.d.).

Given the massive amounts of data in Big Data that needs to get processed, manipulated, and calculated upon, parallel processing and programming are there to use the benefits of distributed systems to get the job done (Minelli et al., 2013).  Hadoop, which is Java based allows for manipulation and calculations to be done by calling on MapReduce, which pulls on the data which is distributed on its servers, to map key items/objects, and reduces the data to the query at hand (Hortonworks, 2013 & Sathupadi, 2010).

Parallel processing allows making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. This is the largest benefit of Hadoop, which allows for parallel processing.  Another advantage of parallel processing is when one processor/node goes out; another node can pick up from where that task last saved safe object task (which can slow down the calculation but by just a bit).  Hadoop knows that this happens all the time with their nodes, so the processor/node create backups of their data as part of their fail safe (IBM, n.d).  This is done so that another processor/node can continue its work on the copied data, which enhances data availability, which in the end gets the task you need to be done now.

Minelli et al. (2013) stated that traditional relational database systems could depend on hardware architecture.  However, Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, we manage the applications, and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).  The next section discusses how to install Hadoop and how to set up Eclipse to access map/reduce servers.

Installation steps

  • Go to the Hadoop Main Page < http://hadoop.apache.org/ > and scroll down to the getting started section, and click “Download Hadoop from the release page.” (Birajdar, 2015)
  • In the Apache Hadoop Releases < http://hadoop.apache.org/releases.html > Select the link for the “source” code for Hadoop 2.7.3, and then select the first mirror: “http://apache.mirrors.ionfish.org/hadoop/common/hadoop-2.7.3/hadoop-2.7.3-src.tar.gz” (Birajdar, 2015)
  • Open the Hadoop-2.7.3 tarball file with a compression file reader like WinRAR archiver < http://www.rarlab.com/download.htm >. Then drag the file into the Local Disk (C:). (Birajdar, 2015)
  • Once the file has been completely transferred to the Local Disk drive, close the tarball file, and open up the hadoop-2.7.3-src folder. (Birajdar, 2015)
  • Download Hadoop 0.18.0 tarball file < https://archive.apache.org/dist/hadoop/core/hadoop-0.18.0/ > and place the copy the “Hadoop-vm-appliance-0-18-0” folder into the Java “jdk1.8.0_101” folder. (Birajdar, 2015; Gnsaheb, 2013)
  • Download Hadoop VM file < http://ydn.zenfs.com/site/hadoop/hadoop-vm-appliance-0-18-0_v1.zip >, unzip it and place it inside the Hadoop src file. (Birajdar, 2015)
  • Open up VMware Workstation 12, and open a virtual machine “Hadoop-appliance-0.18.0.vmx” and select play virtual machine. (Birajdar, 2015)
  • Login: Hadoop-user and password: Hadoop. (Birajdar, 2015; Gnsaheb, 2013)
  • Once in the virtual machine, type “./start-hadoop” and hit enter. (Birajdar, 2015; Gnsaheb, 2013)
    1. To test MapReduce on the VM: bin/Hadoop jar Hadoop-0.18.0-examples.jar pi 10 100000000
      1. You should get a “job finished in X seconds.”
      2. You should get an “estimated value of PI is Y.”
  • To bind MapReduce plugin to eclipse (Gnsaheb, 2013)
    1. Go into the JDK folder, under Hadoop-0.18.0 > contrib> eclipse-plugin > “Hadoop-0.18.0-eclipse-plugin” and place it into the eclipse neon 1 plugin folder “eclipse\plugins”
    2. Open eclipse, then open perspective button> other> map/reduce.
    3. In Eclipse, click on Windows> Show View > other > MapReduce Tools > Map/Reduce location
    4. Adding a server. On the Map/Reduce Location window, click on the elephant
      1. Location name: your choice
      2. Map/Reduce master host: IP address achieved after you log in via the VM
  • Map/Reduce Master Port: 9001
  1. DFS Master Port: 9000
  2. Username: Hadoop-user
  1. Go to the advance parameter tab > mapred.system.dir > edit to /Hadoop/mapped/system

Issues experienced in the installation processes (Discussion of any challenges and explain how it was investigated and solved)

Not one source has the entire solution Birajdar, 2015; Gnsaheb, 2013; Korolev, 2008).  It took a combination of all three sources, to get the same output that each of them has described.  Once the solution was determined to be correct, and the correct versions of the files were located, they were expressed in the instruction set above.  Whenever a person runs into a problem with computer science, google.com is their friend.  The links above will become outdated with time, and methods will change.  Each person’s computer system is different than those from my personal computer system, which is reflected in this instruction manual.  This instruction manual should help others google the right terms and in the right order to get Hadoop installed correctly onto their system.  This process takes about 3-5 hours to install correctly, with the long time it takes to download and install the right files, and with the time to set up everything correctly.

Resources