10 Data Visualization Tools

There are many tools used in today’s market to present data analytics information. Many of these tools are great for particular presentation types. This post will list 10 available big data visualization tools in today’s market.


Data Visualization Tools

There are no shortages of data analytics tools that deal with the entire process from generation to visualization, and its infrastructure is shown in Figure 1.  According to Truck (2016), the primary data analytics visualization tools are Tableau, Google Cloud Platform, Qlik, Looker, RoamBI, Chartio, datorama, Zoomdata, Sisense, and Zeppelin.  According to Machlis (2011), she lists 22 different data visualization tools: R, DataWrangler, Google Refine, Google Fusion Tables, Impure, Tableau Public, Many Eyes, VIDI, Zoho Reports, Choosel, Exhibit, Google Chart Tools, JavaScript InfoVis Toolkit, Protovis, Quantum GIS (QGIS), OpenHeatMap, OpenLayers, OpenStreetMap, TimeFlow, IBM Word-Cloud Generator, Gephi, and NodeXL. Then, Jones (2014) listed the top 10 tools: Tableau Public, OpenRefine, KNIME, RapidMiner, Google Fusion Tables, NodeXL, Import.io, Google Search Operators, Solver, and WolframAlpha.  Even, the California HealthCare Foundation (CHCF, 2014) recommended that data visualization tools that everyone in the healthcare industry could use would be: Google Charts & Maps, Tableau Public, Mapbox, Infogram, Many Eyes, iCharts, and Datawrapper. CHCF (2014) also recommended some data visualization tools for developers in the healthcare industry could use such as High Charts, TileMill, D3.js, FLOT, Fusion Charts, OpenLayers, and JSMap.  These four cases are all examples that no matter which data visualization software is discussed here, there is a plethora of others and there are currently no authoritative sources listing all of them.  This discussion is also not trying to compile a comprehensive or authoritative source either.


Figure 1: Big Data Landscape 2016 which categorizes big data tools and applications by Infrastructure, Analytics, Application, Open Source, Data Sources & APIs, Incubators & Schools, and Cross-Infrastructure/Analytics. (Adapted from Truck, 2016).

Ten Data Visualization Tools and their strengths and weaknesses

Based on the above subject matter experts the following ten visualization tools will be discussed:  Tableau & Tableau Public, Google Fusion Charts, OpenLayers, Chartio, Datorama, Zoomdata, NodeXL, Qlik, Looker, and RoamBI.

Tableau Desktop & Tableau Public

Tableau Desktop is a $1000-1200 product, whereas Tableau Public is free and it is marketed as an end-user interactive business intelligence software to help provide insights hidden in the data (Jones, 2014; Machlis, 2011; Phillipson, 2016; Tableau, n.d.).  Tableau is touted to be 10-100x faster than most other commercial visualization software through its intuitive no-coding drag and drop products (Tableau, n.d.).  Tableau can take in data from excel spreadsheets, Hadoop, cloud, etc. and bring them together for comprehensive data analysis (Jones, 2014; Phillipson, 2016; Tableau, n.d.).  The strength of Tableau Public is all the functionalities seen in Tableau Desktop is provided for free. However any data stored in Tableau Public is made freely available to others within the community (Jones, 2014; Machlis, 2011).  If data privacy is sought, Tableau Desktop allows the data scientist to analyze the data locally, without sharing key information to the world, but at a price (Machlis, 2011; Tableau, n.d.).

Google Fusion Charts

Fusion charts is a web-based tool that is assessable to all with a google drive account, and it allows for control over many different aspects over the data visualizations, where data scientists can limit the amount of data shown, summarize the data, choose from different chart types, and customize legends without the need to know how to code (Google, n.d.; Machlis, 2011). Jones (2014) calls Fusion Charts the “Google Spreadsheets cooler, larger, and much nerdier cousin.” Data could be found through the google search engine or imported quickly from CSV, TSV; UTF-8 encoded files, etc. (Google, n.d.; Jones, 2014). Data can even be exported into JSON files, and all the data could be analyzed in private or can be set free to the public (Machlis, 2011). Certain interactive charts provided by Fusion are Network graphs, zoomable line charts, map charts, heat maps, timeline, storyline, animation, pie charts, tables, scatter plots, etc. (Google, n.d.; Machlis, 2011).  The downsides of this tools are how tedious it can become to edit multiple cells entries, the customizations are quite limiting, and that for large data the API can demand a ton of resources slowing down the execution (Machlis, 2011).


OpenLayers is a JavaScript library for displaying mapping geolocation data; that is easily customizable and extendable using cutting edge tiled or vector layer mapping formats (Machlis, 2011; OpenLayers, n.d.). OpenLayers is an open source code (Machlis, 2011). Some of the maps that can be created are animated, blended, attribution, cluster features, integration with Bing maps, d3 Integration, drag and drop interaction, dynamically added data, etc. (OpenLayers, n.d.).  One of the drawbacks is that it requires a bit amount of coding skill in JavaScript and certain integrations with popular maps are still under development. However, it can run on any web browser (Machlis, 2011).


Chartio is a software as a service, visual query tool that pulls and joins data from multiple sources easily, without knowledge of SQL (Rist & Strom, 2016).   Chartio can process the data into visualizations to aid in building a case with data-driven analytics and dashboard all for $2000 (Chartio, n.d.; Rist & Strom, 2016). Chartio’s commercial product can pull Amazon RDS, Cassandra, CSV files, DB2, Google Cloud SQL, Google BigQuery, Hadoop, MongoDB, Oracle, Rackspace Cloud, Microsoft SQL Server, Windows Azure Cloud, etc. data (Chartio, n.d.; Rist & Strom, 2016). Unfortunately, the user interface is poorly designed and has a learning curve that is greater than other data visualization tools (Rist & Strom, 2016). Chartio (n.d.) boasts that connecting to any of the databases above just requires two terminal commands and that the data pulled from these databases are done through read-only to protect the data. However, Rist and Strom (2016) had initial problems uploading data onto their tool, mostly due to the responsiveness of the API.


An Israeli-based company, Datorama is a cloud-based system and tool that allows for marketing analytics (Gilad, 2016).  Data sources that Datorama uses can come from Facebook, Google, ad exchanges, networks, direct publisher sites, affiliate programs, etc. and can visually demonstrate and monetize the marketing data (Datorama, n.d., Gilad, 2016).  Datorama allows for multi-level authentication for advanced security (Phillipson, 2016). According to Datorama (n.d.), their tool allows for comparisons between online and offline marketing analysis on a single dashboard. Unfortunately, marketing/sales data is the primary use of this tool, and there are other tools in existence that analyze marketing/sales data and much more (Gilad, 2016).  To know the cost of this software one must obtain a quote (Phillipson, 2016).


Zoomdata is an intuitive and collaborative way to visualize data that was built with HTML5, JavaScripts, WebSockets and CSS and expandable libraries such as D3, Leaflet, NVD3, etc. (Zoomdata, n.d.)  Graphing features can include dynamic dashboards with drill down capabilities on, tabular, geodata, pie charts, line graphs, scatter plots, bar charts, stacked bar charts, etc. (Darrow, 2016; Zoomdata, n.d.).  Zoomdata allows for web browsing and touch-oriented analysis and can handle real-time data streams and billions of rows of data (Zoomdata, n.d.). The downside is that this software as a service is a commercial software, which can set you back $1.91/hour (Darrow, 2016). However, Zoomdata can connect to Hadoop, Cloudera MongoDB, Amazon, NoSQL, MPP, and SQL databases, cloud applications, etc. (Darrow, 2016; Zoomdata, n.d.).


NodeXL basic is an open source software that is a Microsoft Excel 2007-2016 plug-in, helps in making it easy to graph and explore network graphs and relationships through entering network edge lists (Jones, 2014; Machlis, 2011; NodeXL, 2015).  NodeXL Pro ($29/year-$749/year) offers extended features from the basic, like dealing with data streams for social networks, text and sentiment analysis, etc. (NodeXL, 2015).  Data pulled from Facebook, Flickr, YouTube, LinkedIn, and Twitter could be represented through this tool (Jones, 2014; Machlis, 2011). Graph Metrics like degree, closeness centrality, PageRank, clustering, graph density is all available in NodeXL (Jones, 2014; NodeXL, 2015).  Editing the appearance of the graphs like color, shape, size, label and opacity can be done through both versions (NodeXL, 2015). Unfortunately, the tool is limited mostly to network analysis (Jones, 2014; Machlis, 2011; NodeXL, 2015).


This free self-service data visualization tool, which allows you to create dynamic and interactive visualizations that one could keep the data on their desktop, without having to release their data to the public (Machlis, 2015; Qlik, n.d.).  It is free for both personal and internal business use (Qlik, n.d.).  Unfortunately, it isn’t easy to share data or visualizations with peers but, Qlik also allows sharing data for up to 5 people privately through their cloud services (Machlis, 2015). Qlik allows integration without a data warehouse from data sources likes Hadoop, Microsoft Excel, LinkedIn, Twitter, Facebook, cloud, databases, etc. (Qlik, n.d). Though there is a learning curve to this software, it is not insurmountable, and a user can quickly learn how to do basic graphs through with multiple filters (Machlis, 2015).


Looker aims to be a data visualization and exploration tool to be used by multiple people and aims to remove the data analytics bottleneck caused by data scientists controlling all the data (Looker, n.d.).  With these data models, it can help define all the measures and dimensions behind the data (Software Advice, n.d.). This tool allows for data models, custom metrics, real-time analysis, and blending between data sets to produce drill down dashboards with the basic charts, graphs, and maps (Looker, n.d.; Software Advice, n.d.). Data inputs can come from commercial off the shelf products like Salesforce or from internally created software and applications (Looker, n.d.). According to a customer of the Looker software, documentation is behind, making it hard to do certain tasks, and another customer says that once a data model is in the application, it becomes hard to edit (Software Advice, n.d.).


A data visualization tool that can be taken anywhere, and primarily built for mobile devices, which can include data from Microsoft Excel, CSV data, SQL Server, Cognos, Salesforce, SAP, Box data, etc. (Bigelow, 2016; MeLLmo Inc., n.d.).  It has been designed for mobile devices to allow for data sharing, exploration, and presentation (MeLLmo Inc., n.d.). It is such a popular software that all ten major pharmaceutical companies are using RoamBI on their iPads (Bigelow, 2016). Visualizations capitalize on tabular data, spark-lines, bar charts, line charts, stacked bar charts, pie charts, bubble charts, KPIs, etc. all on a dashboard, but are not customizable and reporting dimensions are limited (Authors, 2010; MeLLmo Inc., n.d.).  The free version of the application allows for localized data to be uploaded and used, whereas the Pro ($99/year or $795 perpetual) version of the application allows for data connections from online sources (Authors, 2010).


In the end, each of the ten data visualization tools has their advantages and disadvantages along with different price points.  The best way to select the right tool is knowing what one’s data visualizations needs are and compare these and other tools based on those needs.  The tool that meets most or all of the needs should then be selected.


Data Visualization Tools in Healthcare

Data analytics results are useful when the revealed information is presented in an understandable fashion. There are many tools currently available in the market, which are used to present information in the final stage of data analytics.

Purpose and Impact of data visualization in the Healthcare industry

There are many applications of data analytics in the healthcare industry: physician and ambulatory care centers, hospitals and health systems, managed care plans and HMOs, genomic studies, and Accountable care organizations (Cyranoski, 2015; eInfochips, n.d.). Therefore, visualizing health data could help tell stories through analyzing relevant data such that data-driven decisions and actions could be made (California HealthCare Foundation [CHCF], 2014; eInfochips, n.d.).  Cleardata (n.d.) suggested that presentation of data for data visualization should consist of the following best practices: the use of relevant data, begin with understanding what should be communicated then design towards that, make visualizations easy for the consumer, ensure HIPAA-compliance when showing data, and create visualizations that can lead in making data-driven decisions and action. Therefore, before selecting the right visualization tool, a presentation approach must be considered, which takes into account: personal level of expertise, visualization methods, and interactivity of the visualization (CHCF, 2014).

It is not enough to analyze the relevant data for data-driven decisions but also selecting relevant visualizations of that data to enable those data-driven decision (eInfochips, n.d.). There are many types of ways to visualize the data to highlight key facts through style and succinctly: tables and rankings, bar charts, line graphs, pie charts, stacked bar charts, tree maps, choropleth maps, cartograms, pinpoint maps, or proportional symbol maps (CHCF, 2014).  The above visualization plots, charts, maps and graphs could be part of an animated, static, and Interactive Visualizations and would it be a standalone image, dashboards, scorecards, or infographics (CHCF, 2014; eInfochips, n.d.).

The CHCF (2014) recommended that data visualization tools that everyone could use would be: Google Charts & Maps, Tableau Public, Mapbox, Infogram, Many Eyes, iCharts, and Datawrapper. CHCF (2014) also recommended some data visualization tools for developers such as High Charts, TileMill, D3.js, FLOT, Fusion Charts, OpenLayers, and JSMap.  Whereas eInfochips (n.d.) suggested visualization tools like Tableau, R, and Spotfire. Many eyes have been shut down by IBM and have been replaced by Watson Analytics (Machlis, 2011).

Summary on three data visualization tools that are used in health care (Machlis, 2011):

Tool Name Description Advantages Disadvantages Skill Level Required Runs on
R A statistical analysis tool that can not only do simple arithmetic and regression analysis, but it can also do complex data preprocessing, data mining, machine learning, and static data visualizations. Library of code is supported by the community, which is subject matter experts. Runs as a command line program, therefore there is a need to install a Graphical User Interface. Linux, Mac OS X, Unix, & Windows Advance Beginner
Tableau or

Tableau public (Free version)

A tool mostly used for interactive visualization that can do all the visualizations mentioned in the post, through dragging and dropping variables. A drag-and-drop interface allows for quick work to do data analysis that would take the time to manually code. Data stored in Tableau Public is stored on the web for free for others to use, which may make data privacy hard to control.  Otherwise, the full software is over $1K for a single user. There is limited customization in its interface, but can be done through code. Windows and Mac OS X Beginner to Intermediate
Google Chart Tools A self-contained application for storing data on the cloud and visualizing it anywhere, through the use of JavaScript visualization libraries. Integration with other Google products like Google Spreadsheets and is heavily documented JavaScript library. Requires coding to make the visualizations, and you don’t have access to the JavaScript codes and have to rely on continuous Google support. Any device with a web browser. Advanced to Expert




Data Tools: AI and wildlife case study

Data analytics is all about retrieving the right information from a large pool of data. Many techniques, fast algorithms, and infrastructures are used to help extract the information you need, but in many cases, your abilities are limited.

2015 Case study: Unmanned Aerial Vehicles (UAVs) and Artificial Intelligences (AI) revolutionizing Wildlife Monitoring and Conservatism


Aiding in monitoring and conservationism of endangered or at risk of being endangered animals is at the heart of effective wildlife management.  Understanding the current population of animals is key.  However, current techniques like remote photography, camera traps, tagging, GPS collars, scat detect dogs, and DNA sampling is costly on the already strapped resources.  The authors in this study propose to use big data, AI, UAVs, and imagery to help effectively count the wildlife without depleting resources, disturbing the wildlife, improve safety, and improved statistical integrity.

The authors equipped a Mobius RGB camera with 1080p resolution and an FLIR Thermal Camera at 640×510 to an S800 EVO Hexacopter, which has three modes of travel, predefined flight mode via GPS, stabilized mode like autopilot, and manual.  The camera’s main goal is to capture footage of the area, split the image into a high contrast, identify patterns using AI and match them to the respective animal, and add the identified animal to the total count.  Using infrared cameras, the higher temperature animals sick out from the vegetation and soil background. Therefore a filter is applied to color the animal white and the background black to allow for classification and pattern recognition to occur.

Data Collection Procedures:

This idea was tested against the koala population given that they are iconic to Australia and are a vulnerable species.  The area that they studied was the Sunshine Coast, 57km north of Brisbane, Queensland, Australia, where the total ground truth number of koalas is 6. They flew on November 7, 2014, on 7:10-8:00 A.M. to allow for the largest temperature contrast between the koalas and background.  They flew at three different vertical levels: 20 m, 30 m, and 60 m.  A koala was identified if they were in 10 consecutive frames, didn’t make big jumps in locations within those frames, and that the size of the koala didn’t drastically increase.

Evaluation of effectiveness:

At each of the three levels, 100% of the koalas were identified.  However, it is important to note that there was a greater chance for a false positive at 60 m above ground surveillance and it took almost twice the time for the AI classification algorithm to detect the koalas.  The authors suggested that improving the AI classification algorithm by adding more template shapes for animals at different angles will help speed up the AI and improve the quality of detection.  Also, the quality of the templates can contribute to the quality of the detection.  This illustrates that there is a need to add more dynamic templates to the system, thus creating a bigger dataset to draw inferences from that can the higher the quality in detection.  Therefore, the combination of big data and AI is important for this study.

Other applications:

The benefit of this application of UAV, data analytics, and AI could be further extended to search and rescue missions for humans lost in national parks, etc.  The UAVs can supplement human and dog trackers, to gain an advantage of finding the victims quickly since time is extremely important.  Therefore, besides just for conservationist, park rangers can adapt these methods to help in recovery missions.  Another application could include the Department of Defense, for search and rescue missions, or mitigation of the casualties during times of war.


  • Gonzalez, L. F., Montes, G. A., Puig, E., Johnson, S., Mengersen, K., & Gaston, K. J. (2015). Unmanned Aerial Vehicles (UAVs) and Artificial Intelligences revolutionizing Wildlife Monitoring and Conservatism. Sensors 1(97). DOI: 10.3390/s16010097

Data Tools: Artificial Intelligence

Analyzing large data sets requires developing and applying complex algorithms. As data sets become larger, the ability of skilled individual to make sense of it all becomes more difficult.

Big data Analytics and Artificial Intelligence

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). Though previously, AI wasn’t able to come into existence without the proper computational power that is provided today (Cringely, 2013).  AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015).  The goal of AI is to use huge amounts of data to draw out a set of rules through machine learning that will effectively replace experts in a certain field (Cringely, 2013; Power, 2015). Cringely (2013) stated that in some situations big data can eliminate the need for theory and that AI can aid in analyzing big data where theory is either lacking or impossible to define.

AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  The thing that has slowed down the progression of AI in the past was the creation of human readable computer languages like XML or SQL, which is not intuitive for computers to read (Cringely, 2013).  Fortunately, AI can easily use structured data and now use unstructured data thanks to everyone who tags all these unstructured data either in comments or on the data point itself, speeding up the computational time (Cringely, 2013; Power, 2015).  Dewey (2013), hypothesized that not only will AI be able to analyze big data at speeds faster than any human can, but that the AI system can also begin to improve its search algorithms in phenomena called intelligence explosion.  Intelligence explosion is when an AI system begins to analyze itself to improve itself in an iterative process to a point where there is an exponential growth in improvement (Dewey, 2013).

Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).  It would take many scientists to analyze the same big data and analyze it all, to fully understand how the connections were made in the AI system, which is no longer feasible (Cringely, 2013).  It is as if data scientist is trying to read the mind of the AI system, and they currently cannot read a human’s mind. However, the results of AI are becoming accurate, with AI identifying cats in photographs in 72 hours of machine learning and after a cat is tagged in a few photographs (Cringely, 2013). AI could be applied to any field of study like finance, social science, science, engineering, etc. or even play against champions on the Jeopardy game show (Cyranoski, 2015; Cringely, 2013; Dewey, 2013; Power, 2015).

Example of artificial intelligence use in big data analysis: Genomics

The goal of AI use on genomic data is to help analyze physiological traits and lifestyle choices to provide a dedicated and personalized health plan to treat and eventually prevent disease (Cyranoski, 2015; Power, 2015).  This is done by feeding the AI systems with huge amounts of genomic data, which is considered big data by today’s standards (Cyranoski, 2015). Systems like IBM’s Watson (an AI system) could provide treatment options based on the results gained from analyzing thousands or even millions of genomic data (Power, 2015).  This is done by analyzing all this data and allowing machine learning techniques to devise algorithms based on the input data (Cringely, 2013; Cyranoski, 2015; Power, 2015).  As of 2015, there is about 100,000 individual genomic data in the system, and even with this huge amounts of data, it is still not enough to provide the personalized health plan that is currently being envisioned based on a person’s genomic data (Cyranoski, 2015).  Eventually, millions of individuals will need to be added into the AI system, and not just genomic data, but also proteomics, metabolomics, lipidomics, etc.


Data Tools: Hadoop Basic Componets & Architecture

A report that describes how data can be handled before Hadoop can take action on breaking data into manageable sizes.

Big Data

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  However, given that big data today is too big to be processed just by using one processor, the use of parallel processing allows for data analytics to be conducted through platforms like Hadoop more efficiently (Hortonworks, 2013; IBM, n.d.).

Hadoop: Basic Components and Architecture

Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers (IBM, n.d.).  Just like Legos, the end the results can be assembled back.  This feature of HDFS allows for Hadoop to manage big data through parallel processing and analysis (Gary et al., 2005, Hortonworks, 2013; IBM, n.d.).  Multiple data types are supported through the HFDS (IBM, n.d.) For Hadoop’s MapReduce function, it can be broken down into two queries.

Parallel processing is key for Hadoop, because it allows for making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. One of MapReduce’s main two queries is that it splits the data into the Lego pieces and places them across a group of computer nodes in the HDFS called the mapping procedure (Eini, 2010; IBM, n.d; Hortonworks, 2013; Sathupadi, 2010). The second MapReduce query applied algorithms to reduce the data in each of the computer nodes equally to answer the question that was asked of the data; such that at the end of the parallel processing procedures, the reduced data gets combined and further reduced to provide the final answer (Eini, 2010; IBM, n.d; Hortonworks, 2013; Minelli et al., 2013; Sathupadi, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Hortonworks, 2013; Rathbone, 2013; Sathupadi, 2010). Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to house the data and MapReduce runs its procedures on the server in which the data is stored.  Data is stored in a memory, not in cache and allow for continuous service (Gu & Li, 2013; Zaharia et al., 2012).

Given the Lego blocks feature in the HDFS, which allows for MapReduce functions, these blocks can contain a subset of data, which are small enough that they can be easily duplicated (for disaster recovery purposes) in two or more different servers (IBM, n.d.).  This partitioning of the data into data Lego blocks allows for big iterative tasks to be done quite easily and efficiently for big data sets (Gu & Li, 2013).

When to use Hadoop

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized. Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Also, Hadoop fails in providing a real-time response (Greer, Rodriguez-Martinez, & Seguel, 2010).  Therefore, for big data that isn’t streaming real-time data and has a ton of iterative processing/analytical tasks Hadoop should be used.

Preparation of Big Data for Hadoop

Collecting the raw and unaltered real world data is usually the first step of any data or text mining study (Coralles et al., 2015; Gera & Goel, 2015; He et al., 2013; Hoonlor, 2011; Nassirtoussi et al., 2014). Next, the data must be preprocessed, because raw text data files are unsuitable for predictive data analytics tools like Hadoop (Hoonlor, 2011). Barak and Modarres (2015) and Nassirtoussi et al. (2014), all stated that in both data and text mining, data preprocessing has the most significant impact on the research results.  Wayner (2013) and Lublinksy, Smith, and Yakubovich (2013), enumerated the following tools used to preprocess data prior to data analysis with Hadoop as part of the core components of the ecosystem:

  • Ambari: Graphical User Interface for setting up clusters with common components. Essentially a simple management tool.
  • Avro: serialization systems that compiles all the data together into a XML or JSON output to be shared with others.
  • BigTop: tool that provides testing of sub-projects within Hadoop.
  • Clouds: Allows the end-user to spin up multiple nodes to process the data without necessarily owning the infrastructure, essentially pay as you go model
  • Flume: Gathers all data and places it into HDFS. Essentially an enterprise data integration tool.
  • GIS tools: allows end-users to work with big data stored as geographic maps under GIS (Geographic Information Systems) formats.
  • HBase: helps search and share a big tabular data set, unfortunate full ACID is not available. Essentially a NoSQL Database.
  • HDFS: Storage of big data in multiple distributed systems into data blocks. Essentially a Distributed reliable data storage.
  • Hive: SQL type language that files and pulls out data that is needed from HBase. Essentially a high-level abstraction tool.
  • Lucene: indexes large blocks of unstructured text based data and allows for dynamic clustering and ability to read XML
  • Mahout: Allows for Hadoop to use classification, filtering, k-means, Dirichelet, parallel pattern, and Bayesian classification similar to Hadoops MapReduce. Essentially a data analytics library.
  • NoSQL: Uses NoSQL data stores for data that is not typically stored in HBase or HDFS.
  • Oozie: manages the workflow of a job by allowing the user to break the job into simple steps in a flowchart fashion. Essentially a workflow manager.
  • Pig: stores and maps data in processing nodes for Hadoop to find and process. Essentially a high-level abstraction tool.
  • Spark: uses Hadoop infrastructure to store data in the cache to allow for faster processing time
  • SQL on Hadoop: ad-hoc query the data stored in Hadoop servers using SQL
  • Sqoop: stores data in SQL databases into Hadoop. Essentially an enterprise data integration tool.
  • Whirr: Library that allows to run Hadoop clusters on Amazon EC2, Rackspace, etc.
  • ZooKeeper: maintains order and synchronization throughout the parallel processing cluster. Essentially a coordinator of processes.

According to Lublinksy et al. (2013), there are always new datasets, data formats, and data preprocessing and processing tools being added to Hadoop.  Thus the list provided above is not a comprehensive list, but rather one to begin off from.


  • Barak, S., & Modarres, M. (2015). Developing an approach to evaluate stocks by forecasting effective features with data mining methods. Expert Systems with Applications, 42(3), 1325–1339. http://doi.org/10.1016/j.eswa.2014.09.026
  • Corrales, D. C., Ledezma, A., & Corrales, J. C. (2015). A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal. Journal of Computers, V10(6), 396-405. Doi: 10.17706/jcp.10.6.396-405.
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Greer, M., Rodriguez-Martinez, M., & Seguel, J. (2010). Open Source Cloud Computing Tools: A Case Study with a Weather Application.Florida: IEEE Open Source Cloud Computing.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., & Heber, G. (2005). Scientific data management in the coming decade. ACM SIGMOD Record, 34(4), 34-41.
  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.
  • Eini, O. (2010). Map/Reduce- a visual explanation. Retrieved from https://ayende.com/blog/4435/map-reduce-a-visual-explanation
  • He, W., Zha, S., & Li, L. (2013). Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management, 33, 464–472. http://doi.org/10.1016/j.ijinfomgt.2013.01.001
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Hortonworks (2013). Introduction to MapReduce. Retrieved from https://www.youtube.com/watch?v=ht3dNvdNDzI
  • IBM (n.d.) What is the Hadoop Distributed File System (HDFS)? Retrieved from https://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/
  • Lau, W. (2001). A Comprehensive Introduction to Cloud Computing. Retrieved from https://www.simple-talk.com/cloud/development/a-comprehensive-introduction-to-cloud-computing/
  • Lublinsky, B., Smith, K., Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses (1st). VitalSource Bookshelf Online.
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text mining for market prediction: a systematic review. Expert Systems with Applications41(16), 7653–7670. http://doi.org/10.1016/j.eswa.2014.06.009
  • Rathbone, M. (2013). A beginners guide to Hadoop. Retrieved from http://blog.matthewrathbone.com/2013/04/17/what-is-hadoop.html
  • Sathupadi, K. (2010) Map Reduce: A really simple introduction. Retrieved from http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/