Compelling topics

Hadoop, XML and Spark

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010).

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublinsky, Smith, & Yakubovich,, 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Unfortunately, the syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).  XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). However, Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase. Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse.

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.). Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

Data visualization

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  Then there is Data-in-motion, which can be defined as a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

It is not enough to analyze the relevant data for data-driven decisions but also selecting relevant visualizations of that data to enable those data-driven decision (eInfochips, n.d.). There are many types of ways to visualize the data to highlight key facts through style and succinctly: tables and rankings, bar charts, line graphs, pie charts, stacked bar charts, tree maps, choropleth maps, cartograms, pinpoint maps, or proportional symbol maps (CHCF, 2014).  The above visualization plots, charts, maps and graphs could be part of an animated, static, and Interactive Visualizations and would it be a standalone image, dashboards, scorecards, or infographics (CHCF, 2014; eInfochips, n.d.).

Artificial Intelligence (AI)

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” said by Anthony Goldbloom. Thus, the fundamental question that decision makers need to ask, is how the decision is reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Therefore, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations is equivalent to our tough challenges today (McAfee, 2013).  Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.




Data Tools: WEKA

Many tools are used for the purpose of data analytics. WEKA is one of those free tools in the market.


The Java based, open sourced, and platform independent Waikato Environment for Knowledge Analysis (WEKA) tool, for data preprocessing, predictive data analytics, and facilitation interpretations and evaluation (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Miranda, n.d.; Xia & Gong, 2014).  It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (WEKA, n.d). Also, WEKA contains classification, clustering, association rules, regression, and visualization capabilities, in particular, the C4.5 decision tree predictive data analytics algorithm (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011). Here WEKA is an open source data and text mining software tool, thus it is free to use. Therefore there are no costs associated with this software solution.

WEKA can be applied to big data (WEKA, n.d.) and SQL Databases (Patel & Donga, 2015). Subsequently, WEKA has been used in many research studies that are involved in big data analytics (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011; Parkavi & Sasikumar, 2016; Xia & Gong, 2014). For instance, Barak and Modarres (2015) used WEKA for decision tree analysis on predicting stock risks and returns.

The fact that it has been using in this many research studies is that the reliability and validity of the software are high and well established.  Even in a study comparing WEKA with 12 other data analytics tools, is one of two apps studied that have a classification, regression, and clustering algorithms (Gera & Goel, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if one can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage. Another disadvantage of WEKA is that it cannot handle raw data directly, meaning the data had to be preprocessed before it is entered into the software package and analyzed (Hoonlor, 2011). WEKA cannot even import excel files, data in Excel have to be converted into CSV format to be usable within the system (Miranda, n.d.)


  • Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. doi:
  • Gera, M., & Goel, S. (2015). Data Mining -Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Kumar, D., & Fet, D. (2011). Performance Analysis of Various Data Mining Algorithms: A Review. International Journal of Computer Applications, 32(6), 9–16.
  • Miranda, S. (n.d.). An Introduction to Social Analytics : Concepts and Methods.
  • Parkavi, S. & Sasikumar, S. (2016). Prediction of Commodities Market by Using Data Mining Technique. i-Manager’s Journal on Computer Science.
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).
  • WEKA (n.d.) WEKA 3: Data Mining Software in Java. Retrieved from
  • Xia, B. S., & Gong, P. (2014). Review of business intelligence through data analysis. Benchmarking, 21(2), 300–311.

Big Data Analytics: R

R has proven to be the most effective software tool for analyzing big data. This post will be quick discuss of my evaluation of this tool and its relevance in the big data arena, with respects to text mining.

R is a powerful statistical tool that can aid in data mining.  Thus, it has huge relevance in the big data arena.  Focusing on my project, I have found that R has a text mining package [tm()].

Patal and Donga (2015) and Fayyad, Piatetsky-Shapiro, & Smyth, (1996) say that the main techniques in Data Mining are: anomaly detection (outlier/change/deviation detection), association rule learning (relationships between the variables), clustering (grouping data that are similar to another), classification (taking a known structure to new data), regressions (find a function to describe the data), and summarization (visualizations, reports, dashboards). Whereas, According to Ghosh, Roy, & Bandyopadhyay (2012), the main types of Text Mining techniques are: text categorization (assign text/documents with pre-defined categories), text-clustering (group similar text/documents together), concept mining (discovering concept/logic based ideas), Information retrieval (finding the relevant documents per the query), and information extraction (id key phrases and relationships within the text). Meanwhile, Agrawal and Batra (2013) add: summarization (compressed representation of the input), assessing document similarity (similarities between different documents), document retrieval (id and grabbing the most relevant documents), to the list of text mining techniques.

We use the “library(tm)” to aid in transforming text, stem words, build a term-document matrix, etc. mostly for preprocessing the data (RStudio pubs, n.d.). Based on RStudio pubs (n.d.) some text preprocessing steps and code are as follows:

  • To remove punctuation:

docs <- tm_map(docs, removePunctuation)

  • To remove special characters:

for(j in seq(docs))      {        docs[[j]] <- gsub(“/”, ” “, docs[[j]])        docs[[j]] <- gsub(“@”, ” “, docs[[j]])        docs[[j]] <- gsub(“\\|”, ” “, docs[[j]])     }

  • To remove numbers:

docs <- tm_map(docs, removeNumbers)

  • Convert to lowercase:

docs <- tm_map(docs, tolower)

  • Removing “stopwords”/common words

docs <- tm_map(docs, removeWords, stopwords(“english”))

  • Removing particular words

docs <- tm_map(docs, removeWords, c(“department”, “email”))

  • Combining words that should stay together

for (j in seq(docs)){docs[[j]] <- gsub(“qualitative research”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative studies”, “QDA”, docs[[j]])docs[[j]] <- gsub(“qualitative analysis”, “QDA”, docs[[j]])docs[[j]] <- gsub(“research methods”, “research_methods”, docs[[j]])}

  • Removing coming word endings

library(SnowballC)   docs <- tm_map(docs, stemDocument)

Text mining algorithms could consist of but are not limited to (Zhao, 2013):

  • Summarization:
    • Word clouds use “library (wordcloud)”
    • Word frequencies
  • Regressions
    • Term correlations use “library (ggplot2) use functions findAssocs”
    • Plot word frequencies Term correlations use “library (ggplot2)”
  • Classification models:
    • Decision Tree “library (party)” or “library (rpart)”
  • Association models:
    • Apriori use “library (arules)”
  • Clustering models:
    • K-mean clustering use “library (fpc)”
    • K-medoids clustering use “library(fpc)”
    • Hierarchical clustering use “library(cluster)”
    • Density-based clustering use “library (fpc)”

As we can see, there are current libraries, functions, etc. to help with data preprocessing, data mining, and data visualization when it comes to text mining with R and RStudio.