Data Tools: WEKA

WEKA

The Java based, open sourced, and platform independent Waikato Environment for Knowledge Analysis (WEKA) tool, for data preprocessing, predictive data analytics, and facilitation interpretations and evaluation (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Miranda, n.d.; Xia & Gong, 2014).  It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (WEKA, n.d). Also, WEKA contains classification, clustering, association rules, regression, and visualization capabilities, in particular, the C4.5 decision tree predictive data analytics algorithm (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011). Here WEKA is an open source data and text mining software tool, thus it is free to use. Therefore there are no costs associated with this software solution.

WEKA can be applied to big data (WEKA, n.d.) and SQL Databases (Patel & Donga, 2015). Subsequently, WEKA has been used in many research studies that are involved in big data analytics (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011; Parkavi & Sasikumar, 2016; Xia & Gong, 2014). For instance, Barak and Modarres (2015) used WEKA for decision tree analysis on predicting stock risks and returns.

The fact that it has been using in this many research studies is that the reliability and validity of the software are high and well established.  Even in a study comparing WEKA with 12 other data analytics tools, is one of two apps studied that have a classification, regression, and clustering algorithms (Gera & Goel, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if one can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage. Another disadvantage of WEKA is that it cannot handle raw data directly, meaning the data had to be preprocessed before it is entered into the software package and analyzed (Hoonlor, 2011). WEKA cannot even import excel files, data in Excel have to be converted into CSV format to be usable within the system (Miranda, n.d.)

References:

  • Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. doi:http://dx.doi.org/10.1007/s10799-012-0135-7
  • Gera, M., & Goel, S. (2015). Data Mining -Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Kumar, D., & Fet, D. (2011). Performance Analysis of Various Data Mining Algorithms: A Review. International Journal of Computer Applications, 32(6), 9–16.
  • Miranda, S. (n.d.). An Introduction to Social Analytics : Concepts and Methods.
  • Parkavi, S. & Sasikumar, S. (2016). Prediction of Commodities Market by Using Data Mining Technique. i-Manager’s Journal on Computer Science.
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).
  • WEKA (n.d.) WEKA 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Xia, B. S., & Gong, P. (2014). Review of business intelligence through data analysis. Benchmarking, 21(2), 300–311. http://doi.org/http://dx.doi.org/10.1108/BIJ-08-2012-0051
Advertisements

Big Data Analytics: Open-Sourced Tools

Here are three open source text mining software tools for analyzing unstructured big data:

  1. Carrot2
  2. Weka
  3. Apache OpenNLP.

One of the great things about these three software tools is that they are free.  Thus, there is no cost per each software solution.

 Carrot2

A Java based code, which also has a native integration with PHP, and C#/.NET API (Gonzalez-Aguilar & Ramirez Posada, 2012).  Carrot2 can organize a collection of documents into categories based on themes in a visual manner; it can also be used as a web clustering engine. Carpineto, Osinski, Romano, and Weiss (2009) stated that web clustering search engines like Carrot2 help you with fast subtopic retrievals, (i.e. searching for tiger, you can get tiger woods, tigers, Bengals, Bengals football team, etc.), Topic exploration (through a cluster hierarchy), and alleviation information overlook (does more than the first page of results search). The algorithms it uses for categorization is Lingo (Lingo3G), K-mean, and STC, which can support multiple language clustering, synonyms, etc. (Carrot, n.d.).  This software can be used online instead of regular search engines as well (Gonzalez-Aguilar & Ramirez Posada, 2012).  Gonzalez-Aguilar and Ramirez Posada (2012) explain that the interface has three phases for processing information: entry, filtration, and exit.  It represents the cluster data in three visual formats: Heatmap, Network, and pie chart.

The disadvantage of this tool is that it only does clustering analysis, but its advantage is that it can be applied to a search engine to facilitate faster and more accurate searches through its subtopic analysis.  If you would like to use Carrot2 as a search engine, go to http://search.carrot2.org/stable/search and try it out.

Weka

It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (Weka, n.d). Weka can be applied to big data (Weka, n.d.) and SQL Databases (Patel & Donga, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if you can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage.

 Apache OpenNLP

A Java code conventional machine learning toolkit, with tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution (OpenNLP, n.d.) OpenNLP works well with the NetBeans and Eclipse IDE, which helps in the development process.  This tool has dependencies on Maven, UIMA Annotators, and SNAPSHOT.

The advantage of OpenNLP is that specification of rules, constraints, and lexicons don’t need to be entered in manually. Thus, it is a machine learning method which aims to maximize entropy (Buyko, Wermter, Poprat, & Hahn, 2006).  Maximizing entropy allows for collect facts consistently and uniformly.  When the sentence splitter, tokenization, part-of-speech tagging, named entity extraction, chunking, parsing, and conference resolution was tested on two medical corpora, accuracy was up in the high 90%s (Buyko et al., 2006).

This software has high accuracy as its advantage, but also produces quite a bit of false negatives which is its disadvantage.   In the sentence splitter function, it picked up literature citations, and in tokenization, it took specialized characters “-” and “/” (Buyko et al., 2006).

 References:

  • Buyko, E., Wermter, J., Poprat, M., & Hahn, U. (2006). Automatically adapting an NLP core engine to the biology domain. In Proceedings of the Joint BioLINK-Bio-Ontologies Meeting. A Joint Meeting of the ISMB Special Interest Group on Bio-Ontologies and the BioLINK Special Interest Group on Text Data M ining in Association with ISMB (pp. 65-68).
  • Carpineto, C., Osinski, S., Romano, G., and Weiss, D. 2009. A survey of web clustering engines. ACM Comput. ´ Surv. 41, 3, Article 17 (July 2009), 38 pages. DOI = 10.1145/1541880.1541884 http://doi.acm.org/10.1145/1541880.1541884
  • Carrot (n.d.) Open source framework for building search clustering engines. Retrieved from http://project.carrot2.org/index.html
  • Gonzalez-Aguilar, A. AND Ramirez-Posada, M. (2012): Carrot2: Búsqueda y visualización de la información (in Spanish). El Profesional de la Informacion. Retrieved from http://project.carrot2.org/publications/gonzales-ramirez-2012.pdf
  • openNLP (n.d.) The Apache Software Foundation: OpenNLP. Retrieved from https://opennlp.apache.org/
  • Weka (n.d.) Weka 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).