Compelling topics on analytics of big data

  • Big data is defined as high volume, high variety/complexity, and high velocity, which is known as the 3Vs (Services, 2015).
  • Depending on the goal and objectives of the problem, that should help define which theories and techniques of big data analytics to use. Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics. Thus, these three divisions of big data analytics are:
    • Descriptive analytics explains “What happened?”
    • Predictive analytics explains “What will happen?”
    • Prescriptive analytics explains “Why will it happen?”
  • The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013; Services, 2015). According to Dietrich (2013), it is a cyclical life cycle that has iterative parts in each of its six steps: discovery; pre-processing data; model planning; model building; communicate results, and
  • Data-in-motion is the real-time streaming of data from a broad spectrum of technologies, which also encompasses the data transmission between systems (Katal, Wazid, & Goudar, 2013; Kishore & Sharma, 2016; Ovum, 2016; Ramachandran & Chang, 2016). Data that is stored on a database system or cloud system is considered as data-at-rest and data that is being processed and analyzed is considered as data-in-use (Ramachandran & Chang, 2016).  The analysis of real-time streaming data in a timely fashion is also known as stream reasoning and implementing solutions for stream reasoning revolve around high throughput systems and storage space with low latency (Della Valle et al., 2016).
  • Data brokers are tasked collecting data from people, building a particular type of profile on that person, and selling it to companies (Angwin, 2014; Beckett, 2014; Tsesis, 2014). The data brokers main mission is to collect data and drop down the barriers of geographic location, cognitive or cultural gaps, different professions, or parties that don’t trust each other (Long, Cunningham, & Braithwaite, 2013). The danger of collecting this data from people can raise the incidents of discrimination based on race or income directly or indirectly (Beckett, 2014).
  • Data auditing is assessing the quality and fit for the purpose of data via key metrics and properties of the data (Techopedia, n.d.). Data auditing processes and procedures are the business’ way of assessing and controlling their data quality (Eichhorn, 2014).
  • If following an agile development processes the key stakeholders should be involved in all the lifecycles. That is because the key stakeholders are known as business user, project sponsor, project manager, business intelligence analyst, database administers, data engineer, and data scientist (Services, 2015).
  • Lawyers define privacy as (Richard & King, 2014): invasions into protecting spaces, relationships or decisions, a collection of information, use of information, and disclosure of information.
  • Richard and King (2014), describe that a binary notion of data privacy does not Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).
  • Fraud is deception; fraud detection is needed because as fraud detection algorithms are improving, the rate of fraud is increasing (Minelli, Chambers, &, Dhiraj, 2013). Data mining has allowed for fraud detection via multi-attribute monitoring, where it tries to find hidden anomalies by identifying hidden patterns through the use of class description and class discrimination (Brookshear & Brylow, 2014; Minellli et al., 2013).
  • High-performance computing is where there is either a cluster or grid of servers or virtual machines that are connected by a network for a distributed storage and workflow (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli et al., 2013).
  • Parallel computing environments draw on the distributed storage and workflow on the cluster and grid of servers or virtual machines for processing big data (Bhokare et al., 2016; Minelli et al., 2013).
  • NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data and evolve with time (Sadalage & Fowler, 2012).
    • Document store NoSQL databases, use a key/value pair that is the file/file itself, and it could be in JSON, BSON, or XML (Sadalage & Fowler, 2012; Services, 2015). These document files are hierarchical trees (Sadalage & Fowler, 2012). Some sample document databases consist of MongoDB and CouchDB.
    • Graph NoSQL databases are used drawing networks by showing the relationship between items in a graphical format that has been optimized for easy searching and editing (Services, 2015). Each item is considered a node and adding more nodes or relationships while traversing through them is made simpler through a graph database rather than a traditional database (Sadalage & Fowler, 2012). Some sample graph databases consist of Neo4j Pregel, etc. (Park et al., 2014).
    • Column-oriented databases are perfect for sparse datasets, ones with many null values and when columns do have data the related columns are grouped together (Services, 2015). Grouping demographic data like age, income, gender, marital status, sexual orientation, etc. are a great example for using this NoSQL database. Cassandra is an example of a column-oriented database.
  • Public cloud environments are where a supplier to a company provides a cluster or grid of servers through the internet like Spark AWS, EC2 (Connolly & Begg, 2014; Minelli et al. 2013).
  • A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014).
  • Private cloud environments have a similar infrastructure to a public cloud, but the infrastructure only holds the data one company exclusively, and its services are shared across the different business units of that one company (Connolly & Begg, 2014; Minelli et al., 2013).
  • Hybrid clouds are two or more cloud structures that have either a private, community or public aspect to them (Connolly & Begg, 2014).
  • Cloud computing allows for the company to purchase the services it needs, without having to purchase the infrastructure to support the services it might think it will need. This allows for hyper-scaling computing in a distributed environment, also known as hyper-scale cloud computing, where the volume and demand of data explode exponentially yet still be accommodated in public, community, private, or hybrid cloud in a cost efficiently (Mainstay, 2016; Minelli et al., 2013).
  • Building block system of big data analytics involves a few steps Burkle et al. (2001):
    • What is the purpose that the new data will and should serve
      • How many functions should it support
      • Marking which parts of that new data is needed for each function
    • Identify the tool needed to support the purpose of that new data
    • Create a top level architecture plan view
    • Building based on the plan but leaving room to pivot when needed
      • Modifications occur to allow for the final vision to be achieved given the conditions at the time of building the architecture.
      • Other modifications come under a closer inspection of certain components in the architecture

 

References

  • Angwin, J. (2014). Privacy tools: Opting out from data brokers. Pro Publica. Retrieved from https://www.propublica.org/article/privacy-tools-opting-out-from-data-brokers
  • Beckett, L. (2014). Everything we know about what data brokers know about you. Pro Publica. Retrieved from https://www.propublica.org/article/everything-we-know-about-what-data-brokers-know-about-you
  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th). Pearson Learning Solutions. VitalBook file.
  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: what is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • Della Valle, E., Dell’Aglio, D., & Margara, A. (2016). Tutorial: Taming velocity and variety simultaneous big data and stream reasoning. Retrieved from https://pdfs.semanticscholar.org/1fdf/4d05ebb51193088afc7b63cf002f01325a90.pdf
  • Dietrich, D. (2013). The genesis of EMC’s data analytics lifecycle. Retrieved from https://infocus.emc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/
  • Eichhorn, G. (2014). Why exactly is data auditing important? Retrieved from http://www.realisedatasystems.com/why-exactly-is-data-auditing-important/
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Kishore, N. & Sharma, S. (2016). Secure data migration from enterprise to cloud storage – analytical survey. BIJIT-BVICAM’s Internal Journal of Information Technology. Retrieved from http://bvicam.ac.in/bijit/downloads/pdf/issue15/09.pdf
  • Long, J. C., Cunningham, F. C., & Braithwaite, J. (2013). Bridges, brokers and boundary spanners in collaborative networks: a systematic review.BMC health services research13(1), 158.
  • (2016). An economic study of the hyper-scale data center. Mainstay, LLC, Castle Rock, CO, the USA, Retrieved from http://cloudpages.ericsson.com/ transforming-the-economics-of-data-center
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. [Bookshelf Online].
  • Ovum (2016). 2017 Trends to watch: Big Data. Retrieved from http://info.ovum.com/uploads/files/2017_Trends_to_Watch_Big_Data.pdf
  • Park, Y., Shankar, M., Park, B. H., & Ghosh, J. (2014, March). Graph databases for large-scale healthcare systems: A framework for efficient data management and data services. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on (pp. 12-19). IEEE.
  • Ramachandran, M. & Chang, V. (2016). Toward validating cloud service providers using business process modeling and simulation. Retrieved from http://eprints.soton.ac.uk/390478/1/cloud_security_bpmn1%20paper%20_accepted.pdf
  • Richards, N. M., & King, J. H. (2014). Big Data Ethics. Wake Forest Law Review, 49, 393–432.
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, (1st). [Bookshelf Online].
  • Technopedia (n.d.). Data audit. Retrieved from https://www.techopedia.com/definition/28032/data-audit
  • Tsesis, A. (2014). The right to erasure: Privacy, data brokers, and the indefinite retention of data.Wake Forest L. Rev.49, 433.
  • Vardarlier, P., & Silahtaroglu, G. (2016). Gossip management at universities using big data warehouse model integrated with a decision support system. International Journal of Research in Business and Social Science, 5(1), 1–14. Doi: http://doi.org/10.1108/ 17506200710779521
Advertisements

Data Tools: Artificial Intelligence and Data Analytics

The three main ingredients for artificial intelligence are computers, data, and fast algorithms. Big data technologies can turn infinite streams of data into business intelligence decisions capabilities. So how does artificial intelligence influence data analytics.

Machine learning, also known as Artificial Intelligence (AI) adds an intelligence layer to big data to handle the bigger sets of data to derive patterns from it that even a team of data scientist would find challenging (Maycotte, 2014; Power, 2015). AI makes their insights not by how machines are programmed, but how the machines perceive the data and take actions from that perception, essentially conducting self-learning (Maycotte, 2014).  Understanding how a machine perceives the big dataset is a hard task, which also makes it hard to interpret the resulting final models (Power, 2015).  AI is even revolutionizing how we understand what intelligence is (Spaulding, 2013).

So what is intelligence

At first, doing arithmetic was thought of as a sign of biological intelligence until the invention of the digital computers, which then shift biological intelligence to be known for logical reasoning, deduction and inferences to eventually fuzzy logic, grounded learning, and reasoning under uncertainty, which is now matched through Bayes Nets probability and current data analytics (Spaulding, 2013). So as humans keep moving the dial of what biological intelligence is to a more complex structure, if it requires high frequency and voluminous data, then it can be matched by AI (Goldbloom, 2016).  Therefore, as our definition of intelligence expands so will drive the need to capture intelligence artificially, driving change in how big datasets are analyzed.

AI on influencing the future of data analytics modeling, results, and interpretation

This concept should help revolutionize how data scientists and statisticians think about which hypotheses to ask, which variables are relevant, how do the resulting outputs fit in an appropriate conceptual model, and why do these patterns hidden in the data help generate the decision outcome forecasted by AI (Power, 2015). To figure out or make sense of these models would require subject matter experts from multiple fields and multiple levels of employment hierarchy analyzing these model outputs because it is through diversity and inclusion of thought will we understand an AI’s analytical insight.

Also, owning data is different from understanding data (Lapowsky, 2014). Thus, AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with, which allows for a data scientist to gain a better understanding of their datasets (Lapowsky, 2014; Power, 2015).

AI on generating datasets and using data analytics for self-improvements

Data scientists currently collected, preprocess, process and analyze big volumes of data regularly to help provide decision makers with insights from the data to make data-driven decisions (Fayyad, Piatetsky-Shapiro, & Smyth, 1996).  From these data-driven decisions, data scientist then measure the outcomes to prove the effectiveness of their insights (Maycotte, 2014).   This analysis on how the results of data-driven decisions, will allow machine learning algorithms to learn from their decisions and actions to create better ways of searching for key patterns in bigger and future datasets. This is an ability of AI to conduct self-learning based off of the results of data analytics through the use of data analytics (Maycotte, 2014). Meetoo (2016), stated that if there is enough data to create accurate rules it is enough to create insights; because machine learning can run millions of simulations against itself to generate huge volumes of data to which to learn from.

AI on Data Analytics Process

AI is a result of the massive amounts of data being collected, the culmination of ideas from the most brilliant computer scientists of our time, and on an IT infrastructure that didn’t use to exist a few years ago (Power, 2015).  Given that data analytics processes include collecting data, preprocessing data, processing data, and analyzing the results, any improvements made for AI on the infrastructure can have an influence on any part of the data analytics process (Fayyad et al., 1996; Power, 2015).  For example, as AI technology begins to learn how to read raw data to turn that into information, the need for most of the current preprocessing techniques for data cleaning could disappear (Minelli, Chambers, & Dhiraj, 2013). Therefore, as AI begins to advance, newer IT infrastructures will be dreamt up and built such that data analytics and its processes can now leverage this new infrastructure, which can also change the way on how big datasets are analyzed.

Resources:

Data Tools: Artificial Intelligence

Analyzing large data sets requires developing and applying complex algorithms. As data sets become larger, the ability of skilled individual to make sense of it all becomes more difficult.

Big data Analytics and Artificial Intelligence

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). Though previously, AI wasn’t able to come into existence without the proper computational power that is provided today (Cringely, 2013).  AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015).  The goal of AI is to use huge amounts of data to draw out a set of rules through machine learning that will effectively replace experts in a certain field (Cringely, 2013; Power, 2015). Cringely (2013) stated that in some situations big data can eliminate the need for theory and that AI can aid in analyzing big data where theory is either lacking or impossible to define.

AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  The thing that has slowed down the progression of AI in the past was the creation of human readable computer languages like XML or SQL, which is not intuitive for computers to read (Cringely, 2013).  Fortunately, AI can easily use structured data and now use unstructured data thanks to everyone who tags all these unstructured data either in comments or on the data point itself, speeding up the computational time (Cringely, 2013; Power, 2015).  Dewey (2013), hypothesized that not only will AI be able to analyze big data at speeds faster than any human can, but that the AI system can also begin to improve its search algorithms in phenomena called intelligence explosion.  Intelligence explosion is when an AI system begins to analyze itself to improve itself in an iterative process to a point where there is an exponential growth in improvement (Dewey, 2013).

Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).  It would take many scientists to analyze the same big data and analyze it all, to fully understand how the connections were made in the AI system, which is no longer feasible (Cringely, 2013).  It is as if data scientist is trying to read the mind of the AI system, and they currently cannot read a human’s mind. However, the results of AI are becoming accurate, with AI identifying cats in photographs in 72 hours of machine learning and after a cat is tagged in a few photographs (Cringely, 2013). AI could be applied to any field of study like finance, social science, science, engineering, etc. or even play against champions on the Jeopardy game show (Cyranoski, 2015; Cringely, 2013; Dewey, 2013; Power, 2015).

Example of artificial intelligence use in big data analysis: Genomics

The goal of AI use on genomic data is to help analyze physiological traits and lifestyle choices to provide a dedicated and personalized health plan to treat and eventually prevent disease (Cyranoski, 2015; Power, 2015).  This is done by feeding the AI systems with huge amounts of genomic data, which is considered big data by today’s standards (Cyranoski, 2015). Systems like IBM’s Watson (an AI system) could provide treatment options based on the results gained from analyzing thousands or even millions of genomic data (Power, 2015).  This is done by analyzing all this data and allowing machine learning techniques to devise algorithms based on the input data (Cringely, 2013; Cyranoski, 2015; Power, 2015).  As of 2015, there is about 100,000 individual genomic data in the system, and even with this huge amounts of data, it is still not enough to provide the personalized health plan that is currently being envisioned based on a person’s genomic data (Cyranoski, 2015).  Eventually, millions of individuals will need to be added into the AI system, and not just genomic data, but also proteomics, metabolomics, lipidomics, etc.

Resources:

Data Tools: Hadoop Basic Componets & Architecture

A report that describes how data can be handled before Hadoop can take action on breaking data into manageable sizes.

Big Data

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  However, given that big data today is too big to be processed just by using one processor, the use of parallel processing allows for data analytics to be conducted through platforms like Hadoop more efficiently (Hortonworks, 2013; IBM, n.d.).

Hadoop: Basic Components and Architecture

Hadoop’s service is part of cloud (as Platform as a Service = PaaS).  For PaaS, the end users manage the applications and data, whereas the provider (Hadoop), administers the runtime, middleware, O/S, virtualization, servers, storage, and networking (Lau, 2001).

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). Data is broken up into small blocks, like Legos, such that they are distributed across a distributed database system and across multiple servers (IBM, n.d.).  Just like Legos, the end the results can be assembled back.  This feature of HDFS allows for Hadoop to manage big data through parallel processing and analysis (Gary et al., 2005, Hortonworks, 2013; IBM, n.d.).  Multiple data types are supported through the HFDS (IBM, n.d.) For Hadoop’s MapReduce function, it can be broken down into two queries.

Parallel processing is key for Hadoop, because it allows for making quick work on a big data set, because rather than having one processor doing all the work, Hadoop splits up the task amongst many processors. One of MapReduce’s main two queries is that it splits the data into the Lego pieces and places them across a group of computer nodes in the HDFS called the mapping procedure (Eini, 2010; IBM, n.d; Hortonworks, 2013; Sathupadi, 2010). The second MapReduce query applied algorithms to reduce the data in each of the computer nodes equally to answer the question that was asked of the data; such that at the end of the parallel processing procedures, the reduced data gets combined and further reduced to provide the final answer (Eini, 2010; IBM, n.d; Hortonworks, 2013; Minelli et al., 2013; Sathupadi, 2010). In other words, data is partitioned, sorted and grouped to provide a key and value as an output (Hortonworks, 2013; Rathbone, 2013; Sathupadi, 2010). Therefore, IBM’s (n.d.) MapReduce functions use the HFDS to house the data and MapReduce runs its procedures on the server in which the data is stored.  Data is stored in a memory, not in cache and allow for continuous service (Gu & Li, 2013; Zaharia et al., 2012).

Given the Lego blocks feature in the HDFS, which allows for MapReduce functions, these blocks can contain a subset of data, which are small enough that they can be easily duplicated (for disaster recovery purposes) in two or more different servers (IBM, n.d.).  This partitioning of the data into data Lego blocks allows for big iterative tasks to be done quite easily and efficiently for big data sets (Gu & Li, 2013).

When to use Hadoop

Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized. Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Also, Hadoop fails in providing a real-time response (Greer, Rodriguez-Martinez, & Seguel, 2010).  Therefore, for big data that isn’t streaming real-time data and has a ton of iterative processing/analytical tasks Hadoop should be used.

Preparation of Big Data for Hadoop

Collecting the raw and unaltered real world data is usually the first step of any data or text mining study (Coralles et al., 2015; Gera & Goel, 2015; He et al., 2013; Hoonlor, 2011; Nassirtoussi et al., 2014). Next, the data must be preprocessed, because raw text data files are unsuitable for predictive data analytics tools like Hadoop (Hoonlor, 2011). Barak and Modarres (2015) and Nassirtoussi et al. (2014), all stated that in both data and text mining, data preprocessing has the most significant impact on the research results.  Wayner (2013) and Lublinksy, Smith, and Yakubovich (2013), enumerated the following tools used to preprocess data prior to data analysis with Hadoop as part of the core components of the ecosystem:

  • Ambari: Graphical User Interface for setting up clusters with common components. Essentially a simple management tool.
  • Avro: serialization systems that compiles all the data together into a XML or JSON output to be shared with others.
  • BigTop: tool that provides testing of sub-projects within Hadoop.
  • Clouds: Allows the end-user to spin up multiple nodes to process the data without necessarily owning the infrastructure, essentially pay as you go model
  • Flume: Gathers all data and places it into HDFS. Essentially an enterprise data integration tool.
  • GIS tools: allows end-users to work with big data stored as geographic maps under GIS (Geographic Information Systems) formats.
  • HBase: helps search and share a big tabular data set, unfortunate full ACID is not available. Essentially a NoSQL Database.
  • HDFS: Storage of big data in multiple distributed systems into data blocks. Essentially a Distributed reliable data storage.
  • Hive: SQL type language that files and pulls out data that is needed from HBase. Essentially a high-level abstraction tool.
  • Lucene: indexes large blocks of unstructured text based data and allows for dynamic clustering and ability to read XML
  • Mahout: Allows for Hadoop to use classification, filtering, k-means, Dirichelet, parallel pattern, and Bayesian classification similar to Hadoops MapReduce. Essentially a data analytics library.
  • NoSQL: Uses NoSQL data stores for data that is not typically stored in HBase or HDFS.
  • Oozie: manages the workflow of a job by allowing the user to break the job into simple steps in a flowchart fashion. Essentially a workflow manager.
  • Pig: stores and maps data in processing nodes for Hadoop to find and process. Essentially a high-level abstraction tool.
  • Spark: uses Hadoop infrastructure to store data in the cache to allow for faster processing time
  • SQL on Hadoop: ad-hoc query the data stored in Hadoop servers using SQL
  • Sqoop: stores data in SQL databases into Hadoop. Essentially an enterprise data integration tool.
  • Whirr: Library that allows to run Hadoop clusters on Amazon EC2, Rackspace, etc.
  • ZooKeeper: maintains order and synchronization throughout the parallel processing cluster. Essentially a coordinator of processes.

According to Lublinksy et al. (2013), there are always new datasets, data formats, and data preprocessing and processing tools being added to Hadoop.  Thus the list provided above is not a comprehensive list, but rather one to begin off from.

Reference

  • Barak, S., & Modarres, M. (2015). Developing an approach to evaluate stocks by forecasting effective features with data mining methods. Expert Systems with Applications, 42(3), 1325–1339. http://doi.org/10.1016/j.eswa.2014.09.026
  • Corrales, D. C., Ledezma, A., & Corrales, J. C. (2015). A Conceptual Framework for Data Quality in Knowledge Discovery Tasks (FDQ-KDT): A Proposal. Journal of Computers, V10(6), 396-405. Doi: 10.17706/jcp.10.6.396-405.
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Greer, M., Rodriguez-Martinez, M., & Seguel, J. (2010). Open Source Cloud Computing Tools: A Case Study with a Weather Application.Florida: IEEE Open Source Cloud Computing.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Gray, J., Liu, D. T., Nieto-Santisteban, M., Szalay, A., DeWitt, D. J., & Heber, G. (2005). Scientific data management in the coming decade. ACM SIGMOD Record, 34(4), 34-41.
  • Gu, L., & Li, H. (2013). Memory or time: Performance evaluation for iterative operation on hadoop and spark. InHigh Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.
  • Eini, O. (2010). Map/Reduce- a visual explanation. Retrieved from https://ayende.com/blog/4435/map-reduce-a-visual-explanation
  • He, W., Zha, S., & Li, L. (2013). Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management, 33, 464–472. http://doi.org/10.1016/j.ijinfomgt.2013.01.001
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Hortonworks (2013). Introduction to MapReduce. Retrieved from https://www.youtube.com/watch?v=ht3dNvdNDzI
  • IBM (n.d.) What is the Hadoop Distributed File System (HDFS)? Retrieved from https://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/
  • Lau, W. (2001). A Comprehensive Introduction to Cloud Computing. Retrieved from https://www.simple-talk.com/cloud/development/a-comprehensive-introduction-to-cloud-computing/
  • Lublinsky, B., Smith, K., Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Minelli, M., Chambers, M., Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses (1st). VitalSource Bookshelf Online.
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2014). Text mining for market prediction: a systematic review. Expert Systems with Applications41(16), 7653–7670. http://doi.org/10.1016/j.eswa.2014.06.009
  • Rathbone, M. (2013). A beginners guide to Hadoop. Retrieved from http://blog.matthewrathbone.com/2013/04/17/what-is-hadoop.html
  • Sathupadi, K. (2010) Map Reduce: A really simple introduction. Retrieved from http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/

 

Data Tools: Data-In-Motion

How is data-in-motion performed and why is it important to apply data analytics to it?

Definition of terms

Data in-motion: a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

Data complexity: consists of the joining, cleaning, and transformation of data from multiple systems to find relationships that are highly correlated (Katal et al., 2013).  Complexity increases as the velocity of data coming in or transferred increases (Katal et al., 2013; Tsinoremas et al., n.d.).

Data-in-motion analytics performed in case study (Blount et al., 2010)

Artemis was designed, built and deployed in 2009 through a coalition of the University of Ontario Institute of Technology, SickKids, Department of Pediatrics, and University of Toronto, to help read in data from multiple sensors taken from neonatal intensive care units (NICU).  The goal is to have Artemis to read in data from multiple physiological instruments like an electrocardiogram (ECG), heart rate, blood oxygen saturation, respiratory states, etc. to find key patterns and relationships in the data streams (data-in-motion) to provide the best care for infants in NICU.  To make Artemis a success, the coalition had to analyze huge amounts of data from a large group of patients.  Artemis had to interface with multiple medical devices, should be scalable to add more medical devices, and store raw physiological data while at the same time de-identifying the data per U.S. and Canadian Health Privacy laws.  From these multiple medical devices new rules could be created by unsupervised machine learning techniques, and through supervising machine learning techniques with medical/clinical derived rules.  The Artemis system has to read in the data in real-time to sort, join, clean, and transform, to evaluate against certain rules and send out an alert or not to medical staff about one of the NICU patients, while at the same time de-identifying the data and storing it into a database for future analysis and tests.

In the test phase, 5 infants were enrolled and in the deployed state 19 infants were enrolled in the study. This study has to take into account, that the cables from all the sensors and the equipment use to collect all the streaming data must not get in the way of the medical/clinical staff when they need to help out the infant. In some cases, when the Artemis system was deployed, some of the sensors were not attached, and thus the Information Management Teams had to work with medical/clinical staff to help train the model on fewer data as well, if they do not have all the ideal sensors needed to send out alerts for certain situations.  Therefore, this system provides a way for medical/clinical staff to have constant data on NICU patients in real time from multiple sensors and allow the machine to alert them when certain markers and key performance indicators are met.

Importance of applying data analytics to data-in-motion

It can be easily seen that analyzing infant NICU data is important.  It is especially important to leverage analytics to the data stream of the key medical sensors needed per infant in the NICU.  What is not easily seen sometimes is how important all the data really is.  Since, in the real-life deployment showed that not all the medical sensors are being used to help provide the model with enough information to be of use to medical/clinical staff (Blount et al., 2010).

Also, the use of data streams in a university setting would allow for a different perspective that could be used in the NICU case study above.  At the University of Miami, data is triaged into a four-tiered system (Tsinoremas et al., n.d.):

  • High-speed storage – for data that is currently being processed, data-in-motion is at its highest (has 300TB of space and costs $2000/TB)
  • Mid-range speed storage – for data that is currently being looked at (costs $600-$700/TB)
  • Deep storage – long-term data storage, data that is looked at every so often, but not regularly, usually old data (costs $300/TB)
  • Archived – data to be stored offline, but it is perfect for data at rest

This tiered system above could be applied to Artemis, such that they could process which of the medical devices should be processed first when resources are limited.  Also, this could be applied different, such that there should be a window of which data is currently available, e.g. a 1-hour long record of NICU stats saved locally, with longer records still accessible, but not stored in vital processing spaces.  Data windows were discussed, but depending on the situation, data windows could be adjusted to provide the best care for the infants (Blount et al., 2010).

Also, the quality of the sensor data must be taken into account.  If more data is needed/preferred to make informed decisions on infant patients in the NICU (Blount et al., 2010), then there should be a focus in collecting, analyzing, high-quality data and the right types of data.  This would lead the designers of Artemis, medical, and clinician staff to think deeply about which data is relevant, and how much data is enough to make the decisions needed to tend to the infants (Katal et al., 2013).

Resources

  • Blount, M., Ebling, M. R., Eklund, J. M., James, A. G., McGregor, C., Percival, N., … & Sow, D. (2010). Real-time analysis for intensive care: development and deployment of the Artemis analytic system.IEEE Engineering in Medicine and Biology Magazine29(2), 110-118.
  • Katal, A., Wazid, M., & Goudar, R. H. (2013, August). Big data: issues, challenges, tools and good practices. InContemporary Computing (IC3), 2013 Sixth International Conference on (pp. 404-409). IEEE.
  • Tsinoremas, N. F., Zysman, J., Mader, C., Kirtma, B., & Blaire, J. (n.d.) Data in motion: A new paradigm in research data lifecycle management. Center for Computational Science: University of Miami.

Data tools: Analysis of big data involving text mining

The demand for big data talent is growing, and there is a shortage of data analytics talent in United States. Because data analytics is used by many different industries and the data analytics is an interdisciplinary sector, learning and teaching it requires careful planning. This post discusses how big data analytics can be implemented in a given case study.

Definitions

Big data – any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do 2013, Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).

Text mining – a process that involves discovering implicit knowledge from unstructured textual data (Gera & Goel, 2015; Hashimi & Hafez, 2015; Nassirtoussi Aghabozorgi, Wah, & Ngo, 2015).

Case study: Basole, Seuss, and Rouse (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics.

The goal of this study was to use text mining techniques on 472 quality peer reviewed articles that spanned 30 years of knowledge (1977-2008).  The selection criteria for the articles were based on articles focused on the adoption of IT innovation; focused on the enterprise, organization, or firm; rigorous research methods; and publishable leading journals.  The reason to go through all this analysis is to prove the usefulness of text analytics for literature reviews.  In 2016, most literature reviews contain recent literature from the last five years, and in certain fields, it may not just be useful to focus on the last five years.  Extending the literature search beyond this 5-year period, requires a ton of attention and manual labor, which makes the already literature an even more time-consuming endeavor than before. So, the author’s question is to see if it is possible to use text mining to conduct a more thorough review of the body of knowledge that expands beyond just the typical five years on any subject matter.  They argue that the time it takes to conduct this tedious task could benefit from automation.  However, this should be thought of as a first pass through the literature review. Thinking of this regarding a first pass allows for the generation of new research questions and a generation of ideas, which drives more future analysis.In the end, the study was able to conclude that cost and complexity were two of the most frequent determinants of IT innovation adoption from the perspective of an IT department.  Other determinants for IT departments were the complexity, capability, and relative advantage of the innovation.  However, when going up one level of extraction to the enterprise/organizational level, the perceived benefits and usefulness were the main determinants of IT innovation.  Ease of use of the technology was a big deal for the organization.  When comparing, IT innovation with costs there was a negative correlation between the two, while IT innovation has a positive correlation to organization size and top management support.

How was big data analytics learned, taught, and used in the case study?

The research approach for this study was: (1) Document Identification and extraction, (2) document classification and coding, (3) document analysis and knowledge discovery (key terms, co-occurrence), and (4) research gap identification.

Analysis of the data consisted of classifying the data into four time periods (bins): 1988-1979; 1980-1989; 1990-1999; and 2000-2008 and use of a classification scheme based on existing taxonomies (case study, content analysis, field experiment, field study, frameworks and conceptual model, interview, laboratory experiment, literature analysis, mathematical model, qualitative research, secondary data, speculation/commentary, and survey).  Data was also classified by their functional discipline (Information systems and computer science, decision science, management and organization sciences, economics, and innovation) and finally by IT innovation (software, hardware, networking infrastructure, and the tool’s IT term catalog). This study used a tool called Northernlight (http://georgiatech.northernlight.com/).

The hopes of this study are to use the bag-of-words technique and word proximity to other words (or their equivalents) to help extract meaning from a large set of text-based documents.  Bag-of-words technique is known for counting and identifying key terms and phrases, which help uncover themes.  The simplest way of thinking of the bag-of-words technique is word frequencies in a document.

However, understanding the meaning behind the themes means studying the context in which the words are located in, and relating them amongst other themes, also called co-occurrence of terms.  The best way of doing this meaning extraction is to measure the strength/distance between the themes.  Finally, the researcher in this study can set minimums, maximums that can enhance the meaning extraction algorithm to garner insights into IT innovation, while reducing the overall noise in the final results. The researchers set the following rules for co-occurrences between themes:

  • There are approximately 40 words per sentence
  • There are approximately 150 words per paragraph

How could this implementation of big data have been improved upon?

Goldbloom (2016) stated that using big data techniques (machine learning) is best on big data that requires classifying and it breaks down when the task is too small and specialized, therefore prime for only human analysis.  This study only looked at 427 articles, is this considered big enough for analysis, or should the analysis go back through multiple years beyond just the 30 years (Basole et al., 2013).  What is considered big data in 2013 (the time of this study), may not be big data in 2023 (Fox & Do, 2013).

Mei & Zhai (2005), observed how terms and term frequencies evolved over time and graphed it by year, rather than binning the data into four different groups as in Basole et al. (2013).  This case study could have shown how cost and complexity in IT innovation changed over time.  Graphing the results similar to Mei & Zhai (2005) and Yoon and Song (2014) would also allow for an analysis of IT innovation themes and if each of these themes is in an Introduction, Growth, Majority, or Decline mode.

 Reference

  • Basole, R. C., Seuss, C. D., & Rouse, W. B. (2013). IT innovation adoption by enterprises: Knowledge discovery through text analytics. Decision Support Systems, 54, 1044-1054. Retrieved from http://www.sciencedirect.com.ctu.idm.oclc.org/science/article/pii/S0167923612002849
  • Davenport, T. H., & Dyche, J. (2013). Big Data in Big Companies. International Institute for Analytics, (May), 1–31.
  • Fox, S., & Do, T. (2013). Getting real about Big Data: applying critical realism to analyse Big Data hype. International Journal of Managing Projects in Business, 6(4), 739–760. http://doi.org/10.1108/IJMPB-08-2012-0049
  • Gera, M., & Goel, S. (2015). Data Mining-Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Goldbloom, A. (2016). The jobs we’ll lose to machines –and the ones we won’t. TED. Retrieved from http://www.ted.com/talks/anthony_goldbloom_the_jobs_we_ll_lose_to_machines_and_the_ones_we_won_t
  • Hashimi, H., & Hafez, A. (2015). Selection criteria for text mining approaches. Computers in Human Behavior, 51, 729–733. http://doi.org/10.1016/j.chb.2014.10.062
  • Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 198–207. http://doi.org/10.1145/1081870.1081895
  • Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2015). Text-mining of news-headlines for FOREX market prediction: a multi-layer dimension reduction algorithm with semantics and sentiment. Expert Systems with Applications42(1), 306-324.
  • Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J., & Zients, J. (2014). Big Data: Seizing Opportunities. Executive Office of the President of USA, 1–79.
  • Yoon, B., & Song, B. (2014). A systematic approach of partner selection for open innovation. Industrial Management & Data Systems, 114(7), 1068.

Data Tools: WEKA

Many tools are used for the purpose of data analytics. WEKA is one of those free tools in the market.

WEKA

The Java based, open sourced, and platform independent Waikato Environment for Knowledge Analysis (WEKA) tool, for data preprocessing, predictive data analytics, and facilitation interpretations and evaluation (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Miranda, n.d.; Xia & Gong, 2014).  It was originally developed for analyzing agricultural data and has evolved to house a comprehensive collection of data preprocessing and modeling techniques (Patel & Donga 2015).  It is a java based machine learning algorithm for data mining tasks as well as text mining that could be used for predictive modeling, housing pre-processing, classification, regression, clustering, association rules, and visualization (WEKA, n.d). Also, WEKA contains classification, clustering, association rules, regression, and visualization capabilities, in particular, the C4.5 decision tree predictive data analytics algorithm (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011). Here WEKA is an open source data and text mining software tool, thus it is free to use. Therefore there are no costs associated with this software solution.

WEKA can be applied to big data (WEKA, n.d.) and SQL Databases (Patel & Donga, 2015). Subsequently, WEKA has been used in many research studies that are involved in big data analytics (Dogan & Tanrikulu, 2013; Gera & Goel, 2015; Hachey & Grover, 2006; Kumar & Fet, 2011; Parkavi & Sasikumar, 2016; Xia & Gong, 2014). For instance, Barak and Modarres (2015) used WEKA for decision tree analysis on predicting stock risks and returns.

The fact that it has been using in this many research studies is that the reliability and validity of the software are high and well established.  Even in a study comparing WEKA with 12 other data analytics tools, is one of two apps studied that have a classification, regression, and clustering algorithms (Gera & Goel, 2015).

A disadvantage of using this tool is its lack of supporting multi-relational data mining, but if one can link all the multi-relational data into one table, it can do its job (Patel & Donga, 2015). The comprehensiveness of analysis algorithms for both data and text mining and pre-processing is its advantage. Another disadvantage of WEKA is that it cannot handle raw data directly, meaning the data had to be preprocessed before it is entered into the software package and analyzed (Hoonlor, 2011). WEKA cannot even import excel files, data in Excel have to be converted into CSV format to be usable within the system (Miranda, n.d.)

References:

  • Dogan, N., & Tanrikulu, Z. (2013). A comparative analysis of classification algorithms in data mining for accuracy, speed and robustness. Information Technology and Management, 14(2), 105-124. doi:http://dx.doi.org/10.1007/s10799-012-0135-7
  • Gera, M., & Goel, S. (2015). Data Mining -Techniques, Methods and Algorithms: A Review on Tools and their Validity. International Journal of Computer Applications, 113(18), 22–29.
  • Hoonlor, A. (2011). Sequential patterns and temporal patterns for text mining. UMI Dissertation Publishing.
  • Kumar, D., & Fet, D. (2011). Performance Analysis of Various Data Mining Algorithms: A Review. International Journal of Computer Applications, 32(6), 9–16.
  • Miranda, S. (n.d.). An Introduction to Social Analytics : Concepts and Methods.
  • Parkavi, S. & Sasikumar, S. (2016). Prediction of Commodities Market by Using Data Mining Technique. i-Manager’s Journal on Computer Science.
  • Patel, K., & Donga, J. (2015). Practical Approaches: A Survey on Data Mining Practical Tools. Foundations, 2(9).
  • WEKA (n.d.) WEKA 3: Data Mining Software in Java. Retrieved from http://www.cs.waikato.ac.nz/ml/weka/
  • Xia, B. S., & Gong, P. (2014). Review of business intelligence through data analysis. Benchmarking, 21(2), 300–311. http://doi.org/http://dx.doi.org/10.1108/BIJ-08-2012-0051