Compelling topics

Hadoop, XML and Spark

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010).

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublinsky, Smith, & Yakubovich,, 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Unfortunately, the syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).  XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). However, Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase. Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse.

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.). Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

Data visualization

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  Then there is Data-in-motion, which can be defined as a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

It is not enough to analyze the relevant data for data-driven decisions but also selecting relevant visualizations of that data to enable those data-driven decision (eInfochips, n.d.). There are many types of ways to visualize the data to highlight key facts through style and succinctly: tables and rankings, bar charts, line graphs, pie charts, stacked bar charts, tree maps, choropleth maps, cartograms, pinpoint maps, or proportional symbol maps (CHCF, 2014).  The above visualization plots, charts, maps and graphs could be part of an animated, static, and Interactive Visualizations and would it be a standalone image, dashboards, scorecards, or infographics (CHCF, 2014; eInfochips, n.d.).

Artificial Intelligence (AI)

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” said by Anthony Goldbloom. Thus, the fundamental question that decision makers need to ask, is how the decision is reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Therefore, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations is equivalent to our tough challenges today (McAfee, 2013).  Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.

 

Resources:

Data Tools: Artificial Intelligence and Internet of Things

Radio Frequency Identification (RFID) tags are the fundamental technology to the Internet of Things (IoT), which are everywhere and they are shipped more frequently than smartphones (Ashton, 2015). The IoT is the explosion of device/sensor data, which is growing the amount of structured data exponentially with huge opportunities (Jaffe, 2014; Power, 2015). Ashton (2016), analogizes IoT to fancy windmills where data scientist and a computer scientist are taking energy and harnessing it to do amazing things. Newman (2016), stated that there is a natural progression of sensor objects to become learning objects, with a final desire to connect all of the IoT into one big network.  Essentially, IoT is giving senses through devices/sensors to machines (Ashton, 2015).

Artificial Intelligence and the Internet of things

Thus, analyzing this sensor data to derive data-driven insights and actions is key for companies to derive value from the data they are gathering from a wide range of sensors.  In 2016, IoT has two main issues, if it is left on its own and it is not tied to anything else (Jaffe, 2014; Newman, 2016):

  • The devices/sensors cannot deal with the massive amounts of data generated and collected
  • The devices/sensors cannot learn from the data it generates and collects

Thus, artificial intelligence (AI) should be able to store and mine all the data that is collected from a wide range of sensors to give it meaning and value (Canton, 2016; Jaffe, 2014). The full potential of IoT cannot be realized without AI or machine learning (Jaffe, 2014). The value derived from IoT depends on how fast AI through machine learning could give fast actionable insights to key stakeholders (Tang, 2016). AI would bring out the potential of IoT through quickly and naturally collecting, analyzing, organizing, and feeding valuable data to key stakeholders, transforming the field into the Internet of Learning-Things (IoLT) from the standard IoT (Jaffe, 2014; Newman, 2016).  Tang (2016), stated that the IoT is limited by how efficiently AI could analyze the data generated by IoT.  Given that AI is best suited for frequent and high voluminous data (Goldbloom, 2016), AI relies on IoT technology to sustain its learning.

Another, high potential use of IoT with AI is through analyzing data-in-motion, which is analyzing data immediately after collection to identify hidden patterns or meaning to creation actionable data-driven decisions (Jaffe, 2014).

Connection: One without the other or not?

In summary, AI helps give meaning and value to IoT and IoT cannot work without AI. Since, IoT is supplying huge amounts of frequent data, which AI thrives upon.  It can go without saying that a source of data for AI can come from IoT.  However, if there were no IoT, social media can provide AI the amounts of data needed for it to generate insight, albeit different insights will be gained from different sources of voluminous data.  Thus, the IoT technologies worth depends on AI, but AI doesn’t depend solely on IoT.

Resources:

Data Tools: Artificial Intelligence and Data Analytics

Machine learning, also known as Artificial Intelligence (AI) adds an intelligence layer to big data to handle the bigger sets of data to derive patterns from it that even a team of data scientist would find challenging (Maycotte, 2014; Power, 2015). AI makes their insights not by how machines are programmed, but how the machines perceive the data and take actions from that perception, essentially conducting self-learning (Maycotte, 2014).  Understanding how a machine perceives the big dataset is a hard task, which also makes it hard to interpret the resulting final models (Power, 2015).  AI is even revolutionizing how we understand what intelligence is (Spaulding, 2013).

So what is intelligence

At first, doing arithmetic was thought of as a sign of biological intelligence until the invention of the digital computers, which then shift biological intelligence to be known for logical reasoning, deduction and inferences to eventually fuzzy logic, grounded learning, and reasoning under uncertainty, which is now matched through Bayes Nets probability and current data analytics (Spaulding, 2013). So as humans keep moving the dial of what biological intelligence is to a more complex structure, if it requires high frequency and voluminous data, then it can be matched by AI (Goldbloom, 2016).  Therefore, as our definition of intelligence expands so will drive the need to capture intelligence artificially, driving change in how big datasets are analyzed.

AI on influencing the future of data analytics modeling, results, and interpretation

This concept should help revolutionize how data scientists and statisticians think about which hypotheses to ask, which variables are relevant, how do the resulting outputs fit in an appropriate conceptual model, and why do these patterns hidden in the data help generate the decision outcome forecasted by AI (Power, 2015). To figure out or make sense of these models would require subject matter experts from multiple fields and multiple levels of employment hierarchy analyzing these model outputs because it is through diversity and inclusion of thought will we understand an AI’s analytical insight.

Also, owning data is different from understanding data (Lapowsky, 2014). Thus, AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with, which allows for a data scientist to gain a better understanding of their datasets (Lapowsky, 2014; Power, 2015).

AI on generating datasets and using data analytics for self-improvements

Data scientists currently collected, preprocess, process and analyze big volumes of data regularly to help provide decision makers with insights from the data to make data-driven decisions (Fayyad, Piatetsky-Shapiro, & Smyth, 1996).  From these data-driven decisions, data scientist then measure the outcomes to prove the effectiveness of their insights (Maycotte, 2014).   This analysis on how the results of data-driven decisions, will allow machine learning algorithms to learn from their decisions and actions to create better ways of searching for key patterns in bigger and future datasets. This is an ability of AI to conduct self-learning based off of the results of data analytics through the use of data analytics (Maycotte, 2014). Meetoo (2016), stated that if there is enough data to create accurate rules it is enough to create insights; because machine learning can run millions of simulations against itself to generate huge volumes of data to which to learn from.

AI on Data Analytics Process

AI is a result of the massive amounts of data being collected, the culmination of ideas from the most brilliant computer scientists of our time, and on an IT infrastructure that didn’t use to exist a few years ago (Power, 2015).  Given that data analytics processes include collecting data, preprocessing data, processing data, and analyzing the results, any improvements made for AI on the infrastructure can have an influence on any part of the data analytics process (Fayyad et al., 1996; Power, 2015).  For example, as AI technology begins to learn how to read raw data to turn that into information, the need for most of the current preprocessing techniques for data cleaning could disappear (Minelli, Chambers, & Dhiraj, 2013). Therefore, as AI begins to advance, newer IT infrastructures will be dreamt up and built such that data analytics and its processes can now leverage this new infrastructure, which can also change the way on how big datasets are analyzed.

Resources:

Data Tools: Artificial Intelligence and Decision Making

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” – Anthony Goldbloom

Jobs today will look drastically different in 30 years from now (Goldbloom, 2016; McAfee, 2013).  Artificial intelligence (AI) works on Sundays, they don’t take holidays, and they work well at high frequency and voluminous tasks, and thus they have the possibility of replacing many of the current jobs of 2016 (Goldbloom, 2016; Meetoo, 2016).  AI has been doing things that haven’t been done before: understanding, speaking, hearing, seeing, answering, writing, and analyzing (McAfee, 2013). Also, AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015). Eventually, AI and machine learning will be commonly used as a tool to augment or replace decision makers.  Goldbloom (2016) gave the example that a teacher may be able to read 10,000 essays or an ophthalmologist may see 50,000 eyes over a 40-year period; whereas a machine can read millions of essays and see millions of eyes in minutes.

Machine learning is one of the most powerful branches to AI, where machines learn from data, similar to how humans learn to create predictions of the future (Cringely, 2013; Cyranoski, 2015; Goldbloom, 2016; Power, 2015). It would take many scientists to analyze a big dataset in its entirety without a loss of memory such that to gain insights and to fully understand how the connections were made in the AI system (Cringely, 2013; Goldbloom, 2016). This is no easy task because the eerily accurate rules created by AI out of thousands of variables can lack substantive human meaning, making it hard to interpret the results and make an informed data-driven decision (Power, 2015).

AI has been used to solve problems in industry and academia already, which has given data scientist knowledge on the current limitations of AI and whether or not they can augment or replace key decision makers (Cyranoski, 2015; Goldbloom, 2016). Machine learning and AI does well at analyzing patterns from frequent and voluminous amounts of data at faster speeds than humans, but they fail to recognize patterns in infrequent and small amounts of data (Goldbloom, 2016).  Therefore, for small datasets artificial intelligence will not be able to replace decision makers, but for big datasets, they would.

Thus, the fundamental question that decision makers need to ask is how is the decision reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Thus, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split, then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations are equivalent to our tough challenges today (McAfee, 2013).

Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.  This is no different than humans doing self-study and continuous practice to be subject matter experts in their field. But people in STEAM (Science, Technology, Engineering, Arts, and Math) will be best equip them for the future world with AI, because it is from knowing how to combine these fields that novel, infrequent, and unique challenges will arise that humans can solve and machine learning cannot (Goldbloom, 2016; McAfee, 2013; Meetoo, 2016).

Resources:

Data Tools: AI and wildlife case study

2015 Case study: Unmanned Aerial Vehicles (UAVs) and Artificial Intelligences (AI) revolutionizing Wildlife Monitoring and Conservatism

Overview:

Aiding in monitoring and conservationism of endangered or at risk of being endangered animals is at the heart of effective wildlife management.  Understanding the current population of animals is key.  However, current techniques like remote photography, camera traps, tagging, GPS collars, scat detect dogs, and DNA sampling is costly on the already strapped resources.  The authors in this study propose to use big data, AI, UAVs, and imagery to help effectively count the wildlife without depleting resources, disturbing the wildlife, improve safety, and improved statistical integrity.

The authors equipped a Mobius RGB camera with 1080p resolution and an FLIR Thermal Camera at 640×510 to an S800 EVO Hexacopter, which has three modes of travel, predefined flight mode via GPS, stabilized mode like autopilot, and manual.  The camera’s main goal is to capture footage of the area, split the image into a high contrast, identify patterns using AI and match them to the respective animal, and add the identified animal to the total count.  Using infrared cameras, the higher temperature animals sick out from the vegetation and soil background. Therefore a filter is applied to color the animal white and the background black to allow for classification and pattern recognition to occur.

Data Collection Procedures:

This idea was tested against the koala population given that they are iconic to Australia and are a vulnerable species.  The area that they studied was the Sunshine Coast, 57km north of Brisbane, Queensland, Australia, where the total ground truth number of koalas is 6. They flew on November 7, 2014, on 7:10-8:00 A.M. to allow for the largest temperature contrast between the koalas and background.  They flew at three different vertical levels: 20 m, 30 m, and 60 m.  A koala was identified if they were in 10 consecutive frames, didn’t make big jumps in locations within those frames, and that the size of the koala didn’t drastically increase.

Evaluation of effectiveness:

At each of the three levels, 100% of the koalas were identified.  However, it is important to note that there was a greater chance for a false positive at 60 m above ground surveillance and it took almost twice the time for the AI classification algorithm to detect the koalas.  The authors suggested that improving the AI classification algorithm by adding more template shapes for animals at different angles will help speed up the AI and improve the quality of detection.  Also, the quality of the templates can contribute to the quality of the detection.  This illustrates that there is a need to add more dynamic templates to the system, thus creating a bigger dataset to draw inferences from that can the higher the quality in detection.  Therefore, the combination of big data and AI is important for this study.

Other applications:

The benefit of this application of UAV, data analytics, and AI could be further extended to search and rescue missions for humans lost in national parks, etc.  The UAVs can supplement human and dog trackers, to gain an advantage of finding the victims quickly since time is extremely important.  Therefore, besides just for conservationist, park rangers can adapt these methods to help in recovery missions.  Another application could include the Department of Defense, for search and rescue missions, or mitigation of the casualties during times of war.

Resource:

  • Gonzalez, L. F., Montes, G. A., Puig, E., Johnson, S., Mengersen, K., & Gaston, K. J. (2015). Unmanned Aerial Vehicles (UAVs) and Artificial Intelligences revolutionizing Wildlife Monitoring and Conservatism. Sensors 1(97). DOI: 10.3390/s16010097

Data Tools: Artificial Intelligence

Big data Analytics and Artificial Intelligence

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). Though previously, AI wasn’t able to come into existence without the proper computational power that is provided today (Cringely, 2013).  AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015).  The goal of AI is to use huge amounts of data to draw out a set of rules through machine learning that will effectively replace experts in a certain field (Cringely, 2013; Power, 2015). Cringely (2013) stated that in some situations big data can eliminate the need for theory and that AI can aid in analyzing big data where theory is either lacking or impossible to define.

AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  The thing that has slowed down the progression of AI in the past was the creation of human readable computer languages like XML or SQL, which is not intuitive for computers to read (Cringely, 2013).  Fortunately, AI can easily use structured data and now use unstructured data thanks to everyone who tags all these unstructured data either in comments or on the data point itself, speeding up the computational time (Cringely, 2013; Power, 2015).  Dewey (2013), hypothesized that not only will AI be able to analyze big data at speeds faster than any human can, but that the AI system can also begin to improve its search algorithms in phenomena called intelligence explosion.  Intelligence explosion is when an AI system begins to analyze itself to improve itself in an iterative process to a point where there is an exponential growth in improvement (Dewey, 2013).

Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).  It would take many scientists to analyze the same big data and analyze it all, to fully understand how the connections were made in the AI system, which is no longer feasible (Cringely, 2013).  It is as if data scientist is trying to read the mind of the AI system, and they currently cannot read a human’s mind. However, the results of AI are becoming accurate, with AI identifying cats in photographs in 72 hours of machine learning and after a cat is tagged in a few photographs (Cringely, 2013). AI could be applied to any field of study like finance, social science, science, engineering, etc. or even play against champions on the Jeopardy game show (Cyranoski, 2015; Cringely, 2013; Dewey, 2013; Power, 2015).

Example of artificial intelligence use in big data analysis: Genomics

The goal of AI use on genomic data is to help analyze physiological traits and lifestyle choices to provide a dedicated and personalized health plan to treat and eventually prevent disease (Cyranoski, 2015; Power, 2015).  This is done by feeding the AI systems with huge amounts of genomic data, which is considered big data by today’s standards (Cyranoski, 2015). Systems like IBM’s Watson (an AI system) could provide treatment options based on the results gained from analyzing thousands or even millions of genomic data (Power, 2015).  This is done by analyzing all this data and allowing machine learning techniques to devise algorithms based on the input data (Cringely, 2013; Cyranoski, 2015; Power, 2015).  As of 2015, there is about 100,000 individual genomic data in the system, and even with this huge amounts of data, it is still not enough to provide the personalized health plan that is currently being envisioned based on a person’s genomic data (Cyranoski, 2015).  Eventually, millions of individuals will need to be added into the AI system, and not just genomic data, but also proteomics, metabolomics, lipidomics, etc.

Resources: