Compelling topics

Hadoop, XML and Spark

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010).

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublinsky, Smith, & Yakubovich,, 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Unfortunately, the syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).  XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). However, Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase. Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse.

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.). Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

Data visualization

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  Then there is Data-in-motion, which can be defined as a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

It is not enough to analyze the relevant data for data-driven decisions but also selecting relevant visualizations of that data to enable those data-driven decision (eInfochips, n.d.). There are many types of ways to visualize the data to highlight key facts through style and succinctly: tables and rankings, bar charts, line graphs, pie charts, stacked bar charts, tree maps, choropleth maps, cartograms, pinpoint maps, or proportional symbol maps (CHCF, 2014).  The above visualization plots, charts, maps and graphs could be part of an animated, static, and Interactive Visualizations and would it be a standalone image, dashboards, scorecards, or infographics (CHCF, 2014; eInfochips, n.d.).

Artificial Intelligence (AI)

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” said by Anthony Goldbloom. Thus, the fundamental question that decision makers need to ask, is how the decision is reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Therefore, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations is equivalent to our tough challenges today (McAfee, 2013).  Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.

 

Resources:

Advertisements

Data Tools: Artificial Intelligence and Internet of Things

The future is about the integration and convergence of sensor networks, data analytics, cloud, API, and artificial intelligence. Technology trend is about the use of devices to make right decisions based on large amount of data, help with daily life, and business operations. Thus, how are the Internet of Things and Artificial Intelligence connected and related?

Radio Frequency Identification (RFID) tags are the fundamental technology to the Internet of Things (IoT), which are everywhere and they are shipped more frequently than smartphones (Ashton, 2015). The IoT is the explosion of device/sensor data, which is growing the amount of structured data exponentially with huge opportunities (Jaffe, 2014; Power, 2015). Ashton (2016), analogizes IoT to fancy windmills where data scientist and a computer scientist are taking energy and harnessing it to do amazing things. Newman (2016), stated that there is a natural progression of sensor objects to become learning objects, with a final desire to connect all of the IoT into one big network.  Essentially, IoT is giving senses through devices/sensors to machines (Ashton, 2015).

Artificial Intelligence and the Internet of things

Thus, analyzing this sensor data to derive data-driven insights and actions is key for companies to derive value from the data they are gathering from a wide range of sensors.  In 2016, IoT has two main issues, if it is left on its own and it is not tied to anything else (Jaffe, 2014; Newman, 2016):

  • The devices/sensors cannot deal with the massive amounts of data generated and collected
  • The devices/sensors cannot learn from the data it generates and collects

Thus, artificial intelligence (AI) should be able to store and mine all the data that is collected from a wide range of sensors to give it meaning and value (Canton, 2016; Jaffe, 2014). The full potential of IoT cannot be realized without AI or machine learning (Jaffe, 2014). The value derived from IoT depends on how fast AI through machine learning could give fast actionable insights to key stakeholders (Tang, 2016). AI would bring out the potential of IoT through quickly and naturally collecting, analyzing, organizing, and feeding valuable data to key stakeholders, transforming the field into the Internet of Learning-Things (IoLT) from the standard IoT (Jaffe, 2014; Newman, 2016).  Tang (2016), stated that the IoT is limited by how efficiently AI could analyze the data generated by IoT.  Given that AI is best suited for frequent and high voluminous data (Goldbloom, 2016), AI relies on IoT technology to sustain its learning.

Another, high potential use of IoT with AI is through analyzing data-in-motion, which is analyzing data immediately after collection to identify hidden patterns or meaning to creation actionable data-driven decisions (Jaffe, 2014).

Connection: One without the other or not?

In summary, AI helps give meaning and value to IoT and IoT cannot work without AI. Since, IoT is supplying huge amounts of frequent data, which AI thrives upon.  It can go without saying that a source of data for AI can come from IoT.  However, if there were no IoT, social media can provide AI the amounts of data needed for it to generate insight, albeit different insights will be gained from different sources of voluminous data.  Thus, the IoT technologies worth depends on AI, but AI doesn’t depend solely on IoT.

Resources:

Big Data Analytics: POTUS Report

This has become a data-centric society, relying on real-time data and technology (i.e., cell phone, shopping online, social networking) more than ever. Although there are many advantages associated with the use of this data, there are concerns that the collection of massive amounts of data can lead to an invasion of privacy. In January, 2014, President Obama asked his staff to take the next 90 days to prepare a report for him on how big data is affecting people’s privacy. This post revolves around this report.

The aims of big data analytics are for data scientist to fuse data from various data sources, various data types, and in huge amounts so that the data scientist could find relationships, identify patterns, and find anomalies.  Big data analytics can help provide either a descriptive, prescriptive, or predictive result to a specific research question.  Big data analytics isn’t perfect, and sometimes the results are not significant, and we must realize that correlation is not causation.  Regardless, there are a ton of benefits from big data analytics, and this is a field where policy has yet to catch up to the field to protect the nation from potential downsides while still promoting and maximizing benefits.

Policies for maximizing benefits while minimizing risk in public and private sector

In the private sector, companies can create detailed personal profiles will enable personalized services from a company to a consumer.  Interpreting personal profile data would allow a company to retain and command more of the market share, but it can also leave room for discrimination in pricing, services quality/type, and opportunities through “filter bubbles” (Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  Policy recommendation should help to encourage de-identifying personally identifiable information to a point that it would not lead to re-identification of the data. Current policies for the private sector for promoting privacy are (Podesta, et al., 2014):

  • Fair Credit Reporting Act, helps to promote fairness and privacy of credit and insurance information
  • Health insurance Portability and Accountably Act enables people to understand and control how personal health data is used
  • Gramm-Leach-Bliley Act, helps consumers of financial services have privacy
  • Children’s Online Privacy Protection Act minimizes the collection/use of children data under the age of 13
  • Consumer Privacy bill of rights is a privacy blueprint that aids in allowing people to understand what their personal data is being collected and used for that are consistent with their expectation.

In the public sector, we run into issues, when the government has collected information about their citizens for one purpose, to eventually, use that same citizen data for a different purpose (Podesta, et al., 2014).  This has the potential of the government to exert power eventually over certain types of citizens and tamper civil rights progress in the future.  Current policies in the public sector are (Podesta, et al., 2014):

  • The Affordable Care Act allows for building a better health care system from a “fee-for-service” program to a “fee-for-better-outcomes.” This has allowed for the use of big data analytics to promote preventative care rather than emergency care while reducing the use of that data to eliminate health care coverage for “pre-existing health conditions.”
  • The Family Education Rights and Privacy Act, the Protection of Pupil Rights Amendment and the Children’s Online Privacy Act help seal children educational records to prevent misuse of that data.

Identifying opportunities for big data in the economy, health, education, safety, energy-efficiency

In the economy, the use of the internet of things to equip parts of product with sensors to help monitor and transmit live, thousands of data points for sending alerts.  These alerts can tell us when maintenance is needed, for which part and where it is located, making the entire process save time and improving overall safety(Podesta, et al., 2014).

In medicine, the use of predictive analytics could be used to identify instances of insurance fraud, waste, and abuse, in real time saving more than $115M per year (Podesta, et al., 2014).  Another instance of using big data is for studying neonatal intensive care, to help use current data to create prescriptive results to determine which newborns are likely to come into contact with which infection and what would that outcome be (Podesta, et al., 2014).  Monitoring newborn’s heart rate and temperature along with other health indicators can alert doctors of an onset of an infection, to prevent it from getting out of hand. Huge amounts of genetic data sets are helping locate genetic variant to certain types of genetic diseases that were once hidden in our genetic code (Podesta, et al., 2014).

With regards to national safety and foreign interests, data scientist and data visualizers have been using data gathered by the military, to help commanders solve real operational challenges in the battlefield (Podesta, et al., 2014).  Using big data analytics on satellite data, surveillance data, and traffic flow data through roads, are making it easier to detect, obtain, and properly dispose of improvised explosive devices (IEDs).  The Department of Homeland Security is aiming to use big data analytics to identify threats as they enter the country and people of higher than the normal probability to conduct acts of violence within the country (Podesta, et al., 2014). Another safety-related used of big data analytics is the identification of human trafficking networks through analyzing the “deep web” (Podesta, et al., 2014).

Finally for energy-efficiency, understanding weather patterns and climate change, can help us understand our contribution to climate change based on our use of energy and natural resources. Analyzing traffic data, we can help improve energy efficiency and public safety in our current lighting infrastructure by dimming lights at appropriate times (Podesta, et al., 2014).  Energy efficiencies can be maximized within companies using big data analytics to control their direct, and indirect energy uses (through maximizing supply chains and monitoring equipment).  Another way we are moving to a more energy efficient future is when the government is partnering with the electric utility companies to provide businesses and families access to their personal energy usage in an easy to digest manner to allow people and companies make changes in their current consumption levels (Podesta, et al., 2014).

Protecting your own privacy outside of policy recommendation

In this report it is suggested that we can control our own privacy through using the browse in private function in most current internet browsers, this would help prevent the collection of personal data (Podesta, et al., 2014). But, this private browsing varies from internet browser to internet browser.  For important information like being denied employment, credit or insurance, consumers should be empowered to know why they were denied and should ask for that information (Podesta, et al., 2014).  Find out the reason why can allow people to address those issues in order to persevere in the future.  We can encrypt our communications as well, in order to protect our privacy, with the highest bit protection available.  We need to educate ourselves on how we should protect our personal data, digital literacy, and know how big data could be used and abused (Podesta, et al., 2014).  While we wait for currently policies to catch up with the time, we actually have more power on our own data and privacy than we know.

 

Reference:

Podesta, J., Pritzker, P., Moniz, E. J., Holdren, J. & Zients,  J. (2014). Big Data: Seizing Opportunities, Preserving Values.  Executive Office of the President. Retrieved from https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf