Leadership and Social Media

The frequency, use, and depth of engagement on social media will increase the popularity of social media increases. Thus it is important to be able to define what it is (Wollan, Smith, & Zhou, 2010).  However, the definition of social media would change with time because social media is dependent on the technology and platforms that enable and facilitates a social connection (Cohen, 2011; Solis, 2010). The social connection from social media shifts content creation and delivery from a “one-to-many” model to a “many-to-many” model (Solis, 2010; Wollan et al., 2010). This social connection exists between content hosted online versus the consumers of the content (Cohen, 2011). Wollan et al. (2010) defined that social media is highly assessable and scalable.  Thus, social media allows for democratizing content/information and influence (Solis, 2010; Wollan et al., 2010). In the end, social media allows for swift content creation and dissemination by the content creators, whether it is from organization or a single/group people (Wollan et al., 2010).

From a business perspective, social media has become a customer relationship management tool between the business and its customers (Wollan et al., 2010). There is a different type of leadership style needed for companies that use social media to deliver products and services. Li (2010), explains a case study where the Red Cross wanted to control social media right after it saw the negative impacts from the Hurricane Katrina response.  However, over time (not over-night) the Red Cross realized it was better to engage in an open dialog with its participants over social media, which paid off because they were able to raise $10M for Haiti earthquake relief in three days in 2010.  This could only be done when the social media strategy handbook was published online to allow for not only the corporate but the local Red Cross chapters to begin their usage of social media (Li, 2010; American Red Cross, 2012). They had to let go of controlling their image, word for word, but allow their chapters to do so.

This “Let it go” style is the main type of leadership style that Li (2010) proposes in Open Leadership for businesses to succeed in their use of Social Media for its future success. Social media is driving a leadership style that is more democratic (Stupid stuff for dummies, 2011), due to social media’s democratizing properties.  This is because in this world, people vote with how they spend their dollars, social media allows for a company to engage with its customers through exhibiting (not all but) greater transparency and authenticity (Li, 2010). Open Leadership is not about controlling technology but establishing a plan or relationship that is wanted with the social media platform, to maintain a democratic leadership style that grows a corporation successfully.

References:

Advertisements

Compelling topics

Hadoop, XML and Spark

Hadoop is predominately known for its Hadoop Distributed File System (HDFS) where the data is distributed across multiple systems and its code for running MapReduce tasks (Rathbone, 2013). MapReduce has two queries, one that maps the input data into a final format and split across a group of computer nodes, while the second query reduces the data in each node so that when combining all the nodes it can provide the answer sought (Eini, 2010).

XML documents represent a whole data file, which contains markups, elements, and nodes (Lublinsky, Smith, & Yakubovich,, 2013; Myer, 2005):

  • XML markups are tags that helps describe the data start and end points as well as the data properties/attributes, which are encapsulated by < and a >
  • XML elements are data values, encapsulated by an opening <tag> and a closing </tag>
  • XML nodes are part of the hierarchical structure of a document that contains a data element and its tags

Unfortunately, the syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007)

Five questions must be asked before designing an XML data document (Font, 2010):

  1. Will this document be part of a solution?
  2. Will this document have design standards that must be followed?
  3. What part may change over time?
  4. To what extent is human readability or machine readability important?
  5. Will there be a massive amount of data? Does file size matter?

All XML data documents should be versioned, and key stakeholders should be involved in the XML data design process (Font, 2010).  XML is a machine and human readable data format (Smith, 2012). With a goal of using XML for MapReduce, we need to assume that we need to map and reduce huge files (Eini, 2010; Smith 2012). Unfortunately, XML doesn’t include sync markers in the data format and therefore MapReduce doesn’t support XML (Smith, 2012). However, Smith (2012) and Rohit (2013) used the XmlInputFormat class from mahout to work with XML input data into HBase. Smith (2012), stated that the Mahout’s code needs to know the exact sequence of XML start and end tags that will be searched for and Elements with attributes are hard for Mahout’s XML library to detect and parse.

Apache Spark started from a working group inside and outside of UC Berkley, in search of an open-sourced, multi-pass algorithm batch processing model of MapReduce (Zaharia et al., 2012). Spark is faster than Hadoop in iterative operations by 25x-40x for really small datasets, 3x-5x for relatively large datasets, but Spark is more memory intensive, and speed advantage disappears when available memory goes down to zero with really large datasets (Gu & Li, 2013).  Apache Spark, on their website, boasts that they can run programs 100X faster than Hadoop’s MapReduce in Memory (Spark, n.d.). Spark outperforms Hadoop by 10x on iterative machine learning jobs (Gu & Li, 2013). Also, Spark runs 10x faster than Hadoop on disk memory (Spark, n.d.). Gu and Li (2013), recommend that if speed to the solution is not an issue, but memory is, then Spark shouldn’t be prioritized over Hadoop; however, if speed to the solution is critical and the job is iterative Spark should be prioritized.

Data visualization

Big data can be defined as any set of data that has high velocity, volume, and variety, also known as the 3Vs (Davenport & Dyche, 2013; Fox & Do, 2013; Podesta, Pritzker, Moniz, Holdren, & Zients, 2014).  What is considered to be big data can change with respect to time.  What is considered as big data in 2002 is not considered big data in 2016 due to advancements made in technology over time (Fox & Do, 2013).  Then there is Data-in-motion, which can be defined as a part of data velocity, which deals with the speed of data coming in from multiple sources as well as the speed of data traveling between systems (Katal, Wazid, & Goudar, 2013). Essentially data-in-motion can encompass data streaming, data transfer, or real-time data. However, there are challenges and issues that have to be addressed to conducting real-time analysis on data streams (Katal et al., 2013; Tsinoremas et al., n.d.).

It is not enough to analyze the relevant data for data-driven decisions but also selecting relevant visualizations of that data to enable those data-driven decision (eInfochips, n.d.). There are many types of ways to visualize the data to highlight key facts through style and succinctly: tables and rankings, bar charts, line graphs, pie charts, stacked bar charts, tree maps, choropleth maps, cartograms, pinpoint maps, or proportional symbol maps (CHCF, 2014).  The above visualization plots, charts, maps and graphs could be part of an animated, static, and Interactive Visualizations and would it be a standalone image, dashboards, scorecards, or infographics (CHCF, 2014; eInfochips, n.d.).

Artificial Intelligence (AI)

Artificial Intelligence (AI) is an embedded technology, based off of the current infrastructure (i.e. supercomputers), big data, and machine learning algorithms (Cyranoski, 2015; Power, 2015). AI can provide tremendous value since it builds thousands of models and correlations automatically in one week, which use to take a few quantitative data scientist years to do (Dewey, 2013; Power, 2015).  Unfortunately, the rules created by AI out of 50K variables lack substantive human meaning, or the “Why” behind it, thus making it hard to interpret the results (Power, 2015).

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” said by Anthony Goldbloom. Thus, the fundamental question that decision makers need to ask, is how the decision is reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Therefore, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations is equivalent to our tough challenges today (McAfee, 2013).  Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.

 

Resources:

Design Proposal for Healthcare based on the Internet of Things

As a big data analyst, this post would be considering that a major corporation who owns a string of state-of-the-art hospitals in the four corners area of the United States (Arizona, Colorado, New Mexico, and Utah) would like to incorporate data analytics in its operations. The first official task is to provide a Hadoop solution to their business problem of analyzing various data sets. Some structured and some unstructured data have been saved. Thus, this is a design proposal using a Hadoop environment with examples for a design flow chart through XML.

Introduction

This design proposal is for the major corporation who owns a string of state-of-the-art hospitals in the four corners area of the United States (Arizona, Colorado, New Mexico, and Utah). The solution proposed in this proposal for a centralized Healthcare Information Management System (HIMS) to give key stakeholders access to information derived from data that may be hidden in silos and to bring forth vital information needed to provide data-driven decisions.  The goal of this proposal is to allow for collecting, processing, and analyzing data to deliver key insights quickly and accurately to provide better service to the patients.  This proposal should allow for these hospitals to be more agile, responsive, and competitive in the healthcare industry. Thus, this proposal will call for the use Hadoop to analyze data derived from the Internet of Things (IoT) and social media for these hospitals, which is a set of various large and streaming data sets from multiple devices/sensors dealing with sensitive patient data.  This design proposal will consist of a design flow chart that will use XML, include the use of Hadoop and recommend a suite of data visualization tools currently used in the industry to visualize HIMS data.

Requirements

The HIMS must allow for analysis and visualization of new, functional and experimental data, in the context of old existing data, information, and knowledge (Higdon et al., 2016). Dealing with both structured and unstructured data can present a real-world challenge for most hospitals. Structured data exists from the devices and sensors used to monitor patients, while unstructured data exists from clinical, nursing, and doctors notes and diagnosis, which is being saved in a centralized data warehouse. Other data sources exist in helping maintain finances, HR, facility management, etc. are out of scope in this proposal. Another source of unstructured data is on social media from those related to the patients and those sent out of from the hospital; both are also out of scope of this design

This proposed system should be able to integrate internal and external datasets and structured and unscripted data from different sources, such that a traditional relationship database is not adequate to handle the amount of data and different data types (Hendler, 2016). Traditional databases rely on tables and arrays, but the data that come from the healthcare industry comes in N-dimensional arrays as well as from multiple difference sources which can vary across time and contain text data (Gary et al., 2005). Hence, this means that traditional databases are not the best solution for the HIMS.

Since Hadoop is a Platform as a Service (PaaS cloud solution), it can manage administers the runtime, middleware, O/S, virtualization, servers, storage, and networking, while the IT department managing the HIMS deals with the applications, data, and visualization of the results (Lau, 2001).  This provides another advantage over the use of traditional relational database systems for HIMS, which is dependent on the hardware infrastructure (Minelli, Chambers, & Dhiraj, 2013).  The benefits of using a cloud-based solution is a pay-as-you-go business model, which allows for the healthcare industry in these four states to pay only for what they need at that time, allowing for scaling up and down of computational and financial resources (Dikaiakos, Katsaros, Mehra, Pallis, & Vakali, 2009; Lau, 2011). Finally, another benefit of using a cloud-based solution like Hadoop is the reduction in IT Infrastructure costs, and there will be no need to absorb the cost of upgrading the IT Infrastructure every 2 to 3 years (Lau, 2011).

Therefore, to deal with the amount and complexity of the types of data a PaaS solution using Hadoop is recommended over relational databases.  PaaS tools like Hadoop uses distributed databases, which allows for parallel processing of huge amounts of data, for parallel searching, metadata management, analysis, use of tools like MapReduce, and workflow system analysis (Gary et al., 2005, Hortonworks, 2013, IBM, n.d.; Minelli et al., 2013).  MapReduce uses Hadoop’s Distributed File System (HDFS) to map the data and reduce the data using parallel processing to discover hidden insights in the healthcare data (Gary et al., 2005, Hortonworks, 2013, IBM, n.d.).  The HDFS stores the data into smaller blocks of data, which can be recombined when needed into one system, like Lego Blocks, and provides a means for throughput to the data (Hortonworks, 2013; IBM, n.d.).

Finally, for data visualization the California HealthCare Foundation ([CHCF], 2014) recommended that data visualization tools that everyone could use would be: Google Charts & Maps, Tableau Public, Mapbox, Infogram, Many Eyes, iCharts, and Datawrapper. CHCF (2014) also recommended some data visualization tools for developers such as High Charts, TileMill, D3.js, FLOT, Fusion Charts, OpenLayers, and JSMap.  Either of these solutions is fine, and choosing between them should be left to the IT professionals and key stakeholders.  All of this technological backbone to HIMS could become complicated, and for patients, healthcare providers, healthcare clearinghouses, and health plans staff this knowledge is not needed to get their job done and meet their data needs.  It is the job of the IT department to make this happen and to educate people about the use of HIMS to meet their needs.

Education amongst the healthcare providers, healthcare clearinghouses, and health plans on the role of a centralized HIMS for conducting data analysis to improve patient care is necessary (Higdon et al., 2016). These groups of people will be interfacing with the system, Hadoop environment, data, etc.  Therefore, outreach and education program must be implemented to ensure buy-in by all the key stakeholders, as well as training on how to use the system to meet their current needs.  When interacting with the HIMS a graphical user interface (GUI) should be used for patients, healthcare providers, healthcare clearinghouses and health plans, to ensure ease of use:

  • patients should only have read access to their data only
  • healthcare providers should have read/write/edit access to the data of the patient they are caring for
  • healthcare clearinghouses and health plans should only have read access to patient medication and services performed, but no access to anything else and have read/write/edit access to data that fits their scope of work
  • IT department should have full privileges to read/write/edit data that is deidentified patient data to ensure data quality, data input/output, and for data mining

GUIs are software that is built on top of the Hadoop environment and on the computational front-end hardware used to access this data.  The GUI could have forms for providers to Extract, Load, and Transform data easily independent of the operating system (Linux, Mac OS, Windows, Solaris, etc.) and device (Laptops, Tablets, Smartphones, etc.).

Data flow diagrams

Using GUI systems and forms, data that is written/edited can easily be written into an XML format, which should be expandable, consistent across all four states, and meets standards for all four states and federal government (Font, 2010):

<?XML version = “1.0”?>

<!—File name: HIMSmanualDataEntry.xml-->

<State>

            <Statename> Arizona </Statename>

            <HospitalName>…</ HospitalName>

            <DepartmentName>…</DepartmentName>

            …

</State>

<Patient>

            <PatientFirstName>…</PatientFirstName>

            <PatientMiddleName>…</PatientMiddleName>

            <PatientLastName>…</PatientLastName>

            <PatientID>…</PatientID>

            <PatientDOB>…</PatientDOB>

            <PatientStreetAddress>…</PatientStreetAddress>

            <PatientCity>…</PatientCity>

            <PatientState>…</PatientState>

            <PatientZipCode>…</PatientZipCode>

            <PatientHeight>…</PatientHeight>

            <PatientWeight>…</PatientWeight>

            <PatientOnMedications>…</PatientOnMedications>

            <PatientPrimaryCarePhysician>…</PatientPrimaryCarePhysician>

            …

            <PatientSatisfactionSurveyResultsQ1>…</PatientSatisfactionSurveyResultsQ1>

…

<PatientResponseOnSocialMediaPage>…</PatientResponseOnSocialMediaPage>

…

</Patient>

Data from sensors could also be entered into an XML format:

<?xml version = “1.0”?>

<!—File name: HIMSmanualDataEntry.xml–>

<Sensor>

<SensorName>…</SensorName>

<SensorType>…</SensorType>

<SensorManufaturer>…</SensorManufaturer>

<SensorMarginOfError>…</SensorMarginOfError>

<SensorMinValue>…</SensorMinValue>

<SensorMaxValue>…</SensorMaxValue>

</Sensor>

<State>

<Statename> Arizona </Statename>

<HospitalName>…</ HospitalName>

<DepartmentName>…</DepartmentName>

</State>

<Patient>

<PatientID>…</PatientID>

</Patient>

<SensorReadings>

<TimeStamp>…</TimeStamp>

<HeathIndicator1Value>…</ HeathIndicator1Value >

<HeathIndicator2Value >…</ HeathIndicator1Value >

</ SensorReadings >

 

All of this patient data comes in through these different data sources via a GUI or directly in XML format and entered into each state’s respective data centers, where each state inputs that data into HIMS data center, which gets processed and the results are displayed in the GUI to the end-user (Figure 1). Parallel processing occurs when the data is split, mapped and reduced through N nodes, which allows returning data-driven results at a much faster pace than having one node reduce the data in the HIMS datacenter.

1.png

Figure 1: Data flow diagram from the data source to data processing to results.

Overall system diagram

The entire HIMS is built on HDFS and Hadoop to leverage parallel processing and data analytics programs like MapReduce (Figure 2). Hadoop’s Mahout library would allow for Hadoop to use classification, filtering, k-means, Dirichlet, parallel pattern, and Bayesian classification similar to Hadoop’s MapReduce (Wayner, 2013; Lublinksy, Smith, & Yakubovich, 2013). Essentially a data analytics library. Hadoop’s Avro Library should help process the XML data in HIMS (Wayner, 2013; Lublinksy et al., 2013).   While Hadoop’s YARN can set up job scheduling and cluster management (Lublisky et al., 2013).  Finally, Hadoop’s Oozie library manages the workflow of a job by allowing the user to break the job into simple steps in a flowchart fashion. Essentially a workflow manager (Wayner, 2013; Lublinksy et al., 2013).  Though the use of Hadoop’s MapReduce function, petabytes or exabytes of data will be split into megabyte or gigabyte size files (depending on how many nodes and the input data) to be processed and analyzed. The analyzed results will provide useful data or insights hidden in the large data to the end user on their hardware devices via a GUI frontend.

2

Figure 2: Overall systems diagram for HIMS, which is based on Hadoop solution.

 Communication flow chart

Each of the hospitals will have patients, healthcare providers, healthcare clearinghouses and health plans professionals and each of them talk to one another.  Data derived from these conversations get placed inside each hospital’s data center, which gets translated to each states’ datacenter and finally placed into the HIMS data center (Figure 3).  Also, in figure three shows the hidden communications of Regulations, Policies, and Governance set forth between the hospitals at both the federal and state level, and that governs how the data is communicated between the systems. The proposed system must follow all regulations, policies, and governances set forth by the local, state, and federal government and internal policies and procedures of the healthcare system.

3.png

Figure 3. The communications flow chart for the proposed HIMS solution.

Regulations, policies, and governance for the Healthcare industry

With any data solution that involves patient data, the data itself must be recognized as about a person’s identity, and how that person’s identity flows from one IT platform to another is where the concept of privacy and the protection of a patient’s identity becomes important (Richards & Kings, 2014).  It is due to the protection of information as it flows from the patient to the healthcare provider, from the healthcare provider to the IT solution, and from the IT solution to the healthcare provider, is where legal regulations, policies, and governance comes into play in order to protect the patient (O’Driscoll, Daugelaite, & Sleator, 2013). The goal of using the proposed HIMS for these four states is to allow for administrative simplification, lowering costs, improved data security, and lowering error rates, therefore providing better care to the patients (HIPAA, n.d.).

This proposed solution must follow the Health Information for Economic and Clinical Health (HITECH) Act, which promotes “meaningful use” reporting standards, the Health Insurance Portability and Accountability Act (HIPAA), which promotes data privacy of the patients, and the International Standards Organization (ISO) 9001, which focuses on quality management standards (HHS, n.d.; HIPAA, n.d.; McEwen, Boyer, & Sun, 2016; Microsoft, 2016; Nolan, 2015).  However, Richards and Kings (2014), argued that the Act of patients disclosing information to their healthcare provider or the collection of patient data from IoT solutions, there is a loss of control of their personal data, but there is an expectation that the data will remain confidential and not shared with others.  HIPAA (n.d.) allows for disclosure of patient data without their authorization if the discolored deals with treatment, payment, operations, or a subpoena.  However, for the patient data to be entered into a centralized HIMS, patients must authorize it.  McEwen et al. (2016), suggested that data disclosure options to be provided to all patients to provide the best protection and care for the patients: open-consent (data can be used in the future for a specific purpose or research project), broad-consent (data can be used in all cases), or an opt-out consent (broad application of the data, but patients can say no to certain cases).

HIPPA describes how healthcare providers, healthcare clearinghouses, and health plans must de-identify 18 key data points to protect the patient: Names, Geographic data, Dates, Telephone Numbers, VIN, Fax, Device ID and serial numbers, emails addresses, URLs, SSN, IP address, Medical Record Numbers, Biometric ID (fingerprints, iris scans, voice prints, etc), full face photos, health plan beneficiary numbers, account numbers, any other unique ID number (characteristic, codes, etc), and certifications/license numbers (HHS, n.d.; HIPAA, n.d.). If this data is not de-identified properly following the procedures outlined in HIPAA, cyber-criminals, can hack into the centralized HIMS for these four corner states and de-identify the data and leak the information out to the world, causing defamation, stolen identity, etc. (HIPAA, n.d.).  This could be mitigated if ISO 9001 was implemented because of internal audits would be a standard practice and are conducted to ensure quality management of the data and IT system, constant risk assessments to reduce cost, and ensure continual service improvement to drive improvements to the system to be proactive rather than reactive to cyber threats (Nolan, 2015).  Given the information above, HIMS would best be suited as a private cloud solution, where only the data within it can be seen or used by all four states (Lau, 2011).

Assumptions and limitations

There is an assumption that many healthcare organizations and bigger hospitals will have their IT departments, which implement IT solutions that meet Regulations, Policies, and Governance for the healthcare industry (Microsoft, 2016).  Therefore, internal performance and quality management can vary drastically between hospitals and across state boundaries (Nolan, 2015). Also, smaller hospitals and medical facilities may not have the resources to have their IT department. Thus a solution must be devised that is simple, secure, and feasible enough to implement to help bring them onboard with confidence that the proposed solution will fit their needs and provide substantial benefits (Microsoft, 2016).  Nolan (2015), proposed that following ISO 9001 standards would allow for uniformity of objects and methodologies across all hospitals in the four-state region, reduction of the cost needed for different training solutions for different hospitals, and allow for greater efficiency in secure legal data sharing.

Other assumptions that could exist with a centralized HIMS is that with enough data gathered from multiple hospitals, health status changes can become predictable, preventable, or managed and did so would be easier, cheaper and humane (Flower, n.d.).  If patients give broad consent or even opt-out consent to their data, given then, healthcare providers could monitor a patient’s health and be in low-intensity high volume lifelong contact with the patient (Flower, n.d.; McEwen et al., 2016). This will allow patients and healthcare providers to be partners in managing the patient’s personal and family health.  Finally, a system like HIMS could improve the overall health of the population through prediction, prevention, and management of the population’s health (Flower, n.d.).

A limitation to this proposal is the assumption that data is being cleaned, such that it is reliable and credible.  That is because poor data quality, when used in data mining, machine learning, and data analytics, will impact the results and therefore impact any data-driven decisions (Corrales, Ledezma, & Corrales, 2015).  Data cleaning and preprocessing must be done before modeling, and it requires that the IT professionals know and understand the data that is being collected and integrated from heterogeneous datasets (Corrales et al., 2015, Hendler, 2016).

Justification for overall design

This design proposal is for the major corporation who owns a string of state-of-the-art hospitals in the four corners area of the United States (Arizona, Colorado, New Mexico, and Utah). The solution proposed in this proposal for a centralized Healthcare Information Management System (HIMS) to give key stakeholders access to information derived from data that may be hidden in silos and to bring forth vital information needed to provide data-driven decisions.  In Summary, the justification for the HIMS was designed:

  • To allow for these hospitals to be more agile, responsive, and competitive in the healthcare industry, by centralizing and standardizing the datasets across all four states.
  • For the collection, processing, and analyzing data to deliver key insights quickly and accurately to provide better service to the patients.
  • To allow for identification of redundant and duplicitous data that can exist if a patient data is found in more than one hospital, and allowing for that data to be merged into one record.
  • To allow for analysis and visualization of new, functional and experimental data, in the context of old existing data, information, and knowledge so that everyone can view the data that they need at the time that they need it at (Higdon et al., 2016).
  • To allow integration of big internal and external datasets and structured and unscripted data from different sources, through a distributed database system provided by Hadoop’s HDBS (Gary et al., 2005; Hendler, 2016; Hortonworks, 2013, IBM, n.d.).
  • For the use of Hadoop, which is a Java-based system that is a Platform as a Service (PaaS cloud solution), which allows for a pay-as-you-go business model, where the healthcare industry in these four states to pay only for what they need at that time, (Dikaiakos et al., 2009; Lau, 2011).
  • To allow for a reduction in IT Infrastructure costs and there will be no need to absorb the cost of upgrading the IT Infrastructure every 2 to 3 years (Lau, 2011).
  • To allow for the utilization of MapReduce which map the data and reduce the data using parallel processing to discover hidden insights in the healthcare data (Gary et al., 2005, Hortonworks, 2013, IBM, n.d.).
  • To use GUI systems and forms, data that is written/edited can easily be written into an XML format, which should be expandable, consistent across all four states, and meets standards for all four states and federal government (Font, 2010).
  • To allow for the analysis of data derived from the IoT, which is a set of various large and streaming data sets from multiple devices/sensors dealing with sensitive patient data.
  • To allow for administrative simplification, lowering costs, improved data security, and lowering error rates, therefore providing better care to the patients (HIPAA, n.d.).

Thus, this design proposal recommends the use of a centralized distributed database system and the use of Hadoop, such that insights can be garnered and visualized to derive data-driven healthcare decision and provide improved care for the patients.

References

  • Lublinsky, B., Smith, K., Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • McEwen, J. E., Boyer, J. T., & Sun, K. Y. (2013). Evolving approaches to the ethical management of genomic data. Trends in Genetics, 29(6), 375-382.

Data Tools: Artificial Intelligence and Internet of Things

The future is about the integration and convergence of sensor networks, data analytics, cloud, API, and artificial intelligence. Technology trend is about the use of devices to make right decisions based on large amount of data, help with daily life, and business operations. Thus, how are the Internet of Things and Artificial Intelligence connected and related?

Radio Frequency Identification (RFID) tags are the fundamental technology to the Internet of Things (IoT), which are everywhere and they are shipped more frequently than smartphones (Ashton, 2015). The IoT is the explosion of device/sensor data, which is growing the amount of structured data exponentially with huge opportunities (Jaffe, 2014; Power, 2015). Ashton (2016), analogizes IoT to fancy windmills where data scientist and a computer scientist are taking energy and harnessing it to do amazing things. Newman (2016), stated that there is a natural progression of sensor objects to become learning objects, with a final desire to connect all of the IoT into one big network.  Essentially, IoT is giving senses through devices/sensors to machines (Ashton, 2015).

Artificial Intelligence and the Internet of things

Thus, analyzing this sensor data to derive data-driven insights and actions is key for companies to derive value from the data they are gathering from a wide range of sensors.  In 2016, IoT has two main issues, if it is left on its own and it is not tied to anything else (Jaffe, 2014; Newman, 2016):

  • The devices/sensors cannot deal with the massive amounts of data generated and collected
  • The devices/sensors cannot learn from the data it generates and collects

Thus, artificial intelligence (AI) should be able to store and mine all the data that is collected from a wide range of sensors to give it meaning and value (Canton, 2016; Jaffe, 2014). The full potential of IoT cannot be realized without AI or machine learning (Jaffe, 2014). The value derived from IoT depends on how fast AI through machine learning could give fast actionable insights to key stakeholders (Tang, 2016). AI would bring out the potential of IoT through quickly and naturally collecting, analyzing, organizing, and feeding valuable data to key stakeholders, transforming the field into the Internet of Learning-Things (IoLT) from the standard IoT (Jaffe, 2014; Newman, 2016).  Tang (2016), stated that the IoT is limited by how efficiently AI could analyze the data generated by IoT.  Given that AI is best suited for frequent and high voluminous data (Goldbloom, 2016), AI relies on IoT technology to sustain its learning.

Another, high potential use of IoT with AI is through analyzing data-in-motion, which is analyzing data immediately after collection to identify hidden patterns or meaning to creation actionable data-driven decisions (Jaffe, 2014).

Connection: One without the other or not?

In summary, AI helps give meaning and value to IoT and IoT cannot work without AI. Since, IoT is supplying huge amounts of frequent data, which AI thrives upon.  It can go without saying that a source of data for AI can come from IoT.  However, if there were no IoT, social media can provide AI the amounts of data needed for it to generate insight, albeit different insights will be gained from different sources of voluminous data.  Thus, the IoT technologies worth depends on AI, but AI doesn’t depend solely on IoT.

Resources:

Data Tools: Artificial Intelligence and Data Analytics

The three main ingredients for artificial intelligence are computers, data, and fast algorithms. Big data technologies can turn infinite streams of data into business intelligence decisions capabilities. So how does artificial intelligence influence data analytics.

Machine learning, also known as Artificial Intelligence (AI) adds an intelligence layer to big data to handle the bigger sets of data to derive patterns from it that even a team of data scientist would find challenging (Maycotte, 2014; Power, 2015). AI makes their insights not by how machines are programmed, but how the machines perceive the data and take actions from that perception, essentially conducting self-learning (Maycotte, 2014).  Understanding how a machine perceives the big dataset is a hard task, which also makes it hard to interpret the resulting final models (Power, 2015).  AI is even revolutionizing how we understand what intelligence is (Spaulding, 2013).

So what is intelligence

At first, doing arithmetic was thought of as a sign of biological intelligence until the invention of the digital computers, which then shift biological intelligence to be known for logical reasoning, deduction and inferences to eventually fuzzy logic, grounded learning, and reasoning under uncertainty, which is now matched through Bayes Nets probability and current data analytics (Spaulding, 2013). So as humans keep moving the dial of what biological intelligence is to a more complex structure, if it requires high frequency and voluminous data, then it can be matched by AI (Goldbloom, 2016).  Therefore, as our definition of intelligence expands so will drive the need to capture intelligence artificially, driving change in how big datasets are analyzed.

AI on influencing the future of data analytics modeling, results, and interpretation

This concept should help revolutionize how data scientists and statisticians think about which hypotheses to ask, which variables are relevant, how do the resulting outputs fit in an appropriate conceptual model, and why do these patterns hidden in the data help generate the decision outcome forecasted by AI (Power, 2015). To figure out or make sense of these models would require subject matter experts from multiple fields and multiple levels of employment hierarchy analyzing these model outputs because it is through diversity and inclusion of thought will we understand an AI’s analytical insight.

Also, owning data is different from understanding data (Lapowsky, 2014). Thus, AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with, which allows for a data scientist to gain a better understanding of their datasets (Lapowsky, 2014; Power, 2015).

AI on generating datasets and using data analytics for self-improvements

Data scientists currently collected, preprocess, process and analyze big volumes of data regularly to help provide decision makers with insights from the data to make data-driven decisions (Fayyad, Piatetsky-Shapiro, & Smyth, 1996).  From these data-driven decisions, data scientist then measure the outcomes to prove the effectiveness of their insights (Maycotte, 2014).   This analysis on how the results of data-driven decisions, will allow machine learning algorithms to learn from their decisions and actions to create better ways of searching for key patterns in bigger and future datasets. This is an ability of AI to conduct self-learning based off of the results of data analytics through the use of data analytics (Maycotte, 2014). Meetoo (2016), stated that if there is enough data to create accurate rules it is enough to create insights; because machine learning can run millions of simulations against itself to generate huge volumes of data to which to learn from.

AI on Data Analytics Process

AI is a result of the massive amounts of data being collected, the culmination of ideas from the most brilliant computer scientists of our time, and on an IT infrastructure that didn’t use to exist a few years ago (Power, 2015).  Given that data analytics processes include collecting data, preprocessing data, processing data, and analyzing the results, any improvements made for AI on the infrastructure can have an influence on any part of the data analytics process (Fayyad et al., 1996; Power, 2015).  For example, as AI technology begins to learn how to read raw data to turn that into information, the need for most of the current preprocessing techniques for data cleaning could disappear (Minelli, Chambers, & Dhiraj, 2013). Therefore, as AI begins to advance, newer IT infrastructures will be dreamt up and built such that data analytics and its processes can now leverage this new infrastructure, which can also change the way on how big datasets are analyzed.

Resources:

Data Tools: Artificial Intelligence and Decision Making

Decision making is related to reasoning. You make a choice between different alternatives for what you want to do, and the intuitive notion of human free will in choosing between options helps your reasoning at times. So should artificial intelligence be used as a supporting tool or a replacement for decision makers?

“Machines can excel at frequent high-volume tasks. Humans can tackle novel situations.” – Anthony Goldbloom

Jobs today will look drastically different in 30 years from now (Goldbloom, 2016; McAfee, 2013).  Artificial intelligence (AI) works on Sundays, they don’t take holidays, and they work well at high frequency and voluminous tasks, and thus they have the possibility of replacing many of the current jobs of 2016 (Goldbloom, 2016; Meetoo, 2016).  AI has been doing things that haven’t been done before: understanding, speaking, hearing, seeing, answering, writing, and analyzing (McAfee, 2013). Also, AI can make use of data hidden in “dark wells” and silos, where the end-user had no idea that the data even existed, to begin with (Power, 2015). Eventually, AI and machine learning will be commonly used as a tool to augment or replace decision makers.  Goldbloom (2016) gave the example that a teacher may be able to read 10,000 essays or an ophthalmologist may see 50,000 eyes over a 40-year period; whereas a machine can read millions of essays and see millions of eyes in minutes.

Machine learning is one of the most powerful branches to AI, where machines learn from data, similar to how humans learn to create predictions of the future (Cringely, 2013; Cyranoski, 2015; Goldbloom, 2016; Power, 2015). It would take many scientists to analyze a big dataset in its entirety without a loss of memory such that to gain insights and to fully understand how the connections were made in the AI system (Cringely, 2013; Goldbloom, 2016). This is no easy task because the eerily accurate rules created by AI out of thousands of variables can lack substantive human meaning, making it hard to interpret the results and make an informed data-driven decision (Power, 2015).

AI has been used to solve problems in industry and academia already, which has given data scientist knowledge on the current limitations of AI and whether or not they can augment or replace key decision makers (Cyranoski, 2015; Goldbloom, 2016). Machine learning and AI does well at analyzing patterns from frequent and voluminous amounts of data at faster speeds than humans, but they fail to recognize patterns in infrequent and small amounts of data (Goldbloom, 2016).  Therefore, for small datasets artificial intelligence will not be able to replace decision makers, but for big datasets, they would.

Thus, the fundamental question that decision makers need to ask is how is the decision reduced to frequent high volume task and how much of it is reduced to novel situations (Goldbloom, 2016).  Thus, if the ratio is skewed on the high volume tasks then AI could be a candidate to replace decision makers, if the ratio is evenly split, then AI could augment and assist decision makers, and if the ratio is skewed on novel situations, then AI wouldn’t help decision makers.  They novel situations are equivalent to our tough challenges today (McAfee, 2013).

Finally, Meetoo (2016), warned that it doesn’t matter how intelligent or strategic a job could be, if there is enough data on that job to create accurate rules it can be automated as well; because machine learning can run millions of simulations against itself to generate huge volumes of data to learn from.  This is no different than humans doing self-study and continuous practice to be subject matter experts in their field. But people in STEAM (Science, Technology, Engineering, Arts, and Math) will be best equip them for the future world with AI, because it is from knowing how to combine these fields that novel, infrequent, and unique challenges will arise that humans can solve and machine learning cannot (Goldbloom, 2016; McAfee, 2013; Meetoo, 2016).

Resources:

10 Data Visualization Tools

There are many tools used in today’s market to present data analytics information. Many of these tools are great for particular presentation types. This post will list 10 available big data visualization tools in today’s market.

Data Visualization Tools

There are no shortages of data analytics tools that deal with the entire process from generation to visualization, and its infrastructure is shown in Figure 1.  According to Truck (2016), the primary data analytics visualization tools are Tableau, Google Cloud Platform, Qlik, Looker, RoamBI, Chartio, datorama, Zoomdata, Sisense, and Zeppelin.  According to Machlis (2011), she lists 22 different data visualization tools: R, DataWrangler, Google Refine, Google Fusion Tables, Impure, Tableau Public, Many Eyes, VIDI, Zoho Reports, Choosel, Exhibit, Google Chart Tools, JavaScript InfoVis Toolkit, Protovis, Quantum GIS (QGIS), OpenHeatMap, OpenLayers, OpenStreetMap, TimeFlow, IBM Word-Cloud Generator, Gephi, and NodeXL. Then, Jones (2014) listed the top 10 tools: Tableau Public, OpenRefine, KNIME, RapidMiner, Google Fusion Tables, NodeXL, Import.io, Google Search Operators, Solver, and WolframAlpha.  Even, the California HealthCare Foundation (CHCF, 2014) recommended that data visualization tools that everyone in the healthcare industry could use would be: Google Charts & Maps, Tableau Public, Mapbox, Infogram, Many Eyes, iCharts, and Datawrapper. CHCF (2014) also recommended some data visualization tools for developers in the healthcare industry could use such as High Charts, TileMill, D3.js, FLOT, Fusion Charts, OpenLayers, and JSMap.  These four cases are all examples that no matter which data visualization software is discussed here, there is a plethora of others and there are currently no authoritative sources listing all of them.  This discussion is also not trying to compile a comprehensive or authoritative source either.

1

Figure 1: Big Data Landscape 2016 which categorizes big data tools and applications by Infrastructure, Analytics, Application, Open Source, Data Sources & APIs, Incubators & Schools, and Cross-Infrastructure/Analytics. (Adapted from Truck, 2016).

Ten Data Visualization Tools and their strengths and weaknesses

Based on the above subject matter experts the following ten visualization tools will be discussed:  Tableau & Tableau Public, Google Fusion Charts, OpenLayers, Chartio, Datorama, Zoomdata, NodeXL, Qlik, Looker, and RoamBI.

Tableau Desktop & Tableau Public

Tableau Desktop is a $1000-1200 product, whereas Tableau Public is free and it is marketed as an end-user interactive business intelligence software to help provide insights hidden in the data (Jones, 2014; Machlis, 2011; Phillipson, 2016; Tableau, n.d.).  Tableau is touted to be 10-100x faster than most other commercial visualization software through its intuitive no-coding drag and drop products (Tableau, n.d.).  Tableau can take in data from excel spreadsheets, Hadoop, cloud, etc. and bring them together for comprehensive data analysis (Jones, 2014; Phillipson, 2016; Tableau, n.d.).  The strength of Tableau Public is all the functionalities seen in Tableau Desktop is provided for free. However any data stored in Tableau Public is made freely available to others within the community (Jones, 2014; Machlis, 2011).  If data privacy is sought, Tableau Desktop allows the data scientist to analyze the data locally, without sharing key information to the world, but at a price (Machlis, 2011; Tableau, n.d.).

Google Fusion Charts

Fusion charts is a web-based tool that is assessable to all with a google drive account, and it allows for control over many different aspects over the data visualizations, where data scientists can limit the amount of data shown, summarize the data, choose from different chart types, and customize legends without the need to know how to code (Google, n.d.; Machlis, 2011). Jones (2014) calls Fusion Charts the “Google Spreadsheets cooler, larger, and much nerdier cousin.” Data could be found through the google search engine or imported quickly from CSV, TSV; UTF-8 encoded files, etc. (Google, n.d.; Jones, 2014). Data can even be exported into JSON files, and all the data could be analyzed in private or can be set free to the public (Machlis, 2011). Certain interactive charts provided by Fusion are Network graphs, zoomable line charts, map charts, heat maps, timeline, storyline, animation, pie charts, tables, scatter plots, etc. (Google, n.d.; Machlis, 2011).  The downsides of this tools are how tedious it can become to edit multiple cells entries, the customizations are quite limiting, and that for large data the API can demand a ton of resources slowing down the execution (Machlis, 2011).

OpenLayers

OpenLayers is a JavaScript library for displaying mapping geolocation data; that is easily customizable and extendable using cutting edge tiled or vector layer mapping formats (Machlis, 2011; OpenLayers, n.d.). OpenLayers is an open source code (Machlis, 2011). Some of the maps that can be created are animated, blended, attribution, cluster features, integration with Bing maps, d3 Integration, drag and drop interaction, dynamically added data, etc. (OpenLayers, n.d.).  One of the drawbacks is that it requires a bit amount of coding skill in JavaScript and certain integrations with popular maps are still under development. However, it can run on any web browser (Machlis, 2011).

Chartio

Chartio is a software as a service, visual query tool that pulls and joins data from multiple sources easily, without knowledge of SQL (Rist & Strom, 2016).   Chartio can process the data into visualizations to aid in building a case with data-driven analytics and dashboard all for $2000 (Chartio, n.d.; Rist & Strom, 2016). Chartio’s commercial product can pull Amazon RDS, Cassandra, CSV files, DB2, Google Cloud SQL, Google BigQuery, Hadoop, MongoDB, Oracle, Rackspace Cloud, Microsoft SQL Server, Windows Azure Cloud, etc. data (Chartio, n.d.; Rist & Strom, 2016). Unfortunately, the user interface is poorly designed and has a learning curve that is greater than other data visualization tools (Rist & Strom, 2016). Chartio (n.d.) boasts that connecting to any of the databases above just requires two terminal commands and that the data pulled from these databases are done through read-only to protect the data. However, Rist and Strom (2016) had initial problems uploading data onto their tool, mostly due to the responsiveness of the API.

Datorama

An Israeli-based company, Datorama is a cloud-based system and tool that allows for marketing analytics (Gilad, 2016).  Data sources that Datorama uses can come from Facebook, Google, ad exchanges, networks, direct publisher sites, affiliate programs, etc. and can visually demonstrate and monetize the marketing data (Datorama, n.d., Gilad, 2016).  Datorama allows for multi-level authentication for advanced security (Phillipson, 2016). According to Datorama (n.d.), their tool allows for comparisons between online and offline marketing analysis on a single dashboard. Unfortunately, marketing/sales data is the primary use of this tool, and there are other tools in existence that analyze marketing/sales data and much more (Gilad, 2016).  To know the cost of this software one must obtain a quote (Phillipson, 2016).

Zoomdata

Zoomdata is an intuitive and collaborative way to visualize data that was built with HTML5, JavaScripts, WebSockets and CSS and expandable libraries such as D3, Leaflet, NVD3, etc. (Zoomdata, n.d.)  Graphing features can include dynamic dashboards with drill down capabilities on, tabular, geodata, pie charts, line graphs, scatter plots, bar charts, stacked bar charts, etc. (Darrow, 2016; Zoomdata, n.d.).  Zoomdata allows for web browsing and touch-oriented analysis and can handle real-time data streams and billions of rows of data (Zoomdata, n.d.). The downside is that this software as a service is a commercial software, which can set you back $1.91/hour (Darrow, 2016). However, Zoomdata can connect to Hadoop, Cloudera MongoDB, Amazon, NoSQL, MPP, and SQL databases, cloud applications, etc. (Darrow, 2016; Zoomdata, n.d.).

NodeXL

NodeXL basic is an open source software that is a Microsoft Excel 2007-2016 plug-in, helps in making it easy to graph and explore network graphs and relationships through entering network edge lists (Jones, 2014; Machlis, 2011; NodeXL, 2015).  NodeXL Pro ($29/year-$749/year) offers extended features from the basic, like dealing with data streams for social networks, text and sentiment analysis, etc. (NodeXL, 2015).  Data pulled from Facebook, Flickr, YouTube, LinkedIn, and Twitter could be represented through this tool (Jones, 2014; Machlis, 2011). Graph Metrics like degree, closeness centrality, PageRank, clustering, graph density is all available in NodeXL (Jones, 2014; NodeXL, 2015).  Editing the appearance of the graphs like color, shape, size, label and opacity can be done through both versions (NodeXL, 2015). Unfortunately, the tool is limited mostly to network analysis (Jones, 2014; Machlis, 2011; NodeXL, 2015).

Qlik

This free self-service data visualization tool, which allows you to create dynamic and interactive visualizations that one could keep the data on their desktop, without having to release their data to the public (Machlis, 2015; Qlik, n.d.).  It is free for both personal and internal business use (Qlik, n.d.).  Unfortunately, it isn’t easy to share data or visualizations with peers but, Qlik also allows sharing data for up to 5 people privately through their cloud services (Machlis, 2015). Qlik allows integration without a data warehouse from data sources likes Hadoop, Microsoft Excel, LinkedIn, Twitter, Facebook, cloud, databases, etc. (Qlik, n.d). Though there is a learning curve to this software, it is not insurmountable, and a user can quickly learn how to do basic graphs through with multiple filters (Machlis, 2015).

Looker

Looker aims to be a data visualization and exploration tool to be used by multiple people and aims to remove the data analytics bottleneck caused by data scientists controlling all the data (Looker, n.d.).  With these data models, it can help define all the measures and dimensions behind the data (Software Advice, n.d.). This tool allows for data models, custom metrics, real-time analysis, and blending between data sets to produce drill down dashboards with the basic charts, graphs, and maps (Looker, n.d.; Software Advice, n.d.). Data inputs can come from commercial off the shelf products like Salesforce or from internally created software and applications (Looker, n.d.). According to a customer of the Looker software, documentation is behind, making it hard to do certain tasks, and another customer says that once a data model is in the application, it becomes hard to edit (Software Advice, n.d.).

RoamBI

A data visualization tool that can be taken anywhere, and primarily built for mobile devices, which can include data from Microsoft Excel, CSV data, SQL Server, Cognos, Salesforce, SAP, Box data, etc. (Bigelow, 2016; MeLLmo Inc., n.d.).  It has been designed for mobile devices to allow for data sharing, exploration, and presentation (MeLLmo Inc., n.d.). It is such a popular software that all ten major pharmaceutical companies are using RoamBI on their iPads (Bigelow, 2016). Visualizations capitalize on tabular data, spark-lines, bar charts, line charts, stacked bar charts, pie charts, bubble charts, KPIs, etc. all on a dashboard, but are not customizable and reporting dimensions are limited (Authors, 2010; MeLLmo Inc., n.d.).  The free version of the application allows for localized data to be uploaded and used, whereas the Pro ($99/year or $795 perpetual) version of the application allows for data connections from online sources (Authors, 2010).

 

In the end, each of the ten data visualization tools has their advantages and disadvantages along with different price points.  The best way to select the right tool is knowing what one’s data visualizations needs are and compare these and other tools based on those needs.  The tool that meets most or all of the needs should then be selected.

Resources: