Building block system of health care data analytics

Building block system of healthcare big data analytics involves a few steps Burkle et al. (2001):

  • What is the purpose that the new data will and should serve
    • How many functions should it support
    • Marking which parts of that new data is needed for each function
  • Identify the tool needed to support the purpose of that new data
  • Create a top level architecture plan view
  • Building based on the plan but leaving room to pivot when needed
    • Modifications occur to allow for the final vision to be achieved given the conditions at the time of building the architecture.
    • Other modifications come under a closer inspection of certain components in the architecture

With big data analytics in healthcare, parallel programming and distributive programming are part of the solution to consider in building a network cluster (Mirtaheri, Khaneghah, Sharifi, & Azgomi, 2008; Services, 2015). Distributed programming allows for connecting multiple computer resources distributed across different locations (Mirtaheri et al., 2008). Programming that is maximizing the connections for processing or accessing data that is distributed across the computational resources is considered as parallel programming (Mirtaheri et al., 2008; Saden, 2011).  Burkle et al. (2001) explained how they used the building block design for a DNA network cluster build a system to help classify/predict what genome they are analyzing to a pathogen and understand which part of the genome found in many pathogens may be immune to certain treatments:

  • Tracing data from sequencer (*.esd file)
    • Base caller
  • “Raw” sequence (*.scf file)
    • Edit and assemble and export genome assembly for research
  • “Clean” sequence
    • External references are called in and outputted from reference databases
  • Genus and species are identified
  • Completed results are calling and outputting into an attributed local database

The process above flow for sequencing genomic data is part of the top-level plan that was modified as time went by, thus step four of the building blocks process.

Now, let’s consider using the building blocks system for healthcare systems, on a healthcare problem that wants to monitor patient vital signs similar to Chen et al. (2010).

  • The purpose that the new data will serve: Most hospitals measure the following vitals for triaging patients: blood pressure and flow, core temperature, ECG, carbon dioxide concentration (Chen et al. 2010).
    1. Functions should it serve: gathering, storing, preprocessing, and processing the data. Chen et al. (2010) suggested that they should also perform a consistency check, aggregating and integrate the data.
    2. Which parts of the data are needed to serve these functions: all
  • Tools needed: distributed database system, wireless network, parallel processing, graphical user interface for healthcare providers to understand the data, servers, subject matter experts to create upper limits and lower limits, classification algorithms that used machine learning
  • Top level plan: The data will be collected from the vital sign sensors, streaming at various time intervals into a central hub that sends the data in packets over a wireless network into a server room. The server can divide the data into various distributed systems accordingly. A parallel processing program will be able to access the data per patient per window of time to conduct the needed functions and classifications to be able to provide triage warnings if the vitals hit any of the predetermined key performance indicators that require intervention by the subject matter experts.  If a key performance indicator is sparked, send data to the healthcare provider’s device via a graphical user interface.
  • Pivoting is bound to happen; the following can happen
    1. Graphical user interface is not healthcare provider friendly
    2. Some of the sensors need to be able to throw a warning if they are going bad
    3. Subject matter experts may need to readjust the classification algorithm for better triaging

Thus, the above problem as discussed by Chen et al. (2010), could be broken apart to its building block components as addressed in Burkle et al. (2011).  These components help to create a system to analyze this set of big health care data through analytics, via distributed systems and parallel processing as addressed by Services (2015) and Mirtaheri et al. (2008).


  • Burkle, T., Hain, T., Hossain, H., Dudeck, J., & Domann, E. (2001). Bioinformatics in medical practice: what is necessary for a hospital?. Studies in health technology and informatics, (2), 951-955.
  • Chen, B., Varkey, J. P., Pompili, D., Li, J. K., & Marsic, I. (2010). Patient vital signs monitoring using wireless body area networks. In Bioengineering Conference, Proceedings of the 2010 IEEE 36th Annual Northeast (pp. 1-2). IEEE.
  • Mirtaheri, S. L., Khaneghah, E. M., Sharifi, M., & Azgomi, M. A. (2008). The influence of efficient message passing mechanisms on high performance distributed scientific computing. In Parallel and Distributed Processing with Applications, 2008. ISPA’08. International Symposium on (pp. 663-668). IEEE.
  • Sandén, B. I. (2011-01-14). The design of Multithreaded Software: The Entity-Life Modeling Approach, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].

Cloud computing and big data

High-performance computing is where there is either a cluster and grid of servers or virtual machines that are connected by a network for a distributed storage and workflow (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli, Chamber, & Dhiraj, 2013). Parallel computing environments draw on the distributed storage and workflow on the cluster and grid of servers or virtual machines for processing big data (Bhokare et al., 2016; Minelli et al., 2013). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data, stored across distributed systems, use parallel processing, and evolve with time (Sadalage & Fowler, 2012).  Cloud technology is the integration of data storage across a distributed set of servers or virtual machines through either traditional relational database systems or NoSQL database systems while allowing for data preprocessing and processing through parallel processing (Bhokare et al., 2016; Connolly & Begg, 2014; Minelli et al., 2013; Sadalage & Folwer, 2012).

Clouds can come in different flavors depending on how much the organization and supplier want to manage: Infrastructure as a Service, Platform as a Service, and Software as a Service (Connolly & Begg, 2014).  Thus, this makes the enterprise IT act as a broker across the various cloud options.  Also, analyzing exactly how and where data are stored to ensure it complies with various national and international data rules and regulations while preserving data privacy exist with the type of cloud use: public, community, private and hybrid clouds (Minelli et al. 2013; Conolloy & Begg, 2014).

Public cloud environments are where a supplier to a company provides a cluster or grid of servers through the internet like Spark AWS, EC2 (Connolly & Begg, 2014; Minelli et al. 2013).  Cloud computing can be thought of as a set of building blocks.  The company can grow or shrink a number of servers and services when needed dynamically, which allows the company to request the right amount of services for their data collection, storage, preprocessing, and processing needs (Bhokare et al., 2016; Minelli et al., 2013; Sadalage & Fowler, 2012).  This allows for the company to purchase the services it needs, without having to purchase the infrastructure to support the services it might think it will need. This allows for hyper-scaling computing in a distributed environment, also known as hyper-scale cloud computing, where the volume and demand of data explode exponentially yet still be accommodated in public, community, private, or hybrid cloud in a cost efficiently (Mainstay, 2016; Minelli et al., 2013).

Data storage and sharing are a key component of using enterprise public clouds (Sumana & Biswal, 2016).  However, it should be noted that the data is stored in the public cloud is stored on the same servers as probably the company’s competitors, so data security is an issue. Sumana and Biswal (2016) proposed that a key aggregate cryptosystem to be used, where the enterprise holds the master key for all its enterprise files, whereas going a deep layer users can have other data encrypted to send within the enterprise, without needing to know the enterprise file key. This proposed solution for data security in a public cloud allows for end-user registration, end-user revocation, file generation and deletion, and file access and traceability.

A community cloud environment is a cloud that is shared exclusively by a set of companies that share the similar characteristics, compliance, security, jurisdiction, etc. (Connolly & Begg, 2014). Thus, the infrastructure of all of these servers and grids meet industry standards and best practices, with the shared cost of the infrastructure is maintained by the community.

Private cloud environments have a similar infrastructure to a public cloud, but the infrastructure only holds the data one company exclusively, and its services are shared across the different business units of that one company (Connolly & Begg, 2014; Minelli et al., 2013). An organization may have all the components already to build a cloud through various on-premise computing resources and thus tend to build a cloud system using open source code on their internal infrastructure; this is called an on-premise private cloud (Bhokare et al., 2016). The benefit of the private cloud is full control of your data, and the cost of the servers are spread across all the business units, but the infrastructure costs (initial, upgrades, and maintenance costs) are in the company.

Hybrid clouds are two or more cloud structures that have either a private, community or public aspect to them (Connolly & Begg, 2014).  This allows for some data to be retained in the house if need be, and reducing the size of capital expenditure for the internal cloud infrastructure, while other data is stored externally where the cost of the infrastructure is not directly felt by the organization.


  • Bhokare, P., Bhagwat, P., Bhise, P., Lalwani, V., & Mahajan, M. R. (2016). Private Cloud using GlusterFS and Docker.International Journal of Engineering Science5016.
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th). Pearson Learning Solutions. [Bookshelf Online].
  • (2016). An economic study of the hyper-scale data center. Mainstay, LLC, Castle Rock, CO, the USA, Retrieved from transforming-the-economics-of-data-center
  • Minelli, M., Chambers, M., &, Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. [Bookshelf Online].
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, [Bookshelf Online].
  • Sumana, P., & Biswal, B. K. (2016). Secure Privacy Protected Data Sharing Between Groups in Public Cloud.International Journal of Engineering Science3285.

Column-oriented NoSQL databases

NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). Column-oriented databases are perfect for sparse datasets, ones with many null values and when columns do have data the related columns are grouped together (Services, 2015).  Grouping demographic data like age, income, gender, marital status, sexual orientation, etc. are a great example for using this NoSQL database. Cassandra, which is a column-oriented NoSQL database focuses on availability and partition tolerance, this means that as an AP system it can achieve consistency if data can be replicated and verified (Hurst, 2010).

Cassandra has been assessed for performance evaluation against other NoSQL databases like MongoDB and Raik for health care data analytics (Weider, Kollipara, Penmetsa, & Elliadka, 2013).  In this study, NoSQL database demands for health care data were two-fold:

  • Read/write efficiency of medical test results for a patient X (Availability)
  • All medical professionals should see the same information on patient X (Consistency)

A NoSQL graph database did not have the fit to use for the above demands, thus wasn’t part of this study.

The architecture of this project: nine partition nodes, where three by three nodes were used to mimic three data centers that would be used by 100 global health facilities, where data is generated at a rate of 1TB per month and must be kept for 99 years.

The dataset used in this project: a synthetic dataset that has 1M patients with 10M lab reports, averaging at seven lab reports per person, but randomly distributed of from 0-20 lab reports per person.

In meeting both of these two demands, Cassandra had a significantly higher throughput value than the other two NoSQL databases. Cassandra’s EACH_QUORUM write and LOCAL_QUORUM read options are part of their datacenter aware system, providing the great throughputs results, using the three synthetic datacenters. Testing consistency, by using Cassandra’s ONE for its write and read options at an eventual rate (slower consistency) or strong rate (faster consistency), shows that throughput increases with the eventual system. The choice to use either rate rests with the healthcare stakeholders.

The authors concluded that for their system and their requirements Cassandra had the highest throughput regardless of the level of consistency rates (Weider et al., 2013).  They also suggested that each of these tests should be adjusted based on the requirements from key stakeholders in the healthcare profession and that a small variation in the data model could change the results seen here.

In conclusion of this post, NoSQL databases provide huge advantages to data analytics over traditional relational database management systems. But, NoSQL databases must fit the needs of the stakeholders, and quantitative tests must be thoroughly designed to assess which NoSQL database will meet those needs.


  • Hurst, N. (2010). Visual guide to NoSQL systems. Retrieved from
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
  • Weider, D. Y., Kollipara, M., Penmetsa, R., & Elliadka, S. (2013, October). A distributed storage solution for cloud based e-Healthcare Information System. In e-Health Networking, Applications & Services (Healthcom), 2013 IEEE 15th International Conference on (pp. 476-480). IEEE.

Graphical NoSQL Databases

There is a lot of complicated connections between patients, their provider, their diagnoses, etc., and graphically representing this relationship data is one of the main highlights of using a NoSQL graph database (Park, Shankar, Park, & Ghosh, 2014). NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). Graph NoSQL databases are used drawing networks by showing the relationship between items in a graphical format that has been optimized for easy searching and editing (Services, 2015). Each item is considered a node and adding more nodes or relationships while traversing through them is made simpler through a graph database rather than a traditional database (Sadalage & Fowler, 2012). Some sample graph databases consist of Neo4j Pregel, etc. (Park et al., 2014).

Case Study: Graph Databases for large-scale healthcare systems: A framework for efficient data management and data services (Park et al., 2014)

Driver for data analytics needs: Finding areas for cost savings through anomaly detection algorithms, because currently there are a bunch of individual tables and non-normalized data that are replicated multiple times which is causing bottlenecks.

Problem: Understanding and establishing relationships between self-referrals and shared providers, which allows for the use of a collaborative filter.

System Needs: Data management needs error-tolerant and non-redundant database system, while data services need data retrieval, analytics queries, statistical data extraction and mining algorithms.

NoSQL Database used: Neo4J graph NoSQL Database using Cypher query to keep the data normalized and reduce the number of individual tables of data due to the advanced yet simple query capabilities

Methodology: Using the 3EG: 3NF Equivalent Graph Transformation algorithm to convert traditional relational database data into graph database data on realistic synthetic healthcare data.  The synthetic healthcare data consists of zip-codes, diagnosis of disease, available procedures, beneficiary, claim, and providers. The data when flattened can showcase 1 M beneficiaries to 100 K providers, but in a graphical format, that same data will have 51 M nodes and 257 M relationships.

Queries Ran on the NoSQL Database:

  • Shared providers between two beneficiaries
  • Shared providers between two beneficiaries through either actual visits or by referrals
  • List of shared diseases between two beneficiaries through their claim records
  • Any link between two beneficiaries à helps to direct further investigations/queries
  • Shared beneficiaries between two providers
  • Self-referred beneficiaries for a given provider
  • Similar claims based on diagnoses codes
  • Patient wants to switch to a new provider based on a referral by another provider

Using 50 random queries for each of the 8 cases above: the time it took to run the first three cases was faster in a MySQL query, but by less than 0.0X seconds, whereas the last 5 cases the NoSQL was faster ranging from 0.5-40 seconds.  As data size grew so did the processing time for the last five cases on MySQL grew.

Conclusions: The authors were able to show that with more highly advanced cases, MySQL takes more time than NoSQL. Thus, for big data analytics, NoSQL graph databases can help store dynamic relationship data as well as process more complex queries using fewer lines of code and faster than MySQL queries.  This style of storing data allows the end-user in the healthcare field to ask more complex questions and get those answers promptly.


  • Park, Y., Shankar, M., Park, B. H., & Ghosh, J. (2014). Graph databases for large-scale healthcare systems: A framework for efficient data management and data services. In Data Engineering Workshops (ICDEW), 2014 IEEE 30th International Conference on (pp. 12-19). IEEE.
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].

Document store NoSQL databases

NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data and evolve with time (Sadalage & Fowler, 2012). Document store NoSQL databases, use a key/value pair that is the file/file itself, and it could be in JSON, BSON, or XML (Sadalage & Fowler, 2012; Services, 2015).  These document files are hierarchical trees (Sadalage & Fowler, 2012).

Parts of the documents could be updated in real-time this type of NoSQL database allows for easy creation and storage of dynamic data like website page views, unique views, or new metrics (Sadalage & Fowler, 2012).  To help speed up the search of a document store NoSQL database like content in multiple web pages, or store log file, indexes can be created (Services, 2012). These indexes could be stored as attributes, such as a “state,” “city,” “zip-code,” etc. attributes, which can have the same, different, or null values in the NoSQL database and it each of these is allowed (Sadalage & Fowler, 2012).

If you want to insert, update, or delete (also known as a transaction) data in a NoSQL database, it will either succeed or fail, it won’t have the ability as traditional databases to either commit or rollback (Sadalage & Fowler, 2012). Only two of the three features can exist according to CAP Theory (Consistency, Availability, and Partition Tolerance), and document store databases primarily focus on availability through replicating data in different nodes (Hurst, 2010; Sadalage & Fowler, 2012).  Some key players in the document store database realm are CouchDB, MongoDB, OrientDB, RavenDB, and Terrastore (Sadalage & Fowler, 2012).  This discussion will focus on both CouchDB and MongoDB; which are open-sourced code that allows having scalability features (CouchDB, n.d.; MongoDB, n.d.; Sadalage & Fowler, 2012).

CouchDB is an Apache code available for Windows, Linux, and Mac OS X and it is also:

  • AP database system (Hurst, 2010)
  • AP systems can achieve consistency if data can be replicated and verified (Hurst, 2010)
  • Globally distributed server cluster to allow for accessing data and implementing projects anywhere through a data replication protocol (CouchDB, n.d.)
  • Data can be stored on a single or clustered server, via locally on the company’s servers, virtual machines, Raspberry Pi servers, or on a cloud provider (CouchDB, n.d.)
  • Allows for offline end user experience (CouchDB, n.d.)
  • Can use MapReduce for deriving insights from the data (CouchDB, n.d.)
  • Uses HTTP protocol and JSON data (CouchDB, n.d.)
  • Only allowing for appending data helps create a crash-resistant data structure (CouchDB, n.d.)

MongoDB is code available for Windows, Linux, Mac OS X, Solaris, etc.:

  • CP database system (Hurst, 2010)
  • CP systems have issues keeping data available across all nodes through their replication system (Hurst, 2010; Sadalage & Fowler, 2012)
  • Used by companies like Expedia, Forbes, Bosch, AstraZeneca, MetLife, Facebook, Urban Outfitters, sprinklr, the guardian, Comcast, etc., such that 33% of the Fortune 100 are using it (MongoDB, n.d.)
  • Has an expressive query language and secondary indexes out of the box to help access and understand data stored within its database, which is easier to use and requires fewer lines of code (MongoDB, n.d.; Sadalage & Fowler, 2012)
  • Allows for a flexible data model that evolves with time as the data stored in it evolves (MongoDB, n.d.)
  • Allows for integration of silo, internet of things, mobile, catalog data to help provide real-time analytics (MongoDB, n.d.)


  • CouchDB (n.d.). CouchDB, relax. Apache. Retrieved from
  • Hurst, N. (2010). Visual guide to NoSQL systems. Retrieved from
  • MongoDB (n.d.). MongoDB, for giant ideas. Retrieved from
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].

Case study: Health data analysis

Article: Community Health Map: A geospatial and multivariate data visualization tool for public health datasets

Year: 2012

Authors: Awalin Sopan a, Angela Song-Ie Noh b, Sohit Karol c, Paul Rosenfeld d, Ginnah Lee d, Ben Shneiderman a

Research positions:

a Human–Computer Interaction Lab, Department of Computer Science, University of Maryland

b Department of Computer Science, University of Maryland

c Department of Kinesiology, University of Maryland

d Department of Electrical and Computer Engineering, University of Maryland


This study is trying to evaluate the usability of “Community Health Map” (a design study), which is a multivariate and geospatial data visualization tool, to understand health care quality, public health outcomes and access to healthcare. The Community Health Map web application’s target audience is for policymakers at all levels of the government, where multivariate data is dynamically queried, filtered, mapped, tabled, and charted.  This study also made simple descriptive data analytical results about health care quality, access, and public health.


What is driving this research, is to understand where are the financial wastes of Medicare dollars spent across multiple states and Hospital Referral Regions (HRR).  Understanding financial waste is done by filtering on demographic data provided by the Open Government Directive, and health performance data by the Department of HHS and other healthcare industry data sets. Thus, this prototyping/design study of the Community Health Map application incorporated 11 health care variables to understand health care: quality, accessibility and public health.

  • Quality variables: life expectancy, infant mortality, self-reported fair/poor overall health, average numbers of unhealthy days.
  • Access variables: percentage of uninsured, physicians per 100K
  • Public Health variables: smoking rate, lack of physical activity, obesity, nutrition, flu vaccinations (ages 65+)

Methods and techniques used:

This study used descriptive analytics to pilot the Community Health Map.  Each of the three categories is colored differently, green for quality, orange for access, and blue for public health.  Medium Income, Poverty Rate, Percent Over Age 65, and Percent with a bachelor degree is assessed for dynamic filters.

Finally, this study is analyzing the usability of the Community Health Map, by giving people a four-minute tutorial video and 20 minutes to perform five tasks that involve: dynamic filtering, mapping, tabling and charting.  Each participant was encouraged to think out loud


Data is stored in different geospatial ways; some are by counties, zip codes, HRR, etc. and they first had to be pre-processed to a centralized unit.  There had to be assumptions placed and interpolation of geospatial data to drive the results seen in this study.

Not explicitly stated in this study, is that people feel comfortable to completely think out loud and provide the researchers with untethered accesses to their thoughts.  There may be some thoughts that are kept to themselves.  This also assumes that explicitly stated thoughts are valuable and the implicitly stated thoughts are not valuable to this study.

Also not explicitly stated in this study, was how usability is defined by those applicants that can perform the task.  There was no mention that the usability was fit for design for people with disabilities.  In general people with disabilities may be able to provide universal design suggestions that can help improve the usability of the product.  But, given the lack of mention of this, may suggest that the participants didn’t have a known disability to the researchers.

There also seemed to be a convenience and snowballing sampling given that all the subjects were graduate students at the University of Maryland.  It is also assuming that the people that this software is intended to be used by have had some graduate studies background.  Thus, any usability results from this study are limited to such a group and cannot be generalizable to the greater population.


Descriptive data analytics results

  • Using the data, the number of unhealthy days in a one-month period showed that eastern Kentucky and western West Virginia had the highest rates, which surpassed the entire nation. Through subject matter experts from the University of Maryland, this area has a lot of coal mines, which could result in poorer health in that population.
  • Areas of poor health are highly correlated to have low life expectancy. When using dynamic filtering, areas with higher than national average median income and low poverty rates had higher life expectancy than the reverse situation. Thus, showcasing those with low financial access could have lower access to healthcare which could lead to having lower life expectancies.
  • Those with no health insurance and had a higher rate of smoking were seen to have lower life expectancies, which rang true for four counties in South Dakota: Shannon, Bennett, Jackson, and Todd. These counties except for Shannon County also had higher infant mortality rates. Finally, in these counties with higher uninsurance rates also had fewer physicians per 100K residents.

Design study results

  • Some minor interface usability suggestions from one participant.
  • The small demo session used helped them navigate the application effectively.
  • No major usability challenges.
  • No need for installation of the application given that it is all web-based.

Contributions to the field or topic:

Two contributions were made.

  • The tool has been designed and built to be intuitive for a targeted audience and to facilitate data-driven answers based on demographic and health data.
  • The tool can be expanded to other fields with multivariable on geospatial data.


Some of these conclusions seemed obvious, but being able to visualize the data to see how some of these relationships (though other variables could be attributing to these results), can showcase some of the tropes and talking points in modern day politics over healthcare and healthcare reform.  Some of these tropes are: areas that are poorer usually fall into more illnesses; higher uninsurance rates lead to lower life expectancies, and areas with higher uninsurance rates have fewer physicians to treat the same amount of people. This study is to inspire further development of the tool and to enrich it with more demographic, outcome, cost variables, and health data.

Opinion on the validity of the claims and significance of the research:

The results seem to highlight typical political talking points through visualization of the data.  The usability study has some issues given that they were using graduate students and no political figures, which is who this tool is supposed to be used by.  Thus, are the results generalizable no, but is the tool useful, it depends on.  It depends on who needs it, what they need it for, and how easy it is to enter in data from various sources.


  • Sopan, A., Noh, A. S. I., Karol, S., Rosenfeld, P., Lee, G., & Shneiderman, B. (2012). Community Health Map: A geospatial and multivariate data visualization tool for public health datasets. Government Information Quarterly29(2), 223-234

Diagnosis of illness via big data

Big data is defined as high volume, high variety/complexity, and high velocity, which is known as the 3Vs (Services, 2015). Using Machine Learning and Artificial Intelligence, they do well at analyzing patterns from frequent and voluminous amounts of data at faster speeds than humans, but they fail to recognize patterns in infrequent and small amounts of data (Goldbloom, 2016). Thus, the use of data analytic theories and techniques on big data rather than novel situations in healthcare is vital to understand.

Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics.  Thus, these three divisions of big data analytics are:

  • Descriptive analytics explains “What happened?”
  • Predictive analytics explains “What will happen?”
  • Prescriptive analytics explains “Why will it happen?”

(Fayyad et al., 1996; Vardarlier & Silahtaroglu, 2016). Depending on the goal of diagnosing illnesses with the use of big data analytics should depend on the theory/division one should choose.  Raghupathi & Raghupathi (2014), stated some common examples of big data in the healthcare field to be: personal medical records, radiology images, clinical trial data, 3D imaging, human genomic data, population genomic data, biometric sensor reading, x-ray films, scripts, and traditional paper files.

The use of big data analytics to understand the 23 pairs of chromosomes that are the building blocks for people. Healthcare professionals are using the big data generated from our genomic code to help predict which illnesses a person could get (Services, 2013). Thus, using predictive analytics tools and algorithms like decision trees would be of some use.  Another use of predictive analytics and machine learning can be applied to diagnosing an eye disease like diabetic retinopathy from an image by using classification algorithms (Goldbloom, 2016).

The study of epigenetics, which are what parts of the genetic code is turned on versus turned off, can help explain why will certain illnesses are more probable to occur in the future (What is epigenetics, n.d.).  Thus, the use of prescriptive analytics could be of some use in the study of epigenetics. Currently, clinical trials use descriptive analytics to help calculate true positives, false positives, true negatives, and false negatives of a drug treatment versus a placebo are commonly used.  Thus, depending on the goal of diagnosing illnesses and the problem, that should help define which theories and techniques of big data analytics to use. The use of different data analytics techniques and theories based on the problem and data can change how healthcare jobs in the next 30 years from today (Goldbloom, 2016; McAfee, 2013).


Data analytics lifecycle

What is the data analytics Lifecycle?

The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013). According to Dietrich (2013), it is a cyclical life cycle that has iterative parts in each of its six steps:

  • Discovery
  • Pre-processing data
  • Model planning
  • Model building
  • Communicate results
  • Operationalize

However Erl, Buhler, & Khattak (2016), suggested that it is divided in nine steps:

  • Business case evaluation
  • Data identification
  • Data acquisition & filtering
  • Data extraction
  • Data validation & cleansing
  • Data aggregation & representation
  • Data analysis
  • Data visualization
  • Utilization of analysis results

Prajapati (2013), stated five steps:

  • Identifying the problem
  • Designing data requirements
  • Pre-processing data
  • Data analysis
  • Data visualizing

Between these three different lifecycle versions, there is a general pattern that emerges, but it also suggests that the field of data analytics is still too nascent to pin down an exact data analytics lifecycle.  For the purpose of this discussion the lifecycle that will be used is from Services (2015), which uses the Dietrich (2013) lifecycle. Note that both Services (2015) and Dietrich (2013) model is iterative and not static steps.  This lifecycle model allows all key team members to conduct planning work up front and towards the end of the data analytics project to drive success (Dietrich, 2013).

When is it beneficial for stakeholders to be involved?

If following an agile development processes the key stakeholders should be involved in all the lifecycles. That is because the key stakeholders are known as business user, project sponsor, project manager, business intelligence analyst, database administers, data engineer, and data scientist (Services, 2015).  Some of the benefits of applying the Agile development processes to this lifecycle is because it allows for iterative feedback for speed-to-market, improved first-time quality, visibility, risk management, flexibility to pivot when needed, controlling costs, and improved satisfaction through engagement (Waters, 2007).  Allowing the stakeholders to participate in most of these steps can allow the following work to be done to their specifications.

For the first step, discovery, the business learns its domain and its relevant history with lessons learned from previous projects (Services, 2015). Before proceeding ask: “Do I have enough information to draft an analytic plan and share for peer review?” (Dietrich, 2013; Services, 2015). Pre-processing data, also known as data preparation is where a copy of the data is placed in a sandbox (not the original), where the data scientists and team can extract, load and transform (ELT) the copied data (Services, 2015). In this stage, data could also be cleaned, aggregated, augmented, and formatted (Prajapati, 2013). Before proceeding ask: “Do I have enough good quality data to start building the model?” (Dietrich, 2013; Services, 2015). Model planning is when the data scientist and team determines the appropriate models, algorithms, workflow of the data, which helps identify hidden insights between the variables (Services, 2015).  Before proceeding ask: “Do I have a good idea about the type of model to try? Can I refine the analytic plan?” (Dietrich, 2013; Services, 2015). Model building helps sets roughly about 2/3 of the data for training the model and 1/3 of the data for testing the model for production purposes and discovering hidden insights (Prajapati, 2013; Services, 2015). Before proceeding ask: “Is the model robust enough? Have we failed for sure?” (Dietrich, 2013; Services, 2015).   Communicating results could be done visualization of data to the major stakeholders to see if the results are a success or failure (Services, 2015).  Visualization is done in this step is supposed to be interactive with all parties involved in this project (Prajapati, 2013). Finally, the operationalize step is when the data is ready to provide reports, documents, on a pre-defined time interval such that key decision makers could receive the vital data needed (Services, 2015).


Using R and Spark for health care

Use of R with regard to healthcare field case study by Pereira and Noronha (2016):

R and RStudio have been used to look at patient health and diseases records located in Electronic Medical Records (EMR) for fraud detection.  Anomaly detection revolves around using a mapping code that filters data based on geo-locations.  Secondly, a reducer code which aggregates the data based on extreme values of cost claims per disease along with calculating the difference.  Finally, a code that analyzed the data that meets a 60% cost fraud threshold. It was found that as the geo-location resolution increased, the anomalies detected increased.

R and RStudio have been able to use big data analytics to predict diabetes from the Health Information System (HIS) which houses patient information, based on symptoms. For predicting diabetes, the authors used a classification algorithm (decision tree) with a 70%-30% training-test dataset split, to eventually plot the false positive rate v. True positive rate.  This plot showed skill in predicting diabetes.

Use of Spark about the healthcare field case study by Pita et al. (2015):

Data quality in healthcare data is poor and in particular that from the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.


  • Pereira, J. P., & Noronha, V. (2016). Anomalies Detection and Disease Prediction in Healthcare Systems using Big Data Analytics. Retrieved from
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).