Document store NoSQL databases

NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data and evolve with time (Sadalage & Fowler, 2012). Document store NoSQL databases, use a key/value pair that is the file/file itself, and it could be in JSON, BSON, or XML (Sadalage & Fowler, 2012; Services, 2015).  These document files are hierarchical trees (Sadalage & Fowler, 2012).

Parts of the documents could be updated in real-time this type of NoSQL database allows for easy creation and storage of dynamic data like website page views, unique views, or new metrics (Sadalage & Fowler, 2012).  To help speed up the search of a document store NoSQL database like content in multiple web pages, or store log file, indexes can be created (Services, 2012). These indexes could be stored as attributes, such as a “state,” “city,” “zip-code,” etc. attributes, which can have the same, different, or null values in the NoSQL database and it each of these is allowed (Sadalage & Fowler, 2012).

If you want to insert, update, or delete (also known as a transaction) data in a NoSQL database, it will either succeed or fail, it won’t have the ability as traditional databases to either commit or rollback (Sadalage & Fowler, 2012). Only two of the three features can exist according to CAP Theory (Consistency, Availability, and Partition Tolerance), and document store databases primarily focus on availability through replicating data in different nodes (Hurst, 2010; Sadalage & Fowler, 2012).  Some key players in the document store database realm are CouchDB, MongoDB, OrientDB, RavenDB, and Terrastore (Sadalage & Fowler, 2012).  This discussion will focus on both CouchDB and MongoDB; which are open-sourced code that allows having scalability features (CouchDB, n.d.; MongoDB, n.d.; Sadalage & Fowler, 2012).

CouchDB is an Apache code available for Windows, Linux, and Mac OS X and it is also:

  • AP database system (Hurst, 2010)
  • AP systems can achieve consistency if data can be replicated and verified (Hurst, 2010)
  • Globally distributed server cluster to allow for accessing data and implementing projects anywhere through a data replication protocol (CouchDB, n.d.)
  • Data can be stored on a single or clustered server, via locally on the company’s servers, virtual machines, Raspberry Pi servers, or on a cloud provider (CouchDB, n.d.)
  • Allows for offline end user experience (CouchDB, n.d.)
  • Can use MapReduce for deriving insights from the data (CouchDB, n.d.)
  • Uses HTTP protocol and JSON data (CouchDB, n.d.)
  • Only allowing for appending data helps create a crash-resistant data structure (CouchDB, n.d.)

MongoDB is code available for Windows, Linux, Mac OS X, Solaris, etc.:

  • CP database system (Hurst, 2010)
  • CP systems have issues keeping data available across all nodes through their replication system (Hurst, 2010; Sadalage & Fowler, 2012)
  • Used by companies like Expedia, Forbes, Bosch, AstraZeneca, MetLife, Facebook, Urban Outfitters, sprinklr, the guardian, Comcast, etc., such that 33% of the Fortune 100 are using it (MongoDB, n.d.)
  • Has an expressive query language and secondary indexes out of the box to help access and understand data stored within its database, which is easier to use and requires fewer lines of code (MongoDB, n.d.; Sadalage & Fowler, 2012)
  • Allows for a flexible data model that evolves with time as the data stored in it evolves (MongoDB, n.d.)
  • Allows for integration of silo, internet of things, mobile, catalog data to help provide real-time analytics (MongoDB, n.d.)

References

  • CouchDB (n.d.). CouchDB, relax. Apache. Retrieved from http://couchdb.apache.org/
  • Hurst, N. (2010). Visual guide to NoSQL systems. Retrieved from http://blog.nahurst.com/visual-guide-to-nosql-systems
  • MongoDB (n.d.). MongoDB, for giant ideas. Retrieved from https://www.mongodb.com/
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
Advertisements

Case study: Health data analysis

Article: Community Health Map: A geospatial and multivariate data visualization tool for public health datasets

Year: 2012

Authors: Awalin Sopan a, Angela Song-Ie Noh b, Sohit Karol c, Paul Rosenfeld d, Ginnah Lee d, Ben Shneiderman a

Research positions:

a Human–Computer Interaction Lab, Department of Computer Science, University of Maryland

b Department of Computer Science, University of Maryland

c Department of Kinesiology, University of Maryland

d Department of Electrical and Computer Engineering, University of Maryland

Summary/Problem:

This study is trying to evaluate the usability of “Community Health Map” (a design study), which is a multivariate and geospatial data visualization tool, to understand health care quality, public health outcomes and access to healthcare. The Community Health Map web application’s target audience is for policymakers at all levels of the government, where multivariate data is dynamically queried, filtered, mapped, tabled, and charted.  This study also made simple descriptive data analytical results about health care quality, access, and public health.

Motivation:

What is driving this research, is to understand where are the financial wastes of Medicare dollars spent across multiple states and Hospital Referral Regions (HRR).  Understanding financial waste is done by filtering on demographic data provided by the Open Government Directive, and health performance data by the Department of HHS and other healthcare industry data sets. Thus, this prototyping/design study of the Community Health Map application incorporated 11 health care variables to understand health care: quality, accessibility and public health.

  • Quality variables: life expectancy, infant mortality, self-reported fair/poor overall health, average numbers of unhealthy days.
  • Access variables: percentage of uninsured, physicians per 100K
  • Public Health variables: smoking rate, lack of physical activity, obesity, nutrition, flu vaccinations (ages 65+)

Methods and techniques used:

This study used descriptive analytics to pilot the Community Health Map.  Each of the three categories is colored differently, green for quality, orange for access, and blue for public health.  Medium Income, Poverty Rate, Percent Over Age 65, and Percent with a bachelor degree is assessed for dynamic filters.

Finally, this study is analyzing the usability of the Community Health Map, by giving people a four-minute tutorial video and 20 minutes to perform five tasks that involve: dynamic filtering, mapping, tabling and charting.  Each participant was encouraged to think out loud

Assumptions:

Data is stored in different geospatial ways; some are by counties, zip codes, HRR, etc. and they first had to be pre-processed to a centralized unit.  There had to be assumptions placed and interpolation of geospatial data to drive the results seen in this study.

Not explicitly stated in this study, is that people feel comfortable to completely think out loud and provide the researchers with untethered accesses to their thoughts.  There may be some thoughts that are kept to themselves.  This also assumes that explicitly stated thoughts are valuable and the implicitly stated thoughts are not valuable to this study.

Also not explicitly stated in this study, was how usability is defined by those applicants that can perform the task.  There was no mention that the usability was fit for design for people with disabilities.  In general people with disabilities may be able to provide universal design suggestions that can help improve the usability of the product.  But, given the lack of mention of this, may suggest that the participants didn’t have a known disability to the researchers.

There also seemed to be a convenience and snowballing sampling given that all the subjects were graduate students at the University of Maryland.  It is also assuming that the people that this software is intended to be used by have had some graduate studies background.  Thus, any usability results from this study are limited to such a group and cannot be generalizable to the greater population.

Findings:

Descriptive data analytics results

  • Using the data, the number of unhealthy days in a one-month period showed that eastern Kentucky and western West Virginia had the highest rates, which surpassed the entire nation. Through subject matter experts from the University of Maryland, this area has a lot of coal mines, which could result in poorer health in that population.
  • Areas of poor health are highly correlated to have low life expectancy. When using dynamic filtering, areas with higher than national average median income and low poverty rates had higher life expectancy than the reverse situation. Thus, showcasing those with low financial access could have lower access to healthcare which could lead to having lower life expectancies.
  • Those with no health insurance and had a higher rate of smoking were seen to have lower life expectancies, which rang true for four counties in South Dakota: Shannon, Bennett, Jackson, and Todd. These counties except for Shannon County also had higher infant mortality rates. Finally, in these counties with higher uninsurance rates also had fewer physicians per 100K residents.

Design study results

  • Some minor interface usability suggestions from one participant.
  • The small demo session used helped them navigate the application effectively.
  • No major usability challenges.
  • No need for installation of the application given that it is all web-based.

Contributions to the field or topic:

Two contributions were made.

  • The tool has been designed and built to be intuitive for a targeted audience and to facilitate data-driven answers based on demographic and health data.
  • The tool can be expanded to other fields with multivariable on geospatial data.

Conclusion:

Some of these conclusions seemed obvious, but being able to visualize the data to see how some of these relationships (though other variables could be attributing to these results), can showcase some of the tropes and talking points in modern day politics over healthcare and healthcare reform.  Some of these tropes are: areas that are poorer usually fall into more illnesses; higher uninsurance rates lead to lower life expectancies, and areas with higher uninsurance rates have fewer physicians to treat the same amount of people. This study is to inspire further development of the tool and to enrich it with more demographic, outcome, cost variables, and health data.

Opinion on the validity of the claims and significance of the research:

The results seem to highlight typical political talking points through visualization of the data.  The usability study has some issues given that they were using graduate students and no political figures, which is who this tool is supposed to be used by.  Thus, are the results generalizable no, but is the tool useful, it depends on.  It depends on who needs it, what they need it for, and how easy it is to enter in data from various sources.

Reference

  • Sopan, A., Noh, A. S. I., Karol, S., Rosenfeld, P., Lee, G., & Shneiderman, B. (2012). Community Health Map: A geospatial and multivariate data visualization tool for public health datasets. Government Information Quarterly29(2), 223-234

Diagnosis of illness via big data

Big data is defined as high volume, high variety/complexity, and high velocity, which is known as the 3Vs (Services, 2015). Using Machine Learning and Artificial Intelligence, they do well at analyzing patterns from frequent and voluminous amounts of data at faster speeds than humans, but they fail to recognize patterns in infrequent and small amounts of data (Goldbloom, 2016). Thus, the use of data analytic theories and techniques on big data rather than novel situations in healthcare is vital to understand.

Fayyad, Piatetsky-Shapiro, and Smyth (1996) defined that data analytics can be divided into descriptive and predictive analytics. Vardarlier and Silaharoglu (2016) agreed with Fayyad et al. (1996) division but added prescriptive analytics.  Thus, these three divisions of big data analytics are:

  • Descriptive analytics explains “What happened?”
  • Predictive analytics explains “What will happen?”
  • Prescriptive analytics explains “Why will it happen?”

(Fayyad et al., 1996; Vardarlier & Silahtaroglu, 2016). Depending on the goal of diagnosing illnesses with the use of big data analytics should depend on the theory/division one should choose.  Raghupathi & Raghupathi (2014), stated some common examples of big data in the healthcare field to be: personal medical records, radiology images, clinical trial data, 3D imaging, human genomic data, population genomic data, biometric sensor reading, x-ray films, scripts, and traditional paper files.

The use of big data analytics to understand the 23 pairs of chromosomes that are the building blocks for people. Healthcare professionals are using the big data generated from our genomic code to help predict which illnesses a person could get (Services, 2013). Thus, using predictive analytics tools and algorithms like decision trees would be of some use.  Another use of predictive analytics and machine learning can be applied to diagnosing an eye disease like diabetic retinopathy from an image by using classification algorithms (Goldbloom, 2016).

The study of epigenetics, which are what parts of the genetic code is turned on versus turned off, can help explain why will certain illnesses are more probable to occur in the future (What is epigenetics, n.d.).  Thus, the use of prescriptive analytics could be of some use in the study of epigenetics. Currently, clinical trials use descriptive analytics to help calculate true positives, false positives, true negatives, and false negatives of a drug treatment versus a placebo are commonly used.  Thus, depending on the goal of diagnosing illnesses and the problem, that should help define which theories and techniques of big data analytics to use. The use of different data analytics techniques and theories based on the problem and data can change how healthcare jobs in the next 30 years from today (Goldbloom, 2016; McAfee, 2013).

References

Data analytics lifecycle

What is the data analytics Lifecycle?

The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013). According to Dietrich (2013), it is a cyclical life cycle that has iterative parts in each of its six steps:

  • Discovery
  • Pre-processing data
  • Model planning
  • Model building
  • Communicate results
  • Operationalize

However Erl, Buhler, & Khattak (2016), suggested that it is divided in nine steps:

  • Business case evaluation
  • Data identification
  • Data acquisition & filtering
  • Data extraction
  • Data validation & cleansing
  • Data aggregation & representation
  • Data analysis
  • Data visualization
  • Utilization of analysis results

Prajapati (2013), stated five steps:

  • Identifying the problem
  • Designing data requirements
  • Pre-processing data
  • Data analysis
  • Data visualizing

Between these three different lifecycle versions, there is a general pattern that emerges, but it also suggests that the field of data analytics is still too nascent to pin down an exact data analytics lifecycle.  For the purpose of this discussion the lifecycle that will be used is from Services (2015), which uses the Dietrich (2013) lifecycle. Note that both Services (2015) and Dietrich (2013) model is iterative and not static steps.  This lifecycle model allows all key team members to conduct planning work up front and towards the end of the data analytics project to drive success (Dietrich, 2013).

When is it beneficial for stakeholders to be involved?

If following an agile development processes the key stakeholders should be involved in all the lifecycles. That is because the key stakeholders are known as business user, project sponsor, project manager, business intelligence analyst, database administers, data engineer, and data scientist (Services, 2015).  Some of the benefits of applying the Agile development processes to this lifecycle is because it allows for iterative feedback for speed-to-market, improved first-time quality, visibility, risk management, flexibility to pivot when needed, controlling costs, and improved satisfaction through engagement (Waters, 2007).  Allowing the stakeholders to participate in most of these steps can allow the following work to be done to their specifications.

For the first step, discovery, the business learns its domain and its relevant history with lessons learned from previous projects (Services, 2015). Before proceeding ask: “Do I have enough information to draft an analytic plan and share for peer review?” (Dietrich, 2013; Services, 2015). Pre-processing data, also known as data preparation is where a copy of the data is placed in a sandbox (not the original), where the data scientists and team can extract, load and transform (ELT) the copied data (Services, 2015). In this stage, data could also be cleaned, aggregated, augmented, and formatted (Prajapati, 2013). Before proceeding ask: “Do I have enough good quality data to start building the model?” (Dietrich, 2013; Services, 2015). Model planning is when the data scientist and team determines the appropriate models, algorithms, workflow of the data, which helps identify hidden insights between the variables (Services, 2015).  Before proceeding ask: “Do I have a good idea about the type of model to try? Can I refine the analytic plan?” (Dietrich, 2013; Services, 2015). Model building helps sets roughly about 2/3 of the data for training the model and 1/3 of the data for testing the model for production purposes and discovering hidden insights (Prajapati, 2013; Services, 2015). Before proceeding ask: “Is the model robust enough? Have we failed for sure?” (Dietrich, 2013; Services, 2015).   Communicating results could be done visualization of data to the major stakeholders to see if the results are a success or failure (Services, 2015).  Visualization is done in this step is supposed to be interactive with all parties involved in this project (Prajapati, 2013). Finally, the operationalize step is when the data is ready to provide reports, documents, on a pre-defined time interval such that key decision makers could receive the vital data needed (Services, 2015).

References

Using R and Spark for health care

Use of R with regard to healthcare field case study by Pereira and Noronha (2016):

R and RStudio have been used to look at patient health and diseases records located in Electronic Medical Records (EMR) for fraud detection.  Anomaly detection revolves around using a mapping code that filters data based on geo-locations.  Secondly, a reducer code which aggregates the data based on extreme values of cost claims per disease along with calculating the difference.  Finally, a code that analyzed the data that meets a 60% cost fraud threshold. It was found that as the geo-location resolution increased, the anomalies detected increased.

R and RStudio have been able to use big data analytics to predict diabetes from the Health Information System (HIS) which houses patient information, based on symptoms. For predicting diabetes, the authors used a classification algorithm (decision tree) with a 70%-30% training-test dataset split, to eventually plot the false positive rate v. True positive rate.  This plot showed skill in predicting diabetes.

Use of Spark about the healthcare field case study by Pita et al. (2015):

Data quality in healthcare data is poor and in particular that from the Brazilian Public Health System.  Spark was used to help in data processing to improve quality through deterministic and probabilistic record linking within multiple databases.  Record linking is a technique that uses common attributes across multiple databases and identifies a 1-to-1 match.  Spark workflows were created to help do record linking by (1) analyzing all data in each database and common attributes with high probabilities of linkage; (2) pre-processing data where data is transformed, anonymization, and cleaned to a single format so that all the attributes can be compared to each other for a 1-to-1 match; (3) record linking based on deterministic and probabilistic algorithms; and (4) statistical analysis to evaluate the accuracy. Over 397M comparisons were made in 12 hours.  They concluded that accuracy depends on the size of the data, where the bigger the data, the more accuracy in record linking.

References

  • Pereira, J. P., & Noronha, V. (2016). Anomalies Detection and Disease Prediction in Healthcare Systems using Big Data Analytics. Retrieved from http://www.aijet.in/v3/1608001.pdf
  • Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., & Rasella, D. (2015). A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data. In EDBT/ICDT Workshops (pp. 17-26).