Column-oriented NoSQL databases

NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). Column-oriented databases are perfect for sparse datasets, ones with many null values and when columns do have data the related columns are grouped together (Services, 2015).  Grouping demographic data like age, income, gender, marital status, sexual orientation, etc. are a great example for using this NoSQL database. Cassandra, which is a column-oriented NoSQL database focuses on availability and partition tolerance, this means that as an AP system it can achieve consistency if data can be replicated and verified (Hurst, 2010).

Cassandra has been assessed for performance evaluation against other NoSQL databases like MongoDB and Raik for health care data analytics (Weider, Kollipara, Penmetsa, & Elliadka, 2013).  In this study, NoSQL database demands for health care data were two-fold:

  • Read/write efficiency of medical test results for a patient X (Availability)
  • All medical professionals should see the same information on patient X (Consistency)

A NoSQL graph database did not have the fit to use for the above demands, thus wasn’t part of this study.

The architecture of this project: nine partition nodes, where three by three nodes were used to mimic three data centers that would be used by 100 global health facilities, where data is generated at a rate of 1TB per month and must be kept for 99 years.

The dataset used in this project: a synthetic dataset that has 1M patients with 10M lab reports, averaging at seven lab reports per person, but randomly distributed of from 0-20 lab reports per person.

In meeting both of these two demands, Cassandra had a significantly higher throughput value than the other two NoSQL databases. Cassandra’s EACH_QUORUM write and LOCAL_QUORUM read options are part of their datacenter aware system, providing the great throughputs results, using the three synthetic datacenters. Testing consistency, by using Cassandra’s ONE for its write and read options at an eventual rate (slower consistency) or strong rate (faster consistency), shows that throughput increases with the eventual system. The choice to use either rate rests with the healthcare stakeholders.

The authors concluded that for their system and their requirements Cassandra had the highest throughput regardless of the level of consistency rates (Weider et al., 2013).  They also suggested that each of these tests should be adjusted based on the requirements from key stakeholders in the healthcare profession and that a small variation in the data model could change the results seen here.

In conclusion of this post, NoSQL databases provide huge advantages to data analytics over traditional relational database management systems. But, NoSQL databases must fit the needs of the stakeholders, and quantitative tests must be thoroughly designed to assess which NoSQL database will meet those needs.

References

  • Hurst, N. (2010). Visual guide to NoSQL systems. Retrieved from http://blog.nahurst.com/visual-guide-to-nosql-systems
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].
  • Weider, D. Y., Kollipara, M., Penmetsa, R., & Elliadka, S. (2013, October). A distributed storage solution for cloud based e-Healthcare Information System. In e-Health Networking, Applications & Services (Healthcom), 2013 IEEE 15th International Conference on (pp. 476-480). IEEE.
Advertisements

Document store NoSQL databases

NoSQL (Not only Structured Query Language) databases are databases that are used to store data in non-relational databases i.e. graphical, document store, column-oriented, key-value, and object-oriented databases (Sadalage & Fowler, 2012; Services, 2015). NoSQL databases have benefits as they provide a data model for applications that require a little code, less debugging, run on clusters, handle large scale data and evolve with time (Sadalage & Fowler, 2012). Document store NoSQL databases, use a key/value pair that is the file/file itself, and it could be in JSON, BSON, or XML (Sadalage & Fowler, 2012; Services, 2015).  These document files are hierarchical trees (Sadalage & Fowler, 2012).

Parts of the documents could be updated in real-time this type of NoSQL database allows for easy creation and storage of dynamic data like website page views, unique views, or new metrics (Sadalage & Fowler, 2012).  To help speed up the search of a document store NoSQL database like content in multiple web pages, or store log file, indexes can be created (Services, 2012). These indexes could be stored as attributes, such as a “state,” “city,” “zip-code,” etc. attributes, which can have the same, different, or null values in the NoSQL database and it each of these is allowed (Sadalage & Fowler, 2012).

If you want to insert, update, or delete (also known as a transaction) data in a NoSQL database, it will either succeed or fail, it won’t have the ability as traditional databases to either commit or rollback (Sadalage & Fowler, 2012). Only two of the three features can exist according to CAP Theory (Consistency, Availability, and Partition Tolerance), and document store databases primarily focus on availability through replicating data in different nodes (Hurst, 2010; Sadalage & Fowler, 2012).  Some key players in the document store database realm are CouchDB, MongoDB, OrientDB, RavenDB, and Terrastore (Sadalage & Fowler, 2012).  This discussion will focus on both CouchDB and MongoDB; which are open-sourced code that allows having scalability features (CouchDB, n.d.; MongoDB, n.d.; Sadalage & Fowler, 2012).

CouchDB is an Apache code available for Windows, Linux, and Mac OS X and it is also:

  • AP database system (Hurst, 2010)
  • AP systems can achieve consistency if data can be replicated and verified (Hurst, 2010)
  • Globally distributed server cluster to allow for accessing data and implementing projects anywhere through a data replication protocol (CouchDB, n.d.)
  • Data can be stored on a single or clustered server, via locally on the company’s servers, virtual machines, Raspberry Pi servers, or on a cloud provider (CouchDB, n.d.)
  • Allows for offline end user experience (CouchDB, n.d.)
  • Can use MapReduce for deriving insights from the data (CouchDB, n.d.)
  • Uses HTTP protocol and JSON data (CouchDB, n.d.)
  • Only allowing for appending data helps create a crash-resistant data structure (CouchDB, n.d.)

MongoDB is code available for Windows, Linux, Mac OS X, Solaris, etc.:

  • CP database system (Hurst, 2010)
  • CP systems have issues keeping data available across all nodes through their replication system (Hurst, 2010; Sadalage & Fowler, 2012)
  • Used by companies like Expedia, Forbes, Bosch, AstraZeneca, MetLife, Facebook, Urban Outfitters, sprinklr, the guardian, Comcast, etc., such that 33% of the Fortune 100 are using it (MongoDB, n.d.)
  • Has an expressive query language and secondary indexes out of the box to help access and understand data stored within its database, which is easier to use and requires fewer lines of code (MongoDB, n.d.; Sadalage & Fowler, 2012)
  • Allows for a flexible data model that evolves with time as the data stored in it evolves (MongoDB, n.d.)
  • Allows for integration of silo, internet of things, mobile, catalog data to help provide real-time analytics (MongoDB, n.d.)

References

  • CouchDB (n.d.). CouchDB, relax. Apache. Retrieved from http://couchdb.apache.org/
  • Hurst, N. (2010). Visual guide to NoSQL systems. Retrieved from http://blog.nahurst.com/visual-guide-to-nosql-systems
  • MongoDB (n.d.). MongoDB, for giant ideas. Retrieved from https://www.mongodb.com/
  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [Bookshelf Online].
  • Services, E. E. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, 1st Edition. [Bookshelf Online].