Internal and External Validity

In quantitative research, a study is valid if one could draw meaning and inferences from the results based on methodology employed.  The three ways to look at validity is in (1) Content (do we measure what we wanted), (2) Predictive (do we match similar results, can we predict something), and (3) construct (are these hypothetical or real concepts).  This is not to be confused with reliability & consistency.  Thus, Creswell (2013) warns that if we modify an instrument or combine it with others, the validity and reliability of it could change, and in order to use it we must reestablish its validity and reliability.  There are several threats to validity that exist, either internal (history, maturation, regression, selection, mortality, diffusion of treatment, compensatory/resentful demoralization, compensatory rivalry, testing, and instrumentation) or external (interaction of selection and treatment, interaction of setting and treatment, and interaction of history and treatment).

Sample Validity Considerations: The validity issues are and their mitigation plans

Internal Validity Issues:

Hurricane intensities and tracks can vary annually or even decadally.  As time passes during this study for the 2016 and 2017 Atlantic Ocean Basin this study may run into regression issues.  These regression issues threaten the validity of the study in a way that certain types of weather components may not be the only factors that can increase/decrease hurricane forecasting skill from the average.  To mitigate regression issues, the study could mitigate the effect that these storms with an extreme departure from the average forecast skill have on the final results by eliminating them.  Naturally, the extreme departures from the average forecast skill will, with time, slightly impact the mean, but their results are still too valuable to dismiss.  Finding out which weather components impact these extreme departures from the average forecast skill is what drives this project.  Thus, their removal doesn’t seem to fit in this study and defeats the purpose of knowledge discovery.

External Validity Issues: 

The Eastern Pacific, Central Pacific, and Atlantic Ocean Basin have the same underlying dynamics that can create, intensify and influence the path of tropical cyclones.  However, these three basins still behave differently, thus there is an interaction of setting and treatment threats to the validity of these studies results. Results garnered in this study will not allow me to generalize beyond the Atlantic Ocean Basin. The only way to mitigate this threat to validity is to suggest future research to be conducted on each basin separately.

Resources

Exploring Mixed Methods

Explanatory Sequential (QUAN -> qual)

According to Creswell (2013), this mix method style uses qualitative methods to do a deep dive into the quantitative results that have been previously gathered (often to understand the data with respect to the culture).  The key defining feature here is that quantitative data is collected before the qualitative data and that the quantitative data drives the results from the qualitative.  Thus, the emphasis is given to the quantitative results in order to explore and make sense of qualitative results.  It is used to probe quantitative results by explaining them via qualitative results.  Essentially, using qualitative results to enhance your quantitative results.

Exploratory Sequential (QUAL -> quan)

According to Creswell (2013), this mix method style uses quantitative methods to confirm the qualitative results that have been previously gathered (often to understand the culture behind the data).  The key defining feature here is that qualitative data is collected before the quantitative data and that the qualitative data drives the results from the quantitative.  Thus, the emphasis is given to the qualitative results in order to explore and make sense of quantitative results.  It is used to probe qualitative results by explaining them via quantitative results.  Essentially, using quantitative results to enhance your qualitative results.

Which method would you most likely use?  If your methodological fit suggests you to use a mixed-methods research project, does your world view colors your choice?

Resources

Quasi-experimental

In the Quantitative Methodology, there are experimental (deals with the impact of an outcome, while having a controlling variable to see if the tested variable does have an impact), quasi-experimental (deals with a non-random sample but still measures the impact of an outcome) and non-experimental (deals with generalizing/inferring about a population) project designs.

For a non-experimental project design, surveys are used as an instrument to gather data and help produce quantitative/numeric data to help identify trends and sentiment from a sample of a total population (Creswell, 2013).  The Pew Research Center (2015), wanted to analyze the changing attitudes on Gay Marriage a few days after the Supreme Court struck down the bans as unconstitutional, have asked:

Do you oppose/favor allowing gays and lesbians to marry legally? What is your current age? What is your Religious Affiliation? What is your Political Party?  What is your Political Ideology? What is your Race? What is your gender?

Pew found that overall, since they were conducting this survey since 2001, they have seen that in every descriptive variable classifying people has shown an increase in acceptance for marriage, with an overall 55% approval rating to 39%.  This example is not trying to explain a relationship but rather a trend.

For an experimental project design, it usually follows the following steps: Identification of participants, gathering of materials, draft and finalize procedures and setting up measures so that you can conduct the experiment and derive some results from it (Creswell, 2013).

When a participant in a study is randomly assigned to a control group or in other groups in an experiment it is considered a true experiment, if the participants in a study are not randomly assigned then it is considered a quasi-experiment (Creswell, 2013).  In the famous Milgram Obedience Experiment (1974), an ad was posted to collect participants for a study on memory, but in fact, they were there to see if the presence of authority would compromise their internal morals to cause pain and sometimes delivering fatal shocks to another participant (an actor).  About 2/3 of people were willing to administer the deadly shock because they had the presence of authority (a man in a white coat) telling them to continue to the study.  Though this study will be hard to replicate today (due to IRB considerations), it wasn’t fully random, thus it’s a quasi-experiment, but it challenged and shocked the world.  This is a pivotal paper/experiment that defined behavioral science.

Resources

Methodological fit

Do you know what methodology you should use for your research project?

If there is a lot of extensive literature for a topic, then, according to Edmonson and McManus (2007) one could make a contribution to a mature theory then quantitative methodology would be the best methodological fit. If one strays and does a qualitative methodology in this case, they could run into reinventing the wheel error and may fail to fill a gap in the body of knowledge.

If there is just a little literature for a topic, then one could make a contribution to a nascent theory via qualitative methodologies, which in turn would be the best methodological fit (Edmonson & McManus, 2007).  If you do a quantitative research project here, you may be jumping the gun and running into possible false conclusions caused by confounding variables and may still fail to fill the gap in the body of knowledge.

Finally, one can stray from both pure qualitative and quantitative methodologies, and go into a mixed-methods study, and this can occur when there is enough research that the body of knowledge isn’t considered nascent, but not enough to be considered mature (Edmonson & McManus, 2007). Going one route here would do an injustice in filling in the gap in the body of knowledge, because you may be missing key insights that the each part of the mixed methodology (both qualitative and quantitative) can bring to the field.

So, prior to deciding which methodology you should choose, you should do an in-depth literature review.  You cannot pick an appropriate methodology without knowing the body of knowledge.

Hint: The more quantitative research articles you find in a body of knowledge, the more likely your project will be dealing with either a mixed-methods (low number of articles) or a quantitative method (high number of articles) project. If you see none, you may be working on a qualitative methodology.

Reference

  • Edmondson, A., & McManus, S. (2007). Methodological fit in management field research. Academy of Management Review, 32(4), 1155–1179. CYBRARY.

Worldviews and Approaches to Inquiry

The four worldviews according to Cresswell (2013) are postpositivism (akin to quantitative methods), constructivism (akin to qualitative methods), advocacy (akin to advocating action), and pragmatism (akin to mixed methods).   There are positives and negatives for each world view. For pragmatists, they use what truth and what methods from anywhere that works at the time they need it, to get the results they need.  Though the pragmatist research style takes time to conduct.  The advocacy places importance on creating an action item for social change to diminish inequity gaps between asymmetric power relationships like those that exist with class structure and minorities.  Though this research is noble, the moral arc of history bends towards justice, but very slowly, it took centuries for race equality to be where it is at today, it took over 60 years for gender equality, and 40 years for LGBT equality.  Yet, there are still inequalities amongst these groups and the majority that have yet to be resolved.  For instance: Equal Pay for Equal Work for All, Employment/Housing Non-Discrimination for LGBT, Racial Profiling, etc.  The constructivist viewpoint researchers seek to understand the world around them through subjective means.  They use their own understanding and interpretation of historical and cultural settings of participants to shape their interpretation of the open-ended data they collect.  This can lead to an interpretation that is shaped by the researcher’s background and not representative of the whole situation at hand.  Finally, postpositivism looks at the world in numbers, knowing their limitation that not everything can be described in numbers, they choose to propose an alternative hypothesis where they can either accept or reject the hypothesis. Numbers are imperfect and fallible.

My personal world view is akin to a pragmatist world view.  My background in math, science, technology, and management help me synthesize ideas from multiple fields to drive innovation.  It has allowed me to learn rapidly because I can see how one field ties to the other and makes me more adaptable.   However, I also lean a bit more strongly to the math and science side of myself, which is a postpostivism view.

Resource:

The Role of Theory

The theory is intertwined with the research process, thus a thorough understanding of theory must involve the understanding of the relationship between theory and research (Bryman & Bell, 2007).  When looking at research from a deductive role (developing and testing a problem and hypothesis) the theory is presented at the beginning.  The theory here is being tested, as it helps define the problem, its parameters (boundaries) and a hypothesis to test.  Whereas an inductive role uses data and research to build a theory.  Theories can be grand (too hard to pinpoint and test) or they can be mid-range (easier to test, but it is still too big to test it under all assumptions) (Bryman & Bell, 2007).

Where you write your theory depends on the type of world view you have (positivism at the beginning of the paper, or constructivism at the beginning or end of the paper) (Creswell, 2013).   My particular focus will be on the postpositivism view (quantitative methods), so I will dissect the placement of the theory primarily on a quantitative research study (which are mostly deductive in nature).  Placement of the theory in the introduction lit review, or after the hypothesis runs into the issue that it will make it harder for the reader to isolate and separate the theory from their respective sections (Cresswell, 2013).  There is another disadvantage from what Creswell (2013) states for the after the hypothesis approach: you may forget to discuss the origins and rationale for the theory.  Cresswell (2013), suggests as a research tip to separate the theory and create a brand new section for it so that it is easily identified and its origin and rationale can be elaborated on.

However, separating the theory section from the rest of the paper can still get the paper tossed out of being published in a journal if it is still fuzzy to decipher amongst your peers and the editor.  Feldman’s 2004 editorial states that if the question & theory is succinct, grammatically correct, non-trivial, and makes a difference, it would help you get your results published.  However, he also states (like many of our professors do) we need to find what are the key articles and references in the past 5 years, that we should be exhaustive yet exclusive with our dataset, and establish clear boundary conditions such that we can adequately define independent and dependent variable would help you get your results published (Feldman, 2004).  The latter set of conditions helps build your theory, whereas the first set of conditions speaks to the readability of the theory.  If it is hard to read your theory because it’s so convoluted, then why should anyone care to read it?

Resources:

  • Bryman, A. & Bell, E. (2007) Business Research Methods. (2nd ed.). Location: Oxford University Press.
  • Creswell, J. W. (2013). Research Design: Qualitative, Quantitative, and Mixed Methods Approaches, 4th Edition. [VitalSource Bookshelf version]. Retrieved from http://online.vitalsource.com/books/9781483321479/epubcfi/6/24
  • Feldman, D. C. (2004). What are we talking about when we talk about theory? Journal of Management, 30(5), 565–567.

Adv DB: CAP and ACID

Transactions

A transaction is a set of operations/transformations to be carried out on a database or relational dataset from one state to another.  Once completed and validated to be a successful transaction, the ending result is saved into the database (Panda et al, 2011).  Both ACID and CAP (discussed in further detail) are known as Integrity Properties for these transactions (Mapanga & Kadebu, 2013).

 Mobile Databases

Mobile devices have become prevalent and vital for many transactions when the end-user is unable to access a wired connection.  Since the end-user is unable to find a wired connection to conduct their transaction their device will retrieve and save information on transaction either on a wireless connection or disconnected mode (Panda et al, 2011).  A problem with a mobile user accessing and creating a transaction with databases, is the bandwidth speeds in a wireless network are not constant, which if there is enough bandwidth connection to the end-user’s data is rapid, and vice versa.  There are a few transaction models that can efficiently be used for mobile database transactions: Report and Co-transactional model; Kangaroo transaction model; Two-Tiered transaction model; Multi-database transaction model; Pro-motion transaction model; and Toggle Transaction model.  This is in no means an exhaustive list of transaction models to be used for mobile databases.

According to Panda et al (2011), in a Report and Co-transactional Model, transactions are completed from the bottom-up in a nested format, such that a transaction is split up between its children and parent transaction.  The child transaction once successfully completed then feeds that information up to the chain until it reaches the parent.  However, not until the parent transaction is completed is everything committed.  Thus, a transaction can occur on the mobile device but not be fully implemented until it reaches the parent database. The Kangaroo transaction model, a mobile transaction manager collects and accepts transactions from the end-user, and forwards (hops) the transaction request to the database server.  Transaction made in this model is done by proxy in the mobile device, and when the mobile devices move from one location to the next, a new transaction manager is assigned to produce a new proxy transaction. The two-tiered transaction model is inspired by the data replication schemes, where there is a master copy of the data but for multiple replicas.  The replicas are considered to be on the mobile device but can make changes to the master copy if the connection to the wireless network is strong enough.  If the connection is not strong enough, then the changes will be made to the replicas and thus, it will show as committed on these replicas, and it will still be made visible to other transactions.

The multi-database transaction model uses asynchronous schemes, to allow a mobile user to unplug from it and still coordinate the transaction.  To use this scheme, five queues are set up: input, allocate, active, suspend and output. Nothing gets committed until all five queues have been completed. Pro-motion transactions come from nested transaction models, where some transactions are completed through fixed hosts and others are done in mobile hosts. When a mobile user is not connected to the fixed host, it will spark a command such that the transaction now needs to be completed in the mobile host.  Though carrying out this sparked command is resource-intensive.  Finally, the Toggle transaction model relies on software on a pre-determined network and can operate on several database systems, and changes made to the master database (global) can be presented different mobile systems and thus concurrency is fixed for all transactions for all databases (Panda et al, 2011).

At a cursory glance, these models seem similar but they vary strongly on how they implement the ACID properties in their transaction (see table 1) in the next section.

ACID Properties and their flaws

Jim Gray in 1970 introduced the idea of ACID transactions, which provide four guarantees: Atomicity (all or nothing transactions), Consistency (correct data transactions), Isolation (each transaction is independent of others), and Durability (transactions that survive failures) (Mapanga & Kedebu, 2013, Khachana, 2011).  ACID is used to assure reliability in the database system, due to a transaction, which changes the state of the data in the database.

This approach is perfect for small relational centralized/distributed databases, but with the demand to make mobile transactions, big data, and NoSQL, ACID may be a bit constricting.  The web has independent services connected together relationally, but really hard to maintain (Khachana, 2011).  An example of this is booking a flight for a CTU Doctoral Symposium.  One purchases a flight, but then also may need another service that is related to the flight, like ground transportation to and from the hotel, the flight database is completely different and separate from the ground transportation system, yet sites like Kayak.com provide the service of connecting these databases and providing a friendly user interface for their customers.  Kayak.com has its own mobile app as well. So taking this example further we can see how ACID, perfect for centralized databases, may not be the best for web-based services.  Another case to consider is, mobile database transactions, due to their connectivity issues and recovery plans, the models aforementioned cover some of the ACID properties (Panda et al, 2011).  This is the flaw for mobile databases, through the lens of ACID.

Model Atomicity Consistency Isolation Durability
Report & Co-transaction model Yes Yes Yes Yes
Kangaroo transaction model Maybe No No No
Two-tiered transaction model No No No No
Multi-database Transaction model No No No No
Pro-motion Model Yes Yes Yes Yes
Toggle transaction model Yes Yes Yes Yes

Table 1: A subset of the information found in Panda et al (2011) dealing with mobile database system transaction models and how they use or not use the ACID properties.

 

CAP Properties and their trade-offs

CAP stands for Consistency (just like in ACID, correct all data transactions and all users see the same data), Availability (users always have access to the data), and Partition Tolerance (splitting the database over many servers do not have a single point of failure to exist), which was developed in 2000 by Eric Brewer (Mapanga & Kadebu, 2013; Abadi, 2012).  These three properties are needed for distributed database management systems and is seen as a less strict alternative to the ACID properties by Jim Gary. Unfortunately, you can only create a distributed database system using two of the three systems so a CA, CP, or AP systems.  CP systems have a reputation of not being made available all the time, which is contrary to the fact.  Availability in a CP system is given up (or out-prioritized) when Partition Tolerance is needed. Availability in a CA system can be lost if there is a partition in the data that needs to occur (Mapanga & Kadebu, 2013). Though you can only create a system that is the best in two, that doesn’t mean you cannot add the third property in there, the restriction only talks applies to priority. In a CA system, ACID can be guaranteed alongside Availability (Abadi, 2012)

Partitions can vary per distributed database management systems due to WAN, hardware, a network configured parameters, level of redundancies, etc. (Abadi, 2012).  Partitions are rare compared to other failure events, but they must be considered.

But, the question remains for all database administrators:  Which of the three CAP properties should be prioritized above all others? Particularly if there is a distributed database management system with partitions considerations.  Abadi (2012) answers this question, for mission-critical data/applications, availability during partitions should not be sacrificed, thus consistency must fall for a while.

Amazon’s Dynamo & Riak, Facebook’s Cassandra, Yahoo’s PNUTS, and LinkedIn’s Voldemort are all examples of distributed database systems, which can be accessed on a mobile device (Abadi, 2012).  However, according to Abadi (2012), latency (similar to Accessibility) is critical to all these systems, so much so that a 100ms delay can significantly reduce an end-user’s future retention and future repeat transactions. Thus, not only for mission-critical systems but for e-commerce, is availability during partitions key.

Unfortunately, this tradeoff between Consistency and Availability arises due to data replication and depends on how it’s done.  According to Abadi (2012), there are three ways to do data replications: data updates sent to all the replicas at the same time (high consistency enforced); data updates sent to an agreed-upon location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme); and data updates sent to an arbitrary location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme).

According to Abadi (2012), PNUTS sends data updates sent to an agreed-upon location first through asynchronous schemes, which improves Availability at the cost of Consistency. Whereas, Dynamo, Cassandra, and Riak send data updates sent to an agreed-upon location first through a combination of synchronous and asynchronous schemes.  These three systems, propagate data synchronously, so a small subset of servers and the rest are done asynchronously, which can cause inconsistencies.  All of this is done in order to reduce delays to the end-user.

Going back to the Kayak.com example from the previous section, consistency in the web environment should be relaxed (Khachana et al, 2011).  Further expanding on Kayak.com, if 7 users wanted to access the services at the same time they can ask which of these properties should be relaxed or not.  One can order a flight, hotel, and car, and enforce that none is booked until all services are committed. Another person may be content with whichever car for ground transportation as long as they get the flight times and price they want. This can cause inconsistencies, information being lost, or misleading information needed for proper decision analysis, but systems must be adaptable (Khachana et al, 2011).  They must take into account the wireless signal, their mode of transferring their data, committing their data, and load-balance of incoming requests (who has priority to get a contested plane seat when there is only one left at that price).  At the end of the day, when it comes to CAP, Availability is king.  It will drive business away or attract it, thus C or P must give, in order to cater to the customer.  If I were designing this system, I would run an AP system, but conduct the partitioning when the load/demand on the database system will be small (off-peak hours), so to give the illusion of a CA system (because Consistency degradation will only be seen by fewer people).  Off-peak hours don’t exist for global companies or mobile web services, or websites, but there are times throughout the year where transaction to the database system is smaller than normal days. So, making around those days is key.  For a mobile transaction system, I would select a pro-motion transaction system that helps comply with ACID properties.  Make the updates locally on the mobile device when services are not up, and set up a queue of other transactions in order, waiting to be committed once wireless service has been restored or a stronger signal is sought.

References

  • Abadi, D. J. (2012). Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer Society, (2), 37-42.
  • Khachana, R. T., James, A., & Iqbal, R. (2011). Relaxation of ACID properties in AuTrA, The adaptive user-defined transaction relaxing approach. Future Generation Computer Systems, 27(1), 58-66.
  • Mapanga, I., & Kadebu, P. (2013). Database Management Systems: A NoSQL Analysis. International Journal of Modern Communication Technologies & Research (IJMCTR), 1, 12-18.
  • Panda, P. K., Swain, S., & Pattnaik, P. K. (2011). Review of some transaction models used in mobile databases. International Journal of Instrumentation, Control & Automation (IJICA), 1(1), 99-104.

Adv DB: Key-value DBs

NoSQL and Key-value databases

A recap from my last post: “Not Only SQL” databases, best known as NoSQL contains aggregate databases like key-value, document, and column friendly (Sadalage & Fowler, 2012). Aggregates are related sets of data that we would like to treat as a unit (MUSE, 2015c). Relationships between units/aggregates are captured in the relational mapping (Sadalage & Fowler, 2012). A key-value database maps aggregate data to a key, this data is embedded into a key-value.

Consider a bank account, my social security may be used as a key-value to bring up all my accounts: my checking, my 2 savings, and my mortgage loan.  The aggregate is my account, but savings, checking, and a mortgage loan act differently and can exist on different databases and distributed across different physical locations.

These NoSQL databases can be schemaless databases, where data can be stored without any predefined schema.  NoSQL is best for application-specific databases, not to substitute all relational databases (MUSE, 2015b).  NoSQL databases can also have an implicit schema, where the data definition can be taken from a database from an application in order to place the data into the database.

MapReduce & Materialized views

According to Hortonworks (2013), MapReduce’s Process in a high level is: Input -> Map -> Shuffle and Sort -> Reduce -> Output.

Jobs:  Mappers, create and process transactions on a data set filed away in a distributed system and places the wanted data on a map/aggregate with a certain key.  Reducers will know what the key values are, and will take all the values stored in a similar map but in different nodes on a cluster (per the distributed system) from the mapper to reduce the amount of data that is relevant (MUSE, 2015a, Hortonworks, 2013). Reducers can work on different keys.

Benefit: MapReduce knows where the data is placed, thus it does the tasks/computations to the data (on which node in a distributed system in which the data is located at).  Not using MapReduce, tasks/computations take place after moving data from one place to another, which can eat up the computational resources (Hortonworks, 2013).  From this, we know that the data is stored in a cluster of multiple processors, and what MapReduce tries to do is map the data (generate new data sets and store them in a key-value database) and reduce (data from one or more maps is reduced to a smaller pair of key-values) the data (MUSE, 2015a).

Other advantages:  Maps and reduce functions can work independently, while the grouper (groups key-values by key) and Master (divides the work amongst the nodes in a cluster) coordinates all the actions and can work really fast (Sathupadi, 2010).  However, depending on the task division, the work of the mapping and reducing functions can vary greatly amongst the nodes in a cluster.  Nothing has to happen in sequential order and a node can sometimes be a mapper and/or a grouper at any one time of the transaction request.

A great example of this a MapReduce Request is to look at all CTU graduate students and sum up their current outstanding school loans per degree level.  Thus, the final output from our example would be Doctoral Students Current Outstanding School Loan Amount and Master Students Current Outstanding School Loan Amount.  If I ran this in Hadoop, I could use 50 nodes to process this transaction request.  The bad data that gets thrown out in the mapper phase would be the Undergraduate Students.  Doctoral Students will get one key, and Master students would get another key, that is similar in all nodes, so that way the sum of all current outstanding school loan amounts get processed under the correct group.

Resources

Adv DB: Document DBs

Main concepts

Data models are how we see, interact with, and transform our data in a system like a database (MUSE, 2015). A data model to a dev person is an ERD, whereas a metamodel is what is used to describe how a database organizes data in four key ways: Key-values, document, column-family, and graph databases (although graph databases are not aggregates) (Sadalage & Fowler, 2012).

In relational data models, tuples are a set of values (divided and stored information) that cannot be nested, nor placed within another, so all operations must be thought of as reading or writing tuples.  For aggregate data models, we want to do more complex things (like key values, column family and documents) rather than just dealing with tuples (Sadalage & Fowler, 2012). Aggregates are related sets of data that we would like to treat as a unit (MUSE, 2015). Relationships between units/aggregates are captured in relational mapping, and a relational or graph database has no idea that the aggregate exists, also known as “aggregate-ignorant” (Sadalage & Fowler, 2012).

Let’s consider a UPS.  For transactions like amazon.com or ebay.com, we need to know only the shipping address if we are a distributor, but paypal.com or your bank cares about the billing address to give you credit into your account.  UPS must collect both.  Thus, UPS, in their relational models they may have in an ERD with two Entities called: Billing Address and Shipping Address.  Naturally, we can group these into one unit (aggregate) called: Address with an indicator/key to state which address is which.  Thus, I can query the key for shipping addresses.

Finally, atomic operations are supported on a single aggregate at a time, and ACID is not followed for transactions across multiple aggregates at a time (Sadalage & Fowler, 2012).

Document Databases

A document database is able to look into the structure of a unit because we need to use a query, which can return a subset/part of the aggregate (Sadalage & Fowler, 2012). You can think of this as either a chapter or a section in a document (MUSE, 2015).  It can be limited by the size restrictions, but also in what can be placed (structure and type).  People can blur the line between this and key-value databases by placing an ID field, but for the most part, you will query a document database rather than look up a key or ID (Sadalage & Fowler, 2012).

Pros and Cons of Aggregate Data model

Aggregate ignorance allows for manipulation of data, replication, and sharding because if not, we would have to search every unit, yet manipulation of data, replication, and sharding can be easier when done in these units.  Thus it can help in some cases and not in others.  Also, there is no correct or right way on where aggregate boundaries should or shouldn’t exist, which can add to the complexity in understanding a data model.  It is great if we want to run our transactions on as little nodes as possible on a cluster, and dealing with units is easier on a cluster (Sadalage & Fowler, 2012).  It is not great for mapping out the relationships of units of different formats (MUSE, 2015).

References:

Adv DB: NoSQL DB

Emergence

Relational Databases will persist due to ACID, ERDs, concurrency control, transaction management, and SQL capabilities.  It doesn’t help that major software can easily integrate with these databases.  But, the reason why so many new ways keep popping up is due to impedance resource costs on computational systems, when data is pulled and pushed from in-memory to databases.  This resource cost can compound fast with big amounts of data.  Industry wants and needs to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.  Data could also be aggregated into units of similarities, and data consistency can be thrown out the window, in real-life applications since they can actually be divided into multiple phases (MUSE, 2015a).

Think of a bank transaction, not all transactions you do at the same time get processed at the same time, and they may show up on your mobile device (mobile database), they may not be committed until a few hours or days later.  The bank will in my case withdraw my mortgage payment from my checking on the first, but apply it on the second of every month into the loan.  But, for 24 hours my payment is pending.

Thanks to the aforementioned ideas have created a movement to support “Not Only SQL” databases, best known as NoSQL, which was derived from a twitter hashtag #NoSQL.  NoSQL contains Aggregate databases like key-value, document, and column friendly, as well as aggregate ignorant databases like the graph (Sadalage & Fowler, 2012). These can be schemaless databases, where data can be stored without any predefined schema.  NoSQL is best for application-specific databases, not to substitute all relational databases (MUSE, 2015b).

 Originally meant for open-sourced, distributed, nonrelational databases like Voldemort, Dynomite, CouchDB, MongoDB, Cassandra, it expanded in its definition and what applications/platforms it can take on.  CQL is from Cassandra and was written to act like SQL in most cases, but also act differently when needed (Sadalage & Fowler, 2012), hence the No in NoSQL.

Suitable Applications

According to Cassandra Planet (n.d.), NoSQL is best for large data sets (big data, complex data, and data mining):

  • Graph: where data relationships are graphical and interconnected like a web (ex: Neo4j & Titan)
  • Key-Value: data is stored and index by a key (ex: Cassandra, DynamoDB, Azure Table Storage, Riak, & BerkeleyDB)
  • Column Store: stores tables as columns rather than rows (ex: Hbase, BigTable, & HyperTable)
  • Document: can store more complex data, with each document having a key (ex: MongoDB & CouchDB).

System Platform

In Relational databases, there is a resource cost, but in as industry wants to deal with big amounts of data, we can gravitate towards NoSQL.  To process all that data we may need to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.

 References: