Adv DB: Conducting data migration to NoSQL databases

Relational databases schema design (primarily ERDs) are all about creating models, then translating it a schema to which is normalized, but one must be an oracle to anticipate a holistic, end-to-end design, or else suffer when making changes to the database (Scherzinger et al, 2013).  Relational databases are poor at data replication, horizontal scalability, and high availability rates (Schram & Anderson, 2012).  Thus, waterfall approaches to database design are no longer advantageous, and like software development databases can be designed with an agile mentality.  Especially as data store requirements are always evolving. Databases that adopt a “Schema-less” (where data can be stored without any predefined schema) or an “Implicit Schema” (where the data definition van be taken from a database from an application in order to place the data into the database) in “Not Only SQL” (NoSQL) can allow for agile development on a release cycle that can vary from yearly, monthly, weekly, or daily, which is completely dependent on the developers’ iteration cycle (Sadalage & Fowler, 2012).  Taking a look at a blogging agile development lifecycle (below) can show how great schema-less or implicit schemas in NoSQL database development can become, as well as the technical debt that is created, which can cause migration issues down the line.

Blogging

We start a blogging site called “blog.me” and we are in an agile environment, which means iterative improvements and each iteration produces a releasable product (even if we decide not to make a release or update at the end of the iteration).  As a programming team, they have decided that the minimum viable product will consist of the fields, title, and content for the blogger and comments from other people.  This is a similar example proposed by Scherzinger et al in 2013, as they try to explain how implicit schemas work.  In the second iteration, the programming team for “blog.me” has discovered an abuse on the commenting section of the blog.  People have been “trolling” the blog, thus to mitigate this, they implemented a sign-in process with a username and password that is taken from Facebook, which allows for liking a post as well.  Rather than having bloggers to recreate their content, the programmers make the implementation of this update for current and future posts. In a third iteration, the programming teams to institute a uniformed nomenclature to some of their fields.  Rather than changing all the posts from the first two iterations, the programmers decide to enforce these changes moving forward.

Now, one can see how useful a schema-less development (provided by NoSQL) can become.   There is no downtime to how the site interacts and adds no additional burden to the end-users when an update occurs. But, we now have to worry about migrating these three data classes (or as Scherzinger et al calls it technical debt), but what if a commenter goes and comments in a post made in iteration one or two after iteration three has been implemented, we may then have four to five different data classes.  These developers love to develop code and add new features rather than maintain code, which is why this form of developing a database is great, but as we can see technical debt can pile on quickly.  Our goal is to manage a schema of this data, yet have the flexibility of a schema-less database system.

Types of Migration

The migration of data in and out of a data store is usually enabled through a replication scheme (Shirazi et al, 2012) conducted through an application.  There are two primary types of data migration per Scherzinger et al (2013): eager and lazy.  Eager migration means we migrate all the data in a batched fashion, one-by-one retrieval from the data store, transform it and write it back into the data store.  As data becomes larger, eager migration can become resource-intensive and could be a wasted effort. Wasted efforts can come from stale data.  Thus, the lazy approach is considered as a viable option.  Transformations are conducted when a piece of data is touched, so only live and hot data (relevant data) is updated.  Even though this approach saves on resources, if an entity becomes corrupted, there may be no way to retrieve it.  In order to do the migration, an application needs to create an “implicit-schema” on the “schema-less” data.

NoSQL and its multiple flavors

NoSQL databases can deal with aggregate data (relationships between units of data that can be relationally mapped), using key-value, document, and column friendly databases (Scherzinger et al, 2013, Sadalage & Fowler, 2012, Schram & Anderson, 2012).  There also exist graphical databases (Sadalage & Fowler, 2012).  Key-value databases deal with storing data with a unique key and value, while document databases store documents or their parts in a value. (Scherzinger et al, 2013). People can blur the line between this and key-value databases by placing an ID field, but for the most part, you will query a document database rather than look up a key or ID (Sadalage & Fowler, 2012). Whereas column friendly databases store the information in transposed table structures (as columns rather than rows).  Graph databases can show relationships with huge datasets that are highly interconnected, and the complexity of the data is emphasized in this database rather than the size of data (Shirazi et al, 2012).  A further example of a graphical database is shown in the health section in the following pages.  Migrations between the multiple flavors of NoSQL databases allow for one to exploit the strengths and mitigate the weakness between the types when it comes to analyzing the large data quickly.

Data Migration Considerations and Steps

Since data migration uses replication schemes from an application, one must consider how complex writing a SQL query would be if this were a relational database scheme (Shirazi et al, 2012).  This has implications on how complex transforming data or migrating it would be under NoSQL databases, especially when big data is introduced into the equation.  Thus, the pattern of database design must be taken into account when migrating data between relational databases to NoSQL database, or between different NoSQL database types (or even provider). Also, each of these database types treats NULL values differently, some NoSQL databases don’t even waste the storage space and ignore NULL values, some systems have them as in relational databases, and some systems allow for it, but don’t query for it (Scherzinger et al, 2013).  Scherzinger et al (2013) suggest that when migrating data, data models (data stored in the databases that belong to a object or a group, which can have several properties) query models (data that can be inserted, transformed and deleted based on a key-value, or some other kind identification), and freedom from schema (the global structure of the data that can or cannot be fixed in advance) must be taken into account. Whereas, Schram & Anderson in 2012, stated that data models are key when making design changes (migrations) between database systems. Since in NoSQL data is “schema-less” there may not be any global structure, but applications (such as web user-interfaces) built on top of the data-stores can display an implicit structure, and from that, we can list a few steps to consider when migrating data (Tran et al, 2011):

  • Installation and configuration
    1. Set up development tools and environment
    2. Install and set up environments
    3. Install third-party tools
  • Code modification
    1. Set up database connections
    2. Database operation query (if using a NoSQL database)
    3. Any required modifications for compatibility issues
  • Migration
    1. Prepare the database for migration
    2. Migrate the local database to the NoSQL database (the schema-less part)
    3. Prepare system for migration
    4. Migrate the application (the implicit-schema part)
  • Test (how to ensure the data stored in the databases matched with the “Implicit Schema” embedded in the applications when the “Implicit Schema” has experienced a change)
    1. Test if the local system works with a database in NoSQL
    2. Test if the system works with databases in NoSQL
    3. Write test cases and test for functionality of the application in NoSQL

When doing code modification (step 2) from a relational database to a NoSQL database the more changes will be required, and JOIN operations may not be fully supported.  Thus, additional code may be required in order to maintain the serviceability of the application, pre-migration, during migration and post-migration (Tran et al, 2011).  Considering ITIL Service Transition standards, the best time to do a migration or update is in windows of minimum usage by end-users, while still maintaining agreed-upon minimum SLA standards.  As stated in Schram & Anderson (2012) they didn’t want their service to break while they were migrating their data from a relational database to a NoSQL column friendly database.  Other issues, like compatibility between the systems housing the databases or even database types, can also add complexity to migration.  When migrating (step 3) SQL scripts need to be transformed as well, to align with the new database structure, environment, etc. (Tran et al, 2011). Third-party apps can help to a degree with this.  If the planning phase was conducted correctly this phase should be relatively smooth.  Tran et al (2011) stated that there are at least 8 features that drive the cost of migration: (1) Project team’s capability, (2) Application/Database complexity, (3) Existing knowledge and experience, (4) Selecting the correct database and database management system, (5) Compatibility issues, database features, and (8) Connection issues during migration.

Health

A database was created from 7.2M medical reports, in order to understand human diseases, called HealthTable.  The authors in Shirazi et al in 2012, decided to convert a column store into a graph database of Health Infoscape (Table 1 to Figure 1).  Each cause/symptom stems from disease (Dx), yet the power of graph databases as aforementioned are shown, thus facilitating data analysis, even though column friendly databases provide an easier way to maintain the 7.2M data records.

Table 1. HealthTable in Hbase per Shirazi et al (2012).

Row key Info Prevalence Causes
D1 Name Category Female Male Total Cause1 Cause2 Cause3
Heartburn Digestive system 9.4% 9% 9.2% D2    
1 1 1 1 1 2    
D2 Chest Pain Circulatory System 6.8% 6.8% 6.8%      
3 3 3 3 3      
D4 Dizziness Nervous System 4% 2.8% 3.5%      
5 5 5 5 5      

health graph

Figure 1. HeathGraph Bases on HealthTable

Conclusions

From these two use cases (Heath and Blogging) is that data migration can be quite complicated.  Schema-less databases allow for a more agile approach to developing, whereas the alternative is best for the waterfall.  However, with waterfall development slowly on the decay, one must also migrate to other forms of development.  Though applications/databases can migrate from relational databases to NoSQL and thus require a lot of coding because of compatibility issues, applications/databases can also migrate between different types of NoSQL databases.  Each database structure has its strengths and weakness, and migrating data between these databases can provide opportunities for knowledge discovery from the data that is contained within them.  Migrating between database systems and NoSQL types should be conducted if it fulfills many of the requirements and promises to reduce the cost of maintenance (Schram & Anderson, 2012).

References

  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781323137376/
  • Scherzinger, S., Klettke, M., & Störl, U. (2013). Managing schema evolution in NoSQL data stores. arXiv preprint arXiv:1308.0514.
  • Schram, A., & Anderson, K. M. (2012). MySQL to NoSQL: data modeling challenges in supporting scalability. In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity (pp. 191-202). ACM.
  • Shirazi, M. N., Kuan, H. C., & Dolatabadi, H. (2012, June). Design Patterns to Enable Data Portability between Clouds’ Databases. In Computational Science and Its Applications (ICCSA), 2012 12th International Conference on (pp. 117-120). IEEE.
  • Tran, V., Keung, J., Liu, A., & Fekete, A. (2011, May). Application migration to cloud: a taxonomy of critical factors. In Proceedings of the 2nd international workshop on software engineering for cloud computing (pp. 22-28). ACM.

Adv DB: Data Services in the Cloud Service Platform

Rimal et al (2009), states that Cloud Computing Systems are a disruptive service that has gained momentum. What makes it disruptive is that it has similar properties of prior technology, while adding new features and capabilities (like big data processing) in a very cost-effective way.   It has become part of the XaaS (where X can be infrastructure, hardware, software, etc.) as a Service.  According to Connolly & Begg (2014), Data as a Service (DaaS) and Database as a Service (DBaaS) are considered as cloud-based solutions. DaaS doesn’t use SQL interfaces, but it does enable corporations to access data to analyze value streams that they own or those they can easily expand into. DBaaS must be continuously monitored and improved on, because they usually serve multiple organizations, with the added benefit of providing charge-back functions per organization (Connolly & Begg, 2014) or a pay-for-use model (Rimal et al, 2009).  However, one must pick a solution that best serves their business’/organization’s needs.

Benefits of the service

Connolly & Begg (2014), stated that there are benefits to cloud computing such as Cost-reduction due to lower CapEx, ability to scale up or down based on data demands, needs for higher security making data stored here more secure than in-house solutions, 24/7 reliability can be provided, faster development time because time is not take away from building the DBMS from scratch, finally sensitivity/load testing can be made readily available and cheaply because of lower CapEX.  Rimal et al (2009), stated that the benefits came from lower costs, improved flexibility and agility, and scalability.  You can set up systems as quickly as you wish, under a pay-as-you-go model and is great for disaster recovery efforts (as long as the disaster affected your systems, not theirs).   Other benefits could be seen from an article looking at the Data-as-a-Service in the health field: a low cost to implementation and maintainability of databases, defragmentation of data, exchange of patient data across the heady provider organization, and provide a mode for standardization of data types, forms, and frequency to capture data.  From a health-care perspective, it could lead to supporting research, strategic planning of medical decisions, improve data quality, reduce cost, reduce resource scarcity issues from an IT perspective, and finally provide better patient care (AbuKhousa et al, 2012).

Problems that can be removed because of the service

Unfortunately, there are two sides to a coin.  Given that on a cloud service there exists network dependency, such that if the supplier has a power outage, the consumer will not have access to their data.  Other network dependencies can occur like peak service hours, where service tends to be degraded compared to if the company used the supplier during its off-peak hours.  Quite a few organizations use Amazon EC2 (Rimal et al, 2009) as their Cloud DBMS, which if that system is hacked or security is breached the problem is bigger than if it were carried out to only one company. There are system dependencies, like in the case of disaster recovery (AbuKhousa et al, 2012), organizations are as strong as their weakest link when it comes to a disaster, if the point of failure in the service, and there are no other mitigation plans, that organization may have a hard time recuperating their losses.  Also, placing data into these services, you lose control over the data (lose control over availability, reliability, maintainability, integrity, confidentiality, intervenability, and isolation) (Connolly & Begg, 2014).  Rimal et al (2009) stated clear examples of outages that existed in Services like Microsoft (down 22 hours in 2008), or Google Gmail (2.5 hours in 2009), etc.  All of these lack of control points is perhaps one of the main reasons why certain government agencies have had a hard time adopting a commercialized cloud service provider, however they are attempting to create internal clouds.

Architectural view of the system

The overall cloud architecture is the layered system that serves as a single point of contact and uses software applications over the web, using an infrastructure, which draws on resources from necessary hardware to complete a task (Rimal et al, 2009).  Adopted from Rimal et al (2009) the figure below is what these authors describe as the layered cloud architecture.

Software-as-a-Service (SaaS)

Platform-as-a-Service (PaaS)

Developers implementing cloud applications

Infrastructure-as-a-Service (IaaS)

[(Virtualizations, Storage Networks) as-a-Service]

Hardware-as-a-Service (HaaS)

Figure 1. A layered cloud architecture.

Cloud DBMS can be private (an internal provider where the data is held within the organization), public (an external provider manages resources dynamically across their system and through multiple organizations supplying them data), and hybrid (consists of multiple internal and external providers) as defined in Rimal et al (2009).

A problem with DBaaS is the fact that databases between multiple organizations are stored by the same service provider.  Where data can be what separates one organization from its competitors, they must consider the following architecture: Separate Servers, Shared Server but different database processes, Shared databases but separate databases, Shared databases but separate schema, or Shared databases but shared schema (Connolly & Begg, 2014).

Let’s take two competitors: Walmart & Target.  Two supergiant organizations that have trillions of dollars and inventory flowing in and out of their systems monthly.  Let’s also assume three database tables with automated primary keys and foreign keys that connect the data together: Product, Purchasing, and Inventory.  Another set of assumptions: (1) their data can be stored by the same DBaaS provider, (2) their DBaaS provider is Amazon.

While Target and Walmart may use the same supplier for their DBaaS, they can select one of those five architectural solutions to avoid their valuable data to be seen.  If Walmart and Target purchased separate servers, their data can be safe.  They could also go this route if they want to store their huge data and/or have many users of their data. Now if we narrow down our assumptions to the children’s toy section (due to breaking up the datasets to manageable chunks), both Target and Walmart can store their data on a shared server but on separate database processes, they would not have any shared resources like memory or disk, just the virtual environment.  If Walmart and Target went on a shared database server but separate databases, would allow for better resource management between each organization.  If Walmart and Target decided to use the same database (which is unlikely) but hold separate schema, data must have strong access management systems in place.  This may be elected between CVS and Walgreen’s Pharmacy databases, where patient data can be vital, but not impossible to switch from one schema to another, however, this interaction is highly unlikely.   The final structure, highly unlikely for most corporations is sharing databases and schemas.  This final architectural structure is best used for hospitals sharing patient data (AbuKhousa et al, 2012), but, HIPPA must be observed still under this final architecture mode.  Though this is the desired state for some hospitals, it may take years to get to a full system.  Security is a big issue here and will take a ton of developmental hours, but in the long run, it is the cheapest solution available (Connolly & Begg, 2014).

Requirements on services that must be met

Looking at Amazon’s cloud database offering, it should be easy to set up, easy to operate, and easy to scale.  The system should enhance availability and reliability to its live databases compared to in house solutions. There should be software in the databases to back up the database, for recovery at any time. Security patches and maintenance of the underlying hardware and software of the cloud DBMS should be reduced significantly since that is not the burden that should be placed onto the organization. The goal of the Cloud DBMS should be to remove development costs away from managing the DBMS to focus on applications and the data to be stored in the system.  They should also provide infrastructure and hardware as a service to reduce overhead costs in managing these systems.  Amazon’s Relational Database Service can use MySQL, Oracle, SQL Server, or PostgreSQL databases.  Amazon’s DynamoDB is a NoSQL database service.  Amazon’s Redshift costs less than $1K to store a TB of data per year (Varia & Mathew, 2013).

Issues to be resolved during implementation

Rimal et al in 2010, stated some interesting things to consider during a before implementing a Cloud DBMS.  What are the Service-Level Agreements (SLAs) of the supplier Cloud DBMS?  The Cloud DBMS may be up and running 24×7, but if they experience a power outage, what is their SLAs to the organization, as not to impact the organization at all?  What is their backup/replication scheme?  What is their discovery (assists in reusability) schema? What is their load balancing (trying to avoid bottlenecks), especially since most suppliers cater to more than one organization? What does their resource management plan look like? Most cloud DBMS have several copies of the data spread across several servers, so how sure is the vender to ensure no data loss?  What types of security are provided? What is their encryption and decryption strength for the data held within its servers?  How private will the organization’s data be, if hosted on the same server or same database but separate schema?   What are their authorization and Authentication safeguards?  Looking at Varia & Mathew (2013) explain all the cloud DBMS services that Amazon provides, these questions are definitely things that should be addressed for each of their solutions.  Thus, when analyzing a supplier for a cloud DBMS, having technical requirements that meet the Business Goals & Objects (BG&Os) is great to help guide the organization pick the right supplier and solution, given issues that need to be resolved.  Other issues identified came from Chaudhuri (2012): data privacy through access control, auditing, and statistical privacy; allow for data exploration to enable deeper analytics; data enrichment with web 2.0 and social media; query optimization; scalable data platforms; manageability; and performance isolation for multiple organizations occupying the same server.

Data migration strategy & Security Enforcement

When migrating data between organizations into a Cloud DBMS, the taxonomy of the data must be preserved.  Along with taxonomy, one must consider that no data is lost in the transfer, that data is still available to the end-user, before, during and after the migration, and that the transfer is done in a cost-efficient way (Rimal et al 2010).  Furthermore, data migration should be done seamlessly and efficiently as if one were to move between suppliers of services, such that a supplier doesn’t get too entangled into the organization that it is the only solution the organization can see itself as using.  Finally, what type of data do you want to migrate over, mission-critical data may be too precious on certain types of Cloud DBMS, but may be great for a virtualized disaster recovery system?  What type of data to migrate over depends on the needs of a cloud system, to begin with, and what services does the organization want to pay-as-they-go now and in the near future. The type of data to migrate may also depend on the security features provided by the supplier.

Organizational information is a vital resource to any organization, and access to it and maintaining it proprietary is key. If not enforced data like employee social security numbers can be compromised, or credit card numbers of past consumers.  Rimal et al (2009) compared the security considerations in the current cloud systems at the time:

  • Google: uses 128 Bit or higher server authentication
  • GigaSpaces: SSH tunneling
  • Microsoft Azure: Token services
  • OpenNebula: Firewall and virtual private network tunnel

In 2010, Rimal et al, further expanded the security considerations by the suggestion that organizations should look into: authentication and authorization protocols (access management), privacy and federated data, encryption and decryption schemes, etc.

References

  • AbuKhousa, E., Mohamed, N., & Al-Jaroodi, J. (2012). e-Health cloud: opportunities and challenges. Future Internet, 4(3), 621-645.
  • Chaudhuri, S. (2012, May). What next?: a half-dozen data management research goals for big data and the cloud. In Proceedings of the 31st symposium on Principles of Database Systems (pp. 1-4). ACM.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf version]. Retrieved from http://online.vitalsource.com/books/9781323135761/epubcfi/6/210
  • Rimal, B. P., Choi, E., & Lumb, I. (2009, August). A taxonomy and survey of cloud computing systems. In INC, IMS and IDC, 2009. NCM’09. Fifth International Joint Conference on (pp. 44-51). Ieee.
  • Rimal, B. P., Choi, E., & Lumb, I. (2010). A taxonomy, survey, and issues of cloud computing ecosystems. In Cloud Computing (pp. 21-46). Springer London.
  • Varia, J., & Mathew, S. (2013). Overview of amazon web services. Jan-2014.

Sample Literature Analysis

Article: Le, J. K, Tissington, P. A., & Budhwar, P. (2010). To move or not to move – a question of family? International Journal of Human Resource Management, 21(1), 17–45. CYBRARY – Business Source Premier

Issue or problem

The primary issue that Le et al (2010) wanted to study was the effect that family-on-work and work-on-family have.  They decide to do this research on relocation, as it is the most prevalent, direct, and most invasive aspect of work impinging its lives of the employees and its family members.  They break down the effect it has on intermediate family members compared to those on the external family members.  With further break down of the intermediate family members on the spouse and their children. The authors focused primarily on 62 Military Personnel (from UK’s Royal Air Force), because relocation in these situations is less of a choice, and they relocate many times that this is the extreme case scenario.

Stated purpose

After the economic periods in the late 2000s, advancements in technology, higher than normal unemployment, and globalization, relocation for the purpose of work-related issues is on a rise for the past decade (le et al, 2010).  Work relocation directly impacts the family, thus studying how family and work interact with each other with a single point of commonality (the employee) is why this aspect was studied.

A lot of studies have looked into the negative effects on relocation on the family or on the employee.  A lot of studies measure this as a one-way relationship.  Le et al (2010), is trying to study a bidirectional positive and negative impact of relocation on work and family.  They want to use exploratory qualitative research to help find variables or themes (to be used in future quantitative studies).  The researchers also wanted to use this exploratory study to make suggestions on how to mitigate the negative side effects of relocation on the employee’s family and vice versa.

Theoretical (concept or construct) focus or topic

Relocation impacts could be defined loosely as:

Relocation effects on the employee ~ F (- marital status, – number of kids, – spousal employment, spousal support, marital status) * G (adjustment time, willingness)

It should be noted that the function above doesn’t contain weights, but just the positive or negative effect of each variable.  Weights can imply and add more meaning to this equation.  This equation was defined by lee et al (2010) survey of the literature.  These are some of the main factors that were addressed or brought up during a qualitative exploratory study of 62 military personnel.

The concepts or constructs defined

Spillover and facilitation were defined as the key to this analysis.  Spillover is defined as aspects of work that affect the family.  Whereas facilitation doing one thing for work positively impacts another thing for work.  These are needed in order for the study to take place.  If spillover doesn’t happen then how does family impact work and work impact the family through the employee?  If relocation doesn’t positively impact the career of the employee, then why would the employee undergo it?  So, there has to be a perceived or actualized benefit to relocation, before the employee moves or decides to leave their employer all together to avoid the relocation.

Research approach

The authors took an exploratory qualitative approach for their study.  Their main reason for this approach was to explore themes in relocation affecting the family, and the family affecting the relocation.  Their hopes were to identify themes for a future study that could measure the relative strengths of these themes through a quantitative approach. They also state that quantitative tools are insufficient at this part of the exploratory phase, whereas qualitative work has a particular advantage to it.  You need to know which themes to study on your sample before you can devise the appropriate measurement instrument and analysis tool.  Though this can be accomplished through an extensive analysis of the literature, the authors did state that the bidirectional relationship of family and work with respect to relocation is the gap in the current knowledge.

Conducting 30 minutes and 2 hours (average of 1 hour) long interviews with 62 military personnel, allowed to collect these themes.  Another aspect of qualitative research that was used is the three measures for validity.  Face validity (summarizing responses and getting confirmation back from the interviewers), Confirmation (asking clarifying questions), and peer examination (independent peers evaluating and commenting on the questions and findings), are used in this qualitative study, which is what makes this study appropriate for their purpose.

Conclusions of the study

Le et al (2010) stated that for the role of a family member, employees face issues like: guilt that arise from a lack of fulfilling family commitments and needs during relocation and pride due to advantages manifesting in the family because of relocation.

For the spouse of the employee, they face issues like: work-related issues (reduced earning potential, unemployment, and hire-ability), psychosocial impact (anger, depression, etc.), and social impacts (loss of social network, community, and friends).

For the children of the employees, they face issues like: school-related impacts (they may fall behind or speed ahead, depending on where they were relocated on and it is diminished if they were placed in boarding school), psychological impact (mirrors that of the spouse), and social impact (hard time making friends, but strengthens internal family bonds).

Finally for the extended family, though it can be hard to establish a connection, some found it amazing that they got to visit a new place to see the relocated family from time to time.

However, there can be a devastating impact on the family unit, separation can occur (divorce) if there is no focus on family, but only on one’s career, and if relocation fatigue (due to multiple relocations in a span of a few years) occurs.

With all of this work impacting the family, the family can impact the work.  The researchers found that the family can try to influence when and how they move.  This effect is amplified when the employee involves the family in the decision.  Doing this will increase buy-in from all members, and makes the family happier in the end.  The family can defer or accelerate the relocation depending on their own plans.  But, if the company pushes the relocation, the family could exert pressure on the employee as well, to a point where the employee will leave (think of leaving or be aware of the option) the organization because they would prefer to keep their family intact.

Recommendations for future research

This study involved military families.  They relocate every 2 to 3 years, more often than most families around the world.  In most of these cases, rejecting relocation is not a wise facilitating option.  For these reasons, this is an extreme case for employee relocation, as lee et al (2010), noted.  Thus, the study can be applied to generic global and national level companies.  Finally, now that they have identified themes, we can measure their strength/magnitude and correlations between each theme to relocation effects on family and family effects on relocation.

 

Internal and External Validity

In quantitative research, a study is valid if one could draw meaning and inferences from the results based on methodology employed.  The three ways to look at validity is in (1) Content (do we measure what we wanted), (2) Predictive (do we match similar results, can we predict something), and (3) construct (are these hypothetical or real concepts).  This is not to be confused with reliability & consistency.  Thus, Creswell (2013) warns that if we modify an instrument or combine it with others, the validity and reliability of it could change, and in order to use it we must reestablish its validity and reliability.  There are several threats to validity that exist, either internal (history, maturation, regression, selection, mortality, diffusion of treatment, compensatory/resentful demoralization, compensatory rivalry, testing, and instrumentation) or external (interaction of selection and treatment, interaction of setting and treatment, and interaction of history and treatment).

Sample Validity Considerations: The validity issues are and their mitigation plans

Internal Validity Issues:

Hurricane intensities and tracks can vary annually or even decadally.  As time passes during this study for the 2016 and 2017 Atlantic Ocean Basin this study may run into regression issues.  These regression issues threaten the validity of the study in a way that certain types of weather components may not be the only factors that can increase/decrease hurricane forecasting skill from the average.  To mitigate regression issues, the study could mitigate the effect that these storms with an extreme departure from the average forecast skill have on the final results by eliminating them.  Naturally, the extreme departures from the average forecast skill will, with time, slightly impact the mean, but their results are still too valuable to dismiss.  Finding out which weather components impact these extreme departures from the average forecast skill is what drives this project.  Thus, their removal doesn’t seem to fit in this study and defeats the purpose of knowledge discovery.

External Validity Issues: 

The Eastern Pacific, Central Pacific, and Atlantic Ocean Basin have the same underlying dynamics that can create, intensify and influence the path of tropical cyclones.  However, these three basins still behave differently, thus there is an interaction of setting and treatment threats to the validity of these studies results. Results garnered in this study will not allow me to generalize beyond the Atlantic Ocean Basin. The only way to mitigate this threat to validity is to suggest future research to be conducted on each basin separately.

Resources

Exploring Mixed Methods

Explanatory Sequential (QUAN -> qual)

According to Creswell (2013), this mix method style uses qualitative methods to do a deep dive into the quantitative results that have been previously gathered (often to understand the data with respect to the culture).  The key defining feature here is that quantitative data is collected before the qualitative data and that the quantitative data drives the results from the qualitative.  Thus, the emphasis is given to the quantitative results in order to explore and make sense of qualitative results.  It is used to probe quantitative results by explaining them via qualitative results.  Essentially, using qualitative results to enhance your quantitative results.

Exploratory Sequential (QUAL -> quan)

According to Creswell (2013), this mix method style uses quantitative methods to confirm the qualitative results that have been previously gathered (often to understand the culture behind the data).  The key defining feature here is that qualitative data is collected before the quantitative data and that the qualitative data drives the results from the quantitative.  Thus, the emphasis is given to the qualitative results in order to explore and make sense of quantitative results.  It is used to probe qualitative results by explaining them via quantitative results.  Essentially, using quantitative results to enhance your qualitative results.

Which method would you most likely use?  If your methodological fit suggests you to use a mixed-methods research project, does your world view colors your choice?

Resources

Quasi-experimental

In the Quantitative Methodology, there are experimental (deals with the impact of an outcome, while having a controlling variable to see if the tested variable does have an impact), quasi-experimental (deals with a non-random sample but still measures the impact of an outcome) and non-experimental (deals with generalizing/inferring about a population) project designs.

For a non-experimental project design, surveys are used as an instrument to gather data and help produce quantitative/numeric data to help identify trends and sentiment from a sample of a total population (Creswell, 2013).  The Pew Research Center (2015), wanted to analyze the changing attitudes on Gay Marriage a few days after the Supreme Court struck down the bans as unconstitutional, have asked:

Do you oppose/favor allowing gays and lesbians to marry legally? What is your current age? What is your Religious Affiliation? What is your Political Party?  What is your Political Ideology? What is your Race? What is your gender?

Pew found that overall, since they were conducting this survey since 2001, they have seen that in every descriptive variable classifying people has shown an increase in acceptance for marriage, with an overall 55% approval rating to 39%.  This example is not trying to explain a relationship but rather a trend.

For an experimental project design, it usually follows the following steps: Identification of participants, gathering of materials, draft and finalize procedures and setting up measures so that you can conduct the experiment and derive some results from it (Creswell, 2013).

When a participant in a study is randomly assigned to a control group or in other groups in an experiment it is considered a true experiment, if the participants in a study are not randomly assigned then it is considered a quasi-experiment (Creswell, 2013).  In the famous Milgram Obedience Experiment (1974), an ad was posted to collect participants for a study on memory, but in fact, they were there to see if the presence of authority would compromise their internal morals to cause pain and sometimes delivering fatal shocks to another participant (an actor).  About 2/3 of people were willing to administer the deadly shock because they had the presence of authority (a man in a white coat) telling them to continue to the study.  Though this study will be hard to replicate today (due to IRB considerations), it wasn’t fully random, thus it’s a quasi-experiment, but it challenged and shocked the world.  This is a pivotal paper/experiment that defined behavioral science.

Resources

Methodological fit

Do you know what methodology you should use for your research project?

If there is a lot of extensive literature for a topic, then, according to Edmonson and McManus (2007) one could make a contribution to a mature theory then quantitative methodology would be the best methodological fit. If one strays and does a qualitative methodology in this case, they could run into reinventing the wheel error and may fail to fill a gap in the body of knowledge.

If there is just a little literature for a topic, then one could make a contribution to a nascent theory via qualitative methodologies, which in turn would be the best methodological fit (Edmonson & McManus, 2007).  If you do a quantitative research project here, you may be jumping the gun and running into possible false conclusions caused by confounding variables and may still fail to fill the gap in the body of knowledge.

Finally, one can stray from both pure qualitative and quantitative methodologies, and go into a mixed-methods study, and this can occur when there is enough research that the body of knowledge isn’t considered nascent, but not enough to be considered mature (Edmonson & McManus, 2007). Going one route here would do an injustice in filling in the gap in the body of knowledge, because you may be missing key insights that the each part of the mixed methodology (both qualitative and quantitative) can bring to the field.

So, prior to deciding which methodology you should choose, you should do an in-depth literature review.  You cannot pick an appropriate methodology without knowing the body of knowledge.

Hint: The more quantitative research articles you find in a body of knowledge, the more likely your project will be dealing with either a mixed-methods (low number of articles) or a quantitative method (high number of articles) project. If you see none, you may be working on a qualitative methodology.

Reference

  • Edmondson, A., & McManus, S. (2007). Methodological fit in management field research. Academy of Management Review, 32(4), 1155–1179. CYBRARY.

Worldviews and Approaches to Inquiry

The four worldviews according to Cresswell (2013) are postpositivism (akin to quantitative methods), constructivism (akin to qualitative methods), advocacy (akin to advocating action), and pragmatism (akin to mixed methods).   There are positives and negatives for each world view. For pragmatists, they use what truth and what methods from anywhere that works at the time they need it, to get the results they need.  Though the pragmatist research style takes time to conduct.  The advocacy places importance on creating an action item for social change to diminish inequity gaps between asymmetric power relationships like those that exist with class structure and minorities.  Though this research is noble, the moral arc of history bends towards justice, but very slowly, it took centuries for race equality to be where it is at today, it took over 60 years for gender equality, and 40 years for LGBT equality.  Yet, there are still inequalities amongst these groups and the majority that have yet to be resolved.  For instance: Equal Pay for Equal Work for All, Employment/Housing Non-Discrimination for LGBT, Racial Profiling, etc.  The constructivist viewpoint researchers seek to understand the world around them through subjective means.  They use their own understanding and interpretation of historical and cultural settings of participants to shape their interpretation of the open-ended data they collect.  This can lead to an interpretation that is shaped by the researcher’s background and not representative of the whole situation at hand.  Finally, postpositivism looks at the world in numbers, knowing their limitation that not everything can be described in numbers, they choose to propose an alternative hypothesis where they can either accept or reject the hypothesis. Numbers are imperfect and fallible.

My personal world view is akin to a pragmatist world view.  My background in math, science, technology, and management help me synthesize ideas from multiple fields to drive innovation.  It has allowed me to learn rapidly because I can see how one field ties to the other and makes me more adaptable.   However, I also lean a bit more strongly to the math and science side of myself, which is a postpostivism view.

Resource:

The Role of Theory

The theory is intertwined with the research process, thus a thorough understanding of theory must involve the understanding of the relationship between theory and research (Bryman & Bell, 2007).  When looking at research from a deductive role (developing and testing a problem and hypothesis) the theory is presented at the beginning.  The theory here is being tested, as it helps define the problem, its parameters (boundaries) and a hypothesis to test.  Whereas an inductive role uses data and research to build a theory.  Theories can be grand (too hard to pinpoint and test) or they can be mid-range (easier to test, but it is still too big to test it under all assumptions) (Bryman & Bell, 2007).

Where you write your theory depends on the type of world view you have (positivism at the beginning of the paper, or constructivism at the beginning or end of the paper) (Creswell, 2013).   My particular focus will be on the postpositivism view (quantitative methods), so I will dissect the placement of the theory primarily on a quantitative research study (which are mostly deductive in nature).  Placement of the theory in the introduction lit review, or after the hypothesis runs into the issue that it will make it harder for the reader to isolate and separate the theory from their respective sections (Cresswell, 2013).  There is another disadvantage from what Creswell (2013) states for the after the hypothesis approach: you may forget to discuss the origins and rationale for the theory.  Cresswell (2013), suggests as a research tip to separate the theory and create a brand new section for it so that it is easily identified and its origin and rationale can be elaborated on.

However, separating the theory section from the rest of the paper can still get the paper tossed out of being published in a journal if it is still fuzzy to decipher amongst your peers and the editor.  Feldman’s 2004 editorial states that if the question & theory is succinct, grammatically correct, non-trivial, and makes a difference, it would help you get your results published.  However, he also states (like many of our professors do) we need to find what are the key articles and references in the past 5 years, that we should be exhaustive yet exclusive with our dataset, and establish clear boundary conditions such that we can adequately define independent and dependent variable would help you get your results published (Feldman, 2004).  The latter set of conditions helps build your theory, whereas the first set of conditions speaks to the readability of the theory.  If it is hard to read your theory because it’s so convoluted, then why should anyone care to read it?

Resources:

  • Bryman, A. & Bell, E. (2007) Business Research Methods. (2nd ed.). Location: Oxford University Press.
  • Creswell, J. W. (2013). Research Design: Qualitative, Quantitative, and Mixed Methods Approaches, 4th Edition. [VitalSource Bookshelf version]. Retrieved from http://online.vitalsource.com/books/9781483321479/epubcfi/6/24
  • Feldman, D. C. (2004). What are we talking about when we talk about theory? Journal of Management, 30(5), 565–567.

Adv DB: CAP and ACID

Transactions

A transaction is a set of operations/transformations to be carried out on a database or relational dataset from one state to another.  Once completed and validated to be a successful transaction, the ending result is saved into the database (Panda et al, 2011).  Both ACID and CAP (discussed in further detail) are known as Integrity Properties for these transactions (Mapanga & Kadebu, 2013).

 Mobile Databases

Mobile devices have become prevalent and vital for many transactions when the end-user is unable to access a wired connection.  Since the end-user is unable to find a wired connection to conduct their transaction their device will retrieve and save information on transaction either on a wireless connection or disconnected mode (Panda et al, 2011).  A problem with a mobile user accessing and creating a transaction with databases, is the bandwidth speeds in a wireless network are not constant, which if there is enough bandwidth connection to the end-user’s data is rapid, and vice versa.  There are a few transaction models that can efficiently be used for mobile database transactions: Report and Co-transactional model; Kangaroo transaction model; Two-Tiered transaction model; Multi-database transaction model; Pro-motion transaction model; and Toggle Transaction model.  This is in no means an exhaustive list of transaction models to be used for mobile databases.

According to Panda et al (2011), in a Report and Co-transactional Model, transactions are completed from the bottom-up in a nested format, such that a transaction is split up between its children and parent transaction.  The child transaction once successfully completed then feeds that information up to the chain until it reaches the parent.  However, not until the parent transaction is completed is everything committed.  Thus, a transaction can occur on the mobile device but not be fully implemented until it reaches the parent database. The Kangaroo transaction model, a mobile transaction manager collects and accepts transactions from the end-user, and forwards (hops) the transaction request to the database server.  Transaction made in this model is done by proxy in the mobile device, and when the mobile devices move from one location to the next, a new transaction manager is assigned to produce a new proxy transaction. The two-tiered transaction model is inspired by the data replication schemes, where there is a master copy of the data but for multiple replicas.  The replicas are considered to be on the mobile device but can make changes to the master copy if the connection to the wireless network is strong enough.  If the connection is not strong enough, then the changes will be made to the replicas and thus, it will show as committed on these replicas, and it will still be made visible to other transactions.

The multi-database transaction model uses asynchronous schemes, to allow a mobile user to unplug from it and still coordinate the transaction.  To use this scheme, five queues are set up: input, allocate, active, suspend and output. Nothing gets committed until all five queues have been completed. Pro-motion transactions come from nested transaction models, where some transactions are completed through fixed hosts and others are done in mobile hosts. When a mobile user is not connected to the fixed host, it will spark a command such that the transaction now needs to be completed in the mobile host.  Though carrying out this sparked command is resource-intensive.  Finally, the Toggle transaction model relies on software on a pre-determined network and can operate on several database systems, and changes made to the master database (global) can be presented different mobile systems and thus concurrency is fixed for all transactions for all databases (Panda et al, 2011).

At a cursory glance, these models seem similar but they vary strongly on how they implement the ACID properties in their transaction (see table 1) in the next section.

ACID Properties and their flaws

Jim Gray in 1970 introduced the idea of ACID transactions, which provide four guarantees: Atomicity (all or nothing transactions), Consistency (correct data transactions), Isolation (each transaction is independent of others), and Durability (transactions that survive failures) (Mapanga & Kedebu, 2013, Khachana, 2011).  ACID is used to assure reliability in the database system, due to a transaction, which changes the state of the data in the database.

This approach is perfect for small relational centralized/distributed databases, but with the demand to make mobile transactions, big data, and NoSQL, ACID may be a bit constricting.  The web has independent services connected together relationally, but really hard to maintain (Khachana, 2011).  An example of this is booking a flight for a CTU Doctoral Symposium.  One purchases a flight, but then also may need another service that is related to the flight, like ground transportation to and from the hotel, the flight database is completely different and separate from the ground transportation system, yet sites like Kayak.com provide the service of connecting these databases and providing a friendly user interface for their customers.  Kayak.com has its own mobile app as well. So taking this example further we can see how ACID, perfect for centralized databases, may not be the best for web-based services.  Another case to consider is, mobile database transactions, due to their connectivity issues and recovery plans, the models aforementioned cover some of the ACID properties (Panda et al, 2011).  This is the flaw for mobile databases, through the lens of ACID.

Model Atomicity Consistency Isolation Durability
Report & Co-transaction model Yes Yes Yes Yes
Kangaroo transaction model Maybe No No No
Two-tiered transaction model No No No No
Multi-database Transaction model No No No No
Pro-motion Model Yes Yes Yes Yes
Toggle transaction model Yes Yes Yes Yes

Table 1: A subset of the information found in Panda et al (2011) dealing with mobile database system transaction models and how they use or not use the ACID properties.

 

CAP Properties and their trade-offs

CAP stands for Consistency (just like in ACID, correct all data transactions and all users see the same data), Availability (users always have access to the data), and Partition Tolerance (splitting the database over many servers do not have a single point of failure to exist), which was developed in 2000 by Eric Brewer (Mapanga & Kadebu, 2013; Abadi, 2012).  These three properties are needed for distributed database management systems and is seen as a less strict alternative to the ACID properties by Jim Gary. Unfortunately, you can only create a distributed database system using two of the three systems so a CA, CP, or AP systems.  CP systems have a reputation of not being made available all the time, which is contrary to the fact.  Availability in a CP system is given up (or out-prioritized) when Partition Tolerance is needed. Availability in a CA system can be lost if there is a partition in the data that needs to occur (Mapanga & Kadebu, 2013). Though you can only create a system that is the best in two, that doesn’t mean you cannot add the third property in there, the restriction only talks applies to priority. In a CA system, ACID can be guaranteed alongside Availability (Abadi, 2012)

Partitions can vary per distributed database management systems due to WAN, hardware, a network configured parameters, level of redundancies, etc. (Abadi, 2012).  Partitions are rare compared to other failure events, but they must be considered.

But, the question remains for all database administrators:  Which of the three CAP properties should be prioritized above all others? Particularly if there is a distributed database management system with partitions considerations.  Abadi (2012) answers this question, for mission-critical data/applications, availability during partitions should not be sacrificed, thus consistency must fall for a while.

Amazon’s Dynamo & Riak, Facebook’s Cassandra, Yahoo’s PNUTS, and LinkedIn’s Voldemort are all examples of distributed database systems, which can be accessed on a mobile device (Abadi, 2012).  However, according to Abadi (2012), latency (similar to Accessibility) is critical to all these systems, so much so that a 100ms delay can significantly reduce an end-user’s future retention and future repeat transactions. Thus, not only for mission-critical systems but for e-commerce, is availability during partitions key.

Unfortunately, this tradeoff between Consistency and Availability arises due to data replication and depends on how it’s done.  According to Abadi (2012), there are three ways to do data replications: data updates sent to all the replicas at the same time (high consistency enforced); data updates sent to an agreed-upon location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme); and data updates sent to an arbitrary location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme).

According to Abadi (2012), PNUTS sends data updates sent to an agreed-upon location first through asynchronous schemes, which improves Availability at the cost of Consistency. Whereas, Dynamo, Cassandra, and Riak send data updates sent to an agreed-upon location first through a combination of synchronous and asynchronous schemes.  These three systems, propagate data synchronously, so a small subset of servers and the rest are done asynchronously, which can cause inconsistencies.  All of this is done in order to reduce delays to the end-user.

Going back to the Kayak.com example from the previous section, consistency in the web environment should be relaxed (Khachana et al, 2011).  Further expanding on Kayak.com, if 7 users wanted to access the services at the same time they can ask which of these properties should be relaxed or not.  One can order a flight, hotel, and car, and enforce that none is booked until all services are committed. Another person may be content with whichever car for ground transportation as long as they get the flight times and price they want. This can cause inconsistencies, information being lost, or misleading information needed for proper decision analysis, but systems must be adaptable (Khachana et al, 2011).  They must take into account the wireless signal, their mode of transferring their data, committing their data, and load-balance of incoming requests (who has priority to get a contested plane seat when there is only one left at that price).  At the end of the day, when it comes to CAP, Availability is king.  It will drive business away or attract it, thus C or P must give, in order to cater to the customer.  If I were designing this system, I would run an AP system, but conduct the partitioning when the load/demand on the database system will be small (off-peak hours), so to give the illusion of a CA system (because Consistency degradation will only be seen by fewer people).  Off-peak hours don’t exist for global companies or mobile web services, or websites, but there are times throughout the year where transaction to the database system is smaller than normal days. So, making around those days is key.  For a mobile transaction system, I would select a pro-motion transaction system that helps comply with ACID properties.  Make the updates locally on the mobile device when services are not up, and set up a queue of other transactions in order, waiting to be committed once wireless service has been restored or a stronger signal is sought.

References

  • Abadi, D. J. (2012). Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer Society, (2), 37-42.
  • Khachana, R. T., James, A., & Iqbal, R. (2011). Relaxation of ACID properties in AuTrA, The adaptive user-defined transaction relaxing approach. Future Generation Computer Systems, 27(1), 58-66.
  • Mapanga, I., & Kadebu, P. (2013). Database Management Systems: A NoSQL Analysis. International Journal of Modern Communication Technologies & Research (IJMCTR), 1, 12-18.
  • Panda, P. K., Swain, S., & Pattnaik, P. K. (2011). Review of some transaction models used in mobile databases. International Journal of Instrumentation, Control & Automation (IJICA), 1(1), 99-104.