Adv DB: Conducting data migration to NoSQL databases

Relational databases schema design (primarily ERDs) are all about creating models, then translating it a schema to which is normalized, but one must be an oracle to anticipate a holistic, end-to-end design, or else suffer when making changes to the database (Scherzinger et al, 2013).  Relational databases are poor at data replication, horizontal scalability, and high availability rates (Schram & Anderson, 2012).  Thus, waterfall approaches to database design are no longer advantageous, and like software development databases can be designed with an agile mentality.  Especially as data store requirements are always evolving. Databases that adopt a “Schema-less” (where data can be stored without any predefined schema) or an “Implicit Schema” (where the data definition van be taken from a database from an application in order to place the data into the database) in “Not Only SQL” (NoSQL) can allow for agile development on a release cycle that can vary from yearly, monthly, weekly, or daily, which is completely dependent on the developers’ iteration cycle (Sadalage & Fowler, 2012).  Taking a look at a blogging agile development lifecycle (below) can show how great schema-less or implicit schemas in NoSQL database development can become, as well as the technical debt that is created, which can cause migration issues down the line.

Blogging

We start a blogging site called “blog.me” and we are in an agile environment, which means iterative improvements and each iteration produces a releasable product (even if we decide not to make a release or update at the end of the iteration).  As a programming team, they have decided that the minimum viable product will consist of the fields, title, and content for the blogger and comments from other people.  This is a similar example proposed by Scherzinger et al in 2013, as they try to explain how implicit schemas work.  In the second iteration, the programming team for “blog.me” has discovered an abuse on the commenting section of the blog.  People have been “trolling” the blog, thus to mitigate this, they implemented a sign-in process with a username and password that is taken from Facebook, which allows for liking a post as well.  Rather than having bloggers to recreate their content, the programmers make the implementation of this update for current and future posts. In a third iteration, the programming teams to institute a uniformed nomenclature to some of their fields.  Rather than changing all the posts from the first two iterations, the programmers decide to enforce these changes moving forward.

Now, one can see how useful a schema-less development (provided by NoSQL) can become.   There is no downtime to how the site interacts and adds no additional burden to the end-users when an update occurs. But, we now have to worry about migrating these three data classes (or as Scherzinger et al calls it technical debt), but what if a commenter goes and comments in a post made in iteration one or two after iteration three has been implemented, we may then have four to five different data classes.  These developers love to develop code and add new features rather than maintain code, which is why this form of developing a database is great, but as we can see technical debt can pile on quickly.  Our goal is to manage a schema of this data, yet have the flexibility of a schema-less database system.

Types of Migration

The migration of data in and out of a data store is usually enabled through a replication scheme (Shirazi et al, 2012) conducted through an application.  There are two primary types of data migration per Scherzinger et al (2013): eager and lazy.  Eager migration means we migrate all the data in a batched fashion, one-by-one retrieval from the data store, transform it and write it back into the data store.  As data becomes larger, eager migration can become resource-intensive and could be a wasted effort. Wasted efforts can come from stale data.  Thus, the lazy approach is considered as a viable option.  Transformations are conducted when a piece of data is touched, so only live and hot data (relevant data) is updated.  Even though this approach saves on resources, if an entity becomes corrupted, there may be no way to retrieve it.  In order to do the migration, an application needs to create an “implicit-schema” on the “schema-less” data.

NoSQL and its multiple flavors

NoSQL databases can deal with aggregate data (relationships between units of data that can be relationally mapped), using key-value, document, and column friendly databases (Scherzinger et al, 2013, Sadalage & Fowler, 2012, Schram & Anderson, 2012).  There also exist graphical databases (Sadalage & Fowler, 2012).  Key-value databases deal with storing data with a unique key and value, while document databases store documents or their parts in a value. (Scherzinger et al, 2013). People can blur the line between this and key-value databases by placing an ID field, but for the most part, you will query a document database rather than look up a key or ID (Sadalage & Fowler, 2012). Whereas column friendly databases store the information in transposed table structures (as columns rather than rows).  Graph databases can show relationships with huge datasets that are highly interconnected, and the complexity of the data is emphasized in this database rather than the size of data (Shirazi et al, 2012).  A further example of a graphical database is shown in the health section in the following pages.  Migrations between the multiple flavors of NoSQL databases allow for one to exploit the strengths and mitigate the weakness between the types when it comes to analyzing the large data quickly.

Data Migration Considerations and Steps

Since data migration uses replication schemes from an application, one must consider how complex writing a SQL query would be if this were a relational database scheme (Shirazi et al, 2012).  This has implications on how complex transforming data or migrating it would be under NoSQL databases, especially when big data is introduced into the equation.  Thus, the pattern of database design must be taken into account when migrating data between relational databases to NoSQL database, or between different NoSQL database types (or even provider). Also, each of these database types treats NULL values differently, some NoSQL databases don’t even waste the storage space and ignore NULL values, some systems have them as in relational databases, and some systems allow for it, but don’t query for it (Scherzinger et al, 2013).  Scherzinger et al (2013) suggest that when migrating data, data models (data stored in the databases that belong to a object or a group, which can have several properties) query models (data that can be inserted, transformed and deleted based on a key-value, or some other kind identification), and freedom from schema (the global structure of the data that can or cannot be fixed in advance) must be taken into account. Whereas, Schram & Anderson in 2012, stated that data models are key when making design changes (migrations) between database systems. Since in NoSQL data is “schema-less” there may not be any global structure, but applications (such as web user-interfaces) built on top of the data-stores can display an implicit structure, and from that, we can list a few steps to consider when migrating data (Tran et al, 2011):

  • Installation and configuration
    1. Set up development tools and environment
    2. Install and set up environments
    3. Install third-party tools
  • Code modification
    1. Set up database connections
    2. Database operation query (if using a NoSQL database)
    3. Any required modifications for compatibility issues
  • Migration
    1. Prepare the database for migration
    2. Migrate the local database to the NoSQL database (the schema-less part)
    3. Prepare system for migration
    4. Migrate the application (the implicit-schema part)
  • Test (how to ensure the data stored in the databases matched with the “Implicit Schema” embedded in the applications when the “Implicit Schema” has experienced a change)
    1. Test if the local system works with a database in NoSQL
    2. Test if the system works with databases in NoSQL
    3. Write test cases and test for functionality of the application in NoSQL

When doing code modification (step 2) from a relational database to a NoSQL database the more changes will be required, and JOIN operations may not be fully supported.  Thus, additional code may be required in order to maintain the serviceability of the application, pre-migration, during migration and post-migration (Tran et al, 2011).  Considering ITIL Service Transition standards, the best time to do a migration or update is in windows of minimum usage by end-users, while still maintaining agreed-upon minimum SLA standards.  As stated in Schram & Anderson (2012) they didn’t want their service to break while they were migrating their data from a relational database to a NoSQL column friendly database.  Other issues, like compatibility between the systems housing the databases or even database types, can also add complexity to migration.  When migrating (step 3) SQL scripts need to be transformed as well, to align with the new database structure, environment, etc. (Tran et al, 2011). Third-party apps can help to a degree with this.  If the planning phase was conducted correctly this phase should be relatively smooth.  Tran et al (2011) stated that there are at least 8 features that drive the cost of migration: (1) Project team’s capability, (2) Application/Database complexity, (3) Existing knowledge and experience, (4) Selecting the correct database and database management system, (5) Compatibility issues, database features, and (8) Connection issues during migration.

Health

A database was created from 7.2M medical reports, in order to understand human diseases, called HealthTable.  The authors in Shirazi et al in 2012, decided to convert a column store into a graph database of Health Infoscape (Table 1 to Figure 1).  Each cause/symptom stems from disease (Dx), yet the power of graph databases as aforementioned are shown, thus facilitating data analysis, even though column friendly databases provide an easier way to maintain the 7.2M data records.

Table 1. HealthTable in Hbase per Shirazi et al (2012).

Row key Info Prevalence Causes
D1 Name Category Female Male Total Cause1 Cause2 Cause3
Heartburn Digestive system 9.4% 9% 9.2% D2    
1 1 1 1 1 2    
D2 Chest Pain Circulatory System 6.8% 6.8% 6.8%      
3 3 3 3 3      
D4 Dizziness Nervous System 4% 2.8% 3.5%      
5 5 5 5 5      

health graph

Figure 1. HeathGraph Bases on HealthTable

Conclusions

From these two use cases (Heath and Blogging) is that data migration can be quite complicated.  Schema-less databases allow for a more agile approach to developing, whereas the alternative is best for the waterfall.  However, with waterfall development slowly on the decay, one must also migrate to other forms of development.  Though applications/databases can migrate from relational databases to NoSQL and thus require a lot of coding because of compatibility issues, applications/databases can also migrate between different types of NoSQL databases.  Each database structure has its strengths and weakness, and migrating data between these databases can provide opportunities for knowledge discovery from the data that is contained within them.  Migrating between database systems and NoSQL types should be conducted if it fulfills many of the requirements and promises to reduce the cost of maintenance (Schram & Anderson, 2012).

References

  • Sadalage, P. J., Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781323137376/
  • Scherzinger, S., Klettke, M., & Störl, U. (2013). Managing schema evolution in NoSQL data stores. arXiv preprint arXiv:1308.0514.
  • Schram, A., & Anderson, K. M. (2012). MySQL to NoSQL: data modeling challenges in supporting scalability. In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity (pp. 191-202). ACM.
  • Shirazi, M. N., Kuan, H. C., & Dolatabadi, H. (2012, June). Design Patterns to Enable Data Portability between Clouds’ Databases. In Computational Science and Its Applications (ICCSA), 2012 12th International Conference on (pp. 117-120). IEEE.
  • Tran, V., Keung, J., Liu, A., & Fekete, A. (2011, May). Application migration to cloud: a taxonomy of critical factors. In Proceedings of the 2nd international workshop on software engineering for cloud computing (pp. 22-28). ACM.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: