Adv DB: CAP and ACID

Transactions

A transaction is a set of operations/transformations to be carried out on a database or relational dataset from one state to another.  Once completed and validated to be a successful transaction, the ending result is saved into the database (Panda et al, 2011).  Both ACID and CAP (discussed in further detail) are known as Integrity Properties for these transactions (Mapanga & Kadebu, 2013).

 Mobile Databases

Mobile devices have become prevalent and vital for many transactions when the end-user is unable to access a wired connection.  Since the end-user is unable to find a wired connection to conduct their transaction their device will retrieve and save information on transaction either on a wireless connection or disconnected mode (Panda et al, 2011).  A problem with a mobile user accessing and creating a transaction with databases, is the bandwidth speeds in a wireless network are not constant, which if there is enough bandwidth connection to the end-user’s data is rapid, and vice versa.  There are a few transaction models that can efficiently be used for mobile database transactions: Report and Co-transactional model; Kangaroo transaction model; Two-Tiered transaction model; Multi-database transaction model; Pro-motion transaction model; and Toggle Transaction model.  This is in no means an exhaustive list of transaction models to be used for mobile databases.

According to Panda et al (2011), in a Report and Co-transactional Model, transactions are completed from the bottom-up in a nested format, such that a transaction is split up between its children and parent transaction.  The child transaction once successfully completed then feeds that information up to the chain until it reaches the parent.  However, not until the parent transaction is completed is everything committed.  Thus, a transaction can occur on the mobile device but not be fully implemented until it reaches the parent database. The Kangaroo transaction model, a mobile transaction manager collects and accepts transactions from the end-user, and forwards (hops) the transaction request to the database server.  Transaction made in this model is done by proxy in the mobile device, and when the mobile devices move from one location to the next, a new transaction manager is assigned to produce a new proxy transaction. The two-tiered transaction model is inspired by the data replication schemes, where there is a master copy of the data but for multiple replicas.  The replicas are considered to be on the mobile device but can make changes to the master copy if the connection to the wireless network is strong enough.  If the connection is not strong enough, then the changes will be made to the replicas and thus, it will show as committed on these replicas, and it will still be made visible to other transactions.

The multi-database transaction model uses asynchronous schemes, to allow a mobile user to unplug from it and still coordinate the transaction.  To use this scheme, five queues are set up: input, allocate, active, suspend and output. Nothing gets committed until all five queues have been completed. Pro-motion transactions come from nested transaction models, where some transactions are completed through fixed hosts and others are done in mobile hosts. When a mobile user is not connected to the fixed host, it will spark a command such that the transaction now needs to be completed in the mobile host.  Though carrying out this sparked command is resource-intensive.  Finally, the Toggle transaction model relies on software on a pre-determined network and can operate on several database systems, and changes made to the master database (global) can be presented different mobile systems and thus concurrency is fixed for all transactions for all databases (Panda et al, 2011).

At a cursory glance, these models seem similar but they vary strongly on how they implement the ACID properties in their transaction (see table 1) in the next section.

ACID Properties and their flaws

Jim Gray in 1970 introduced the idea of ACID transactions, which provide four guarantees: Atomicity (all or nothing transactions), Consistency (correct data transactions), Isolation (each transaction is independent of others), and Durability (transactions that survive failures) (Mapanga & Kedebu, 2013, Khachana, 2011).  ACID is used to assure reliability in the database system, due to a transaction, which changes the state of the data in the database.

This approach is perfect for small relational centralized/distributed databases, but with the demand to make mobile transactions, big data, and NoSQL, ACID may be a bit constricting.  The web has independent services connected together relationally, but really hard to maintain (Khachana, 2011).  An example of this is booking a flight for a CTU Doctoral Symposium.  One purchases a flight, but then also may need another service that is related to the flight, like ground transportation to and from the hotel, the flight database is completely different and separate from the ground transportation system, yet sites like Kayak.com provide the service of connecting these databases and providing a friendly user interface for their customers.  Kayak.com has its own mobile app as well. So taking this example further we can see how ACID, perfect for centralized databases, may not be the best for web-based services.  Another case to consider is, mobile database transactions, due to their connectivity issues and recovery plans, the models aforementioned cover some of the ACID properties (Panda et al, 2011).  This is the flaw for mobile databases, through the lens of ACID.

Model Atomicity Consistency Isolation Durability
Report & Co-transaction model Yes Yes Yes Yes
Kangaroo transaction model Maybe No No No
Two-tiered transaction model No No No No
Multi-database Transaction model No No No No
Pro-motion Model Yes Yes Yes Yes
Toggle transaction model Yes Yes Yes Yes

Table 1: A subset of the information found in Panda et al (2011) dealing with mobile database system transaction models and how they use or not use the ACID properties.

 

CAP Properties and their trade-offs

CAP stands for Consistency (just like in ACID, correct all data transactions and all users see the same data), Availability (users always have access to the data), and Partition Tolerance (splitting the database over many servers do not have a single point of failure to exist), which was developed in 2000 by Eric Brewer (Mapanga & Kadebu, 2013; Abadi, 2012).  These three properties are needed for distributed database management systems and is seen as a less strict alternative to the ACID properties by Jim Gary. Unfortunately, you can only create a distributed database system using two of the three systems so a CA, CP, or AP systems.  CP systems have a reputation of not being made available all the time, which is contrary to the fact.  Availability in a CP system is given up (or out-prioritized) when Partition Tolerance is needed. Availability in a CA system can be lost if there is a partition in the data that needs to occur (Mapanga & Kadebu, 2013). Though you can only create a system that is the best in two, that doesn’t mean you cannot add the third property in there, the restriction only talks applies to priority. In a CA system, ACID can be guaranteed alongside Availability (Abadi, 2012)

Partitions can vary per distributed database management systems due to WAN, hardware, a network configured parameters, level of redundancies, etc. (Abadi, 2012).  Partitions are rare compared to other failure events, but they must be considered.

But, the question remains for all database administrators:  Which of the three CAP properties should be prioritized above all others? Particularly if there is a distributed database management system with partitions considerations.  Abadi (2012) answers this question, for mission-critical data/applications, availability during partitions should not be sacrificed, thus consistency must fall for a while.

Amazon’s Dynamo & Riak, Facebook’s Cassandra, Yahoo’s PNUTS, and LinkedIn’s Voldemort are all examples of distributed database systems, which can be accessed on a mobile device (Abadi, 2012).  However, according to Abadi (2012), latency (similar to Accessibility) is critical to all these systems, so much so that a 100ms delay can significantly reduce an end-user’s future retention and future repeat transactions. Thus, not only for mission-critical systems but for e-commerce, is availability during partitions key.

Unfortunately, this tradeoff between Consistency and Availability arises due to data replication and depends on how it’s done.  According to Abadi (2012), there are three ways to do data replications: data updates sent to all the replicas at the same time (high consistency enforced); data updates sent to an agreed-upon location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme); and data updates sent to an arbitrary location first through synchronous and asynchronous schemes (high availability enforced dependent on the scheme).

According to Abadi (2012), PNUTS sends data updates sent to an agreed-upon location first through asynchronous schemes, which improves Availability at the cost of Consistency. Whereas, Dynamo, Cassandra, and Riak send data updates sent to an agreed-upon location first through a combination of synchronous and asynchronous schemes.  These three systems, propagate data synchronously, so a small subset of servers and the rest are done asynchronously, which can cause inconsistencies.  All of this is done in order to reduce delays to the end-user.

Going back to the Kayak.com example from the previous section, consistency in the web environment should be relaxed (Khachana et al, 2011).  Further expanding on Kayak.com, if 7 users wanted to access the services at the same time they can ask which of these properties should be relaxed or not.  One can order a flight, hotel, and car, and enforce that none is booked until all services are committed. Another person may be content with whichever car for ground transportation as long as they get the flight times and price they want. This can cause inconsistencies, information being lost, or misleading information needed for proper decision analysis, but systems must be adaptable (Khachana et al, 2011).  They must take into account the wireless signal, their mode of transferring their data, committing their data, and load-balance of incoming requests (who has priority to get a contested plane seat when there is only one left at that price).  At the end of the day, when it comes to CAP, Availability is king.  It will drive business away or attract it, thus C or P must give, in order to cater to the customer.  If I were designing this system, I would run an AP system, but conduct the partitioning when the load/demand on the database system will be small (off-peak hours), so to give the illusion of a CA system (because Consistency degradation will only be seen by fewer people).  Off-peak hours don’t exist for global companies or mobile web services, or websites, but there are times throughout the year where transaction to the database system is smaller than normal days. So, making around those days is key.  For a mobile transaction system, I would select a pro-motion transaction system that helps comply with ACID properties.  Make the updates locally on the mobile device when services are not up, and set up a queue of other transactions in order, waiting to be committed once wireless service has been restored or a stronger signal is sought.

References

  • Abadi, D. J. (2012). Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer Society, (2), 37-42.
  • Khachana, R. T., James, A., & Iqbal, R. (2011). Relaxation of ACID properties in AuTrA, The adaptive user-defined transaction relaxing approach. Future Generation Computer Systems, 27(1), 58-66.
  • Mapanga, I., & Kadebu, P. (2013). Database Management Systems: A NoSQL Analysis. International Journal of Modern Communication Technologies & Research (IJMCTR), 1, 12-18.
  • Panda, P. K., Swain, S., & Pattnaik, P. K. (2011). Review of some transaction models used in mobile databases. International Journal of Instrumentation, Control & Automation (IJICA), 1(1), 99-104.

Adv DB: Key-value DBs

NoSQL and Key-value databases

A recap from my last post: “Not Only SQL” databases, best known as NoSQL contains aggregate databases like key-value, document, and column friendly (Sadalage & Fowler, 2012). Aggregates are related sets of data that we would like to treat as a unit (MUSE, 2015c). Relationships between units/aggregates are captured in the relational mapping (Sadalage & Fowler, 2012). A key-value database maps aggregate data to a key, this data is embedded into a key-value.

Consider a bank account, my social security may be used as a key-value to bring up all my accounts: my checking, my 2 savings, and my mortgage loan.  The aggregate is my account, but savings, checking, and a mortgage loan act differently and can exist on different databases and distributed across different physical locations.

These NoSQL databases can be schemaless databases, where data can be stored without any predefined schema.  NoSQL is best for application-specific databases, not to substitute all relational databases (MUSE, 2015b).  NoSQL databases can also have an implicit schema, where the data definition can be taken from a database from an application in order to place the data into the database.

MapReduce & Materialized views

According to Hortonworks (2013), MapReduce’s Process in a high level is: Input -> Map -> Shuffle and Sort -> Reduce -> Output.

Jobs:  Mappers, create and process transactions on a data set filed away in a distributed system and places the wanted data on a map/aggregate with a certain key.  Reducers will know what the key values are, and will take all the values stored in a similar map but in different nodes on a cluster (per the distributed system) from the mapper to reduce the amount of data that is relevant (MUSE, 2015a, Hortonworks, 2013). Reducers can work on different keys.

Benefit: MapReduce knows where the data is placed, thus it does the tasks/computations to the data (on which node in a distributed system in which the data is located at).  Not using MapReduce, tasks/computations take place after moving data from one place to another, which can eat up the computational resources (Hortonworks, 2013).  From this, we know that the data is stored in a cluster of multiple processors, and what MapReduce tries to do is map the data (generate new data sets and store them in a key-value database) and reduce (data from one or more maps is reduced to a smaller pair of key-values) the data (MUSE, 2015a).

Other advantages:  Maps and reduce functions can work independently, while the grouper (groups key-values by key) and Master (divides the work amongst the nodes in a cluster) coordinates all the actions and can work really fast (Sathupadi, 2010).  However, depending on the task division, the work of the mapping and reducing functions can vary greatly amongst the nodes in a cluster.  Nothing has to happen in sequential order and a node can sometimes be a mapper and/or a grouper at any one time of the transaction request.

A great example of this a MapReduce Request is to look at all CTU graduate students and sum up their current outstanding school loans per degree level.  Thus, the final output from our example would be Doctoral Students Current Outstanding School Loan Amount and Master Students Current Outstanding School Loan Amount.  If I ran this in Hadoop, I could use 50 nodes to process this transaction request.  The bad data that gets thrown out in the mapper phase would be the Undergraduate Students.  Doctoral Students will get one key, and Master students would get another key, that is similar in all nodes, so that way the sum of all current outstanding school loan amounts get processed under the correct group.

Resources

Adv DB: Document DBs

Main concepts

Data models are how we see, interact with, and transform our data in a system like a database (MUSE, 2015). A data model to a dev person is an ERD, whereas a metamodel is what is used to describe how a database organizes data in four key ways: Key-values, document, column-family, and graph databases (although graph databases are not aggregates) (Sadalage & Fowler, 2012).

In relational data models, tuples are a set of values (divided and stored information) that cannot be nested, nor placed within another, so all operations must be thought of as reading or writing tuples.  For aggregate data models, we want to do more complex things (like key values, column family and documents) rather than just dealing with tuples (Sadalage & Fowler, 2012). Aggregates are related sets of data that we would like to treat as a unit (MUSE, 2015). Relationships between units/aggregates are captured in relational mapping, and a relational or graph database has no idea that the aggregate exists, also known as “aggregate-ignorant” (Sadalage & Fowler, 2012).

Let’s consider a UPS.  For transactions like amazon.com or ebay.com, we need to know only the shipping address if we are a distributor, but paypal.com or your bank cares about the billing address to give you credit into your account.  UPS must collect both.  Thus, UPS, in their relational models they may have in an ERD with two Entities called: Billing Address and Shipping Address.  Naturally, we can group these into one unit (aggregate) called: Address with an indicator/key to state which address is which.  Thus, I can query the key for shipping addresses.

Finally, atomic operations are supported on a single aggregate at a time, and ACID is not followed for transactions across multiple aggregates at a time (Sadalage & Fowler, 2012).

Document Databases

A document database is able to look into the structure of a unit because we need to use a query, which can return a subset/part of the aggregate (Sadalage & Fowler, 2012). You can think of this as either a chapter or a section in a document (MUSE, 2015).  It can be limited by the size restrictions, but also in what can be placed (structure and type).  People can blur the line between this and key-value databases by placing an ID field, but for the most part, you will query a document database rather than look up a key or ID (Sadalage & Fowler, 2012).

Pros and Cons of Aggregate Data model

Aggregate ignorance allows for manipulation of data, replication, and sharding because if not, we would have to search every unit, yet manipulation of data, replication, and sharding can be easier when done in these units.  Thus it can help in some cases and not in others.  Also, there is no correct or right way on where aggregate boundaries should or shouldn’t exist, which can add to the complexity in understanding a data model.  It is great if we want to run our transactions on as little nodes as possible on a cluster, and dealing with units is easier on a cluster (Sadalage & Fowler, 2012).  It is not great for mapping out the relationships of units of different formats (MUSE, 2015).

References:

Adv DB: NoSQL DB

Emergence

Relational Databases will persist due to ACID, ERDs, concurrency control, transaction management, and SQL capabilities.  It doesn’t help that major software can easily integrate with these databases.  But, the reason why so many new ways keep popping up is due to impedance resource costs on computational systems, when data is pulled and pushed from in-memory to databases.  This resource cost can compound fast with big amounts of data.  Industry wants and needs to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.  Data could also be aggregated into units of similarities, and data consistency can be thrown out the window, in real-life applications since they can actually be divided into multiple phases (MUSE, 2015a).

Think of a bank transaction, not all transactions you do at the same time get processed at the same time, and they may show up on your mobile device (mobile database), they may not be committed until a few hours or days later.  The bank will in my case withdraw my mortgage payment from my checking on the first, but apply it on the second of every month into the loan.  But, for 24 hours my payment is pending.

Thanks to the aforementioned ideas have created a movement to support “Not Only SQL” databases, best known as NoSQL, which was derived from a twitter hashtag #NoSQL.  NoSQL contains Aggregate databases like key-value, document, and column friendly, as well as aggregate ignorant databases like the graph (Sadalage & Fowler, 2012). These can be schemaless databases, where data can be stored without any predefined schema.  NoSQL is best for application-specific databases, not to substitute all relational databases (MUSE, 2015b).

 Originally meant for open-sourced, distributed, nonrelational databases like Voldemort, Dynomite, CouchDB, MongoDB, Cassandra, it expanded in its definition and what applications/platforms it can take on.  CQL is from Cassandra and was written to act like SQL in most cases, but also act differently when needed (Sadalage & Fowler, 2012), hence the No in NoSQL.

Suitable Applications

According to Cassandra Planet (n.d.), NoSQL is best for large data sets (big data, complex data, and data mining):

  • Graph: where data relationships are graphical and interconnected like a web (ex: Neo4j & Titan)
  • Key-Value: data is stored and index by a key (ex: Cassandra, DynamoDB, Azure Table Storage, Riak, & BerkeleyDB)
  • Column Store: stores tables as columns rather than rows (ex: Hbase, BigTable, & HyperTable)
  • Document: can store more complex data, with each document having a key (ex: MongoDB & CouchDB).

System Platform

In Relational databases, there is a resource cost, but in as industry wants to deal with big amounts of data, we can gravitate towards NoSQL.  To process all that data we may need to use parallel computing with clusters to store, retrieve, and manipulate big amounts of data.

 References:

Adv DB: Web DBMS Tools

Developers need tools to design web-DBMS interfaces for dynamic use of their site for either e-commerce (Amazon storefront), decision making (National Oceanographic and Atmospheric Administration weather forecast products), or forgather information (Survey Monkey), etc.  ADO.NET and Fusion Middleware are two of many tools and middleware that can be used to develop web-to-database interaction (MUSE, 2015).

ADO.NET (Connolly & Begg, 2014)

Microsoft’s approach to a web-centric middleware for the web-database interface, which provides compatibility with .NET class library, support to XML (used excessively as an industry standard), and connection/disconnection data access.  It has two tiers: dataset (data table collection, XML) and .NET Framework Data Provider (connection, command, data reader, data adapter, for the database).

Pros: Built on standards to allow for non-Microsoft products to use it.  Automatically creates XML interfaces for the application to be turned into a Web Operable Service.  Even the .NET classes conform to XML and other standards.  Other development tools for further expanding the GUI set can be added and bound to the Web Service.

Cons: According to the Data Developer Center website (2010),  with connected data access, you must explicitly manage all database resources, and not doing so can cause resource mismanagement (connections are never freed up).  Other functions in certain classes are missing, like mapping to table-valued functions in the Entity Framework.

Fusion Middleware (Connolly & Begg, 2014):

Oracle’s approach to a web-centric middleware for the web-database interface, which provides development tools, business intelligence, content management, etc.  It has three tiers: Web (using Oracle web cache and HTTP Server), Middle Tier (apps, security services, web logic servers, other remote servers, etc.), and data (the database).

Pros: Scalable. It is based on a Java Platform (full Java EE 6 implementation).  Allows Apache modules like those that route HTTP Requests, for store procedures on a database server, for transparent single sign-on, SHTTP, etc. Their Business Intelligence function allows you to extract and analyze data to create reports and charts (statically or dynamically) for decision analysis.

Cons: The complexity of their system along with their new approach creates a steep learning curve, and requires skilled developers.

The best approach for me was Microsoft: If you want to connect to many other Microsoft applications, this is one route to consider.  It has a nice learning curve (from personal experience).  Another aspect, was when I was building apps for the Library at the University of Oklahoma, the DBAs and I didn’t really like the grid view basic functionalities, so we exploited the aforementioned pro of interfacing with third-party codes, to create more interactive table view of our data.  What is also nice is that our data was on an Oracle database, and all we had to do was switch the pointer from SQL to Oracle, without needed to change the GUI code.

Resources

Adv DB: Web-DBMS

With 2.27 billion users of the web in 2012, we are talking about a significant portion of the world population (Connolly & Begg, 2014) and HTTP allows connection to between the web server, browser, and data.  The internet is platform-independent, but GUIs are developed for the end-user to access sites and information that they seek.  The web can be static, as those that present just information of the company (a webpage dedicated to presenting the Vision Statement of Pizza Hut) or can be considered dynamic, where it accepts user input and produces an output (Domino’s pizza form to order and pay for a pizza online and have it delivered 30 minutes from that order).  The latter can be considered as a Web-DBMS integration, which provides access to corporate data, connection to data, ability to connect to a database independently from a web browser, is scalable, allows for growth, allows for changes based on permission levels, etc. (Connolly & Begg, 2014).

Advantages and disadvantages of Web-DBMS

According to Connolly & Begg (2012), the advantages could be:

  • Platform independence: Browsers like Chrome, Firefox, Edge, etc. can be used to interpret basic HTML code and data connections to run without a need to modify the code to meet the needs for each Operating Systems (OS)/platforms that exists.
  • Graphical User Interface (GUI): Traditional access to databases is through command line based entries.  End users expect to see forms for inputting data, not SQL commands or command lines, they also expect to use search terms in their natural language rather than saying “Select * From …”.  Meeting the end user’s expectations and demands can mean high profits.
  • Standardization: It allows for a standard to be adopted in the back end and front end (GUI), since now this data set is visible to the world.  Doing so makes it easier to connect to the data and write code.  Also, using HTML it can allow any machine with connectivity and a browser to access the GUI and the data.
  • Scalable deployment: Separating the server from the client, allows for easy upgrading and managing across multiple OS/platforms
  • Cross-platform support: Safari is available for Macs, Edge is available for Microsoft, Firefox is available for all OS, etc. These browsers provide support for the service they provide, so the coders only need to worry about interfacing with them via HTML.  Thus, connecting to a data source via HTML is possible, with no need for writing code to deal with each OS/platform.

Whereas the disadvantages could be:

  • Security: SQL attacks can be made if there are text fields to be entered for form analysis, feedback, or comment gathering.
  • Cost: High cost of the infrastructure, Facebook collects user data, posts, images, videos, etc. and thus is very vital to have all that data backed up and available readily by all people with the right permissions. Also, the cost of staff to upkeep the database from being exploited from those not permitted to all the data that they have.  On average it can vary from $300K to $3.4M depending on the organization’s needs, market share, the purpose for their site.
  • Limited functionality of HTML or HTML5: Things/transactions that can be done easily in SQL are much harder based on what you can and cannot do in HTML and other code you can tie to HTML like Javascript, PHP, etc., which complicates the code overall. As the internet is moving more to an interactive and dynamic capability could be added to HTML to make the back end coding of Web-DBMS easier and provide a ton of functionality.
  • Bandwidth: If you have millions of people accessing your services online all at once, like Facebook, you must ensure that they can access the data with high Accessibility, Consistency, and Partition Tolerance.
  • Performance: A 100ms delay can significantly reduce an end-user’s future retention and future repeat transactions according to Abadi in 2012. This can be brought on by bandwidth issues, or a security breach, slowing down your resources, creating a high cost.  Also, this interface makes it slower than if it were just directly connecting to a database through a traditional database client.

All these disadvantages that were mentioned are all interconnected, one thing can cause another.

The best approach to integrate Web and DBMS

Let’s take the Domino’s Pizza (ordering online service).  They could use a Microsoft Web Solution Platform (.NET, ASP, and ADO) to connect to their data sources, that is database platform independent (SQL, Oracle, etc.).  For example, the ASP is a programming model for dynamic end-user interaction, but .NET has many more tools, services and technologies for further end-user interaction with data, databases, and the websites (Connolly & Begg, 2014).  Using a web solution platform will give Domino’s a means of receiving inputs through a form, to write to their ordering database (which then can be used in the future for better inventory decision analysis), and taking credit card payments online.  Then once all that data has been taken in, an image/task completion bar can be shown to the customer on where their order is currently at in their process.  They can also save cookies to save end-user preferences for future orders and speeding up the user’s interaction with their page.

Resources

  • Abadi, D. J. (2012). Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer Society, (2), 37-42.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf version]. Retrieved from http://online.vitalsource.com/books/9781323135761/epubcfi/6/2

Adv DB: Indexes for query optimization

Information sought in a database can be extracted through a query.  However, the bigger the database, the slower the processing time it would take for a query to go through, hence query optimization techniques are conducted.  Another reason for optimization can occur with complex queries operations.

Rarely see that an index is applied on every column in every table

Using indices for query optimization is like using the index at the back of the book to help you find the information/topic you need quickly. You could always scan all the tables just like you can read the entire book, but that is not efficient (Nevarez, 2010).  You can use an index seek (ProductID = 77) or an index scan via adding an operand (ABS(ProductID) = 77), though a scan takes up more resources than a seek.  You can combine them (ProductID = 77 AND ABS(SalesOrderID) = 12345), where you would seek via ProductID and scan for SalesOrderID.  Indexing can be seen as an effective way to optimize your query, besides other methods like applying heuristic rules or ordering the query operations for efficient use of resources (Connolly & Begg, 2014).  However, indices not being used have no use to us, as they take up space on our system (Nevarez, 2010) which can slow down your operations.  Thus, they should be removed.  That is why indexing shouldn’t be applied to every column in every table.  Indexing in every column may not be necessary because it can also depend on the size of the table, indexing is not needed if the table is 3*4, but may be needed if a table is 30,000 * 12.

Thoughts on how to best manage data files in a database management system (DBMS)

Never assume, verify any changes you make with cold hard data. When considering how best to manage a database one must first learn if the data files or the data within the database are dynamic (users create, insert, update, delete regularly) or static (changes are minimal to non-existant) (Connolly & Begg, 2014).  Database administrators need to know when to fine-tune their databases with useful indices on tables that are widely used and turn off those that are not used at all.  Turning off those that are not used at all will saving space, optimize updated functions, and improving resource utilization (Nevarez, 2010). Knowing this will help us understand the nature of the database user. We can then re-write queries that are optimized via correct ordering of operations, removing unnecessary loops and do joins instead, how join, right join or left join properly, avoiding the wildcard (*) and call on data you need, and ensure proper use of internal temporary tables (those created on a server while querying).  Also, when timing queries, make sure to test the first run against itself and avoid the accidental time calculation which includes data stored in the cache. Also, caching your results, using the cache in your system when processing queries is ideal.  A disadvantage of creating too many tables in the same database is slower interaction times, so creating multiple databases with fewer tables (as best logic permits) may be a great way to help with caching your results (MySQL 5.5 Manual, 2004).

Resources

Adv DB: Transaction management and concurrency control

Transaction management, in a nutshell, is keeping track (serialized or scheduled) changes made to a database.  An overly simplistic example is debiting and crediting $100 and $110 dollars (respectively).  If the account balance is currently at $90, the order of this transaction is vital to avoid overdraft fees.  Now, concurrency control is used to ensure data integrity when a transaction occurs.  Thus making the two events interconnected.  Thus, in our example, serializing the transaction (all actions are done consecutively) is key.  You want to add the $110 dollars first so you have $200 in the account to then debit $100.  To do this you will need a timestamp ordering/serialization.  This became a terrible issue back in 2010 and is still an issue in 2014 (Kristof), where a survey of 44 major banks in which, half still re-order the transactions, which can result in draining account balances and causing overdraft fees.  The way they get around all of this is usually having processing times for deposits, which are typically longer than the processing times for charges.  Thus, even if done correctly serially, the processing time can per transaction vary so significantly that these issues happen.  According to Kristof (2014), banks say they do this to process payments in order of priority.

In the case above, it illustrates why this is why an optimistic concurrency control method is not helpful.  It is not helpful because they don’t check for serialization when doing the transactions initially (causing high cost on resources).  However, transactions in optimistic situations are done locally and validated against serialization before finalizing.  Here, if we started at the first of the month and paid a bunch of bills and then realized we were close to $0 so we deposited $110 and continued paying bills to the sum of $100, this can eat up a lot of processing time.  Thus it can get quite complicated quite quickly.  Conservative concurrency controls have the fewest number of abort and eliminates waste in processing via doing things in a serial nature, but you cannot run things in a parallel manner.

Huge amounts of data coming in like those from the internet of things (where databases need to be flexible and extensive because a projected trillion of different items would be producing data) would benefit greatly from the optimistic concurrency control.  Take the example of a Fitbit/Apple watch/Microsoft band.  It records data on you throughout the day.  However, the massive data is time-stamped and heterogeneous, it doesn’t matter if the data for sleep and walking are processed in parallel, but in the end, it is still validated.  This allows for a faster upload time through blue tooth and/or wifi environments.  Data can be actively extracted and explained in real-time, but when there are many sensors on the device, the data and sensors all have different forms of reasoning rules and semantic links between data, where existing or deductive links between sources exist (Sun & Jara, 2014) and that is where the true meaning of the generated data lies.  Sun & Jara suggests that a solid mathematics basis will help in ensuring correct and efficient data storage system and model.

Resources

Words matter: Customize to configure

Let’s look at some definitions when it comes to software development and the nuances each one plays:

Customize: to modify or supplement with code development internally to match end-user requests, it may not be preserved during an upgrade. This could be analogous to hacking into a game like Pokemon Go and enabling end-users the ability to spoof their locations, to obtain regional exclusive pocket monsters.

Tailoring: modifying or supplementing without code to enable a system into an environment.  Analogous to downloading Pokemon Go a Google play store or Apple app store, where the right version of the app is downloaded into the right environment.

Personalization: meeting the customers’ needs effectively and efficiently.  This is achieved by analyzing customer data and using predictive analytics.  A great way is using the Active Sync tool to encourage players of Pokemon Go to be more active, but realizing there are three tiers to active players and personalizing the rewards based on those levels that are achieved.  Personalization can also be seen with character customizations, clothing, poses, and buddy pokemon.

Configure: it is the process of setting up options and features tailored to meet implementation of business requirements.  In pokemon go, some people want to achieve a full pokedex, some like the gym system, some like the 1:1 battles, 1:1 trades, side quests, beating the villains, etc. You can configure your goals in the game by doing one or all, and you can do it to the amount that you want, meeting your requirements for satisfaction in playing the game.

Now if we want to think of these concepts on a continuum:

Customize <——- Tailoring ——- Personalization ——-> Configuring

where the cost of complexity decreases from right to left, constriction in growth decreases from right to left, and a decrease in profit margin occurs from right to left.

The question now becomes, its the additional complexity on this spectrum worth the extra cost incurred?

That is for you to decide.

 

 

3 conferences in Computer Science and 3 conferences in Big Data

3 scholarly conferences that focus on algorithms, programming languages, managing telecommunications software engineering, managing corporate information resources, and managing partnership-based IT operations:

  1. Advance International Conference on Telecommunications: http://www.iaria.org/conferences2015/AICT15.html
  2. IEEE International Conference on Software, Telecommunications and Computer Networks (SoftCOM) http://www.globaleventslist.elsevier.com/events/2015/09/international-conference-on-software-telecommunications-and-computer-networks-softcom/
  3. IEEE Global Communication Conference Exhibition & Industry Forum (GLOBECOM) http://globecom2014.ieee-globecom.org/

3 conferences that cover Big Data:

1. IEEE International Conference on BigData: http://cci.drexel.edu/bigdata/bigdata2014/programschedule.htm

A conference that provides student travel awards to help subsidize the cost, thanks to the National Science Foundation. Held in Washington DC in 2014. They also provide a doctoral symposium. Keynote speeches include Never-ending language learning; Smart Data – How you and I will exploit Big Data for personalized digital health and many other activities; and Addressing Human Bottlenecks in Big data. Reading the keynote speeches’ abstracts I found this quote to be true at my job in the past year “… the key bottlenecks lie with data analysis and data engineers, who are routinely asked to work with data that cannot possibly be loaded into tools for statistical analytics or visualization.” (IEEE, 2014). Another Keynote talks about an Artificial Intelligence learning machine NELL (Never-Ending Language Learner) that runs 24 hours per day learning to read the web and extracting knowledge and creating beliefs. It is starting to reason over its extracted knowledge. It recently learned that “inaccuracy is an event outcome” (NELL, 2015)

2. Data Lead: http://www.datalead2014.com/

Held in Paris, France in 2015 and Berkeley, California near October and November months. It is their second year of this annual conference. It is in partnership with the University of California (Berkeley) Hass School of Business. Their goal is to spark an international conversation on the application of big data on business processes and issues. There is a particular focus on issues revolving around finance and a marketing, though they cater to the sciences, education, government, etc. They see big data as an economic commodity.

3. IARIA International Conference on Data Analytics: http://www.iaria.org/conferences2015/DATAANALYTICS15.html

Fourth conference held by IARIA, in Nice, France for 2015, during the middle of the year. Topics in their conferences deal with: Fundamentals, mechanisms, and features, sentiment/opinion analytics, target analytics, big data, knowledge discovery, visualization, filtering data, relevant/redundant/obsolete analytics, predictive, trust in data, legal issues, cyber threats, etc. They have two biannual peer-reviewed journals since 2008 associated with this group: International Journal on Advances in Software & International Journal on Advances in Intelligent Systems. Other conferences from IARIA: SoftNet, InfoWare, NetWare, NexTech, DataSys, BioSciencesWorld, Infosys, and NexComm, all in Europe.