Adv DB: Web-DBMS

With 2.27 billion users of the web in 2012, we are talking about a significant portion of the world population (Connolly & Begg, 2014) and HTTP allows connection to between the web server, browser, and data.  The internet is platform-independent, but GUIs are developed for the end-user to access sites and information that they seek.  The web can be static, as those that present just information of the company (a webpage dedicated to presenting the Vision Statement of Pizza Hut) or can be considered dynamic, where it accepts user input and produces an output (Domino’s pizza form to order and pay for a pizza online and have it delivered 30 minutes from that order).  The latter can be considered as a Web-DBMS integration, which provides access to corporate data, connection to data, ability to connect to a database independently from a web browser, is scalable, allows for growth, allows for changes based on permission levels, etc. (Connolly & Begg, 2014).

Advantages and disadvantages of Web-DBMS

According to Connolly & Begg (2012), the advantages could be:

  • Platform independence: Browsers like Chrome, Firefox, Edge, etc. can be used to interpret basic HTML code and data connections to run without a need to modify the code to meet the needs for each Operating Systems (OS)/platforms that exists.
  • Graphical User Interface (GUI): Traditional access to databases is through command line based entries.  End users expect to see forms for inputting data, not SQL commands or command lines, they also expect to use search terms in their natural language rather than saying “Select * From …”.  Meeting the end user’s expectations and demands can mean high profits.
  • Standardization: It allows for a standard to be adopted in the back end and front end (GUI), since now this data set is visible to the world.  Doing so makes it easier to connect to the data and write code.  Also, using HTML it can allow any machine with connectivity and a browser to access the GUI and the data.
  • Scalable deployment: Separating the server from the client, allows for easy upgrading and managing across multiple OS/platforms
  • Cross-platform support: Safari is available for Macs, Edge is available for Microsoft, Firefox is available for all OS, etc. These browsers provide support for the service they provide, so the coders only need to worry about interfacing with them via HTML.  Thus, connecting to a data source via HTML is possible, with no need for writing code to deal with each OS/platform.

Whereas the disadvantages could be:

  • Security: SQL attacks can be made if there are text fields to be entered for form analysis, feedback, or comment gathering.
  • Cost: High cost of the infrastructure, Facebook collects user data, posts, images, videos, etc. and thus is very vital to have all that data backed up and available readily by all people with the right permissions. Also, the cost of staff to upkeep the database from being exploited from those not permitted to all the data that they have.  On average it can vary from $300K to $3.4M depending on the organization’s needs, market share, the purpose for their site.
  • Limited functionality of HTML or HTML5: Things/transactions that can be done easily in SQL are much harder based on what you can and cannot do in HTML and other code you can tie to HTML like Javascript, PHP, etc., which complicates the code overall. As the internet is moving more to an interactive and dynamic capability could be added to HTML to make the back end coding of Web-DBMS easier and provide a ton of functionality.
  • Bandwidth: If you have millions of people accessing your services online all at once, like Facebook, you must ensure that they can access the data with high Accessibility, Consistency, and Partition Tolerance.
  • Performance: A 100ms delay can significantly reduce an end-user’s future retention and future repeat transactions according to Abadi in 2012. This can be brought on by bandwidth issues, or a security breach, slowing down your resources, creating a high cost.  Also, this interface makes it slower than if it were just directly connecting to a database through a traditional database client.

All these disadvantages that were mentioned are all interconnected, one thing can cause another.

The best approach to integrate Web and DBMS

Let’s take the Domino’s Pizza (ordering online service).  They could use a Microsoft Web Solution Platform (.NET, ASP, and ADO) to connect to their data sources, that is database platform independent (SQL, Oracle, etc.).  For example, the ASP is a programming model for dynamic end-user interaction, but .NET has many more tools, services and technologies for further end-user interaction with data, databases, and the websites (Connolly & Begg, 2014).  Using a web solution platform will give Domino’s a means of receiving inputs through a form, to write to their ordering database (which then can be used in the future for better inventory decision analysis), and taking credit card payments online.  Then once all that data has been taken in, an image/task completion bar can be shown to the customer on where their order is currently at in their process.  They can also save cookies to save end-user preferences for future orders and speeding up the user’s interaction with their page.

Resources

  • Abadi, D. J. (2012). Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. IEEE Computer Society, (2), 37-42.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. [VitalSource Bookshelf version]. Retrieved from http://online.vitalsource.com/books/9781323135761/epubcfi/6/2

Adv Topics: The architecture of the Internet

Introduction

Kelly (2007) stated that there are 100 billion clicks per day and 55 trillion links. But, the internet is very pervasive to the human kind. In 2012, 2.27 billion people used the internet. However, globally 1.7 billion people are actively engaged with the internet (Li, 2010; Sakr, 2014). An actively engaged user of the internet would be using social media, where they can watch videos, share status updates, and curated content, with a goal of actively engaging other users (Li, 2010). Cloud-based services rely on the internet to provide services like distributed processing, large scale data storage, support distributed transactions, data manipulation, etc. (Brookshear & Brylow, 2014; Sakr, 2014). Thus, the internet plays a vital role in enabling big data analysis and social engagement, given its humble beginning for storing research projects between multiple research institutions in the 1960s (Brookshear & Brylow, 2014).

Internet Architecture

The internet has evolved into a socio-technical system. This evolution has come about in five distinct stages:

  • Web 1.0: Created by Tim Berners-Lee in the 1980s, where it was originally defined as a way of connecting static read-only information hosted across multiple computational components primarily for companies (Patel, 2013). The internet was a network of computational networks and where communication that is done throughout the internet is governed by the TCP/IP protocol (Brookshear & Brylow, 2014). The internet’s architecture relies on three components: Uniform Resource Identifier (URI), Hyper Text Transfer Protocol (HTTP), and HyperText Markup language (HTML) (Jacobs & Walsh, 2004).
    • A URI is a unique address that is common across all corners of the web and agreed upon convention by the internet community, which is made up of characters and numerical values that allow an end-user of the internet to locate and retrieve information (Brookshear & Brylow, 2014; Jacobs & Walsh, 2004; Patel 2013). This unique addressing convention allows the internet to store massive amounts of information without URI collision, which is when two URI hold the same value, which can confound search engines (Jacobs & Walsh, 2004). An example of a URI could be www.mkhernandez.com.
    • For a computer to locate this URI’s information which is hosted on a web server, a web browser like Goggle chrome, Microsoft edge, or Firefox uses the HTTP protocol to retrieve the information to be displayed by the browser (Brookshear & Brylow, 2014). The web browser would send an HTTP GET request for www.mkhernandez.com via the TCP/IP port 80, and as long as the browser is given access to the information stored in the URI, then the web server sends an HTTP POST or PUT to give the web browser the sought after information (Jacobs & Walsh, 2004). In a sense, HTTP protocols play an important role in information access management.
    • Once the HTTP POST is sent from the web server consisting the information stored in www.mkhernandez.com, the web browser must now convert the information and display it on the computer screen. HTML is a <tag> based notational code that helps a browser read the information sent by the web server and display it in a simple or rich data format (Brookshear & Brylow, 2014; Patel 2013). The <html /> tags are used by the HTTP protocol to identify the content type and encoding style for displaying the information of the web server onto the browser (Jacobs & Walsh, 2004).   The <head /> and <title /> tags are used as metadata about the document. Whereas the <body /> tag contains the information of the URI, and <a /> tags helps you link to other relevant URIs easily.
  • Web 2.0: Changed the state of the internet from a read-only state to a read/write state and had grown communities that hold a common interest (Patel, 2013). This version of the web led to more social interaction, giving people and content importance on the web, due to the introduction of social media tools through the introduction of web applications (Li, 2010; Patel, 2013; Sakr, 2014). Web applications can include event-driven and object-oriented programming that are designed to handle concurrent activities for multiple users and had a graphical user interface (Connolly & Begg, 2014; Sandén, 2011). Key technologies include:
    • Weblogs (Blogs), Video logs (vlogs), and audio logs (podcasts) are all content in various styles that are published chronologically in descending time order, which can be tagged with keywords for categorization and available when people need them (Li, 2010; Patel, 2013). These logs can be used for fact-finding when content is stored chronologically (Connolly& Begg, 2014).
    • Really Simple Syndication (RSS) is a web and data feed format that summarizes data via producing an open standard format, XML file (Patel, 2013; Services, 2015). This is regularly used for data collection (Services, 2015).
    • Wikis are editable and expandable by those who have access to the data, and information can be restored or rolled back (Patel, 2013). Wiki editors ensure data quality and encourage participation from the community in providing meaningful content to the community (Li, 2010).
  • Web 3.0: This is the state the web at 2017. Involves the semantic web that is driven by data integration through the uses of metadata (Patel, 2013). This version of the web supports a worldwide database with static HTML documents, dynamically rendered data, next standard HTML (HTML5), and links between documents with hopes of creating an interconnected and interrelated openly accessible world data such that tagged micro-content can be easily discoverable through search engines (Connolly & Begg, 2014; Patel, 2013). This new version of HTML, HTML5 can handle multimedia and graphical content and introduces new tags like <section />, <article />, <nav />, and <header />, which are great for semantic content (Connolly & Begg, 2014). Also, end-users are beginning to build dynamic web applications for others to interact with (Patel, 2013). Key technologies include:
    • Extensible Markup Language is a tag based metalanguage (Patel, 2013). These tags not limited to the tags defined by other people and can be created at the pace of the author rather than waiting for a standard body to approve a tag structure (UK Web Design Company, n.d.).
    • Resource Description Framework (RDF) is based on URI triples <subject, predicate, object>, which helps describes data properties and classes (Connolley & Begg, 2014; Patel, 2013). RDF is usually represented at the top of the data set with a @prefix (Connolly & Begg, 2014). For instance,

@prefix: <https://mkhernandez.com&gt; s: Author <https://mkhernandez.wordpress.com/about/&gt;. <https://mkhernandez.wordpress.com/about/&gt;. s:Name “Dr. Michael Kevin Hernandez”. <https://mkhernandez.wordpress.com/about/&gt; s:e-mail “dr.michael.k.hernandez@gmail.com.”

  • Web 4.0: It is considered the symbiotic web, where data interactions occur between humans and smart devices, the internet of things (Atzori, 2010; Patel, 2013). These smart devices can be wired to the internet or connected via wireless sensors through enhanced communication protocols (Atzori, 2010). Thus, these smart devices would have read and write concurrency with humans, where the largest potential of web 4.0 has these smart devices analyze data online and begin to migrate the online world into the real world (Patel, 2013). Besides interacting with the internet and the real world, the internet of things smart devices would be able to interact with each other (Atzori, 2010). Sakr (2014) stated that this web ecosystem is built off of four key items:
    • Data devices where data is gathered from multiple sources that generate the data
    • Data collectors are devices or people that collect data
    • Data aggregation from the IoT, people, RFIDs, etc.
    • Data users and data buyers are people that derive value out of the data
  • Web 5.0: Previous iterations of the web do not perceive people’s emotion, but one day it could be able to understand a person’s emotional (Patel, 2013). Kelly (2007) predicted that in 5,000 days the internet would become one machine and all other devices would be a window into this machine. In 2007, Kelly stated that this one machine “the internet” has the processing capability of one human brain, but in 5,000 days it will have the processing capability of all the humanity.

Performance Bottlenecks & Root Causes

There is a trend in terms of “performance bottleneck” to access large-scale Web data as the Web technology evolves. A bottleneck is when the flow of information is slowed down or stopped in its entirety that it can cause a bad end-user experience (TechTarget, 2007, Thomas, 2012). Performance bottlenecks can cause an application to perform poorly towards expectations (Thomas, 2012). As the internet evolved, there are new performance bottlenecks that begin to appear:

  • Web 1.0: When two URI hold the same value, it can confound search engine, hence the move to IPv4 to IPv6 (Jacobs & Walsh, 2004). Transfer protocols rely on network devices: network interface card, firewall, cables, tight security, load balancers, routers, etc., which all affect the flow of information (bandwidth) (Jacobs & Walsh, 2004; Thomas, 2012). Finally, HTTP information is pulled from the web server, thus low capacity computers, broken links, tight security, poor configurations, which can result in HTTP errors (4xx, 5xx), lots of open connections, memory leaks, lengthy queues, extensive data scans, database deadlock, etc. (Thomas, 2012).
  • Web 2.0: Database demands on the web application, because there are more write and read transactions (Sakr, 2014). Plus, all the performance bottlenecks from web 1.0.
  • Web 3.0: Searching for information in the data is tough and time-consuming without a computer processing application (UK Web Design Company, n.d.). Data is tied to the logic and language like HTML without a readily made browser to simply explore the data and therefore may require HTML or other software to process the data (Brewton, Yuan, & Akowuah, 2012; UK Web Design Company, n.d.). Syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007).
  • Web 4.0: IoT creates performance bottlenecks but it also has its issues, if it is left on its own and it is not tied to anything else (Jaffe, 2014; Newman, 2016): (a) the devices cannot deal with the massive amounts of data generated and collected; and (b) the devices cannot learn from the data it generates and collects. Finally, there are also infrastructure potential performance bottlenecks for IoT (Atzori, 2010): (a) the huge number of internet oriented devices that will be taking up the last few IPv4 addresses; (b) things oriented and internet oriented devices could spend a time in sleep mode, which is not typical for current devices using the current IP networks; (c) IoT devices when connecting to the internet produce smaller packets of data at higher frequency than current devices; (d) each of the devices would have to use a common interface and standard protocols as other devices, which can easily flood the network and increase complexity of middleware software layer design; and (e) IoT are vastly heterogeneous objects, where each device with their own function and has its own way of communicating.
  • Web 5.0: Give the assumption of what this version of the web would become, a possible performance bottleneck would be a number of resources consumed to keep the web operating.

 

Overall, Kelly (2007) stated that there are 100 billion clicks per day and 55 trillion links and that there are 2 million emails per second, 1 million IM messages, etc. big data will impact the performance of the web. Big data will be primarily impacting the web server, because of the increasing size of information and potential lack of bandwidth, big data can slow down the throughput performance (Thomas, 2012).

High-level strategies to mitigate

The web should evolve with time to keep up with the demands and needs of society, and it is predicted change significantly. What the web evolves into is yet to be seen, but it will be quite unique. However, with multiple heterogeneous types of devices (IoT) trying to connect to the web, a standardized protocol and interface to the web should be adopted (Atzori, 2010). A move to IPv6 from IPv4 to accommodate the massive number of IoT devices that expected to generate data and connect to the web must happen to accommodate this opportunity to gain more data.

Given, Kelly (2007) centralized view of the web system, the information and data are stored distributively, and better algorithms are needed to connect relevant data and information more efficiently. Data storage is cheap, but at the rate of data creation by IoT and other sources, processing speeds through parallel and distributed processing must increase to take advantage of this explosion of big data into the web. This is the gap created by the fact that data collection is outpacing the speed of data processing and analysis. Given this gap, a data scientist should prioritize which subset of the data is valuable enough to analyze, to solve certain problems. This is a great workaround, however, it ignores the full data set at times, which was generated from a system. It’s not enough to analyze what is deemed to be valuable parts of the data, because another part of the data may reveal more insight (the whole is better than the sum of its parts argument).

NoSQL database types and extensible markup languages have enhanced how information and data are related to each other, but standardization and perhaps an automated 1:1 mapping of the same data being represented in different NoSQL databases may be needed to gain further insights faster. Web application developers should also be more resource conscious, trying to get more computational results for fewer resources.

Resources:

  • Atzori, L., Antonio Iera, A., & Morabito, G. (2010). The Internet of things: A survey. Computer Networks, 54(2). 787–2,805
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.
  • Sandén, B. I. (2011). Design of Multithreaded Software: The Entity-Life Modeling Approach. Wiley-Blackwell. VitalBook file.
  • Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.
  • Services, EMC E. (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. John Wiley & Sons P&T. VitalBook file.