Kelly (2007) stated that there are 100 billion clicks per day and 55 trillion links. But, the internet is very pervasive to the human kind. In 2012, 2.27 billion people used the internet. However, globally 1.7 billion people are actively engaged with the internet (Li, 2010; Sakr, 2014). An actively engaged user of the internet would be using social media, where they can watch videos, share status updates, and curated content, with a goal of actively engaging other users (Li, 2010). Cloud-based services rely on the internet to provide services like distributed processing, large scale data storage, support distributed transactions, data manipulation, etc. (Brookshear & Brylow, 2014; Sakr, 2014). Thus, the internet plays a vital role in enabling big data analysis and social engagement, given its humble beginning for storing research projects between multiple research institutions in the 1960s (Brookshear & Brylow, 2014).
The internet has evolved into a socio-technical system. This evolution has come about in five distinct stages:
- Web 1.0: Created by Tim Berners-Lee in the 1980s, where it was originally defined as a way of connecting static read-only information hosted across multiple computational components primarily for companies (Patel, 2013). The internet was a network of computational networks and where communication that is done throughout the internet is governed by the TCP/IP protocol (Brookshear & Brylow, 2014). The internet’s architecture relies on three components: Uniform Resource Identifier (URI), Hyper Text Transfer Protocol (HTTP), and HyperText Markup language (HTML) (Jacobs & Walsh, 2004).
- A URI is a unique address that is common across all corners of the web and agreed upon convention by the internet community, which is made up of characters and numerical values that allow an end-user of the internet to locate and retrieve information (Brookshear & Brylow, 2014; Jacobs & Walsh, 2004; Patel 2013). This unique addressing convention allows the internet to store massive amounts of information without URI collision, which is when two URI hold the same value, which can confound search engines (Jacobs & Walsh, 2004). An example of a URI could be www.mkhernandez.com.
- For a computer to locate this URI’s information which is hosted on a web server, a web browser like Goggle chrome, Microsoft edge, or Firefox uses the HTTP protocol to retrieve the information to be displayed by the browser (Brookshear & Brylow, 2014). The web browser would send an HTTP GET request for www.mkhernandez.com via the TCP/IP port 80, and as long as the browser is given access to the information stored in the URI, then the web server sends an HTTP POST or PUT to give the web browser the sought after information (Jacobs & Walsh, 2004). In a sense, HTTP protocols play an important role in information access management.
- Once the HTTP POST is sent from the web server consisting the information stored in www.mkhernandez.com, the web browser must now convert the information and display it on the computer screen. HTML is a <tag> based notational code that helps a browser read the information sent by the web server and display it in a simple or rich data format (Brookshear & Brylow, 2014; Patel 2013). The <html /> tags are used by the HTTP protocol to identify the content type and encoding style for displaying the information of the web server onto the browser (Jacobs & Walsh, 2004). The <head /> and <title /> tags are used as metadata about the document. Whereas the <body /> tag contains the information of the URI, and <a /> tags helps you link to other relevant URIs easily.
- Web 2.0: Changed the state of the internet from a read-only state to a read/write state and had grown communities that hold a common interest (Patel, 2013). This version of the web led to more social interaction, giving people and content importance on the web, due to the introduction of social media tools through the introduction of web applications (Li, 2010; Patel, 2013; Sakr, 2014). Web applications can include event-driven and object-oriented programming that are designed to handle concurrent activities for multiple users and had a graphical user interface (Connolly & Begg, 2014; Sandén, 2011). Key technologies include:
- Weblogs (Blogs), Video logs (vlogs), and audio logs (podcasts) are all content in various styles that are published chronologically in descending time order, which can be tagged with keywords for categorization and available when people need them (Li, 2010; Patel, 2013). These logs can be used for fact-finding when content is stored chronologically (Connolly& Begg, 2014).
- Really Simple Syndication (RSS) is a web and data feed format that summarizes data via producing an open standard format, XML file (Patel, 2013; Services, 2015). This is regularly used for data collection (Services, 2015).
- Wikis are editable and expandable by those who have access to the data, and information can be restored or rolled back (Patel, 2013). Wiki editors ensure data quality and encourage participation from the community in providing meaningful content to the community (Li, 2010).
- Web 3.0: This is the state the web at 2017. Involves the semantic web that is driven by data integration through the uses of metadata (Patel, 2013). This version of the web supports a worldwide database with static HTML documents, dynamically rendered data, next standard HTML (HTML5), and links between documents with hopes of creating an interconnected and interrelated openly accessible world data such that tagged micro-content can be easily discoverable through search engines (Connolly & Begg, 2014; Patel, 2013). This new version of HTML, HTML5 can handle multimedia and graphical content and introduces new tags like <section />, <article />, <nav />, and <header />, which are great for semantic content (Connolly & Begg, 2014). Also, end-users are beginning to build dynamic web applications for others to interact with (Patel, 2013). Key technologies include:
- Extensible Markup Language is a tag based metalanguage (Patel, 2013). These tags not limited to the tags defined by other people and can be created at the pace of the author rather than waiting for a standard body to approve a tag structure (UK Web Design Company, n.d.).
- Resource Description Framework (RDF) is based on URI triples <subject, predicate, object>, which helps describes data properties and classes (Connolley & Begg, 2014; Patel, 2013). RDF is usually represented at the top of the data set with a @prefix (Connolly & Begg, 2014). For instance,
@prefix: <https://mkhernandez.com> s: Author <https://mkhernandez.wordpress.com/about/>. <https://mkhernandez.wordpress.com/about/>. s:Name “Dr. Michael Kevin Hernandez”. <https://mkhernandez.wordpress.com/about/> s:e-mail “firstname.lastname@example.org.”
- Web 4.0: It is considered the symbiotic web, where data interactions occur between humans and smart devices, the internet of things (Atzori, 2010; Patel, 2013). These smart devices can be wired to the internet or connected via wireless sensors through enhanced communication protocols (Atzori, 2010). Thus, these smart devices would have read and write concurrency with humans, where the largest potential of web 4.0 has these smart devices analyze data online and begin to migrate the online world into the real world (Patel, 2013). Besides interacting with the internet and the real world, the internet of things smart devices would be able to interact with each other (Atzori, 2010). Sakr (2014) stated that this web ecosystem is built off of four key items:
- Data devices where data is gathered from multiple sources that generate the data
- Data collectors are devices or people that collect data
- Data aggregation from the IoT, people, RFIDs, etc.
- Data users and data buyers are people that derive value out of the data
- Web 5.0: Previous iterations of the web do not perceive people’s emotion, but one day it could be able to understand a person’s emotional (Patel, 2013). Kelly (2007) predicted that in 5,000 days the internet would become one machine and all other devices would be a window into this machine. In 2007, Kelly stated that this one machine “the internet” has the processing capability of one human brain, but in 5,000 days it will have the processing capability of all the humanity.
Performance Bottlenecks & Root Causes
There is a trend in terms of “performance bottleneck” to access large-scale Web data as the Web technology evolves. A bottleneck is when the flow of information is slowed down or stopped in its entirety that it can cause a bad end-user experience (TechTarget, 2007, Thomas, 2012). Performance bottlenecks can cause an application to perform poorly towards expectations (Thomas, 2012). As the internet evolved, there are new performance bottlenecks that begin to appear:
- Web 1.0: When two URI hold the same value, it can confound search engine, hence the move to IPv4 to IPv6 (Jacobs & Walsh, 2004). Transfer protocols rely on network devices: network interface card, firewall, cables, tight security, load balancers, routers, etc., which all affect the flow of information (bandwidth) (Jacobs & Walsh, 2004; Thomas, 2012). Finally, HTTP information is pulled from the web server, thus low capacity computers, broken links, tight security, poor configurations, which can result in HTTP errors (4xx, 5xx), lots of open connections, memory leaks, lengthy queues, extensive data scans, database deadlock, etc. (Thomas, 2012).
- Web 2.0: Database demands on the web application, because there are more write and read transactions (Sakr, 2014). Plus, all the performance bottlenecks from web 1.0.
- Web 3.0: Searching for information in the data is tough and time-consuming without a computer processing application (UK Web Design Company, n.d.). Data is tied to the logic and language like HTML without a readily made browser to simply explore the data and therefore may require HTML or other software to process the data (Brewton, Yuan, & Akowuah, 2012; UK Web Design Company, n.d.). Syntax and tags are redundant, which can consume huge amounts of bytes, and slow down processing speeds (Hiroshi, 2007).
- Web 4.0: IoT creates performance bottlenecks but it also has its issues, if it is left on its own and it is not tied to anything else (Jaffe, 2014; Newman, 2016): (a) the devices cannot deal with the massive amounts of data generated and collected; and (b) the devices cannot learn from the data it generates and collects. Finally, there are also infrastructure potential performance bottlenecks for IoT (Atzori, 2010): (a) the huge number of internet oriented devices that will be taking up the last few IPv4 addresses; (b) things oriented and internet oriented devices could spend a time in sleep mode, which is not typical for current devices using the current IP networks; (c) IoT devices when connecting to the internet produce smaller packets of data at higher frequency than current devices; (d) each of the devices would have to use a common interface and standard protocols as other devices, which can easily flood the network and increase complexity of middleware software layer design; and (e) IoT are vastly heterogeneous objects, where each device with their own function and has its own way of communicating.
- Web 5.0: Give the assumption of what this version of the web would become, a possible performance bottleneck would be a number of resources consumed to keep the web operating.
Overall, Kelly (2007) stated that there are 100 billion clicks per day and 55 trillion links and that there are 2 million emails per second, 1 million IM messages, etc. big data will impact the performance of the web. Big data will be primarily impacting the web server, because of the increasing size of information and potential lack of bandwidth, big data can slow down the throughput performance (Thomas, 2012).
High-level strategies to mitigate
The web should evolve with time to keep up with the demands and needs of society, and it is predicted change significantly. What the web evolves into is yet to be seen, but it will be quite unique. However, with multiple heterogeneous types of devices (IoT) trying to connect to the web, a standardized protocol and interface to the web should be adopted (Atzori, 2010). A move to IPv6 from IPv4 to accommodate the massive number of IoT devices that expected to generate data and connect to the web must happen to accommodate this opportunity to gain more data.
Given, Kelly (2007) centralized view of the web system, the information and data are stored distributively, and better algorithms are needed to connect relevant data and information more efficiently. Data storage is cheap, but at the rate of data creation by IoT and other sources, processing speeds through parallel and distributed processing must increase to take advantage of this explosion of big data into the web. This is the gap created by the fact that data collection is outpacing the speed of data processing and analysis. Given this gap, a data scientist should prioritize which subset of the data is valuable enough to analyze, to solve certain problems. This is a great workaround, however, it ignores the full data set at times, which was generated from a system. It’s not enough to analyze what is deemed to be valuable parts of the data, because another part of the data may reveal more insight (the whole is better than the sum of its parts argument).
NoSQL database types and extensible markup languages have enhanced how information and data are related to each other, but standardization and perhaps an automated 1:1 mapping of the same data being represented in different NoSQL databases may be needed to gain further insights faster. Web application developers should also be more resource conscious, trying to get more computational results for fewer resources.
- Atzori, L., Antonio Iera, A., & Morabito, G. (2010). The Internet of things: A survey. Computer Networks, 54(2). 787–2,805
- Brewton, J., Yuan, X., & Akowuah, F. (2012). XML in health information systems. Retrieved from http://www.worldcomp-proceedings.com/proc/p2012/BIC4653.pdf
- Brookshear, G., Brylow, D. (2014). Computer Science: An Overview, (12th ed.). Vitalbook file.
- Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, (6th ed.). Pearson Learning Solutions. VitalBook file.
- Hiroshi (2007). Advantages & disadvantages of XML. Retrieved from http://www.techmynd.com/advantages-disadvantages-of-xml/
- Jacobs, I., & Walsh, N. (eds). (2004). Architecture of the World Wide Web, Volume one. Retrieved from http://www.w3.org/TR/2004/REC-webarch-20041215/.
- Kelly, K. (2007). The next 5,000 days of the web. TED Talk. Retrieved from https://www.ted.com/talks/kevin_kelly_on_the_next_5_000_days_of_the_web
- Li, C. (2010). Open Leadership: How Social Technology Can Transform the Way You Lead, (1st ed.). VitalBook file.
- Patel, K. (2013). Incremental journey for World Wide Web: Introduced with Web 1.0 to recent Web 5.0 – A survey paper. International Journal of Advanced Research in Computer Science and Software Engineering, 3(10), 410–417.
- Sandén, B. I. (2011). Design of Multithreaded Software: The Entity-Life Modeling Approach. Wiley-Blackwell. VitalBook file.
- Sakr, S. (2014). Large Scale and Big Data, (1st ed.). Vitalbook file.
- Services, EMC E. (2015) Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. John Wiley & Sons P&T. VitalBook file.
- TechTarget (2007) bottleneck. Retrieved from http://searchenterprisewan.techtarget.com/definition/bottleneck
- Thomas, T. (2012). Web applications performance symptoms and bottleneck identification. Retrieved from http://www.agileload.com/agileload/blog/2012/11/27/web-applications-performance-symptoms-and-bottlenecks-identification
- UK Web Design Company (n.d.). XML Advantages & disadvantages. Retrieved from http://www.theukwebdesigncompany.com/articles/xml-advantages-disadvantages.php