Adv Topics: Big data addressing Security Issues

Cybersecurity attacks are limited by their physical path, the network connectivity and reachability limits, and the attack structure, which is by exploiting a vulnerability that enables an attack (Xie et al., 2010). Previously, automated systems and tools were implemented to deal with moderately skilled cyber-attackers, plus white hat hackers are used to identify security vulnerabilities, but it is not enough to keep up with today’s threats (Peterson, 2012). Preventative measure only deals with newly discoverable items, not the ones that have yet to be discoverable (Fink, Sharifi, Carbonell, 2011). These two methods are preventative measures, with the goal of protecting the big data and cyberinfrastructure used to store and process big data from malicious intent. Setting up these preventative measures are no longer good enough to protect big data and its infrastructure. Thus there has been a migration towards using real-time analysis on monitored data (Glick, 2013). Real-time analysis is concerned with “What is really happening?” (Xie et al., 2010).

If algorithms used to process big data can be pointed towards cyber security, Security Information and Event Management (SIEM), it can add another solution towards identifying cyber security threat (Peterson, 2012). All that big data cyber security analysis will do is make security teams faster to react if they have the right context to the analysis, but it won’t make the security teams act in a more proactive way (Glick, 2013). SIEM has gone above and beyond current cyber security prevention measures, usually by collecting the log data in real time that is generated and processing the log data in real time using algorithms like correlation, pattern recognition, behavioral analysis, and anomaly analysis (Glick, 2013; Peterson, 2012). Glick (2013), reported that data from a variety of sources help build a cyber security risk and threat profile in real-time that can be taken to cyber security teams to react to each threat in real time, but it works on small data sets.

SIEM couldn’t handle the vast amounts of big data and therefore analyzing the next cyber threats came from using tools like Splunk to identify anomalies amongst the data (Glick, 2013). SIEM was proposed for use in the Olympics games, but Splunk was being used for investment banking purposes (Glick, 2013; Peterson, 2012). FireEye is another big data analytics security tool that was used for identifying network threats (Glick, 2013).

  • Xie et al. (2010), proposed the use of Bayesian networks for cyber security analysis. This solution considers that modeling cyber security profiles are difficult to construct and uncertain, plus they built the tool for near real-time systems. That is because Bayesian models try to model cause-and-effect relationships. Using deterministic security models are unrealistic and do not capture the full breadth of a cyber attack and cannot capture all the scenarios for real-time analysis. If the Bayesian models are built to reflect reality, then it could be used for near real-time analysis. In real-time cyber security analysis, analysts must consider an attacker’s choices are unknown or if they will be successful in their targets and goals. Building a modular graphical attack model can help calculate uncertainties, which can be done by decomposing the problem into finite small parts, where realistic data can be used to pre-populate all the parameters. These modular graphical attack models should consider the physical paths in the explicit and abstract form. Thus, the near real-time Bayesian network considers the three important uncertainties introduced in a real-time attack (italicized). Using this method is robust as determined by a holistic sensitivity analysis.
  • Fink et al. (2011), proposed a mashup of crowdsourcing, machine learning, and natural language processing to dealing both vulnerabilities and careless end user actions, for automated threat detection. In their study, they focused on scam websites and cross-site request forgeries. For scam website identification, the concept of using crowdsourced end users to flag certain websites as a scam is key to this process. The goal is that when a new end user approaches the scam website, a popup appears stating “This website is a scam! Do not provide personal information.” The authors’ solution ties data from heterogeneously common web scam blacklist databases. This solution has high precisions (98%), and high recall (98.1%) on their test of 837 manually labeled sites that was cross-validated using a ten-fold cross -validation analysis between the blacklisted database. The current system’s limitation does not address new threats and different sets of threats.

These studies and articles illustrate that the benefit of using big data analytics for cybersecurity analysis provides the following benefits (Fink et al., 2011; Glick, 2013; IBM Software, 2013; Peterson, 2012; Xie et al., 2010):

(a) moving away from preventative cybersecurity and moving towards real-time analysis to become reactive faster to a current threat;

(b) creating security models that more accurately reflect the reality and uncertainty that exists between the physical paths, successful attacks, and unpredictability of humans for near real-time analysis;

(c) provide a robust identification technique; and

(d) reduction of identifying false positives, which eat up the security team’s time.

Thus, helping security teams to solve difficult issues in real-time. However, this is a new and evolving field that is applying big data analytics. Thus it is expected that many tools will be developed, and the most successful tool would be able to provide real-time cybersecurity data analysis with the huge set of algorithms each aimed at studying different types of attacks. It is even possible for one day to see artificial intelligence to become the next new phase of providing real-time cyber security analysis and resolutions.



Adv Topics: Security Issues with Cloud Technology

Big data requires huge amounts of resources to analyze it for data driven decisions, thus there has been a gravitation towards cloud computing to work in this era of big data (Sakr, 2014). Cloud technology is different than personal systems that place different demands on cyber security, where personal systems could have single authority systems and cloud computing systems, have no individual owners, have multiple users, groups rights, and shared responsibility (Brookshear & Brylow, 2014; Prakash & Darbari, 2012). Cloud security can be just as good or better than personal systems because cloud providers could have the economies of scales that can support a budget to have an information security team that many organizations may not be able to afford (Connolly & Begg, 2014). Cloud security can be designed to be independently modular, which is great for heterogenous distributed systems (Prakash & Darbari, 2012).

For cloud computing eavesdropping, masquerading, message tampering, replaying the message, and denial of services are security issues that should be addressed (Prakash & Darbari, 2012). Sakr (2014) stated that exploitation of co-tenancy, a secure architecture for the cloud, accountability for outsourced data, confidentiality of data and computation, privacy, verifying outsourced computation, verifying capability, cloud forensics, misuse detection, and resource accounting and economic attacks are big issues for cloud security. This post will discuss the exploitation of co-tendency and confidentiality of data and computation.

Exploitation of Co-Tenancy: An issue with cloud security is within one of its properties, that it is a shared environment (Prakash & Darbari, 2012; Sakr, 2014). Given that it is a shared environment, people with malicious intent could pretend to be someone they are not to gain access, in other words masquerading (Prakash & Darbari, 2012). Once inside, these people with malicious intent tend to gather information about the cloud system and the data contained within it (Sakr, 2014). Another way these services could be used by malicious people is to use the computational resources of the cloud to carry out denial of service attacks on other people.   Prakash and Darbari (2012) stated that two-factor authentications were used on personal devices and for shared distributed systems, there has been proposed a use of a three-factor authentication. The first two factors are the use passwords and smart cards. The last one could be either biometrics or digital certificates. Digital certificates can be used automatically to reduce end-user fatigue on using multiple authentications (Connolly & Begg, 2014). The third level of authentication helps to create a trusted system. Subsequently, a three-factor authentication could primarily mitigate masquerading. Sakr (2014), proposed using a tool that hides the IP addresses the infrastructure components that make up the cloud, to prevent the cloud for being used if the entry is granted to a malicious person.

Confidentiality of data and computation: If data in the cloud is accessed malicious people can gain information, and change the content of that information. Data stored on the distributed systems are sensitive to the owners of the data, like health care data which is heavily regulated for privacy (Sakr, 2014). Prakash and Darbari (2012) suggested the use of public key cryptography, software agents, XML binding technology, public key infrastructure, and role-based access control are used to deal with eavesdropping and message tampering. This essentially hides the data in such a way that it is hard to read without key items that are stored elsewhere in the cloud system. Sakr (2014) suggested homomorphic encryption may be needed, but warns that the use of encryption techniques increases the cost and time of performance. Finally, Lublinsky, Smith, and Yakubovich (2013), stated that encrypting the network to protect data-in-motion is needed.

Overall, a combination of data encryption, hiding IP addresses of computational components, and three-factor authentication may mitigate some of the cloud computing security concerns, like eavesdropping, masquerading, message tampering, and denial of services. However, using these techniques will increase the time it takes to process big data. Thus a cost-benefit analysis must be conducted to compare and contrast these methods while balancing data risk profiles and current risk models.


  • Brookshear, G., & Brylow, D. (2014). Computer Science: An Overview, (12th ed.). Pearson Learning Solutions. VitalBook file.
  • Connolly, T., & Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox. VitalBook file.
  • Prakash, V., & Darbari, M. (2012). A review on security issues in distributed systems. International Journal of Scientific & Engineering Research, 3(9), 300–304.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.

Adv Topics: Security Issues associated with Big Data

The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013). Per Khan et al. (2014), the entire data lifecycle consists of the following eight stages:

  • Raw big data
  • Collection, cleaning, and integration of big data
  • Filtering and classification of data usually by some filtering criteria
  • Data analysis which includes tool selection, techniques, technology, and visualization
  • Storing data with consideration of CAP theory
  • Sharing and publishing data, while understanding ethical and legal requirements
  • Security and governance
  • Retrieval, reuse, and discovery to help in making data-driven decisions

Prajapati (2013), stated the entire data lifecycle consists of the following five steps:

  • Identifying the problem
  • Designing data requirements
  • Pre-processing data
  • Data analysis
  • Data visualizing

It should be noted that Prajapati includes steps that first ask what, when, who, where, why, and how with regards to trying to solve a problem. It doesn’t just dive into getting data. Combining both Prajapati (2013) and Kahn et al. (2014) data lifecycles, provides a better data lifecycle. However, there are 2 items to point out from the above lifecycle: (a) the security phase is an abstract phase because security considerations are involved in stages (b) storing data, sharing and publishing data, and retrieving, reusing and discovery phase.

Over time the threat landscape has gotten worse and thus big data security is a major issue. Khan et al. (2014) describe four aspects of data security: (a) privacy, (b) integrity, (c) availability, and (d) confidentiality. Minelli, Chambers, and Dhiraj (2013) stated that when it comes to data security a challenge to it is understanding who owns and has authority over the data and the data’s attributes, whether it is the generator of that data, the organization collecting, processing, and analyzing the data. Carter, Farmer, and Siegel (2014) stated that access to data is important, because if competitors and substitutes to the service or product have access to the same data then what advantage does that provide the company. Richard and King (2014), describe that a binary notion of data privacy does not exist.  Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).

Carter et al. (2014) focused on data access where access management leads to data availabilities to certain individuals. Whereas, Minelli et al. (2013) focused on data ownership. However, Richard and King (2014) was able to tie those two concepts into data privacy. Thus, each of these data security aspects is interrelated to each other and data ownership, availability, and privacy impacts all stages of the lifecycle. The root causes of the security issues in big data are using dated techniques that are best practices but don’t lead to zero-day vulnerability action plans, with a focus on prevention, focus on perimeter access, and a focus on signatures (RSA, 2013). Specifically, certain attacks like denial of service attacks are a threat and root cause to data availability issues (Khan, 2014). Also, RSA (2013) stated that from a sample of 257 security officials felt that the major challenges to security were the lack of staffing, large false positive amounts which creates too much noise, lack of security analysis skills, etc. Subsequently, data privacy issues arise from balancing compensation risks, maintaining privacy, and maintaining ownership of the data, similar to a cost-benefit analysis problem (Khan et al., 2014).

One way to solve security concerns when dealing with big data access, privacy, and ownership is to place a single entry point gateway between the data warehouse and the end-users (The Carology, 2013). The single entry point gateway is essentially middleware, which help ensures data privacy and confidentiality by acting on behalf of an individual (Minelli et al., 2013). Therefore, this gateway should aid in threat detection, assist in recognizing too many requests to the data which can cause a denial of service attacks, provides an audit trail and doesn’t require to change the data warehouse (The Carology, 2013). Thus, the use of middleware can address data access, privacy, and ownership issues. RSA (2013) proposed a solution to use data analytics to solve security issues by automating detection and responses, which will be covered in detail in another post.


  • Carter, K. B., Farmer, D., and Siegel, C. (2014). Actionable Intelligence: A Guide to Delivering Business Results with Big Data Fast! John Wiley & Sons P&T. VitalBook file.
  • Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z. Ali, W. K. M., Alam, M., Shiraz, M., & Gani., A. (2014). Big data: Survey, technologies, opportunities, and challenges. The Scientific World Journal, 2014. Retrieved from
  • Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.

Adv Topics: MapReduce and Batching Computations

With an increase in data produced by end-user and interne of things connectivity it creates a need for a computational infrastructure that deals with live-streaming big data processing (Sakr, 2014). Real-time data processing requires continuous data input and data output that needs to be processed in a small amount of time (Narsude, 2015). Therefore, once the data comes in it must be processed to be considered a real-time analysis of the data. Real-time data processing deals with a data stream that is defined by an unbounded set of data that can be transformed into stream spouts, which is a bound subset of the stream data (Sakr, 2014).

MapReduce and Hadoop frameworks are amazing with big data-at-rest because it relies on batch jobs with a beginning and an end (Sakr, 2014). However, Narsude (2015) and Sakr (2014), stated that batch processing is not great for real-time stream analytics because stream data doesn’t stop coming in unless an end-user stops it (Sakr, 2014). Batch processing means that the end-user is periodically processing data that has been previously collected to reach a certain size or time-dependent variable (Narsude, 2015).

Streaming data has the following properties (Sakr, 2014): (a) it is generated from multiple places sensuously; (b) has continuous data being produced and in transit; (c) requires continuous parallel processing jobs; (d) online processing of data that isn’t saved, so data analytics lifecycle is applied in real-time to the streaming data; (e) persistently storing streaming data is not optimal; and (f) queries are continuous. Real-time streaming can usually be conducted through two different paradigms (figure 1): processing data one event at a time also known as tuple at a time and keeping data batches small enough to use micro-batching jobs also known as micro-batching (Narsude, 2015).


Figure 1: Illustrating the difference in paradigms in real time processing where the orange dots represent an equal sized event and the blue boxes are the data processing (adapted from Narsude, 2015).   Differing event sizes do exist, which can complicate this image further.

Notice that from figure 1, these two different paradigms create different sets of latency. Thus, real-time processing produces a near real-time output. Therefore, data scientist need to take this into account when picking a tool to handle their real-time data. Some of the many large-scale stream tools include (Sakr, 2014): Aurora, Borealis, IBM System S and IBM Spade, Deduce, StreamCloud, Stormy, and Twitter Storm. The focus here will be on DEDUCE and Twitter Storm.


The DEDUCE architecture offers a middleware solution that can handle streaming data processing and use MapReduce capabilities for real-time data processing (Kumar, Andrade, Gedik, & Wu, 2010; Sakr, 2014). This middleware is using a concept called micro-batching where data is collected in near real-time and is being processed distributive over a batched MapReduce job (Narsude, 2015). Thus, this middleware is using a simplified MapReduce job design for data-at-rest and data-in-motion, where the data stream can come from a variable of an operator or a punctuated input stream and the data output is written in a distributed database system (Kumar et al., 2010; Sakr, 2014). An operator output can come from an end-user, a mapper output or a reducer output (Sakr, 2014).

For its scalability features the middleware creates the illusion to the end-user that the data is stored in one location though it is stored in multiple distributed systems through a common interface (Kumar et al., 2010; Sakr, 2014). DEDUCE’s unified abstraction and runtime has the following customizable features that corresponds to its performance optimization capability (Sakr, 2014): (a) using MapReduce jobs for data flow; (b) re-using MapReduce jobs offline to calibrate data analytics models; (c) optimizing the mapper and reducer jobs under the current computational infrastructure; and (d) being able to control configurations, like update frequency and resource utilization, for the MapReduce jobs.

Twitter Storm

Twitter Storm is an open-source code that is distrusted at GitHub (David, 2011). The Twitter Storm architecture is a distributed and fault-tolerant system that is horizontally scalable to handle parallel processing, guaranteed message processing of all data at least once, fault tolerance to ensure fidelity of results, and programming language agnostic (David, 2011; Sakr, 2014). This system uses processing data one event at a time paradigm to real-time processing (Narsude, 2015).

This software uses data stream spouts and bolts, and a task can be run on either spouts or bolts. Bolts consume a multiple data stream spouts to process and analyze the data to either produce a single output or a new data stream spout (Sakr, 2014). A master node takes care of what stream data and code goes to which spout and bolt through worker node supervisors, and the worker nodes usually work a task(s) per instructions of the worker node supervisor (David, 2011; Sakr, 2014). Considering that the more complex the query, the more data spouts and bolts are needed in the processing topology (figure 2), which helps to address this solution’s scalability features (Sakr, 2014).


Figure 2: A sample Twitter Storm stream to spout to bolt topology (Adapted from Sakr, 2014).

Finally, the performance optimization capability is based on the above sample topology image data, where stream data can be grouped in five different topological spouts and bolts configuration: stream data topology is customizable by the end user; stream data can be randomly distributed, stream data is replicated across all spouts and bolts, one data stream goes to a single bolt directly, or stream data goes into a specified group based on parameters (Sakr, 2014).


Real-time processing is in reality near real-time processes as it depends on the paradigm that is chosen to run the data processing scheme. DEDUCE uses a micro-batching technique, and Twitter Storm uses a tuple at a time. Thus, depending on the problem and project requirements data scientists can use either paradigm and processing scheme that meets their needs.


Adv Topics: Extracting Knowledge from big data

The evolution of data to wisdom is defined by the DIKW pyramid, where Data is just facts without any context, but when facts are used to understand relationships it generates Information (Almeyer-Stubbe & Coleman, 2014). That information can be used to understand patterns, it can then help build Knowledge, and when that knowledge is used to understand principles, it builds Wisdom (Almeyer-Stubbe & Coleman, 2014; Bellinger, Castro, Mills, n.d.). Building an understanding to jump from one level of the DIKW pyramid, is an appreciation of learning “why” (Bellinger et al., n.d.). Big data was first coined in a Gartner blog post, is data that has high volume, variety, and velocity, but without any interest in understanding that data, data scientist will lack context (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Laney, 2001). Therefore, applying the DIKW pyramid can help turn that big data into extensive knowledge (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Sakr, 2014). Extensive knowledge is a derived from placing meaning to big data usually in the form of predictive analytics algorithms (Sakr, 2014).

Machine learning requires historical data and is part of the data analytics process under data mining to understand hidden patterns or structures within the data (Almeyer-Stubbe & Coleman, 2014). Machine learning is easier to build and maintain than other classical data mining techniques (Wollan, Smith, & Zhou, 2010). Machine learning algorithms include clustering, classification, and association rules techniques and the right algorithm from any of these three techniques must be selected that meet the needs of the data (Services, 2015). Unsupervised machine learning techniques like clustering are used when data scientist do not understand or classify data prior to data mining techniques to understand hidden structures within the data set (Brownlee, 2016; Services, 2015). Supervised machine learning involves model training and model testing to aid in understanding which input variables feed into an output variable, involving such techniques as classification and regression (Brownlee, 2016).

An example of an open source Hadoop machine learning algorithm library would include Apache Mahout, which can be found at (Lublinsky, Smith, & Yakubovich, 2013). A limitation from learning from historical data to predict the future is it can “stifle innovation and imagination” (Almeyer-Stubbe & Coleman, 2014). Another limitation can exist that current algorithms may not run on distributed database systems. Thus some tailoring of the algorithms may be needed (Services, 2015). The future of machine learning involves its algorithms becoming more interactive to the end user, known as active learning (Wollan, Smith, & Zhou, 2010).

Case Study: Machine learning, medical diagnosis, and biomedical engineering research – commentary (Foster, Koprowski, & Skufca, 2014)

The authors created a synthetic training data set to simulate a typical medical classification problem of healthy and ill people and assigned random numbers to 10 health variables. Given this information, the actual classification accuracy should be 50%, which is also similar to pure chance alone. These authors found that when classification machine learning algorithms are misapplied, it can lead to false results. This was proven when their model took only 50 people to produce similar accuracy values of pure chance alone. Thus, the authors of this paper were trying to warn the medical field that misapplying classification techniques can lead to overfitting.

The authors then looked at feature selection for classifying Hashimoto’s disease from 250 clinical ultrasound data with the disease and 250 healthy people. Ten variables were selected to help classify these images, and a MATLAB machine learning algorithm was trained on 400 people (200 healthy and 200 ill) to then be tested on 100 people (50 healthy and 50 ill). They were able to show that when 3-4 variables were used they produced better classification results, thus 3-4 variables had huge information gain. This can mislead practitioners, because of the small data set that could be generalized too broadly and the lack of independence between training and testing datasets. The authors argued that larger data sets are needed to get rid of some of the issues that could result in the misapplication of classifiers.

The authors have the following four recommendations when considering the use of supervised machine learning classification algorithms:

    1. Clearly, state the purpose of the study and come from a place of understanding of that problem and its applications.
    2. Minimize the number of a variable when used in classifiers, such as using pruning algorithms in classification algorithms to only select certain variables that meet a certain level of information gain. This is more important with smaller data sets than with big data.
    3. Understand that classifiers are sensitive and that results gained from one set of instances might require further adjustments to be implemented elsewhere.
    4. Classification algorithms and data mining are part of the experimental process not the answer to all problems.


Adv Topics: Graph dataset partitioning in Cloud Computing

When dealing with cloud environments, the network management systems are key to processing big data, because some computational components within the network can have bandwidth unevenness (Conolly & Begg, 2014; Sakr, 2014). Chen et al. (2016) stated that often the cloud infrastructure is designed as interconnected computational resources in a tree structure, where they are first group in pods, which are then connected to other pods This unevenness can be defined as differing network performance between computational components within and across different racks from each other (Lublinsky, Smith, & Yakubovich, 2013). This unevenness grows as the cloud gets upgrade and the computational components grow to become increasingly heterogeneous with time (Chen et al., 2016). Thus, network management systems, which is part of resource pooling in cloud environments, helps with data load balancing, route connections, and diagnose problems between the “computational node ó link ó computational node ó link ó computational node …;” where each node is connected to multiple databases through multiple links (Conolly & Begg, 2014).

Network bandwidth unevenness becomes an issue when trying to process graph and network data, because (a) the complexity of the data cannot be stored in typical relational database systems; (b) the data is complex to process; and (c) scalability issues (Chen et al., 2016). Sakr (2014) stated there are two ways to scale data: vertical scaling where researchers are finding the few computational components that can handle huge data loads or horizontal scaling where researchers are spreading the data across multiple computational components. Chen et al. (2016) and Sakr (2014) recommended these two large-scale graph dataset partitioning methods for cloud computing systems:

  • Machine Graph: Models the graphical data by treating each computational node as a graphical vertex, such that the interconnectivity between each computational node would represent the connection of that graphical data element to others graphical data elements. A benefit of using this design is that the computational nodes representing this graphical data doesn’t need the knowledge of the topology of the network and can be built by utilizing the network bandwidth between computational nodes.
  • Partition Sketch: A graphical data set is partitioned out to represent multiple tree-structured data, and each computational node address a single level of the graphical tree data. From the perspective of the partitioned tree; the root node holds the input graph, non-leaf nodes hold partition iteration data, and leaf nodes hold the partitioned graph data points. This is built iteratively, with the goal of reducing the number of cross leaf-nodes between partitions. Design wise, each partition sketch is locally optimized, addresses monotonicity (cross leaf-nodes between partitions and non-leaf node depth), and addresses the proximity of parent nodes by defining common parent data between leaf nodes and non-leaf nodes. For instance, data that has a low common parent data should be stored together in high bandwidth computational nodes, whereas leaf-node data should be stored in low bandwidth computational nodes. However, the limitation of such a design is that the partition size must be carefully selected to fit monotonicity, but at the same time not have too many nodes as to run into input and output errors when trying to retrieve the data. Thus, more processing nodes should be used to decrease the number of cross leaf-nodes with low bandwidth, but not too many to cause memory issues.


  • Chen, R. Weng, X., He, B., Choi, B. and Yang, M. (2016). Network performance aware graph partitioning for large graph processing systems in the cloud. Retrieved from
  • Connolly, T., Begg, C. (2014). Database Systems: A Practical Approach to Design, Implementation, and Management, 6th Edition. Pearson Learning Solutions. VitalBook file.
  • Lublinsky, B., Smith, K., & Yakubovich, A. (2013). Professional Hadoop Solutions. Wrox, VitalBook file.
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.

Adv Topics: IP size distributions and detecting network traffic anomalies

In 2011, internet advertising has generated over $31B in the US (Sakr, 2014). Much of this revenue is generated by created contextual advertising, which is when online advertisers, search engine optimizers, and sponsored search providers try to engage a user experience and revenue with displaying relevant and context based ads online (Chakrabarti, Agarwal, & Josifovski, 2008). Thus, understanding click rates of online advertisers, search engine optimizers, and sponsored search providers can provide online revenue to any business’ products or services (Regelson & Fain, 2006). When a consumer clicks on the ad to decide whether to purchase the product or service, a small amount of money is withdrawn from the online advertising budget from the company (Regelson & Fain, 2006; Sakr, 2014).

This business model is subjected to cyber-attacks, such that a competitor can create an automated piece of code to click on the advertising without making a purchase, which in the end depletes the online advertising budget (Sakr, 2014). This automated piece of code usually comes from an IP size distribution, which is a group of IPs set to target one ad and pretending to be an actual consumer, which sounds like a DoS attack – Denial of Service attack (Park & Lee, 2001; Sakr, 2014). However, DoS attack is to use IP size distributions to block services from a website, and the best way to prevent this situation is to trace back the source of the IP size distribution and block it (Park & Lee, 2001). This is slightly different though; it is not denying the company’s service or products, its depleting their online advertisement budget, which will reduce one company’s online market share.

Sakr (2014) says that IP size distributions are defined by two dimensions (a) application and (b) time; which change throughout time due to business cycles, flash crowds, etc. IP size distributions are generated three ways: (a) legitimate users, (b) publisher’s friends that could include sponsored providers with some fraudulent clicks, and (c) bot-master with botnets (Sakr, 2014; Soldo & Metwally, 2012). The goal is now to identify the bot-master with botnets and the fraudulent clicks. Thus, companies need to be able to detect network traffic anomalies based on the IP size distribution:

  • Sakr (2014) and Soldo and Metwally (2012) suggested using anomaly detection algorithms, which relies on the current IP size distribution and analyzes the data to search for patterns that are characteristic of these attacks. These methods of detection are robust because it uses these characters of fraudulent clicks, which has low complexity and can be written to run MapReduce in parallel processing. This method can assign a distinct cookie ID for analysis when a click is generated. This technique uses a regression model and compares IP rates to a Poisson distribution, as well as using an explanatory diversity feature which counts the distinct cookies and measures an entropy of that distribution; setting this as the true IP sizes. The use of this information to generate explanatory diversity models, which can then also be analyzed using quantile regression, linear regression, percentage regression, and principal component analysis. Then each of these analyses has their root mean square error computed, relative error, and bucket error to allow inter-comparability between the results of each of these models to the true value. This inter-comparison allows for detection of anomalous activities because each method measures different properties within the same data. Once the IP addresses have been identified as fraudulent, they are then flagged.
  • Regelson and Fain (2006) suggests using historical data if it is available to create reliable prior IP size distribution to compare it to current IP size distributions. Though the authors suggested using this for studying click through rates, which is clicking on the ad to purchase, this could also be used for this scenario. This method of using historical data can sometimes work when there is a wealth of historical information, but in cases that there are little to none historical information a creative aggregation technique could work. This technique uses a cluster of less frequent and similar items as well as completely novel items to develop that historical context needed to build the historical IP size distribution. This technique uses a logistic regression analysis. This method could reduce error by 50% when there was no historical data to compare to.

With further analysis of the first method, the strengths of this method are:

  • that there is no need to obtain personally identifiable information
  • no need to authenticate end user clicks
  • fully automated statistical aggregation method that can scale linearly using MapReduce
  • creating a legitimate looking IP size distributions is really difficult

while the limitations of this method are:

  • It requires many actual click data to create these models
  • Colluding with other companies to provide their click data can help create a large amount of click data needed, but usually, that data is proprietary.

That is why the second method was mentioned from Regelson and Fain (2006) because they address the limitations of the Sakr (2014) and Soldo and Metwally (2012) method.


  • Chakrabarti, D., Agarwal, D., & Josifovski, V. (2008). Contextual advertising by combining relevance with click feedback. In Proceedings of the 17th international conference on World Wide Web (pp. 417-426). ACM.
  • Park, K., & Lee, H. (2001). On the effectiveness of probabilistic packet marking for IP traceback under denial of service attack. In INFOCOM 2001. Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (Vol. 1, pp. 338-347). IEEE.
  • Regelson, M., & Fain, D. (2006). Predicting click-through rate using keyword clusters. In Proceedings of the Second Workshop on Sponsored Search Auctions (Vol. 9623).
  • Sakr, S. (2014). Large scale and big data: Processing and management. Boca Raton, FL: CRC Press.
  • Soldo, F., & Metwally, A. (2012). Traffic anomaly detection based on the IP size distribution. In INFOCOM, 2012 Proceedings IEEE (pp. 2005-2013). IEEE.