Adv Topics: Security Issues associated with Big Data

The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013). Per Khan et al. (2014), the entire data lifecycle consists of the following eight stages:

  • Raw big data
  • Collection, cleaning, and integration of big data
  • Filtering and classification of data usually by some filtering criteria
  • Data analysis which includes tool selection, techniques, technology, and visualization
  • Storing data with consideration of CAP theory
  • Sharing and publishing data, while understanding ethical and legal requirements
  • Security and governance
  • Retrieval, reuse, and discovery to help in making data-driven decisions

Prajapati (2013), stated the entire data lifecycle consists of the following five steps:

  • Identifying the problem
  • Designing data requirements
  • Pre-processing data
  • Data analysis
  • Data visualizing

It should be noted that Prajapati includes steps that first ask what, when, who, where, why, and how with regards to trying to solve a problem. It doesn’t just dive into getting data. Combining both Prajapati (2013) and Kahn et al. (2014) data lifecycles, provides a better data lifecycle. However, there are 2 items to point out from the above lifecycle: (a) the security phase is an abstract phase because security considerations are involved in stages (b) storing data, sharing and publishing data, and retrieving, reusing and discovery phase.

Over time the threat landscape has gotten worse and thus big data security is a major issue. Khan et al. (2014) describe four aspects of data security: (a) privacy, (b) integrity, (c) availability, and (d) confidentiality. Minelli, Chambers, and Dhiraj (2013) stated that when it comes to data security a challenge to it is understanding who owns and has authority over the data and the data’s attributes, whether it is the generator of that data, the organization collecting, processing, and analyzing the data. Carter, Farmer, and Siegel (2014) stated that access to data is important, because if competitors and substitutes to the service or product have access to the same data then what advantage does that provide the company. Richard and King (2014), describe that a binary notion of data privacy does not exist.  Data is never completely private/confidential nor completely divulged, but data lies in-between these two extremes.  Privacy laws should focus on the flow of personal information, where an emphasis should be placed on a type of privacy called confidentiality, where data is agreed to flow to a certain individual or group of individuals (Richard & King, 2014).

Carter et al. (2014) focused on data access where access management leads to data availabilities to certain individuals. Whereas, Minelli et al. (2013) focused on data ownership. However, Richard and King (2014) was able to tie those two concepts into data privacy. Thus, each of these data security aspects is interrelated to each other and data ownership, availability, and privacy impacts all stages of the lifecycle. The root causes of the security issues in big data are using dated techniques that are best practices but don’t lead to zero-day vulnerability action plans, with a focus on prevention, focus on perimeter access, and a focus on signatures (RSA, 2013). Specifically, certain attacks like denial of service attacks are a threat and root cause to data availability issues (Khan, 2014). Also, RSA (2013) stated that from a sample of 257 security officials felt that the major challenges to security were the lack of staffing, large false positive amounts which creates too much noise, lack of security analysis skills, etc. Subsequently, data privacy issues arise from balancing compensation risks, maintaining privacy, and maintaining ownership of the data, similar to a cost-benefit analysis problem (Khan et al., 2014).

One way to solve security concerns when dealing with big data access, privacy, and ownership is to place a single entry point gateway between the data warehouse and the end-users (The Carology, 2013). The single entry point gateway is essentially middleware, which help ensures data privacy and confidentiality by acting on behalf of an individual (Minelli et al., 2013). Therefore, this gateway should aid in threat detection, assist in recognizing too many requests to the data which can cause a denial of service attacks, provides an audit trail and doesn’t require to change the data warehouse (The Carology, 2013). Thus, the use of middleware can address data access, privacy, and ownership issues. RSA (2013) proposed a solution to use data analytics to solve security issues by automating detection and responses, which will be covered in detail in another post.

Resources:

  • Carter, K. B., Farmer, D., and Siegel, C. (2014). Actionable Intelligence: A Guide to Delivering Business Results with Big Data Fast! John Wiley & Sons P&T. VitalBook file.
  • Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z. Ali, W. K. M., Alam, M., Shiraz, M., & Gani., A. (2014). Big data: Survey, technologies, opportunities, and challenges. The Scientific World Journal, 2014. Retrieved from http://www.hindawi.com/journals/tswj/2014/712826/
  • Minelli, M., Chambers, M., & Dhiraj, A. (2013). Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses. John Wiley & Sons P&T. VitalBook file.
Advertisements

Data analytics lifecycle

What is the data analytics Lifecycle?

The scientific method helps give a framework for the data analytics lifecycle (Dietrich, 2013). According to Dietrich (2013), it is a cyclical life cycle that has iterative parts in each of its six steps:

  • Discovery
  • Pre-processing data
  • Model planning
  • Model building
  • Communicate results
  • Operationalize

However Erl, Buhler, & Khattak (2016), suggested that it is divided in nine steps:

  • Business case evaluation
  • Data identification
  • Data acquisition & filtering
  • Data extraction
  • Data validation & cleansing
  • Data aggregation & representation
  • Data analysis
  • Data visualization
  • Utilization of analysis results

Prajapati (2013), stated five steps:

  • Identifying the problem
  • Designing data requirements
  • Pre-processing data
  • Data analysis
  • Data visualizing

Between these three different lifecycle versions, there is a general pattern that emerges, but it also suggests that the field of data analytics is still too nascent to pin down an exact data analytics lifecycle.  For the purpose of this discussion the lifecycle that will be used is from Services (2015), which uses the Dietrich (2013) lifecycle. Note that both Services (2015) and Dietrich (2013) model is iterative and not static steps.  This lifecycle model allows all key team members to conduct planning work up front and towards the end of the data analytics project to drive success (Dietrich, 2013).

When is it beneficial for stakeholders to be involved?

If following an agile development processes the key stakeholders should be involved in all the lifecycles. That is because the key stakeholders are known as business user, project sponsor, project manager, business intelligence analyst, database administers, data engineer, and data scientist (Services, 2015).  Some of the benefits of applying the Agile development processes to this lifecycle is because it allows for iterative feedback for speed-to-market, improved first-time quality, visibility, risk management, flexibility to pivot when needed, controlling costs, and improved satisfaction through engagement (Waters, 2007).  Allowing the stakeholders to participate in most of these steps can allow the following work to be done to their specifications.

For the first step, discovery, the business learns its domain and its relevant history with lessons learned from previous projects (Services, 2015). Before proceeding ask: “Do I have enough information to draft an analytic plan and share for peer review?” (Dietrich, 2013; Services, 2015). Pre-processing data, also known as data preparation is where a copy of the data is placed in a sandbox (not the original), where the data scientists and team can extract, load and transform (ELT) the copied data (Services, 2015). In this stage, data could also be cleaned, aggregated, augmented, and formatted (Prajapati, 2013). Before proceeding ask: “Do I have enough good quality data to start building the model?” (Dietrich, 2013; Services, 2015). Model planning is when the data scientist and team determines the appropriate models, algorithms, workflow of the data, which helps identify hidden insights between the variables (Services, 2015).  Before proceeding ask: “Do I have a good idea about the type of model to try? Can I refine the analytic plan?” (Dietrich, 2013; Services, 2015). Model building helps sets roughly about 2/3 of the data for training the model and 1/3 of the data for testing the model for production purposes and discovering hidden insights (Prajapati, 2013; Services, 2015). Before proceeding ask: “Is the model robust enough? Have we failed for sure?” (Dietrich, 2013; Services, 2015).   Communicating results could be done visualization of data to the major stakeholders to see if the results are a success or failure (Services, 2015).  Visualization is done in this step is supposed to be interactive with all parties involved in this project (Prajapati, 2013). Finally, the operationalize step is when the data is ready to provide reports, documents, on a pre-defined time interval such that key decision makers could receive the vital data needed (Services, 2015).

References