Quant: Central tendencies and variances

When might it be better to use each central tendency approach over the others?
What are the dangers in including the extreme scores in your central tendency measures?
How will your variability scores change when you consider the group with and without the extreme reaction people?

Central to quantitative research and methods is to understand the numerical, ordinal, or categorical dataset and what the data represents. This can be done through either descriptive statistics, where the researcher uses statistics to help describe a data set, or it can be done through inferential statistics, where conclusions can be drawn about the data set (Miller, n.d.).  However, researchers should aim to avoid situations where insights and conclusions are gathered and are drawn from the extreme or non-representative data, and understanding the central tendency can help avoid this scenario. For instance, in data mining for business and industry, current practice is comparing multiple random samples based on their central tendency (Ahlemeyer-Stubbe & Coleman, 2014).  Field (2013) and Schumacker (2014), defined central tendency as an all-encompassing term to help describe the “center of a frequency distribution” through the commonly used measures mean, median, and mode.

Central Tendency

In a symmetrical distribution, the central tendency is where most of the data values tend to occur. Thus we can use mean to help describe the central tendency (Schumacker, 2014).  The mean is the arithmetic average value of the data distribution, which is the sum of all the data values divided by the number of data points in the distribution (Field, 2013). Miller (n.d.) stated that the mean value is best when the data is interval data (equally distributed continuously), and the data is well balanced and not skewed.  However, if the data is heavily skewed, then the median is the best to use since it ignores the extreme values on both ends of the distribution (Miller, n.d.). Medians are easily calculated when the total number of data points in a distribution is an odd number, but if it is an even number of data points, the two most centered values will have to be averaged to obtain the median. Modes can help the researcher to identify patterns and is best for nominal or ordinal data (Miller, 2013). Therefore, the mode is the data value that is the most frequent among the data distribution, and a data distribution can be bimodal, which is having two modes in the distribution, or multi-modal, which is having three or more modes of the distribution (Field, 2013). The median is the data value at the center of the distribution when the data values are placed in ascending order (Field, 2013).

Example:  A random sample of fictitious Twitter user’s follower count consist of {22, 40, 57, 57, 68, 93, 116, 121, 168, 405, 2380, 8746}, the mean would be 1022.75, the mode would be 57, and the median would be 104.5.

Outliers, missing values, and multiplication of a constant, and adding a constant are some factors that can affect the central tendency (Schumacker, 2014).  In the case of outliers, it can draw the mean away from the central tendency and towards the outliers skewing the distribution (Miller, n.d.; Schumacker, 2014). The mean moves to the skew, or extreme values to the data, such as in the example above.  This is the danger in including the extreme values in the central tendency measures.  Then there is a need to know more about the central tendency of the data.  One way is to analyze the mean and median together to understand how skewed the data is and in which direction.  Heavily skewed distributions would heavily increase the distance between these two values, and if the mean less than the median the distribution is skewed negatively (Field, 2013).  To understand the distribution, better other measures like variance and standard deviations could be used.

Variance and Standard Deviation

Variance and standard deviations are considered as measures of dispersion, which talk about how the data values are spread across the data distribution and how they are different from the mean (Field, 2013; Schumacker, 2014).  The difference between the central tendency value and the data value is considered the deviance of the value (Field, 2013).  Squaring each deviance to get rid of negative signs and summing up all those deviances for each and every data value will result in obtaining a sum of squared error value (Field, 2013).   Taking the sum of squared errors value and dividing that up by the number of data values -1, will result in obtaining the variance of the distribution (Field, 2013).  Therefore, the variance is the average deviation between the central tendency and the data (Field, 2013).  Variance is the standard deviation value squared, and the variance is a measured value that is not in the same units of the data set (Field, 2013; Miller, n.d.; Schumacker, 2014).  In the above example, the variance about the mean is 6349719 with a standard deviation about the mean is 2519.865.  While removing the two extreme values the variance about the mean is 12293.34 and the standard deviation about the mean are 110.8754. This helps shows that the variability scores can change drastically with and without the extreme values.  Thus, this illustrates that variability shrinks when there is an agreement in the data (Miller, n.d.).

References:

  • Ahlemeyer-Stubbe, A., & Coleman S. (2014). A practical guide to data mining for business and industry. UK, Wiley-Blackwell. VitalBook file.
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Miller, R. (n.d.). Week 1: Central tendency [Video file]. Retrieved from http://breeze.careeredonline.com/p9fynztexn6/?launcher=false&fcsContent=true&pbMode=normal
  • Schumacker, R. E. (2014) Learning statistics using R. California, SAGE Publications, Inc, VitalBook file.

Quantitative Vs Qualitative Analysis

Explanation of the goals and procedures for the quantitative and qualitative methods. Discussing the differences and commonalities of these methods and
what advantages exist in the use of a quantitative method.

Field (2013) states that both quantitative and qualitative methods are complimentary at best not competing approaches to solving the world’s problems. Although these methods are quite different from each other. Creswell (2014) explain how these two, quantitative and qualitative methods, can be combined to study a phenomenon through what is called a “Mixed Method” Approach, which is out of scope for this discussion.  Simply put, quantitative methods are utilized when the research contains variables that are numerical, and qualitative methods are utilized when the research contains variables that are based on language (Field, 2013).  Thus, each methods goals and procedures are quite different

Goals and procedures

Quantitative methods derive from positivist, numerically driven, and epistemological (Joyner, 2012).   Quantitative methods use closed-ended questions, i.e. hypothesis, and collect their data numerically through instruments (Creswell, 2014). In quantitative research, there is an emphasis on experiments, measurement, and a search of relationships via fitting data to a statistical model and through observing a collection of data graphically to identify trends via deduction (Field, 2013; Joyner, 2012). According to Creswell (2014), quantitative researchers build protections against biases and control for alternative explanations through experiments which are generalizable and replicable. Quantitative studies could be experimental, quasi-experimental, causal-comparative, correlational, descriptive, and evaluation (Joyner, 2012).  According to Edmondson and McManus (2007), quantitative methodologies fit best when the underlying research theory is mature.  The maturity of the theory should tend to drive researchers towards one method over the other, along the spectrum quantitative, mixed, or qualitative methodologies (Creswell, 2014; Edmondson & McManus, 2007).

Comparatively, Edmondson and McManus (2007) stated, qualitative methodologies fit best when the underlying research theory is nascent. Quantitative methods derive from phenomenological view, the perceptions of people (Joyner, 2012).  Qualitative methods use open-ended questions, i.e. interview questions and collect their data through observations of a situation (Creswell, 2014).  Qualitative research focuses on meaning and understanding of a situation where the researcher searches for meaning through interpretation of the data via induction (Creswell, 2014; Joyner, 2012).  Qualitative research could be case studies, ethnographic, action, philosophical, historical, legal, educational, etc. (Joyner, 2012).

Commonalities and differences

The commonalities that exist between these two methods is that each method has a question to answer, an identified area of interest (Creswell, 2014; Edmonson & McManus, 2007; Field, 2013; Joyner 2012).  Each method requires a survey of the current literature to help develop the research question (Creswell, 2014; Edmondson & McManus, 2007). Finally, there is a need to design a study to collect and analyze data to help answer that research question (Creswell, 2014; Edmonson & McManus, 2007; Field, 2013; Joyner 2012).  Therefore, the similarities between these two methods exist on why research is conducted and at a high level the what and the how research is conducted.  They differ in the particulars of the what and the how research is conduction.

The research question(s) can either become a centralized question with(out) sub-questions, but in quantitative research is driven by a series of statistically testable theoretical-hypothesis (Creswell, 2014; Edmonson & McManus, 2007). For quantitative methods data analysis, statistical tests are done to seek relationships, with hopes of testing a theory-driven hypothesis and providing a precise model, via a collection of numerical measures and established constructs (Edmonson & McManus, 2007). Given the need to statistically accept or reject theoretical-hypothesis, the sample size for a quantitative methods tend to be greater than those of qualitative methods (Creswell, 2014).  Qualitative research is driven by exploration and observations to test their hypothesis (Creswell, 2014; Edmonson & McManus, 2007). For qualitative methods data analysis, there should be an iterative and explorative content analysis, with hopes to build a new construct (Edmonson & McManus, 2007).  These are some of many other differences that exist between these two methods.

When are the advantages of quantitative methods maximized

Based off of Edmondson and McManus (2007), the best time to use quantitative methods is when the underlying theory of the research subject is mature.  Maturity consists of extensive literature that could be reviewed, the existence of theoretical constructs, and extensively tested measures (Edmondson & McManus, 2007).  Thus, the application of quantitative methods will help build effectively on prior work which will help fill in the gap of knowledge on a particular topic, whereas qualitative methods and mixed methods would fail to do so. Applying quantitative methods to a mature theory is reinventing the wheel, and applying mixed methods to it, will uneven the status of the evidence (Edmondson & McManus, 2007).

References:

  • Creswell, J. W. (2014) Research design: Qualitative, quantitative and mixed method approaches (4th ed.). California, SAGE Publications, Inc. VitalBook file.
  • Edmondson, A. C., & McManus, S. E. (2007). Methodological fit in management field research. Academy of Management Review, 32(4), 1155–1179. http://doi.org/10.5465/AMR.2007.26586086
  • Field, A. (2013) Discovering Statistics Using IBM SPSS Statistics (4th ed.). UK: Sage Publications Ltd. VitalBook file.
  • Joyner, R. L. (2012) Writing the Winning Thesis or Dissertation: A Step-by-Step Guide (3rd ed.). Corwin. VitalBook file.

Business Intelligence: Compelling Topics

This post discusses the most compelling topics in Business Intelligence.

Departments are currently organized in a silo. Thus, their information is in silo systems, which makes it difficult to leverage that information across the company.  When we employ a data warehouse, which is a central database that contains a collection of decision-related internal and external sources of data, it can aid in the data analysis for the entire company (Ahlemeyer-Stubbe & Coleman, 2014). When we build a multi-level Business Intelligence (BI) system on top of a centralized data warehouse, we no longer have silo data systems, and thus, can make a data-driven decision.  Thus, to support data-driven decision while moving away from a silo department kept data to a centralized data warehouse, Curry,  Hasan, and O’Riain (2012) created a system that shows results from the hospital centralized data warehouse at different levels of the company, as the organization level (stakeholders are executive members, shareholders, regulators, suppliers, consumers), the functional level (stakeholders are functional managers, organization manager), and the individual level (stakeholders are the employees).  Data may be centralized, but specialized permissions on data reports can exist on a multi-level system.

The types of data that exist and can be stored in a centralized data warehouse are: Real-time data: data that reveals events that are happening immediately, Lag information: information that explains events that have recently just happened; and Lead information: information that helps predict events into the future based off of lag data, like regression data, forecasting model output (based off of Laursen & Thorlund, 2010).  All with the goal of helping decision makers if certain Target Measures are met.  Target measures are used to improve marketing efforts through tracking measures like ROI, NVP, Revenue, lead generation, lag generations, growth rates, etc. (Liu, Laguna, Wright, & He, 2014).

Decision Support Systems (DSS) were created before BI strategies.  A DSS helps execute the project, expand the strategy, improve processes, and improves quality controls in a quickly and timely fashion.  Data warehouses’ main role is to support the DSS (Carter, Farmer, & Siegel, 2014).  Unfortunately, the talks above about data types and ways to store data to enable data-driven decisions it doesn’t explain the “how,” “what,” “when,” “where,” “who”, and “why.”  However, a strong BI strategy is imperative to making this all work.  A BI strategies can include, but is not limited to data extraction, data processing, data mining, data analysis, reporting, dashboards, performance management, actionable decisions, etc. (Fayyad, Piatetsky-Shapiro, & Smyth, 1996; Padhy, Mishra, & Panigrahi, 2012; McNurlin, Sprague,& Bui, 2008).  This definition along with the fact the DSS is 1/5 principles to BI suggest that DSS was created before BI and that BI is a more new and holistic view of data-driven decision making.

But, what can we do with a strong BI strategy? Well with a strong BI strategy we can increase a company’s revenue through Online profiling.  Online profiling is using a person’s online identity to collect information about them, their behaviors, their interactions, their tastes, etc. to drive a targeted advertising (McNurlin et al., 2008).  Unfortunately, the fear comes when the end-users don’t know what the data is currently being used for, what data do these companies or government have, etc.  Richards and King (2014) and McEwen, Boyer, and Sun (2013), expressed that it is the flow of information, and the lack of transparency is what feeds the fear of the public. McEwen et al. (2013) did express many possible solutions, one which could gain traction in this case is having the consumers (end-users) know what variables is being collected and have an opt-out feature, where a subset of those variables stay with them and does not get transmitted.

 

Reference:

  • Ahlemeyer-Stubbe, Andrea, Shirley Coleman. (2014). A Practical Guide to Data Mining for Business and Industry, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781118981863/
  • Carter, K. B., Farmer, D., & Siegel, C. (2014-08-25). Actionable Intelligence: A Guide to Delivering Business Results with Big Data Fast!, 1st Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781118920657/
  • Curry, E., Hasan, S., & O’Riain, S. (2012, October). Enterprise energy management using a linked dataspace for energy intelligence. In Sustainable Internet and ICT for Sustainability (SustainIT), 2012 (pp. 1-6). IEEE.
  • Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37. Retrieved from: http://www.aaai.org/ojs/index.php/aimagazine/article/download/1230/1131/
  • Laursen, G. H. N., & Thorlund, J. (2010) Business Analytics for Mangers: Taking Business Intelligence Beyond Reporting. Wiley & SAS Business Institute.
  • Liu, Y., Laguna, J., Wright, M., & He, H. (2014). Media mix modeling–A Monte Carlo simulation study. Journal of Marketing Analytics, 2(3), 173-186.
  • McEwen, J. E., Boyer, J. T., & Sun, K. Y. (2013). Evolving approaches to the ethical management of genomic data. Trends in Genetics, 29(6), 375-382.
  • McNurlin, B., Sprague, R., & Bui, T. (09/2008). Information Systems Management, 8th Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781323134702/
  • Padhy, N., Mishra, D., & Panigrahi, R. (2012). The survey of data mining applications and feature scope. arXiv preprint arXiv:1211.5723.  Retrieved from: https://arxiv.org/ftp/arxiv/papers/1211/1211.5723.pdf
  • Richards, N. M., & King, J. H. (2014). Big data ethics. Wake Forest L. Rev., 49, 393

Business Intelligence: Predictions Followup

The last post discussed the future of data mining. For this post, I will expand my opinion on what business intelligence (BI) is moving toward in the future.

  • Potential Opportunities:

o    Health monitoring.  Currently, smart watches are tracking our heart rate, steps, standing time, climbing stairs, siting time, heart beats, workouts, biking, sleep, etc.  But, what if we had a device that measured daily our chemicals in our blood, that is no longer as painful as pricking your finger if you are diabetic.  This, the technology could not only measure your blood chemical makeup but could send alerts to EMT and doctors if there is a dangerous imbalance of chemicals in your blood (Carter et al., 2014).  This would require a strong BI program across emergency responders, individuals, and doctors.

o    As Moore’s law of computational speed moves forward in time, the more chances are companies able to interpret real-time data and produce lead information which can drive actionable data-driven decisions. Companies can finally get answers to strategic business questions in minutes as well (Carter et al., 2014).

o    Both internal data (corporate data) and external data (competitor analysis, costumer analysis, social media, affinity and sentiment analysis), will be reported to senior leaders and executives who have the authority to make decisions on behalf of the company on a frequent basis.  These issues may show up in a dashboard, with x number of indicators/metrics as successfully implemented in a case study of a hospital (Topaloglou & Barone, 2015).

  • Potential Pitfalls:

o    Tools for threat detection, like those being piloted in New York City, could have an increased level of discrimination (Carter, Farmer, & Siegel, 2014). As big data analytics is being used to do facial recognition of photographs and live video to identify threats, it can lead to more racial profiling if the knowledge fed into the system as a priori has elements of racial profiling.  This could lead to a bias in reporting, track higher levels of a particular demographic, and the fact that past performance doesn’t indicate the future.

o    Data must be validated before it is published onto a data warehouse.  Due to the low data volatility feature of data warehouses, we need to ensure that the data we receive is correct, thus expected value thresholds must be set to capture errors before they are entered.  Wrong data in, means wrong data analysis, and wrong data-drove decisions.  An example of expected value thresholds could be that earth’s temperature cannot exceed 500K at the surface.

o    Amplified customer experience.  As BI incorporates social media to gauge what is going on in the minds of their customer, if something were to go viral that could hurt the company, it can be devastating for the company.  Essentially we are giving the customer an amplified voice.  This can be rumors of software, hardware leaks as what happens for every Apple iPhone generation/release, which can put current proprietary information into the hands of their competitors.  A nasty comment or post that gets out of control on a social media platform, to celebrity boycotts.  Though, the opportunity here lies in receiving key information on how to improve their products, identify leakers of information, and settle nasty rumors, issues, or comments.

  • Potential Threats:

o    Loss of data through hackers, which are aiming to steal someone’s identity.  Firewalls must be tighter than ever, and networks must be more secure than ever as a company goes into a centralized data warehouse.  Data warehouses are vital for BI initiatives, but if HR data is located in the warehouse, (for example to help HR identify likelihood measures of disgruntled employees to aid in their retention efforts) then if a hacker were to get a hold of that data, thousands of people information can be compromised.  This is nothing new, but this is a potential threat that must be mitigated as we proceed into BI systems.  This can not only apply to people data but company proprietary data.

o    Consumer advertisement blitz. If companies use BI to blast their customers with ads in hopes to better market to people and use item affinity analysis, to send coupons and attract more sales and higher revenues.  There is a personal example here for me:  XYZ is a clothing store, when I moved to my first house, the old owner never switched their information in their database.  But, since they were a frequent buyer and those magazines, coupons, flyers, and sales were working on the old owner of the house, they kept getting blasted with marketing ads.  When I moved in, I got a magazine every two days.  It was a waste of paper and made me less likely to shop there.  Eventually, I had enough and called customer service.  They resolved the issue, but it took six weeks after that call, for my address to be removed from their marketing and customer database.  I haven’t shopped there since.

o    Informational overload.  As companies go forward into implementing BI systems, they must meet with the entire multi-level organization to find out their data needs.  Just because we have the data, doesn’t mean we should display it.  The goal is to find the right amount of key success factors, key performance indicators, and metrics, to help out the decision makers at all different levels.  Complicating this part up can compromise the adoption of BI in the organization and will be seen as a waste of money rather than a tool that could help them in today’s competitive market.  This is such a hard line to walk on, but it is one of the biggest threats.  It was realized in the hospital case study (Topaloglou & Barone, 2015) and therefore mitigated for through extensive planning, buy-in, and documentation.

 

Resources: