Adv Quant: Use of Bayesian Analysis in research

Just using knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story until they are combined, hence establishing the need for Bayesian analysis (Hubbard, 2010).  Bayes’ theory is a conditional probability that takes into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015).  Bayesian analysis aids in avoiding overconfidence and underconfidence because it doesn’t ignore prior or new data (Hubbard, 2010).  There are many examples of how Bayesian analysis can be used in the context of social media data.  Below are just three ways of many,

  • With high precision, Bayesian Analysis was able to detect spam twitter accounts from legitimate users, based on their followers/following ration information and their most 100 recent tweets (McCord & Chuah, 2011). McCord and Chuah (2011) was able to use Bayesian analysis to achieve a 75% accuracy in detecting spam just by using user-based features, and ~90% accuracy in detecting spam when using both user and content based features.
  • Boulle (2014) used Bayesian Analysis off of 60,000 URLs in 100 websites. The goal was to predict the number of visits and messages on Twitter and Facebook after 48 hours, and Boulle (2014) was able to come close to the actual numbers through using Bayesian Analysis, showcasing the robustness of the approach.
  • Zaman, Fox, and Bradlow (2014), was able to use Bayesian analysis for predicting the popularity of tweets by measuring the final count of retweets a source tweet gets.

An in-depth exploration of Zaman, et al. (2014)

Goal:

The researchers aimed to predict how popular a tweet can become a Bayesian model to analyze the time path of retweets a tweet receives, and the eventual number of retweets of a tweet one week later.

  • They were analyzing 52 tweets varying among different topics like music, politics, etc.
    • They narrowed down the scope to analyzing tweets with a max of 1800 retweets per root tweets.

Defining the parameters:

  • Twitter = microblogging site
  • Tweets = microblogging content that is contained in up to 140 characters
  • Root tweets = original tweets
  • Root user = generator of the root tweet
  • End user = those who read the root tweet and retweeted it
  • Twitter followers = people who are following the content of a root user
  • Follower graph = resulting connections into a social graph from known twitter followers
  • Retweet = a twitter follower’s sharing of content from the user for their followers to read
  • Depth of 1 = how many end users retweeted a root tweet
  • Depth of 2 = how many end users retweeted a retweet of the root tweet

Exploration of the data:

From the 52 sampled root tweets, the researchers found that the tweets had anywhere between 21-1260 retweets associated with them and that the last retweet that could have occurred between a few hours to a few days from the root tweet’s generation.  The researchers calculated the median times from the last retweet, yielding scores that ranged from 4 minutes to 3 hours.  The difference between the median times was not statistically significant to reject a null hypothesis, which involved a difference in the median times.  This gave potentially more weight to the potential value of the Bayesian model over just descriptive/exploratory methods, as stated by the researchers.

The researchers explored the depth of the retweets and found that 11,882 were a depth of 1, whereas 314 were a depth of 2 or more in those 52 root tweets, which suggested that root tweets get more retweets than retweeted tweets.  It was suggested by the researchers that the depth seemed to have occurred because of a large number of followers from the retweeter’s side.

It was noted by the researchers that retweets per time path decays similarly to a log-normally distribution, which is what was used in the Bayesian analysis model.

Bayesian analysis results:

The researchers partitioned their results randomly into a training set with 26 observations, and a testing set of 26 observations, and varied the amount of retweets observations from 10%-100% of the last retweet.  Their main results are plotted in boxplots, where the whiskers cover 90% of the posterior solution (Figure 10).

IP3F12.png

The figure above is directly from Zaman, et al. (2014). The authors mentioned that as the observation fraction increased the absolute percent errors decreased.    For future work, the researchers suggested that their analysis could be parallelized to incorporate more data points, take into consideration the time of day the root tweet was posted, as well as understanding the content within the tweets and their retweet-ability because of it.

References

  • Boullé, M. (2014). Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics.
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • Mccord, M., & Chuah, M. (2011). Spam detection on twitter using traditional classifiers. In International Conference on Autonomic and Trusted Computing (pp. 175-186). Springer Berlin Heidelberg.
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Zaman, T., Fox, E. B., & Bradlow, E. T. (2014). A Bayesian approach for predicting the popularity of tweets. The Annals of Applied Statistics8(3), 1583-1611.

Big Data Analytics: Pizza Industry

Pizza, pizza! A competitive analysis was completed on Dominos, Pizza Hut, and Papa Johns.  Competitive analysis is gathering external data that is available freely, i.e. social media like Twitter tweets and Facebook posts.  That is what He, Zha, and Li (2013) studied, approximately 307 total tweets (266 from Dominos, 24 from Papa John, 17 from Pizza Hut) and 135 wall post (63 from Dominos, 37 from Papa Johns, 35 from Pizza Hut), for the month October 2011(He et al, 2013).  It should be noted that these are the big three pizza chain controlling 23% of the total market share (7.6% from Dominos, 4.23% from Papa Johns, 11.65% from Pizza Hut)(He et al., 2013) (He et al., 2013). Posts and tweets contain text data, videos, and pictures.  All the data collected was text-based data and collected manually, and SPSS Clementine tool was used to discover themes in their text (He et al., 2013).

He et al. (2013), found that Domino’s Pizza was using social media to engage their customers the most.  Domino’s Pizza did the most to reply to as many tweets and posts.  The types of posts in all three companies varied from the promotion to marketing to polling (i.e. “What is your favorite topping?”), facts about pizza, Halloween-themed posts, baseball themed posts, etc. (He et al., 2013).  Results from the text mining of all three companies: Ordering and delivery was key (customers shared the experience and feelings about their experience), Pizza Quality (taste & quality), Feedback on customers’ purchase decisions, Casual socialization posts (i.e. Happy Halloween, Happy Friday), and Marketing tweets (posts on current deals, promotions and advertisement) (He et al, 2013).  Besides text mining, there was also content analysis on each of their sites (367 pictures & 67 videos from Dominos, 196 pictures & 40 videos from Papa Johns, and 106 pictures and 42 videos from Pizza Hut), which showed that the big three were trying to drive customer engagement (He et al., 2013).

He et al. (2013) lists the theory that with higher positive customer engagement, customers can become brand advocates, which increases their brand loyalty and push referrals to their friends, and approximately 1/3 people followed a friend’s referral if done through social media.  Thus, evaluating the structure and unstructured data provided to an organization about their own product and theirs of their competitors, they could use it to help increase their customer services, driving improvements in their own products, and driving more customers to their products (He et al., 2013).  Key lessons from this study, which would help any organization gain an advantage in the market are to (1) Constantly monitor your social media and those of your competitors, (2) Establish a benchmark of how many posts, likes, shares, etc. between you and your competitors, (3) Mine the conversational data for content and context, and (4) analyze the impact of your social media footprint to your own business (when prices rise or fall what is the response, etc.) (He et al, 2013).

Resources:

  • He, W., Zha, S., & Li, L. (2013). Social media competitive analysis and text mining: A case study in the pizza industry. International Journal of Information Management, 33(3), 464-472.