Adv Quant: Use of Bayesian Analysis in research

Just using knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story until they are combined, hence establishing the need for Bayesian analysis (Hubbard, 2010).  Bayes’ theory is a conditional probability that takes into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015).  Bayesian analysis aids in avoiding overconfidence and underconfidence because it doesn’t ignore prior or new data (Hubbard, 2010).  There are many examples of how Bayesian analysis can be used in the context of social media data.  Below are just three ways of many,

  • With high precision, Bayesian Analysis was able to detect spam twitter accounts from legitimate users, based on their followers/following ration information and their most 100 recent tweets (McCord & Chuah, 2011). McCord and Chuah (2011) was able to use Bayesian analysis to achieve a 75% accuracy in detecting spam just by using user-based features, and ~90% accuracy in detecting spam when using both user and content based features.
  • Boulle (2014) used Bayesian Analysis off of 60,000 URLs in 100 websites. The goal was to predict the number of visits and messages on Twitter and Facebook after 48 hours, and Boulle (2014) was able to come close to the actual numbers through using Bayesian Analysis, showcasing the robustness of the approach.
  • Zaman, Fox, and Bradlow (2014), was able to use Bayesian analysis for predicting the popularity of tweets by measuring the final count of retweets a source tweet gets.

An in-depth exploration of Zaman, et al. (2014)

Goal:

The researchers aimed to predict how popular a tweet can become a Bayesian model to analyze the time path of retweets a tweet receives, and the eventual number of retweets of a tweet one week later.

  • They were analyzing 52 tweets varying among different topics like music, politics, etc.
    • They narrowed down the scope to analyzing tweets with a max of 1800 retweets per root tweets.

Defining the parameters:

  • Twitter = microblogging site
  • Tweets = microblogging content that is contained in up to 140 characters
  • Root tweets = original tweets
  • Root user = generator of the root tweet
  • End user = those who read the root tweet and retweeted it
  • Twitter followers = people who are following the content of a root user
  • Follower graph = resulting connections into a social graph from known twitter followers
  • Retweet = a twitter follower’s sharing of content from the user for their followers to read
  • Depth of 1 = how many end users retweeted a root tweet
  • Depth of 2 = how many end users retweeted a retweet of the root tweet

Exploration of the data:

From the 52 sampled root tweets, the researchers found that the tweets had anywhere between 21-1260 retweets associated with them and that the last retweet that could have occurred between a few hours to a few days from the root tweet’s generation.  The researchers calculated the median times from the last retweet, yielding scores that ranged from 4 minutes to 3 hours.  The difference between the median times was not statistically significant to reject a null hypothesis, which involved a difference in the median times.  This gave potentially more weight to the potential value of the Bayesian model over just descriptive/exploratory methods, as stated by the researchers.

The researchers explored the depth of the retweets and found that 11,882 were a depth of 1, whereas 314 were a depth of 2 or more in those 52 root tweets, which suggested that root tweets get more retweets than retweeted tweets.  It was suggested by the researchers that the depth seemed to have occurred because of a large number of followers from the retweeter’s side.

It was noted by the researchers that retweets per time path decays similarly to a log-normally distribution, which is what was used in the Bayesian analysis model.

Bayesian analysis results:

The researchers partitioned their results randomly into a training set with 26 observations, and a testing set of 26 observations, and varied the amount of retweets observations from 10%-100% of the last retweet.  Their main results are plotted in boxplots, where the whiskers cover 90% of the posterior solution (Figure 10).

IP3F12.png

The figure above is directly from Zaman, et al. (2014). The authors mentioned that as the observation fraction increased the absolute percent errors decreased.    For future work, the researchers suggested that their analysis could be parallelized to incorporate more data points, take into consideration the time of day the root tweet was posted, as well as understanding the content within the tweets and their retweet-ability because of it.

References

  • Boullé, M. (2014). Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics.
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • Mccord, M., & Chuah, M. (2011). Spam detection on twitter using traditional classifiers. In International Conference on Autonomic and Trusted Computing (pp. 175-186). Springer Berlin Heidelberg.
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Zaman, T., Fox, E. B., & Bradlow, E. T. (2014). A Bayesian approach for predicting the popularity of tweets. The Annals of Applied Statistics8(3), 1583-1611.
Advertisements