Adv Topics: Extracting Knowledge from big data

The evolution of data to wisdom is defined by the DIKW pyramid, where Data is just facts without any context, but when facts are used to understand relationships it generates Information (Almeyer-Stubbe & Coleman, 2014). That information can be used to understand patterns, it can then help build Knowledge, and when that knowledge is used to understand principles, it builds Wisdom (Almeyer-Stubbe & Coleman, 2014; Bellinger, Castro, Mills, n.d.). Building an understanding to jump from one level of the DIKW pyramid, is an appreciation of learning “why” (Bellinger et al., n.d.). Big data was first coined in a Gartner blog post, is data that has high volume, variety, and velocity, but without any interest in understanding that data, data scientist will lack context (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Laney, 2001). Therefore, applying the DIKW pyramid can help turn that big data into extensive knowledge (Almeyer-Stubbe & Coleman, 2014; Bellinger et al., n.d.; Sakr, 2014). Extensive knowledge is a derived from placing meaning to big data usually in the form of predictive analytics algorithms (Sakr, 2014).

Machine learning requires historical data and is part of the data analytics process under data mining to understand hidden patterns or structures within the data (Almeyer-Stubbe & Coleman, 2014). Machine learning is easier to build and maintain than other classical data mining techniques (Wollan, Smith, & Zhou, 2010). Machine learning algorithms include clustering, classification, and association rules techniques and the right algorithm from any of these three techniques must be selected that meet the needs of the data (Services, 2015). Unsupervised machine learning techniques like clustering are used when data scientist do not understand or classify data prior to data mining techniques to understand hidden structures within the data set (Brownlee, 2016; Services, 2015). Supervised machine learning involves model training and model testing to aid in understanding which input variables feed into an output variable, involving such techniques as classification and regression (Brownlee, 2016).

An example of an open source Hadoop machine learning algorithm library would include Apache Mahout, which can be found at http://mahout.apache.org (Lublinsky, Smith, & Yakubovich, 2013). A limitation from learning from historical data to predict the future is it can “stifle innovation and imagination” (Almeyer-Stubbe & Coleman, 2014). Another limitation can exist that current algorithms may not run on distributed database systems. Thus some tailoring of the algorithms may be needed (Services, 2015). The future of machine learning involves its algorithms becoming more interactive to the end user, known as active learning (Wollan, Smith, & Zhou, 2010).

Case Study: Machine learning, medical diagnosis, and biomedical engineering research – commentary (Foster, Koprowski, & Skufca, 2014)

The authors created a synthetic training data set to simulate a typical medical classification problem of healthy and ill people and assigned random numbers to 10 health variables. Given this information, the actual classification accuracy should be 50%, which is also similar to pure chance alone. These authors found that when classification machine learning algorithms are misapplied, it can lead to false results. This was proven when their model took only 50 people to produce similar accuracy values of pure chance alone. Thus, the authors of this paper were trying to warn the medical field that misapplying classification techniques can lead to overfitting.

The authors then looked at feature selection for classifying Hashimoto’s disease from 250 clinical ultrasound data with the disease and 250 healthy people. Ten variables were selected to help classify these images, and a MATLAB machine learning algorithm was trained on 400 people (200 healthy and 200 ill) to then be tested on 100 people (50 healthy and 50 ill). They were able to show that when 3-4 variables were used they produced better classification results, thus 3-4 variables had huge information gain. This can mislead practitioners, because of the small data set that could be generalized too broadly and the lack of independence between training and testing datasets. The authors argued that larger data sets are needed to get rid of some of the issues that could result in the misapplication of classifiers.

The authors have the following four recommendations when considering the use of supervised machine learning classification algorithms:

    1. Clearly, state the purpose of the study and come from a place of understanding of that problem and its applications.
    2. Minimize the number of a variable when used in classifiers, such as using pruning algorithms in classification algorithms to only select certain variables that meet a certain level of information gain. This is more important with smaller data sets than with big data.
    3. Understand that classifiers are sensitive and that results gained from one set of instances might require further adjustments to be implemented elsewhere.
    4. Classification algorithms and data mining are part of the experimental process not the answer to all problems.

Resources:

Advertisements

Case Study: Sociotechnical system in education

Definition of key terms

  • Sociotechnical Systems: the interplay, impact, and mutual influence when technology is introduced into a social system, i.e. workplace, school, home, etc. (com, n.d.; Sociotechnical theory, n.d.) The social system comprises people at all levels of knowledge, skills, attitudes, values and needs (Sociotechnical theory, n.d.).
  • Formal Learning: scholastic learning in schools (Hayashi & Baranauskas, 2013)
  • Non-formal Learning: scholastic learning outside of schools (Hayashi & Baranauskas, 2013)
  • Informal Learning: other learning that occurs outside of schools (Hayashi & Baranauskas, 2013)

Case Study Description (Hayashi & Baranauskas, 2013)

This qualitative study introduced 520 donated laptops among the students (ages 6-14) and teachers in the public school, Padre Emilio Miotti School, in Campinas, Brazil.  With a goal of providing a detailed description of the results in order to inspire (transfer knowledge) over focusing on generalizing the results to other schools and scholastic-socio technological systems.  The sociotechnical system is defined by cultural conventions, where the participants in the study can be classified under in the formal, informal, and technical levels of a Semiotic Onion (Figure 1).

1.png

(Source: Adopted directly from Hayashi and Baranauska, 2013)

Therefore, the goal of this qualitative study was to understand how to insert the technological artifacts (the laptops), into the scholastic curriculum, that makes sense to the end users (scholastic community: teachers, students, etc.) into a meaningful integration across all aspects of the Semiotic Onion.  Data collection for this qualitative study was done through interviews and discussion in the Semio-participatory Workshops in 2009, as well as the authors being participant observers over a one year period in the scholastic activities.

There were four opportunities that should be considered (supporting forces for adoption):

  • Transforming homework assignments: Allowed for teachers to bring some homework into the classwork and allow the students to conduct their searches, normally done at home at school. Teachers could now observe the emotional flux of their students evolve while they complete the assignments.  This evolution of the emotional flux during homework use to be only observed by parents.
  • Integrating the school in Interdisciplinary Activities: In a collaborative fashion, teachers were able to create assignments using the laptop cameras to capture everyday objects or events of the students to help show them how to eat healthier, different animals and their behaviors, save on the electric bill, teach them about calories, watts, electricity, animals, etc. This creates a path of data to information to knowledge that helps motivate the students to learn more.
  • Laptops inside and outside the school walls: Students have more pride in using their own devices and were willing to showcase and educate the public about their technology and its effectiveness. This has far reaching results that were not explored in this study.
  • Student Volunteers: The use of older students to help troubleshoot younger student’s laptop problems, which taught some students patients and other skills across the Semiotic Onion. The students learned about responsibility, empathy, and other vital social skills.

There were issues across the Semiotic Onion that were also enumerated (challenging forces for adoption):

  • Technological: Internet connection was slow and intermitted even though there was broadband internet available and wireless routers
  • Technological: How to recharge 30 laptops at a time with only two wall sockets
  • Technological: How to transport laptops back and forth from storage rooms to classrooms
  • Technological: Laptop response times at certain periods of times were slow at best
  • Technological: Demand for technological support increases dramatically
  • Formal: The fear of laptops being stolen from the students on their way to or from school
  • Formal: Teachers worried that they could find or create technological assignments that fit their lesson plans
  • Informal: Teachers are not comfortable in teaching with technology they are not familiar with themselves
  • Informal: Most parents didn’t and couldn’t use the student’s laptop to assist them

This study concludes by saying that the introduction of technology into the education system in these scenarios for this case study had a positive response and that key lessons learned, assignments could be duplicated and studied in other scenarios.  Therefore, the authors emphasized on the transferability of the study rather than generalizability of the results.

Evaluation of this case study

This study was a case study of the socio-technological scholastic system when donated laptops were introduced into a Brazilian school.  This paper presented the socio-technological plan and its analysis.  The authors were thorough by listing all the opportunities (supporting forces for adoption) and issues (challenging forces for adoption) of technological inclusion into the scholastic system by evaluating it from the perspectives of the Semiotic Onions.  Therefore, this was a thorough study of this study’s positive introduction of technology to the scholastic, social system.  The only drawback in this study is that the researchers failed to interview how the laptops affected the world outside of the school walls and familiar homes.

This paper is a well-designed qualitative study that uses surveys, interviews, etc. to gain their primary results, but to improve the study’s credibility, the researchers become a participant observer for one-year videotaping and taking field notes to supplement their analysis.  They mention that case studies are done to foster transferability of ideas across similar situations rather than generalizing the results.  Therefore the authors stated the limitations of this study and how they mitigated issues that would arise about the study’s credibility.

References:

Adv Quant: Use of Bayesian Analysis in research

Just using knowledge before data collection and the knowledge gained from data collection doesn’t tell the full story until they are combined, hence establishing the need for Bayesian analysis (Hubbard, 2010).  Bayes’ theory is a conditional probability that takes into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015).  Bayesian analysis aids in avoiding overconfidence and underconfidence because it doesn’t ignore prior or new data (Hubbard, 2010).  There are many examples of how Bayesian analysis can be used in the context of social media data.  Below are just three ways of many,

  • With high precision, Bayesian Analysis was able to detect spam twitter accounts from legitimate users, based on their followers/following ration information and their most 100 recent tweets (McCord & Chuah, 2011). McCord and Chuah (2011) was able to use Bayesian analysis to achieve a 75% accuracy in detecting spam just by using user-based features, and ~90% accuracy in detecting spam when using both user and content based features.
  • Boulle (2014) used Bayesian Analysis off of 60,000 URLs in 100 websites. The goal was to predict the number of visits and messages on Twitter and Facebook after 48 hours, and Boulle (2014) was able to come close to the actual numbers through using Bayesian Analysis, showcasing the robustness of the approach.
  • Zaman, Fox, and Bradlow (2014), was able to use Bayesian analysis for predicting the popularity of tweets by measuring the final count of retweets a source tweet gets.

An in-depth exploration of Zaman, et al. (2014)

Goal:

The researchers aimed to predict how popular a tweet can become a Bayesian model to analyze the time path of retweets a tweet receives, and the eventual number of retweets of a tweet one week later.

  • They were analyzing 52 tweets varying among different topics like music, politics, etc.
    • They narrowed down the scope to analyzing tweets with a max of 1800 retweets per root tweets.

Defining the parameters:

  • Twitter = microblogging site
  • Tweets = microblogging content that is contained in up to 140 characters
  • Root tweets = original tweets
  • Root user = generator of the root tweet
  • End user = those who read the root tweet and retweeted it
  • Twitter followers = people who are following the content of a root user
  • Follower graph = resulting connections into a social graph from known twitter followers
  • Retweet = a twitter follower’s sharing of content from the user for their followers to read
  • Depth of 1 = how many end users retweeted a root tweet
  • Depth of 2 = how many end users retweeted a retweet of the root tweet

Exploration of the data:

From the 52 sampled root tweets, the researchers found that the tweets had anywhere between 21-1260 retweets associated with them and that the last retweet that could have occurred between a few hours to a few days from the root tweet’s generation.  The researchers calculated the median times from the last retweet, yielding scores that ranged from 4 minutes to 3 hours.  The difference between the median times was not statistically significant to reject a null hypothesis, which involved a difference in the median times.  This gave potentially more weight to the potential value of the Bayesian model over just descriptive/exploratory methods, as stated by the researchers.

The researchers explored the depth of the retweets and found that 11,882 were a depth of 1, whereas 314 were a depth of 2 or more in those 52 root tweets, which suggested that root tweets get more retweets than retweeted tweets.  It was suggested by the researchers that the depth seemed to have occurred because of a large number of followers from the retweeter’s side.

It was noted by the researchers that retweets per time path decays similarly to a log-normally distribution, which is what was used in the Bayesian analysis model.

Bayesian analysis results:

The researchers partitioned their results randomly into a training set with 26 observations, and a testing set of 26 observations, and varied the amount of retweets observations from 10%-100% of the last retweet.  Their main results are plotted in boxplots, where the whiskers cover 90% of the posterior solution (Figure 10).

IP3F12.png

The figure above is directly from Zaman, et al. (2014). The authors mentioned that as the observation fraction increased the absolute percent errors decreased.    For future work, the researchers suggested that their analysis could be parallelized to incorporate more data points, take into consideration the time of day the root tweet was posted, as well as understanding the content within the tweets and their retweet-ability because of it.

References

  • Boullé, M. (2014). Selective Naive Bayes Regressor with Variable Construction for Predictive Web Analytics.
  • Hubbard, D. W. (2010). How to measure anything: Finding the values of “intangibles” in business. (2nd e.d.) New Jersey, John Wiley & Sons, Inc.
  • Mccord, M., & Chuah, M. (2011). Spam detection on twitter using traditional classifiers. In International Conference on Autonomic and Trusted Computing (pp. 175-186). Springer Berlin Heidelberg.
  • Smith, M. (2015). Statistical analysis handbook. Retrieved from http://www.statsref.com/HTML/index.html?introduction.html
  • Zaman, T., Fox, E. B., & Bradlow, E. T. (2014). A Bayesian approach for predicting the popularity of tweets. The Annals of Applied Statistics8(3), 1583-1611.