Adv Quant: Bayesian analysis in R

Introduction

Bayes’ theory is a conditional probability that takes into account prior knowledge, but updates itself when new data becomes available (Hubbard, 2010; Smith, 2015).  The formulation of Bayes’ theory is p(θ |y)= p(theta)*P(y| θ)/(∑(P(θ)*P(y| θ))), where p(θ) is the prior probabilities, and P(y| θ) are the likelihoods (Cowles, Kass, & O’Hagan, 2009).

The Delayed Airplanes Dataset consists of airplane flights from Washington D.C. into New York City.  The date range for this data is for the entire month of February 2016, and there are 702 cases to be studied.

Results Figure 1: Histogram showcasing the density of flight delays that are 15 minutes or longer. Figure 2: Shows summary data for the variables in this Bayesian Analysis before training and testing. Figure 3: Bayesian Prediction of the flight delay data from Washington, D.C. to New York City, NY. Figure 4: Bayesian prediction results versus the test data results, where false negatives are encircled in blue, while false positives are encircled in red.

Discussion

The histogram (Figure 1) showcases that there are almost three times as many cases that flights depart on time from Washington, D.C. to New York City, NY.  Summation data proves this (Table 2).

The above summary (Table 2) states that 77.813% of the flights were not delayed equal to or more than 15 minutes, for the cases we do have data on. There is null data in the departure time, delayed 15 minutes or more, and weather delay variables.  To know the percentage of flights per day of the week, or carrier, destination, etc. the prior probabilities need to be calculated below.

About 77.2973% of the training model didn’t have a delay, but 22.7027% did have a delay of 15 or greater minutes (from tdelay variable).  These values are close to those above summation (Figure 2). Thus the training data could be trusted, even though a random sampling wasn’t taken.  The reason for not taking a random sampling is to be able to predict into the future, given 60% of the data is already collected.

Comparing both sets of histograms (Figure 1 and Figure 3), the distribution of the first histogram is binomial.  However, the posterior distribution, the secondary histogram, is similarly shaped as a positively skewed distribution.  This was an expected result described by Smith (2015), which is why the author states that the prior distribution has an effect on the posterior distribution.

The Bayesian prediction results tend to produce a bunch false negatives, compared to the real data sets, thus indicating more type II error than type I error.  When looking at the code below, the probability of finding a result that is 0.5 or larger is 15.302%.

Code

#

## Locate the data, filter out the data, and pull it into R from the computer (R, n.d.b.)

#

setwd(“C:/Users/XXX/Documents/R/dataSets”)

#

##

### ———————————————————————————————————-

##        Dependent:   Departure Delay Indicator, 15 minutes or more (Dep_Del15)

##        Independent: Arrival airports of Newark-EWR, Kennedy-JFK, and LaGuardia-LGA (Origin)

##        Independent: Departure airports of Baltimore-BWI, Dulles-IAD, and Reagan-DCA (Dest)

##        Independent: Carriers (Carrier)

##        Independent: Hours of departure (Dep_Time)

##        Independent: Weather conditions (Weather_Delay)

##        Independent: Monday = 1, Tuesday = 2, …Sunday = 7 (Day_Of_Week)

### ———————————————————————————————————-

##  bayes theory => p(theta|y)= p(theta)*P(y|theta)/(SUM(P(theta)*P(y|theta))) (Cowles, Kass, & O’Hagan, 2009)

### ———————————————————————————————————-

##

#

## Create a data.frame

delay = data.frame(airplaneData)

## Factoring and labeling the variables (Taddy, n.d.)

delay\$DEP_TIME = factor(floor(delay\$DEP_TIME/100))

delay\$DAY_OF_WEEK = factor(delay\$DAY_OF_WEEK, labels = c(“M”, “T”, “W”, “R”, “F”, “S”, “U”))

delay\$DEP_DEL15 = factor(delay\$DEP_DEL15)

delay\$WEATHER_DELAY= factor(ifelse(delay\$WEATHER_DELAY>=1,1,0)) # (R, n.d.a.)

delay\$CARRIER = factor(delay\$CARRIER, levels = c(“AA”,”B6″,”DL”,”EV”,”UA”))

levels(delay\$CARRIER) = c(“American”, “JetBlue”, “Delta”, “ExpressJet”, “UnitedAir”)

## Quick understanding the data

delayed15 = as.numeric(levels(delay\$DEP_DEL15)[delay\$DEP_DEL15])

hist(delayed15, freq=F, main = “Histogram of Delays of 15 mins or longer”, xlab = “time >= 15 mins (1) or time < 15 (0)”)

summary(delay)

### Create the training and testing data (60/40%)

ntotal=length(delay\$DAY_OF_WEEK)    # Total number of datapoints assigned dynamically

ntrain = sample(1:ntotal,floor(ntotal*(0.6))) # Take values 1 – n*0.6

ntest = ntotal-floor(ntotal*(0.6))       # The number of test cases (40% of the data)

trainingData = cbind(delay\$DAY_OF_WEEK[ntrain], delay\$CARRIER[ntrain],delay\$ORIGIN[ntrain],delay\$DEST[ntrain],delay\$DEP_TIME[ntrain],delay\$WEATHER_DELAY[ntrain],delayed15[ntrain])

testingData  = cbind(delay\$DAY_OF_WEEK[-ntrain], delay\$CARRIER[-ntrain],delay\$ORIGIN[-ntrain],delay\$DEST[-ntrain],delay\$DEP_TIME[-ntrain],delay\$WEATHER_DELAY[-ntrain],delayed15[-ntrain])

## Partitioning the train data by half

trainFirst= trainingData[trainingData[,7]<0.5,]

trainSecond= trainingData[trainingData[,7]>0.5,]

### Prior probabilities = p(theta) (Cowles, Kass, & O’Hagan, 2009)

## Dependent variable: time delayed >= 15

tdelay=table(delayed15[ntrain])/sum(table(delayed15[ntrain]))

### Prior probabilities between the partitioned training data

## Independent variable: Day of the week (% flights occured in which day of the week)

tday1=table(trainFirst[,1])/sum(table(trainFirst[,1]))

tday2=table(trainSecond[,1])/sum(table(trainSecond[,1]))

## Independent variable: Carrier (% flights occured in which carrier)

tcarrier1=table(trainFirst[,2])/sum(table(trainFirst[,2]))

tcarrier2=table(trainSecond[,2])/sum(table(trainSecond[,2]))

## Independent variable: Origin (% flights occured in which originating airport)

tOrigin1=table(trainFirst[,3])/sum(table(trainFirst[,3]))

tOrigin2=table(trainSecond[,3])/sum(table(trainSecond[,3]))

## Independent variable: Destination (% flights occured in which destinateion airport)

tdest1=table(trainFirst[,4])/sum(table(trainFirst[,4]))

tdest2=table(trainSecond[,4])/sum(table(trainSecond[,4]))

## Independent variable: Department Time (% flights occured in which time of the day)

tTime1=table(trainFirst[,5])/sum(table(trainFirst[,5]))

tTime2=table(trainSecond[,5])/sum(table(trainSecond[,5]))

## Independent variable: Weather (% flights delayed because of adverse weather conditions)

twx1=table(trainFirst[,6])/sum(table(trainFirst[,6]))

twx2=table(trainSecond[,6])/sum(table(trainSecond[,6]))

### likelihoods = p(y|theta) (Cowles, Kass, & O’Hagan, 2009)

likelihood1=tday1[testingData[,1]]*tcarrier1[testingData[,2]]*tOrigin1[testingData[,3]]*tdest1[testingData[,4]]*tTime1[testingData[,5]]*twx1[testingData[,6]]

likelihood2=tday2[testingData[,1]]*tcarrier2[testingData[,2]]*tOrigin2[testingData[,3]]*tdest2[testingData[,4]]*tTime2[testingData[,5]]*twx2[testingData[,6]]

### Predictions using bayes theory = p(theta|y)= p(theta)*P(y|theta)/(SUM(P(theta)*P(y|theta))) (Cowles, Kass, & O’Hagan, 2009)

Bayes=(likelihood2*tdelay)/(likelihood2*tdelay+likelihood1*tdelay)

hist(Bayes, freq=F, main=”Bayesian Analysis of flight delay data”)

plot(delayed15[-ntrain]~Bayes, main=”Bayes results versus actual results for flights delayed >= 15 mins”, xlab=”Bayes Analysis Prediction of which cases will be delayed”, ylab=”Actual results from test data showing delayed cases”)

## The probability of 0.5 or larger

densityMeasure = table(delayed15[-ntrain],floor(Bayes+0.5))

probabilityOfXlarger=(densityMeasure[1,2]+densityMeasure[2,1])/ntest

probabilityOfXlarger

References