Adv Quant: Polynomial Regression in R


For this local polynomial regression, the “oldfaithful.csv” will be used from the open-source data. The eruption times (in minutes) and the waiting time to the next eruption (in minutes) of 272 eruptions are provided for the Old Faithful geyser.



Figure 1: Density Histograms for eruptions times and eruption waiting times.


Figure 2: Smoothed density histrogram from local polynomial regresion.


Figure 3: Intercomparisson of linear regression (blue), lowess regresion(red), and polynomial regression (green) on the eruption data.


Figure 4: Residual plots for both linear and polynomial regression.


The histogram plots (Figure 1) illustrate that both variables, eruption times and eruptions waiting time are both bimodal distributions.  Thus, a linear regression (Figure 3), would not capture the relationship between these two variables.  A polynomial smoothed version of the bimodal curve (Figure 2) show that for low values of the geysers magnitude, there is a low wait time for the next occurrence and vice versa.  The smoothed density curve shows the estimate values of the geyser’s variable distribution better than the bar histogram

LOCFIT (locally fitted regression) and LOWESS (locally weighted scatterplot smoothing regression) are assessed alongside the typical LM (linear regression).  LOCFIT is based on LOWESS, which allows the end user to specify the smoothing parameter and neighborhood size, but LOCFIT affords the end user more control over other the smoothing parameters (Futschik & Crompton, 2004).  Both LOCFIT and LOWESS are methods for regression that uses the nearest-neighbor-based model (Field, 2013; Futschik & Crompton, 2004; Loader, 2013; Smith, 2015).  This analysis will look at all three.

The goal is to see if there is a relationship between the waiting time to the next eruption to the magnitude of the eruption (per eruption time).  Through the linear regression algorithm, the linear model is eruptions = 0.075628 (waiting) – 1.874016.  The Pearson’s correlation coefficient is 0.9008112. Thus 81.14% of the variation could be explained by a linear regression model.  The lowess regression appears not to capture the distribution of data at smaller eruption times, but it is better than the linear regression model since its correlation is 0.9809684, and can explain 0.9622990 of the variation between the variables.

Finally, to evaluate the effectiveness of the linear model and the polynomial model, residuals must be assessed (Figure 4). Both of the residual plots don’t show any discernable pattern. However, the residuals are closer to zero in the polynomial regression, suggesting that it does a better job at explaining the variance between the eruption magnitude and the next eruption wait time.  In conclusion, the best regression for this data set appears to be the polynomial regression.



## Use R to analyze the faithful dataset.

## This is a version of the eruption data from the “Old Faithful” geyser in Yellowstone National Park, Wyoming.

##  •     X (primary key)

##  •     eruptions (eruption time [mins])

##  •     waiting (wait time for this eruptions [mins])


fateful = read.csv(file=””, header = TRUE, sep = “,”)


# Produce density histograms of eruption times and of waiting times.

hist(fateful$eruptions, freq=F, xlab = “eruptions time [mins]”,  main = “Histogram of the eruptions time”)

hist(fateful$waiting, freq=F, xlab = “eruptions waiting time [mins]”,  main = “Histogram of the eruptions waiting time”)

# Produce a smoothed density histogram from local polynomial regression.



plot(locfit(~lp(fateful$eruptions),data=fateful), xlab = “eruptions time [mins]”,  main = “Histogram of the eruptions time”)

plot(locfit(~lp(fateful$waiting),data=fateful), xlab = “eruptions waiting time [mins]”,  main = “Histogram of the eruptions waiting time”)

# Compare local polynomial regression to regular regression.

lowessRegression = lowess(fateful$waiting, faithful$eruptions, f=2/3)

polynomialRegression = locfit(fateful$eruptions~lp(fateful$waiting))

linearRegression = lm(fateful$eruptions~fateful$waiting)

# Graphing the data

plot(fateful$waiting, fateful$eruptions, main = “Eruption Times”, xlab=”eruption time [min]”, ylab = “Waiting time to next eruption [min]”)

lines(lowessRegression, col=”red”)

abline(linearRegression, col=”blue”)

lines(polynomialRegression, col=”green”)

# summary on the regressions


# correlations on the regressions


cor(lowessRegression$x, lowessRegression$y)

# Plotting residuals

plot(residuals(linearRegression), main = “residuals for the linear regression”, ylab = “residuals”)

plot(residuals(polynomialRegression), main = “residuals for the polynomial regression”, ylab=”residuals”)