Introduction

For this project, I’m going to run through an example of fitting a multiple linear regression model in R. We’re going to see if we can use this to accurately predict the price of Vintage Bordeaux wine each year based on a few measured variables for each vintage. To do this, we’ll use variety of statistical and visual methods to evaluate the model, as well as perform a transformation to see if that improves the fit. If you’re unfamiliar with any of the tests or methods I use, a quick Google search will yield numerous great resources for each concept.

First, let’s take care of some housekeeping by setting the working directory and importing the csv file:

## Set Working Directory
  setwd("/Users/Chris/Documents/R/Linear Model Evaluation and Transformation in R")
    getwd()
## [1] "C:/Users/Chris/Documents/R/Linear Model Evaluation and Transformation in R"
## Import Data
    data <- read.csv(file.path("Bordeaux.csv"), header = TRUE,sep=",")
    data
##    Year  Temp Rain PrevRain Age Price
## 1  1952 17.12  160      600  31 0.368
## 2  1953 16.73   80      690  30 0.635
## 3  1955 17.15  130      502  28 0.446
## 4  1957 16.13  110      420  26 0.221
## 5  1958 16.42  187      582  25 0.180
## 6  1959 17.48  187      485  24 0.658
## 7  1960 16.42  290      763  23 0.139
## 8  1961 17.33   38      830  22 1.000
## 9  1962 16.30   52      697  21 0.331
## 10 1963 15.72  155      608  20 0.168
## 11 1964 17.27   96      402  19 0.306
## 12 1965 15.37  267      602  18 0.106
## 13 1966 16.53   86      819  17 0.473
## 14 1967 16.23  118      714  16 0.191
## 15 1968 16.20  292      610  15 0.105
## 16 1969 16.55  244      575  14 0.117
## 17 1970 16.67   89      622  13 0.404
## 18 1971 16.77  112      551  12 0.272
## 19 1972 14.98  158      536  11 0.101
## 20 1973 17.07  123      376  10 0.156
## 21 1974 16.30  184      574   9 0.111
## 22 1975 16.95  171      572   8 0.301
## 23 1976 17.65  247      418   7 0.253
## 24 1977 15.58   87      821   6 0.107
## 25 1978 15.82   51      763   5 0.270
## 26 1979 16.17  122      717   4 0.214
## 27 1980 16.00   74      578   3 0.136

Part 1: Construct a Regression Model that Describes the Price of the Vintage

Let’s create a multiple linear model of the price of Vintage Bordeaux as a function of its Age (in years where 1983=0), Rainfall (mm during the harvest season), Previous Rainfall (mm in the six months before the harvest season), and Temperature (degrees Celius during the growing season).

y <- data$Price
x1 <- data$Age
x2 <- data$Rain
x3 <- data$PrevRain
x4 <- data$Temp

# multiple linear regression
fit <- lm(Price ~ Age + Rain + PrevRain + Temp, data = data)

Part 2: Run a Regression Analysis on the Data

Now we need to run a summary of the multiple linear regression to find the quantiles, coefficients, R-squared value, adjusted R-sqared value, F-Statistic, and the p-values. We’ll then run an ANOVA table to get the variance data.

summary(fit)
## 
## Call:
## lm(formula = Price ~ Age + Rain + PrevRain + Temp, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14072 -0.08770 -0.01074  0.03410  0.26783 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.1716289  0.6928899  -4.577 0.000147 ***
## Age          0.0080519  0.0029410   2.738 0.012013 *  
## Rain        -0.0010351  0.0003314  -3.123 0.004947 ** 
## PrevRain     0.0005638  0.0001979   2.849 0.009338 ** 
## Temp         0.1903096  0.0390606   4.872 7.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1176 on 22 degrees of freedom
## Multiple R-squared:  0.7356, Adjusted R-squared:  0.6875 
## F-statistic:  15.3 on 4 and 22 DF,  p-value: 4.017e-06
summary(aov(fit))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Age          1 0.2354  0.2354  17.025 0.000443 ***
## Rain         1 0.2614  0.2614  18.905 0.000258 ***
## PrevRain     1 0.0212  0.0212   1.532 0.228900    
## Temp         1 0.3283  0.3283  23.738 7.18e-05 ***
## Residuals   22 0.3042  0.0138                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note the R-Squared value is only 0.7356, which means only 73.56% of the variance in price is predicted from the current model. This is not particularly high, and should make us question the efficacy of the current linear model.

Let’s also check the four model assumptions to see if the model is appropriate. These four assumptions are linearity, independence, normality of residuals, and equal variance of residuals.

Part 3: Check the Model Assumptions

## 1. check for linearity
layout(matrix(c(1,2,3,4),2,2))

## run simple plots with each variable to check for linearity
plot(y ~ x1, main = "Price vs Age", xlab = "Age (years)", ylab = "Price")
plot(y ~ x2, main = "Price vs Rainfall", xlab = "Rainfall (mm)", ylab = "Price")
plot(y ~ x3, main = "Price vs Previous Rain", xlab = "Previous Rainfall (mm)", ylab = "Price")
plot(y ~ x4, main = "Price vs Temperature", xlab = "Temperature (degrees C)", ylab = "Price")

## Residual Plots e_i vs x_i
par(mfrow = c(2,2))
plot(resid(fit)~x1, main = "Residuals vs Age", 
     xlab = "Age (years)", ylab = "Residuals")
  abline(h = 0)
plot(resid(fit)~x2, main = "Residuals vs Rainfall",
     xlab = "Rainfall (mm)", ylab = "Residuals")
    abline(h = 0)
plot(resid(fit)~x3, main = "Residuals vs Previous Rain",
     xlab = "Previous Rainfall (mm)", ylab = "Residuals")
    abline(h = 0)
plot(resid(fit)~x4, main = "Residuals vs Temperature",
     xlab = "Temperature (degrees C)", ylab = "Residuals")
    abline(h = 0)

We see no evidence of linearity. There is not obvious relationship between any of the possible explanatory variables and the response variable (price) in any of the scatterplots. Furthermore, the patterns in the residual plots indicate that a linear relationship is not likely. We want to see a random distribution in the residual plots, which none of them have (the closest is price vs age, but it’s still clearly not random).

Next, we can run a Lag 1 Autocorrelation to see if the independence of errors is reasonable:

## 2. Check for Independence
    
##Lag 1 Autocorrelation
n = length(y)
acf(resid(fit), lag.max = 1, plot = FALSE)
## 
## Autocorrelations of series 'resid(fit)', by lag
## 
##      0      1 
##  1.000 -0.588
2 / sqrt(n)
## [1] 0.3849002

Since 0.588 > 0.3849 independence of variables is not necessarily reasonable. This violates the second condition for linearity.

Next, we’ll checked for the normality of residuals with a q-q plot:

## 3. Check for normality of Residuals

# q-q plot
z <- ( fit$residuals - mean(fit$residuals) ) / sd(fit$residuals)
qqnorm(z, sub = "Residuals")
abline(a = 0, b = 1)

The q-q plot does not appear to be linear. We can say that errors are not necessarily normally distributed (it’s very likely that they aren’t).

Finally, we’ll check the equal variance of residuals with a plot of the residuals against fitted values:

# residual plots
layout(matrix(c(1,2,3,4),2,2))
plot(fit)

We see a pattern in the plot for the residuals vs fitted value, so it’s not likely that the errors have equal variances.

So, we have no evidence that our current model adheres to any of the four basic model assumptions. It appears the current model is not adequate.

Part 4: Transformation

To try and resolve this failure, we’ll apply a transformation to y. To perform the transformation, we can first run a box-cox power transformation to find the most appropriate transformation for y.

## Load additional library
    library(MASS)

# Plot of Log-likelihood function vs lambda
    boxcox(y ~ x1 + x2 +x3 + x4,
            plotit=T, lambda = seq(-2, 2, length = 10))

    ## Looking for lambda values 95% and above
    ## The "best" estimate for lambda is where the curve
    ## is at maximum (lambda approximately 0)
    ## A 95% confidence interval for lambda is
    ## approximately -1.3 to 1.3


## List of values of lambda and corresponding Log-likelihood function
    boxcox(y ~ x1 + x2 +x3 + x4,
            plotit=F, lambda = seq(0, 0.5, length = 10))
## $x
##  [1] 0.00000000 0.05555556 0.11111111 0.16666667 0.22222222 0.27777778
##  [7] 0.33333333 0.38888889 0.44444444 0.50000000
## 
## $y
##  [1]  -7.938316  -8.287859  -8.701824  -9.180354  -9.723166 -10.329557
##  [7] -10.998429 -11.728321 -12.517450 -13.363766
    lambda <- 0


## Box-Cox power transformation
    trans.y <- log(y)
    trans.y ## transformed y values
##  [1] -0.9996723 -0.4541303 -0.8074363 -1.5095926 -1.7147984 -0.4185503
##  [7] -1.9732813  0.0000000 -1.1056369 -1.7837913 -1.1841702 -2.2443162
## [13] -0.7486599 -1.6554819 -2.2537949 -2.1455813 -0.9063404 -1.3019532
## [19] -2.2926348 -1.8578993 -2.1982251 -1.2006450 -1.3743658 -2.2349264
## [25] -1.3093333 -1.5417793 -1.9951004
    y   ## original dataset y values
##  [1] 0.368 0.635 0.446 0.221 0.180 0.658 0.139 1.000 0.331 0.168 0.306
## [12] 0.106 0.473 0.191 0.105 0.117 0.404 0.272 0.101 0.156 0.111 0.301
## [23] 0.253 0.107 0.270 0.214 0.136
## Linear regression of the transformated data
    trans.fit <- lm(trans.y ~ Age + Rain + PrevRain + Temp, data = data)

We pick the value of lamda with the largest log-likelihood (y-value). -7.938316 is largest and corresponds with approximately x = 0. So lamda = 0. This implies the best transformation for y is log(y).

Now we have an updated model. We’ll then apply the previous tests again to see if the results were better.

Part 5: Re-run Regression Analysis

First, let’s look again at the simple plots for the new model to compare:

## 1. check for linearity
layout(matrix(c(1,2,3,4),2,2))

## run simple plots with each variable to check for linearity
plot(trans.y ~ x1, main = "log(Price) vs Age", xlab = "Age (years)", ylab = "log(Price)")
plot(trans.y ~ x2, main = "log(Price) vs Rainfall", xlab = "Rainfall (mm)", ylab = "log(Price)")
plot(trans.y ~ x3, main = "log(Price) vs Previous Rain",
     xlab = "Previous Rainfall (mm)", ylab = "log(Price)")
plot(trans.y ~ x4, main = "log(Price) vs Temperature",
     xlab = "Temperature (degrees C)", ylab = "log(Price)")

## Residual Plots e_i vs x_i
par(mfrow = c(2,2))
plot(resid(trans.fit)~x1, main = "Transformed Model Residuals vs Age",
     xlab = "Age (years)", ylab = "Residuals")
  abline(h = 0)
plot(resid(trans.fit)~x2, main = "Transformed Model Residuals vs Rainfall",
     xlab = "Rainfall (mm)", ylab = "Residuals")
    abline(h = 0)
plot(resid(trans.fit)~x3, main = "Transformed Model Residuals vs Previous Rain",
     xlab = "Previous Rainfall (mm)", ylab = "Residuals")
    abline(h = 0)
plot(resid(trans.fit)~x4, main = "Transformed Model vs Temperature",
     xlab = "Temperature (degrees C)", ylab = "Residuals")
    abline(h = 0)

We see that the residual plots for age and rainfall appear to be without a pattern. Previous rainfall and temperature (especially temperature) may not be linear.

We’ll also take another look at the summary and ANOVA tables to get some quantitative information:

summary(trans.fit)
## 
## Call:
## lm(formula = trans.y ~ Age + Rain + PrevRain + Temp, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45748 -0.23902  0.01067  0.18533  0.53642 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.216e+01  1.686e+00  -7.213 3.15e-07 ***
## Age          2.390e-02  7.155e-03   3.341  0.00296 ** 
## Rain        -3.866e-03  8.062e-04  -4.795 8.66e-05 ***
## PrevRain     1.171e-03  4.814e-04   2.432  0.02359 *  
## Temp         6.170e-01  9.502e-02   6.493 1.57e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2861 on 22 degrees of freedom
## Multiple R-squared:  0.8282, Adjusted R-squared:  0.797 
## F-statistic: 26.51 on 4 and 22 DF,  p-value: 3.89e-08
summary(aov(trans.fit))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Age          1  2.224   2.224  27.178 3.15e-05 ***
## Rain         1  3.002   3.002  36.682 4.27e-06 ***
## PrevRain     1  0.003   0.003   0.038    0.847    
## Temp         1  3.450   3.450  42.160 1.57e-06 ***
## Residuals   22  1.800   0.082                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note the increase of the R-Squared value to 0.8282 from 0.7356, which is a significant improvement. It’s worth noting that based on the p-values in the coefficients table, we can see that all four explanatory variables are significant.

We’ll run the Lag 1 Autocorrelation again to see if the independence of errors is reasonable for the new model:

## 2. Check for Independence
    
##Lag 1 Autocorrelation
n = length(trans.y)
acf(resid(trans.fit), lag.max = 1, plot = FALSE)
## 
## Autocorrelations of series 'resid(trans.fit)', by lag
## 
##      0      1 
##  1.000 -0.417
2 / sqrt(n)
## [1] 0.3849002

Since 0.417 > 0.384, independence of variables is still not necessarily reasonable.

It’s time to check the normality of residuals assumption for the new model with a q-q plot:

## 3. Check for normality of Residuals

# q-q plot
z <- (trans.fit$residuals - mean(trans.fit$residuals) ) / sd(trans.fit$residuals)
qqnorm(z, sub = "Residuals (transformed)")
abline(a = 0, b = 1)

The q-q plot isn’t quite linear, so the results aren’t definitive. We can run a Shapiro-Wilk test for a more formal test to confirm:

    ## Shapiro-Wilk
  shapiro.test(trans.fit$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  trans.fit$residuals
## W = 0.95865, p-value = 0.3441

With a p-value of 0.3441, we cannot reject the null hypothesis that the residuals are linear. Still, we’d be wise to doubt this is an optimal model as the we see a clear pattern to the q-q plot points.

Finally, let’s check for the equal variance of residuals with a plots of the residuals against fitted values:

# residual plots
layout(matrix(c(1,2,3,4),2,2))
plot(trans.fit)

The residual plot appears to have no real pattern. It’s likely that the errors of this model have equal variances.

Overall, we see the new model is not perfect. However, it’s definitely an improvement.

Part 6: Analysis

As previously mentioned, the summary table for the log(y) transformed model showed that all four variables are significant, with the highest p-value being .024:

summary(trans.fit)
## 
## Call:
## lm(formula = trans.y ~ Age + Rain + PrevRain + Temp, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45748 -0.23902  0.01067  0.18533  0.53642 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.216e+01  1.686e+00  -7.213 3.15e-07 ***
## Age          2.390e-02  7.155e-03   3.341  0.00296 ** 
## Rain        -3.866e-03  8.062e-04  -4.795 8.66e-05 ***
## PrevRain     1.171e-03  4.814e-04   2.432  0.02359 *  
## Temp         6.170e-01  9.502e-02   6.493 1.57e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2861 on 22 degrees of freedom
## Multiple R-squared:  0.8282, Adjusted R-squared:  0.797 
## F-statistic: 26.51 on 4 and 22 DF,  p-value: 3.89e-08

The estimates column of the summary shows that Age, Previous Rain, and Temperature are positively correlated with price, and rain is negatively correlated with price. However, these values are extremely small (each change in one unit of the variable has a minimal effect on the log(price) when all variables are taken into condsideration).

All four of the basic model conditions are left with somewhat ambiguous results. This, combined with the R-squared value of 0.8282, which still leaves over 17% of the variance in price unexplained by the model, ought to make us question the adequacy of a linear model in the first place. It’s likely that some other model will more effectively model the effects of these variables on the price of Bordeax Vintage. However, if a linear model is required, the log(Price) transformed model is almost certainly the best.