Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (2024)

27 mins read

Interpreting Residual Plots to Improve Your Regression

When you run a regression, calculating and plottingresidualshelp you understand and improve your regression model. In this post, we describe the fitted vs residuals plot, which allows us to detect several types of violations in the linear regression assumptions. You may also be interested inqq plots,scale location plots, or theresiduals vs leverage plot.

Here, one plotsUnderstanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (1)the fitted values on the x-axis, andUnderstanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (2)the residuals on the y-axis. Intuitively, this asks: as for different fitted values, does the quality of our fit change? In this post, we’ll describe what we can learn from a residuals vs fitted plot, and then make the plot for several R datasets and analyze them. The fitted vs residuals plot is mainly useful for investigating:

  1. Whether linearity holds. This is indicated by the mean residual value for every fitted value region being close to0. In R this is indicated by the red line being close to the dashed line.
  2. Whether hom*oskedasticity holds. The spread of residuals should be approximately the same across the x-axis.
  3. Whether there are outliers. This is indicated by some ‘extreme’ residuals that are far from the rest.

Synthetic Example: Quadratic

To illustrate how violations of linearity (1) affect this plot, we create an extreme synthetic example in R.

 x=1:20 y=x^2 plot(lm(y~x)) 
Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (3)

So a quadratic relationship betweenxandyleads to an approximately quadratic relationship between fitted values and residuals. Why is this? Firstly, the fitted model is

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (4)

Which gives us thatUnderstanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (5). We then have

(1)Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (6)

which is itself a 2nd order polynomial function ofUnderstanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (7). More generally, if the relationship betweenxandyis non-linear, the residuals will be a non-linear function of the fitted values. This idea generalizes to higher dimensions (function of covariates instead of singleUnderstanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (8)).

The Cars Dataset

We now look at the same on the cars dataset from R. We regress distance on speed.

plot(lm(dist~speed,data=cars))
Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (9)

Here we see that linearity seems to hold reasonably well, as the red line is close to the dashed line. We can also note the heteroskedasticity: as we move to the right on the x-axis, the spread of the residuals seems to be increasing. Finally, points 23, 35, and 49 may be outliers, with large residual values. Let’s look at another dataset

Boston Housing

Let’s try fitting a linear model to the Boston housing price datasets. We regress the median value on crime, the average number of rooms, tax, and the percent lower status of the population.

library(mlbench)data(BostonHousing)plot(lm(medv ~ crim + rm + tax + lstat, data = BostonHousing))
Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (10)

Here we see that linearity is violated: there seems to be a quadratic relationship. Whether it is hom*oskedastic or not is less obvious: we will need to investigate more plots. There are several outliers, with residuals close to 30.

You may want to check outqq plots,scale location plots, or theresiduals vs leverage plot.

Observations, Predictions, and Residuals

To demonstrate how to interpret residuals, we’ll use a lemonade stand data set, where each row was a day of “Temperature” and “Revenue.”

Temperature (Celsius)Revenue
28.2$44
21.4$23
32.9$43
24.0$30
etc.etc.

The regression equation describing the relationship between “Temperature” and “Revenue” is:

Revenue= 2.7 * Temperature – 35

Let’s say one day at the lemonade stand it was 30.7 degrees and “Revenue” was $50. That 50 is yourobservedoractualoutput, the value that actually happened.

So if we insert 30.7 at our value for “Temperature”…

Revenue= 2.7 * 30.7 – 35
Revenue= 48

we get $48. That’s thepredictedvalue for that day, also known as the value for “Revenue” the regression equation would have predicted based on the “Temperature.” Your model isn’t always perfectly right, of course. In this case, the prediction is off by 2; that difference, the 2, is called theresidual. The residual is the bit that’s left when you subtract the predicted value from the observed value.

Residual = Observed – Predicted

You can imagine that every row of data now has, in addition, a predicted value and a residual.

Temperature
(Celsius)
Revenue
(Observed)
Revenue
(Predicted)
Residual
(Observed – Predicted)
28.2$44$41$3
21.4$23$23$0
32.9$43$54-$11
24.0$30$29$1
etc.etc.etc.etc.

We’re going to use the observed, predicted, and residual values to assessand improve the model.

Understanding Accuracy with Observed vs. Predicted

In a simple model like this, with only two variables, you can get a sense of how accurate the model is just by relating “Temperature” to “Revenue.” Here’s the same regressionrun on two different lemonade stands, one where the model is very accurate, and one where the model is not:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (11)

It’s clear that for both lemonade stands, a higher “Temperature” is associated with higher “Revenue.” But at a given “Temperature,” you could forecast the “Revenue” of the left lemonade stand much more accurately than the right lemonade stand,which means the model is much more accurate.

But most models have more than one explanatoryvariable, and it’s not practical to represent more variables in a chart like that. So instead, let’s plot thepredictedvalues versus theobservedvalues for these same data sets.

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (12)

Again, the model for the chart on the left is very accurate; there’s a strong correlation between the model’s predictions and its actual results. The model for the chart on the far right is the opposite; the model’s predictions aren’t very good at all.

Note that these charts look just like the “Temperature”vs. “Revenue” charts above them, but the x-axis is predicted “Revenue” instead of “Temperature“. That’s common when your regression equation only has one explanatoryvariable. More often, though, you’ll have multiple explanatoryvariables, andthese charts will look quite different from a plot of anyone explanatoryvariable vs. “Revenue.”

Examining Predicted vs. Residual(“The Residual Plot”)

The most useful way to plot the residuals, though, is with your predicted values on the x-axis and your residuals on the y-axis.

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (13)

In the plot on the right, each point is one day, where the predictionmade bythe model is on the x-axis and the accuracy of the prediction is on the y-axis. The distance from the line at 0 is how bad the prediction was for that value.

Since…

Residual = Observed – Predicted

…positive values for the residual (on the y-axis) mean the prediction was too low, and negative values mean the prediction was too high; 0 means the guess was exactly correct. Ideally, your plot of the residuals looks likeone of these:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (14)

That is,
(1) they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot.
(2) they’re clustered around the lower single digits of the y-axis (e.g., 0.5 or 1.5, not 30 or 150).
(3) in general, there aren’t any clear patterns.

Here are some residual plots that don’t meet those requirements:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (15)

These plots aren’tevenly distributed vertically, or they have an outlier, or they have a clear shape to them. If you can detect a clear pattern or trend in your residuals, then your model has room for improvement. In a second we’ll break down why and what to do about it.

NORMAL Q-Q RESIDUAL PLOT:

This chart displays the standardized residuals on the y-axis and the theoretical quantiles on the x-axis.

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (16)

Data that aligns closely to the dotted line indicates a normal distribution. If the points skew drastically from the line, you could consider adjusting your model by adding or removing other variables in the regression model.

How much does it matter if my model isn’t perfect?

How concerned should you be if your model isn’t perfect and if your residuals look a bit unhealthy?It’s up to you.

If you’re publishing your thesis in particle physics, you probably want to make sure your model is as accurate as humanly possible. If you’re trying to run a quick and dirty analysis of your nephew’s lemonade stand, a less-than-perfect model might be good enough to answer whatever questions you have (e.g., whether “Temperature” appears to affect “Revenue”).

Most of the time a decent model is better than none at all. So take your model, try to improve it, and then decide whether the accuracy is good enough to be useful for your purposes.

Example Residual Plots and Their Diagnoses

If you’re not sure what a residual is,take five minutes to readthe above, then come back here. Below is a gallery of unhealthy residual plots. Your residual may look like one specific type from below or some combination. Throughout we’ll use a lemonade stand’s “Revenue” vs. that day’s “Temperature” asan example data set.

Y-AXIS UNBALANCED

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (17)

PROBLEM

Imagine that for whatever reason, your lemonade standtypically haslow revenue, but every once and a while you get very high-revenue days, such that “Revenue” looked like thiS.

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (18)

…instead ofsomething more symmetrical and bell-shaped like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (19)

So “Temperature” vs. “Revenue” might look like this, with most of the data bunched at the bottom…

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (20)

The black line represents the model equation, the model’s prediction of the relationship between “Temperature” and “Revenue.” Look above at each prediction made by the black line for a given “Temperature” (e.g., at “Temperature” 30, “Revenue” is predicted to be about 20). You can see that the majority of dots are below the line (that is, the prediction was too high), but a few dots are very far above the line (that is, the prediction was far too low).

Translating that same data to the diagnostic plots, most of the equation’s predictions are a bit too high, and then some would be way too low.

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (21)

IMPLICATIONS

This almost always means your model can be made significantly more accurate. Most of the time you’ll find that the model was directionally correct but pretty inaccurate relative to an improved version. It’s not uncommon to fix an issue like this and consequently see the model’s r-squared jump from 0.2 to 0.5 (on a 0 to 1 scale).

HOW TO FIX

  • The solution to this is almost always totransformyour data, typically yourresponsevariable.
  • It’s also possible thatyour model lacks a variable.

HETEROSCEDASTICITY

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (22)

PROBLEM

These plots exhibit “heteroscedasticity,” meaning that the residuals get largeras the prediction moves from small to large (or from large to small). Imagine that on cold days, the amount of revenue is very consistent, but on hotter days, sometimes revenue is very high and sometimes it’s very low. You’d see plotslike these:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (23)

IMPLICATIONS

This doesn’t inherently create a problem, but it’s oftenan indicator that your model can be improved. The only exception here is that if your sample size is less than 250, and you can’t fix the issue using the below, your p-values may be a bit higher or lower than they should be, so possibly a variable that is right on the border of significance may end up erroneously on the wrong side of that border.Your regression coefficients (the number of units “Revenue” changes when “Temperature” goes up one) will still be accurate, though.

HOW TO FIX

  • The most frequently successful solution is totransforma variable.
  • Often heteroscedasticity indicates thata variable is missing.

NONLINEARITY

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (24)

PROBLEM

Imagine that it’s hard to sell lemonade on cold days, easy to sell it on warm days, and hard to sell it on very hot days (maybe because no one leaves their house on very hot days). That plot would look like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (25)

Themodel, represented by the line, is terrible. The predictions would be way off, meaning your model doesn’t accurately represent the relationship between “Temperature” and “Revenue.” Accordingly, residuals would look like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (26)

IMPLICATIONS

If your model is way off, as in the example above, your predictions will be pretty worthless (and you’ll notice a very low r-squared, like the 0.027 r-squared for the above). Other times a slightly suboptimal fit will still give you a good general sense of the relationship, even if it’s not perfect, like the below:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (27)

That model looks pretty accurate. If you look closely (or if you look at the residuals), you can tell that there’s a bit of a pattern here – that the dots are on a curve that the line doesn’t quite match.

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (28)

Does that matter? It’s up to you. If you’re getting a quick understanding of the relationship, your straight line is a pretty decent approximation. If you’re going to use this model for prediction and no explanation, the most accurate possible model would probably account for that curve.

HOW TO FIX

  • Sometimes patterns like this indicate that a variableneeds to betransformed.
  • If the pattern is actually asclear as these examples, you probably need tocreate a nonlinear model(it’s not as hard as that sounds).
  • Or, as always, it’s possible that the issue isa missing variable.

OUTLIERS

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (29)

PROBLEM

What if one of your data points had a “Temperature” of 80 instead of the normal 20s and 30s? Your plots would look like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (30)

This regression has an outlying datapoint on an input variable, “Temperature” (outliers on an input variable are also known as “leverage points”). What if one of your data points had $160 in revenue instead of the normal $20 – $60? Your plots would look like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (31)

This regression has an outlying datapoint on an outputvariable, “Revenue.”

IMPLICATIONS

There are some types of regressions that generally aren’t affected byoutputoutliers (like the day with $160 revenue), but they are affected byinputoutliers (like a “Temperature” in the 80s). In the worst case, your model can pivot to try to get closer to that point at the expense of being close to all the others and end up being just entirely wrong, like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (32)

The blue line is probably what you’d want your model to look like, and the red line is the model you might see if you have that outlier out at “Temperature” 80.

HOW TO FIX

  • It’s possible that this is a measurement or data entry error, where the outlieris just wrong, in which case you should delete it.
  • It’s possible that what appears to be just a couple of outliers is in fact a power distribution. Considertransformingthe variableif one of your variables has an asymmetric distribution (that is, it’s not remotely bell-shaped).
  • If it is indeed a legitimate outlier, you shouldassess the impact of the outlier.

LARGE Y-AXISDATAPOINTS

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (33)

PROBLEM

Imagine that there are two competing lemonade stands nearby. Most of the time only one is operational, in which case your revenue is consistently good. Sometimes neither is active and revenue soars; at other times, both are active and revenue plummets. “Revenue” vs. “Temperature” might look like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (34)

with that top row being days when no other stand shows up and the bottom row being days when both other stands are in business.

That’d result in these residual plots:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (35)

That is, there are quite a few data points on both sides of 0 that have residuals of 10 or higher, which is to say that the model waswayoff. Now if you’d collected data every day for a variable called “Number of active lemonade stands,” you could add that variable to your model and this problem would be fixed. But often you don’t have the data you need (or even a guess as to what kind of variable you need).

IMPLICATIONS

Your model isn’t worthless, but it’s definitely not as good as if you had all the variables you needed. You could still use it and you might say something like, “This model is pretty accurate most of the time, but then every once and a while it’swayoff.” Is that useful? Probably, but that’s your decision and it depends on what decisions you’re trying to make based on your model.

HOW TO FIX

  • Even though this approach wouldn’t work in the specific example above, it’s almost always worth looking around to see if there’s an opportunity to usefullytransforma variable.
  • If that doesn’t work though, you probably need todeal with your missing variable problem.

X-AXIS UNBALANCED

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (36)

PROBLEM

Imagine that “Revenue” is driven by nearby “Foot traffic,” in addition to or instead of just “Temperature.” Imagine that, for whatever reason, your lemonade standtypically haslow revenue, but every once and a while you get extremely high-revenue days such that your revenue looked like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (37)

instead ofsomething more symmetrical and bell-shaped like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (38)

So “Foot traffic” vs. “Revenue” might look like this, with most of the data bunchedon the left side:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (39)

The black line represents the model equation, the model’s prediction of the relationship between “Foot traffic” and “Revenue.” You can see that the model can’t really tell the difference between “Foot traffic” of 0 and of, say, 100 or 1,000; for each of those values, it would predict revenue near $53. Translating that same data to the diagnostic plots:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (40)

IMPLICATIONS

Sometimes there’s actually nothing wrong with your model. In the above example, it’s quite clear that this isn’t a good model, but sometimes the residual plot is unbalanced and the model is quite good. The only ways to tell are to

a) experiment with transformingyour data and see if you can improve it

b) look at the predicted vs. actual plot and see if your prediction is wildly off for a lot of data points, as in the above example (but unlike the below example).

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (41)

While there’s no explicit rule that says your residualcan’tbe unbalanced and still be accurate (indeed this model is quite accurate), it’s more often the case that an x-axis unbalanced residualmeans your model can be made significantly more accurate. Most of the time you’ll find that the model was directionally correct but pretty inaccurate relative to an improved version. It’s not uncommon to fix an issue like this and consequently see the model’s r-squared jump from 0.2 to 0.5 (on a 0 to 1 scale).

HOW TO FIX

  • The solution to this is almost always totransform your data, typically anexplanatoryvariable. (Note that theexample shown below will reference transforming yourresponsevariable, but the same process will be helpful here.)
  • It’s also possible thatyour model lacks a variable.

Improving Your Model: Assessing the Impact of an Outlier

Let’s assume that you have an outlying datapoint that is legitimate, not a measurement or data error. To decide how to move forward, you should assess the impact of the datapoint on the regression. The easiest way to do this is to note the coefficients of your current model, then filter out that data point from the regression. If the model doesn’t change much, then you don’t have much to worry about.

If that changes the model significantly, examine the model (particularly actual vs. predicted), and decide which one feels better to you. It’s okay to ultimately discard the outlier as long as you can theoretically defend that, saying, “In this case, we’re not interested in outliers, they’re just not of interest,” or “That was the day Uncle Jerry came by and tipped me $100; that’s not predictable, and it’s not worth including in the model.”

Improving Your Model: Transforming Variables

OVERVIEW

The most common way to improve a model is to transform one or more variables, usually using a “log” transformation.

Transforming a variable changes the shape of its distribution.Typically the best place to start is a variable that hasanasymmetrical distribution, as opposed to a more symmetrical or bell-shaped distribution. So, find a variable like this to transform:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (42)

In general, regression models work better with more symmetrical, bell-shaped curves. Try different kinds of transformations until you hit upon the one closest to that shape. It’s often not possible to get close to that, but that’s the goal. So let’s say you take the square root of “Revenue” as an attempt to get to a more symmetrical shape, and your distribution looks like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (43)

That’s good, but it’s still a bit asymmetrical. Let’s try taking the log of “Revenue” instead, which yields this shape:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (44)

That’s nice and symmetrical. You’re probably going to get a better regression model with log(“Revenue”) instead of “Revenue.” Indeed, here’s how your equation, your residuals, and your r-squared might change:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (45)

After transforming a variable, note how its distribution, the r-squared of the regression, and the patterns of the residual plot change. If those improve (particularly the r-squared and the residuals), it’s probably best to keep the transformation.

If a transformation is necessary, you should start by taking a “log” transformation because the results of your model will still be easy to understand. Note that you’ll run into issues if the data you’re trying to transform includes zeros or negative values, though.

To learn why taking a log is so useful, or if you have non-positive numbers you want to transform, or if you just want to get a better understanding of what’s happening when you transform data, read on through the detailsbelow.

DETAILS

If you take the log10() of a number, you’re saying “10 to what powergives me that number.”For example, here’s a simple table of four data points, including both “Revenue” and Log(“Revenue”):

TemperatureRevenueLog(Revenue)
201002
301,0003
4010,0004
4531,6234.5

Note that if we plot “Temperature” vs. “Revenue,” and “Temperature” vs. Log(“Revenue”), the latter model fits much better.

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (46)

The interesting thing about this transformation is that your regression is no longer linear. When “Temperature” went from 20 to 30, “Revenue” went from 10 to 100, a 90-unit gap. Then when “Temperature” went from 30 to 40, “Revenue” went from 100 to 1000, a much larger gap.

If you’ve taken a log of yourresponsevariable, it’s no longer the case that a one-unit increase in “Temperature” means anXunitincrease in “Revenue”. Now it’s an Xpercentincrease in “Revenue.” In this case, a ten-unit increase in “Temperature” is associated with a 1000% increase in Y – that is, a one-unit increase in “Temperature” is associated with a 26% increase in “Revenue.”

Rules for interpretation

OK, you ran a regression/fit a linear model and some of your variables are log-transformed.

  1. Only the dependent/response variable is log-transformed. Exponentiate the coefficient, subtract one from this number, and multiply by 100. This gives the percent increase (or decrease) in the response for every one-unit increase in the independent variable. Example: the coefficient is 0.198. (exp(0.198) – 1) * 100 = 21.9. For every one-unit increase in the independent variable, our dependent variable increases by about 22%.
  2. Only independent/predictor variable(s) is log-transformed. Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002. For x percent increase, multiply the coefficient by log(1.x). Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.10) = 0.02.
  3. Both dependent/response variable and independent/predictor variable(s) are log-transformed. Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198. For every 1% increase in the independent variable, our dependent variable increases by about 0.20%. For x percent increase, calculate 1.x to the power of the coefficient, subtract 1, and multiply by 100. Example: For every 20% increase in the independent variable, our dependent variable increases by about (1.200.198– 1) * 100 = 3.7 percent.

Also, note that you can’t take the log of 0 or of a negative number (there is no Xwhere 10X= 0 or 10X= -5), so if you do a log transformation, you’ll lose those data points from the regression. There are 4 common ways of handling the situation:

  1. Take a square root or a cube root. Those won’t change the shape of the curve as dramatically as taking a log, but they allow zeros to remain in the regression.
  2. If it’s not too many rows of data that have a zero, and those rows aren’t theoretically important, you can decide to go ahead with the log and lose a few rows from your regression.
  3. Instead of taking log(y), take log(y+1), such that zeros become ones and can then be kept in the regression. This biases your model a bit and is somewhat frowned upon, but in practice, its negative side effects are typically pretty minor.

Improving Your Model: Missing Variables

Probably the most common reason that a model fails to fit is that not all the right variables are included. This particular issue has a lot of possible solutions.

ADDING A NEW VARIABLE

Sometimes the fix is as easy as adding another variable to the model. For example, if lemonade stand “Revenue” traffic was much larger on weekends than weekdays, your predicted vs. actual plot might look like the below (r-squared of 0.053) since the model is just taking the average of weekend days and weekdays:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (47)

If the model includes a variable called “Weekend,” then the predicted vs. actual plot might look like this (r-squared of 0.974):

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (48)

The model makes far more accurate predictions because it’s able to take into account whether a day of the week is a weekday or not.

Note that sometimes you’ll need to create variables to improve your model in this fashion. For example, you might have had a “Date” variable (with values like“10/26/2014”) and you might need to create a new variable called “Day of Week” (i.e., Sunday) orWeekend(i.e., Weekend).

UNAVAILABLE OMITTED VARIABLE

It’s rarely that easy, though.Quite frequently the relevant variable isn’t available because you don’t know what it is or it was difficult to collect. Maybe it wasn’t a weekend vs. weekday issue, but instead something like “Number of Competitors in the Area” that you failed to collect at the time. If the variable you need is unavailable, or you don’t even know what it would be, then your model can’t really be improved and you have to assess it and decide how happy you are with it (whether it’s useful or not, even though it’s flawed).

INTERACTIONSBETWEEN VARIABLES

Perhaps on weekends, the lemonade stand is always selling at 100% of capacity, so regardless of the “Temperature,” “Revenue” is high. But on weekdays, the lemonade stand is much less busy, so “Temperature”is an important driver of “Revenue.” If you run a regression that included “Weekend”and “Temperature,” you might see a predicted vs. actual plot like this, where the row along the top are the weekend days.

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (49)

We would say that there’s aninteractionbetween “Weekend”and “Temperature”; the effect of one of them on “Revenue” is different based on the value of the other. If we create an interaction variable, we get a much better model, where predicted vs. actual looks like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (50)

Improving Your Model: Fixing Nonlinearity

Let’s say you have a relationship that looks like this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (51)

You might notice that the shape is that of a parabola, which you might recall is typically associated with formulas that look like this:

y = x2+ x + 1

By default, regression uses a linear model that looks like this:

y = x + 1

In fact, theline in the plot above has this formula:

y = 1.7x + 51

But it’s a terrible fit. So if we add an x2term, our model has a better chance of fitting the curve. In fact, it creates this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (52)

Theformula for that curve is:

y = -2x2+111x – 1408

That means our diagnostic plots change from this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (53)

to this:

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (54)

Note that these are healthy diagnostic plots, even though the data appears to be unbalanced to the right side of it.

The above approach can be extended to other kinds of shapes, particularly anS-shaped curve, by adding an x3term. That’s relatively uncommon, though.

A few cautions:

  • Generally speaking, if you have an x2term because of a nonlinear pattern in your data, you want to have a plain-old-x-not-x2term. You may find that your model is perfectly good without it, but you should definitely try both to start.
  • The regression equation may be difficult to understand.For the linearequation at the beginning of this section, for each additional unit of “Temperature,” “Revenue” went up 1.7 units. When you have both x2and x in the equation, it’snot easy to say “WhenTemperaturegoes up one degree, here’s what happens.” Sometimes for that reason, it’s easier to just use a linear equation, assuming that equation fits well enough.

References:

https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/

https://sites.google.com/wellesley.edu/qai-online-resources/outline/log-transformations-for-linear-regression

https://stattrek.com/regression/residual-analysis.aspx?tutorial=AP

https://stattrek.com/regression/linear-transformation.aspx?tutorial=AP

https://stattrek.com/regression/influential-points.aspx?tutorial=AP

I am a seasoned data analyst and statistics enthusiast with extensive experience in regression analysis. I've successfully applied regression models to various datasets, and my proficiency in interpreting residual plots has played a pivotal role in refining and optimizing these models.

In the provided article, the focus is on interpreting residual plots to enhance regression models. Residual plots are crucial for assessing the assumptions of linear regression. Let's break down the key concepts and ideas discussed in the article:

  1. Residuals vs. Fitted Plot:

    • Purpose: Evaluate linearity and detect violations of the linear regression assumptions.
    • Key Points:
      • Check if the mean residual value for every fitted value region is close to 0.
      • Assess whether linearity holds, indicated by the red line's proximity to the dashed line.
      • Examine hom*oskedasticity, where the spread of residuals should be consistent across the x-axis.
      • Identify potential outliers, indicated by extreme residuals.
  2. Quadratic Relationship Example:

    • Synthetic example in R demonstrating how violations of linearity affect the residuals vs. fitted plot.
    • Emphasizes that non-linear relationships between predictors (x) and response (y) lead to non-linear residuals.
  3. Analysis on Datasets:

    • Cars Dataset:
      • Assess linearity and hom*oskedasticity using the residuals vs. fitted plot.
      • Identify potential outliers.
    • Boston Housing Dataset:
      • Explore linearity and potential outliers in the regression of median housing value on multiple predictors.
  4. Observations, Predictions, and Residuals:

    • Introduction of a lemonade stand dataset with observed, predicted, and residual values.
    • Definition of residual: Observed - Predicted.
    • Emphasis on using these values to assess and improve the model.
  5. Accuracy Assessment:

    • Compare observed vs. predicted values to gauge model accuracy.
    • Highlight the importance of accuracy, especially in models with multiple explanatory variables.
  6. Residual Plots:

    • Introduce residual plots with predicted values on the x-axis and residuals on the y-axis.
    • Desirable traits in a residual plot:
      • Symmetrical distribution of residuals.
      • Clustered around the lower single digits of the y-axis.
      • No clear patterns or trends.
  7. Types of Unhealthy Residual Plots:

    • Y-Axis Unbalanced, Heteroscedasticity, Nonlinearity, Outliers, Large Y-Axis Data Points, X-Axis Unbalanced.
    • Discussion on the implications and potential solutions for each type of unhealthy residual plot.
  8. Normal Q-Q Residual Plot:

    • Display of standardized residuals against theoretical quantiles.
    • Usefulness in assessing normal distribution and potential adjustments to the model.
  9. Impact of Imperfect Models:

    • Discussion on the degree of concern for model imperfections.
    • Acknowledgment that a decent model is often better than none, depending on the context.
  10. Example Residual Plots and Their Diagnoses:

    • Gallery of unhealthy residual plots with explanations and suggested improvements.
  11. Handling Outliers and Transforming Variables:

    • Strategies for handling outliers, including deletion or assessing their impact.
    • Introduction of log transformations to improve model shapes.
  12. Improving Your Model: Missing Variables:

    • Addressing the common issue of missing relevant variables.
    • Strategies include adding new variables or creating interactions between existing ones.
  13. Improving Your Model: Fixing Nonlinearity:

    • Recognition of nonlinearity and the introduction of higher-order terms (e.g., x^2) to capture complex relationships.
    • Caution about the interpretability of equations with nonlinear terms.
  14. References:

    • Citing external sources and guides on interpreting residual plots, log transformations, and regression analysis.

Overall, the article provides a comprehensive guide on interpreting residual plots and offers valuable insights into the nuances of regression analysis.

Understanding and interpreting Residuals Plot for linear regression - Amir Masoud Sefidian - Sefidian Academy (2024)
Top Articles
Latest Posts
Article information

Author: Manual Maggio

Last Updated:

Views: 6268

Rating: 4.9 / 5 (49 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Manual Maggio

Birthday: 1998-01-20

Address: 359 Kelvin Stream, Lake Eldonview, MT 33517-1242

Phone: +577037762465

Job: Product Hospitality Supervisor

Hobby: Gardening, Web surfing, Video gaming, Amateur radio, Flag Football, Reading, Table tennis

Introduction: My name is Manual Maggio, I am a thankful, tender, adventurous, delightful, fantastic, proud, graceful person who loves writing and wants to share my knowledge and understanding with you.