Skip to content

Linear regression

Python notebook: https://github.com/daviskregers/data-science-recap/blob/main/09-linear-regression.ipynb

  • Fit a line to a data set of observations
  • Use this line to predict unobserved values

How does it work?

  • Usually using "least squares"
  • Minimizes the squared-error between each point and the line
  • Remember the slope-intercept equation of a line? y=mx+b
  • The slope is the correlation between the two variables times the standard deviation in Y, all divided by the standard deviation in X.
  • The intercept is the mean of Y minus the slope times the mean of X
  • But python will do all that for you.

There are more than one way to do it

  • Gradient descent is an alternate method to least squares
  • Basically iterates to find the line that best follows the contours defined by the data.
  • Can make sense when dealing with 3D data.
  • Easy to try in python and just compare the results to least squares
    • But usually least squares is a perfecly good choice.

Measuring error with r-squared

  • How do we measure how well our line fits our data?
  • R-squared (aka coefficient of determination) measures:
    • The fraction of the total variation in Y that is captured by the model.
\[1.0-\frac{sum of squared errors}{sum of squared variation from mean}\]

Interpreting r-squared

  • Ranges from 0 to 1
  • 0 is bad (none of the variance is captured), 1 is good (all of the variance is captured).