I’m currently working my way through the DataCamp career track, Statistician with R. The fourth course in the track is DataCamp Correlation and Regression in R. I enjoyed this course. It was not the most technical course I’ve taken with DataCamp. But it was very good and reinforcing and building correlation and linear regression intuition. The goal of the course was not to practice coding, but to gain a feel for how these concepts work. and on that level I think the course succeeded.
Visualizing Two Variables
The first section of the Datacamp Correlation and Regression course is called, “Visualizing Two Variables.” No the most advanced topic in the world, but it sets the table for the four sections to come. int he course you build several scatter plots and box-plots. The course also introduces several ways to characterize bivariate relationships.
- Form – Linear, Quadratic, non-linear, fan-shaped
- Direction – Positive or Negative
- Strength – How much noise is there?
- Outliers – Weird points
The section also discusses using Jitter or Alpha to help correct for over plotting. I had never though of using alpha as a way to correct for it, so I found that to be a useful tidbit.
This one gets a little mathy. In this section the correlation coefficient is introduced. The correlation coefficient only works on linear relationships, it is a value between 0 and 1, and can be positive or negative. The whole section goes on to explore each of these aspects. The way the section did that was by introducing Anscombe Data, which is apparently a famous dataset. Within the data set there are several graphs of various shapes all with the same correlation coefficient. The Anscombe dataset shows the pitfalls of the correlation coefficient.
The section ends with the important concept of Spurious Correlation. Spurious correlation is the idea that although two things move together that does not necessarily mean they are related. Common confounding variables are time when looking at time series, and population when looking at geographic data. For fun examples of spurious correlation check out this site.
Simple Linear Regression & Interpreting Regression Models
Section 3 and 4 dealt with Simple Linear Regressions. This on was a slog. Theses sections made you calculate by hand the slope and intercept of a Linear Regression. It’s important to know. but it was dry.
The third section ends with a discussion of the phrase “Regression to the mean” and how it is different than creating a regression.
Section five was all about error. Root-mean-squared-error, sum-of-squared-residuals, residual-standard-error all get discussed and calculated. I would call this section a slog, but compared to the last two it was a welcome relief. I’m kidding a bit. I actually enjoyed this section the most. It felt he most like what a statistician does. Critiquing models and trying to udnerstand their failings.
The section ends with a very good explanation on outliers. the section also introduces two concepts I had not really heard before known: Leverage and Influence. Leverage is how far an observation is from the mean along the x-axis. And Influence is how much a point affects a model. It is very possible for an observation to have high leverage and low influence. I think a simple way to view leverage is to view it as the residual along the x-axis. So one way to spot outliers is to see which observations have high leverage and a high residual.
I enjoyed the topic of outliers quite a bit in this section. How to deal with outliers is very subjective. Building the concepts of leverage and influence along with residuals helped to create a framework by which to judge outliers. And to create justifications for their removals or inclusion.
Summary of DataCamp Correlation and Regression in R
Overall this was a more theoretical course. It used the R-console more for arithmetic than for making graphs. Having completed the course I can safely say I understand the fundamentals of linear regression much more than when I started. But I’ll be honest, only the real data dork will likely enjoy this course. It lacks frills. To see my review of the DataCamp R project of visualizing Covid-19 check out this post.