Data Visualization

DataCamp Project Visualizing Covid-19

By Jon Elordi

This is a review and my answers to the DataCamp Project Visualizing COVID-19. You can find the project here if you want to do it for yourself. It’s been a while since I’ve done a Datacamp project. My last one was on visualizing music data.

DataCamp Project Review:

It’s a fairly straight forward project that you can get through in about 45 minutes if you’re familiar with ggplot2. The results are several line graphs made using ggplot2. I have mixed feelings about DataCamp Projects. On the one hand, I think they’re great for practicing things like working on Jupyter Notebooks, and several R concepts. On the other hand, the algorithm that checks the projects is very strict and infuriating.

That was possibly the most annoying part about the DataCamp Project Visualizing COVID-19 was the lack of flexibility. For example, the project’s algorithm finds it very important where you put the aes() call in your ggplot chain. So you’ll find yourself re-running code that’s different that produces the same outcome. That’s frustrating. By far the most frustrating was at one point the project tells you to “Place the labels at 100000 on the y-axis. Use the who_events data again.” Straight forward enough. HOWEVER, in your code, you needed to input 100000 as 1e5. And if you just put in 100000 you will get it wrong. You will get frustrated. And you will take the hint.

So use my answers below so you can avoid taking hits and losing out on those points XP points.

Visualizing COVID-19 Answers:

I’m going to input the code and a screenshot of the output where applicable.

Part 1 – Loading

This one is straight forward. Loading libraries and printing a table. the output will be a long table with two columns: date and cum_cases.

#Load the readr, ggplot2, and dplyr packages
#Read datasets/confirmed_cases_worldwide.csv into confirmed_cases_worldwide
  confirmed_cases_worldwide <- read_csv("datasets/confirmed_cases_worldwide.csv")
#See the result

Part 2 – Initial Time Series

Making a time series chart looking at cumulative cases over time. A straight forward ggplot2 line chart.

#Draw a line plot of cumulative cases vs. date
#Label the y-axis
  ggplot(confirmed_cases_worldwide, aes(x = date, y = cum_cases)) +
  geom_line() +
  ylab("Cumulative confirmed cases")

Part 3 – China

Creating another time series this time adding the color aesthetic to show China VS the rest of the world. The project makes you pass the “group” variable, which I’ve never used before. and I’m not sure what it does because the graph looked the same whether I had it in there or not.

#Read in datasets/confirmed_cases_china_vs_world.csv
  confirmed_cases_china_vs_world <- read_csv("datasets/confirmed_cases_china_vs_world.csv")
#See the result
#Draw a line plot of cumulative cases vs. date, grouped and colored by is_china
#Define aesthetics within the line geom
  plt_cum_confirmed_cases_china_vs_world <- ggplot(confirmed_cases_china_vs_world) +
  geom_line(aes(x = date, y = cum_cases, group = is_china, color = is_china)) +
  ylab("Cumulative confirmed cases")
#See the plot
Sorry about chopping off the bottom. I figured the code was more important than an X-axis

Part 4 – Annotations

Adding annotations to the chart. This was by far the most annoying part of the project. There are just so many settings to get exactly right. NOTE!!!! Use 1e5 instead of 100000. Let me save your sanity

  who_events <- tribble( ~ date, ~ event, "2020-01-30", "Global health\nemergency     declared", "2020-03-11", "Pandemic\ndeclared", "2020-02-13", "China reporting\nchange" )   %>%
  mutate(date = as.Date(date))
#Using who_events, add vertical dashed lines with an xintercept at date
#and text at date, labeled by event, and at 100000 on the y-axis
  plt_cum_confirmed_cases_china_vs_world +
  geom_vline(data = who_events, aes(xintercept = date), linetype = 'dashed') +
  geom_text(data = who_events, aes(x = date,label = event), y = 1e5)
Maybe it was my laptop, but I do not think overlapping annotations is a good look

Part 5 – Trend Line

Some of the errors you get with the geom_smooth() can be confusing in the project. It’ll ask for an x or y. Ignore them do what you’ve been taught.

#Filter for China, from Feb 15
  china_after_feb15 <- confirmed_cases_china_vs_world %>%
  filter(is_china == "China", date >= "2020-02-15")
#Using china_after_feb15, draw a line plot cum_cases vs. date
#Add a smooth trend line using linear regression, no error bars
  ggplot(china_after_feb15, aes(x = date, y = cum_cases)) +
  geom_line() +
  geom_smooth(method = "lm", se = FALSE) +
  ylab("Cumulative confirmed cases")
datacamp project covid linear regression

Part 6 – Another Trend Line

#Filter confirmed_cases_china_vs_world for not China
  not_china <- confirmed_cases_china_vs_world %>%
  filter(is_china != "China")
#Using not_china, draw a line plot cum_cases vs. date
#Add a smooth trend line using linear regression, no error bars
  plt_not_china_trend_lin <- ggplot(data = not_china, aes(x = date, y = cum_cases)) +
  geom_line() +
  geom_smooth(method = "lm", se = F) +
  ylab("Cumulative confirmed cases")
#See the result
datacamp project covid non-china regression

Part 7 – Log Scale

I actually didn’t know about the scale_y_log10() function. That was nice to learn about.

#Modify the plot to use a logarithmic scale on the y-axis
  plt_not_china_trend_lin +
datacamp project covid log scale

Part 8 – Other Countries

#Run this to get the data for each country
  confirmed_cases_by_country <- read_csv("datasets/confirmed_cases_by_country.csv")
#Group by country, summarize to calculate total cases, find the top 7
  top_countries_by_total_cases <- confirmed_cases_by_country %>%
  group_by(country) %>%
  summarize(total_cases = max(cum_cases)) %>%
#See the result
datacamp project covid other countries table

Part 9 – Wrapping It Up

#Run this to get the data for the top 7 countries
  confirmed_cases_top7_outside_china <-     read_csv("datasets/confirmed_cases_top7_outside_china.csv")
#Using confirmed_cases_top7_outside_china, draw a line plot of
#cum_cases vs. date, grouped and colored by country
  ggplot(data = confirmed_cases_top7_outside_china) +
  geom_line(aes(x = date, y = cum_cases, group = country, color = country)) +
  ylab("Cumulative confirmed cases")
datacamp project covid top 7 countries