Lab 7: Data Tidying (Ch 5), Transformation and Visualization with COVID-19 reporting data

Author

Emily Tran

library(tidyverse)
library(lubridate) # allows you to work with dates

Chapter 5: Data tidying (Exercise 1 for Lab 7)

5.1 Introduction

Tidy data is a system used to organize data in R.

In this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.

5.2 Tidy data

There are different ways you can represent the same data. For instance, these datasets have the same four variables (country, year, population, cases), but the values are organized in different ways.

table1

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

table3

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

table1 is easiest to work with inside tidyverse because it is tidy. There are 2 advantages to data being tidy: consistent data structure, and ease of using R functions that work with vectors of values.

There are 3 rules that make a dataset tidy:

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

dplyr and ggplot2 are other packages in tidyverse that are meant to work with tidy data. Some examples below:

# Compute rate per 10,000
table1 |>
  mutate(rate = cases / population * 10000)

# A tibble: 6 × 5
  country      year  cases population  rate
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67

# Compute total cases per year
table1 |> 
  group_by(year) |> 
  summarize(total_cases = sum(cases))

# A tibble: 2 × 2
   year total_cases
  <dbl>       <dbl>
1  1999      250740
2  2000      296920

# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 2000

5.2.1 Exercises

Exercise 1

For each table, observations refer to each row. Each observation is a row and each row is an observation. For table1, those observations are Afghanistan, 1999, 745, 19987071, for instance. Each of those values are held in single cells under the columns country, year, cases, and population.

For table2, the observations are Afghanistan, 1999, cases, 745, for instance. Each of those values fall within a single cell under the columns country, year, type, count.

For table3, the observations are Afghanistan, 1999, 745/19987071, for instance. Each of those values fall under a single cell under the columns country, year, rate.

Exercise 2

Rate is cases/population. What I would do for table2: group by year to get “per year”, then sum the count per country (per year now). Then I would create a variable to hold table1 information, except in that variable I would put the summed count by country per year.

For table3, I would compute the rate, given under the rate column, properly by dividing the cases by population to get a number, and then multiply that by 10000. I would store those values in a new column in a new table.

5.3 Lengthening data: pivot

In reality, most data aren’t tidy as you would hope them to be. Two main reasons: data is often organized for purposes other than analysis, like to make data entry easier. And, most people aren’t familiar with the principles of tidy data.

To help with tidying, you can pivot your data into a tidy form.

tidry has two functions for pivoting data: pivot_longer() and pivot_wider(). We will look into these functions below.

5.3.1 Data in column names

The billboard dataset records the billboard rank of songs in the year 2000. In this dataset, each observation is a song. The first 3 columns are variables that describe the song.

Additionally, there are 76 columns (wk1-wk76) that describe the rand of the song in each week. Note that here, the column names are one variable (describing week), and their cell values are another kind of variable (describing rank).

We will tidy this data using pivot_longer(). These are the arguments: cols - specifies which columns need to be pivoted (that is, which columns aren’t variables) names_to - names the variable stored in the column names values_to - names the variable stored in the cell values.

We also will tidy up the data by cleaning up the cell values under the “week” column. This entails using mutate function, which adds a new column that we can use readr::parse_number() on.

parse_number() is a handy function that will extract the first number from a string, ignoring all other text.

billboard

# A tibble: 317 × 79
   artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
   <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
 2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
 3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
 4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
 5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
 6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
 7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
 8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
 9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

# pivot the billboard data so that it's longer
  billboard_longer <- billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", # The reason for quotes is that we are creating new variables
    values_to = "rank",
    values_drop_na = TRUE # do this to get rid of NA values
  ) |> 
    
# tidy up the week column, by converting the cell values under that column from characters to numbers
  mutate(
    week = parse_number(week)
  )
  
billboard_longer

# A tibble: 5,307 × 5
   artist  track                   date.entered  week  rank
   <chr>   <chr>                   <date>       <dbl> <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26       1    87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26       2    82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26       3    72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26       4    77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26       5    87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26       6    94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26       7    99
 8 2Ge+her The Hardest Part Of ... 2000-09-02       1    91
 9 2Ge+her The Hardest Part Of ... 2000-09-02       2    87
10 2Ge+her The Hardest Part Of ... 2000-09-02       3    92
# ℹ 5,297 more rows

Now we have effectively tidied up the billboard data, since we have all week numbers in one variable and all rank values in another. We can move on to visualizing.

billboard_longer |> 
  ggplot(aes(x = week, y = rank, group = track)) + 
  geom_line(alpha = 0.25) + # transparency
  scale_y_reverse() # reverses y-axis values

5.3.2 How does pivoting work?

Let’s examine how it works. Suppose we have this tibble here (where we use tribble(), a handy function for construting small tibbles by hand):

df <- tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)

What we want is for our data to have 3 variables: id (which already exists), measurement, and value. We can achieve this by pivoting df longer:

df |> 
  pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )

# A tibble: 6 × 3
  id    measurement value
  <chr> <chr>       <dbl>
1 A     bp1           100
2 A     bp2           120
3 B     bp1           140
4 B     bp2           115
5 C     bp1           120
6 C     bp2           125

Pivot works like this:

Columns that are already variables need to be repeated, once for each column that is pivoted.

Further,

The column names become values in a new variable

And, the values_to parameter causes the cell values to become values in a new variable, defined within the quotes. They are unwound row by row.

The number of values is preserved (not repreated); simply unwound row-by-row.

5.3.3 Many variables in column names

It is challenging when you have multiple pieces of information crammed into the column names. In that case, you would want to store that information into separate new variables.

Here is an example, the who2 dataset, from World Health Organization, about tuberculosis diagnoses. There are two columns that are already variables: country and year.

But then there are 56 columns sp_m_014, ep_m_4554, and rel_m_3544. Notice, for those columns, there’s a pattern: each column name is made up of three pieces separated by _.

The first piece, sp/rel/ep, describes the method used for the diagnosis.

The second piece, m/f is the gender (coded as a binary variable in this dataset).

And the third piece, 014/1524/2534/3544/4554/5564/65 is the age range (014 represents 0-14, for example)

who2

# A tibble: 7,240 × 58
   country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
   <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Afghanistan  1980       NA        NA        NA        NA        NA        NA
 2 Afghanistan  1981       NA        NA        NA        NA        NA        NA
 3 Afghanistan  1982       NA        NA        NA        NA        NA        NA
 4 Afghanistan  1983       NA        NA        NA        NA        NA        NA
 5 Afghanistan  1984       NA        NA        NA        NA        NA        NA
 6 Afghanistan  1985       NA        NA        NA        NA        NA        NA
 7 Afghanistan  1986       NA        NA        NA        NA        NA        NA
 8 Afghanistan  1987       NA        NA        NA        NA        NA        NA
 9 Afghanistan  1988       NA        NA        NA        NA        NA        NA
10 Afghanistan  1989       NA        NA        NA        NA        NA        NA
# ℹ 7,230 more rows
# ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
#   sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
#   sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
#   sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
#   sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
#   sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …

So, that gives us 5 categories of informations altogether. And then there’s also the count of patients in each of those categories (this is given through the cell values).

What we want to do is organize these 6 pieces of information that’s in who2 that we want to organize. Country and year are already in columns; we also want method of diagnosis, gender category, age-range category, and the count of patients in that category (which is given through the cell values).

This can all be organized by using pivot_longer(), where we will create a vector of column names for names_to parameter. Then we will also give instructions for splitting the original variable names into pieces, which we can do through the names_sep parameter. Lastly, using values_to parameter, we will name the variable that stores the counts.

who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = c("diagnosis", "gender", "age"), 
    names_sep = "_",
    values_to = "count"
  )

# A tibble: 405,440 × 6
   country      year diagnosis gender age   count
   <chr>       <dbl> <chr>     <chr>  <chr> <dbl>
 1 Afghanistan  1980 sp        m      014      NA
 2 Afghanistan  1980 sp        m      1524     NA
 3 Afghanistan  1980 sp        m      2534     NA
 4 Afghanistan  1980 sp        m      3544     NA
 5 Afghanistan  1980 sp        m      4554     NA
 6 Afghanistan  1980 sp        m      5564     NA
 7 Afghanistan  1980 sp        m      65       NA
 8 Afghanistan  1980 sp        f      014      NA
 9 Afghanistan  1980 sp        f      1524     NA
10 Afghanistan  1980 sp        f      2534     NA
# ℹ 405,430 more rows

An alternative to names_sep is names_pattern, which you can use to extract variables from more complicated naming scenarios, once you’ve learned about regular expressions in Chapter 15.

5.3.4 Data and variable names in the column headers

How to address scenarios where column names include a mix of variable values and variable names? Let’s look at household dataset, which contains data about five families, with the names and birthdate of up to 2 children.

household

# A tibble: 5 × 5
  family dob_child1 dob_child2 name_child1 name_child2
   <int> <date>     <date>     <chr>       <chr>      
1      1 1998-11-26 2000-01-29 Susan       Jose       
2      2 1996-06-22 NA         Mark        <NA>       
3      3 2002-07-11 2004-04-05 Sam         Seth       
4      4 2004-10-10 2009-08-27 Craig       Khai       
5      5 2000-12-05 2005-02-28 Parker      Gracie

What is challenging about this dataset is that the column names contain the names of two different variables, like dob_childnumber and name_childnumber. To solve this problem, we will supply a vector in the names_to parameter, but this time use a special sentinel: “.value”. This is not a variable name; it’s a unique value that tells pivot_longer() to do something different, specifically to override the usual values_to argument (which we’re omitting here) to instead use the first component of the pivoted column name as a variable name in the output.

We also want to use values_drop_na = TRUE. This is because the shape of the input forces the creation of explicit missing variables (as in the case of families who only have one child).

household |> 
  pivot_longer(
    cols = !family, 
    names_to = c(".value", "child"), 
    names_sep = "_", 
    values_drop_na = TRUE
  )

# A tibble: 9 × 4
  family child  dob        name  
   <int> <chr>  <date>     <chr> 
1      1 child1 1998-11-26 Susan 
2      1 child2 2000-01-29 Jose  
3      2 child1 1996-06-22 Mark  
4      3 child1 2002-07-11 Sam   
5      3 child2 2004-04-05 Seth  
6      4 child1 2004-10-10 Craig 
7      4 child2 2009-08-27 Khai  
8      5 child1 2000-12-05 Parker
9      5 child2 2005-02-28 Gracie

This is an illustration of what “.value” does in names_to.

5.4 Widening data

We’ve just used pivot_longer() to solve the issue of values being in column names. Now, we will learn to use pivot_wider(), which makes datasets wider by increasing columns and reducing rows. This is especially useful when one observation is spread across multiple rows. Often an issue with government data.

Let’s look at the cms_patient_experience dataset, which comes from Centers of Medicare and Medicaid services that details patient experiences.

cms_patient_experience

# A tibble: 500 × 5
   org_pac_id org_nm                           measure_cd measure_title prf_rate
   <chr>      <chr>                            <chr>      <chr>            <dbl>
 1 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       63
 2 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       87
 3 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       86
 4 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       57
 5 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       85
 6 0446157747 USC CARE MEDICAL GROUP INC       CAHPS_GRP… CAHPS for MI…       24
 7 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       59
 8 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       85
 9 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       83
10 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI…       63
# ℹ 490 more rows

The core unit being studied is an organization; the issue is that each organization is spread across six rows, with one row for each measurement taken in the survey organization.

pivot_wider() has the opposite interface to pivot_longer(): instead of choosing new column names, we need to provide the existing columns that define the values (values_from parameter) and the column name (names_from parameter).

In addition, we also need to tell pivot_wider() which column(s) have values that uniquely identify each row. In this case, those are the variables starting with “org”.

cms_patient_experience |> 
  pivot_wider(
    id_cols = starts_with("org"), # tell the function which columns have values that uniquely identify each row
    names_from = measure_cd,
    values_from = prf_rate
  )

# A tibble: 95 × 8
   org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5 CAHPS_GRP_8
   <chr>      <chr>        <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
 1 0446157747 USC C…          63          87          86          57          85
 2 0446162697 ASSOC…          59          85          83          63          88
 3 0547164295 BEAVE…          49          NA          75          44          73
 4 0749333730 CAPE …          67          84          85          65          82
 5 0840104360 ALLIA…          66          87          87          64          87
 6 0840109864 REX H…          73          87          84          67          91
 7 0840513552 SCL H…          58          83          76          58          78
 8 0941545784 GRITM…          46          86          81          54          NA
 9 1052612785 COMMU…          65          84          80          58          87
10 1254237779 OUR L…          61          NA          NA          65          NA
# ℹ 85 more rows
# ℹ 1 more variable: CAHPS_GRP_12 <dbl>

5.4.1 How does pivot_wider() work?

Let’s understand how pivot_wider() works by looking at a simpler dataset, as another example. Here’s our tibble about patients (A or B ID’s) and their blood pressure measurements.

df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)

Now let’s use pivot_wider(). What we are doing here is taking entries from the value column (use as argument for values_from) and entries from the measurment column (use as argument for names_from).

df |> 
  pivot_wider(
    names_from = measurement,
    values_from = value
  )

# A tibble: 2 × 4
  id      bp1   bp2   bp3
  <chr> <dbl> <dbl> <dbl>
1 A       100   120   105
2 B       140   115    NA

5.5 Summary

This chapter covered tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions.

The main challenge is to transform untidy data into tidy format, which is possible by using functions like pivot_longer() and pivot_wider().

Lab 7 Report: Transformation and Visualization with COVID-19 reporting data

Visualizing COVID-19 cases, deaths, recoveries

This lab covers COVID-19 data. The virus has been renamed SARS-CoV-2 based on phylogenetic data.

Introduction to John Hopkins University case tracking data

John Hopkins University researchers developed an interactive dashboard that track reported COVID-19 cases, which is available on Github. We will work with this data on Github: csse_covid_19_data > csse_covid_19_daily_reports > 03-11-2020.csv.

The 03-11-2020.csv file has these columns: Province/State, Country/Region, Last Update, Confirmed, Deaths, Recovered, Latitude, Longitude.

Importantly, note that countries like China and US are listed multiple times in Country/Region because they have can have different Province/State, whereas other countries are listed only once. When working with China/US, we need to sum these rows to get toal count for China/US. Ensure that “/” is not used when making column names because it will give an error in ggplot().

On the Computer

Loading data from a Github repository

Let’s load the data from Github just once to your data folder. Use eval = FALSE to hide the code chunk.

AI note

I consulted AI to help me create a folder and download the file properly in there. The file being used from Github is time_series_covid19_confirmed_global.csv.

# create the data folder
dir.create("data", showWarnings = FALSE)

# download the file
download.file(
  url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv",
  destfile = "data/time_series_covid19_confirmed_global.csv"
)

This loads the data.

time_series_confirmed <- read_csv("data/time_series_covid19_confirmed_global.csv")|>
  rename(Province_State = "Province/State", Country_Region = "Country/Region")

Data Tidying: Pivoting

We will be going through Chapter 5: Data Tidying and Pivot.

Because the data downloaded is in wide format and we want long format, let’s use pivot_longer.

time_series_confirmed_long <- time_series_confirmed |> 
               pivot_longer(-c(Province_State, Country_Region, Lat, Long),
                            names_to = "Date", values_to = "Confirmed")

Dates and time

We will also change date format to be easier to work with for our graphs. This is covered in Chapter 17 (Dates & Times).

# puts dates in yyyy-mm-dd format
time_series_confirmed_long$Date <- mdy(time_series_confirmed_long$Date)

Making graphs from the time series data

For this, we are using the data in time_series_confirmed_long, where we have just formatted it to have long format in the last section.

Let’s make a time series graph of the confirmed cases in the US, where we will count up the individual state data, done below.

# note that these changes weren't stored under a variable
time_series_confirmed_long |> 
  group_by(Country_Region, Date) |> 
  summarise(Confirmed = sum(Confirmed)) |> 
  filter (Country_Region == "US") |> 
  ggplot(aes(x = Date,  y = Confirmed)) + 
    geom_point(size = 0.2) +
    geom_line() +
    ggtitle("US COVID-19 Confirmed Cases")

Now let’s count up COVID-19 cases from several countries, including the US.

time_series_confirmed_long |> 
    group_by(Country_Region, Date) |> 
    summarise(Confirmed = sum(Confirmed)) |> 
    filter (Country_Region %in% c("China","France","Italy", 
                                "Korea, South", "US")) |> 
    ggplot(aes(x = Date,  y = Confirmed, color = Country_Region)) + 
      geom_point(size = 0.2) +
      geom_line() +
      ggtitle("COVID-19 Confirmed Cases")

Another example: make a new table with daily counts using lag function, which subtracts a row from the previous row.

AI note

“Explain the logic of how the mutate line is giving daily cases”

Explanation is that, under a new column called Daily (created by using the mutate function), the difference in COVID-19 cases between today and yesterday is calculated. This is seen in how Confirmed (“today”) subtracts lag-Confirmed (yesterday). The lag part is to get the value of the previous row, and for the case of the very first row in which there is no preceding row, default argument is filled in so that the first Confirmed value will serve as “yesterday” since there’s no preceding row.

Essentially the value in the Daily column calculates the difference in number of Confirmed cases between the current row and the previous row.

time_series_confirmed_long_daily <- time_series_confirmed_long |> 
    group_by(Country_Region, Date) |> 
    summarise(Confirmed = sum(Confirmed)) |> 
  
  # this line created a new column called Daily, which tells how many new cases per day.
    mutate(Daily = Confirmed - lag(Confirmed, default = first(Confirmed)))

view(time_series_confirmed_long_daily)

geom_plot() - Making the graph using the daily cases, for US data:

time_series_confirmed_long_daily |> 
    filter (Country_Region == "US") |> 
    ggplot(aes(x = Date,  y = Daily, color = Country_Region)) + 
      geom_point(size = 0.3) +
      ggtitle("COVID-19 Confirmed Cases")

geom_line() - Line version of this graph:

time_series_confirmed_long_daily |> 
    filter (Country_Region == "US") |> 
    ggplot(aes(x = Date,  y = Daily, color = Country_Region)) + 
      geom_line() +
      ggtitle("COVID-19 Confirmed Cases")

geom_smooth() - Same graph as above (a line graph), but adding a curve fit: geom_smooth() by default will ad a LOESS/LOWESS (Locally Weighted Scatterplott Smoothing) smoother to the data.

time_series_confirmed_long_daily |> 
    filter (Country_Region == "US") |> 
    ggplot(aes(x = Date,  y = Daily, color = Country_Region)) + 
      geom_smooth() +
      ggtitle("COVID-19 Confirmed Cases")

geom_smooth(method = “gam”, se = FALSE) - This is to add a curved fit using generalized additive model (GAM). se = FALSE hides the shaded band.

time_series_confirmed_long_daily |> 
    filter (Country_Region == "US") |> 
    ggplot(aes(x = Date,  y = Daily, color = Country_Region)) + 
      geom_smooth(method = "gam", se = FALSE) +
      ggtitle("COVID-19 Confirmed Cases")

Animated graphs with gganimate

You can animate graphs in R and have them embedded in your web page. Essentially, gganimate creates a series of files that are encompassed in a gif file. You can also save and download this gif too.

There are some important gganimate functions:

transition_*() - defines how the data should be spread out and how it relates to itself across time

view_*() - defines how the positional sclaes should change along the animation

shadow_*() defines how data from other points in time should be presented in the given point in time.

enter_()/exit_() defines how new data should appear and how old data should disappear during the course of the animation.

ease_aes() defines how different aesthetics should be eased during transitions.

Installing gganimate and gifski

gifski is a package for creating a gif file from gganimate. Use Tools > Install Packages to install gifski and gganimate on Unity.

library(gganimate)
library(gifski)
theme_set(theme_bw())

An animation of the confirmed cases in select countries

This is a gif of daily counts of COVID cases over time, for US data only.

AI note

“Can you clarify how the last two groups of code related to animation are different?”

Code chunk

The code relating to gganimate lines:

seq_along(Date) basically means “make a running sequence from 1 up to however many dates there are.” It’s used so that gganimate knows in what order the points should appear, one for each day.

transition_reveal(Date): It gradually reveals the data points along the x-axis (Date), showing the points and line building up over time.

The code relating to making the animation:

This line actually creates the animated GIF from your ggplot object p. renderer = gifski_renderer(): Tells it to render the animation as a GIF image (using the gifski package).

end_pause = 15: This is to keep the final frame visible a bit longer (15 extra frames) at the end before looping again, so it doesn’t restart again too fast.

The transition_reveal() line is important for gganimate to know what to animate!

# Let's create a variable to hold US-only data from time_series_confirmed_long_daily
daily_counts <- time_series_confirmed_long_daily |> 
      filter (Country_Region == "US")

# make the graph here
p <- ggplot(daily_counts, aes(x = Date,  y = Daily, color = Country_Region)) + 
        geom_point() +
        ggtitle("Confirmed COVID-19 Cases") +
  
# gganimate lines  
        geom_point(aes(group = seq_along(Date))) +
        transition_reveal(Date) 

# make the animation
 animate(p, renderer = gifski_renderer(), end_pause = 15)

Make sure to set eval = FALSE so that the gif is not recreated each time you render your script.

Saving animations:

Use this line: anim_save(“daily_counts_US.gif”, p).

An animation of confirmed deaths

# This download may take about 5 minutes. You only need to do this once so set `#| eval: false` in your qmd file
download.file(url="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv", 
  destfile = "data/time_series_covid19_deaths_global.csv")

Data tidying, pivot, and time

The data used was time_series_deaths_confirmed, and here we put it into long format to be more readable, where it is stored under time_series_deaths_long.

time_series_deaths_confirmed <- read_csv("data/time_series_covid19_deaths_global.csv")|>
  rename(Province_State = "Province/State", Country_Region = "Country/Region")

time_series_deaths_long <- time_series_deaths_confirmed |> 
    pivot_longer(-c(Province_State, Country_Region, Lat, Long),
        names_to = "Date", values_to = "Confirmed") 

time_series_deaths_long$Date <- mdy(time_series_deaths_long$Date)

Making the animated graph

We are making an animated graph that counts number of deaths for several select countries

p <- time_series_deaths_long |>
  filter (Country_Region %in% c("US","Canada", "Mexico","Brazil","Egypt","Ecuador","India", "Netherlands", "Germany", "China" )) |>
  
  ggplot(aes(x = Country_Region, y = Confirmed, color = Country_Region)) + 
    geom_point(aes(size = Confirmed)) + 
    transition_time(Date) + 
    labs(title = "Cumulative Deaths: {frame_time}") + 
    ylab("Deaths") +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)

Exercises

Exercise 1

Go through Chapter 5, putting the examples & exercises into report. (Please see the very beginning).

Exercise 2

Instead of making a graph of 5 countries on the same graph as in the above example, use facet_wrap with scales=“free_y”.

In this example I am using facet_wrap to make separate graphs of 5 countries about COVID-19 Confirmed Cases. I used ncol = 3 because it fit nicely and also because I used Lab 6 as reference for how to use facet_wrap, where I did ncol = 3.

time_series_confirmed_long |> 
    group_by(Country_Region, Date) |> 
    summarise(Confirmed = sum(Confirmed)) |> 
    filter (Country_Region %in% c("China","France","Italy", 
                                "Korea, South", "US")) |> 

    ggplot(aes(x = Date,  y = Confirmed, color = Country_Region)) + 
      geom_point() +
      geom_line() + facet_wrap(vars(Country_Region), scales = "free_y", ncol = 3) +
    labs(
      title = "COVID-19 Confirmed Cases",
      x = "Date",
      y = "Confirmed"
    )

Exercise 3

Using the daily count of confirmed cases, make a single graph with 5 countries of your choosing.

To do this I will use the data frame time_series_confirmed_long_daily. The countries I am choosing are: Estonia, Netherlands, Norway, Russia, and Sri Lanka. The x-axis will be Date and the y-axis will be Daily.

# This block is just to let me see all the distinct countries.
#time_series_confirmed_long |> 
#    distinct(Country_Region) |>
#  view()

p <- time_series_confirmed_long_daily |> 
    group_by(Country_Region, Date) |> 
    summarise(Daily = sum(Daily)) |> 
    filter (Country_Region %in% c("Estonia","Netherlands","Norway", 
                                "Russia", "Sri Lanka")) |> 
    ggplot(aes(x = Date,  y = Daily, color = Country_Region)) + 
      geom_point(size = 0.2) +
      geom_line() +
      labs(
      title = "Daily count of COVID-19 Confirmed Cases",
      x = "Date",
      y = "Number of daily confirmed cases",
      color = "Country_Region"
    ) +
  transition_reveal(Date)

# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)

Chapter 5: Data tidying (Exercise 1 for Lab 7)

5.1 Introduction

5.2 Tidy data

5.2.1 Exercises

Exercise 1

Exercise 2

5.3 Lengthening data: pivot

5.3.1 Data in column names

5.3.2 How does pivoting work?

5.3.3 Many variables in column names

5.3.4 Data and variable names in the column headers

5.4 Widening data

5.4.1 How does pivot_wider() work?

5.5 Summary

Lab 7 Report: Transformation and Visualization with COVID-19 reporting data

Visualizing COVID-19 cases, deaths, recoveries

Introduction to John Hopkins University case tracking data

On the Computer

Loading data from a Github repository

AI note

Data Tidying: Pivoting

Dates and time

Making graphs from the time series data

AI note

Animated graphs with gganimate

Installing gganimate and gifski

An animation of the confirmed cases in select countries

AI note

Code chunk

The code relating to gganimate lines:

The code relating to making the animation:

Saving animations:

An animation of confirmed deaths

Data tidying, pivot, and time

Making the animated graph

Exercises

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5