library(tidyverse)
library(lubridate) # allows you to work with datesLab 7: Data Tidying (Ch 5), Transformation and Visualization with COVID-19 reporting data
Chapter 5: Data tidying (Exercise 1 for Lab 7)
5.1 Introduction
Tidy data is a system used to organize data in R.
In this chapter, we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidyr is a member of the core tidyverse.
5.2 Tidy data
There are different ways you can represent the same data. For instance, these datasets have the same four variables (country, year, population, cases), but the values are organized in different ways.
table1# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
table2# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
table3# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
table1 is easiest to work with inside tidyverse because it is tidy. There are 2 advantages to data being tidy: consistent data structure, and ease of using R functions that work with vectors of values.
There are 3 rules that make a dataset tidy:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.
dplyr and ggplot2 are other packages in tidyverse that are meant to work with tidy data. Some examples below:
# Compute rate per 10,000
table1 |>
mutate(rate = cases / population * 10000)# A tibble: 6 × 5
country year cases population rate
<chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
# Compute total cases per year
table1 |>
group_by(year) |>
summarize(total_cases = sum(cases))# A tibble: 2 × 2
year total_cases
<dbl> <dbl>
1 1999 250740
2 2000 296920
# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country, shape = country)) +
scale_x_continuous(breaks = c(1999, 2000)) # x-axis breaks at 1999 and 20005.2.1 Exercises
Exercise 1
For each table, observations refer to each row. Each observation is a row and each row is an observation. For table1, those observations are Afghanistan, 1999, 745, 19987071, for instance. Each of those values are held in single cells under the columns country, year, cases, and population.
For table2, the observations are Afghanistan, 1999, cases, 745, for instance. Each of those values fall within a single cell under the columns country, year, type, count.
For table3, the observations are Afghanistan, 1999, 745/19987071, for instance. Each of those values fall under a single cell under the columns country, year, rate.
Exercise 2
Rate is cases/population. What I would do for table2: group by year to get “per year”, then sum the count per country (per year now). Then I would create a variable to hold table1 information, except in that variable I would put the summed count by country per year.
For table3, I would compute the rate, given under the rate column, properly by dividing the cases by population to get a number, and then multiply that by 10000. I would store those values in a new column in a new table.
5.3 Lengthening data: pivot
In reality, most data aren’t tidy as you would hope them to be. Two main reasons: data is often organized for purposes other than analysis, like to make data entry easier. And, most people aren’t familiar with the principles of tidy data.
To help with tidying, you can pivot your data into a tidy form.
tidry has two functions for pivoting data: pivot_longer() and pivot_wider(). We will look into these functions below.
5.3.1 Data in column names
The billboard dataset records the billboard rank of songs in the year 2000. In this dataset, each observation is a song. The first 3 columns are variables that describe the song.
Additionally, there are 76 columns (wk1-wk76) that describe the rand of the song in each week. Note that here, the column names are one variable (describing week), and their cell values are another kind of variable (describing rank).
We will tidy this data using pivot_longer(). These are the arguments: cols - specifies which columns need to be pivoted (that is, which columns aren’t variables) names_to - names the variable stored in the column names values_to - names the variable stored in the cell values.
We also will tidy up the data by cleaning up the cell values under the “week” column. This entails using mutate function, which adds a new column that we can use readr::parse_number() on.
parse_number() is a handy function that will extract the first number from a string, ignoring all other text.
billboard# A tibble: 317 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
# pivot the billboard data so that it's longer
billboard_longer <- billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week", # The reason for quotes is that we are creating new variables
values_to = "rank",
values_drop_na = TRUE # do this to get rid of NA values
) |>
# tidy up the week column, by converting the cell values under that column from characters to numbers
mutate(
week = parse_number(week)
)
billboard_longer# A tibble: 5,307 × 5
artist track date.entered week rank
<chr> <chr> <date> <dbl> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
# ℹ 5,297 more rows
Now we have effectively tidied up the billboard data, since we have all week numbers in one variable and all rank values in another. We can move on to visualizing.
billboard_longer |>
ggplot(aes(x = week, y = rank, group = track)) +
geom_line(alpha = 0.25) + # transparency
scale_y_reverse() # reverses y-axis values5.3.2 How does pivoting work?
Let’s examine how it works. Suppose we have this tibble here (where we use tribble(), a handy function for construting small tibbles by hand):
df <- tribble(
~id, ~bp1, ~bp2,
"A", 100, 120,
"B", 140, 115,
"C", 120, 125
)What we want is for our data to have 3 variables: id (which already exists), measurement, and value. We can achieve this by pivoting df longer:
df |>
pivot_longer(
cols = bp1:bp2,
names_to = "measurement",
values_to = "value"
)# A tibble: 6 × 3
id measurement value
<chr> <chr> <dbl>
1 A bp1 100
2 A bp2 120
3 B bp1 140
4 B bp2 115
5 C bp1 120
6 C bp2 125
Pivot works like this:
Further,
And, the values_to parameter causes the cell values to become values in a new variable, defined within the quotes. They are unwound row by row.
5.3.3 Many variables in column names
It is challenging when you have multiple pieces of information crammed into the column names. In that case, you would want to store that information into separate new variables.
Here is an example, the who2 dataset, from World Health Organization, about tuberculosis diagnoses. There are two columns that are already variables: country and year.
But then there are 56 columns sp_m_014, ep_m_4554, and rel_m_3544. Notice, for those columns, there’s a pattern: each column name is made up of three pieces separated by _.
The first piece, sp/rel/ep, describes the method used for the diagnosis.
The second piece, m/f is the gender (coded as a binary variable in this dataset).
And the third piece, 014/1524/2534/3544/4554/5564/65 is the age range (014 represents 0-14, for example)
who2# A tibble: 7,240 × 58
country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan 1980 NA NA NA NA NA NA
2 Afghanistan 1981 NA NA NA NA NA NA
3 Afghanistan 1982 NA NA NA NA NA NA
4 Afghanistan 1983 NA NA NA NA NA NA
5 Afghanistan 1984 NA NA NA NA NA NA
6 Afghanistan 1985 NA NA NA NA NA NA
7 Afghanistan 1986 NA NA NA NA NA NA
8 Afghanistan 1987 NA NA NA NA NA NA
9 Afghanistan 1988 NA NA NA NA NA NA
10 Afghanistan 1989 NA NA NA NA NA NA
# ℹ 7,230 more rows
# ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
# sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
# sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
# sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
# sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
# sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>, …
So, that gives us 5 categories of informations altogether. And then there’s also the count of patients in each of those categories (this is given through the cell values).
What we want to do is organize these 6 pieces of information that’s in who2 that we want to organize. Country and year are already in columns; we also want method of diagnosis, gender category, age-range category, and the count of patients in that category (which is given through the cell values).
This can all be organized by using pivot_longer(), where we will create a vector of column names for names_to parameter. Then we will also give instructions for splitting the original variable names into pieces, which we can do through the names_sep parameter. Lastly, using values_to parameter, we will name the variable that stores the counts.
who2 |>
pivot_longer(
cols = !(country:year),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)# A tibble: 405,440 × 6
country year diagnosis gender age count
<chr> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan 1980 sp m 014 NA
2 Afghanistan 1980 sp m 1524 NA
3 Afghanistan 1980 sp m 2534 NA
4 Afghanistan 1980 sp m 3544 NA
5 Afghanistan 1980 sp m 4554 NA
6 Afghanistan 1980 sp m 5564 NA
7 Afghanistan 1980 sp m 65 NA
8 Afghanistan 1980 sp f 014 NA
9 Afghanistan 1980 sp f 1524 NA
10 Afghanistan 1980 sp f 2534 NA
# ℹ 405,430 more rows
An alternative to names_sep is names_pattern, which you can use to extract variables from more complicated naming scenarios, once you’ve learned about regular expressions in Chapter 15.
5.3.4 Data and variable names in the column headers
How to address scenarios where column names include a mix of variable values and variable names? Let’s look at household dataset, which contains data about five families, with the names and birthdate of up to 2 children.
household# A tibble: 5 × 5
family dob_child1 dob_child2 name_child1 name_child2
<int> <date> <date> <chr> <chr>
1 1 1998-11-26 2000-01-29 Susan Jose
2 2 1996-06-22 NA Mark <NA>
3 3 2002-07-11 2004-04-05 Sam Seth
4 4 2004-10-10 2009-08-27 Craig Khai
5 5 2000-12-05 2005-02-28 Parker Gracie
What is challenging about this dataset is that the column names contain the names of two different variables, like dob_childnumber and name_childnumber. To solve this problem, we will supply a vector in the names_to parameter, but this time use a special sentinel: “.value”. This is not a variable name; it’s a unique value that tells pivot_longer() to do something different, specifically to override the usual values_to argument (which we’re omitting here) to instead use the first component of the pivoted column name as a variable name in the output.
We also want to use values_drop_na = TRUE. This is because the shape of the input forces the creation of explicit missing variables (as in the case of families who only have one child).
household |>
pivot_longer(
cols = !family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)# A tibble: 9 × 4
family child dob name
<int> <chr> <date> <chr>
1 1 child1 1998-11-26 Susan
2 1 child2 2000-01-29 Jose
3 2 child1 1996-06-22 Mark
4 3 child1 2002-07-11 Sam
5 3 child2 2004-04-05 Seth
6 4 child1 2004-10-10 Craig
7 4 child2 2009-08-27 Khai
8 5 child1 2000-12-05 Parker
9 5 child2 2005-02-28 Gracie
5.4 Widening data
We’ve just used pivot_longer() to solve the issue of values being in column names. Now, we will learn to use pivot_wider(), which makes datasets wider by increasing columns and reducing rows. This is especially useful when one observation is spread across multiple rows. Often an issue with government data.
Let’s look at the cms_patient_experience dataset, which comes from Centers of Medicare and Medicaid services that details patient experiences.
cms_patient_experience# A tibble: 500 × 5
org_pac_id org_nm measure_cd measure_title prf_rate
<chr> <chr> <chr> <chr> <dbl>
1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 63
2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 87
3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 86
4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 57
5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 85
6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP… CAHPS for MI… 24
7 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 59
8 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 85
9 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 83
10 0446162697 ASSOCIATION OF UNIVERSITY PHYSI… CAHPS_GRP… CAHPS for MI… 63
# ℹ 490 more rows
The core unit being studied is an organization; the issue is that each organization is spread across six rows, with one row for each measurement taken in the survey organization.
pivot_wider() has the opposite interface to pivot_longer(): instead of choosing new column names, we need to provide the existing columns that define the values (values_from parameter) and the column name (names_from parameter).
In addition, we also need to tell pivot_wider() which column(s) have values that uniquely identify each row. In this case, those are the variables starting with “org”.
cms_patient_experience |>
pivot_wider(
id_cols = starts_with("org"), # tell the function which columns have values that uniquely identify each row
names_from = measure_cd,
values_from = prf_rate
)# A tibble: 95 × 8
org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5 CAHPS_GRP_8
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0446157747 USC C… 63 87 86 57 85
2 0446162697 ASSOC… 59 85 83 63 88
3 0547164295 BEAVE… 49 NA 75 44 73
4 0749333730 CAPE … 67 84 85 65 82
5 0840104360 ALLIA… 66 87 87 64 87
6 0840109864 REX H… 73 87 84 67 91
7 0840513552 SCL H… 58 83 76 58 78
8 0941545784 GRITM… 46 86 81 54 NA
9 1052612785 COMMU… 65 84 80 58 87
10 1254237779 OUR L… 61 NA NA 65 NA
# ℹ 85 more rows
# ℹ 1 more variable: CAHPS_GRP_12 <dbl>
5.4.1 How does pivot_wider() work?
Let’s understand how pivot_wider() works by looking at a simpler dataset, as another example. Here’s our tibble about patients (A or B ID’s) and their blood pressure measurements.
df <- tribble(
~id, ~measurement, ~value,
"A", "bp1", 100,
"B", "bp1", 140,
"B", "bp2", 115,
"A", "bp2", 120,
"A", "bp3", 105
)Now let’s use pivot_wider(). What we are doing here is taking entries from the value column (use as argument for values_from) and entries from the measurment column (use as argument for names_from).
df |>
pivot_wider(
names_from = measurement,
values_from = value
)# A tibble: 2 × 4
id bp1 bp2 bp3
<chr> <dbl> <dbl> <dbl>
1 A 100 120 105
2 B 140 115 NA
5.5 Summary
This chapter covered tidy data: data that has variables in columns and observations in rows. Tidy data makes working in the tidyverse easier, because it’s a consistent structure understood by most functions.
The main challenge is to transform untidy data into tidy format, which is possible by using functions like pivot_longer() and pivot_wider().
Lab 7 Report: Transformation and Visualization with COVID-19 reporting data
Visualizing COVID-19 cases, deaths, recoveries
This lab covers COVID-19 data. The virus has been renamed SARS-CoV-2 based on phylogenetic data.
Introduction to John Hopkins University case tracking data
John Hopkins University researchers developed an interactive dashboard that track reported COVID-19 cases, which is available on Github. We will work with this data on Github: csse_covid_19_data > csse_covid_19_daily_reports > 03-11-2020.csv.
The 03-11-2020.csv file has these columns: Province/State, Country/Region, Last Update, Confirmed, Deaths, Recovered, Latitude, Longitude.
Importantly, note that countries like China and US are listed multiple times in Country/Region because they have can have different Province/State, whereas other countries are listed only once. When working with China/US, we need to sum these rows to get toal count for China/US. Ensure that “/” is not used when making column names because it will give an error in ggplot().
On the Computer
Loading data from a Github repository
Let’s load the data from Github just once to your data folder. Use eval = FALSE to hide the code chunk.
AI note
I consulted AI to help me create a folder and download the file properly in there. The file being used from Github is time_series_covid19_confirmed_global.csv.
# create the data folder
dir.create("data", showWarnings = FALSE)
# download the file
download.file(
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv",
destfile = "data/time_series_covid19_confirmed_global.csv"
)This loads the data.
time_series_confirmed <- read_csv("data/time_series_covid19_confirmed_global.csv")|>
rename(Province_State = "Province/State", Country_Region = "Country/Region")Data Tidying: Pivoting
We will be going through Chapter 5: Data Tidying and Pivot.
Because the data downloaded is in wide format and we want long format, let’s use pivot_longer.
time_series_confirmed_long <- time_series_confirmed |>
pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "Confirmed") Dates and time
We will also change date format to be easier to work with for our graphs. This is covered in Chapter 17 (Dates & Times).
# puts dates in yyyy-mm-dd format
time_series_confirmed_long$Date <- mdy(time_series_confirmed_long$Date)Making graphs from the time series data
For this, we are using the data in time_series_confirmed_long, where we have just formatted it to have long format in the last section.
Let’s make a time series graph of the confirmed cases in the US, where we will count up the individual state data, done below.
# note that these changes weren't stored under a variable
time_series_confirmed_long |>
group_by(Country_Region, Date) |>
summarise(Confirmed = sum(Confirmed)) |>
filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Confirmed)) +
geom_point(size = 0.2) +
geom_line() +
ggtitle("US COVID-19 Confirmed Cases")Now let’s count up COVID-19 cases from several countries, including the US.
time_series_confirmed_long |>
group_by(Country_Region, Date) |>
summarise(Confirmed = sum(Confirmed)) |>
filter (Country_Region %in% c("China","France","Italy",
"Korea, South", "US")) |>
ggplot(aes(x = Date, y = Confirmed, color = Country_Region)) +
geom_point(size = 0.2) +
geom_line() +
ggtitle("COVID-19 Confirmed Cases")Another example: make a new table with daily counts using lag function, which subtracts a row from the previous row.
AI note
“Explain the logic of how the mutate line is giving daily cases”
Explanation is that, under a new column called Daily (created by using the mutate function), the difference in COVID-19 cases between today and yesterday is calculated. This is seen in how Confirmed (“today”) subtracts lag-Confirmed (yesterday). The lag part is to get the value of the previous row, and for the case of the very first row in which there is no preceding row, default argument is filled in so that the first Confirmed value will serve as “yesterday” since there’s no preceding row.
Essentially the value in the Daily column calculates the difference in number of Confirmed cases between the current row and the previous row.
time_series_confirmed_long_daily <- time_series_confirmed_long |>
group_by(Country_Region, Date) |>
summarise(Confirmed = sum(Confirmed)) |>
# this line created a new column called Daily, which tells how many new cases per day.
mutate(Daily = Confirmed - lag(Confirmed, default = first(Confirmed)))
view(time_series_confirmed_long_daily)geom_plot() - Making the graph using the daily cases, for US data:
time_series_confirmed_long_daily |>
filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Daily, color = Country_Region)) +
geom_point(size = 0.3) +
ggtitle("COVID-19 Confirmed Cases")geom_line() - Line version of this graph:
time_series_confirmed_long_daily |>
filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Daily, color = Country_Region)) +
geom_line() +
ggtitle("COVID-19 Confirmed Cases")geom_smooth() - Same graph as above (a line graph), but adding a curve fit: geom_smooth() by default will ad a LOESS/LOWESS (Locally Weighted Scatterplott Smoothing) smoother to the data.
time_series_confirmed_long_daily |>
filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Daily, color = Country_Region)) +
geom_smooth() +
ggtitle("COVID-19 Confirmed Cases")geom_smooth(method = “gam”, se = FALSE) - This is to add a curved fit using generalized additive model (GAM). se = FALSE hides the shaded band.
time_series_confirmed_long_daily |>
filter (Country_Region == "US") |>
ggplot(aes(x = Date, y = Daily, color = Country_Region)) +
geom_smooth(method = "gam", se = FALSE) +
ggtitle("COVID-19 Confirmed Cases")Animated graphs with gganimate
You can animate graphs in R and have them embedded in your web page. Essentially, gganimate creates a series of files that are encompassed in a gif file. You can also save and download this gif too.
There are some important gganimate functions:
transition_*() - defines how the data should be spread out and how it relates to itself across time
view_*() - defines how the positional sclaes should change along the animation
shadow_*() defines how data from other points in time should be presented in the given point in time.
enter_()/exit_() defines how new data should appear and how old data should disappear during the course of the animation.
ease_aes() defines how different aesthetics should be eased during transitions.
Installing gganimate and gifski
gifski is a package for creating a gif file from gganimate. Use Tools > Install Packages to install gifski and gganimate on Unity.
library(gganimate)
library(gifski)
theme_set(theme_bw())An animation of the confirmed cases in select countries
This is a gif of daily counts of COVID cases over time, for US data only.
AI note
“Can you clarify how the last two groups of code related to animation are different?”
Code chunk
The code relating to gganimate lines:
seq_along(Date) basically means “make a running sequence from 1 up to however many dates there are.” It’s used so that gganimate knows in what order the points should appear, one for each day.
transition_reveal(Date): It gradually reveals the data points along the x-axis (Date), showing the points and line building up over time.
The code relating to making the animation:
This line actually creates the animated GIF from your ggplot object p. renderer = gifski_renderer(): Tells it to render the animation as a GIF image (using the gifski package).
end_pause = 15: This is to keep the final frame visible a bit longer (15 extra frames) at the end before looping again, so it doesn’t restart again too fast.
The transition_reveal() line is important for gganimate to know what to animate!
# Let's create a variable to hold US-only data from time_series_confirmed_long_daily
daily_counts <- time_series_confirmed_long_daily |>
filter (Country_Region == "US")
# make the graph here
p <- ggplot(daily_counts, aes(x = Date, y = Daily, color = Country_Region)) +
geom_point() +
ggtitle("Confirmed COVID-19 Cases") +
# gganimate lines
geom_point(aes(group = seq_along(Date))) +
transition_reveal(Date)
# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)Make sure to set eval = FALSE so that the gif is not recreated each time you render your script.
Saving animations:
Use this line: anim_save(“daily_counts_US.gif”, p).
An animation of confirmed deaths
# This download may take about 5 minutes. You only need to do this once so set `#| eval: false` in your qmd file
download.file(url="https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv",
destfile = "data/time_series_covid19_deaths_global.csv")Data tidying, pivot, and time
The data used was time_series_deaths_confirmed, and here we put it into long format to be more readable, where it is stored under time_series_deaths_long.
time_series_deaths_confirmed <- read_csv("data/time_series_covid19_deaths_global.csv")|>
rename(Province_State = "Province/State", Country_Region = "Country/Region")
time_series_deaths_long <- time_series_deaths_confirmed |>
pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "Confirmed")
time_series_deaths_long$Date <- mdy(time_series_deaths_long$Date)Making the animated graph
We are making an animated graph that counts number of deaths for several select countries
p <- time_series_deaths_long |>
filter (Country_Region %in% c("US","Canada", "Mexico","Brazil","Egypt","Ecuador","India", "Netherlands", "Germany", "China" )) |>
ggplot(aes(x = Country_Region, y = Confirmed, color = Country_Region)) +
geom_point(aes(size = Confirmed)) +
transition_time(Date) +
labs(title = "Cumulative Deaths: {frame_time}") +
ylab("Deaths") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)Exercises
Exercise 1
Go through Chapter 5, putting the examples & exercises into report. (Please see the very beginning).
Exercise 2
Instead of making a graph of 5 countries on the same graph as in the above example, use facet_wrap with scales=“free_y”.
In this example I am using facet_wrap to make separate graphs of 5 countries about COVID-19 Confirmed Cases. I used ncol = 3 because it fit nicely and also because I used Lab 6 as reference for how to use facet_wrap, where I did ncol = 3.
time_series_confirmed_long |>
group_by(Country_Region, Date) |>
summarise(Confirmed = sum(Confirmed)) |>
filter (Country_Region %in% c("China","France","Italy",
"Korea, South", "US")) |>
ggplot(aes(x = Date, y = Confirmed, color = Country_Region)) +
geom_point() +
geom_line() + facet_wrap(vars(Country_Region), scales = "free_y", ncol = 3) +
labs(
title = "COVID-19 Confirmed Cases",
x = "Date",
y = "Confirmed"
)Exercise 3
Using the daily count of confirmed cases, make a single graph with 5 countries of your choosing.
To do this I will use the data frame time_series_confirmed_long_daily. The countries I am choosing are: Estonia, Netherlands, Norway, Russia, and Sri Lanka. The x-axis will be Date and the y-axis will be Daily.
# This block is just to let me see all the distinct countries.
#time_series_confirmed_long |>
# distinct(Country_Region) |>
# view()p <- time_series_confirmed_long_daily |>
group_by(Country_Region, Date) |>
summarise(Daily = sum(Daily)) |>
filter (Country_Region %in% c("Estonia","Netherlands","Norway",
"Russia", "Sri Lanka")) |>
ggplot(aes(x = Date, y = Daily, color = Country_Region)) +
geom_point(size = 0.2) +
geom_line() +
labs(
title = "Daily count of COVID-19 Confirmed Cases",
x = "Date",
y = "Number of daily confirmed cases",
color = "Country_Region"
) +
transition_reveal(Date)
# make the animation
animate(p, renderer = gifski_renderer(), end_pause = 15)