R | JLaw's R Blog

Are Birth Dates Still Destiny for Canadian NHL Players?

Mon, 04 Dec 2023 00:00:00 +0000

In the first chapter Malcolm Gladwell’s Outliers he discusses how in Canadian Junior Hockey there is a higher likelihood for players to be born in the first quarter of the year. In his words:

Because these kids are older within their year they make all the important teams at a young age which gets them better resources for skill development and so on.

While it seems clear that more players are born in the first few months of the year, what isn’t explored is whether or not this would be expected. Maybe more people in Canada in general are born earlier in the year.

I will explore whether Gladwell’s result is expected as well as whether this is still true in today’s NHL for Canadian-born players.

To answer these questions I will download data on birth rates from Statistics Canada as well as player roster data from the NHL’s API.

This analysis will leverage the httr package to download the data, tidyverse for data manipulation, and ggtext/ggimage/scales for visualization.

library(tidyverse)
library(httr)
library(scales)
library(ggimage)
library(ggtext)

Section 1: What is the distribution of births by month in Canada?

Gladwell’s thesis is that you see more Canadian Junior hockey players born earlier in the year because of the way that cut-offs are set for Youth Hockey. I think that he is correct but what if most people in Canada are born in the beginning of the year. Then this might be representative of the population rather than an outlier effect.

Information about births by month in Canada can be found at Statistics Canada.

Initially I had tried to web-scrape the table using rvest but could not figure out a way to deal with the “Number” row. Since the data could be downloaded as a CSV file my alternative solution was to use httr to send a call to the download link to grab the file. The URL was found by using the inspect option in Firefox when clicking the download link.

canada_raw <- GET('https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=1310041501&latestN=0&startDate=19910101&endDate=20220101&csvLocale=en&selectedMembers=%5B%5B1%5D%2C%5B%5D%2C%5B1%5D%5D&checkedLevels=1D1%2C1D2') %>%
  content()

The GET() command sends the request to the server and the content() function returns the results. Without the content() function there is a lot of additional information about the call such as headers, request url, etc.

The raw data contains many columns that are either duplicative or unnecessary for this analysis:

REF_DATE	GEO	DGUID	Month of birth	Characteristics	UOM	UOM_ID	SCALAR_FACTOR	SCALAR_ID	VECTOR	COORDINATE	VALUE	STATUS	SYMBOL	TERMINATED	DECIMALS
1991	Canada, place of residence of mother	2016A000011124	Total, month of birth	Number of live births	Number	223	units	0	v21400536	1.1.1	403816	NA	NA	NA	0
1992	Canada, place of residence of mother	2016A000011124	Total, month of birth	Number of live births	Number	223	units	0	v21400536	1.1.1	399109	NA	NA	NA	0

I pulled the data for 1991 through 2022 and each year has a total row as well as a row for each individual month. To clean up this data I filter out the total rows with str_detect(), keep only the REF_DATE for the year, extract the month using str_extract and keep the VALUE which is the actual number of births.

canada_births <- canada_raw %>%
  filter(!str_detect(`Month of birth`, 'Total')) %>%
  transmute(
    REF_DATE,
    MONTH = str_extract(`Month of birth`, 'Month of birth, (\\w+)', 1),
    VALUE
  ) %>% 
  group_by(MONTH) %>% 
  summarize(country_births = sum(VALUE)) %>% 
  mutate(country_pct = country_births/sum(country_births))

Then the distribution can be calculated by dplyr functions. The true distribution of birth month in Canada vs. the expected distribution if every day had an equal chance is shown below:

canada_births %>%
  transmute(
    `Canada %` = country_pct,
    `Expected % from Days in Month` = case_when(
      MONTH %in% c('April', 'June', 'September', 'November') ~ 30/365,
      MONTH == 'February' ~ 28/365,
      TRUE ~ 31/365,
    ),
    `Difference` = `Canada %` - `Expected % from Days in Month`,
    month_id = factor(MONTH, levels = c('January', 'February', 'March', 'April',
                                        'May', 'June', 'July', 'August',
                                        'September', 'October', 'November', 'December'))
  ) %>% 
  gather(lbl, value, -month_id) %>% 
  spread(month_id, value) %>%
  mutate(
    lbl = factor(lbl, levels = c('Canada %', 'Expected % from Days in Month', 'Difference')),
    across(January:December, ~percent(.x, accuracy = .1))) %>%
  arrange(lbl) %>% 
  kable(col.names = c("", names(.)[-1]))

	January	February	March	April	May	June	July	August	September	October	November	December
Canada %	8.0%	7.5%	8.5%	8.4%	8.8%	8.5%	8.9%	8.7%	8.7%	8.4%	7.8%	7.8%
Expected % from Days in Month	8.5%	7.7%	8.5%	8.2%	8.5%	8.2%	8.5%	8.5%	8.2%	8.5%	8.2%	8.5%
Difference	-0.5%	-0.1%	0.0%	0.2%	0.3%	0.3%	0.4%	0.2%	0.5%	-0.1%	-0.4%	-0.7%

At first glance, Canadians seem less likely to be born in the beginning of the year (particularly January and February) than from random distribution. They’re also less likely to be born in the end of the year.

Let’s see what the Canadian NHL players look like:

Section 2: What is the difstribution of births by month for Canadian NHL players?

To get the information about the NHL players I will use httr to query the NHLs API. My original version of this analysis used the nhlapi package which is on CRAN. But the NHL changed their API at some point in the last few months so that package no longer functioned.

Getting the 2023-2024 team rosters can be done through the API endpoint https://api-web.nhle.com/v1/roster/{team}/20232024 where {team} is a three-character code representing an individual team. To get the rosters for each team I need to first get the codes for each team.

This is going to involve a bunch of JSON manipulation which is new to me so their is probably a more elegant solution.

All information on NHL teams can be retrieved from the https://api.nhle.com/stats/rest/en/team endpoint. Using the same GET() / content() set from the prior section I can get all the team information

teams <- GET('https://api.nhle.com/stats/rest/en/team') %>% 
  content()

This comes back as a list with two items, “data” which contains all the useful information and “total” which contains the number of elements returned in “data”. I just need the “data” piece.

teams <- teams %>% 
  .[['data']]

Now “teams” is a list with 59 elements with each element containing information (id, franchiseId, fullName, legaugeId, rawTricode, triCode) about a team.

teams[1:3] %>% jsonlite::toJSON(auto_unbox = T) %>% jsonlite::prettify()

## [
##     {
##         "id": 11,
##         "franchiseId": 35,
##         "fullName": "Atlanta Thrashers",
##         "leagueId": 133,
##         "rawTricode": "ATL",
##         "triCode": "ATL"
##     },
##     {
##         "id": 34,
##         "franchiseId": 26,
##         "fullName": "Hartford Whalers",
##         "leagueId": 133,
##         "rawTricode": "HFD",
##         "triCode": "HFD"
##     },
##     {
##         "id": 31,
##         "franchiseId": 15,
##         "fullName": "Minnesota North Stars",
##         "leagueId": 133,
##         "rawTricode": "MNS",
##         "triCode": "MNS"
##     }
## ]
##

Ultimately I want to restructure this set of nested lists into a rectangular format. The way I’ll do this is create a tibble of list columns using tibble() and then tidyr::unnest_wider to turn each element of a list-column into its own column.

teams <- teams %>% 
  tibble(data = .) %>% 
  unnest_wider(data)

Now everything is in a much more legible format:

id	franchiseId	fullName	leagueId	rawTricode	triCode
11	35	Atlanta Thrashers	133	ATL	ATL
34	26	Hartford Whalers	133	HFD	HFD
31	15	Minnesota North Stars	133	MNS	MNS

That was all just to get the 3-character codes needed to actually get the rosters. Since a separate call is made to the Roster endpoint for each team this is a good opportunity to create a function. Then I can use purrr::map_dfr to iterate through the team codes to combine all the rosters together.

For the function, it’ll take a team code for input and extract the player’s first name, last name, birth date, and birth country.

The data structure returned from the Roster endpoint is a list with elements for forwards, defensemen, and goalies, Then for each player within the data looks like:

GET(glue::glue('https://api-web.nhle.com/v1/roster/NJD/20232024')) %>% 
    content() %>%
    .[['forwards']] %>%
    .[[1]] %>%
    jsonlite::toJSON(auto_unbox = T, pretty = T) %>% 
    jsonlite::prettify()

## {
##     "id": 8479414,
##     "headshot": "https://assets.nhle.com/mugs/nhl/20232024/NJD/8479414.png",
##     "firstName": {
##         "default": "Nathan"
##     },
##     "lastName": {
##         "default": "Bastian"
##     },
##     "sweaterNumber": 14,
##     "positionCode": "R",
##     "shootsCatches": "R",
##     "heightInInches": 76,
##     "weightInPounds": 205,
##     "heightInCentimeters": 193,
##     "weightInKilograms": 93,
##     "birthDate": "1997-12-06",
##     "birthCity": {
##         "default": "Kitchener"
##     },
##     "birthCountry": "CAN",
##     "birthStateProvince": {
##         "default": "ON"
##     }
## }
##

To get only the data I want, I’ll (1) pass a team code into the function to call to the API with GET() and content(), (2) use flatten() to remove the level of forwards, defensemen, and goalies to have all the players as one nested list, (3) turn the data into a tibble of list-columns with tibble(), and (4) use the tidyr::hoist() function to pull only the items I want from the structure. Finally, I use transmute to add the 3-character input to the results and to exclude the data list-column.

get_roster <- function(team){
  GET(glue::glue('https://api-web.nhle.com/v1/roster/{team}/20232024')) %>% 
  content() %>%
  flatten() %>%
  tibble(data = .) %>%
  hoist('data',
    'firstName' = list('firstName', 1L),
    'lastName' = list('lastName', 1L),
    'birthDate',
    'birthCountry'
  ) %>% 
  transmute(team = team, firstName, lastName, birthDate, birthCountry)
}

Within hoist(), the construction of list('firstName', 1L) is to avoid having to pull the “default” sub-item within firstName. This way simply grabs the value of the first element within the firstName item. Since birthDate and birthCountry have no sub-items there is no need to do that for those fields.

Finally to get all the players from all the teams I use purrr::map_dfr() to iterate through the team codes and run my function. There is a filter to remove any items with missing firstName fields because the Team endpoint returns information for all historical teams (e.g, Atlanta Thrashers, Hartford Whalers, etc.). Since these teams are not active in 2023-2024 they return information but the fields I want don’t populate.

all_roster <- map_dfr(teams$triCode, get_roster) %>%
  filter(!is.na(firstName))

Now we have a dataset of the 774 players in the NHL. This number is slightly larger than the expected number of NHL players (736 = 23 players * 32 teams) so there is likely a nuance to how a roster player is determined but it shouldn’t matter for this analysis.

Since I only want to look at Canadian players because I have no idea if the same cut-offs that apply in Canada apply in other countries in the world. I’ll also do some data cleanup on birth months and calculate the player distribution.

canada_players <- all_roster %>% 
  filter(birthCountry == 'CAN') %>% 
  mutate(
    mob = month(ymd(birthDate), label = T, abbr = F),
    mob_id = month(ymd(birthDate))
  ) %>% 
  count(mob_id, mob, name = "players") %>%
  mutate(player_pct = players/sum(players))

Now we have the distribution of birth months for the 314 Canadian NHL players

Section 3: Putting it all together

The last section is to combine the Canada birth month data from Section 1 with the Canadian NHL player from Section 2 and make a pretty visualization.

First I combining the data and create a field for the percentage of births you’d expect if every day was equally likely (ex. if January has 31 days then there is a 31/365 chance of being randomly born in January):

combined <- canada_players %>%
  left_join(canada_births, by = c('mob' = 'MONTH')) %>%
  #Put in random value
  mutate(
    random = case_when(
      mob_id %in% c(4, 6, 9, 11) ~ 30/365,
      mob_id %in% c(1, 3, 5, 7, 8, 10, 12) ~ 31/365,
      mob_id == 2 ~ 28/365
    )
  )

For the visualization I’m going to use the ggimage package to use icons of the Canadian flag and the NHL logo. This package can render a URL so I’ll create variables for those URLs.

NHL_ICON <- "https://pbs.twimg.com/media/F9sTTAYakAAkRv6.png"
CANADA_ICON <- "https://cdn-icons-png.flaticon.com/512/5372/5372678.png"

Finally, a combination of ggplot, ggtext, and ggimage is used to create the visualization.

ggplot(combined, aes(x = fct_reorder(mob, -mob_id))) + 
  geom_line(aes(y = random, group = 1), lty = 2, color = 'grey60') + 
  geom_linerange(aes(ymin = country_pct, ymax = player_pct)) + 
  geom_image(aes(image = NHL_ICON, y = player_pct), size = .08) + 
  geom_image(aes(image = CANADA_ICON, y = country_pct), size = .07) + 
  geom_text(aes(label = percent(player_pct, accuracy = .1), 
                y = if_else(player_pct > country_pct, player_pct + .004, player_pct - .004))) + 
  geom_text(aes(label = percent(country_pct, accuracy = .1), 
                y = if_else(country_pct > player_pct, country_pct + .004, country_pct - .004))) +
  annotate(
    'curve',
    xend = 2.3,
    x = 1.5,
    yend = .084,
    y = .10,
    curvature = .25,
    arrow = arrow(
      length = unit(7, "pt"),
      type = "closed"
    )) + 
  annotate(
    'richtext',
    x = 1,
    y = .105,
    label = "The grey line is the expected % of births<br />if birth month was completely random",
    size = 4
  ) + 
  scale_y_continuous(labels = percent) + 
  coord_flip() + 
  labs(x = "Month of Birth", y = "Percentage of Births (%)",
       title = "Are Canadian NHL Players More Likely to be Born Early in the Year?",
       subtitle = 'Comparing the distribution of birth months between Canadian NHL players and Canada in general ',
       caption = glue::glue("<img src = {NHL_ICON} width = '15' height=' 15' /> - Canadian NHL Players Birth Month Distribution <br />
                            <img src = {CANADA_ICON} width = '15' height=' 15' /> - Canadian Birth Month (1991-2022) Distribution")
       ) + 
  theme_light() +
  theme(
    text = element_text(family = 'Asap SemiCondensed', size = 14),
    plot.title.position = 'plot',
    plot.title = element_markdown(),
    plot.caption = element_markdown()
  )

Visually it looks pretty clear that there are more Canadian NHL players born in January/February than expected and fewer players born in August through the end of the year. May and July are interesting but I don’t have an intuition for why more NHL players might be born in those months.

For a more stats-y perspective. A chi-sq test can be used to see if the distribution of the Canadian NHL players is different than Canada in general. In the following code, x is the number of Canadian NHL players born in each month and p is the expected proportion based on the distribution of birth months for Canada as a whole.

broom::tidy(chisq.test(x = combined$players, p = combined$country_pct))

## # A tibble: 1 × 4
##   statistic p.value parameter method                                  
##       <dbl>   <dbl>     <dbl> <chr>                                   
## 1      25.6 0.00752        11 Chi-squared test for given probabilities

The p-value of <.01 means that we can reject the null hypothesis that they are the same distribution.

So it seems that Malcolm Gladwell’s thesis in Outliers still holds true in today’s NHL among Canadian players.

The Most Unexpectedly Good and Bad TV Episodes

Thu, 28 Sep 2023 00:00:00 +0000

The 9th episode of the 2nd Season of Ted Lasso is an episode called “Beard After Hours” which I found to be a pretty bad episode on a pretty good show. I wondered whether others found this to be an unexpectedly bad episode of TV or if it was just me. The website RatinGraph confirmed that while it wasn’t the worst episode of the series, its in the bottom 3.

Further Googling had shown that this episode (along with one other) were the results of the series getting an extension from 10 episodes to 12 episodes for Season 2. Thus, “Beard After Hours” was a filler episode intended to not affect the main plot line.

This got me thinking about other unexpectedly bad episodes of TV. And since doing unexpectedly bad and unexpectedly good are similar I figured why not both. So in this post, I find the 10 most unexpectedly good and unexpectedly bad episodes of television.

Data

IMDB provides datasets for personal and non-commercial use which contains information on TV Series, their episodes, and the ratings of those episodes. More specifically I will be using the title.basics.tsv.gz file for basic info on TV Series (and episode names), title.episode.tsv.gz to get all of the episode IDs for the TV Series, and title.ratings.tsv.gz to get the ratings and number votes for each episode.

For this analysis, there are no fancy packages being used. Just tidyverse, glue, and broom for data manipulation and ggtext and ggrepel for enhancements to the visualizations.

First step is loading libraries,

library(tidyverse)
library(broom)
library(ggrepel)
library(glue)
library(ggtext)

setting some global settings for visualization,

theme_set(theme_light(base_size = 14, base_family = "Asap SemiCondensed"))

theme_update(
  panel.grid.minor = element_blank(),
  plot.title = element_text(face = "bold"),
  plot.title.position = "plot"
)

and reading in the 3 IMDB data files. The raw files are tab-delimited and use the \N character for a missing value, the na parameter in read_delim tells R to set these to NA rather than keep them as a string.

basics <- read_delim(file = 'data/title.basics.tsv', delim = '\t', na = '\\N')
ratings <- read_delim(file = 'data/title.ratings.tsv', delim = '\t')
episodes <- read_delim(file = 'data/title.episode.tsv', delim = '\t', na ='\\N')

The basics file contains nearly 250k TV Series which is way more than I want to deal with so I’ll keep shows that meet a certain criteria:

IMDB categorizes it as a TV Series
The show started in 1990 or later (because I wanted things I’d be familiar with)
IMDB classifies it as either a Comedy or a Drama
IMDB does not classify it as a Talk Show, Reality Show, News, Game Show, or Short
- Genres on IMDB can have multiple categories for example Jimmy Kimmel Live! is classified as Comedy, Music, and Talk-Show
There are at least 20 episodes in the Series (need a track record for how a show is rated)
Each episode has on average 250 votes (want to have enough stability in the ratings and for the show to be somewhat popular)

These exclusions are handled with the following code:

basics_agg <- basics %>%
  # Limit to TV Series
  filter(titleType == 'tvSeries') %>% 
  # Keep Only Shows Starting In or After 1990
  filter(startYear >= 1990) %>%
  # Join all the Episodes to the TV Series data
  inner_join(episodes, by = join_by(tconst==parentTconst)) %>%
  # Join the ratings to the episode data
  inner_join(ratings, by = join_by(tconst.y == tconst)) %>% 
  # Calculate summary statistics for each show
  group_by(tconst, titleType, primaryTitle, originalTitle, 
           isAdult, startYear, endYear, runtimeMinutes, genres) %>%
  summarize(
    total_episodes = n(),
    avg_votes = mean(numVotes),
    overall_average = sum(numVotes * averageRating) / sum(numVotes),
    .groups = 'drop'
  ) %>%
  # Keep Comedies and Dramas
  filter(str_detect(genres, 'Comedy|Drama')) %>%
  # Exclude Other Genres
  filter(!str_detect(genres, 'Talk-Show|Reality-TV|News|Game-Show|Short')) %>%
  # Keep Only if 20+ Episodes on Series
  filter(total_episodes >= 20) %>%
  # Keep Only if Episodes Average 250 Votes or More
  filter(avg_votes > 250)

Now there are only 700 shows remaining in the data which is much more manageable!

Creating an episode level data set

So far the basics_agg data set is just a list of 700 TV Series and their information. To build a model to predict episode ratings I’ll have to build a data set where each row is an episode. This will replicate some the logic from above that merges the 3 data-sets together:

all_tv_details <- basics_agg %>% 
  ## Join in Episode Data
  inner_join(episodes, by = join_by(tconst==parentTconst)) %>%
  ## Join in Ratings Data
  inner_join(ratings, by = join_by(tconst.y == tconst)) %>% 
  # Bring in Episode Titles
  left_join(basics %>% filter(titleType == 'tvEpisode') %>% 
              transmute(tconst, episodeTitle = primaryTitle),
            by = join_by(tconst.y == tconst)) %>% 
  arrange(tconst, seasonNumber, episodeNumber) %>% 
  group_by(tconst) %>% 
  # Create variables for total number episodes
  mutate(episodeOverall = row_number(tconst),
         seasonNumber = factor(seasonNumber)
  ) %>%
  # Filter Out Missing Data
  filter(!is.na(seasonNumber) & !is.na(episodeNumber)) %>%
  ungroup()

Now the dataset is all prepared to find our unexpected episodes.

Methodology

The methodology I’m using for what’s an unexpectedly good or bad episode of TV is similar to the methodology used in the tsoutliers() function in the forecast package. Although since this isn’t really a time-series, I’ll be modifying it slightly to not account for “seasonal components”. My method is:

For each TV Series create a prediction of what the expected IMDB rating would be.
- Using a linear model with the Overall Episode number to capture a global trend (does the series get better or worse over time) as well as Season Number and Episode Number (and their interaction) to capture more local effects (is a certain season as a whole just bad).
- For example, in the show Scrubs (which I love), the 9th season is rated much lower than Seasons 1-8. Therefore episodes in Season 9 aren’t unexpectedly bad since the whole season is bad.
Calculate the difference between the Predicted Ratings from the model in Step #1 and the Actual Rating from IMDB.
Look at the distribution of the differences from Step #2. Episodes will be labeled as unexpectedly good or bad if the difference calculated in step #2 is large enough.
- For “large enough” I look at the interquartile range (IQR) of the differences (the 75th percentile minus the 25th percentile) and label an episode as unexpectedly bad if that episode’s difference is less than the 25th Percentile - 3 times the IQR and unexpectedly good if that episode’s difference is greater than the 75th Percentile + 3 times the IQR.
The amount of unexpectedness is based on the difference between the lower/upper bound and the actual value.
- This is different than just using the difference between the predicted and the actual values.
- The reason being that if a show has a very wide expected range, for example from 4 to 9. Then if the predicted value is 6.5 and the actual value is 9.1 then there’s a difference of 2.8 from the predicted value but only 0.1 outside the expected range. I want to focus on the greatest gap from expected so I want to take larger variability into account.

A visual explanation using an episode of Stranger Things as an example is shown below. I want to focus more on the difference between the 7.4 and 6.1 vs. the 8.5 and 6.1:

Function to find the Unexpected Episodes

The steps above have been built into a function called get_anomalies() which runs the four steps described above. The parameter onlyAnomalies determines whether to return only unexpected episodes or to return all episodes. The differences described in step 2 are added using the augment function from broom :

get_anomalies <- function(dt, onlyAnomalies = T){
  
  ## STEP 1: Run Linear Model on IMDB Ratings vs. Episode Number + Season Info
  #if multiple seasons for show use both global and local trend
  if(n_distinct(dt$seasonNumber) > 1){
    model <- lm(averageRating ~ episodeOverall + seasonNumber*episodeNumber, 
                data = dt)
  }
  # if only one season then global trend = local trend
  else{
    model <- lm(averageRating ~ episodeOverall, data = dt)
  }
  
  ### Step 2 - Add in Residuals from model to initial data set
  results <- augment(model, dt) %>% 
  ### Step 3 - Calculate the 3*IQR Range for each episode
    mutate(
      ## Determine the IQR of the Residuals (P75 - P25)
      iqr = (quantile(.resid, .75)-quantile(.resid, .25)),
      ## Set Lower Bound for expected range of residuals
      lci = quantile(.resid, .25)-(3*iqr),
      ## Set Upper Bound for expected range of residuals
      uci = quantile(.resid, .75)+(3*iqr),
      ## Tag an episode as an anomaly if its actual rating is outside the bounds
      anomaly = if_else(.resid > uci | .resid < lci, T, F),
      
      ## Set expected range back in the scale of the 0-10 prediction.
      lower = .fitted + lci,
      upper = .fitted + uci,
      
      # Step 4 - Calculate the difference between the bounds and the actual 
      # value to use for measure of unexpectedness
      remainder = if_else(.resid < 0, averageRating-lower, averageRating-upper)
    ) %>% 
    # Subset columns
    select(episodeOverall, seasonNumber, episodeNumber, episodeTitle, 
           averageRating, .fitted, .resid, 
           anomaly, lower, upper, remainder)
  
  # Determine whether to return all episodes or just the unexpected episodes
  if(onlyAnomalies == T){
    return(results %>% filter(anomaly == T))
  }
  else{
    return(results)
  }
}

Results

The function above needs to be run individually on the 700 TV Series in the data. To run all 700 models in a simple way I use the Many Models framework by nesting data into list-columns and using map to run the function on each subset of data.

results <- all_tv_details %>% 
  # Create a dataset with 1 row per TV Series with all data in a list-column
  group_by(primaryTitle) %>%
  nest() %>% 
  # Run the function to get the unexpected episodes as a new list-column
  mutate(results = map(data, get_anomalies)) %>% 
  # Break the new list-column back into individual rows
  unnest(results) %>%
  # Drop the original list columns and ungroup the data set
  select(-data) %>% 
  ungroup() %>%
  # Use Glue package to make a pretty label
  mutate(
    lbl = glue("**{primaryTitle}** S{s}E{e} - {episodeTitle}",
               s = if_else(as.numeric(seasonNumber) < 10, 
                           glue("0{seasonNumber}"), glue("{seasonNumber}")),
               e = if_else(as.numeric(episodeNumber) < 10, 
                           glue("0{episodeNumber}"), glue("{episodeNumber}"))
    )
  )

And now without further ado… the RESULTS!

The TV Shows with the Most Unexpected Episodes

Overall, 143 episodes were identified as being unexpectedly good or bad. Of this bunch 64% are unexpectedly bad showing that its more common for a good show to miss then it is for a show to hit an unexpected home run.

The first thing I want to look at are the 10 TV Series that have the most unexpected episodes both good and bad.

results %>% 
  # Group by Show
  group_by(primaryTitle) %>% 
  # Count the number of unexpected episodes in total as well as good and bad
  summarize(
    total = n(),
    `unexpectedly bad` = sum(.resid < 0)*-1,
    `unexpectedly good` = sum(.resid > 0)
  ) %>%
  # Get the Top 10 by Total
  slice_max(order_by = total, n = 10, with_ties = F) %>% 
  pivot_longer(
    cols = c(`unexpectedly bad`, `unexpectedly good`),
    names_to = "type",
    values_to = "episodes"
  ) %>% 
  ggplot(aes(x = episodes, y=fct_reorder(primaryTitle, total), fill = type)) + 
    geom_col() + 
    geom_text(aes(label = if_else(episodes != 0, abs(episodes), NA)), 
              hjust = "inward", color = 'grey90') +
    labs(x = "# of Unexpected Episodes",
         y = "TV Series",
         title = "TV Series with Most <i style = 'color:#ba2a22'>Unexpected</i> Episodes",
         fill = "") + 
    scale_fill_viridis_d(option = "cividis", begin = .2, end = .8, 
                         labels = str_to_title) +
     
    theme(
      legend.position = 'top',
      plot.title = element_markdown(),
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank(),
      axis.title = element_text(size = 12),
      axis.text.y = element_text(size = 10),
      legend.margin = margin(0, 0, -5, 0)
    )

Surprisingly, at least to me, SpongeBob Squarepants has the most unexpected episodes with 5 and they’re all unexpectedly bad. This is in contrast with Desperate Housewives and Big Bang Theory which have all unexpectedly good episodes.

The 10 Most Unexpectedly Good Episodes

The Top 10 Unexpectedly Good episodes are:

# Define Elements for Manual Label
color = c("Actual\nRating" = "darkred",
          "Predicted\nRating" = 'black', 
          "Series\nAverage" = "darkblue")
shape = c("Actual\nRating" = 19, 
          "Predicted\nRating" = 19, 
          "Series\nAverage" = 1)

# Subset to only the Unexpectedly Good Results
good_results <- results %>% 
  filter(.resid > 0)  %>% 
  slice_max(order_by = remainder, n = 10, with_ties = F)

# Plot
good_results %>% 
  select(lbl, .resid, Predicted = .fitted, Actual = averageRating, 
         lower, upper, remainder) %>% 
  ggplot(aes(x = fct_reorder(lbl, remainder))) + 
  geom_pointrange(aes(y = Predicted, ymin = lower, ymax = upper, 
                      color = 'Predicted\nRating')) + 
  geom_point(aes(y = Actual, color = "Actual\nRating"), size = 2) +
  geom_text(aes(label = Actual, y = Actual), color = 'darkred', nudge_x = .3) +
  geom_text(aes(label = round(lower, 1), y = lower),  nudge_x = .3, size = 3) +
  geom_text(aes(label = round(upper, 1), y = upper),  nudge_x = .3, size = 3) +
  geom_text(aes(label = round(Predicted, 1), y = Predicted),  nudge_x = .3) +
  scale_color_manual(values = color, name = '') +
  labs(x = "", y = "IMDB Rating", 
       title = "Top 10 Unexpectedly <i style = 'color:#2E8B57'>Good Episodes </i>",
       subtitle = "*As measured by the difference between Prediction Interval and Actual IMDB Episode Rating*") +
  coord_flip() + 
  theme(
    plot.title = element_markdown(),
    plot.title.position = 'plot',
    plot.subtitle = element_markdown(size = 10),
    panel.grid.major.x = element_blank(),
    axis.text.x = element_markdown(size = 10),
    axis.title.x = element_text(size = 11),
    axis.text.y = element_markdown(size = 9),
    legend.position = 'top',
    legend.margin = margin(0, 0, -5, 0),
    legend.text = element_text(size = 9),
    legend.key.size = unit(0.2, "cm")
  )

According to this method the most unexpectedly good episode of TV since 1990 is from The Fresh Prince of Bel-Air’s 4th Season entitled “Papa’s Got a Brand New Excuse. Its a pretty entertaining show in general but this episode contains quite possible the most iconic scene from the show. Inquirer.com called the end of this episode “among the most tear-jerking in sitcom history”.

I’m not really familiar with many of the other episodes from this list, but at least from number 1 it seems like the method works well. Remember this isn’t looking for the best TV episodes but the best ones from a show you wouldn’t expect.

The 10 Most Unexpectedly Bad Episodes

The Top 10 Unexpectedly Bad episodes are:

# Subset Data to only Unexpectedly Bad Results
bad_results <- results %>% 
  filter(.resid < 0) %>% 
  slice_min(order_by = remainder, n = 10, with_ties = F)

# Plot
bad_results %>% 
  select(lbl, .resid, Predicted = .fitted, 
         Actual = averageRating, lower, upper, remainder) %>% 
  ggplot(aes(x = fct_reorder(lbl, -remainder))) + 
    geom_pointrange(aes(y = Predicted, ymin = lower, 
                        ymax = upper, color = 'Predicted\nRating')) + 
    geom_point(aes(y = Actual, color = "Actual\nRating"), size = 2) +
    geom_text(aes(label = Actual, y = Actual), 
              color = 'darkred', nudge_x = .3) +
    geom_text(aes(label = round(lower, 1), y = lower),  
              nudge_x = .3, size = 3) +
    geom_text(aes(label = round(upper, 1), y = upper),  
              nudge_x = .3, size = 3) +
    geom_text(aes(label = round(Predicted, 1), y = Predicted),  
              nudge_x = .3) +
    scale_color_manual(values = color, name = '') +
    labs(x = "", y = "IMDB Rating", 
         title = "Top 10 Unexpectedly <i style = 'color:#b22222'>Bad Episodes </i>",
         subtitle = "*As measured by the difference between Prediction Interval and Actual IMDB Episode Rating*") +
    coord_flip() + 
    theme(
      plot.title = element_markdown(),
      plot.title.position = 'plot',
      plot.subtitle = element_markdown(size = 10),
      panel.grid.major.x = element_blank(),
      axis.text.x = element_markdown(size = 10),
      axis.title.x = element_text(size = 11),
      axis.text.y = element_markdown(size = 9),
      legend.position = 'top',
      legend.margin = margin(0, 0, -5, 0),
      legend.text = element_text(size = 9),
      legend.key.size = unit(0.2, "cm")
    )

I have seen more of the bad list than I did the good list. While I don’t know much about later seasons of Riverdale, I can speak more about #5 which comes from Stranger Things’ 2nd Season. This episode involved a side-quest of one of the main characters meeting a family member that hasn’t appeared in any episode since.

Also, the Scrubs episode that appears “My Night to Remember” is the only clip show from the series.

In both of these examples, they’re unexpectedly bad because the shows in general are good (they have the two highest upper bounds of the Top 10) but these two episodes did nothing to advance the plot and were ultimately filler. Much like the Ted Lasso episode that motivated this analysis.

Moral of the story is that people don’t like filler episodes.

Drilling into a Few Shows

Just for fun I wrote a general purpose function inspired by the RatinGraph charts that will show any TV Series’s trend-lines and expected range as well as highlight any unexpected episodes.

plot_shows <- function(title){
  # Run the anomaly function on a single show and return all the episodes
  get_anomalies(
    all_tv_details %>% 
      filter(primaryTitle == title),
    onlyAnomalies = F
  ) %>% 
    mutate(primaryTitle = title) %>%
    ggplot(aes(x = episodeOverall, y = averageRating, color = seasonNumber)) + 
    # Plot the expected value range
    geom_ribbon(aes(ymin = lower, ymax = upper), fill = 'lightblue', 
                color = NA, alpha = .3) + 
    # Plot the overall trend line across the entire Series
    geom_smooth( se = F, method = 'lm', lty = 2, color = 'grey60') +
    # Plot the trendlines for each season
    geom_smooth(aes(group = seasonNumber), se = F, method = 'lm', 
                lty = 2, show.legend = F) +
    # Plot the actuals for each episode
    geom_point(alpha = .5) + 
    # Add annotations for any outliers
    geom_label_repel(data = . %>% filter(anomaly == T), size = 3, 
                     min.segment.length = 0, 
                     aes(label = glue("Season {seasonNumber} Episode {episodeNumber}
                                    {episodeTitle}
                                    Rating: {averageRating}")),
                     show.legend = F) + 
    guides(color = guide_legend(nrow = 1)) +
    labs(x = 'Episodes', y = 'IMDB Rating', title = title, color = 'Season:') + 
    theme(
      plot.title = element_text(family = 'Roboto'),
      legend.position = 'bottom',
    )
  
}

First let’s look at the show with the most unexpectedly good episode, The Fresh Prince of Bel-Air.

plot_shows('The Fresh Prince of Bel-Air')

For most of its episodes the IMDB ratings are a solid 7.5. There are some ups and some downs but nothing like the 9.7 rating that “Papa’s Got a Brand New Excuse” received.

On the negative side, let’s look at Scrubs:

plot_shows('Scrubs')

You can see that the model accounts for Season 9 being rated worse than 1-8 although not being unexpected since the whole season is poorly rating. Also, the clip episode “My Night To Remember” being far worse at 5.4 then the 8 rating that Seasons 1-8 usually had.

Finally, let’s look at Ted Lasso since it was the inspiration for this post:

plot_shows('Ted Lasso')

No major outliers. And the episode “Beard After Hours” (the lowest green dot) while lower than expected doesn’t meet the extreme criteria to be included here.

Special Thanks

A special thanks to Cédric Scherer. His posit::conf(2023) presentation on Engaging and Beautiful Data Visualizations with ggplot2 taught me a TON.

Appendix: Code for the example Plot for Stranger Things

examples <- results %>% filter(primaryTitle == 'Stranger Things')

examples %>% 
  select(lbl, .resid, Predicted = .fitted, Actual = averageRating, lower, upper, remainder) %>% 
  ggplot(aes(x = fct_reorder(lbl, -remainder))) + 
  geom_pointrange(aes(y = Predicted, ymin = lower, ymax = upper, color = 'Predicted\nRating')) + 
  geom_point(aes(y = Actual, color = "Actual\nRating"), size = 2) +
  geom_text(aes(label = Actual, y = Actual), color = 'darkred', nudge_x = .1) +
  geom_text(aes(label = round(lower, 1), y = lower),  nudge_x = .05, size = 3) +
  geom_text(aes(label = round(upper, 1), y = upper),  nudge_x = .05, size = 3) +
  geom_text(aes(label = round(Predicted, 1), y = Predicted),  nudge_x = .1) +
  annotate(
    "richtext",
    y = 8.5,
    x = 1.3,
    size = 3,
    label = "*The expected IMDB ratings for this episode of <br> Stranger Things is between 7.4 and 9.7*",
    family = "Asap SemiCondensed",
    label.color = NA
  ) + 
  annotate(
    'curve',
    xend = 1,
    x = .7,
    yend = 6.1,
    y = 6.5,
    curvature = .25,
    arrow = arrow(
      length = unit(7, "pt"),
      type = "closed"
    )
  ) + 
  annotate(
    'text',
    x = .7,
    y = 6.5,
    label = "The episode had an\n6.1 rating on IMDB",
    family = "Asap SemiCondensed",
    size = 3,
    vjust = 1
  ) + 
  annotate(
    'curve',
    x = 1,
    xend = 1,
    y = 6.1,
    yend = 7.4,
    color = 'darkred',
    lty = 2,
    curvature = -.3,
    arrow = arrow(
      length = unit(7, "pt"),
      type = "closed",
      ends = 'both'
    )
  ) + 
  annotate(
    'richtext',
    x = 1.35,
    y = 6.75,
    size = 3,
    color = 'darkred',
    family = "Asap SemiCondensed",
    label = "<i>The <b>'unexpectedness'</b> is the difference between<br>the outer bound (7.4) and the actual (6.1)
    <br>7.4 - 6.1 = 1.3",
    label.color = NA,
    fill = NA
  ) + 
  scale_color_manual(values = color, name = '') +
  labs(x = "", y = "IMDB Rating", 
       title = "<span style = 'color:#ff1515'>Stranger Things</span> S02E07 - Chapter Seven: The Lost Sister",
       subtitle = "Outer bounds are defined based on **3xIQR**") +
  coord_flip() + 
  theme(
    plot.title = element_markdown(),
    plot.title.position = 'plot',
    plot.subtitle = element_markdown(size = 10),
    panel.grid.major.x = element_blank(),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_text(size = 11),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.position = 'top',
    legend.margin = margin(0, 0, -5, 0),
    legend.text = element_text(size = 10),
    legend.key.size = unit(0.2, "cm")
  )

When Will NYC's Subway Ridership Recover?

Mon, 29 Aug 2022 00:00:00 +0000

While writing my posts about COVID’s effect on NYC Subway ridership the New York Times published an article called The Pandemic Wasn’t Supposed to Hurt New York Transit This Much. The article states:

I believe the 80% target by 2026 comes from a McKinsey study. While I don’t know the details of the study I do have some subway fare data sitting around. So why not compare the article’s data to my own.

The methodology will be similar to what I did in my initial subway forecasting post using the modeltime package and the champion model Prophet w/ XGBoosted Errors to do the forecasting.

Libraries

### Data Manipulation Packages
library(timetk) # For time series features in recipe
library(tidyverse) # General Data Manipulation
library(scales) # Making prettier scales
library(lubridate) # Dealing with Dates

# Modeling Ecosystem
library(modeltime) # Framework for Time Series models
library(tidymodels) # Framework for general recipe and workflows

### Model Packages
library(prophet) # Algorithm for forecasting

Data

The data is the same as from my initial post. Its initially at the week/station/fare level. For this exercise I just need the data at the weekly level.

fares <- readRDS(file.path(here(), 'content', 'post', 
                           '2022-07-13-how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares', 
                           'data',
                           'mta_data.RDS')) %>% 
  group_by(week_start) %>% 
  summarize(fares = sum(fares))

My first blog post in this series covered the modeltime package in more detail for trying out many different forecasting models. That post found that Prophet with XGBoosted Errors was the best model. Here I’ll be replicating that workflow for that model type.

Step 1: Defining the pre-processing recipe

This step defines the forecasting formula as predicting fares based on all other features. Then it creates a bunch of time series specific features from the date field in step_timeseries_signature. step_rm removes some variables created in the prior step that aren’t useful, and finally step_dummy turns all the categorical variables into one-hot encoded indicators. Here I also set the training data set as the MTA Fares beginning during the COVID period (after April 1, 2020) since training on the prior time period will give very strange results.

rec <- recipe(fares ~ ., data = fares %>% filter(week_start >= ymd(20200401))) %>%
  step_timeseries_signature(week_start) %>% 
  step_rm(matches("(.iso$)|(am.pm$)|(.xts$)|(hour)|(minute)|(second)|(wday)")) %>% 
  step_dummy(all_nominal(), one_hot = TRUE)

Step 2: Define the Model Workflow and Fit the Model

Sticking with the tidymodels framework, here I define a workflow which will consist of the recipe created in Step 1 through add_recipe and the model set through add_model(). Within add_model() the model type is set to Boosted Prophet. I believe the ‘prophet_xgboost’ is the default engine so set_engine isn’t necessary, but good to keep around anyway.

prophet_boost_wf <- workflow() %>%
  add_model(
    prophet_boost(seasonality_yearly = TRUE) %>%
      set_engine('prophet_xgboost')
  ) %>% 
  add_recipe(rec) %>%
  fit(fares %>% filter(week_start >= ymd(20200401)) )

Step 3: Using the Model to Forecast the Future

In this instance I don’t have a test set to work with so I’m jumping directly into forecasting. Also since I don’t know how long it will take for the forecast to recover to pre-COVID levels, I’ll set the forecast horizon for 6 years in the h parameter. Passing in the actual_data let it be included in the output data set.

final_fcst <- modeltime_table(
    prophet_boost_wf
  ) %>% 
  modeltime_forecast(
    h = "6 years",
    actual_data = fares,
    keep_data = TRUE
  )

Visualizing the Forecast

The modeltime package makes it easy to visualize the data through the plot_modeltime_forecast package. The default is to create a plot.ly plot but that can be converted to a ggplot2 plot by setting .interactive to FALSE.

final_fcst %>% 
  plot_modeltime_forecast(.interactive = F) + 
  scale_y_continuous(labels = comma)

When will Subway fares return to 80% of Pre-COVID? To 100%?

Now we can see how close my forecast is to the New York Times Report. I don’t actually know what the NY Times is considering Pre-COVID levels but for my purposes I’ll use the average number of fares in December 2019 to be my Pre-COVID.

baseline <- fares %>% 
  filter(month(week_start)==12 & year(week_start) == 2019) %>% 
  summarize(avg_fares = mean(fares)) %>% 
  pull(avg_fares)

From the projection plot above its clear that there is a seasonality that peaks in the fall and drops in December through the New Year. To declare victory at 80% I’m going to require that there are 4 consecutive weeks of fares being above the Pre-COVID baseline.

I’m not sure of a great way to define when is the earliest date of the first run of 4 weeks above a threshold but I’m working it out in three steps:

Define an indicator for whether that week is above 80% (above_80_ind)
Run a counter for each time that the indicator flips from 0 to 1 (run_id_80) to get an id for each run
For each run_id_80 get the sum of above_80_inds to represent the length of each run (run_length_80)

rec_pct <- final_fcst %>% 
  filter(week_start >= ymd(20200401)) %>% 
  # Build Recovery Percentage
  mutate(recovery_pct = .value / baseline) %>%
  # Define Runs of when recovery_pct is above .8
  mutate(
    above_80_ind = (recovery_pct > .8),
    above_100_ind = (recovery_pct > 1)
  ) %>% 
  # Define ID for each time we start a run
  mutate(
    run_id_80 = cumsum(if_else(above_80_ind == 1 & lag(above_80_ind) == 0, 
                               1, 0)),
    run_id_100 = cumsum(if_else(above_100_ind == 1 & lag(above_100_ind) == 0, 
                               1, 0))
  ) %>% 
  add_count(run_id_80, wt = above_80_ind, name = "run_length_80") %>%
  add_count(run_id_100, wt = above_100_ind, name = "run_length_100")

Now I can plot the recovery percentage by week and show that the first time there are four consecutive weeks above 80% is 2025-07-05 and the first time there are four consecutive weeks above 100% of the Pre-COVID value is 2027-06-26.

rec_pct %>% 
  ggplot(aes(x = week_start, y = recovery_pct)) + 
    geom_line(color = "#0039A6") + 
    geom_segment(aes(x = min(week_start), 
                     xend = rec_pct[which.max(rec_pct$run_length_80 >= 4), ]$week_start,
                     y = .8,
                     yend = .8), lty = 2) + 
    geom_segment(aes(x = rec_pct[which.max(rec_pct$run_length_80 >= 4), ]$week_start,
                     xend = rec_pct[which.max(rec_pct$run_length_80 >= 4), ]$week_start,
                     y = 0,
                     yend = .8), lty = 2) + 
    geom_segment(aes(x = min(week_start), 
                     xend = rec_pct[which.max(rec_pct$run_length_100 >= 4), ]$week_start,
                     y = 1,
                     yend = 1), lty = 2) + 
    geom_segment(aes(x = rec_pct[which.max(rec_pct$run_length_100 >= 4), ]$week_start,
                     xend = rec_pct[which.max(rec_pct$run_length_100 >= 4), ]$week_start,
                     y = 0,
                     yend = 1), lty = 2) + 
    scale_x_date(breaks = "1 years",
                 labels = year,
                 expand = c(0, 0)) + 
    scale_y_continuous(labels = percent, expand = c(0, 0),
                       breaks = seq(0, 1.6, .2)) + 
    labs(title = "Projected MTA Recovery vs. Pre-COVID",
         subtitle = "Pre-COVID Baseline from December 2019", 
         x = "Date", y = "% of Dec 2019 Baseline") + 
    cowplot::theme_cowplot()

Based on this projection the NY Times article is being slightly pessimistic. According to the above NYC should reach 80% of Pre-COVID baseline by Mid-2025 which is earlier than the article’s projection of 2026.

Who will be right? We’ll have to wait at least 3 years to find out!

Exploring Types of Subway Fares with Hierarchical Forecasting

Wed, 24 Aug 2022 00:00:00 +0000

In my prior post I used forecasting to look at the effect of COVID on the expected amount of New York City subway swipes. In this post I will drill a level deeper to run forecasts for various types of subway fares to see if any particularly type has recovered better or worse than any others.

The goal for this post will be to create a top-level forecast for total NYC subway fares and forecasts for each of the types of subway fares. The sub-levels of individual subway fares form a natural hierarchy with the total number. For my forecast, I’d like the forecasts for the sub-levels and for the total to match each other for the sake of consistency. This is called “hierarchical forecasting”. More details can be found in Rob Hyndman and George Athanasopoulos’ Forecasting: Principles and Practice.

The book makes use of the fable and tsibble packages.

Libraries

library(tsibble) # Data Structure for Time Series
library(tidyverse) # Data Manipulation Packages
library(fable) # Time Series Forecasting Models
library(lubridate) # Date Manipulation
library(scales) # Convenience Functions for Percents

Data Preparation

The data set will be the same as from the prior blog post which contains weekly Subway data by station, card type, and week from May 2010 through June 2022. Please see the previous post for more details on the data processing. The raw fare files come from the MTA’s Website.

dt <- readRDS(here('content/post/2022-07-13-how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/data/mta_data.rds'))

In the data set there are 30 different fare types, however, I really don’t want to create 30 different forecasts. Especially if some of these are going to be small volume. The top 5 fare types make up 93% of the fares, so I’ll group the other 25 into an “other” category. Then I aggregate the data set to the week and fare_type level and add up the fares column which represents the number of swipes for each fare type.

dt_by_fare <- dt %>%
  #Remove Out of Pattern Thursday File
  filter(week_start != '2010-12-30') %>%
  #Clean Up fare types and create date fields
  mutate(
    week_start = ymd(week_start),
    year_week = yearweek(week_start),
    fare_type = case_when(
      fare_type == 'ff' ~ 'full_fare',
      fare_type == 'x30_d_unl' ~ 'monthly_unlimited',
      fare_type == 'x7_d_unl' ~ 'weekly_unlimited',
      fare_type == 'students' ~ 'student',
      fare_type == 'sen_dis' ~ 'seniors',
      TRUE ~ 'other'
    )
  ) %>% 
  group_by(week_start, year_week, key,  fare_type) %>% 
  # Drop all the groupings during summary
  summarize(fares = sum(fares),  .groups = 'drop')

Now the data set has gone from 7,244,430 rows to 3,702.

To be able to use the fable package to do forecasting, the data needs to be in the tsibble format. This construction takes a “key” and an “index” parameter. The “key” is the grouping factor which in this case is the fare_type and the “index” is the time parameter which will be the year_week field.

Then to create the “hierarchical” structure into the data, the aggregate_key function from fabletools is used. Telling the structure to be aggregated over the fare_types by adding up the fares will allow for forecasting reconciliation to ensure that the forecast outputs are coherent.

dt_ts <- tsibble(dt_by_fare, key = fare_type, index = year_week) %>% 
  aggregate_key(fare_type, fares = sum(fares))

The dt_ts data set is now 628 rows greater than the dt_by_fare data set. This is because of the aggregated layer that was generated from aggregate_key(). The 628 is the number of distinct weeks in the data.

If continuing down the forecasting path there would eventually be an error during the forecast step due to a missing value in the initial time series. The scan_gaps() function from tsibble will look for implicit missing observations (gaps in the index). The count_gaps() function will also provide a similar summary.

scan_gaps(dt_ts) %>% 
  count(year_week) %>%
  kable()

year_week	n
2011 W18	6
2013 W16	7

The function shows that I’m missing the data for the 18th week of 2011 and the 16th week at 2013. At first I thought this was a problem with my data processing from before. But when visiting the MTA website those files are actually missing.

Notice that the file for May 21st, 2011 is not listed. Same with May 4th, 2013.

To get around this issue, I need to first turn the implicit missings into explicit NAs. This can be done with tsibble’s fill_gaps() function which adds in NAs for the missing dates.

dt_ts <- dt_ts  %>% 
  group_by_key() %>% 
  fill_gaps()


dt_ts %>% 
  head() %>% 
  kable()

year_week	fare_type	fares
2011 W18	full_fare	NA
2013 W16	full_fare	NA
2010 W21	full_fare	11545507
2010 W22	full_fare	12580200
2010 W23	full_fare	12820291
2010 W24	full_fare	12707781

Notice that the two missing dates now appear. However, the forecasting is also going to have problems with the NA values. So I’ll need to fill in a value. For simplicity, I’m going to use tidyr’s fill function and just use the previous value.

dt_ts <- dt_ts %>% 
  arrange(year_week) %>% 
  fill(fares, .direction = 'down')

dt_ts %>% 
  filter(year_week %in% c(yearweek('2011 W17'), 
                          yearweek('2011 W18'), 
                          yearweek('2011 W19')
                          ),
         fare_type == 'full_fare'
           ) %>% 
  arrange(fare_type) %>% 
  kable()

year_week	fare_type	fares
2011 W17	full_fare	13795196
2011 W18	full_fare	13795196
2011 W19	full_fare	13794517

Forecasting

The objective of this post is to determine which types of Subway fares have been most affected by COVID. In order to do this I’ll consider the time between 2010-2019 to be the pre-COVID period which the forecasting model will be built and then I’ll forecast 2020 - June 2022 and compare to the actuals.

The fable package uses the model() function to set and fit forecasts. In this case I’m creating a forecast named base and using an ARIMA model on the univariate time series for fares. If I had wanted to use Exponential Smoothing I would just change ARIMA() to ETS(). So in short, fable provides a simple mechanism to fit forecasts.

As it presently stands the base model for the aggregate time series does not have to match the total of the individual series. The reconcile() function lets you choose the method of all the key structure of the data will be made to “work”.

In this example, I’m trying out:

Bottoms-Ups: Make the aggregate level equal the sum of the individuals
Top-Down: Make the individual forecasts equal the aggregate series
Min Trace: Reconciliation using the minimum race combination method which looks to minimize the forecast variances of the set of coherent forecasts

fit <- dt_ts %>% 
  filter(year(year_week)< 2020) %>%
  model(base = ARIMA(fares))%>%
  reconcile(bottom_up = bottom_up(base),
            top_down = top_down(base),
            min_trace = min_trace(base, "mint_shrink"))

The fit object now contains four types of forecasts (base, bottom_up, top_down, min_trace) for each fare type and for the aggregation of the fare types.

Handling the forecasting for 2020+ data is handled by the forecast() function. The fit object is passed into the forecast() function and the 2020+ data gets passed into the new_data function.

fc <- fit %>% 
  forecast(new_data = dt_ts %>% filter(year(year_week) >= '2020'))

The fc object now contains the four forecasts for each fare type and the aggregate forecast for the last 2.5 years of data. This can be displayed with the autoplot() function.

autoplot(fc, dt_ts %>% ungroup(), level = NULL) + 
  facet_wrap(~fare_type, scales = "free_y") + 
  scale_y_continuous(labels = scales::comma_format()) + 
  labs(color = "", x = "Date" ,y = "Number of Fares") + 
  theme(
    legend.position = 'bottom',
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)
  )

So did the forecasts reconcile correctly?

Since this post is about Hierarchical Time Series it will be important to check to see if the reconciliation works. In the following chart, I will add up the fare type forecasts for each of the four forecasting models and compare them to the aggregate forecast. For simplicity I will just choose a single data point.

fc %>% filter(year_week == yearweek('2020 W01')) %>%
  as_tibble() %>% 
  transmute(fare_type = if_else(
    is_aggregated(fare_type), 'aggregated', as.character(fare_type)),
    year_week, model = .model, forecast = .mean) %>% 
  spread(model, forecast) %>% 
  group_by(is_aggregated = ifelse(fare_type == 'aggregated', 
                                  'Top-Level', 
                                  'Sum of Components')) %>% 
  summarize(across(where(is.numeric), sum)) %>% 
  gather(model, value, -is_aggregated) %>% 
  ggplot(aes(x = model, y = value, fill = is_aggregated)) + 
    geom_col(position = 'dodge') + 
    geom_text(aes(label = paste0(round(value/1e6, 1), "MM")), vjust = 0,
              position = position_dodge(width = 1)) +
    coord_cartesian(ylim = c(30e6, 30.6e6)) + 
    scale_y_continuous(labels = function(x){paste0(x/1e6, "MM")}) + 
    scale_fill_viridis_d(option = "C", begin = .2, end = .8) + 
    labs(title = "Comparing Different Reconciliation Methods",
         subtitle = "Week 1 2020",
         caption = 'NOTE: y-axis does NOT start at 0',
         x = "Reconcilation Method", y = "Total # of Fares",
         fill = "") + 
    cowplot::theme_cowplot() + 
    theme(
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      legend.position = 'bottom',
      legend.direction = 'horizontal'
    )

In the base (unreconciled) model the top-level time series is 500K fares higher than the sum of the various fare types. However, we want the forecasts to be consistent with each other and that’s exactly what we see in the three reconciled models. In the bottoms-up model, the “top-level” is scaled down to match the sum of the fare types. In top-down the sum of components are scaled up to match the “top-level”. And min_trace is somewhere in-between.

How much did each Fare Type recovery to Pre-COVID levels?

Now that we have the reconciled forecasts we’re now able to actually to the analysis to determine which Fare Types have recovered the most and least to pre-COVID levels. This will be done using the maximum available date in the data set and the min_trace forecast.

bind_rows(
  dt_ts %>% 
    filter(year_week == max(year_week)) %>% 
    as_tibble() %>%
    transmute(fare_type =  if_else(is_aggregated(fare_type), 
                                   'All Fares', 
                                   as.character(fare_type)), 
              time = "actuals", 
              fares),
  fc %>% 
    as_tibble() %>% 
    filter(year_week == max(year_week), .model == "min_trace") %>% 
    as_tibble() %>%
    transmute(fare_type =  if_else(is_aggregated(fare_type), 
                                   'All Fares', 
                                   as.character(fare_type)), 
              time = 'projected', 
              fares = .mean)
) %>% 
  spread(time, fares) %>% 
  mutate(recovery = actuals / projected) %>% 
  gather(period, fares, -fare_type, -recovery) %>%
  ggplot(aes(x = fct_reorder(fare_type, -fares), y = fares, fill = fct_rev(period))) + 
    geom_col(position = 'dodge') + 
    geom_text(aes(label = paste0(round(fares/1e6, 1), "MM")), vjust = 0,
              position = position_dodge(width = .9), size = 3) + 
    stat_summary(
      aes(x = fare_type, y = fares),
      geom = 'label',
      inherit.aes = F,
      fontface = 'bold', fill = 'lightgrey', size = 3,
      fun.data = function(x){
        return(data.frame(y = max(x)+8e6,
                          label = paste0((min(x)/max(x)) %>% percent,
                          "\nRecovered")))
      }
    )  + 
    labs(title = "Actuals vs. Projected Subway Fares",
         subtitle = "% Recovered is difference between Actual and Projected",
         caption = "Comparing W24 2022 Data",
         x = "",
         y = "# of Fares",
         fill = "") + 
    scale_fill_viridis_d(option = "C", begin = .2, end = .8) + 
    #This link was dope https://stackoverflow.com/questions/22945651/remove-space-between-plotted-data-and-the-axes
    scale_y_continuous(expand = expansion(mult = c(0, .12))) + 
    cowplot::theme_cowplot() + 
    theme(
      legend.position = 'bottom',
      legend.direction = 'horizontal',
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)
    )

Overall, the forecast shows that subway fares have only recovered to 40% of the Pre-COVID levels. The fare types that have recovered the most are the Student and Senior cards which may make sense as schools are generally back to in-person instruction. The fare type that has recovered the least is the monthly unlimited card which also makes sense as hybrid work environments make paying for a full month of unlimited a less valuable proposition.

Appendix: Measuring Forecast Accuracy

To end this post its worthwhile to show how I would measure the forecast accuracy. The accuracy() function from fabletools makes it very easy to see forecasting accuracy metrics. Just pass in the forecast, the actuals, and a list of metrics and you get a tibble back.

####Appendix: Forecast Accuracy
fc %>%
  accuracy(
    data = dt_ts,
    measures = list(rmse = RMSE, mase = MASE, mape = MAPE)
  ) %>%
  filter(.model == 'min_trace') %>% 
  arrange(mape) %>% 
  kable()

.model	fare_type	.type	rmse	mase	mape
min_trace	seniors	Test	497395.6	9.090555	169.1391
min_trace	full_fare	Test	7735877.4	11.608419	199.1030
min_trace	other	Test	1835483.2	5.458558	241.3594
min_trace		Test	20683422.4	12.284181	260.5088
min_trace	weekly_unlimited	Test	4266233.0	8.342853	295.2711
min_trace	monthly_unlimited	Test	5788133.3	12.763524	489.1300
min_trace	student	Test	703170.5	1.469493	43077.3022

Although since we’re trying predict “what if COVID didn’t happen” I don’t expect these forecasts to perform very well.

How much has COVID cost the NYC Subway system in "lost fares"?

Wed, 13 Jul 2022 00:00:00 +0000

With things in NYC beginning to return to normal after two years of COVID I found myself thinking about how much money was lost in Subway fares in the 2+ years where people were working from home. Seeing an opportunity to mess around with some forecasting packages, I set out to determine *how much money in lost rides has COVID cost the MTA?“.

For this post, I’ll be using the modeltime package from Business-Science.io which is a time-series integration into tidymodels to run multiple time series candidates and choose the best one.

Libraries

### Data Manipulation Packages
library(timetk)
library(tidyverse)
library(scales)
library(lubridate)

# Modeling Ecosystem
library(modeltime) 
library(tidymodels) 
library(treesnip) 

### Model Packages
library(catboost)
library(prophet)

Data

For this project, I’ll be using the MTA’s weekly fare data which contains the number swipes for each fare type, for each station. I’d previously scraped data from this website in a prior blog post so I won’t go through the methodology again.

Since for this project I don’t need station level or fare type granularity, I’m going to aggregate the data set to the date level.

dt <- readRDS('data/mta_data.RDS') %>% 
  group_by(week_start) %>% 
  summarize(fares = sum(fares))

Methodology

The dataset contains the weekly number of subway swipes from May 2010 through June 2022. To determine the number of “lost fares”, I’m going to build a forecast of the number of swipes from 2020 onwards and use the residuals between the forecast and the actual data to determine the number of “lost swipes”. Since I don’t reasonably expect a model to accurately predict 2020 onwards but I want to ensure I will have a reasonable model, I will train the model on data from 2010 through 2018 and then validate based on the 2019 data which should be similar to 2018.

Based on the validation set I will choose the best model and then using that model I will forecast 2020, 2021, and 2022.

Ultimately this test plan looks as follows:

dt %>% 
  mutate(lbl = case_when(
    week_start < ymd(20190101) ~ "a) Train",
    year(week_start) == 2019 ~ 'b) Validate',
    year(week_start) >= 2020 ~ 'c) Test'
  ), 
  total_fares = fares) %>% 
  ggplot(aes(x = week_start)) + 
  geom_line(data = dt, aes(y = fares), color = 'grey60') + 
  geom_line(aes(y = fares, color = lbl)) + 
  labs(title = 'Testing Plan for Forecasting',
       x = "Date", y = "# of Metrocard Swipes",
       color = "") + 
  scale_y_continuous(labels = comma) + 
  facet_wrap(~lbl, nrow = 3) + 
  cowplot::theme_cowplot()

In order to split the data, I’m going to first chop off the 2020+ data into a test dataframe:

test <- dt %>% filter(year(week_start) >= 2020)

And then use timetk’s time_series_split to create the sets that will be used for model development and validation:

splits <- time_series_split(
  dt %>% filter(year(week_start) < 2020) %>% arrange(week_start),
  assess = 52, cumulative = T)

The assess options tells the function to split the last 52 weeks of data into the validation set and the cumulative option tells the function to use all the other data in the training set.

The training data runs from 2010-05-29 to 2018-12-29 and the validation data runs from 2019-01-05 to 2019-12-28.

Modeling

The modeling process will use the recipe / workflow process that is used in the tidymodels ecosystem. However, add-on packages like modeltime and treesnip will allow for extensions to time series and other ML algorithms. For a more detailed look at Tidymodels check out my post on icing the kicker.

Pre-Preprocessing

The first step with tidymodels is to set up a recipe for pre-processing and featuring engineering. It tells the ecosystem the model formula and what new features to create or remove. In the below recipe, I’m setting the week_start fields to be an “id” as opposed to a predictor because some of the models we’ll try (CatBoost, XGBoost) can’t handle dates. The “id” role means that the data remains but isn’t used in the model.

The step_timeseries_signature() creates a large number of features based on the date field such as fields for year, day, half, quarter, month, day of the year, day of week, etc. It also includes a number of time based fields which won’t be necessarily since this data is at a weekly grain. These unnecessary fields are removed in the step_rm() function. Finally, all categorical variables are one-hot-encoded to turn them into indicator variables using step_dummy().

rec <- recipe(fares ~ ., data = training(splits)) %>%
  update_role(week_start, new_role = 'id') %>% 
  step_timeseries_signature(week_start) %>% 
  step_rm(matches("(.iso$)|(am.pm$)|(.xts$)|(hour)|(minute)|(second)|(wday)")) %>% 
  step_dummy(all_nominal(), one_hot = TRUE)

Model Fitting

To determine the best model for the forecasting portion I’m going to look at 6 different modeling workflows:

CatBoost
XGBoost
Auto Arima with XGBoosted Errors
Exponential Smoothing
Prophet
Prophet with XGBoosted Errors

For each of these models, I will set up a workflow, add the proper model using the parsnip interface, add the recipe, and fit the model. For the last 4 models, I re-update the role of the week_start field back to a predictor from an id since those models can use the date field directly.

catboost_wf <- workflow() %>% 
  add_model(
    boost_tree(mode = 'regression') %>% 
      set_engine('catboost')
  ) %>% 
  add_recipe(rec) %>% 
  fit(training(splits))

xgboost_wf <- workflow() %>% 
  add_model(
    boost_tree(mode = 'regression') %>% 
      set_engine('xgboost')
  ) %>% 
  add_recipe(rec) %>% 
  fit(training(splits))

arima_boosted_wf <- workflow() %>% 
  add_model(
    arima_boost() %>%
      set_engine(engine = "auto_arima_xgboost")
  ) %>%
  add_recipe(rec %>% update_role(week_start, new_role = "predictor")) %>%
  fit(training(splits))


ets_wf <- workflow() %>% 
  add_model(
    exp_smoothing() %>%
      set_engine(engine = "ets")
  ) %>%
  add_recipe(rec %>% update_role(week_start, new_role = "predictor")) %>%
  fit(training(splits))

prophet_wf <- workflow() %>%
  add_model(
    prophet_reg(seasonality_yearly = TRUE) %>% 
      set_engine(engine = 'prophet')
  ) %>%
  add_recipe(rec %>% update_role(week_start, new_role = "predictor")) %>%
  fit(training(splits))

prophet_boost_wf <- workflow() %>%
  add_model(
    prophet_boost(seasonality_yearly = TRUE) %>%
      set_engine('prophet_xgboost')
  ) %>% 
  add_recipe(rec %>% update_role(week_start, new_role = "predictor")) %>%
  fit(training(splits))

Validating

To apply these models to the validation set and calculate accuracy I use the modeltime package’s modeltime_table() and modeltime_calibrate() functions. The first organizes the various workflows into a single object and the later will compute the accurate based on the validation set of 2019 data.

calibration_table <- modeltime_table(
  catboost_wf,
  xgboost_wf,
  arima_boosted_wf,
  ets_wf,
  prophet_wf,
  prophet_boost_wf
) %>% 
  modeltime_calibrate(testing(splits))

I can then assess the accuracy measures for the time series using table_modeltime_accuracy() after sorting by the root mean squared error which will be the accuracy metric I use to determine the best model.

calibration_table %>%
  modeltime_accuracy() %>%
  arrange(rmse) %>% 
  select(.model_desc, where(is.double)) %>%
  mutate(across(where(is.double), 
                ~if_else(.x < 10, round(.x, 2), round(.x, 0)))) %>%
  kable()

.model_desc	mae	mape	mase	smape	rmse	rsq
PROPHET W/ XGBOOST ERRORS	947892	3.00	0.59	3.04	1271929	0.76
PROPHET W/ REGRESSORS	1150569	3.75	0.71	3.76	1515907	0.63
XGBOOST	1292654	4.08	0.80	4.25	1888753	0.59
ARIMA(0,1,2) W/ XGBOOST ERRORS	1515049	4.81	0.94	4.96	1946304	0.55
CATBOOST	1900626	6.31	1.17	6.20	2362239	0.62
ETS(A,N,A)	1930427	6.25	1.19	6.31	2436219	0.08

From the accuracy table, the best model was the Prophet w/ XGBoosted Errors.

The calibration table data contains a column called .calibration_data which contains the validation set predictions which I can use to visualize the the forecasted fit vs. the actuals in for the 2019 data.

calibration_table %>% 
    select(.model_desc, .calibration_data) %>% 
    unnest(cols = c(.calibration_data)) %>% 
    filter(year(week_start)==2019, .model_desc != 'ACTUAL') %>% 
    ggplot(aes(x = week_start)) + 
      geom_line(aes(y = .actual), color = 'black', lty = 2) + 
      geom_line(aes(y = .prediction, color = .model_desc), lwd = 1.2) + 
      facet_wrap(~.model_desc, ncol = 2) + 
      scale_color_discrete(guide = "none") +
      scale_y_continuous(label = comma) + 
      labs(title = "Comparing Models to Test Set of 2009", 
           subtitle = "Dashed Line is Actuals",
           y = "# of Fares",
           x = "Date") + 
      theme_bw() + 
      theme(
        axis.text.x = element_text(angle = 60, hjust = .5, vjust = .5)
      )

Forecasting the COVID Time Period

Now that I’ve identified the Prophet w/ XGBoosted errors model as the best model, its time to retrain it one final time on both the training and validation data before using it to forecast the COVID time period. The refiting on all data is handled by modeltime_refit().

refit_tbl <- calibration_table %>% 
    filter(.model_desc =='PROPHET W/ XGBOOST ERRORS' ) %>%
    modeltime_refit(data = bind_rows(training(splits), testing(splits)))

Finally, the forecasting onto the test set is handled by modeltime_forecast(). The test data and actuals are passed into the function so that the actuals and forecast can be directly compared.

final_fcst <- refit_tbl %>% 
  modeltime_forecast(
    new_data = test,
    actual_data = dt,
    keep_data = TRUE
  )

The forecast vs. the actuals can be visualized with plot_modeltime_forecast():

final_fcst %>% 
  plot_modeltime_forecast(.conf_interval_show = T, .interactive = F) + 
  scale_y_continuous(labels = comma)

Calculating the “Lost Fare” Amount

Now with forecast computed I can determine the number of lost fares by comparing the forecast number of fares to the actual number of fares. Then to convert that to an amount of money, I’m using a simplistic assumption that each fare would have cost about 2 dollars. This is a heuristic since there are many different kinds of fares in the NYC Subway system which have different costs. A full-fare cost $2.75, a monthly unlimited card costs $127, for Seniors and other reduced fare populations the cost is half-price as $1.35.

loss_amt <- final_fcst %>% 
  filter(.model_desc == 'PROPHET W/ XGBOOST ERRORS',
         .index >= min(test$week_start)) %>% 
  mutate(diff = fares-.value,
         diff_lo = fares - .conf_lo,
         diff_hi = fares - .conf_hi,
         fare = diff * 2.00,
         fare_lo = diff_lo * 2.00,
         fare_high = diff_hi* 2.00) %>% 
  arrange(.index) %>%
  mutate(fares_lost = cumsum(fare),
         fares_lost_lo = cumsum(fare_lo),
         fares_lost_high = cumsum(fare_high))

Using the confidence intervals of the predictions I can form a range of how much in “lost fares” the MTA suffered since 2020.

Ultimately, this analysis shows that the MTA has likely lost $5B in lost fares since 2020, but it would be as low as $4.4B or as high as $5.7B.

The cumulative loss can be visualized as follows:

loss_amt %>% 
  filter(.index >= ymd(20200101)) %>%
  ggplot(aes(x = .index, y = fares_lost*-1)) + 
    geom_line() + 
    geom_ribbon(aes(ymin = fares_lost_lo*-1, ymax = fares_lost_high*-1), alpha = .3,
                fill = 'darkgreen') + 
    scale_y_continuous(labels = dollar, breaks = seq(0, 6e9, 1e9), expand = c(0 ,0)) + 
    labs(title = "Cumulative Amount of Subway Fares Lost Since 2020",
         x = "Date", y = "$ Lost", caption = "$ Lost = Projected Swipes Lost * $2.00") + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = 'plot',
      panel.grid.major.y = element_line(color = 'grey45')
    )

Concluding Thoughts

While things are starting to return to more “normalcy” on the NYC subway its still far from what is was in the pre-COVID times. Based on this forecasting exercise, its estimated that the MTA has already lost around $5B in “lost fares” and that number is continuing to grow. Because while things are recovering, there’s still a long way to go.

ML for the Lazy: Can AutoML Beat My Model?

Tue, 03 May 2022 00:00:00 +0000

In this fourth (and hopefully final) entry in my “Icing the Kicker” series of posts, I’m going to jump back to the first post where I used tidymodels to predict whether or not a kick attempt would be iced. However, this time I see if using the h2o AutoML feature and the SuperLearner package can improve the predictive performance of my initial model.

Why is this ML for the Lazy?

I called this ML for the Lazy because for h2o and SuperLearner models I’m going to do absolutely nothing but let the algorithms run. No tuning, no nothing.

The Data

The data for this exercise was initially described in the first post in the series. During this post I will construct three models:

Replicating the final model from the original post using tidymodels
A version using h2o’s autoML function
A version using the SuperLearner package for ensembles

library(tidyverse) #Data Manipulation
library(tidymodels) # Data Splitting and Replicating Initial Model
library(themis) # For SMOTE Recipie
library(h2o) # For AutoML
library(SuperLearner) # For SuperLearner Ensemble.
library(here) # For path simplification

I’ll read in the data from the first post. This code block should look familiar from the other three posts in the series.

fg_data <- readRDS(here('content/post/2022-01-17-predicting-when-kickers-get-iced-with-tidymodels/data/fg_attempts.RDS')) %>%
  transmute(
    regulation_time_remaining,
    attempted_distance,
    drive_is_home_offense = if_else(drive_is_home_offense, 1, 0),
    score_diff,
    prior_miss = if_else(prior_miss==1, 'yes', 'no'),
    offense_win_prob,
    is_overtime = if_else(period > 4, 1, 0),
    is_iced = factor(is_iced, levels = c(1, 0), labels = c('iced', 'not_iced'))
  )

The next step is to replicate how the data was divided in the training and testing sets from the initial post. This is done using the initial_split() function from rsample. The seed will be set to what it originally was so that the same training and testing splits are used.

set.seed(20220102)
ice_split <- initial_split(fg_data, strata = is_iced)
ice_train <- training(ice_split)
ice_test <- testing(ice_split)

Model #1: TidyModels

To replicate the results from tidymodels I will first reconstruct the pre-processing recipe that used one-hot encoding to turn categorical variables into numeric and applied the SMOTE algorithm to deal with the severe class imbalance in the data.

rec_smote <- recipe(is_iced ~ ., data = ice_train) %>%
  step_dummy(all_nominal_predictors(), one_hot = T) %>%
  step_smote(is_iced)

In that post the final model was a tuned XGBoost model with the following parameters:

So rather than set up a tuning grid, I’ll just build a spec that includes that exact parameters and combine it with the recipe in a workflow:

orig_wf <- workflow(rec_smote,
               boost_tree(
                 "classification",
                 mtry = 5,
                 trees = 1641,
                 min_n = 19,
                 tree_depth = 8,
                 learn_rate = 0.007419,
                 loss_reduction = 9.425834,
                 sample_size = 0.9830687,
                 stop_iter = 21
               ))

Next step is to run the model on the full training data and predict on the testing data using the last_fit() function. I will have the function returns testing set metrics for Precision, Recall, and F1 Score.

orig_results <- last_fit(orig_wf, 
                         ice_split, 
                         metrics=metric_set(f_meas, precision, recall))

The performance metrics can be extracted using the collect_metrics() function and then I’ll do some post-processing to put it in a format that will eventually be combined with the other models:

orig_metrics <- collect_metrics(orig_results) %>% 
  transmute(
    label = "Original Model",
    metric = .metric,
    estimate = .estimate
  )

kable(orig_metrics)

label	metric	estimate
Original Model	f_meas	0.4324324
Original Model	precision	0.3428571
Original Model	recall	0.5853659

And the collection_predictions() function will extract the predictions for the test set to use in a confusion matrix:

orig_cf <- collect_predictions(orig_results) %>%
  count(is_iced, .pred_class) %>% 
  mutate(label = "Original Model", .before = 1) %>% 
  rename(pred = .pred_class)

Model #2 - h2o AutoML

The next candidate will be h2o’s AutoML function. h2O is an open-source machine learning platform that runs in java and has interfaces with R amongst others. The AutoML feature will auto-magically try different models and eventually construct a leaderboard of the best models. For this section, the blog post from Riley King was an inspiration as AutoML was used to compare against data from the Sliced data science competition.

In order to start using h2o I must first initialize the engine:

h2o.init()

h2O also has its own data format which must used. Fortunately its easy to convert between the tibbles and this format with as.h2o:

train_data <- as.h2o(ice_train)

Due to how h2o is set up, I’ll need to specific the name of the dependent variable (y) as a string and provide the list of predictors as a vector of strings (x). This is most easily done prior to the function call using setdiff() to remove the dependent from the other variables.

y <- "is_iced"
x <- setdiff(names(train_data), y)

Now its time to run the AutoML function. In the h2o.automl() function I provide the name of the dependent variable, the vector of the independent variable, a project name which I believe doesn’t matter for this purpose, a boolean to tell it to try to balance classes, and a seed so that results are replicable. The final parameter I give the function is the “max_runtime_secs”. Since the algorithm will continue to spawn new models it needs a criteria to know when to stop. For convenience, I will allow it to run for 10 minutes.

h2oAML <- h2o.automl(
  y = y,
  x = x,
  training_frame = train_data,
  project_name = "ice_the_kicker_bakeoff",
  balance_classes = T,
  max_runtime_secs = 600,
  seed = 20220425
)

When the AutoML algorithm completes each model that was run will be placed in a leaderboard which can be accessed by:

leaderboard_tbl <- h2oAML@leaderboard %>% as_tibble()

leaderboard_tbl %>% head() %>% kable()

model_id	auc	logloss	aucpr	mean_per_class_error	rmse	mse
GBM_grid_1_AutoML_1_20220430_215247_model_47	0.9193728	0.1092884	0.9953164	0.4591029	0.1759418	0.0309555
GBM_grid_1_AutoML_1_20220430_215247_model_95	0.9186846	0.1098212	0.9953443	0.4834514	0.1766268	0.0311970
StackedEnsemble_AllModels_4_AutoML_1_20220430_215247	0.9182852	0.1070530	0.9951190	0.4476356	0.1731510	0.0299813
StackedEnsemble_AllModels_3_AutoML_1_20220430_215247	0.9182371	0.1072580	0.9952534	0.4525710	0.1735284	0.0301121
GBM_grid_1_AutoML_1_20220430_215247_model_69	0.9181298	0.1097581	0.9952088	0.4819644	0.1765332	0.0311640
StackedEnsemble_AllModels_2_AutoML_1_20220430_215247	0.9179346	0.1077711	0.9950966	0.4411767	0.1738748	0.0302325

I can get the top model from the leaderboard by running h2o.getModel() on the model id from the leaderboard. In this case it was a Gradient Boosted Machine (GMB).

model_names <- leaderboard_tbl$model_id
top_model <- h2o.getModel(model_names[1])

With the model id I can also see what the parameters are that were used in this model.

top_model@model$model_summary %>% 
  pivot_longer(cols = everything(),
               names_to = "Parameter", values_to = "Value") %>% 
  kable(align = 'c')

Parameter	Value
number_of_trees	74.000000
number_of_internal_trees	74.000000
model_size_in_bytes	11612.000000
min_depth	3.000000
max_depth	3.000000
mean_depth	3.000000
min_leaves	6.000000
max_leaves	8.000000
mean_leaves	7.851351

While for tidymodels the last_fit() function ran the model on the test set for me, for h2o I’ll need to do that myself… but its not that difficult. h2o has an h2o.predict() function similar to R’s predict() which takes in a model and data to predict on through a newdata parameter. For that newdata I need to convert the test data into the h2o format through as.h2o(). Then I bind the predictions as a new column into the rest of the test data so that performance statistics and confusion metrics can be generated.

h2o_predictions <- h2o.predict(top_model, newdata = as.h2o(ice_test)) %>%
  as_tibble() %>%
  bind_cols(ice_test)

Similar to how I needed to do the predictions manually, I’ll also need to collect the performance metrics manually. This is also easy using the yardstick package:

h2o_metrics <- bind_rows(
  #Calculate Performance Metrics
  yardstick::f_meas(h2o_predictions, is_iced, predict),
  yardstick::precision(h2o_predictions, is_iced, predict),
  yardstick::recall(h2o_predictions, is_iced, predict)
) %>%
  # Add an id column and make it the first column
  mutate(label = "h2o", .before = 1) %>% 
  # Remove the periods from column names
  rename_with(~str_remove(.x, '\\.')) %>%
  # Drop the estimator column
  select(-estimator)

kable(h2o_metrics)

label	metric	estimate
h2o	f_meas	0.3731020
h2o	precision	0.2398884
h2o	recall	0.8390244

Finally, I’ll compute the confusion matrix.

h2o_cf <- h2o_predictions %>% 
  count(is_iced, pred= predict) %>% 
  mutate(label = "h2o", .before = 1)

Model #3: SuperLearner

The third candidate model that I’ll try is through the SuperLearner package. SuperLearner is an ensemble package that will create many different types of models and then by taking a weighted combination of those models hopes to attain better performance accuracy than any of the individual models along.

To use the SuperLearner() function, the dependent variable vector Y must be provide as a numeric vector, and the predictors vector X must also only contain numeric data, therefore all factors are converted back to numeric.

Since I’m predicting a binary outcome (whether or not a kick attempt will be iced) I specify the family as binomial. Finally, the models to be combined as specified in the SL.library argument. The full list of models are contained in the listWrappers() function. However, I’m choosing a subset primarily out of convenience. Mostly that I couldn’t get some of the other models (for example bartMachine) to run properly. The models I’m choosing to include in the ensemble are a GLM, XGBoost, GLM w/ Interactions, Regularized GLM (glmnet), MARS (earth), GAM, and a Random Forest.

mod <- SuperLearner(
  Y = ice_train %>% mutate(iced = if_else(is_iced == 'iced', 1, 0)) %>% 
    pull(iced),
  X = ice_train %>% mutate(prior_miss = if_else(prior_miss == 'yes', 1, 0)) %>% 
    select(-is_iced) %>% as.data.frame,
  family = binomial(),
  SL.library = c( 'SL.glm', "SL.xgboost", "SL.glm.interaction", 'SL.glmnet', 
                  'SL.earth', 'SL.gam', 'SL.randomForest')
)

Since the ultimate output of the SuperLearner is a weighted combination of those models I can extract the weights and show which models have the highest influence on the final predictions.

mod$coef %>% as_tibble(rownames = "model") %>% 
  mutate(model = str_remove_all(model, '(SL\\.)|(_All)')) %>%
  ggplot(aes(x = fct_reorder(model, -value), y = value, fill = model)) + 
    geom_col() + 
    geom_text(aes(label = value %>% percent), vjust = 0) + 
    scale_fill_viridis_d(option = "B", end = .8, guide = 'none') + 
    labs(x = "Model", y = "% Contribution of Model", title = "% Contribution For Each Component to SuperLearner") + 
    cowplot::theme_cowplot() + 
    theme(
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()
    )

It appears that Random Forest has the strongest effect followed by the MARS model and the XGBoost model.

Predicting the test set is similar to the h2o version except I can use the generic predict function. However, the predict function will return a vector of probabilities of being iced rather than a label like h2o did. Therefore I need to make a judgement call on a probability cut-off for determining an attempt as iced or not. I’ll choose to use the incidence rate of the training data, 4.2%, as the cut-off. Probabilities greater than 4.2% will be considered “iced” and below that will be “not iced”.

pred_sl = predict(mod, newdata = ice_test %>% 
                    mutate(prior_miss = if_else(prior_miss == 'yes', 1, 0)) %>%
                    select(-is_iced) %>% 
                    as.data.frame, type = 'response')$pred 

pred_sl <- ice_test %>%
  mutate(pred = if_else(pred_sl >= mean(ice_train$is_iced == 'iced'), 1, 0),
         pred = factor(pred, levels = c(1, 0), labels = c('iced', 'not_iced')))

Similar to the above section, I’ll use yardstick for the performance metrics.

sl_metrics <- bind_rows(
  yardstick::f_meas(pred_sl, is_iced, pred),
  yardstick::precision(pred_sl, is_iced, pred),
  yardstick::recall(pred_sl, is_iced, pred)
) %>% 
  mutate(label = "SuperLearner", .before = 1) %>% 
  rename_with(~str_remove(.x, '\\.')) %>% 
  select(-estimator)

kable(sl_metrics)

label	metric	estimate
SuperLearner	f_meas	0.3389513
SuperLearner	precision	0.2097335
SuperLearner	recall	0.8829268

And calculate the confusion matrix.

sl_cf <- pred_sl %>% 
  count(is_iced, pred) %>% 
  mutate(label = "SuperLearner", .before = 1)

Comparing the Three Models

For each of the three models I’ve calculated Precision, Recall, and F1. I’ll combine this information in a plot so its easier to see the different performance for each model:

bind_rows(
  orig_metrics,
  h2o_metrics,
  sl_metrics
) %>% 
  ggplot(aes(x = str_wrap(label, 9), y = estimate, fill = label)) + 
    geom_col() + 
    geom_text(aes(label = estimate %>% percent), vjust = 1, color = 'grey90') + 
    scale_fill_viridis_d(option = "C", end = .6, guide = 'none') + 
    facet_wrap(~metric, nrow = 1, scales = "free_y") +
    labs(x = "Model", y = "Performance Metric",
         title = "Comparing the Performance Metrics on Test Set") + 
    theme_light() + 
    theme(
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      strip.text = element_text(color = 'black')
    )

From the perspective of the F1-Score which balances precision and recall the original model performed the best. But looking at the components it appears that the original model had a higher precision meaning that when it predicted an iced attempt it was more likely to be right than the other models (although was still only right 35% of the time). However, it left some true iced attempts on the table since its recall was substantially lower than both the h2o model and the SuperLearner model.

I can get a better lens on what things are and are not being predicted well by looking at each model’s confusion matrix on the test set.

bind_rows(
  orig_cf,
  h2o_cf,
  sl_cf
) %>% 
  group_by(label) %>% 
  mutate(pct = n/sum(n)) %>% 
  ggplot(aes(x = is_iced, y = pred, fill = n)) + 
    geom_tile() + 
    geom_text(aes(label = glue::glue('{n}\n({pct %>% percent})')),
              color = 'grey90') + 
    facet_wrap(~label, nrow = 1) + 
    scale_fill_viridis_c(guide = 'none', end = .8) + 
    labs(x = "Actual Value", y = "Predicted Value",
         title = "Comparing Confusion Matrices") + 
    theme_light() + 
    theme(
      axis.ticks = element_blank(),
      strip.text = element_text(color = 'black')
    )

In the confusion matrix its much easier to see that the original model was less likely to make a prediction of iced than the other two models. This led to it having the higher precision but also the lower recall as the original model missed 85 iced attempts vs. the h2o model only missing 33 and the SuperLearner only missing 24.

So which model performed the best? If I’m just going by balanced performance by looking at the F1 score then the original model outperformed the other two. However, its worth thinking about whether precision or recall is more important since that could have an influence on how to view the model’s performance. If ensuring that all the iced kicked are captured is most important then I should weight more towards recall. But if I want to feel that when the model predicts an iced kick that there will be an iced kick I should stick with the original model.

Ain't Nothin But A G-Computation (and TMLE) Thang: Exploring Two More Causal Inference Methods

Sun, 13 Mar 2022 00:00:00 +0000

In my last post I looked at the causal effect of icing the kicker using weighting. Those results found that icing the kicker had a non-significant effect on the success of the field goal attempt with a point estimate of -2.82% (CI: -5.88%, 0.50%). In this post I will explore two other methodologies for causal inference with observational data, G-Computation and Target Maximum Likelihood Estimation. Beyond the goal of exploring new methodologies I will see how consistent these estimates are to the prior post.

G-Computation

I first learned about G-Computation from Malcom Barrett’s Causal Inference in R workshop. For causal inference the ideal goal is to see what would happen to a field goal attempt in the world where the kicker is iced vs. isn’t iced. However, in the real world only one of these outcomes is possible. G-Computation creates these hypothetical worlds by:

Fitting a model on observed data including treatment indicator (whether the kicker is iced) and covariates (other situational information)
Creating duplicates of the data set where all observations are set to a single level of treatment (in this case make two replications of the data, one where all kicks are iced and one where all kicks are NOT iced)
Predict the FG success for these replicates
Calculate Avg(Iced) - Avg(Not Iced) to obtain the causal effect.
Bootstrap the entire process in order to get valid confidence intervals.

For this exercise I won’t need any complicated packages. Using rsample for bootstrapping will be as exotic as it gets.

library(tidyverse)
library(rsample)
library(scales)
library(here)

And the data that will be used is the same from the prior two blog posts which is the 19,072 Field Goal Attempts from College Football between 2013 and 2021. For details on that data and its construction please refer to the first post in this series.

fg_attempts <- readRDS(here('content/post/2022-01-17-predicting-when-kickers-get-iced-with-tidymodels/data/fg_attempts.RDS')) %>%
  transmute(
    regulation_time_remaining,
    attempted_distance,
    drive_is_home_offense = if_else(drive_is_home_offense, 1, 0),
    score_diff,
    prior_miss,
    offense_win_prob,
    is_iced = factor(is_iced, levels = c(0, 1), labels = c('Not Iced', 'Iced')),
    fg_made,
    id_play
  )

Step 1: Fit a model using all the data

The first step in the G-Computation process is to fit a model using all covariates and the treatment indicator against the outcome of field goal success. This will use the same covariates from the prior post which include the amount of time remaining in regulation, the distance of the field goal attempt, whether the kicking team is on offense or defense, the squared difference in score, whether the kicking team had previously missed in the game, and the pre-game win probability for the kicking team. The treatment effect is is_iced which reflects whether the defense called timeout before the kick and the outcome fg_made is whether the kick was successful.

Since I’m predicted a binary outcome I will use logistic regression.

m <- glm(fg_made ~ is_iced + regulation_time_remaining + attempted_distance + 
           drive_is_home_offense + I(score_diff^2) + prior_miss + offense_win_prob,
         data = fg_attempts,
         family = 'binomial')

Step 2: Create Duplicates of the Data Set

In order to create the hypothetical world of what would have happened if kicks were iced or not iced I’ll create duplicates of the data; one where all the data is “iced” and one where all the data is “not iced”. The effect that I am interested in is the “average treatment effect on the treated” (ATT) which is for the kicks that were actually “iced” what would have happened if they weren’t? Therefore for these duplicates I’ll only be using the observations where “icing the kicker” actually occurred and create one duplicate version where the is_iced is set to zero.

replicated_data <- bind_rows(
  # Get all of the Iced Kicks
  fg_attempts %>% filter(is_iced == 'Iced'),
  # Get all of the Iced Kicks But set the treatment field to "Not Iced"
  fg_attempts %>% filter(is_iced == 'Iced') %>% mutate(is_iced = 'Not Iced')
)

Step 3: Predict the Probability of Success for the Duplicates

This will be very straight forward using the predict() function. Using type = 'response' returns the probabilities vs. the predicted log-odds.

replicated_data <- replicated_data %>%
  mutate(p_success = predict(m, newdata = ., type = 'response'))

Step 4: Use the Predicted Successes to Calculate the Causal Effect

From the predicted data I can calculate the average success when Iced = 1 and when Iced = 0 and take the difference to obtain the causal effect of icing the kicker.

replicated_data %>% 
  group_by(is_iced) %>% 
  # Get average success by group
  summarize(p_success = mean(p_success)) %>%
  spread(is_iced, p_success) %>%
  # Calculate the causal effect
  mutate(ATT = `Iced` - `Not Iced`) %>%
  # Pretty format using percentages
  mutate(across(everything(), scales::percent_format(accuracy = .01))) %>% 
  kable()

Iced	Not Iced	ATT
67.66%	70.12%	-2.46%

From this calculation, the average treatment effect on the treated is -2.46% which is very close to the -2.82% from the previous post.

But to know if this effect would be statistically significant I’ll need to bootstrap the whole process.

Step 5: Bootstrap the Process to Obtain Confidence Intervals

To bootstrap the function using rsample I need to first create a function that takes splits from the bootstraps and returns the ATT estimates calculated in Step 4 above:

g_computation <- function(split, ...){
  .df <- analysis(split)
  
  m <- glm(fg_made ~ is_iced + regulation_time_remaining + attempted_distance + 
                   drive_is_home_offense + I(score_diff^2) + prior_miss + offense_win_prob,
                 data = .df,
                 family = 'binomial')
  
  return(
    # Create the Replicated Data
    bind_rows(
        fg_attempts %>% filter(is_iced == 'Iced'),
        fg_attempts %>% filter(is_iced == 'Iced') %>% mutate(is_iced = 'Not Iced')
    ) %>% 
      # Calculate predictions on replicated data
      mutate(p_success = predict(m, newdata = ., type = 'response')) %>%
      group_by(is_iced) %>%
      summarize(p_success = mean(p_success)
      ) %>%
      spread(is_iced, p_success) %>%
      # Calculate ATT
      mutate(ATT = `Iced` - `Not Iced`)
  )
  
}

Now that the entire process has been wrapped in a function I need to create the bootstrap samples that will be passed into the function In the next code block I create 1,000 bootstrap samples and using purrr:map pass each sample into the function to obtain the ATTs.

set.seed(20220313)

g_results <- bootstraps(fg_attempts, 1000, apparent = T) %>% 
  mutate(results = map(splits, g_computation)) %>%
  select(results, id) %>%
  unnest(results)

Finally, I’ll use the 2.5 and 97.5 percentiles to form the confidence intervals and the mean to form the point estimate of the ATT distribution returned from the bootstrap process.

g_results %>% 
  summarize(.lower = quantile(ATT, .025),
            .estimate = mean(ATT),
            .upper = quantile(ATT, .975)) %>%
  mutate(across(everything(), scales::percent_format(accuracy = .01))) %>%
  kable()

.lower	.estimate	.upper
-5.66%	-2.51%	0.59%

Using G-Computation I reach the same conclusion that icing the kicker does not have a statistically significant effect on FG success. The point estimate of the effect of icing the kicker was -2.51% (CI: -5.66%, 0.59%)

Targeted Maximum Liklihood Estimation (TMLE)

In the previous post using weighting and in the G-Computation section above there is a fundamental assumption that all of the covariates that can influence Icing the Kicker’s influence on field goal success have been controlled for in the model. In practice, this is difficult to know for sure. In this case, there is a probably an influence of weather and wind direction/speed that is not covered in this data because it was difficult to obtain. Targeted Maximum Likelihood Estimation (TMLE) is one of the “doubly robust” estimators that will provide some safety against model misspecification.

In TMLE, there will be one model to estimate the probability that a kick attempt is being iced (propensity score) and a second model will be used to estimate how icing the kicker and other covariates will effect the success of that kick (outcome model). These models get combined in an ensemble to produce estimates of the average treatment effect on the treated. The “doubly robust” aspect is that the result will be a consistent estimator as long as one of the two models is correctly specified.

For more information on TMLE as a double robust estimate check out the excellent blog from StitchFix which is a large influence on this section.

To run TMLE in R, I’ll use the tmle package which will estimate the propensity score and outcome model using the SuperLearner package which stacks models to create an ensemble. As the blog states, “using SuperLearner is a way to hedge your bets rather than putting all your money on a single model, drastically reducing the chances we’ll suffer from model misspecification” since SuperLearner can leverage many different types of sub-models."

library(tmle)

The tmle() function will run the procedure to estimate the various causal effect statistics. The parameters of the function are:

Y is whether the Field Goal attempt was successful
A is the treatment indicators of whether the Field Goal attempt was iced or not
W is a data set of covariates
Q.SL.library is the set of sub-models that SuperLearner will use to estimate the outcome model
g.SL.library is the set of sub-models that SuperLearner will use to estimate the propensity scores
V is the number of folds to use for the cross-validation to determine the optimal models
family is set to ‘binomial’ since the outcome data is binary

The types of sub-models under consideration will be GLMs, GLMs w/ Interactions, GAMs, and polynomial MARS model. The complete list of models available in SuperLearner can be found here or using the listWrappers() function.

If you actually know the forms of the propensity model or outcome model those could be directly specified using gform or Qform. But I’ll be letting SuperLearner do all the work.

tmle_model <- tmle(Y=fg_attempts$fg_made
                   ,A=if_else(fg_attempts$is_iced=='Iced', 1, 0)
                   ,W=fg_attempts %>% 
                     transmute(regulation_time_remaining, attempted_distance,
                            drive_is_home_offense, score_diff=score_diff^2,
                            prior_miss, offense_win_prob)
                   ,Q.SL.library=c("SL.glm", "SL.glm.interaction", "SL.gam", "SL.polymars")
                   ,g.SL.library=c("SL.glm", "SL.glm.interaction", "SL.gam", "SL.polymars")
                   ,V=10
                   ,family="binomial"
)

The TMLE object contains the results for a variety of causal effects (ATE, ATT, etc.). Since all the comparisons I’ve looked at use the ATT, I’ll do that again here.

tibble(
  .lower = tmle_model$estimates$ATT$CI[1],
  .estimate = tmle_model$estimates$ATT$psi,
  .upper = tmle_model$estimates$ATT$CI[2]
) %>%
  mutate(across(everything(), scales::percent_format(accuracy = .01))) %>%
  kable()

.lower	.estimate	.upper
-5.77%	-2.63%	0.52%

The results of the TMLE are consistent in the conclusion that the effect of icing the kicker is not statistically significant. But from a point estimate perspective the TMLE procedure estimates that the effect is slightly larger than G-Computation at -2.63% but smaller than Weighting.

Summary

Throughout this post and the last post I’ve calculated the Average Treatment Effect on the Treated using three different methodologies the results of which are:

Altogether the three methodology align on the idea that icing the kicker is not a significant effect on the outcome of the Field Goal and even if it were (based on point estimate) it would be quite small.

Does Icing the Kicker Really Work? A Causal Inference Exercise

Mon, 14 Feb 2022 00:00:00 +0000

In my prior post I looked at when coaches were most likly to ice a kicker where ‘icing a kicker’ means for a defense to call a timeout right before the offense is about to kick a field goal. In this post, I’ll be looking to apply causal inference techniques to see whether icing the kicker even matters.

In a perfect world we’d run an A/B test or some type of experiment where some games could be played with the ability to ice the kicker and some without. However, this is unfeasible because the fairness of sports requires that games are played with the same rules.

It would also be easy to just look at the field goal percentage when a kicker was iced vs. wasn’t. However, this would have a lot of selection bias as the situations where a kicker is likely to be iced is different than what might be the normal field goal attempt.

This analysis will follow a similar flow to the Causal Inference in R Workshop conducted by Lucy D’Agostino McGowan and Malcolm Barrett. For the data I’ll be reusing the data from my prior post which 19,072 Field Goal Attempts from College Football between 2013 and 2021. For details on that data and its construction please refer to the prior post.

What Have Other Analyses Shown?

This is not the first time this question has been asked:

A Football Study Hall article found that “Looking at all field goal attempts in Q4 and OT, there were 1070 attempts. 761, or 71% of them were good. Given the condition of whether a kicker was iced or not does seem to make a difference. For kickers who were iced, the number of made field goals drops to 123/196, or 63%, while the kickers who were not iced was 638/874, or 73% were good.”
An SB Nation article, which was actually more of a game recap, has the subtitle “Icing the kicker doesn’t work, but coaches keep on doing it anyway.”.
Grantland found that icing the kicker doesn’t work
ESPN found that “attempts to ice a kicker at the end of a game actually increased the kicker’s chances of success”
Finally, Mixpanel found “it seems kickers that have been iced are a whole 0.1% less likely to make their kick successfully”

To generally, the consensus seems to be that the effect of icing the kicker is somewhere between not effective to potential harmful to the kicking team.

What Would a Naive Analysis Show?

I’ll start by doing a really naive analysis of just looking at data as-is comparing iced to non-iced kickers. To start I’ll load the libraries for this analysis and read in the field goal attempt data from my prior post.

library(tidyverse)
library(here)
library(gtsummary)
library(broom)
library(survey)
library(rsample)
library(smd)


fg_attempts <- readRDS(here('content/post/2022-01-17-predicting-when-kickers-get-iced-with-tidymodels/data/fg_attempts.RDS')) %>%
  transmute(
    regulation_time_remaining,
    attempted_distance,
    drive_is_home_offense = if_else(drive_is_home_offense, 1, 0),
    score_diff,
    prior_miss,
    offense_win_prob,
    is_iced = factor(is_iced, levels = c(0, 1), labels = c('Not Iced', 'Iced')),
    fg_made,
    id_play
  )

As a reminder the data contains 19,072 field goals attempts from College Football FBS Regular Season games between 2013 and 2021. For the very naive analysis I’ll just look at the data as-is.

fg_attempts %>% 
  group_by(`Was Iced` = is_iced) %>% 
  summarize(
    `FG Attempts` = n(),
    `FG Made` = sum(fg_made == T),
    `FG %` = mean(fg_made) %>% scales::percent(accuracy = .1)
  ) %>% 
  knitr::kable(align='c')

Was Iced	FG Attempts	FG Made	FG %
Not Iced	18268	13882	76.0%
Iced	804	544	67.7%

From the very Naive data, 76% of non-iced kicks were converted vs. 67.7% of iced kicks for a difference of 8.3%!! This seems like decently large difference (and if we ran a test of proportions on this it would be statistically significant).

A more robust solution

But comparing iced kicks to non-iced kicks as-is doesn’t make much sense. As many of the articles referenced above state, icing the kicker is something done to increase in the pressure in high-pressure situations like when the kick would determine who wins the game. These types of situations are vastly different than the lower-pressure situations where the majority of field goals occur.

An easy way to determine whether there are differences in the factors that might lead to a field goal being iced is by looking at the standardized mean differences of the other features in the data set to see the extent of the difference between the iced and non-iced attempts.

I’ll be using the tbl_summary() function from {{gtsummary}} to create this table. In the below code, I split the data by is_iced, tell the function to show the mean and standard deviation for all continuous variables, show the percentage for binary variables and each value should be rounded to two digits. The standardized mean difference gets added through the add_difference() function.

tbl_summary(
  fg_attempts,
  by = 'is_iced',
  include = c(regulation_time_remaining, attempted_distance, 
              drive_is_home_offense, score_diff, prior_miss, offense_win_prob, 
              is_iced),
  statistic = list(all_continuous() ~ "{mean} ({sd})",
                   all_dichotomous() ~ "{p}%"),
  digits = list(everything() ~ 2)
) %>% 
  add_difference(everything() ~ "smd")

Characteristic	Not Iced, N = 18,268¹	Iced, N = 804¹	Difference²	95% CI^2,3
regulation_time_remaining	1,936.02 (964.84)	1,170.94 (898.40)	0.82	0.75, 0.89
attempted_distance	35.25 (9.31)	38.69 (9.85)	-0.36	-0.43, -0.29
drive_is_home_offense	51.98%	50.87%	0.02	-0.05, 0.09
score_diff	1.15 (13.57)	-0.25 (10.94)	0.11	0.04, 0.18
prior_miss	13.74%	17.66%	-0.11	-0.18, -0.04
offense_win_prob	0.53 (0.28)	0.49 (0.24)	0.12	0.05, 0.19
¹ Mean (SD); % ² Standardized Mean Difference ³ CI = Confidence Interval

When looking at standardized mean differences, generally values less than 0.1 mean there is a adequate balance between the two groups. Between 0.1 and 0.2 is not too alarming, but values greater than 0.2 would indicate a heavy imbalance. In this data, the time remaining, attempted distance show large differences between iced and non-iced attempts.

While there are many mechanisms to correct for the imbalances between the observed groups (Matching, Weighting, Stratification, etc.) I’m going to focus on weighting for this analysis. The process will be:

Develop Propensity Scores based on other features to predict the probability that a field goal attempt will be iced with logistic regression.
Use the weights to adjust the population of the non-iced group to reflect the iced group. Since I’m looking to determine whether icing the kicker actually matters I want to measure the difference in Field Goal Success Rates for situations when the kicker might be iced. This is called the Average Treatment Effect on the Treated (ATT). This is in contrast to the Average Treatment Effect (ATE), which would measure the causal effect of icing the kicker in general and not just in situations where icing would occur.
Ensure that the post-weighted data are not imbalanced like the pre-weighted data.
Calculate the ATT and bootstrap confidence intervals.

Step 1: Develop the Propensity Model

The first step in developing the weights to make the population more “even” is to develop a propensity score for the treatment. Here I’ll run a logistic regression using glm() to predict where the Field Goal attempt will be iced based on the covariates that were unbalanced from before.

p_iced <- glm(is_iced ~ regulation_time_remaining + attempted_distance + 
             drive_is_home_offense + I(score_diff^2) + prior_miss + offense_win_prob, 
           data = fg_attempts, 
           family = 'binomial')

These are called propensity models because their output represent the propensity of a given attempt to get iced.

Step 2: Use the Propensity Scores to weight the non-Iced Field Goal Attempts

Then by using the augment() function from the {{broom}} package, I can add the predicted values from the model to the fg_attempts data set. The probabilities from this model can be used to re-weight the data in any number of ways. You can make adjust both the test and control to make them look like each other. You can also adjust test group to look like the control, and you can weight the control to look like the test.

In this case, since I want to understand the causal effect of icing the kicker on kicks that are likely to be iced, I’ll be re-weighting the control group to look like the test group. Thus, I will be looking for the average treatment effect among the treated (ATT) vs. the overall average treatment effect (ATE).

The formula for re-weighting the population for the ATT is:

Where p_i is the attempt’s propensity to be iced and Z_i is whether the attempt was actually iced. This winds up assigning each attempt in the test group to 1 and will upweight field goal attempts that had higher propensities for being iced from the non-iced group.

weighted_dt <- p_iced %>% 
  augment(type.predict = "response", data = fg_attempts) %>%
  mutate(
    w_att = ((.fitted * (is_iced=='Iced'))/.fitted) + 
      ((.fitted*(is_iced != 'Iced'))/(1-.fitted))
  )

Before showing the effects of the weighting let’s first look at the unweighted propensity scores:

ggplot(weighted_dt, aes(x = .fitted, fill = is_iced)) + 
  geom_density(alpha = .5) + 
  scale_x_continuous(labels = scales::percent) + 
  #scale_y_log10(labels = scales::comma) + 
  scale_fill_manual(values = c('Iced' = 'green', 'Not Iced' = 'blue')) + 
  labs(x = "P(Icing The Kicker)",
       y = "",
       title = "Probability of a FG Attempt Being Iced (Unweighted)",
       fill = "Kicker Iced?") + 
  cowplot::theme_cowplot() + 
  theme(
    legend.position = 'bottom',
    legend.justification = 'center',
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )

This makes it very clear that the distributions of propensity to ice differs very heavily between the group that was actually iced and that which was not. Its also nice to see that the group that was actually iced generally have higher propensity scores that those that don’t.

Now let’s look at the distribution of propensity score when taking the weights into account. The distribution of the Iced group is shown in green and is unchanged from the pre and post weightings. On the bottom is the Non-Iced attempts. The overall distribution is shown in grey and the re-weighted distribution is shown in blue. Notice how it more closely reflects the distribution of the iced group.

weighted_dt %>%
  tidyr::spread(is_iced, .fitted, sep = "_p") %>%
  ggplot() +
  geom_histogram(bins = 50, aes(is_iced_pIced), alpha = 0.5) + 
  geom_histogram(bins = 50, aes(is_iced_pIced, weight = w_att), fill = "green", alpha = 0.5) + 
  geom_histogram(bins = 50, alpha = 0.5, aes(x = `is_iced_pNot Iced`, y = -..count..)) + 
  geom_histogram(bins = 50, aes(x = `is_iced_pNot Iced`, weight = w_att, y = -..count..), fill = "blue", alpha = 0.5) + 
  geom_hline(yintercept = 0, lwd = 0.5) +
  scale_y_continuous(label = abs) +
  scale_x_continuous(label = scales::percent) + 
  labs(title = "Post-Weighted Probability of FG Attempt Being Iced",
       subtitle = "grey is unweighted distribution",
       x = "P(Icing the Kicker)",
       y = "# of Attempts") + 
  theme_minimal() + 
  geom_rect(aes(xmin = 0.45, xmax = .47, ymin = 5, ymax = 100), fill = "#5DB854") + 
  geom_text(aes(x = 0.46, y = 50), label = "Iced", angle = 270, color = "white") + 
  geom_rect(aes(xmin = 0.45, xmax = .47, ymin = -100, ymax = -5), fill = "#5154B8") + 
  geom_text(aes(x = 0.46, y = -50), label = "Non-Iced", angle = 270, color = "white") + 
  coord_cartesian(ylim = c(-100, 100))

Step 3: Ensure the Post-Weighted Data is no longer Imbalanced

The next step is to run some diagnostics to make sure that the imbalance that we saw back at the beginning of this post in the standardized mean differences have gone away. I’m going to use the {{survey}} package and the tbl_svysummary() function from {{gtsummary}} to create a survey design object that incorporates the weights that were derived above. The ids = ~ 1 code tells the design object that there are no clusters in this.

svy_des <- svydesign(
  ids = ~ 1,
  data = weighted_dt,
  weights = ~ w_att
)

Then the syntax of tbl_svysummary() is identical to tbl_summary() just it uses the survey design object rather than a data frame. Like with the additional table, I’m adding in the standardized mean difference.

tbl_svysummary(
  svy_des,
  by = 'is_iced',
  include = c(regulation_time_remaining, attempted_distance, 
              drive_is_home_offense, score_diff, prior_miss, 
              offense_win_prob, is_iced),
  statistic = list(all_continuous() ~ "{mean} ({sd})",
                   all_dichotomous() ~ "{p}%"),
  digits = list(everything() ~ 2)
) %>% 
  add_difference(everything() ~ "smd")

Characteristic	Not Iced, N = 800¹	Iced, N = 804¹	Difference²	95% CI^2,3
regulation_time_remaining	1,187.16 (875.67)	1,170.94 (898.40)	0.02	-0.08, 0.12
attempted_distance	38.76 (9.34)	38.69 (9.85)	0.01	-0.09, 0.11
drive_is_home_offense	51.18%	50.87%	0.01	-0.09, 0.10
score_diff	0.63 (11.03)	-0.25 (10.94)	0.08	-0.02, 0.18
prior_miss	17.42%	17.66%	-0.01	-0.10, 0.09
offense_win_prob	0.49 (0.26)	0.49 (0.24)	0.00	-0.10, 0.10
¹ Mean (SD); % ² Standardized Mean Difference ³ CI = Confidence Interval

Notice that all of the SMDs are now below the 0.1 threshold.

Another way to visualize the changes in SMDs between is using a Love Plot. Named after Dr. Thomas E. Love, the Love Plot is a way of summarizing covariate balance before and after weighting. In the first code block, I calculate both the weighted and unweighted standardized mean differences using the {{smd}} package. In the smd() code blocks, I pass in each variable, the group variable, and in the case of the weighted version, the weights.

smds <- weighted_dt %>%
  # Calculate the SMD for Each Variable
  summarise(
    # List Variables to run functions
    across(c(regulation_time_remaining, attempted_distance, 
             drive_is_home_offense, score_diff, prior_miss,
             offense_win_prob),
           # List functions
           list(
             unweighted = ~smd(.x, is_iced, na.rm = T)$estimate, 
             weighted = ~smd(.x, is_iced, w_att, na.rm = T)$estimate 
           ),
           # Assign how the naming will show up in the output
           # Assign placeholder _zzz_ to split on in the next step
           .names = "{.col}_zzz_{.fn}")
  )

smds %>% 
  pivot_longer( 
    everything(),
    values_to = "SMD", 
    names_to = c("variable", "Method"), 
    names_sep = "_zzz_"
  ) %>%
  ggplot(
    aes(x = abs(SMD), y = variable, group = Method, color = Method)
  ) +  
  geom_line(orientation = "y") +
  geom_point() + 
  geom_vline(xintercept = 0.1, color = "black", size = 0.1) + 
  labs(title = "Love Plot Pre/Post Weighting",
       subtitle = "Post-Weighted Variables are All Balanced",
       y= "") + 
  cowplot::theme_cowplot()

The Love Plot clearly shows that the weighted version of the data has corrected the imbalances that we’ve seen in the unweighted version since all variables are now below 0.1. So it looks like the propensity score weighting

Step 4: Calculate the ATT

The final step is to calculate the average treatment effect on the treatment by regressing our outcome variable (Fields Goal Made) by the “treatment” (Whether the kick was iced or not) weighted by the weighting scheme we came up with above. I’m using a linear probability model for convenience so that the coefficient is more human interpretable. But there is a case to be made for using a logistic regression since fg_made is binary.

final.model <- lm(fg_made ~ is_iced, data = weighted_dt, weights = w_att)

tidy(final.model, conf.int = T) %>%
  select(term, estimate, conf.low, conf.high) %>% 
  mutate(across(where(is.numeric), ~scales::percent(.x, accuracy = .01))) %>% 
  knitr::kable(align = 'c')

term	estimate	conf.low	conf.high
(Intercept)	70.51%	69.58%	71.44%
is_icedIced	-2.85%	-4.16%	-1.54%

From this model the results look like that icing the kicker results in a decreased success rate of 2.85%. The confidence intervals from the linear model would suggest that its statistically significant. However, the confidence intervals generated above are overly optimistic as the weights are treated as separate individuals rather than actual weights. In order to get more robust confidence intervals, I’ll use bootstrapping to redo the entire process 1000 times. The following function does the entire process from above (propensity score -> weights -> output model).

#### Bootstrapping Estimates
fit_ipw <- function(split, ...) { 
  .df <- analysis(split) 
  
  # fit propensity score model
  propensity_model <- glm(
    is_iced ~ regulation_time_remaining + attempted_distance + 
             drive_is_home_offense + I(score_diff^2)  + prior_miss + offense_win_prob, 
    family = binomial(), 
    data = .df
  )
  
  # calculate inverse probability weights
  .df <- propensity_model %>% 
    augment(type.predict = "response", data = .df) %>%
    mutate(
      w_att = ((.fitted * (is_iced=='Iced'))/.fitted) + 
      ((.fitted*(is_iced != 'Iced'))/(1-.fitted))
    )
  
  # fit correctly bootstrapped ipw model
  lm(fg_made ~ is_iced, data = .df, weights = w_att) %>%
    tidy()
}

The bootstrapping will be done using the {{rsample}} package and the bootstraps() function. In the function I ask for 1,000 bootstrapped samples (the apparent option includes a 1001st sample that’s the entire data set). Then I apply the above function to every bootstrapped sample through {{purrr}}’s map() function.

# fit ipw model to bootstrapped samples
set.seed(20220130)
ipw_results <- bootstraps(fg_attempts, 1000, apparent = TRUE) %>% 
  mutate(results = map(splits, fit_ipw))

Finally, the int_t() function generates confidence intervals from the t-distribution based on the results of the 1,000 bootstrapped samples.

# get t-statistic-based CIs
int_t(ipw_results, results) %>%
  filter(term == "is_icedIced") %>% 
  select(term, .lower, .estimate, .upper) %>% 
  mutate(across(where(is.numeric), ~scales::percent(.x, accuracy = .01))) %>%
  knitr::kable(align = 'c')

term	.lower	.estimate	.upper
is_icedIced	-5.88%	-2.82%	0.50%

From the bootstrapped results, we have the a similar point estimate of -2.82% which is much smaller than the 8.3% that was seen in the naive analysis. but the confidence intervals now spans from -5.88% to 0.50% making the results not significantly different from zero.

So in conclusion, we can’t definitively say that icing the kicker is actually harmful to the kicker’s success which seems consistent with the other studies that say that either its ineffective or only mildly effective at best.

In the next post in this series, I’ll be looking at alternative causal inference methodologies like G-computation and targeted maximum likelihood estimation (TLME) to see if the results are similar or different to the results from this post.

Predicting When Kickers Get Iced with {tidymodels}

Mon, 24 Jan 2022 00:00:00 +0000

I’m constantly on the lookout for things I can use for future posts for this blog. My goal is usually two-fold. First, what is a tool or technique I want to try/learn and second is there an interesting data set that I can use with those tools. I’d been wanting to play around with {tidymodels} for a while but hadn’t found the right problem. Watching some of the NCAA bowl games over the winter break finally provided me with a use-case. My original question of whether icing the kicker really works? will be explored in a future post but it led to the question for this post which will explore predicting when coaches will choose to ice the kicker.

This post will explore the data gathering process from the College Football Database, the modeling process using tidymodels, and explaining the model using tools such as variable importance plots, partial dependency plots, and SHAP values.

Huge thanks to Julia Silge whose numerous blog posts on tidymodels were instrumental as a resource for learning the ecosystem.

Part I: Data Gathering

To determine whether or not a potential field goal attempt will get iced or not I’ll need data on each field goal attempt, I’ll need a definition of what is icing the kicker, and I’ll need other features that would be predictive of whether or not a kicker will be iced.

Wikipedia defines “icing the kicker” as “the act of calling a timeout immediately prior to the snap in order to disrupt the process of kicking a field goal”. Therefore, we’ll define a field goal attempt as being iced if a timeout is called by the defense directly before it.

The data for this post comes from the College Football Database More details on this API can be found in my earlier post on Exploring Non-Conference Rivalries so the set-up will not be covered here. Play-by-Play data from any game can be accessed from the cfbd_pbp_data() function.

Looking at the returned data, the features that I’ll explore as potentially predictive are:

Regulation Time Remaining in the Game (or if the game is in overtime)
Distance of the Field Goal Attempt
The Score Difference
Whether the kicking team is the home team
Whether the kicking team has missed earlier in the game
The pre-game winning probability of the kicking team (to assess whether the game is expected to be close)

The packages needed for the data gathering process are tidyverse for data manipulation and cfbfastR to access the API.

library(cfbfastR)
library(tidyverse)

For convenience I’ll be looking at NCAA Regular Season football games between 2013 and 2021. The API notes that prior to the College Football Playoff in 2014 the regular season was weeks 1-14 and since 2014 its been weeks 1 to 15. To create a loop of the weeks and years to pass to the data pull function I’ll use expand.grid() to create all combinations of weeks and years and then apply a filter to keep only valid weeks.

grid <- expand.grid(
  year = 2013:2021,
  week = 1:15
) %>%
  arrange(year, week) %>%
  # Before 2014 there were only 14 regular season weeks
  filter(year > 2014 | week <= 14)

The API does provide options to specify which types of plays to return. However, to determine whether or not a timeout was called immediately before it I’ll need to pull the data for EVERY play to accurately apply a lag function. Since I don’t want to keep every play at the end of the day, I’ll create a function to handle the API call and some post processing using the grid of weeks and years above as inputs to the function. I use map2_dfr() from purrr to iterate over two parameters into a function.

The call to cfbd_pbp_data() with week and year parameters will return the play by play data for every game in that week. To process the data I subset to relevant columns, create some lagged columns to determine the time that the play started (since the time in the data reflects the end of play) and the plays that came immediately before. The information from the lagged variables get used to define the dependent variable is_iced as if the prior play was a timeout called by the defensive team during the same drive then we’ll consider the attempted to be iced.

Then I create some additional values that will be used in the modeling, subset my data to only be field goal attempts (and remove any duplicated rows that unfortunately exist), and create the variable for whether the kicking team had a prior miss in the game.

###Get Play by Play Data
fg_attempts <- map2_dfr(grid$year, grid$week, function(year, week){
  
  
  plays <- cfbd_pbp_data(year=year, week=week, season_type = 'regular') %>%
    group_by(game_id) %>%
    arrange(id_play, .by_group = TRUE) %>% 
    #Subset to only relevant columns
    select(offense_play, defense_play, home, away, 
           drive_start_offense_score, drive_start_defense_score,
           game_id, drive_id, drive_number, play_number,
           period, clock.minutes, clock.seconds, yard_line, yards_gained,
           play_type, play_text, id_play,
           drive_is_home_offense, 
           offense_timeouts,
           defense_timeouts,
           season, wk) %>% 
    mutate(
      # Get prior play end time to use as current play start time
      play_start_mins = lag(clock.minutes),
      play_start_secs = lag(clock.seconds),
      # Get previous plays
      lag_play_type = lag(play_type),
      lag_play_text = lag(play_text),
      
      # Create Other Variables
      is_iced = coalesce(
        if_else(
          # If the same drive, the immediately prior play was a timeout 
          # called by the defensive team
          drive_id == lag(drive_id) & 
            play_number - 1 == lag(play_number) & 
            lag_play_type == 'Timeout' &
            str_detect(str_to_lower(lag_play_text), str_to_lower(defense_play)),
          1,
          0
        ), 
        0
      ),
      score_diff = drive_start_offense_score - drive_start_defense_score,
      time_remaining_secs = 60*play_start_mins + play_start_secs,
      fg_made = if_else(play_type == 'Field Goal Good', 1, 0)
    ) %>% 
    ungroup() %>% 
    ## Keep only Field Goal Attempt Plays
    filter(str_detect(play_type, 'Field Goal'),
           !str_detect(play_type, 'Blocked')) %>%
    #Distinct Out Bad Rows
    distinct(game_id, drive_id, period, clock.minutes, clock.seconds, play_type, play_text,
             .keep_all = T) %>%
    ## Determine if the offensive team has missed a field goal once already during the game
    group_by(game_id, offense_play) %>% 
    mutate(min_miss = min(if_else(play_type == 'Field Goal Missed', id_play, NA_character_), na.rm = T),
           prior_miss = if_else(id_play <= min_miss | is.na(min_miss), 0, 1)
    ) %>% 
    ungroup()
  }
)

Getting the offensive win probabilities has to come from a separate function, cfbd_metrics_wp_pregame(). This function came return a season’s worth of data by only calling the year. Using map_dfr with the years 2013 to 2021 will return this data.

betting_lines <- map_dfr(unique(grid$year), ~cfbd_metrics_wp_pregame(year = .x, season_type = 'regular'))

The final step is adding the win probability data to the play by play data by joining the data and assigning the win probability for each play to the offensive team vs. home/away. Finally, I do some final data cleaning to not have negative timeouts remaining, extracting the attempted distance of the field goal from a play-by-play string, and defining the regulation time remaining. The last step is removing attempts where icing the kicker would be impossible. Since the defense needs a timeout to be able to ice, any attempt where the defense has no timeouts gets excluded.

fg_data <- fg_attempts %>%
  inner_join(betting_lines %>%
               select(game_id, home_win_prob, away_win_prob),
             by = "game_id") %>%
  mutate(offense_win_prob = if_else(offense_play == home, home_win_prob, away_win_prob),
         defense_timeouts = pmax(defense_timeouts, 0),
         regulation_time_remaining = if_else(
           period > 4, 0, (4-period)*900+pmin(time_remaining_secs, 900)),
         attempted_distance = coalesce(str_extract(play_text, '\\d+') %>% as.numeric(),
                                       yards_gained)
         ) %>%
  #Need to Ensure that Icing Could Occur
  filter(defense_timeouts > 0 | is_iced)

The result of this is a dataset of 19,072 field goal attempts covering 6,435 games over 9 seasons. of the 19,072 attempts, 804 (4%) would be considered as iced.

Part 2: Building the Model

Normally, I would do some EDA to better understand the data set but in the interest of word count I’ll jump right into using tidymodels to predict whether or not a given field goal attempt will be iced. In order to make the data work with the XGBoost algorithm I’ll subset and convert some numeric variables including our dependent variable to factors. A frustrating thing I learned in writing this post is that with a factor dependent variable the assumption is that the first level is the positive class. I’m recoding is_iced to reflect that. The libraries I’ll be working with for the modeling section are tidymodels for nearly everything and themis to use SMOTE to attempt to correct class imbalance, and finetune to run the tune_race option.

library(tidymodels)
library(themis)
library(finetune)

model_data <- fg_data %>%
  transmute(
    regulation_time_remaining,
    attempted_distance,
    drive_is_home_offense = if_else(drive_is_home_offense, 1, 0),
    score_diff,
    prior_miss = if_else(prior_miss==1, 'yes', 'no'),
    offense_win_prob,
    is_overtime = if_else(period > 4, 1, 0),
    is_iced = factor(is_iced, levels = c(1, 0), labels = c('iced', 'not_iced'))
  )

One of the powerful pieces of the tidymodels ecosystem is that its possible to try out different pre-processing recipes and model specifications with ease. For example, this dataset is heavily class imbalanced, I can easily try two versions of the model where one attempts to correct for this and one that does not. To assess how good a job my model does at predicting future data I’ll split by data into a training set and test set, stratifying on is_iced to ensure the dependent variable is balanced across the slices. The initial_split() function creates the split with a default proportion of 75% and training() and testing() extracts the data.

set.seed(20220102)
ice_split <- initial_split(model_data, strata = is_iced)
ice_train <- training(ice_split)
ice_test <- testing(ice_split)

One thing to note is that XGBoost has many tuning parameters so I’ll use cross-validation to figure out the best combination of hyper-parameters. The vfold_cv() function will take the training data and split it into 5 folds again stratifying by the is_iced variable.

train_5fold <- ice_train %>%
  vfold_cv(5, strata = is_iced)

Tidymodels

My interpretation of the building blocks are recipes which handle how data should pre-processed, specifications which tells tidymodels which algorithms and parameters to use, and workflows that bring them together. Since I’ve done most of the pre-processing in the data gathering piece these recipes will be pretty vanilla. However, this data is heavily imbalanced with only 4% of attempts being iced. So I will have two recipes. The first sets up the formula and one-hot encodes the categorical variables.

rec_norm <- recipe(is_iced ~ ., data = ice_train) %>%
  step_dummy(all_nominal_predictors(), one_hot =T)

and the second will add a second step that uses step_smote() to create new examples of the minority class to fix the class imbalance problem.

rec_smote <- recipe(is_iced ~ ., data = ice_train) %>%
  step_dummy(all_nominal_predictors(), one_hot = T) %>%
  step_smote(is_iced)

Then I’ll define my specification. The hyper-parameters that I want to tune are set to tune() and then I tell {tidymodels} that I want to use XGBoost for a classification problem.

xg_spec <- boost_tree(
  trees = tune(), 
  tree_depth = tune(), 
  min_n = tune(), 
  loss_reduction = tune(),                     
  sample_size = tune(), 
  mtry = tune(),         
  learn_rate = tune(), 
  stop_iter = tune()
) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

The recipes and specifications are combined in workflows (if using 1 recipe and 1 specification) or workflow sets if wanting to use different combinations. In the workflow_set() function you can specify a list of recipes as preproc and a list of specifications as models. The cross parameter being set to true creates every possible combination. For this analysis I’ll have 2 preproc/model combinations:

wf_sets <- workflow_set(
  preproc = list(norm = rec_norm, 
                 smote = rec_smote),
  models = list(vanilla = xg_spec),
  cross = T
)

Next step is setting up the grid of parameters that will be tried in the model specifications above. Since there a lot of parameters to be tuned and I don’t want this to run forever I’m using grid_latin_hypercube to set 100 combinations of parameters that try to cover the entire parameter space but without running every combination.

grid <- grid_latin_hypercube(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), ice_train),
  learn_rate(),
  stop_iter(range = c(10L, 50L)),
  size = 100
)

Finally its time to train the various workflows that have been designed. To do this I’ll pass the workflow_set defined above into the workflow_map() function. The “tune_race_anova” specification tells the training process to abandon certain hyper-parameter values if they’re not showing value. More detail can be found Julia Silge’s post. Also passed into this function are the resamples generated from the 5 folds, the grid of parameters, a control set that will save the predictions and workflows so that I can revisit them later on. Finally, I create a metric set of the performance metrics I want to calculate here choosing accuracy , ROC AUC, Multinomial Log Loss, and F1 Measure.

#Set up Multiple Cores
doParallel::registerDoParallel(cores = 4)


tuned_results <- wf_sets %>% 
  workflow_map(
    "tune_race_anova",
    resamples = train_5fold,
    grid = grid,
    control = control_race(save_pred = TRUE,
                           parallel_over = "everything",
                           save_workflow = TRUE),
    metrics = metric_set(f_meas, accuracy, roc_auc, mn_log_loss, pr_auc, precision, recall),
    seed = 20210109
  )

Tidymodels has an autoplot() function which will plot the best scoring model runs for each metric. However, I want a little more customization then what that function (or at least what I know of that function) provides. Using map_dfr() I’m going to stack the top model for each specification for each of the 5 performance metrics on top of each other using rank_results() to get the top model for each config for each metric.

perf_stats <- map_dfr(c('accuracy', 'roc_auc', 'mn_log_loss', 'pr_auc', 'f_meas',
                        'precision', 'recall'),
                ~rank_results(tuned_results, rank_metric = .x, select_best = T) %>% 
                filter(.metric == .x) 
        )

perf_stats %>% 
  ggplot(aes(x = wflow_id, color = wflow_id, y = mean)) +
    geom_pointrange(aes(y = mean, ymin = mean - 1.96*std_err, ymax = mean + 1.96*std_err)) + 
    facet_wrap(~.metric, scales = "free_y") + 
    scale_color_discrete(guide = 'none') + 
    labs(title = "Performance Metric for Tuned Results",
         x = "Model Spec",
         y = "Metric Value",
         color = "Model Config"
    ) + 
    theme_light()

Since I do care about the positive class more than the negative class but I don’t have a strong preference to false positive vs. false negative being more costly I’m going to use F1-Score as the performance metric I care most about. As expected the plain vanilla specification had a higher accuracy than the version using SMOTE to correct for imbalance. But it had lower values for F1, PR AUC and ROC AUC. I can also use rank_results() to show the top models for the F1 measure across the different specifications:

rank_results(tuned_results, rank_metric = 'f_meas') %>%
  select(wflow_id, .config, .metric, mean, std_err) %>%
  filter(.metric == 'f_meas') %>% 
  kable()

wflow_id	.config	.metric	mean	std_err
smote_vanilla	Preprocessor1_Model050	f_meas	0.4113826	0.0122165
smote_vanilla	Preprocessor1_Model049	f_meas	0.4096641	0.0135101
smote_vanilla	Preprocessor1_Model045	f_meas	0.4092579	0.0123975
smote_vanilla	Preprocessor1_Model076	f_meas	0.4075923	0.0094581
smote_vanilla	Preprocessor1_Model097	f_meas	0.4049903	0.0089085
smote_vanilla	Preprocessor1_Model063	f_meas	0.4047996	0.0101844
smote_vanilla	Preprocessor1_Model030	f_meas	0.4033350	0.0105798
norm_vanilla	Preprocessor1_Model040	f_meas	0.2830217	0.0225049

The top 7 models by F1 are all various configurations of the SMOTE recipe. The best model specification had an average F1 of 0.411 across the five folds. To get a better understanding of what this model’s specification actually was I extract the model configuration that has the best F1-score by using extract_workflow_set_result() with the workflow id and then select_best() with the metric I care about:

##Get Best Model
best_set <- tuned_results %>% 
  extract_workflow_set_result('smote_vanilla') %>% 
  select_best(metric = 'f_meas')

kable(best_set)

mtry	trees	min_n	tree_depth	learn_rate	loss_reduction	sample_size	stop_iter	.config
5	1641	19	8	0.007419	9.425834	0.9830687	21	Preprocessor1_Model050

The best model in this case had 5 random predictors, 1641 trees, and so on.

Now that I know which model configuration is the best one, the last step is to final the model using the full training data and predict on the test set. The next block of code extracts the workflow, sets the parameters to be those from the best_set defined above using finalize_workflow, and then last_fit() does the final fitting using the full training set and prediction on the testing data when we pass it the workflow and the split object.

final_fit <- tuned_results %>%
  extract_workflow('smote_vanilla') %>%
  finalize_workflow(best_set) %>%
  last_fit(ice_split, metrics=metric_set(accuracy, roc_auc, mn_log_loss, 
                                         pr_auc, f_meas, precision, recall))

Then with collect_metrics() I can see the final results when the model was applied to the test set that had been unused thus far.

collect_metrics(final_fit) %>% 
  kable()

.metric	.estimator	.estimate	.config
accuracy	binary	0.9341443	Preprocessor1_Model1
f_meas	binary	0.4332130	Preprocessor1_Model1
precision	binary	0.3438395	Preprocessor1_Model1
recall	binary	0.5853659	Preprocessor1_Model1
roc_auc	binary	0.9101661	Preprocessor1_Model1
mn_log_loss	binary	0.1546972	Preprocessor1_Model1
pr_auc	binary	0.3505282	Preprocessor1_Model1

The F1 score is actually higher than in the training at 0.43% with a precision of 34%, a recall of 59%, and a ROC AUC of 0.91%.

Tidymodels also makes it very easy to display ROC curves using collect_predictions to get the predictions from the final model and test set and roc_curve to calculate the sensitivity and specificity.

collect_predictions(final_fit) %>%
  roc_curve(is_iced, .pred_iced) %>%
  ggplot(aes(1 - specificity, sensitivity)) +
  geom_abline(lty = 2, color = "gray80", size = 1.5) +
  geom_path(alpha = 0.8, size = 1) +
  coord_equal() +
  labs(color = NULL)

As well as calculate the confusion matrix with collect_predictions and conf_mat.

collect_predictions(final_fit) %>%
  conf_mat(is_iced, .pred_class) %>%
  autoplot(type = 'heatmap')

Part 3: Interpreting the model

So now the model has been built can be used for predicting whether or not a field goal attempt will get iced given certain parameters. But XGBoost is in the class of “black-box” models where it might be difficult to know what’s going on under the hood. In this third part, I’ll explore:

Variable Importance
Partial Dependency Plots
SHAP Values

All of which will help to provide some interpretability to the model fit in part 2.

Variable Importance

Variable Importance plots are one way of understanding which predictor has the largest effect on the model outcomes. There are many ways to measure variable importance but the one I’m using is the default in the {vip} package for XGBoost which is “gain”. Variable importance using gain measures the fractional contribution of each feature to the model based on the total gain of the feature’s splits where gain is the improvement to accuracy brought by a feature to its branches.

The {vip} package provides variable importance when given a model object as an input. To get that I use extract_fit_parsnip() to get the parsnip version of the model object. Then the vip() function does the rest.

library(vip)

extract_fit_parsnip(final_fit) %>%
  vip(geom = "point", include_type = T) + 
  geom_text(aes(label = scales::percent(Importance, accuracy = 1)),
            nudge_y = 0.023) + 
  theme_light()

Unsurprisingly, the regulation time remaining is the most important feature which makes sense because the amount of time remaining dictates whether using the timeout on a kick is worthwhile. Although whether the kicking team is the home team being the 2nd most important feature is a bit more surprising as I would have thought game situation would apply more than home or away team. I thought score difference would be higher.

Partial Dependency

Variable importance tells us “what variables matters” but it doesn’t tell us “how they matter”. Are the relationships between the predictors and predictions linear or non-linear. Is there some magic number where a step function occurs. Variable Importance cannot answer these questions, but partial dependency plots can!

A partial dependency plot shows the effect of a predictor on the model outcome holding everything else constant. The {pdp} package can be used to generate these plots. The {pdp} package is a little less friendly with {tidymodels} since you need to provide the native model object rather than the parsnip version (which is still easily accessible using extract_fit_engine()). Also, the data passed into the partial() function needs to be the same as the data that actually goes into the model object. So I create fitted_data by prep()ing the recipe and then bake()’ing which applies the recipe to the original data set.

The partial function can also take a while to run, so I’m using {furrr} which allows for {purrr} functions to be run in parallel on the {future} backend. In the future_map_dfr function, I’m running partial on every predictor in the data and stacking the results on top of each other so that I can plot them in the final step. The use of prob=T converts the model output to a probability but since XGBoost probabilities are uncalibrated best not to read too much into the values.

library(pdp)

##Get Processed Training Data
model_object <- extract_fit_engine(final_fit)

fitted_data <- rec_smote %>%
  prep() %>%
  bake(new_data = model_data) %>%
  select(-is_iced)

library(furrr)
plan(multisession, workers = 4)

all_partial <- future_map_dfr(
  names(fitted_data), ~as_tibble(partial(
    model_object,
    train = fitted_data,
    pred.var = .x,
    type = 'classification',
    plot = F,
    prob = T, #Converts model output to probability scale
    trim.outliers = T
  )) %>% 
    mutate(var = .x) %>%
    rename(value = all_of(.x)),
  .progress = T,
  .options = furrr_options(seed = 20220109)
)

all_partial %>% 
  #Remove Prior Miss since its one-hot encoded
  filter(!str_detect(var, 'prior_miss|overtime')) %>% 
  ggplot(aes(x = value, y = yhat, color = var)) + 
    geom_line() + 
    geom_smooth(se = F, lty = 2, span = .5) + 
    facet_wrap(~var, scales = "free") + 
    #scale_y_continuous(labels = percent_format(accuracy = .1)) + 
    scale_color_discrete(guide = 'none') +
    labs(title = "Partial Dependency Plots for Whether A Kick Gets Iced?",
         subtitle = "Looking at 19,072 NCAA Field Goal Attempts (2013-2021)",
         x = "Variable Value",
         y = "Prob. of Attempt Getting Iced") + 
    theme_light()

From these plots we can tell that the likelihood of getting iced increases when:

The Attempted Distance is between 30-50 yards
When two teams are expected to be somewhat evenly matched (based on pre-game win probabilities)
When nearly the end of the game or the end of the half (that middle spike in regulation time remaining is halftime since timeouts reset at the beginning of each half)
When the kicking team is losing by a very small margin (or when the game is within +/- 10 points)

While variable importance told us that Regulation Time Remaining was the most important variable, the partial dependency plot shows us how it affects the model in a non-linear way.

SHAP Values

The next measure of interpretability combines pieces of both variable importance and partial dependency plots. SHAP values are claimed to be the most advanced method to interpret results from tree-based models. They are based on Shaply values from game theory and measure feature importance based on the marginal contribution of each predictor for each observation to the model output.

The {SHAPforxgboost} package provides an interface to getting SHAP values. The plot that will give us overall variable importance is the SHAP summary plot which we’ll get using shap.plot.summary. However, first the data structure needs to be prepped using the model object and the training data in a matrix.

library(SHAPforxgboost)

shap_long <- shap.prep(xgb_model = extract_fit_engine(final_fit), 
                        X_train = fitted_data %>% as.matrix())
                       
shap.plot.summary(shap_long)

In the summary plot, the most important variables are ordered from top to bottom. Within any given variable each point represents an observation. The shading of the point represents which that observation has a higher or lower value for that features. For example, in regulation time remaining lower amounts of remaining time will be orange while higher amounts will be purple. The position on the left or right side of zero represents whether they decrease or increase the likelihood of getting iced. For regulation time remaining notice that the very purple is strongly negative (on the left side) and the very orange is strongly positive (on the right side).

Similar to the variable importance plot, regulation time remaining was the most important feature.

We can also get dependency plots similar to the partial dependency plots with SHAP values using shap.plot.dependence. We’ll look at the regulation time remaining on the x-axis and the SHAP values for regulation time remaining on the y-axis. Since this returns a ggplot object, I’ll add in vertical lines to represent the end of each quarter.

SHAPforxgboost::shap.plot.dependence(data_long = shap_long, x = 'regulation_time_remaining', 
                                     y = 'regulation_time_remaining', 
                                     color_feature = 'regulation_time_remaining') + 
  ggtitle("Shap Values vs. Regulation Time Remaining") + 
  geom_vline(xintercept = 0, lty = 2) + 
    geom_vline(xintercept = 900, lty = 2) + 
    geom_vline(xintercept = 1800, lty = 2) + 
    geom_vline(xintercept = 2700, lty = 2)

Similar to the summary plot, the less time remaining in the game the more orange the point and the more time remaining the more purple. Again, like in the partial dependency plot, we see a non-linear relationship with increases towards the end of each quarter and heavy spikes in the last 3 minutes of the 2nd and 4th quarters.

This is just an example of the things that can be done with SHAP values but hopefully its usefulness for understanding both what’s important and how its important has been illustrated.

Wrapping Up

This was quite long so a huge thanks if you made it to the end. This post took a tour through {tidymodels} and some interpretable ML tools to look at when field goal attempts are more likely to get iced. If you’re a football fan then the results shouldn’t be terribly surprising. It its good to know that the model outputs generally pass the domain expertise “sniff-test”. In the next post, I’ll use this same data to attempt to understand whether icing the kicker actually works in making the kicker more likely to miss the attempt.

Examining College Football Conference Realignment with {ggraph}

Wed, 29 Dec 2021 00:00:00 +0000

In my previous post I looked at College Football Non-Conference games to create a network map overlaid on top of the United States using the {ggraph} package. In this post I’ll be extending that to examine Conference Realignment, which is when colleges change from one conference to the next. Over the years, this has been caused by reactions to internal politics between Football schools vs. Basketball schools, or schools wanting an increase in clout by joining a more prestigious conference.

More specifically, I’ll be making a network map based on historical conference affiliations to visualize the changes that have occurred due to realignment. Then I’ll zoom specifically into the case of the Big 12 conference to show how the graph reflects the history of the conference.

Since all of the packages being used in this post were described in the prior post, I’ll be skipping through that section.

Set up

library(tidyverse)
library(cfbfastR)
library(tidygraph)
library(ggraph)
library(ggtext)
library(showtext)

font_add_google('Roboto', "roboto")
showtext_auto()

Creating a Network of the FBS Conference Affiliations

For both analyses I’ll be creating a network graph where individual schools are the nodes and the edges represent whether those schools were in the same conference in a given year. Since conference affiliations will change over time, the number of years that schools were in the same conference will form a strength of association. To get this data, I’ll be using the cfdb_team_info() function from the {cfbfastR} package to return a list of all the FBS schools and their conference affiliation for each year between 1980 and 2021.

The choice of 1980 is arbitrary to limit the number of connections and the size of the data. However, the package can return data much further back in time.

In order to extract the data for each year I pass a vector of years 1980 through 2021 into map_dfr from {purrr} to run a custom function taking each individual year as an input and stacking the results into a single data frame. My custom function first calls the College Football Database API to retrieve all the schools for a given year and removes all the Independent schools since they do not have an affiliation (for example, Notre Dame). Then since I need to get my list of schools into a list of co-occurrences for each conference, I group_by() conference so the next parts of the function get run on a conference by conference basis and expand the school column to create two columns with all within conference combinations. Since “all within conference combinations” includes having the same school twice, I’ll filter out those rows, and since A/B is different than B/A, I’ll create new variables that will always put the school coming first alphabetically into school1 and the other into school2. Technically, this will double count each entry but I’ll run distinct() to get the unique set since I’m going to eventually weight by the number of years and this function runs one year at a time.

conference_graph_data <- map_dfr(1980:2021, function(yr){
  # get the list of schools for a given year
  x <- cfbd_team_info(year = yr) %>%
    # remove independents
    filter(conference != 'FBS Independents') %>%
    group_by(conference) %>% 
    # get all combinations of schools within each conference
    expand(school, school, .name_repair = 'universal')%>% 
    # Remove the combinations that are the same school twice
    filter(school...2 != school...3) %>%
    # Enforce an order so that each school pair appears in the same order
    mutate(school1 = if_else(school...2 < school...3, school...2, school...3),
           school2 = if_else(school...2 < school...3, school...3, school...2),
           season = yr) %>%
    # subset the columns
    select(season, conference, school1, school2) %>%
    # remove duplicates since each combination would be counted twice
    distinct()
  return(x)
  
})

For the nodes on this graph I’ll only want the schools that are part of the Football Bowl Subdivision in 2021 rather than schools that may have dropped down to the FCS. To get this list I’ll run cfdb_team_info(year = 2021) to get a data frame of all 2021 schools. But since I only need a vector to filter on, I’ll use pull() to just extract the school name to the vector.

current_fbs <- cfbd_team_info(year = 2021) %>%
  pull(school)

Next, I’ll use the {tidygraph} package to turn this list of edges into a tbl_graph() object. First I ungroup the data frame since it would still be grouped from my custom function. Then using the count() function I create a weight column for each year the schools are affiliated with each other. Next, I leverage the vector I created in the step before to keep only edges where both schools are currently in the FBS. Then I create the tbl_graph object using as_tbl_graph().

tbl_graph objects can be manipulated using {dplyr} verbs to create additional information for either the nodes or the edges. In this instance I add two additional columns to the nodes:

I add the number of schools that each node is affiliated with using centrality_degree()
I create grouping of node communities using group_louvain()

conf_graph_all <- conference_graph_data %>% 
  ungroup() %>% 
  count(school1, school2, name = 'weight', sort = T) %>% 
  filter(school1 %in% current_fbs & school2 %in% current_fbs) %>% 
  as_tbl_graph(directed = F) %>%
  mutate(degree = centrality_degree(),
         community = group_louvain())

print(conf_graph_all)

## # A tbl_graph: 128 nodes and 1142 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 128 x 3 (active)
##   name          degree community
##   <chr>          <dbl>     <int>
## 1 Air Force         18         3
## 2 Alabama           13         1
## 3 Arizona           11         7
## 4 Arizona State     11         7
## 5 Auburn            13         1
## 6 Ball State        15         4
## # ... with 122 more rows
## #
## # Edge Data: 1,142 x 3
##    from    to weight
##   <int> <int>  <int>
## 1     1    12     42
## 2     1    32     42
## 3     1    42     42
## # ... with 1,139 more rows

Note that in the above output, you can see the columns for degree and community that I created. For the Arizona and Arizona State columns, the degree means that they are each connected to 11 schools (which I found kind of shocking, but since the Pac-10 formed in 1978 it does make sense that they’ve only been in a conference with the other now Pac-12 schools). The community column means that they both belong to the same grouping of nodes, which in this case is probably the Pac-12.

For creating the network visualization itself, I’m using the {ggraph} package which has a very similar syntax to {ggplot2}. The important notes here is that I’m displaying the edges as straight lines using geom_edge_link() and varying the shading, color, and width based on the weight. And I’m displaying the nodes as labels using geom_node_label and filling in by the community column. Everything else should be pretty normal if you’re familiar with {ggplot2} syntax.

conf_graph_all %>% 
  ggraph() + 
  geom_edge_link(aes(edge_alpha = weight, edge_color = weight, edge_width = weight)) + 
  geom_node_label(aes(label = name, fill = factor(community)), show.legend = F, size = 3) + 
  scale_edge_alpha_continuous(guide = 'none') + 
  scale_edge_width() + 
  scale_edge_color_viridis(option = 'C', end = .8, guide = 'none') + 
  scale_size_discrete(range = c(4, 6)) + 
  ggthemes::scale_fill_gdocs(guide = F, palette = ggthemes::tableau_color_pal()) + 
  labs(title = "2021 FBS College Football Teams Conference Affiliations",
       subtitle = "Network of Affiliated Schools (1980 - 2021)",
       edge_width = "Years Affiliated",
       caption = '**Source:** CollegeFootballData API') + 
  theme_graph() + 
  theme(
    legend.position = 'bottom',
    plot.title = element_markdown(family = 'roboto'),
    plot.subtitle = element_markdown(family = 'roboto'),
    plot.caption = element_markdown()
  )

While I normally like to have everything be reproducible it felt necessary to do some annotations about what the various communities are and how they reflect the current conference structure as well as how schools that change conferences appear as caught in a tug of war between two communities. These annotations, while possible to due in R, are much easier to do outside of it.

The piece that I enjoy the most is the depiction of the former Big East football teams. Syracuse, Virginia Tech, Miami, Pittsburgh, and Boston College left for the ACC between 2004 and 2013 (with Louisville following in 2014); West Virginia left for the Big 12 in 2012; And Rutgers left for the Big Ten in 2014 (along with Maryland who left the ACC for the Big Ten and shows up very clearly between those two clusters).

Zooming into the Big 12

Using a similar technique to the one above I can look at a sub-graph of the current Big 12 schools. I chose the Big 12 for this example because I think the history of the conference is both interesting and well structured when compared to the complete chaos or complete stability of other conferences. Just to get this out of the way, College Football conference sometimes anchor more to branding in their names than accuracy. You might notice that the Big 12 only has 10 schools and the Big Ten has 14. Best not too think too much about this.

Similar to before, I’ll query the College Football Data Base API and pass in the parameter B12 for the Big 12 Conference and the year 2021 to get the list of existing schools and then I’ll use that list to filter to the current Big 12 schools and any other schools that has ever been affiliated with a current Big 12 school. For simplicity later on I create an indicator for whether the node is a current Big 12 school.

current_big_12 <- cfbd_team_info(conference = 'B12', year = 2021) %>%
  pull(school)


conf_graph_b12 <- conference_graph_data %>% 
  ungroup() %>% 
  # Filter to only pairs that involve at least 1 Big 12 School
  filter(school1 %in% current_big_12 | school2 %in% current_big_12) %>% 
  # Count the pairs to form the number of years that they were affiliated
  count(school1, school2, name = 'weight', sort = T) %>% 
  # Turn to tbl_graph_object
  as_tbl_graph(directed = F) %>%
  # Create indicator for a current Big 12 Schools
  mutate(is_current_big_12 = name %in% current_big_12)

Using similar code to the full network above, I can plot the Big 12 sub-graph after filtering to only nodes from current Big 12 schools. In this case, rather than using the default ggraph() layout, I give it the 'fr' string which applies the Fruchterman-Reingold layout algorithm. Since this can provide non-deterministic layouts, I set the seed before running.

set.seed(20211229)
conf_graph_b12 %>%
  # Filter to current Big 12 Schools
  filter(is_current_big_12) %>%
  ggraph('fr') + 
  geom_edge_link(aes(edge_alpha = weight, edge_color = weight, 
                     edge_width = weight)) + 
  geom_node_label(aes(label = name)) + 
  scale_edge_alpha_continuous(guide = 'none') + 
  scale_edge_width() + 
  scale_edge_color_viridis(option = 'C', end = .8, guide = 'none') + 
  labs(title = "2021 Big 12 Football Conference",
       subtitle = "Network Graph Based on Conference Affiliations 1980-2021",
       edge_width = "Years Affiliated",
       caption = '**Source:** CollegeFootballData API') + 
  theme_graph() + 
  theme(
    legend.position = 'bottom',
    plot.title = element_markdown(family = 'roboto'),
    plot.subtitle = element_markdown(family = 'roboto'),
    plot.caption = element_markdown()
    
  )

Just eyeballing the above graph it looks like there are 4 clusters of nodes:

The strong network of Oklahoma, Oklahoma State, Iowa State, Kansas, and Kansas State
The strong network of Texas, Texas Tech, and Baylor
TCU is a moderate strength network with the Texas schools
West Virginia without any strong connections.

When looking through the Big 12 conference history this structure makes a ton of sense. The conference was formed in 1996 from the merging of the Big 8 which included the schools in group 1 (as well as Nebraska, Colorado, and Missouri who eventually left for other conferences in 2011-2012) and the Southwest Conference from which Texas Tech, Texas and Baylor joined (Texas A&M joined as well but left for a different conference in 2012). So the strong networks in groups 1 and 2 and the weaker connections between them reflect these original conference and their merging.

TCU was part of the original Southwest Conference with the Texas schools but did not join the Big 12 until 2012 instead journeying through the Western Athletic Conference (WAC), Conference USA, and the Mountain West Conference. This is reflected in their connection with the Texas schools (through their time in the Southwest Conference) but with weaker strength than the other Texas schools have with each other.

Finally, West Virginia joined the Big 12 in 2012 from the Big East conference and prior to that point had no affiliation with any of the other schools.

While the graph is good for showing the structure of the relationships it can be difficult to follow the conference merges and changes. This should be more apparent in the visualization below:

conference_graph_data %>%
  ungroup() %>%
  filter(school1 %in% current_big_12 | school2 %in% current_big_12) %>%
  gather(dummy, school, -season, -conference) %>% 
  select(-dummy) %>%
  distinct() %>% 
  add_count(school, name = 'years') %>%
  group_by(school, years, conference) %>% 
  summarize(start = min(season)-.5, end = max(season)+.5) %>%
  mutate(first_conference = max(if_else(start == min(start), conference, NA_character_), na.rm = T),
         first_start = max(if_else(start == min(start), start, NA_real_), na.rm = T),
         n_conferences = n_distinct(conference)) %>%
  arrange(first_start, first_conference, -years, n_conferences, school) %>% 
  ungroup() %>% 
  mutate(ord = row_number()) %>% 
  filter(school %in% current_big_12) %>%
  ggplot(aes(x = fct_reorder(school, ord, min, .desc = T))) + 
  geom_linerange(aes(ymin = start, ymax = end, color = conference), size = 8) + 
  labs(x = "Schools", y = "Season", color = "Conference",
       title = "Conference Migration of the Current Big 12 Schools") + 
  coord_flip() + 
  ggthemes::scale_color_tableau() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.y = element_markdown(),
    plot.subtitle = element_markdown(),
    panel.grid.major.y = element_line(color = 'grey90')
  )

Given the history of the Big 12 Conference and college football conference realignment in general it does appears that network structures work well for encoding the history of conference affiliations into a visualization.

Exploring College Football Non-Conference Rivalries with {ggraph}

Mon, 27 Dec 2021 00:00:00 +0000

We’re in the middle of College Football’s bowl post-season and I’d been wanting to do a more in-depth post on networks using {tidygraph} and {ggraph} for a while. So now seemed like as good a time as any to explore some College Football data. I had used {ggraph} in prior posts on exploring season’s of MTV’s The Challenge and when sequence mining my web browsing but this post will be more focused on the network visualization than those two posts.

In this post I will explore what are the most common non-Conference games?

But really the goal is to create some fun visualizations that hopefully will tell a story.

Getting Started + The Data

For many of the posts on this blog I tend to web scrape my own data. Initially I had planned to use Wikipedia to get a list of all the Football Bowl Subdivision (FBS) teams and their 2019 schedule to do this analysis. However, this proved difficult to find the right data that was easily accessible. However, there truly is an R package for everything and enter {cfbfastR} which provides access to the College Football Database API and provided me with easy access to all the information I needed. To use this package all that’s needed is registering for a free API key and adding it to your .Renviron file.

In addition to {cfbfastR} for getting the data, I’ll be using {showtext} to access Google Fonts, {tidyverse} for general data manipulation, {tidygraph} for handling the network data, and {ggraph} to handle the network graph plotting. Access to the Google Font Roboto is done using {showtext}’s font_add_google function and then showtext_auto().

library(tidyverse)
library(cfbfastR)
library(tidygraph)
library(ggraph)
library(ggtext)
library(showtext)

font_add_google('Roboto', "roboto")
showtext_auto()

What are the largest non-Conference Rivalries in College Football’s FBS?

The goal will be to create a map showing the links between College Footballs largest non-Conference rivalries. In this case, “largest” will be defined as most frequent. While College Football has many rivalries that are between Conference rivals I wanted to focus on non-Conference because I felt it would make for a better visualization. Additionally, since Conference teams generally have to play each other frequently it would be more difficult to discern a “chosen” rivalry vs. one dictated by conference membership.

The data that I’ll need for this analysis are:

A list of the FBS schools. I’ll use 2019 data since the 2021 season is still in progress and the 2020 was abnormal.
A list of all the games played between 2010 and 2019 which is the time-frame I’ll be using for this analysis.

Fortunately, both of these are really easily available from the College Football Data Base. The helper function cfdb_team_info returns all of the FBS schools for the 2019 season with information on the school itself as well as the latitudes and longitudes of the schools saving me the need to geocode.

cfdb_game_info provides all the games for a specified year. In order to get all the seasons between 2010 and 2019 I use map_dfr to iterate over the vector 2010-2019 and row bind each output into a combined data frame.

schools <- cfbd_team_info(year = 2019)

schedule <- map_dfr(2010:2019, cfbd_game_info)

To create a network graph I will need to create datasets to represent the nodes of the graph, in this case schools, and the edges, the match-ups between the two schools. For the nodes this will be straight-forward since I will just need a subset of the columns in schools:

nodes <- schools %>%
  select(id = team_id, school, conference, latitude, longitude)

knitr::kable(head(nodes, 5))

id	school	conference	latitude	longitude
2005	Air Force	Mountain West	38.99697	-104.84362
2006	Akron	Mid-American	41.07255	-81.50834
333	Alabama	SEC	33.20828	-87.55038
2026	Appalachian State	Sun Belt	36.21143	-81.68543
12	Arizona	Pac-12	32.22881	-110.94887

Edges will be a little trickier since I want this graph to be undirected. If Notre Dame plays USC, I don’t really care who was the home team or the away team, so I’ll need to find a way to count these as the same match-up. While I’m sure there’s a better way to do this I decided to solve this problem by making the team that goes first alphabetically school1 and the other team school2. This will apply a consistent ordering between any match-up.

In order to use the {tidygraph} package the edge list needs to have from and to columns even if the graph is undirected. Then once I have the edge list I construct a weight column by using the count() function from {dplyr}.

I also exclude all conference games using a field that comes in the data set as well as an additional filter to ensure that both nodes are FBS schools since FBS schools can play non-FBS schools during the season.

edge_list <- schedule %>% 
  # Remove any conference games
  filter(conference_game == F,
         #require that both the home and away teams are in our graph 
         home_id %in% nodes$id, 
         away_id %in% nodes$id) %>% 
  # apply alphabetical ordering to the two teams
  mutate(
    first_team = if_else(home_team < away_team, home_team, away_team),
    first_id = if_else(home_team < away_team, home_id, away_id),
    second_team = if_else(home_team < away_team, away_team, home_team),
    second_id = if_else(home_team < away_team, away_id, home_id)
  ) %>%
  select(from = first_id, to = second_id, first_team, second_team) %>%
  count(from, to, first_team, second_team, name = 'weight')

knitr::kable(head(edge_list, 5))

from	to	first_team	second_team	weight
2	23	Auburn	San José State	2
2	97	Auburn	Louisville	1
2	166	Auburn	New Mexico State	1
2	228	Auburn	Clemson	5
2	264	Auburn	Washington	1

An interpretation of this first row is that Auburn played San Jose State twice between 2010 and 2019 and only played Louisville once.

The {tidygraph} package has its own structure called a tbl_graph which combines the nodes and edges into a single data structure and allows the user to manipulate either portion. While there is a constructor specifically for the tbl_graph object, I was having trouble getting it to work so I used graph_from_data_frame from {igraph} and then cast the graph to a tbl_graph.

Also, no disrespect to the University of Hawaii but their presence really messes up the graph since Hawaii is so far from the other schools. So I’m just going to exclude them.

g <- igraph::graph_from_data_frame(d = edge_list, directed = F, vertices = nodes) %>% 
  as_tbl_graph() %>% 
  filter(!str_detect(school, 'Hawai'))

print(g)

## # A tbl_graph: 129 nodes and 1137 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 129 x 5 (active)
##   name  school            conference    latitude longitude
##   <chr> <chr>             <chr>            <dbl>     <dbl>
## 1 2005  Air Force         Mountain West     39.0    -105. 
## 2 2006  Akron             Mid-American      41.1     -81.5
## 3 333   Alabama           SEC               33.2     -87.6
## 4 2026  Appalachian State Sun Belt          36.2     -81.7
## 5 12    Arizona           Pac-12            32.2    -111. 
## 6 9     Arizona State     Pac-12            33.4    -112. 
## # ... with 123 more rows
## #
## # Edge Data: 1,137 x 5
##    from    to first_team second_team      weight
##   <int> <int> <chr>      <chr>             <int>
## 1    10    90 Auburn     San José State        2
## 2    10    52 Auburn     Louisville            1
## 3    10    70 Auburn     New Mexico State      1
## # ... with 1,134 more rows

Note that the output contains two sets of data, one for nodes and one for edges. Also note, that the nodes are noted as (active). There is a function called activate which will let a user switch between node and edge data within the tbl_graph object and use functions like mutate, filter, etc. on the data.

Visualizing the Graph

Normally, a graph can be displayed using any number of algorithms to show optimal clustering and separation. However, in this case my nodes are actual schools with actual locations given by their latitudes and longitudes. So for my graph, if I want to show them on a United States map I will need to create a layout that forces the nodes in their true geographic positions. This can be done using the create_layout function which takes the graph and then x and y positions. Since those x and y positions need to be in the same order as the nodes in the graph object I’m just going to reference the graph object directly when populating x and y.

lay = create_layout(g, 'manual', x= g %>% pull(longitude), y=g %>% pull(latitude))

With the layout in place I can construct the graph. The syntax for {{ggraph}} isn’t much different from {{ggplot2}}. The main difference is in the starting function where {{ggraph}} takes in a graph and/or a layout. In this case because my custom layout already contains the graph I can just pass in the layout. Then there are some specific geoms for the graphs such as geom_node_point which places a point at each node, and geom_edge_arc which draws an arc for each edge with the strength parameter controlling how “arc-y” to make the edge (as opposed to a straight line which could be done with geom_edge_link). Then there are some specific styles like edge_alpha vs. alpha. But if you’re familiar with {ggplot2}} then this syntax shouldn’t be too different. The only other piece which I had never used before was borders("state", color = 'grey90') to draw the US state borders.

While the more common games will show up with thicker and brighter lines not everyone knows the location of every FBS college in the US. So for the match-ups that occurred in at least of 8 of the 10 available years, I’ll add labels to the edges.

ggraph(lay) + 
  borders("state", color = 'grey90') +
  geom_node_point(color = 'grey90') + 
  geom_edge_arc(strength = 0.1, 
                aes(edge_alpha = weight, 
                    edge_color = weight, 
                    edge_width = weight,
                    label = if_else(weight >= 8, 
                                    paste0(first_team,'-',second_team), "")
                ),
                vjust = -.5,
                hjust = 0,
                label_colour = 'white',
                label_size = 6) + 
  scale_edge_color_viridis(begin = .2, end = .8, option = "A", direction = 1,
                           labels = round) + 
  scale_edge_width_continuous(range = c(.5, 1.5), guide = 'none') + 
  scale_edge_alpha_continuous(guide = 'none', range = c(0.1, 1)) + 
  labs(title = "NCAA FBS Non-Conference Games (2010 - 2019)",
       caption = '**Source:** CollegeFootballData API',
       edge_color = "# of Games Played") + 
  theme(
    panel.background = element_rect(fill = 'black'),
    plot.background = element_rect(fill = 'black'),
    plot.caption = element_markdown(color = 'white', size = 16),
    plot.subtitle = element_textbox_simple(family = 'roboto', size = 20, 
                                           color = 'white'),
    plot.title = element_markdown(hjust = .5, family = 'roboto', 
                                  color = 'white', size = 40),
    legend.position = 'bottom',
    legend.title = element_text(family = 'roboto', size = 20, color = 'white', 
                                vjust = 1),
    legend.text = element_text(family = 'roboto', size = 20, color = 'white'),
    legend.background = element_rect(fill = 'black')
  )

Analysis

Besides looking cool (in my opinion) this chart shows an edge for every non-conference game that occurred between 2010 and 2019 which is a lot of games. But to answer the questions of the largest Non-Conference rivalries there are a couple of patterns that arise:

The independent schools are over-represented which is not surprising since all of their games are non-conference games. This includes Notre Dame and BYU.
Games between schools that are in-state but in different conferences (Florida vs. Florida State, Colorado vs. Colorado State, Clemson vs. South Carolina, Georgia vs. Georgia Tech).
Games between schools that have functional reasons to be rivals such as the three service academies (Army, Navy, and Air Force).

While not terribly surprising for anyone that follows college football, this post hopefully shows how you can create a network graph out of geographic coordinates and fix the layout so that it can be applied on top of a real map.

In the next post I’ll be continuing on the theme of College Football and network graphs to see what we can learn about Conference Realignment!

What's the Most American of American Films? An Analysis with {gt} and {gtExtras}

Mon, 18 Oct 2021 00:00:00 +0000

I love movies. I enjoy watching them, I enjoy reading about the industry (sometimes), and as a bit of a data-nerd (exhibit a: my blog), I enjoy learning about the outliers in the industry. One of my favorite trends to follow is the shifting dynamics of Hollywood being driven more by International Box Office and the impact this has on the types of movies being made. One of my favorite examples is the movie Warcraft. From a critical perspective the movie is not good sporting a Rotten Tomatoes score of 28% (although the audience score is 76%). However, there is a massive disparity in the box office gross with only $47M of its $439M coming from the United States. Ultimately, this movie was a failure in the US but incredibly popular internationally.

With the announcement of the RStudio 2021 Table Contest, I wanted to look into identifying what are the movies that were the successful abroad but a failure in the US. But after playing with the data a bit I decided to flip the question to ask what is the most “American” movie. That is what were the most successful movies in the US that did not perform well abroad.

Part 1: Gathering the Data

Box Office Mojo has a table with the Top 1000 grossing movies with their split between Domestic and International grosses. This table should form the best backbone of finding successful US movies. However, since the most “American” movie could be anywhere in the Top 1000, I’ll need to gather all 1000.

Loading Libraries

Aside from tidyverse the main package needed to extract this table will be rvest which is used for tidy web scraping. The glue package will be used to make string construction a bit easier and httr will be used to access the Open Movie Database API to augment the initial Box Office gross data.

library(rvest) # Scrape Table From BoxOfficeMofo
library(tidyverse) # Data Manipulations
library(glue) # String Interpolation
library(httr) # Accessing the OMDB API

Since the Box Office Mojo table is paginated, I’ll need a loop to get through all 1000. The starting point for the table is controlled by the offset parameter in the URL. The map_dfr function from purrr will make it very easy to loop through the different offset parameters and combine each run into a single data set.

I’ll be feeding map_dfr parameter values of 0, 200, 400, 600, and 800 iteratively and passing it into the Box Office Mojo URL. The glue() function allows me to insert the offset value directly into the string through the {}. In this code block each iteration:

Grabs an offset parameter (0 to 800, by 200)
Passes that into an anonymous function that as the parameter x
Runs read_html() on the URL with the offset and extracts the table element with html_elements()
Extract the information from the table with html_table() into a tibble
When I get to the OMDB API piece rather than searching by title I can search directly by IMDB ID and since Box Office Mojo is owned by IMDB, I’m going to extract the ID from the links in the table:
- From the previously extracted table element, extract the tags with html_elements() and extract the href attributes from those tags using html_attr().

# Iterate through 0 to 800 by 200 and pass as X into the function
tbl <- map_dfr(seq(0, 800, 200),
               function(x){
                 #Read URL
                 base <- glue("https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset={x}") %>%
                   read_html() %>% 
                   # Extract Table Structure
                   html_element('table')
                 
                   bind_cols(
                     #Get Actual Table Data
                     base %>% html_table(convert = F),
                     
                     #Get IMDB IDs From Links
                     imdb_id = base %>% 
                       html_elements('a') %>% 
                       html_attr('href') %>%
                       keep(~str_detect(.x, 'tt')) %>%
                       str_extract('tt\\d+')
                     
                   )
               })

In order to get the html_table() piece to work correctly, I needed to set convert=F which tells the function not to try to turn numeric-looking values into numbers. Since everything was read in as a character, I need to do some light data cleaning using the parse_number() function from readr to turn characters that look like numbers into numbers.

I’ll also need to define what I mean when I say a movie is the “Most American”. What I want is to find movies that did well in the US and didn’t do well abroad. But…

If I look at the highest percentage of Domestic Gross I’ll get movies that might not have had an International release or did not have a large US gross (and therefore might not have been successful in the US)
If I look at the highest differences between US and International Gross I might find things that made a lot of money both Domestically and Internationally but just more domestically.

In order to find a balance between the two, I create the “domestic score” which is ratio of the percent of Worldwide Gross that was Domestic to the percent of Worldwide Gross that was International (in order to maximum "US-centric movies) but also to weight this ratio by the log2 of the Domestic Lifetime Gross in order to make sure that we’re finding successful movies and not just small movies that were only released in the US.

Then since I want my results to be in a table I don’t need all 1,000 movies, so I’ll use arrange() and head() to grab the Top 5 by the domestic score.

tbl_clean <- tbl %>% 
  janitor::clean_names() %>% 
  mutate(
    rank = parse_number(rank),
    worldwide_lifetime_gross = parse_number(worldwide_lifetime_gross),
    domestic_lifetime_gross = parse_number(domestic_lifetime_gross),
    domestic_percent = parse_number(domestic_percent)/100,
    foreign_lifetime_gross = parse_number(foreign_lifetime_gross),
    foreign_percent = parse_number(foreign_percent)/100,
    year = parse_number(year),
    # Developing a way to get the highest domestic percentages that also did well domestically
    domestic_score = (domestic_percent / foreign_percent)*log2(domestic_lifetime_gross)
  ) %>%
  arrange(-domestic_score) %>%
  # Keep The Top 10 As Candidates for the API
  head(5)

To make this table a little more fun there’s a couple elements that I’d like to bring in from the Open Movie Database such as the Rotten Tomatoes score, release dates, awards, and URL for the movie’s poster. In order to use the API you first need to register for an API key. I’ve stored that in my .Renviron file so I can place it into glue.

To use the API I can search for movies using the IMDB Id that I had gotten from above which gets used as part of the i= parameter to the URL which gets passed to the GET() function from the httr package. The information for the 5 movies from above get passed in using the map_dfr() function. The anonymous function takes in the IMDB id and returns a tibble that contains the extra information that I wanted for the table.

###Use OMDB Data for the Country Filters and Poster Data
omdb_data <- map_dfr(tbl_clean$imdb_id,
                      function(id){
                        omdb_resp <- GET(URLencode(glue("https://www.omdbapi.com/?apikey={Sys.getenv('OMDB_API_KEY')}&i={id}&type=movie&r=json")))
                        if(content(omdb_resp)$Response == "True"){
                          return(
                            content(omdb_resp, as = 'parsed') %>% 
                              tibble(
                                imdb_id = id,
                                api_title = .$Title,
                                release_date = .$Released,
                                runtime = .$Runtime,
                                language = .$Language,
                                country = .$Country,
                                awards = .$Awards,
                                poster_url = .$Poster,
                                ratings_source = ifelse(length(.$Ratings) > 0,
                                                        .$Ratings[[2]]$Source,
                                                        "missing"),
                                rating = ifelse(length(.$Ratings) > 0,
                                                .$Ratings[[2]]$Value,
                                                "-99")
                              ) %>% select(-.) %>% distinct() 
                          )
                        }
                      })

The raw JSON returned from the API looks like:

and output of the OMDB data table looks like:

field	value
imdb_id	tt0878804
api_title	The Blind Side
release_date	20 Nov 2009
runtime	129 min
language	English
country	United States
awards	Won 1 Oscar. 9 wins & 30 nominations total
poster_url	https://m.media-amazon.com/images/M/MV5BMjEzOTE3ODM3OF5BMl5BanBnXkFtZTcwMzYyODI4Mg@@._V1_SX300.jpg
ratings_source	Rotten Tomatoes
rating	66%

With the Box Office Data and the OMDB Data in separate data sets, I can combine them together through the common IMDB id. Finally, I’ll keep only movies listed as the United States (can’t be American if not at least partially made in the good ol USA) and I’ll extract the number of Oscars won our of the awards string to be used later.

#Combine All Data
combine_dt <- tbl_clean %>% 
  inner_join(omdb_data, by = "imdb_id") %>%
  #Keep US Movies
  filter(str_detect(country, "United States")) %>%
  extract(awards, "num_oscars", "Won (\\d+) Oscar", remove = F, convert = T) %>%
  replace_na(list(num_oscars = 0))

With the data set constructed, now onto the table.

Part 2: Constructing the Table

The libraries used to construct the table are gt and gtExtras

library(gt)
library(gtExtras)

I plan to use images for the number of Oscars won, the Rotten Tomatoes score (fresh or rotten), and flags to show the Domestic Box Office and International Box Office so rather than have long URLs in the table construction itself, I’ll create constant variables and refer to those in the code:

ROTTEN_URL = 'https://www.rottentomatoes.com/assets/pizza-pie/images/icons/tomatometer/tomatometer-rotten.f1ef4f02ce3.svg'
FRESH_URL = 'https://www.rottentomatoes.com/assets/pizza-pie/images/icons/tomatometer/tomatometer-fresh.149b5e8adc3.svg'
OSCAR_URL = 'https://upload.wikimedia.org/wikipedia/en/7/7f/Academy_Award_trophy.png'
US_FLAG_URL = 'https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/188px-Flag_of_the_United_States.svg.png'
WORLD_FLAG_URL = 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/EarthFlag1.svg/525px-EarthFlag1.svg.png'

Since gt has a lot of syntax, I’ll combine a bunch of steps together rather than showing each individual change. But the start of the table is just the gt() function.

p <- combine_dt %>%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %>%
  gt()

At first this is pretty ugly as a table but fortunately gt and gtExtras have a lot of very convenient features to make the table very pretty very quickly. The first set of steps will be:

Turn the URL to the movie poster into the action poster with gt_img_rows() from gtExtras
Turn the domestic percentage field to a percent format with fmt_percent() from gt
Turn the Domestic and Foreign Box Office Gross Values to dollar in millions with fmt_currency from gt
Turn the Worldwide Lifetime Gross into a bar plot with gt_plt_bar() from gtExtas

p <- combine_dt %>%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %>%
  gt() %>%
  
  #New Code
  gt_img_rows(poster_url, height = 75) %>%
  fmt_percent(domestic_percent, decimals = 1) %>%
  fmt_currency(columns = c("domestic_lifetime_gross", "foreign_lifetime_gross"),
               suffixing = T, decimals = 1) %>%
  gt_plt_bar(worldwide_lifetime_gross, color = 'darkgreen', width = 50)

The next steps will use the text_transform functions from gt to turn the number of Oscars won into the literal Oscar image for each Oscar won, and for the Rotten Tomatoes score, I’ll use with the “Fresh” image if the score is above 60% or the “Rotten” image if below 60%.

In general the text_tranform() function takes two parameters. The first is where is the function will be applied. In the first example, locations = cells_body(rating) means that I will apply the function defined in fn to the rating column. Then for the fn I’m using glue() to choose the FRESH_URL or ROTTEN_URL based on the numeric value of the rating itself and using web_image() to display the image.

For the number of Oscars…. I’m not 100% sure why I needed to use the lapply() and html() rendering to get the number Oscar statues to repeat. I suppose its has to do with the way that data is being passed around in the text_transform() function. However, “working” is better than perfect in this case. The function takes the num_oscars field and replicates the Oscar image as many times as necessary.

p <- combine_dt %>%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %>%
  gt() %>%
    gt_img_rows(poster_url, height = 75) %>%
    fmt_percent(domestic_percent, decimals = 1) %>%
    fmt_currency(columns = c("domestic_lifetime_gross", "foreign_lifetime_gross"),
                 suffixing = T, decimals = 1) %>%
    gt_plt_bar(worldwide_lifetime_gross, color = 'darkgreen', width = 50) %>% 
  
  
    #### NEW CODE
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue('{web_image(img)}<br />{rating}', 
               img = if_else(parse_number(rating) < 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %>%
      text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x <- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %>%
              gt::html()
          })
          }
      )

The gtExtras package has an awesome function called gt_merge_stack() that will take one column and stack it on top of a second column. This is a really cool way to condense information in an easy way. Using this I will merge the title and release date columns and place the release date under the title.

p <- combine_dt %>%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %>%
  gt() %>%
    gt_img_rows(poster_url, height = 75) %>%
    fmt_percent(domestic_percent, decimals = 1) %>%
    fmt_currency(columns = c("domestic_lifetime_gross", "foreign_lifetime_gross"),
                 suffixing = T, decimals = 1) %>%
    gt_plt_bar(worldwide_lifetime_gross, color = 'darkgreen', width = 50) %>% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue('{web_image(img)}<br />{rating}', 
               img = if_else(parse_number(rating) < 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %>%
      text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x <- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %>%
              gt::html()
          })
          }
      ) %>%
  
  
  ###NEW CODE
      gt_merge_stack(title, release_date)

To make a valuable info-graphic I’ll need to add in titles, subtitle, and to have appropriate attribution to myself, I’ll add in source notes as well. To this do, I’ll use the tab_header() to define the title and subtitle, and the tab_source_note() option to add the source line. Within this blocks the html() and md() functions allow for the use of HTML and Markdown respectively to render text.

p <- combine_dt %>%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %>%
  gt() %>%
    gt_img_rows(poster_url, height = 75) %>%
    fmt_percent(domestic_percent, decimals = 1) %>%
    fmt_currency(columns = c("domestic_lifetime_gross", "foreign_lifetime_gross"),
                 suffixing = T, decimals = 1) %>%
    gt_plt_bar(worldwide_lifetime_gross, color = 'darkgreen', width = 50) %>% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue('{web_image(img)}<br />{rating}', 
               img = if_else(parse_number(rating) < 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %>%
    text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x <- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %>%
              gt::html()
          })
          }
        ) %>%
    gt_merge_stack(title, release_date) %>%
  
  ###NEW CODE
    tab_header(
      title = html("What are the most <b><span style='color:#002868'>American</span></b> of American Films?"),
      subtitle = html("As measured by the share of Box Office Gross coming from the United States versus the rest of the world, movies with or about <b>Adam Sandler</b>, <b>Football</b>, and <b>Christmas</b> tend to be Box Office successes in the United States but not the rest of the world.  Although, it is unclear whether it is Football or Adam Sandler that makes the movie most appealing to American tastes.")
      ) %>%
      tab_source_note(
      md("***Author:*** JLaw | ***Sources:*** [BoxOfficeMojo,com](https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=0) and [Open Movie Database](https://www.omdbapi.com/)")
    )

Since the table can get pretty wide, it would be helpful to alternate the background colors of the rows so that its easy to follow the information. This can be done with opt_row_striping() which will add the striping with defaults and the row.striping.background_color option within tab_options().

p <- combine_dt %>%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %>%
  gt() %>%
    gt_img_rows(poster_url, height = 75) %>%
    fmt_percent(domestic_percent, decimals = 1) %>%
    fmt_currency(columns = c("domestic_lifetime_gross", "foreign_lifetime_gross"),
                 suffixing = T, decimals = 1) %>%
    gt_plt_bar(worldwide_lifetime_gross, color = 'darkgreen', width = 50) %>% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue('{web_image(img)}<br />{rating}', 
               img = if_else(parse_number(rating) < 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %>%
    text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x <- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %>%
              gt::html()
          })
          }
        ) %>%
    gt_merge_stack(title, release_date) %>%
    tab_header(
      title = html("What are the most <b><span style='color:#002868'>American</span></b> of American Films?"),
      subtitle = html("As measured by the share of Box Office Gross coming from the United States versus the rest of the world, movies with or about <b>Adam Sandler</b>, <b>Football</b>, and <b>Christmas</b> tend to be Box Office successes in the United States but not the rest of the world.  Although, it is unclear whether it is Football or Adam Sandler that makes the movie most appealing to American tastes.")
      ) %>%
      tab_source_note(
      md("***Author:*** JLaw | ***Sources:*** [BoxOfficeMojo,com](https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=0) and [Open Movie Database](https://www.omdbapi.com/)")
    ) %>%
  
  ###NEW CODE
  opt_row_striping() %>%
  tab_options(row.striping.background_color = "#ececec")

Now every other row was has a light shade of grey.

The next thing to do is to fix up the column labels. This is done with the col_labels() function which allows me to change how the variable names used for each column will be displayed. Using the use of glue(), html(), web_image(), and emo::ji() and I can insert images into the column titles. Also, since so many columns are related to Box Office Grosses, I’ll create a column spanner with tab_spanner() that goes from the domestic_gross column to the worldwide_lifetime_gross. Finally, since removing the label of poster_url will shrink the column width, I’ll increase the width with cols_width() and the px() function.

p <- combine_dt %>%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %>%
  gt() %>%
    gt_img_rows(poster_url, height = 75) %>%
    fmt_percent(domestic_percent, decimals = 1) %>%
    fmt_currency(columns = c("domestic_lifetime_gross", "foreign_lifetime_gross"),
                 suffixing = T, decimals = 1) %>%
    gt_plt_bar(worldwide_lifetime_gross, color = 'darkgreen', width = 50) %>% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue('{web_image(img)}<br />{rating}', 
               img = if_else(parse_number(rating) < 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %>%
    text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x <- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %>%
              gt::html()
          })
          }
        ) %>%
    gt_merge_stack(title, release_date) %>%
    tab_header(
      title = html("What are the most <b><span style='color:#002868'>American</span></b> of American Films?"),
      subtitle = html("As measured by the share of Box Office Gross coming from the United States versus the rest of the world, movies with or about <b>Adam Sandler</b>, <b>Football</b>, and <b>Christmas</b> tend to be Box Office successes in the United States but not the rest of the world.  Although, it is unclear whether it is Football or Adam Sandler that makes the movie most appealing to American tastes.")
      ) %>%
      tab_source_note(
      md("***Author:*** JLaw | ***Sources:*** [BoxOfficeMojo,com](https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=0) and [Open Movie Database](https://www.omdbapi.com/)")
    ) %>%
  opt_row_striping() %>%
  tab_options(row.striping.background_color = "#ececec") %>%
  
  ### New Code
  cols_label(
      poster_url = "",
      title = "Title",
      domestic_lifetime_gross = html(glue("{web_image(US_FLAG_URL)}United States")),
      foreign_lifetime_gross = html(glue("{web_image(WORLD_FLAG_URL)}Rest of World")),
      domestic_percent = "US % of Total",
      worldwide_lifetime_gross = glue("{emo::ji('dollar')}Total{emo::ji('dollar')}"),
      num_oscars = "# Oscars won",
      rating = "Rotten Tomatoes Score"
    ) %>%
  tab_spanner(label = "Box Office Gross", columns = domestic_lifetime_gross:worldwide_lifetime_gross) %>%
  cols_width(
      poster_url ~ px(75)
    )

Now this has come a lot a long way from the first image but there’s a lot of cleaning up that needs to be done with the various tab_style() functions. The tab_style() function takes two arguments. The style which is how things will look and the location which is where the styling will be applied. For the style I’ll be using the cell_text() helper to alter the size, weight (bolding), transform (to turn to all uppercase), alignment and font (using the google_font() helper).

For the locations, there are helpers for each part of the table. There is cells_body() for the cell text, cells_column_labels() for the column headers, cells_title(), which can take a “title” or “subtitle” option for those elements and cells_column_spanners() for the column spanners I created in the prior step. Within locations, you can further specify which columns the formatting will apply to. While it defaults to everything(), the columns can be entered as if they’re part of a select statement for dplyr. Finally, if wanting to include multiple locations (or multiple styles) in the same code block, the various helpers can be wrapped in a list().

For the formatting, I’ll:

Change the font, size, alignment, and make everything upper-case for the title.
Change the font, size, and alignment for the subtitle.
Change the font, size, and make everything upper-case and bold for the column headers.
Make all of the column headers center aligned except for the title column.
Change the font and center-align all of the cells except for the title column.

p <- combine_dt %>%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %>%
  gt() %>%
    gt_img_rows(poster_url, height = 75) %>%
    fmt_percent(domestic_percent, decimals = 1) %>%
    fmt_currency(columns = c("domestic_lifetime_gross", "foreign_lifetime_gross"),
                 suffixing = T, decimals = 1) %>%
    gt_plt_bar(worldwide_lifetime_gross, color = 'darkgreen', width = 50) %>% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue('{web_image(img)}<br />{rating}', 
               img = if_else(parse_number(rating) < 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %>%
    text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x <- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %>%
              gt::html()
          })
          }
        ) %>%
    gt_merge_stack(title, release_date) %>%
    tab_header(
      title = html("What are the most <b><span style='color:#002868'>American</span></b> of American Films?"),
      subtitle = html("As measured by the share of Box Office Gross coming from the United States versus the rest of the world, movies with or about <b>Adam Sandler</b>, <b>Football</b>, and <b>Christmas</b> tend to be Box Office successes in the United States but not the rest of the world.  Although, it is unclear whether it is Football or Adam Sandler that makes the movie most appealing to American tastes.")
      ) %>%
      tab_source_note(
      md("***Author:*** JLaw | ***Sources:*** [BoxOfficeMojo,com](https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=0) and [Open Movie Database](https://www.omdbapi.com/)")
    ) %>%
  opt_row_striping() %>%
  tab_options(row.striping.background_color = "#ececec") %>%
  cols_label(
      poster_url = "",
      title = "Title",
      domestic_lifetime_gross = html(glue("{web_image(US_FLAG_URL)}United States")),
      foreign_lifetime_gross = html(glue("{web_image(WORLD_FLAG_URL)}Rest of World")),
      domestic_percent = "US % of Total",
      worldwide_lifetime_gross = glue("{emo::ji('dollar')}Total{emo::ji('dollar')}"),
      num_oscars = "# Oscars won",
      rating = "Rotten Tomatoes Score"
    ) %>%
  tab_spanner(label = "Box Office Gross", columns = domestic_lifetime_gross:worldwide_lifetime_gross) %>%
  cols_width(
      poster_url ~ px(75)
    ) %>%
  
  ## New Code
  tab_style(
      style = cell_text(
        size = "x-large",
        font = google_font('Josefin Sans'),
        align = 'left',
        transform = 'uppercase'
      ),
      location = cells_title("title")
    ) %>%
  tab_style(
      style = cell_text(
        size = "medium",
        font = google_font('Inter'),
        align = 'left'
      ),
      location = cells_title("subtitle")
    ) %>%
  tab_style(
      style = cell_text(
        size = 'large',
        weight = 'bold',
        transform = 'uppercase',
        font = google_font('Bebas Neue')
      ),
      locations = list(cells_column_labels(), cells_column_spanners())
    ) %>%
  tab_style(
      style = cell_text(align = 'center'),
      locations = cells_column_labels(-title)
    ) %>%
  tab_style(
      style = cell_text(font = google_font('Sora'), align = 'center', size = 'small'),
      locations = cells_body(-title)
    )

And now our table looks pretty!!

Conclusion

In this blog post, I’ve defined a methodology for identifying the most “American” of US films and based on the results in the table it seems like the Most American things are Football, Adam Sandler, and Christmas.

Finding the Eras of MTV's The Challenge Through Clustering

Wed, 15 Sep 2021 00:00:00 +0000

Since 1998, MTV’s The Challenge (formerly the Real World/Road Rules Challenge) has graced the airwaves where it is currently in Season 37. In a prior post I had mentioned that this is one of my guilty pleasure shows so this will likely not be the last post that is based around America’s 5th professional sport.

For casting the show, the early years revolved around having alumni from MTV’s The Real World and Road Rules compete against each other (in an odd bit of irony or misnaming, the first season called Road Rules: All Stars actually consisted of only Real World alumni). Over the next 37 seasons, the series has evolved bringing in other MTV properties such as “Are You the One?” and expanding internationally to properties like “Survivor: Turkey” and “Love Island UK”.

Since the cast of characters has continuously evolved over the 37 seasons, I thought it would be interested to see if I can algorithmically classify the eras of the show based on the cast of each season through Hierarchical Clustering and visualizing using UMAP.

Libraries

library(tidygraph) # For manipulating network data sets
library(ggraph) # For visualizing network data sets
library(tidyverse) # General Data Manipulation
library(rvest) # For web scraping data from Wikipedia
library(widyr) # For calculating cosine similarity of seasons
library(umap) # For dimensionality reduction

Getting the Data

Since the goal is to cluster the seasons of the The Challenge based on similarity of their casts, I need to get the cast list from each of the 37 seasons. Fortunately, Wikipedia contains the casts within each season’s page. Unfortunately, I’m lazy and really don’t want to specifically hunt down the URLs for each of 37 seasons and write individual rvest code.

So I’ll use the Seasons table on Wikipedia to act as a driver file for each season’s page using rvest to extract the table using its xpath, pulling out all of the anchor elements (<a>), using html_attrs() to extract all of the attributes into a list and using purrr’s map_dfr function to combine all of the links into a list. Unfortunately, there are multiple links on row of the table (one for the title and one of the location of the season), so using stringr’s str_detect, I’ll keep only the rows that has the word “Challenge” in the title. Or “Stars” in the case of the first season which was just called “Road Rules: All-Stars”.

seasons <- read_html('https://en.wikipedia.org/wiki/The_Challenge_(TV_series)') %>%
  html_element(xpath = '/html/body/div[3]/div[3]/div[5]/div[1]/table[2]') %>% 
  html_elements('a') %>% 
  html_attrs() %>% 
  map_dfr(bind_rows) %>% 
  filter(str_detect(title, 'Challenge|Stars')) %>%
  select(-class)

href	title
/wiki/Road_Rules:_All_Stars	Road Rules: All Stars
/wiki/Real_World/Road_Rules_Challenge_(season)	Real World/Road Rules Challenge (season)
/wiki/Real_World/Road_Rules_Challenge_2000	Real World/Road Rules Challenge 2000

The dataset now has the Wikipedia link for each season in the href column and a more human-reading title in the title column.

The next problem to tackle is iterating through each season to extract the cast table. The issue here is that the Cast table is not uniform on each season’s page and the cast table is not always the same ordered table. So in the end I did have to look at all 37 pages to determine which tables and which columns within those tables to extract.

## Set up which tables and columns to extract from Wikipedia
seasons <- seasons %>%
  mutate(
    season_num = row_number(), #Define Season Identifier
    # Set Table Numbers On Page To Extract
    table_num = case_when(
      season_num %in% c(1, 12, 16, 19) ~ '2',
      season_num %in% c(27) ~ '3',
      season_num %in% c(2, 4, 5, 6, 9, 11) ~ '4,5',
      TRUE ~ '3, 4'
    ),
    # Set Column Numbers to Extract From Each Table
    keep_cols = case_when(
      season_num %in% c(5) ~ '1, 2',
      season_num %in% c(12, 19, 27) ~ '1, 3',
      TRUE ~ '1'
    )
  )

For example, the default was where the two tables to extract were the 3rd and 4th tables on the page and I only needed the first column.

With this additional metadata, I could now write a function to read the URL and extract the correct tables and table columns:

### Write Function to Scrape the Names
get_cast <- function(url, table_num, keep_cols, title, order){
  
  ##Convert the String Inputs into a numeric vector
  table_num = str_split(table_num, ',') %>% unlist() %>% as.numeric()
  keep_cols = str_split(keep_cols, ',') %>% unlist() %>% as.numeric()
  
  #Read Page and Filter Tables
  ct <- read_html(paste0('https://en.wikipedia.com/',url)) %>%
    # Extract Table Tags
    html_elements('table') %>%
    # Keep only the specified tables
    .[table_num] %>% 
    # Extract the information from the tables into a list (if more than 1)
    html_table() %>% 
    # Use MAP to keep only the selected columns from each table
    map(~select(.x, all_of(keep_cols)))
  
  #If Keeping Multiple Columns Gather to a Single Column Called Name
  if(length(keep_cols) == 1){
    ct <- ct %>% 
      map(~rename(.x, "Name" = 1)) 
  }else if(length(keep_cols) > 1){
    ct <- ct %>%
      map(~gather(.x, "Field", "Name")) %>% 
      map(~select(.x, 2)) 
  }
  
  # Combine all the tables into 1 columns and append title column
  ct <- ct %>% map_dfr(bind_rows) %>% mutate(title = title, order = order)

  return(ct)
  
}

The five parameters passed to this function are all contained in the driver file created above. In order to iterate through the seasons to create a data set of the cast members I’ll use the pmap_dfr() function from purrr to provide more than two inputs to a function (pmap vs. map and map2) and combine all the outputs into a single data frame by binding the rows (the dfr part of the function name).

In pmap, the first parameter is a list of the various parameters to pass to the function and the second parameter is the function to be called. The elements of the list can then be referred to as ..1 being the href parameters (first parameter from the list), ..2 the table_name parameter, and so on.

###Create Dataset with all names
all_cast <- pmap_dfr(list(seasons$href, 
                          seasons$table_num, 
                          seasons$keep_cols,
                          seasons$title,
                          seasons$season_num), 
                     ~get_cast(..1, ..2, ..3, ..4, ..5))

The results of this new table now looks like:

Name	title	order
Cynthia Roberts	Road Rules: All Stars	1
Eric Nies	Road Rules: All Stars	1
Jon Brennan	Road Rules: All Stars	1

Cleaning the Data and Final Preparations

The data on Wikipedia is fairly clean but there are places where automation is no substitute for domain knowledge. In this case the cast tables refer to what people were called in that specific season. But in some cases as cast members have returned for multiple seasons what they have been called has changed. For example. the now host of NBC’s First Look, Johnny “Bananas” Devenanzio, began his time on The Challenge as “John Devenanzio”, then “Johnny Devenanzio”, and finally, “Johnny ‘Bananas’ Devenanzio” for his most recent 12 seasons. Some female cast members married as “Tori Hall” became “Tori Fiorenza”. And in the most subtle of changes, “Nany González” appears both with and without the accent over the “a” (huge shoutout to the r/MtvChallenge sub-Reddit for calling me out on that when it cause Nany to not appear in my data visualization).

Other changes are less interesting such as removing footnotes from people’s names, fixing that in the Seasons table both Season 5 and Season 23 are called “Battle of the Seasons”, and appending the season’s names onto the cast table

###Clean up the Cast Member Columns and Clean up The Title Columns
###Domain Knowledge that these are all the same people (especially the married ones)
all_cast_clean <- all_cast %>%
  mutate(
    #Remove Footnotes
    Name = str_remove_all(Name, '\\[.*\\]'),
    #Fix the Various References to Johnny Bananas
    Name = if_else(str_detect(Name, 'John.* Devenanzio'),'Johnny "Bananas" Devenanzio',Name),
    Name = if_else(str_detect(Name, 'Jordan.*Wiseley'), 'Jordan Wiseley', Name),
    Name = if_else(str_detect(Name, 'Natalie.*Duran'), 'Natalie "Ninja" Duran', Name),
    Name = if_else(str_detect(Name, 'Theresa Gonz'), 'Theresa Jones', Name),
    Name = if_else(str_detect(Name, 'Tori Fiorenza'), 'Tori Hall', Name),
    Name = if_else(str_detect(Name, 'Nany'), 'Nany González', Name)
  )

##Season Table
seasons_table <- read_html('https://en.wikipedia.org/wiki/The_Challenge_(TV_series)') %>%
  html_element(xpath = '/html/body/div[3]/div[3]/div[5]/div[1]/table[2]') %>%
  html_table() %>%
  janitor::clean_names() %>%
  mutate(year = str_extract(original_release, '\\d{4}') %>% as.integer()) %>%
  select(order, short_title = title, year) %>%
  distinct() %>%
  mutate(short_title = if_else(order == 23, 'Battle of the Seasons 2', short_title))


all_cast_info <- all_cast_clean %>%
  left_join(seasons_table, by = "order")

Exploring the Data

Before getting into the real meat of the analysis, I’m going to do some quick EDA to answer some potentially interesting questions about The Challenge Cast that we can see in the data.

Who Has Been on the Most Challenges?

A quick question might be what challenger has been on the most seasons. This can be answered pretty quickly with the count() function from dplyr.

all_cast_info %>%
  count(Name, sort = T) %>%
  head(7) %>%
  ggplot(aes(x = fct_reorder(Name, n), y = n, fill = Name)) + 
    geom_col() + 
    geom_text(aes(label = n, hjust = 0)) +
    ghibli::scale_fill_ghibli_d(name = 'LaputaMedium', guide = 'none') + 
    scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
    coord_flip() + 
    labs(x = "Challenger", y = "# of Appearances", 
         title = "Who Has Been on the Most Seasons of the Challenge?") + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = 'plot'
    )

As any Challenge fan knows, Johnny Bananas has been on the most seasons with 20 and CT just behind at 19.

Looking at Consecutive Season Behavior

An interesting visualization we can do is to explore how frequently Challengers are on consecutive seasons using a series of dumbbell plots. In this plot there will be a point for each endpoint of a stretch of consecutive seasons and they will be connected by a line.

Check out the post on the r/MtvChallenge sub-Reddit for a nicer (although slightly wrong) version of this plot.

all_cast_info %>% 
    ## Add the number of seasons for each challenger as a new column
    add_count(Name, name = 'num_seasons') %>%
    # Filter to only those who have been on 10+ seasons
    filter(num_seasons >= 10) %>%
    # For each challenger define consecutive segments based on when the prior
    # season number is more than 1 or missing (for the first observation)
    group_by(Name) %>%
    arrange(order, .by_group = T) %>%
    mutate(
      diff = order - lag(order),
      new_segment = if_else(is.na(diff) | diff > 1, 1, 0),
      run = cumsum(new_segment)
    ) %>% 
    # Define the endpoints of each segment
    group_by(Name, run) %>% 
    summarize(start = min(order),
              end = max(order),
              num_seasons = max(num_seasons)) %>%
  ggplot(aes(x = fct_rev(fct_reorder(Name, start, min)), 
             color = Name, fill = Name)) + 
    geom_linerange(aes(ymin = start, ymax = end), size = 1) + 
    geom_point(aes(y = start), size = 2) + 
    geom_point(aes(y = end), size = 2) + 
    scale_fill_discrete(guide = 'none') + 
    scale_color_discrete(guide = 'none') +
    scale_y_continuous(breaks = seq(1, 37, 2)) + 
    labs(x = "", y = "Seasons", title = "How Often Were Challengers On The Show?",
         subtitle = "*Only Challengers Appearing On At Least 10 Seasons Ordered By First Appearance*",
         caption = "*Source:* Wikipedia | **Author:** Jlaw") + 
    coord_flip() + 
    cowplot::theme_cowplot() + 
    theme(
      panel.grid.major.y = element_line(size = .5, color = '#DDDDDD'),
      plot.subtitle = ggtext::element_markdown(),
      plot.title.position = 'plot',
      plot.caption = ggtext::element_markdown(),
      axis.ticks.y = element_blank()
    )

Which Seasons Had the Highest Percentage of “one and done” Challengers?

Sometimes the show will bring a cast member on and it doesn’t work out and you never see them again. I can also look at which seasons had the largest number of cast members who were never seen again. Since Season 37 is still airing and we don’t know who will / won’t come back, I’ve excluded that season:

all_cast_info %>% 
  add_count(Name, name = "num_seasons") %>%
  filter(num_seasons == 1 & order != 37) %>%
  count(short_title, year) %>% 
  ggplot(aes(x = fct_reorder(short_title, n), y = n, fill = year)) + 
    geom_col() + 
    geom_text(aes(label = n), hjust = 0) +
    labs(x = "Season Title", y = "Number of 'one and done' Challengers",
         title = "What Season Had the Most 'One and Done' Challengers",
         subtitle = "Lighter Colors are Later Seasons",
         fill = "Year Aired") +
    scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
    scale_fill_viridis_c() + 
    guides (fill = guide_colourbar(barwidth = 15, barheight = 0.5)) + 
    expand_limits(x = 0, y = 0) + 
    coord_flip() + 
    cowplot::theme_cowplot() + 
    theme(
        plot.title.position = 'plot',
        legend.position = 'bottom'
    )

The seasons with the largest number of one and done’s tended to be seasons where the shows had large influxes of new challengers due to different formats. Battle of the Seasons was a very large cast and the first to not have small teams. Battle of the Bloodlines was a concept where 50% of the challengers were family members who had never been on the show and thankfully never were again.

What Are the Most Similar Episodes of the Challenge?

I can visualize season similarity in a network graph, however, I need to first restructure the data. Right now I just have all the positive cases but I need to build data that has every person/season combination with 1/0 indicators. Then I can use the pairwise_similarity() function from widyr to get the cosine similarity of each season. The upper=F setting makes it so there’s only 1 row for each combination (e.g, only A, B rather than both A,B and B,A):

similarity <- all_cast_info %>%
  #Create an indicator for all the positive cases
  transmute(order, short_title, Name, ind = 1) %>%
  # Make a wide data set and fill in 0s for all the negative cases
  pivot_wider(
    names_from = 'Name',
    values_from = 'ind',
    values_fill = 0
  ) %>% 
  # Bring the table back to long format with 1/0s
  pivot_longer(
    cols = c(-order, -short_title),
    names_to = "Name",
    values_to = "ind"
  ) %>% 
  pairwise_similarity(short_title, Name, ind, upper = F, diag = F) %>% 
  arrange(-similarity)  %>%
  filter(similarity > .29)

item1	item2	similarity
Vendettas	Final Reckoning	0.6806139
War of the Worlds	War of the Worlds 2	0.5760221
Invasion of the Champions	XXX: Dirty 30	0.5635760

The most similar seasons in the data are Vendettas (Season 31) and Final Reckoning (Season 32) which makes sense as these were consecutive seasons that were also the last two pieces of a trilogy.

The similarity threshold of 0.29 was chosen judgmentally to include as many seasons as possible without over-complicating the graph. The next step in building the network graph itself. I’m setting a seed since the layout in ggraph is non-deterministic and I’d like to make it reproducible. The similarity data frame is converted to a tbl_graph object with as_tbl_graph, I join in the short titles to from the labels and then set edges to have alpha values (transparency) tied to similarity and use the names for the node labels.

set.seed(20210904)
as_tbl_graph(similarity) %>%
  left_join(seasons_table, by = c('name' = "short_title")) %>%
  ggraph(layout = 'fr') + 
    geom_edge_link(aes(alpha = similarity), width = 1.5) + 
    geom_node_label(aes(label = name, fill = order), size = 5) + 
    scale_fill_viridis_c(begin = .3) + 
    scale_shape_discrete(guide = 'none') + 
    scale_x_continuous(expand = expansion(add = c(.6, .8))) +
    labs(title = "Network of Challenge Seasons",
         subtitle = "Edges measured by Cosine Similarity of Cast",
         caption = "All Stars and RW vs RR did not have >.0.29 Similarity to Any Other Season",
         alpha = "Cosine Similarity",
         fill = "Season #") + 
    theme_graph(plot_margin = margin(30, 0, 0, 30)) + 
    theme(
      legend.position = 'bottom'
    )

Through the network graph we can see that the first two seasons aren’t connected to anything and don’t appear and then Seasons 3 and 5 and Seasons 4 and 6 exist in their own clusters. But the rest of the structure you can trace from early seasons to later season.

Clustering the Seasons with Hierarchical Clustering

Now that EDA is done, its time to determine our eras through clustering. In order to use hierarchical clustering I need to create a distance matrix. To do so I will replicate some of the code from above where each row will be a season and each column a Challenger and the value will be either 1 if they were on that season or 0 otherwise.

Since this data is binary I will be using a binary distance where 1 and 1 is a match and any 1/0 pair is a mismatch (e.g, 0 and 0 despite being the same value does not count as similarity). The definition is the proportion of bits in which only one is on among those where at least one is on.

Then the hierarchical clustering algorithm is run with hclust. There are many different agglomeration methods that can be used ranging from single (where difference between clusters is defined by their closest elements), complete (which defines differences by farthest apart elements), average (which is the average of all the points distance), and Ward which is the minimal distance between sum of squares. For more information, see this CrossValidated answer.

# Cast Data to Wide Format
dt <- all_cast_info %>%
  transmute(order, short_title, Name, ind = 1) %>%
  pivot_wider(
    names_from = 'Name',
    values_from = 'ind',
    values_fill = 0
  )

# 
dst <- dt %>%
  # Remove fields I don't want part of the distance function
  select(-order, -short_title) %>%
  dist(method = 'binary') %>%
  #the agglomeration method to be used. 
  hclust(method = 'ward.D2')

I can then visualize the resulting dendrogram using the plot() function and supplying the short_title field I previously excluded as a label parameter. By looking at the dendrogram it seems like there are five clusters which I will highlight with the rect.hclust function and specifying k=5:

plot(dst, labels = dt$short_title, 
     main = 'Hierarchical Clustering of Challenge Seasons',
     xlab = '')
rect.hclust(dst, k = 5)

Based on the dendrogram, there are five clusters:

	Cluster #1	Cluster #2	Cluster #3	Cluster #4	Cluster #5
Seasons:	1 (All-Stars) to 11 (Gauntlet 2)	12 (Fresh Meat) to 18 (The Ruins)	19 (Fresh Meat 2) to 26 (Battle of the Exes 2)	27 (Battle of the Bloodlines) to 32 (Final Reckoning)	33 (War of the Worlds) to 37 (Spies, Lies, and Allies)
Why?	Original seasons when challengers were only from Real World or Road Rules	First Introduction of challengers not from prior properties	Second Injection of challenges not from prior properties	Half of the case are family members of prior challengers	Introduction of large influx of new challengers from internaional reality shows

So it seems like the algorithm latched on to change points where the casts became heavily rookies, which would make sense since that is a forced dissimilarity.

Returning to the data I can append the cluster assignment to the orignal data with the cuttree function and providing it the number of clusters to return.

h_clust_results <- dt %>%
  mutate(cluster = cutree(dst, k = 5))

Dimensionality Reduction with UMAP

The data set used for the clustering contained 37 rows representing each season of The Challenge and 360 columns representing every challenge who has ever been on the show. This type of data is prime for dimensionality reduction. Uniform Manifold Approximation and Projection (UMAP) is a technique that can be used for dimensionality reduction and visualization similar to T-SNE. The UMAP algorithm can be found in the umap package.

Running UMAP is pretty straightforward with the umap() function and here I give it the very wide data set used for clustering. In the returned object there is an element called layout which contains the compressed two dimensional space returned by UMAP. Again I’m setting a seed as the results of UMAP can be non-deterministic.

set.seed(20210904)
ump <- umap(dt %>% select(-order, -short_title))

I can add then those dimensions to the clustering results from above to see how closely the UMAP compression will match the clustering from the hclust function:

h_clust_results %>% 
  select(order, short_title, cluster) %>%
  # Add in UMAP dimensions
  mutate(
    dim1 = ump$layout[, 1],
    dim2 = ump$layout[, 2]
  ) %>%
  ggplot(aes(x = dim1, y = dim2, color = factor(cluster))) + 
  geom_text(aes(label = short_title)) + 
  labs(title = 'UMAP Projection of Challenge Seasons',
       subtitle = "Colors Represent Prior Clustering") + 
  scale_color_discrete(guide = 'none') + 
  scale_x_continuous(expand = expansion(add = c(.3, .4))) + 
  cowplot::theme_cowplot() + 
  theme(
    axis.ticks = element_blank(),
    axis.text = element_blank()
  )

Overall, the UMAP projection captures similar information to the clustering since both methods were unsupervised and the colors (the prior clusters) are very close in the UMAP projected space.

Predicting New Observations with UMAP

In the summer of 2020, The Challenge: All Stars aired on Paramount+. The series was intended to bring back fan favorites from early seasons of the challenge (although whether the actual cast would be considered fan favorites, all-stars, or even from early seasons was debatable). An interesting final question to ask is: what cluster would The Challenge: All Stars belong to in the UMAP space?.

This next block of code is going to do a lot of heavy lifting but isn’t dissimilar from what was done in the earlier parts of this post. I will be downloading the cast from Wikipedia, cleaning it (more marriages and nicknames), and adding it to the original data set to get the 0 cases and the filtering it back to The Challenge: All Stars season.

all_stars <- 
  # Take Original Data Set
  all_cast_info %>%
  # Add Indicators
  transmute(order, short_title, Name, ind = 1) %>%
  # Get the New Challenge Season
  bind_rows(
    get_cast('wiki/The_Challenge:_All_Stars', '3, 4', '1', 'The Challenge: All Stars', 99)  %>%
      transmute(order, short_title = title, Name, ind = 1)  %>%
      #Cleaning Names
      mutate(
        Name = case_when(
          Name == "Katie Cooley" ~ "Katie Doyle",
          Name ==  'Eric "Big Easy" Banks' ~ 'Eric Banks',
          Name == 'Teck Holmes' ~ 'Tecumshea "Teck" Holmes III',
          TRUE ~ Name
        )
      )
  ) %>% 
  # Cast to Wider
  pivot_wider(
    names_from = 'Name',
    values_from = 'ind',
    values_fill = 0
  ) %>% 
  # Filter back to the All Stars Season
  filter(short_title == 'The Challenge: All Stars') %>%
  # Removing Things that Won't Be Predicted
  select(-order, -short_title)

Then predicting the All Stars season in the UMAP space can be done similar to other predictions in R with the predict function:

all_stars_pred <- predict(ump, all_stars)

which returns a matrix with 1 row for the season and 2 columns for the UMAP x and y dimensions. Then this can be visualized on top of the original UMAP projection as an annotation.

# Take Original Data
h_clust_results %>% 
  select(order, short_title, cluster) %>%
  # Add in the original UMAP data
  mutate(
    dim1 = ump$layout[, 1],
    dim2 = ump$layout[, 2]
  ) %>%
  ggplot(aes(x = dim1, y = dim2, color = factor(cluster))) +
  #ggrepel::geom_text_repel(aes(label = short_title)) + 
  geom_text(aes(label = short_title)) +
  # Add Annotation for the Challenge All Stars Season with the predicted
  # projection.
  annotate(
    'label',
    label = 'The Challenge: All Stars',
            x = all_stars_pred[, 1],
            y = all_stars_pred[, 2],
            color = 'black') + 
  labs(title = 'Predicting Challenge All-Stars Onto Prior UMAP Projection',
       subtitle = "Colors Represent Prior Clustering") + 
  scale_color_discrete(guide = 'none') + 
  scale_fill_discrete(guide = 'none') +
  scale_x_continuous(expand = expansion(add = c(.3, .4))) +
  cowplot::theme_cowplot() + 
  theme(
    axis.ticks = element_blank(),
    axis.text = element_blank()
  )

It seems like Challenge All-Stars would be part of the first cluster of the first group of seasons but is somewhat between the “green cluster” which could make sense as there were a couple of cast members on the show who first showed up in the 23rd season (Battle of the Seasons 2).

$GME To The Moon: How Much of an Outlier Was Gamestop's January Rise?

Thu, 12 Aug 2021 00:00:00 +0000

Introduction

Between January 13th and January 27th, 2021 the stock price for Gamestop (GME) rose 10x from $31 to $347 dollars. This rise was in part due to increased popularity on the Reddit forum r/wallstreetbets looking to create a short squeeze and because they “liked the stock”. This rapid rise also drew attention of popular media such as CNBC:

However, this post will not try to understand the mechanics of why GME rose or whether it should have risen. What I will try to answer is “how unexpected was its rise” using an array of different forecasting tools. To assess how expected this rise in GME stock is, I’ll be using the following packages:

Anomalize
Prophet
Forecast (auto.arima)
CausalImpact

From these methods we should get a good idea of just how unexpected this rise was. The method for doing this will be using historical price data through January 21st to predict the Gamestop stock price for the period of January 22nd, through February 4th and looking at the mean average percent error (MAPE¹) to quantify the amount of unexpectedness.

As a reminder the MAPE is calculated as:

where A is the actual and F is the forecasted value.

Peer Sets

While I can look at the GME time-series and know that its an outlier relative to past performance maybe something in early January caused all video games related stocks to increase. The peer set that I will look at using as external regressors are:

Nintendo (NTDOF) - Maker of the Switch System
Sony (SONY) - Maker of the Playstation System
Microsoft (MSFT) - Maker of the XBox System

Data

I’ll be using the stock prices for these four stocks from 1/1/2016 through 2/22/2021 for this analysis and I will use the tidyquant package to get this data through the tq_get() function.

library(tidyquant) #Get Stock Data 
library(tidyverse) #Data Manipulation
library(lubridate) #Date Manipulation

### Make Data Weekly
dt <- tq_get(c('GME', 'SONY', 'NTDOF', 'MSFT'),
             get='stock.prices',
             from = '2016-01-01',
             to = '2021-02-22')

With the data pulled we can visualize each of the time-series for the four stocks. While the peer stocks all rose between 2020 and Feb 2021 it does appear that Gamestop truly “goes to the moon” above and beyond the peer stocks.

dt %>% 
  filter(ymd(date) >= ymd(20200101)) %>% 
  ggplot(aes(x = date, y=close, color = symbol, group = symbol)) + 
     geom_line() + 
    geom_vline(xintercept = ymd(20210122), lty = 2, color = 'red') + 
    geom_vline(xintercept = ymd(20210204), lty = 2, color = 'red') + 
   labs(x = "Date", y = "Closing Price", title = "Gamestop's Ride to the Moon &#128640;&#128640;&#128640;",
         subtitle = "Fueled by <span style='color:#ff4500'><b>r/wallstreetbets</b></span> $GME rose nearly 10x in a week",
        caption = "<i>Prediction zone bounded by the <span style='color:red'>red dashed</span> lines</i>"
        ) +
     scale_color_discrete(guide = 'none') +
     scale_x_date(date_breaks = "6 months", date_labels = "%b %Y") + 
      facet_wrap(~symbol, ncol = 1, scales = "free_y") + 
      cowplot::theme_cowplot() + 
      theme(
        plot.title = ggtext::element_markdown(),
        plot.subtitle = ggtext::element_markdown(),
        plot.caption = ggtext::element_markdown(),
        strip.background = element_blank(),
        strip.text = ggtext::element_textbox(
          size = 12,
          color = "white", fill = "#5D729D", box.color = "#4A618C",
          halign = 0.5, linetype = 1, r = unit(5, "pt"), width = unit(1, "npc"),
          padding = margin(2, 0, 1, 0), margin = margin(3, 3, 3, 3)
        )
      )

Anomalize

anomalize is a package developed by Business Science to enable tidy anomaly detection. This package has three primarily functions:

time_decompose() - which separates the data into its components
anomalize() - which runs anomaly detection on the remainder component
time_recompose() - recomposes the data to create limits around the “normal” data.

The package also provides two options for calculating the remainders, STL and Twitter. The STL method does seasonal decomposition through loess while the Twitter method does seasonal decomposition through medians. Additionally there are two options for calculating the anomalies from the remainders, IQR and GESD.

As for which methods to choose, a talk from Catherine Zhou summarizes the choice as:

Twitter + GESD is better for highly seasonal data
STL + IQR better if seasonality isn’t a factor.

More details on these methods are available in the anomalize methods vignettes.

Since all of these stocks benefit from increases in holiday sales I’ll use STL + IQR. Unfortunately, anomalize (to my knowledge) cannot handle covariates, so I’ll only be checking for anomalies for the Gamestop stock. Although I’ll add other regressors in the other packages.

library(anomalize)

anomalize_dt <- dt %>%
  filter(symbol == 'GME') %>% 
  # Merge keeps all of the original data in the decomposition
  time_decompose(close, method = 'stl', merge = T, trend = "1 year") %>% 
  anomalize(remainder, method = "iqr") %>% 
  time_recompose() %>% 
  filter(between(date, ymd(20210122), ymd(20210204)))

Looking at our prediction window returns:

predictions_anomalize <- anomalize_dt %>% 
  transmute(date, actual = close, predicted = trend + season, 
            normal_lower = recomposed_l1, normal_upper = recomposed_l2, 
            residual = remainder, anomaly)


knitr::kable(predictions_anomalize, digits = 2)

date	actual	predicted	normal_lower	normal_upper	residual	anomaly
2021-01-22	65.01	16.97	10.08	23.68	48.04	Yes
2021-01-25	76.79	17.06	10.17	23.77	59.73	Yes
2021-01-26	147.98	17.15	10.26	23.86	130.83	Yes
2021-01-27	347.51	17.27	10.37	23.98	330.24	Yes
2021-01-28	193.60	17.36	10.47	24.07	176.24	Yes
2021-01-29	325.00	17.44	10.54	24.15	307.56	Yes
2021-02-01	225.00	17.53	10.64	24.24	207.47	Yes
2021-02-02	90.00	17.62	10.73	24.33	72.38	Yes
2021-02-03	92.41	17.74	10.84	24.45	74.67	Yes
2021-02-04	53.50	17.83	10.94	24.54	35.67	Yes

So anomalize correctly identified all dates as anomalies vs what was expected. Now I can calculate the MAPE as 84.09% which means that only 16% of Gamestop’s stock movement was predicted.

Prophet

Prophet is a forecasting library that was developed by Facebook. To calculate the MAPE, I will fit the prophet model to the data before the prediction period and then predict for the data in our prediction period (post). Prophet does allow for the addition of other regressors so I will run two version of the model. The first will just be on the Gamestop time series and the second will bring in the Sony, Nintendo, and Microsoft regressors.

Data Processing

Currently, the data is in a tidy format where all symbols are in a separate row. In order to use them in prophet (and in future packages), I need to have the data in a format where each row is a date and all of the symbols are separate columns. Additionally, to be used in prophet the data must have a ds column for for the date and a y column for the time series being projected. The following code block will split into the pre-period and the prediction period as well as rename the GME series to y and date to ds.

prep_data <- dt %>% 
  select(date, symbol, close) %>% 
  pivot_wider(names_from = 'symbol', values_from = 'close') %>% 
  rename(y = GME, ds = date)

pre <- prep_data %>% filter(ds <= ymd(20210121))
pred <- prep_data %>% filter(between(ds, ymd(20210122), ymd(20210204)))

Model 1: Only the Gamestop Time Series

library(prophet)

#Build the Model
model_no_regressors <- prophet(pre)
#Predict on the Future Data
model_no_regressors_pred <- predict(model_no_regressors, pred)

We can look at the predicted results and the residuals by joining the actual data back to the predicted data:

predictions_prophet_no_reg <- model_no_regressors_pred %>% 
  inner_join(pred %>% select(ds, y), by = "ds") %>% 
  transmute(ds, actual = y, predicted = yhat, lower = yhat_lower, 
            upper = yhat_upper, residual = y-yhat)

knitr::kable(predictions_prophet_no_reg, digits = 2)

ds	actual	predicted	lower	upper	residual
2021-01-22	65.01	17.74	14.91	20.59	47.27
2021-01-25	76.79	17.31	14.56	20.15	59.48
2021-01-26	147.98	17.14	14.13	19.90	130.84
2021-01-27	347.51	17.03	14.29	19.93	330.48
2021-01-28	193.60	16.89	13.89	19.31	176.71
2021-01-29	325.00	16.55	13.99	19.29	308.45
2021-02-01	225.00	16.24	13.43	19.01	208.76
2021-02-02	90.00	16.16	13.66	18.90	73.84
2021-02-03	92.41	16.14	13.46	18.83	76.27
2021-02-04	53.50	16.11	13.47	18.79	37.39

From this I can calculate the MAPE as 84.71% again indicatoring that only 16% of the movement was “expected”.

Model 2: Gamestop + Regressors

To run a prophet model with regressions the syntax is a little bit different as rather than pass a dataset into the prophet() function, I’ll need to start with the prophet() function, add the regressors and then pass the data into a fit_prophet() function to actually fit the model.

# Initialize Model
prophet_reg <- prophet()

#Add Regressors
prophet_reg <- add_regressor(prophet_reg, 'MSFT')
prophet_reg <- add_regressor(prophet_reg, 'SONY')
prophet_reg <- add_regressor(prophet_reg, 'NTDOF')

#Fit Model
prophet_reg <- fit.prophet(prophet_reg, pre)

# Predict on Future Data
prophet_reg_pred <- predict(prophet_reg, pred)

Then looking at the predictions:

predictions_prophet_reg <- prophet_reg_pred %>% 
  inner_join(pred %>% select(ds, y), by = "ds") %>% 
  transmute(ds, actual = y, predicted = yhat, lower = yhat_lower, 
            upper = yhat_upper, residual = y-yhat)

knitr::kable(predictions_prophet_reg, digits = 2)

ds	actual	predicted	lower	upper	residual
2021-01-22	65.01	20.32	17.95	22.78	44.69
2021-01-25	76.79	19.17	16.66	21.71	57.62
2021-01-26	147.98	18.95	16.22	21.39	129.03
2021-01-27	347.51	18.07	15.52	20.69	329.44
2021-01-28	193.60	17.47	14.93	19.74	176.13
2021-01-29	325.00	17.61	15.09	19.98	307.39
2021-02-01	225.00	17.22	14.67	19.65	207.78
2021-02-02	90.00	17.61	14.97	20.02	72.39
2021-02-03	92.41	21.18	18.40	23.63	71.23
2021-02-04	53.50	21.25	18.75	23.61	32.25

which gives us a MAPE of 82.14%. The addition of the external regressors make the forecast errors slightly lower. Now the movement is 18% expected.

Auto.Arima

auto.arima() is a function within the forecast package that algorithmically determines the proper specification for an ARIMA (auto-regressive integrated moving average) model. The basic version of auto-arima fits on a univariate series which I will do first, and then I’ll use external regressors similar to what was done with Prophet.

Model 1: Only Gamestop Time Series

library(forecast)

# Fit auto arima model
auto_arima_model <- auto.arima(pre$y)

The function returns an ARIMA(1, 2, 2) model. The forecast() function is then used for use the model to forecast into the future.

# Forecast 10 Periods Ahead
auto_arima_pred <- forecast(auto_arima_model, 10)

Then as with the earlier models I can look at the predictions vs. the actuals. The forecast object returns a list where I can pull out the forecast from the “mean” item and the predicted bound using lower and upper. The list contains intervals for both 80% and 95% so the [, 2] pulls the 95% intervals.

predictions_auto_arima <- pred %>% 
  bind_cols(
    tibble(
      predicted = auto_arima_pred$mean %>% as.numeric(),
      lower = auto_arima_pred$lower[, 2] %>% as.numeric(),
      upper = auto_arima_pred$upper[, 2] %>% as.numeric()
    )
  ) %>% 
  transmute(
    ds, actual = y, predicted, lower, upper, residuals = y - predicted
  )
  
knitr::kable(predictions_auto_arima, digits = 2)

ds	actual	predicted	lower	upper	residuals
2021-01-22	65.01	43.71	42.35	45.07	21.30
2021-01-25	76.79	44.37	42.41	46.32	32.42
2021-01-26	147.98	45.04	42.62	47.47	102.94
2021-01-27	347.51	45.70	42.87	48.54	301.81
2021-01-28	193.60	46.38	43.17	49.59	147.22
2021-01-29	325.00	47.04	43.48	50.60	277.96
2021-02-01	225.00	47.72	43.82	51.61	177.28
2021-02-02	90.00	48.37	44.16	52.59	41.63
2021-02-03	92.41	49.05	44.53	53.57	43.36
2021-02-04	53.50	49.71	44.89	54.53	3.79

This gives a MAPE of 57.20%, which is much better than the prior methods.

Adding in External Regressors

auto.arima can also take into account external regressors through the xreg parameter. Its a little trickier to implement since the regressors need to be in a Matrix. But as usual, StackOverflow comes through with a solution. In this case its from the package author himself!

# Create Matrix of External Regressors
xreg <- model.matrix(~ SONY + NTDOF + MSFT - 1, data = pre)
# Fit ARIMA Model
auto_arima_reg <- auto.arima(pre$y, xreg = xreg)

# Create Matrix of Extenral Regressors for Forecasting
xreg_pred <- model.matrix(~ SONY + NTDOF + MSFT - 1, data = pred)
# Forecast with External Regressors
auto_arima_reg_fcst <- forecast(auto_arima_reg, h = 10, xreg = xreg_pred)

predictions_auto_arima_reg <- pred %>% 
  bind_cols(
    tibble(
      predicted = auto_arima_reg_fcst$mean %>% as.numeric(),
      lower = auto_arima_reg_fcst$lower[, 2] %>% as.numeric(),
      upper = auto_arima_reg_fcst$upper[, 2] %>% as.numeric()
    )
  ) %>% 
  transmute(
    ds, actual = y, predicted, lower, upper, residuals = y - predicted
  )
  
knitr::kable(predictions_auto_arima_reg, digits = 2)

ds	actual	predicted	lower	upper	residuals
2021-01-22	65.01	43.70	42.34	45.06	21.31
2021-01-25	76.79	44.37	42.42	46.32	32.42
2021-01-26	147.98	45.10	42.69	47.52	102.88
2021-01-27	347.51	45.69	42.86	48.52	301.82
2021-01-28	193.60	46.51	43.31	49.71	147.09
2021-01-29	325.00	46.96	43.41	50.51	278.04
2021-02-01	225.00	47.88	44.00	51.76	177.12
2021-02-02	90.00	48.53	44.33	52.73	41.47
2021-02-03	92.41	49.55	45.04	54.06	42.86
2021-02-04	53.50	50.17	45.36	54.99	3.33

This gives a MAPE of 57.03%. Again the addition of external regressors only makes things slightly better.

CausalImpact

CausalImpact is a package developed by Google to measure the causal impact of an intervention on a time series. The package uses a Bayesian Structural Time-Series model to estimate a counter-factual of how a response would have evolved without the intervention. This package works by comparing a time-series of interest to a set of control time series and uses the relationships pre-intervention to predict the counterfactual.

CasualInference also will require some data preparation as it requires a zoo object as an input. But I can largely leverage the prep_data data set created in the prophet section as CausalInference only requires that the field of interest is in the first column. The construction of the zoo object take in the data and the date index as its two parameters.

Then for running the causal impact analysis, I pass in the zoo data set and specific what are the pre-period and the post-period. The model.args options of model.args = list(nseasons = 5, season.duration = 1) adds day of week seasonality by specifying that there are 5 periods to a seasonal component that each point represents 1 period of a season. For another example to add day of week seasonality to data with hourly granularity then I would specify nseasons=7 and season.duration=24 to say that there are 7 period in a season and 24 data points in a period.

library(CausalImpact)

#Create Zoo Object
dt_ci <- zoo(prep_data %>% dplyr::select(-ds), prep_data$ds)

#Run Causal Impact
ci <- CausalImpact(dt_ci, 
                   pre.period = c(as.Date('2020-05-03'), as.Date('2021-01-21')),
                   post.period = c(as.Date('2021-01-22'), as.Date('2021-02-04')),
                   model.args = list(nseasons = 5, season.duration = 1)
                   )

To get the information about the predictions, I can pull them out of the series attribute within the ci object. While not being used in this analysis, the summary() and plot() functions are very useful. And the option for summary(ci, "report") is interesting in that it gives a full paragraph description of the results.

predictions_causal_inference <- ci$series %>% 
  as_tibble(rownames = 'ds') %>% 
  filter(between(ymd(ds), ymd(20210122), ymd(20210204))) %>% 
  transmute(ds, actual = response, predicted = point.pred, 
            lower = point.pred.lower, upper = point.pred.upper, 
            residual = point.effect)
  
knitr::kable(predictions_causal_inference, digits = 2)

ds	actual	predicted	lower	upper	residual
2021-01-22	65.01	39.17	34.08	44.21	25.84
2021-01-25	76.79	38.40	32.58	43.85	38.39
2021-01-26	147.98	38.71	33.25	44.76	109.27
2021-01-27	347.51	38.58	32.86	44.85	308.93
2021-01-28	193.60	39.06	33.24	45.83	154.54
2021-01-29	325.00	39.21	32.71	45.93	285.79
2021-02-01	225.00	38.29	32.08	45.04	186.71
2021-02-02	90.00	38.60	31.59	45.92	51.40
2021-02-03	92.41	38.40	31.52	45.83	54.01
2021-02-04	53.50	38.96	31.11	46.97	14.54

This would give us a MAPE of 64.60%, which is between the auto.arima models and the other methods.

Conclusion

This post looked at five different mechanisms to forecast what Gamestop’s stock price would be during the period when it spiked. Bringing all of the projections together with the actuals gives us:

all_combined <- bind_rows(
  #Actuals
  dt %>% filter(symbol == 'GME') %>% 
    transmute(ds = ymd(date), lbl = 'actuals', y = close),
  #Anomalize
  predictions_anomalize %>% 
    transmute(ds = ymd(date), lbl = "Anomalize", y = predicted),
  #Prophet Regressors
  predictions_prophet_no_reg %>% 
    transmute(ds = ymd(ds), lbl = "Prophet (No Regressors)", y = predicted),
  #Prophet No Regressors
  predictions_prophet_reg %>% 
    transmute(ds = ymd(ds), lbl = "Prophet (w/ Regressors)", y = predicted),
  #Auto.Arima (No Regressors)
  predictions_auto_arima %>% 
    transmute(ds = ymd(ds), lbl = "Auto.Arima (No Regressors)", y = predicted),
  #Auto.Arima (w/ Regressors)
  predictions_auto_arima_reg %>% 
    transmute(ds = ymd(ds), lbl = "Auto.Arima (w/ Regressors)", y = predicted),
  #Causal Inference
  predictions_causal_inference %>% 
    transmute(ds = ymd(ds), lbl = "CausalImpact", y = predicted)
) 

all_combined %>%
  filter(ds >= '2021-01-18' & ds <= '2021-02-04') %>% 
  ggplot(aes(x = ds, y = y, color = lbl)) + 
    geom_line() + 
    geom_vline(xintercept = ymd(20210122), lty = 2, color = 'darkred') + 
    geom_vline(xintercept = ymd(20210204), lty = 2, color = 'darkred') + 
    labs(title = "Comparing GME Price Projections 1/22/21 - 2/4/21",
         x = "Date",
         y = "GME Closing Price ($)",
         color = "") + 
    scale_x_date(date_breaks = "2 days", date_labels = "%b %d") + 
    scale_y_log10() + 
    scale_color_manual(values = wesanderson::wes_palette("Zissou1", 
                                                       n = 7,
                                                       type = 'continuous')) +
    cowplot::theme_cowplot() + 
    theme(
      legend.direction = 'horizontal',
      legend.position = 'bottom'
    ) + 
    guides(color=guide_legend(nrow=3,byrow=TRUE))

Looking at all the projections together its clear that no forecasting method really saw the massive spike in price coming. Although it looks like the Auto.Arima method comes closest, but potentially more because its started from the highest point rather than any forecast being particularly sensitive.

Looking just at January 27th, the peak of the spike gives the clearest perspective on the difference between the actual and all the projections:

all_combined %>% 
  filter(ds == '2021-01-27') %>% 
  ggplot(aes(x = fct_reorder(lbl, y), y = y, fill = lbl)) + 
    geom_col() + 
    geom_text(aes(label = y %>% scales::dollar(),
                  hjust = (y >= 300))) + 
    labs(x = "Projection Method",
         y = "GME Closing Price on Jan 27",
         title = "Looking at the Peak of the Spike",
         subtitle = "Gamestop Closing Price on January 27, 2021",
         fill = "") +
   scale_fill_manual(guide = F, 
                     values = wesanderson::wes_palette("Zissou1", 
                                                       n = 7,
                                                       type = 'continuous')) + 
   scale_y_continuous(label = scales::dollar) +
   coord_flip() + 
   cowplot::theme_cowplot()

No methodology really comes within $300 of the actual price. To quantify just how unexpected Gamestop’s rise is, I’ll look at the MAPEs for all the forecasting methods.

format_mape <- function(dt, method){
 return(
   dt %>% 
    yardstick::mape(actual, predicted) %>% 
    transmute(Method = method, MAPE = .estimate %>% scales::percent(scale = 1, accuracy = .01))
 )
}

bind_rows(
  #Anomalize
  format_mape(predictions_anomalize, "Anomalize"),
  #Prophet Regressors
  format_mape(predictions_prophet_no_reg, "Prophet (No Regressors)"),
  #Prophet No Regressors
  format_mape(predictions_prophet_reg, "Prophet (w/ Regressors)"), 
  #Auto.Arima
  format_mape(predictions_auto_arima, "Auto.Arima (No Regressors)"), 
  #Auto.Arima (w/ Rregressors)
  format_mape(predictions_auto_arima_reg, "Auto.Arima (w/ Regressors)"), 
  #Causal Inference
  format_mape(predictions_causal_inference, "CausalImpact")
) %>% 
  knitr::kable(align = c('l', 'r'))

Method	MAPE
Anomalize	84.09%
Prophet (No Regressors)	84.71%
Prophet (w/ Regressors)	82.14%
Auto.Arima (No Regressors)	57.20%
Auto.Arima (w/ Regressors)	57.03%
CausalImpact	64.60%

Using the MAPE as the measure of “unexpectedness” I would conclude that this outcome 57% to 85% unexpected (although a lot of the accuracy comes less from the models doing a good job of predicting the spike and more from the models being flat and the stock price coming back down). So despite a small rise before the projection period, its clear that Gamestops’s meteoric rise and then fall would a very unexpected event.

Practically, the MAPE function is being calculated using the yardstick package where the format is yardstick::mape(truth, estimate) where truth and estimate are the columns for the actual and predicted values.↩︎

How to not have Plot.ly Inflate Hugo's Reading Time

Mon, 26 Jul 2021 00:00:00 +0000

I’m a big proponent of enabling the reading time option on this blog which uses Hugo’s academic theme. I always appreciate seeing it on other blogs so I know how much time to invest in the post. I also like it because its a feedback mechanism for me to try to write more concisely. But having too long a reading time at the beginning of a post can be a deterrent to getting people to read.

Writing the recap post for this blog’s 1 year anniversary, when I first generated the post using plot.ly for an interactive chart, I noticed that the reading time ballooned up to 98 minutes from the 13 that it was supposed to be.

Turning to “Dr. Google” I didn’t find any immediate solutions for getting the reading time to be more tractable. However, I did figure out a small “hack” within RMarkdown to provide the same end output to the blog, but without the increase in reading time.

This post will show:

That this happens
Why this happens
And a way to continue to use plot.ly from RMarkdown without having it balloon the post’s reading time.

library(tidyverse)
library(plotly)

What is Happening?

When rendering a RMarkdown file to Hugo and using a plot.ly chart that includes categorical data it will cause the article’s reading time to balloon. At least it will in the case where there are many points with categorical data. For this trivial example, I’ll see which character from Friends had the most lines throughout the run of the show. Apparently this is available in a friends R package… because everything is available in an R package!!

p <- friends::friends %>%
  filter(!is.na(speaker)) %>% 
  #Creating Running Season and Episode Indicator
  inner_join(
    friends::friends %>% 
      distinct(season, episode) %>%
      arrange(season, episode) %>%
      mutate(episode_num = row_number()),
    by = c('season', 'episode')
  ) %>%
  #Summarize By Character
  count(episode_num, speaker, name = "lines") %>%
  group_by(speaker) %>% 
  arrange(episode_num) %>%
  mutate(total_lines = cumsum(lines),
         max_lines = max(total_lines)) %>%
  ungroup() %>%
  #Keep Top 20
  mutate(rnk = dense_rank(-max_lines)) %>%
  filter(rnk <= 20) %>% 
  ggplot(aes(x = episode_num, y = total_lines, color = speaker)) + 
    geom_line() + 
    labs(x = "Episode # of Friends",
         y = "Number of Lines",
         title = "Cumulative Number of Lines Spoken by Characters on Friends") + 
    cowplot::theme_cowplot() + 
    theme(legend.position='none',
          plot.title = element_text(size = 14)) 

ggplotly(p)

But WTF… when I render this page I see that the Reading Time is 60 minutes!!! For this article to this point!! Insanity.

Why is this happening?

The TL;DR of what’s going on is that plot.ly embeds all of the data from the chart directly into the page source. So if we view the page source we’ll see elements for every point of the data:

Then (I believe) Hugo misinterprets aspects of this data as additional word count and that’s how an article that should only be a few minutes becomes closer to an hour.

How to get around this?

In my post, I worked around this by:

Displaying the code I wanted to show with an eval=FALSE option on the code chunk to not actually render the plot.ly chart but show the code that WOULD render it.
Having a 2nd code block that’s nearly identical with a echo=FALSE option on the code chunk to not show the code that is actually run. This code chunk should also save the plot.ly widget as a self-contained file to the directory using something like htmlwidgets::saveWidget(p1, file="p1.html", selfcontained = T) when p1 is the ggplotly() element and p1.html is output.
Have a 3rd code chunk with echo=FALSE to create an iframe tag that will contain the HTML file created in step 2. This is done with htmltools::tags$iframe(src = "p1.html") and some other options.

To show this in action (although I’ll display all 3 code blocks in this example)

Code Block 1: The Code You Want To Display

This is a repeat from the code from above which has eval=FALSE so its shown but not run:

p <- friends::friends %>% 
  filter(!is.na(speaker)) %>% 
  #Creating Running Season and Episode Indicator
  inner_join(
    friends::friends %>% 
      distinct(season, episode) %>%
      arrange(season, episode) %>%
      mutate(episode_num = row_number()),
    by = c('season', 'episode')
  ) %>%
  #Summarize By Character
  count(episode_num, speaker, name = "lines") %>%
  group_by(speaker) %>% 
  arrange(episode_num) %>%
  mutate(total_lines = cumsum(lines),
         max_lines = max(total_lines)) %>%
  ungroup() %>%
  #Keep Top 20
  mutate(rnk = dense_rank(-max_lines)) %>%
  filter(rnk <= 20) %>% 
  ggplot(aes(x = episode_num, y = total_lines, color = speaker)) + 
    geom_line() + 
    labs(x = "Episode # of Friends",
         y = "Number of Lines",
         title = "Cumulative Number of Lines Spoken by Characters on Friends") + 
    cowplot::theme_cowplot() + 
    theme(legend.position='none',
          plot.title = element_text(size = 14)) 

ggplotly(p)

Code Block 2: The Code That’s ACTUALLY run to save the plot.ly chart to an external file

This would normally have echo=FALSE so that it is run but not seen. It is identical to the prior code block but it will save the chart to p1.html.

## Identical Code to CB1
p <- friends::friends %>% 
  filter(!is.na(speaker)) %>% 
  #Creating Running Season and Episode Indicator
  inner_join(
    friends::friends %>% 
      distinct(season, episode) %>%
      arrange(season, episode) %>%
      mutate(episode_num = row_number()),
    by = c('season', 'episode')
  ) %>%
  #Summarize By Character
  count(episode_num, speaker, name = "lines") %>%
  group_by(speaker) %>% 
  arrange(episode_num) %>%
  mutate(total_lines = cumsum(lines),
         max_lines = max(total_lines)) %>%
  ungroup() %>%
  #Keep Top 20
  mutate(rnk = dense_rank(-max_lines)) %>%
  filter(rnk <= 20) %>% 
  ggplot(aes(x = episode_num, y = total_lines, color = speaker)) + 
    geom_line() + 
    labs(x = "Episode # of Friends",
         y = "Number of Lines",
         title = "Cumulative Number of Lines Spoken by Characters on Friends") + 
    cowplot::theme_cowplot() + 
    theme(legend.position='none',
          plot.title = element_text(size = 14)) 

################MODIFIED PART STARTS HERE##############################

## Save the plot.ly chart to an object
p1 <- ggplotly(p)

## Save the object as a self-contained HTML file
htmlwidgets::saveWidget(p1, file="p1.html", selfcontained = T)

Code Block 3: The code to redner the stand-alone plot.ly chart

This also would normally have echo=FALSE to run the code but not display it.

htmltools::tags$iframe(
  src = "p1.html", 
  scrolling = "no", 
  seamless = "seamless",
  frameBorder = "0",
  height=400,
  width=800
)

And now as you can see, we have the plot.ly chart displayed. But the reading time is a much more manageable 5 minutes. This is because the source HTML now looks like:

Much fewer words. Hope this helps.

Celebrating the Blog's First Birthday With googleAnalyticsR

Wed, 14 Jul 2021 00:00:00 +0000

On July 4th, 2020, I posted the first article to this humble R blog as a small hobby to do something new while working from home through COVID. Very recently, this blog celebrated its first year and I wanted to leverage Google Analytics to do a look back at the last year, what’s done well as well as when and where people were visiting from. Much of this content is heavily leveraged from Antoine Soetewey’s Stats and R blog post and but the numbers contained here will be much smaller than on his.

As it says on the home page, this blog was primarily meant for me to be able have a more accessible set of code snippets / a reason to do random analyses to continue learning. That fact that people have taken the time to read it has been awesome and really the icing on a really delicious cake.

So to everything currently reading or who has read the blog in the last year. Thank you so much! Now onto the recap!!

Libraries and Set-up

The libraries I’ll use in this analysis generally serve 1 of 2 functions, either to access and manipulate data from Google Analytics which is the workhorse of this post or to do/edit data visitations (plotly, scales, gghalves, ggflags).

library(plotly) #For turning ggplots into INTERACTIVE ggplots
library(tidyverse) #General Data Manipulation
library(googleAnalyticsR) # To access the Google Analytics API
library(scales) # Making text prettier
library(gghalves) # Creating Half Boxplot / Half Point Plots
library(wesanderson) # To have some more fun colors
library(countrycode) # Convert Country Names to 2 Letter Codes
library(ggflags) # Plot Flags

Getting Google Analytics was also somewhat tricky to get set up. As someone who’s not terrible familiar with Google Cloud Platform and was a little hazy about using the generic public account that comes with googleAnalyticsR, I struggled a bit with getting the authentication correct. The googleAnalyticsR documentation provides some guidance for getting the authentication to work with markdown. But I trialed and errored so much that I don’t think I can provide much guidance for what I did. I just kindof kept running ga_auth_setup() until things seemed like they were working.

But after getting client ids and auth ids into my R Environment, I can authenticate the markdown file with:

ga_auth()

As a last piece of set-up most of the functions in googleAnalyticsR take in a view_id and a date range. Since those will all be the same since I’m looking at this blog and for the first year, I’ll create those variables first so they can be referenced in each call:

view_id <- ga_account_list()$viewId

start_date <- as.Date("2020-07-04")
end_date <- as.Date("2021-07-03")

The Headlines (Users and Sessions)

The first thing to explore will be the total number of users, sessions, and pageviews that occurred over the first year of the R Blog. To access the Google Analytics API, I’ll use the google_analytics() function. The parameters are pretty self-explanatory in that you give it your ViewId, a date range, a set of metrics, and a set of dimensions to get data returned. The anti_sample option will split up the call so that nothing gets sampled.

A complete list of metrics and dimensions can be found in the Google Analytics Metrics and Dimension Explorer.

totals <- google_analytics(view_id,
                           date_range = c(start_date, end_date),
                           metrics = c("users", "sessions", "pageviews"),
                           anti_sample = TRUE 
)

Over the first year of the blog (July 4th, 2020 through July 3rd, 2021), I had 3,081 users visit with 4,211 sessions and 6,685 total page views. Given the relatively minimal promotion, I’ll call that a win 🏆.

I have a hypothesis that most of my views came in the days immediately following posts as I’m connected to the R-Bloggers aggregator. To check this hypothesis I’ll compare the time series of sessions to the days when posts were first made. Since the post dates are embedded in the URLs (/year/month/day/title), I’ll get the full URLs from Google Analytics and pull out the post dates from there:

# Get all visited pages 
launch_dates <- google_analytics(view_id,
                                 date_range = c(start_date, end_date),
                                 metrics = c("pageviews"), 
                                 dimensions = c("pagePath"),
                                 anti_sample = TRUE
)

# Grab All the URLs that have the /year/month/day pattern and at 
#least 10 page views
launch_dates <- launch_dates %>%
  #Keep only rows that match my pattern
  filter(str_detect(pagePath, '/\\d+/\\d+/\\d+')) %>%
  #Extract and convert the date components
  extract(pagePath, regex='/(\\d+)/(\\d+)/(\\d+)/',
         into = c('year', 'month', 'day')) %>% 
  #Turn the components into an actual date field
  mutate(dt = lubridate::ymd(paste(year, month, day, sep = '-')),
         #Fixing an error in this logic
         dt = if_else(dt == lubridate::ymd(20201201), 
                      lubridate::ymd(20201206),
                      dt)) %>% 
  group_by(dt) %>% 
  summarize(pg = sum(pageviews)) %>% 
  filter(pg > 10)

Now I can get the sessions over time from Google Analytics and overlay the launch dates on top of them:

sessions_over_time <- google_analytics(view_id,
                           date_range = c(start_date, end_date),
                           metrics = c("sessions"),
                           dimensions = c("date"),
                           anti_sample = TRUE 
)

sessions_over_time %>% 
  left_join(launch_dates, by = c("date" = "dt"), keep = T) %>% 
  ggplot(aes(x = date, y = sessions)) + 
    geom_line() + 
    geom_point(aes(x = dt), color = 'darkblue', size = 3) + 
    scale_x_date(date_breaks = 'month', date_labels = '%b %Y') + 
    labs(x = "Date", y = "# of Sessions", 
         title = "Sessions Over the Last Year",
         subtitle = "Blue dots represent post dates") + 
    cowplot::theme_cowplot()

Based on the post dates (blue dotes), it does not seem like post dates correlate to the highest volume. While there are some peaks on post days, particularly in March through July, there are a number of large spikes that occur a bit after the posting dates.

Looking at Monthly Active Users

For whatever reason I thought I would be cool to have 1,000 monthly active users on the blog (1,000 unique visitors in a 30 day period). Given that there were only 3,081 throughout the course of the year it doesn’t seem likely that I made this goal. But fortunately we don’t have to guess:

mau <- google_analytics(view_id,
                           date_range = c(start_date, end_date),
                           metrics = c("30dayUsers"), 
                           dimensions = c("date"),
                           anti_sample = TRUE
)

mau %>% 
  ggplot(aes(x = date, y = `30dayUsers`)) + 
  geom_line(color = wes_palette('Moonrise2', n=1, 'discrete')) + 
  geom_smooth(se = F, lty = 2, color = wes_palette('BottleRocket1', 1)) + 
  labs(x = "Date", y = "# of Sessions", 
       title = "Monthly Active Users (30 Days)",
       subtitle = "Smoothed Line in Red") + 
  cowplot::theme_cowplot()

The blog definitely became more popular in April (awesome). But sadly, the monthly active user count topped out at at 872 😢. Better luck in 2021-2022.

Days of the Week

Next up is looking at the number of sessions split by days of the week. Just for fun here, I’ll utilize the gghalves package which allows you to create hybrid geoms. In this case, I’ll make a half box plot, half point plot to be able to show the distribution in box plot form but also get a better idea of the actual distribution from the points. The side parameter tells the function to plot on the left or right half.

Since most days have very few sessions, the plot is set to a log10 scale

sessions_dow <- google_analytics(view_id,
                                date_range = c(start_date, end_date),
                                metrics = c("sessions"),
                                dimensions = c("Date", "dayOfWeekName"),
                                anti_sample = TRUE
)

sessions_dow %>% 
  # Code the text labels to a Factor
  mutate(dayOfWeekName = factor(dayOfWeekName,
                                levels = c('Sunday', 'Monday', 'Tuesday', 
                                           'Wednesday', 'Thursday', 'Friday',
                                           'Saturday'))) %>%
  ggplot(aes(x = dayOfWeekName, y = sessions, fill = dayOfWeekName)) + 
    geom_half_boxplot(side = 'l', outlier.shape = NA) + 
    geom_half_point(side = 'r', aes(color = dayOfWeekName)) +
    labs(title = 'What is the Day of Week Distribution of Sessions?',
         x = "Day of Week",
         y = "Sessions") + 
    scale_y_log10() + 
    scale_fill_manual(guide = 'none', 
                      values = wes_palette('Zissou1', n =7, type = 'continuous')) + 
    scale_color_manual(guide = 'none',
                       values = wes_palette('Zissou1', n =7, type = 'continuous')) + 
    cowplot::theme_cowplot()

It seems like Monday is the most popular day and then there’s a slight decline throughout the rest of the week. The median number of sessions for the weekdays are all fairly similar but there is a higher ceiling for Monday, Tuesday, Wednesday than there is for Thurs and Friday.

Sources and Pages

I don’t do I ton of promotion of the blog but I am very interested in knowing how people are getting to the site as well as what pages people gravitated to the most.

How are people getting to the Site?

Google Analytics provides the referral source for site visitors. Let’s take a look at the top 10 referral sources to the site:

sources <- google_analytics(view_id,
                               date_range = c(start_date, end_date),
                               metrics = c("sessions"),
                               dimensions = c("source"),
                               anti_sample = TRUE 
)

sources %>% 
  mutate(pct = sessions / sum(sessions)) %>% 
  #Get top 10 rows by session value
  slice_max(sessions, n = 10) %>%
  ggplot(aes(x = fct_reorder(source, sessions), 
             y = sessions,
             fill = source)) + 
  geom_col() +
  geom_text(aes(label = paste0(sessions %>% comma(accuracy = 1), ' (',
                               pct %>% percent(accuracy = .1), ')')),
            nudge_y = 80) + 
  scale_y_continuous(expand = c(0, 0)) + 
  scale_fill_manual(guide = 'none',
                    values = wes_palette('FantasticFox1', n=10, 'continuous')) + 
  labs(x = "Referral Sources", y = "# of Sessions",
       title = "Where Did People Visiting the Blog Come From?") + 
  coord_flip(ylim = c(0, 1600)) + 
  cowplot::theme_cowplot() + 
  theme(
    plot.title.position = 'plot'
  )

Somewhat surprising to me is that nearly a third of sessions are direct to the site and another 20% are from Google. Given that I think R-Bloggers is probably my primary mechanism of promotion, I’m not surprised that its in the Top 3, but kindof surprised that it is #3. It is also kindof cool to see referrals from rweekly.org and linkedin where I don’t know exactly how my blog is popping up.. but happy that it is!

What are the Most Visited Posts on the Site?

One of the most obvious questions for this post is what prior post generated the most views. Because there are non-post pages on the site (such as the home page), I’ll need to do some cleaning to keep only the actual posts. But then we can look at the top 10 posts by page views.

top_pages <- google_analytics(view_id,
                               date_range = c(start_date, end_date),
                               metrics = c("pageviews"),
                               dimensions = c("pageTitle"),
                               anti_sample = TRUE
)

top_pages %>% 
  # Remove that all page titles end in | Jlaw's R Blog
  mutate(pageTitle = str_remove_all(pageTitle, " \\| JLaw's R Blog")) %>%
  # Remove the Main Post Page, the Home Page, and Unknown Pages
  filter(!pageTitle %in% c('(not set)', "JLaw's R Blog", "Posts")) %>%
  # Keep the Top 10 By Page Views
  slice_max(pageviews, n = 10) %>%
  ggplot(aes(x = fct_reorder(str_wrap(pageTitle, 75), pageviews), 
             y = pageviews,
             fill = pageTitle)) + 
  geom_col() +
  geom_text(aes(label = pageviews %>% comma(accuracy = 1)),
            hjust = 1) + 
  scale_fill_discrete(guide = F) + 
  scale_y_continuous(expand = c(0, 0)) + 
  labs(x = "", y = "# of Users",
       title = "Most Popular Posts") + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    plot.title.position = 'plot'
  )

I’m not surprised that the “Scraping the Google Play Store with RSelenium” is the Top Post on the site as it got picked up by at least one other website that I was aware of. Also, as far as I know its not a very common topic. Similarly, my post on arulesSequence isn’t surprising as that’s an interesting package with not a ton of blog posts about it. However, I did not realize that the “7 Things I Learned During Advent of Code 2020” was as popular as it was. And finally, it makes me kindof happy that the Visualizing Dancing with the Stars winners with gt was number 4. I really like that post and Hugo (how I generate this site) got really confused and claims that the reading time is an hour when it is much shorter. So I’m happy that people weren’t too scared off.

HOWEVER, while its good to know which are the most popular posts in general. Some of these posts are older than others and have had more of a chance to generate page views than others. For example, the Instagram Lite post is from late June while the Advent of Code post is from December. To counter this, I can look at the cumulative number of page views from the first page view date. Then we can see which post is accumulating views the fastest. To this, I’m going to create a static ggplot but then use ggplotly to make it interactive:

pages_by_time <- google_analytics(view_id,
                                  date_range = c(start_date, end_date),
                                  metrics = c("pageviews"),
                                  dimensions = c("date", "pageTitle"),
                                  anti_sample = TRUE 
)

p <- pages_by_time %>% 
  filter(pageTitle != '(not set)') %>% 
  #Filter out pages with less than 50 pageviews
  add_count(pageTitle, wt = pageviews, name = "total_views") %>% 
  filter(total_views >= 50) %>% 
  # Calculate Days Since Post and Cumulative Number of Views
  group_by(pageTitle) %>% 
  arrange(pageTitle, date) %>% 
  mutate(
    min_date = min(date),
    days_since_post = date - min(date),
    cuml_views = cumsum(pageviews)) %>% 
  ungroup() %>% 
  #The text aesthetic allows me to add that field into the tooltip for plotly
  ggplot(aes(x = days_since_post, y = cuml_views, color = pageTitle, text = min_date)) + 
    geom_line() + 
    coord_cartesian(xlim = c(0, 100), ylim = c(0, 550)) + 
    labs(title = "Which Posts Got the Views the Fastest?",
         subtitle = "First 100 Days Since Post Date",
         y = "Cumulative Page Views",
         x = "Days Since Post Date") +
    cowplot::theme_cowplot() + 
    # This will work with Plotly while scale_color_discrete(guide = F) will not
    theme(legend.position='none') 

# Create Interactive Version of GGPLOT
ggplotly(p)

Now its a little clearer to see the bump that the RSelenium post got around day 8 that shot it to most popular. Also, the Instagram Lite post at 8 days since publishing is actually the most viewed for a Day 8. However, its trajectory is beginning to flatten and while it seems like it will be one of the more popular ones, it doesn’t seem like it will catch RSelenium.

What Countries are People Visiting The Site From?

The blog was visited by users from 134 countries throughout the year, which is pretty crazy to think about. We can look at the distribution of countries by users to see whether the blog is most popular in the US (which is expected) or if it has a stronger than expected International appeal. To add some pizzazz to the graph, I’ll use the countrycode package to convert the country names into two-letter codes and then use ggflag to add the flags to the plot (note that geom_flag works by having a country aesthetic set).

users_by_country <- google_analytics(view_id,
                               date_range = c(start_date, end_date),
                               metrics = c("users"),
                               dimensions = c("country"),
                               anti_sample = TRUE 
)

users_by_country %>% 
  filter(country != '(not set)') %>% 
  #Get % Column and Recode Countries to the iso2c standard
  mutate(pct = users/sum(users),
         code = str_to_lower(countrycode(country, 
                                         origin = 'country.name.en', 
                                         destination = 'iso2c')
                             )
         ) %>%
  # Get Top 10 Countries by # of Users
  slice_max(users, n = 10) %>% 
  ggplot(aes(x = fct_reorder(country, users), 
             y = users,
             fill = country,
             country = code)) + 
    geom_col() +
    geom_text(aes(label = paste0(users %>% comma(accuracy = 1), 
                                 ' (', pct %>% percent(accuracy = .1), ')')),
              nudge_y = 50) + 
    geom_flag(y = 30, size = 15) + 
    scale_fill_discrete(guide = F) + 
    scale_y_continuous(expand = c(0, 0)) + 
    labs(x = "Country", y = "# of Users",
         title = "Where Did Users Come From?") + 
    coord_flip(ylim = c(0, 1100)) + 
    cowplot::theme_cowplot()

As expected the US is where the most users are location with close to 30% of all users. However, what’s a bit surprising is that 70% of the users are NOT from the US. And in the Top 10 countries there’s a pretty good representation across the continents with North America, South America, Europe, Asia, and Australia all represented (Africa gets its first representation at #29 with Nigeria).

Concluding Thoughts

First and foremost, thank you to everyone who has supported the blog by reading it over the past year. This really did start out as a small hobby for myself during COVID but I hope that others have found some value in the various posts. For this post in particular, I hope it displays all the things you can find within Google Analytics. For me personally, it made me happy to take this post to reflect on the first year of the blog and see the reach that a single person doing this in their spare time can have. So again, thank you all and onto Year 2 (and another shot at that 1000 Monthly Active User Goal!!)

What Are People Sayin' About Instagram Lite?

Sat, 26 Jun 2021 00:00:00 +0000

In the beginning of May, I used RSelenium to scrape the Google Play Store reviews for Instagram Lite to demonstrate how the package can be used to automate browser behavior. Its taken longer than I had initially planned to do this follow-up on the analysis of that data. But better late than never. So in this analysis I will do some exploratory work and some text mining to look at questions such as:

How have IG Lite reviews been trending?
What are prevalent topics in the Google Play reviews about IGLite?
For words with negative sentiment, why are people feeling negatively?
What are the most prevalent keywords in the set of reviews?

The main libraries that I will use to do this analysis are udpipe for applying the language model used to develop part of speech tagging, BTM to construct the Biterm model, and textrank / wordcloud to do keyword extraction and make the wordcloud. Both udpipe, BTM, and textrank are part of the Bnosac NLP ecosystem.

The analyses from these posts are heavily inspired from Bnosac’s posts on Biterm Modeling and Sentiment Analysis.

library(tidyverse)  # General Data Manipulation
library(lubridate) # Date Manipulations
library(extrafont)  # To use more fun fonts in GGPLOT
loadfonts(device = "win")
library(udpipe) # Tokenizing, Lemmatising, Tagging and Dependency Parsing
library(BTM) # Biterm Topic Modeling
library(scales) # To help format  plots
library(textrank) # Keyword Extraction
library(wordcloud) # Create wordcloud

For data I’ll be using the result file from the my web scraping post from April:

iglite <- read_csv('https://raw.githubusercontent.com/jtlawren67/jlawblog/master/content/post/2021-05-03-scraping-google-play-reviews-with-rselenium/data/review_data.csv')

As a reminder the data looks like:

names	stars	dates	clicks	reviews
Harikrishnan	3	2021-04-05	4787	Its surely consumes less data than original app, but many of you may not get comfortable with this interface. One of the major problems I faced was that stories are getting replayed many times without me doing anything. The next major issue is that if you dont like a post it comes to your feed everytime over and over again until you like the post. Hope Instgram Team will find a solution to these problems
Piyush AryaPrakash	1	2021-04-06	3655	It’s good to see that they are providing a lite version. But it doesn’t even work . It’s better to use in chrome than downloading lite. What’s the problem - The feeds never get refreshed . You just have to scroll down and when you click refresh still you see the same feeds. Doesn’t support links . Lags too much . Too much annoying while using the messenger. Despite having a good internet connection it keeps laging saying something went wrong. It’s too slow
Badri narayan	4	2021-04-24	40	Very nice app as it is lite so it is good consume less data have limited things but I don’t understand you can watch reels in app but if someone send you reels it shows not supported in lite so it should be fixed and during dark mode the text we type is not visible fix this too and everything is good <U+0001F917>

Exploring the IG Lite Review Data

Given the time the initial analysis was run I captured 2,040 reviews covering dates from 2019-03-03 and 2021-04-24. However, reviews from earlier than December 2020 are likely referring to the initial version of IG Lite rather than the relaunched version.

The first thing to look at is to see how the review counts have been trending over time:

iglite %>% 
  count(dates, name = "reviews") %>%
  filter(dates >= lubridate::ymd(20201201)) %>%
  ggplot(aes(x = dates, y = reviews)) + 
    geom_line() + 
    geom_smooth(se = F, lty = 2) + 
    labs(y = "# of Reviews in data set", x = "Month",
         title = "Number of IGLite Reviews In Dataset") + 
    cowplot::theme_cowplot() +
    theme(
      plot.title.position = 'plot',
      text = element_text(family = 'Arial Narrow')
    )

The trend of reviews started strong in mid-December upon the launch of IG Lite before stabilizing at around 10 per day before beginning an incline in February and reaching around 20 reviews per day. So if we assume that increasing reviews are correlated with increasing users then it seems like IG Lite is gaining momentum.

But are the reviews good reviews? As an app that is continuously iterating it would be interesting to see how the distribution of Star Ratings from 1-5 to changed over time as more reviews come in. To do this we can look at the cumulative distributions for each star rating from Dec 2020 through April 2021.

Since certain days do not have coverage across all 5 reviews (remember we’ll only getting 10 per day at the beginning). I’ll need to create a skeleton for each day and all five ratings so that zeros are taken into account rather than treated as gaps. For this I’ll using tidyr’s crossing() function, which is a bit like expand.grid() in Base R to create a data set with all combinations of vectors.

#Create a data frame with every day from 12/1/2020 through the max date and  1-5 
#value for stars on each day
tidyr::crossing(
  dates = seq.Date(ymd(20201201), max(iglite$dates), by = 'day'),
  stars = 1:5
  ) %>% 
  # Join  actual data to the skeleton to get the number of reviews for that day
  left_join(
    iglite %>%
      count(dates, stars, name = "reviews") %>%
      filter(dates >= lubridate::ymd(20201201)),
    by = c("dates", "stars")
  ) %>% 
  # Fill any missing values with 0
  replace_na(list(reviews = 0)) %>%
  # Create the cumulative count of reviews for each star level
  group_by(stars) %>%
  arrange(dates) %>% 
  mutate(cuml_stars = cumsum(reviews)) %>%
  ungroup() %>% 
  # Add a column for the cumulative count of reviews for up to that point
  add_count(dates, wt = cuml_stars, name = "total_review_in_date") %>%
  # Create the cumulative distribution for that star level to that point
  # For the most recent day create a label to be used in the post
  mutate(pct = cuml_stars / total_review_in_date,
         lbl = if_else(dates == max(dates), 
                       paste(stars, pct %>% percent(accuracy = 1), sep = ': '), 
                       NA_character_)) %>% 
  # Remove the dates prior to having 25 total reviews
  filter(total_review_in_date >= 25) %>% 
  # Plot the distribution
  ggplot(aes(x = dates, y = pct, color = as.factor(stars))) + 
    geom_line() + 
    ggrepel::geom_label_repel(aes(label = lbl)) + 
    scale_color_discrete(guide = F) + 
    scale_y_continuous(labels = percent) + 
    labs(title = "IGLite Rating Distribution",
         subtitle = "Cumulative Distribution Dec - Apr",
         caption = "Dates Start at 25 Reviews") + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = 'plot',
      text = element_text(family = 'Arial Narrow')
    )

Looking at the distributions over time, in January one and three star ratings were the most common with around 25% each. Fives and twos were relatively low. However, since January, the number of fives have climbed to eventually make up 23% of the total reviews in the data set. Unfortunately, the number of one star reviews has also climbed and is the most common review in the data set at 31%.

An alternative way of utilizing the star ratings is to create a Net Promoter-like score. If you’ve ever received an email asking “On a scale from 1 to 10 how likely are you to recommend this to a friend”, you’ve been a part of the Net Promoter Score. The Net Promoter Score is a score from -100 to 100 that is an index about how willing people are to reccomend a product. It divides the world into Promoters (scores 9 and 10) and Detractors (scores 6 and below) and then calculates % of Promoters - % of Detractors.

In this case, I’ll consider a promoter as someone who rates IGLite a 4 or a 5 and a detractor someone who rates IGLite a 1 or a 2. Then we can calculate our version of NPS for each month to get a rough look at sentiment trend.

iglite %>%
  # Filter to December
  filter(dates >= lubridate::ymd(20201201)) %>% 
  # Turn star scores into Promoter / Detractors and create a dataset where
  # for each day we'll have Favorable/Unfavorable/Neutral as columns
  mutate(lbl = case_when(
    stars >= 4 ~ "favorable",
    stars <= 2 ~ "unfavorable",
    TRUE ~ "neutral"
    ),
    mth = format(dates, "%Y-%m")
  ) %>% 
  count(mth, lbl, name = "reviews") %>%
  spread(lbl, reviews) %>% 
  replace_na(list(favorable = 0, unfavorable = 0, neutral = 0)) %>% 
  # Calculate the NPS score
  mutate(
         total = favorable + neutral + unfavorable,
         pct_favorable = favorable/total,
         pct_unfavorable = unfavorable/total,
         nps = pct_favorable - pct_unfavorable
         ) %>%
  # Plot the NPS score by month
  ggplot(aes(x = mth, y = nps), group = 1) + 
    geom_col(aes(fill = if_else(nps < 0, 'darkred', 'darkgreen'))) + 
    geom_point() + 
    geom_label(aes(label = nps %>% percent(accuracy = .1))) + 
    scale_fill_discrete(guide = F) + 
    labs(title = "NPS Score for IGLite",
         subtitle = "NPS = % Promoters (Reviews > 3) - % Detractors (Reviews < 3)",
         y = "Net Promoter Score",
         x = "Month") + 
   cowplot::theme_cowplot() + 
    theme(
      plot.title.position = 'plot',
      text = element_text(family = 'Arial Narrow'),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()
    )

Yikes! This does not look great with each of the 5 months in the data having a negative NPS score. However, similar to the cumulative ratings in the chart above the later months (March and April) have faired much better than the first two months post-release (Jan and Feb) with the NPS score being close to zero. Looking at the raw data, it seems like the “neutral” comes from being polarizing with 42% Promoters and 43% Detractors rather than having a lot of people with neutral with 3 star ratings:

Month	Total Reviews	% Favorable	% Neutral	% Unfavorable	NPS
2020-12	301	33.9%	25.9%	40.2%	-6.3%
2021-01	248	33.9%	16.1%	50.0%	-16.1%
2021-02	361	34.6%	15.0%	50.4%	-15.8%
2021-03	565	40.0%	18.4%	41.6%	-1.6%
2021-04	540	41.7%	15.4%	43.0%	-1.3%

Text-Mining

With the EDA portion done, its on to Text Mining the reviews. In a past-post I had used the Tidytext Ecosystem to look at Tweet difference between Instagram and TikTok but this time I will be using the Bnosac ecosystem of packages to do Biterm Modeling, Sentiment Analysis with dependency parsing, and then the textrank and wordcloud package to generate a word-cloud of extracted keywords.

Pre-processing with udpipe

In prior text-mining posts, I used tidytext to handle tokenization, however, in this analysis I will leverage the udpipe package. The udpipe is a R wrapper around the C++ library of the same name that uses a pre-trained language models to easily tokenize, tag, lemmatize or perform dependency parsing on text in any language. The “ud” in udpipe stands for Universal Dependencies which is a “framework for consistent annotation of grammar”.

In order to prepare the data for the model there needs to be some light pre-processing as udpipe expects the data to have a doc_id and a text field.

#Columns need to be doc_id and text for the model
cleaned <- iglite %>% 
  mutate(doc_id = row_number(),
         text = str_to_lower(reviews),
         text = str_replace_all(text, "'", ""))

To annotate our data with udpipe I’ll call the udpipe() function with my data and the language of the model to use. This function is will download the appropriate language model, in this case English, and then annotate the data.

annotated_reviews    <- udpipe(cleaned, "english")

To show what the udpipe model did to the data we can look at the first review before the annotations:

text
its surely consumes less data than original app, but many of you may not get comfortable with this interface. one of the major problems i faced was that stories are getting replayed many times without me doing anything. the next major issue is that if you dont like a post it comes to your feed everytime over and over again until you like the post. hope instgram team will find a solution to these problems

its surely consumes less data than original app, but many of you may not get comfortable with this interface. one of the major problems i faced was that stories are getting replayed many times without me doing anything. the next major issue is that if you dont like a post it comes to your feed everytime over and over again until you like the post. hope instgram team will find a solution to these problems

and after the annotations:

annotated_reviews %>% filter(doc_id == 1) %>% head(3) %>% knitr::kable()

doc_id	paragraph_id	sentence_id	sentence	start	end	term_id	token_id	token	lemma	upos	xpos	feats	head_token_id	dep_rel	deps	misc
1	1	1	its surely consumes less data than original app, but many of you may not get comfortable with this interface.	1	3	1	1	its	its	PRON	PRP$	Gender=Neut\|Number=Sing\|Person=3\|Poss=Yes\|PronType=Prs	3	nsubj	NA	NA
1	1	1	its surely consumes less data than original app, but many of you may not get comfortable with this interface.	5	10	2	2	surely	surely	ADV	RB	NA	3	advmod	NA	NA
1	1	1	its surely consumes less data than original app, but many of you may not get comfortable with this interface.	12	19	3	3	consumes	consume	VERB	VBZ	Mood=Ind\|Number=Sing\|Person=3\|Tense=Pres\|VerbForm=Fin	0	root	NA	NA

We now get a ton of metadata including indicators for the sentence, we can the token (token) and its lemma (lemma) (note that consumes becomes consume), parts of speech (upos), and dependency relationships (deprel) and more.

Now that we’ve tokenized the data we can start using it to analyze the reviews.

Biterm Modeling

The first analysis task will be biterm modeling using the BTM package. The Biterm Topic Model model was developed by Yan et. al as a means to determining the topics that occur in short-texts such as Tweets (or in this case Google Play Reviews). Its meant to provide an improvement to traditional topics modeling in uses cases such as this. My understanding of the difference between traditional topic modeling and biterm topic model is that in the former, the model learns word co-occurrence within documents, while with the later, the model learns word co-occurrences within a window across the entire set of documents. In this context a “biterm” consists of two words co-occurring in the same context, for example, in the same short text window. This analysis is modeled after the one from bnosac.

In the BTM model we can explicitly tell the model which word co-occurrences we care about vs. letting it run on everything. This enables us to only care about certain parts of speech, words of certain lengths, and non-stop words. For this analysis we will consider a co-occurrence window of 3 while removing stopwords, removing words with less than 3 characters, and only keeping nouns, adjectives, verbs, and adverbs.

#Define a Dictionary of BiTerms
library(data.table)
library(stopwords)
biterms <- as.data.table(annotated_reviews)
biterms <- biterms[, cooccurrence(x = lemma,
                                  relevant = upos %in% c("NOUN", "ADJ", "VERB") & 
                                             nchar(lemma) > 2 & !lemma %in% stopwords("en"),
                                  skipgram = 3),
                   by = list(doc_id)]

The biterm data set we’ve constructed looks like:

biterms %>% head(5) %>% knitr::kable()

doc_id	term1	term2	cooc
1	like	post	2
1	consume	less	1
1	less	data	1
1	original	app	1
1	get	comfortable	1

This states that in the first review, the word pair (like, post) occurs within a 3 word window twice in the document.

Now we can actually construct the biterm model. For simplicity, I’m setting it to train 9 topics. The background = T setting makes the 1st topic a background topic that reflects to empirical word distribution to filter out common words (which is why k = 10):

set.seed(123456)

train_data <- annotated_reviews %>% 
  filter(
    upos %in% c("NOUN", "ADJ", "VERB"),
    !lemma %in% stopwords::stopwords("en"),
     nchar(lemma) > 2
  ) %>%
  select(doc_id, lemma)

btm_model     <- BTM(train_data, biterms = biterms, k = 10, iter = 2000, background = TRUE)

Now that we’ve constructed topics, there needs to be a good way to visualize those topics. Fortunately the textplot package handles this nicely:

library(textplot)
library(ggraph)
set.seed(123456)

plot(btm_model, top_n = 10,
     title = "BTM model of IGLite Reviews",
     labels = c("",
                "Reels",
                "Likes the App",
                "Takes Too Long",
                "Can't Upload",
                "Dark Mode", 
                "Bugs", 
                "Feature Requests",
                "Uses Less Resources", 
                "Instagram Lite"))

From this chart we can see that there’s a lot of people mentioning bugs and other problems, specifically around upload. People talking about how IG Lite consumes less space and data, people wanting new features such as a music sticker option in stories, and a LOT of people wanting Dark Mode. And there are people who like it and think its a good app.

Sentiment Analysis withe Dependency Parsing

In many sentiment analyses a dictionary method is used to assign positive sentiment and negative sentiment and then some sort of aggregation occurs to determine whether a document is “happy” or “sad” or whatever other type of emotion. But what gets left on the table is “Why” there is positive or negative sentiment. In this case, we can see that people gave IG Lite bad ratings or complained about issues, but without looking through every review, it tough to know why.

This next piece is based on a bnosac blog post and will leverage the dependency output from udpipe to see what words are connected to the words with negative sentiment.

To first determine words with negative sentiment I will need an external dictionaries to identify:

Positive vs. Negative words - the base positive vs. negative scoring
Amplifying and Deamplifying words - words like ‘very’ which make an emotion more intense or ‘barely’ which make an emotion less intense.
Negators - words like ‘not’ which would flip the sentiment

For these lists I will get the data used in the sentometrics package:

load(url("https://github.com/SentometricsResearch/sentometrics/blob/master/data-raw/FEEL_eng_tr.rda?raw=true"))
load(url("https://github.com/SentometricsResearch/sentometrics/blob/master/data-raw/valence-raw/valShifters.rda?raw=true"))

and break them up into separate vectors of words:

polarity_terms <- FEEL_eng_tr %>% transmute(term = x, polarity = y)
polarity_negators <- valShifters$valence_en %>% filter(t==1) %>% pull(x) %>% str_replace_all("'","")
polarity_amplifiers <- valShifters$valence_en %>% filter(t==2) %>% pull(x) %>% str_replace_all("'","")
polarity_deamplifiers <- valShifters$valence_en %>% filter(t==3) %>% pull(x) %>% str_replace_all("'","")

Finally, I can use udpipe’s txt_sentiment function to use these lists to score my annotated data.

sentiments <- txt_sentiment(annotated_reviews, term = "lemma", 
                            polarity_terms = polarity_terms,
                            polarity_negators = polarity_negators, 
                            polarity_amplifiers = polarity_amplifiers,
                            polarity_deamplifiers = polarity_deamplifiers)
sentiments <- sentiments$data

In addition to the initial annotations there are now columns for polarity (just the positive / negative based on the term) and sentiment_polarity which incorporates the additional information.

Now that there are sentiments I’m going to want to find the words that those negative terms modify using cbind_dependencies().

reasons <- sentiments %>%
  #Attached Parent Words to Data
  cbind_dependencies() %>%
  #Filter Columns
  select(doc_id, lemma, token, upos, polarity, sentiment_polarity, token_parent, lemma_parent, upos_parent, dep_rel) %>%
  #Keep Only Terms with Negative Sentiment
  filter(sentiment_polarity < 0)

The revised data now looks like:

head(reasons) %>% knitr::kable()

doc_id	lemma	token	upos	polarity	sentiment_polarity	token_parent	lemma_parent	upos_parent	dep_rel
1	less	less	ADJ	-1	-1.8	data	data	NOUN	amod
1	comfortable	comfortable	ADJ	1	-1.0	get	get	VERB	xcomp
1	problem	problems	NOUN	-1	-1.0	one	one	NUM	nmod
1	do	do	AUX	1	-1.0	like	like	VERB	aux
1	problem	problems	NOUN	-1	-1.0	solution	solution	NOUN	nmod
2	do	does	AUX	1	-1.0	work	work	VERB	aux

A quick look at the data calls out a problem that exists with all dictionary based approaches which is that there is a context that the analyst knows that a dictionary cannot. For example, the term above “less data” is taken to be a negative because having “less data” would be bad… except in the context of Instagram Lite using “less data” would actually be good.

To get a better understanding of why we’re seeing negative sentiment I will construct a network graph between the negative term and the thing they are modifying and looking for the common phrases.

# Keep only dependency relationships that are adjectival modifiers 
# (terms that modify a noun / pronoun)
reasons <- filter(reasons, dep_rel %in% "amod")

# Count Number of occurrences
word_cooccurences <- reasons %>% 
  count(lemma, lemma_parent, name = "cooc", sort = T) 

# Create the Nodes as either the term in the dictionary or a word linked 
#to the term in the dictionary
vertices <- bind_rows(
  data_frame(key = unique(reasons$lemma)) %>% 
    mutate(in_dictionary = if_else(key %in% polarity_terms$term, 
                                   "in_dictionary", 
                                   "linked-to")),
  data_frame(key = unique(setdiff(reasons$lemma_parent, reasons$lemma))) %>% 
    mutate(in_dictionary = "linked-to")
  )

library(ggraph)
library(igraph)

# Keep Top 20 Words CoOccurances
cooc <- head(word_cooccurences, 20)
set.seed(123456789)

cooc %>%  
  graph_from_data_frame(vertices = filter(vertices, 
                                          key %in% c(cooc$lemma, 
                                                     cooc$lemma_parent))) %>%
  ggraph(layout = "fr") +
  geom_edge_link0(aes(edge_alpha = cooc, edge_width = cooc)) +
  geom_node_point(aes(color = in_dictionary), size = 5) +
  geom_node_text(aes(label = name), vjust = 1.8, col = "darkgreen") +
  scale_color_viridis_d(option = "C", begin = .2, end = .8) + 
  ggtitle("Which words are linked to the negative terms") +
  theme_void()

In the network we see the “less data” as the strongest co-occurrence even thought it (and many other words in this group) are not strictly negative words. Some of these connections make sense to be negative like “slow speed” or “useless app” which seems unquestionably bad. But some of these don’t make sense to me like “full screen” being bad. Although looking at a few of the sample reviews that say full screen they are usually in reference to full screen modes not working. So while it does appear that the sentiment model is capturing that “full screen” is discussed as a negative thing, the graph view above does not make that clear.

So dependency parsing for sentiment analysis seems like a cool idea but is a bit “your mileage may vary”.

Word Clouds on Keywords

The last text analysis technique for this post will probably be the most well known… wordclouds. It will show what are the most common words in our data set and can be used to understand the set of reviews at a quick glance. But rather than relying on most common words, I’ll use the textrank package to extract relevant keywords text where keywords are defined as combinations of words following each other. To try to get the most relevant set of keywords, I will be limiting to nouns, adjective, and verbs and will create a wordcloud of the top 30.

textrank_keywords(annotated_reviews$lemma,
                  relevant = annotated_reviews$upos %in% c('NOUN', 'ADJ', 'VERB')) %>% 
  .$keywords %>% filter(ngram > 1 & freq > 1, !str_detect(keyword, 'be')) %>%
  slice_max(freq, n = 50) %>% 
  with(wordcloud(keyword, freq, max.words = 50, colors = brewer.pal(10, 'Dark2')))

So what are people saying about IG Lite…. that they want dark mode, they want music stickers and that its a good app.

Conclusions

In this post I leveraged the Google Play Reviews that were scraped back in April to analyze the ratings and the review text using some of less well-known NLP packages (at least in my opinion) to do modified versions of Topic Modeling with Biterm Models, modified versions of sentiment analysis with dependency parsing, and a modified version of a word cloud using keyword extraction.

As far as answering the questions about what are people saying about IG Lite. It seems really mixed. In terms of star ratings things appeared to start very rough in Jan / Feb but had improved through March and April. From the topic models, some people like that its less resource intense than “Instagram Heavy” while others find it buggy and lacking features. From the sentiment analysis, this polarized view can be summed up in the nodes that formed “Good App”, “Good Enough”, and “Useless App” such that there’s no dominant sentiment.

Except Dark Mode… give the people dark mode.

How have the AFI Top 30 Movies Changed Between 1998 and 2007?

Sun, 16 May 2021 00:00:00 +0000

During COVID I’ve started watching some older “classic” movies that I hadn’t seen before but felt for whatever reason I should have seen as a movie fan. Last week, I had watched The Third Man after listening to a podcast about Spy Movies. After watching it I was surprised to find out that while it was named the Top British Film of All-Time it is NOT in the AFI Top 100 list that was refreshed in 2007. However, it was in the original list of 1998.

This got me thinking about what were all the differences between the original 1998 list and the revised 2007 list. And while the results are very clearly in the Wikipedia table I though it would be fun to try out a visualization using bump charts. This posts utilizes the ggbump package to make the bump chart and much of the code and style in this post is influenced from the package README.

Libraries

The main parts of this post will be:

Scraping the table from Wikipedia using rvest
Doing some light transformations with dplyr
Doing the plotting with ggplot2, ggbump, and a couple of other packages for fonts.

library(rvest)
library(tidyverse)
library(glue)
library(ggbump)
library(ggtext)
library(showtext)

When making the plot I wanted to leverage the font that’s actually used on the American Film Institute web-page which turned out to be the Google Font Nunito. Using the showtext package, I can install the Google fonts into the R session and load them for use in plotting. The function font_add_google from the showtext package takes two arguments, the name of the Google Font and a family alias that can be used to refer to the font later. For example, in the code below, I’ll be referring to “Nunito” as the “afi” family later on. The showtext_auto call allows for the family aliases to be used in future code.

# Load Google Font
font_add_google("Nunito", "afi")
font_add_google("Roboto", "rob")
showtext_auto()

Scraping the Data

The data for the original and new AFI Top 100 Lists are in the same table on the AFI 100 Years… 100 Movies Wikipedia table. I’ll be using rvest to grab the table and import it into a tibble. I do this by providing rvest with the URL using read_html, search for a specific CSS class with html_element and then extract the information from the table with html_table. Since rvest will take the column names exactly from the table, which will include spaces, I’ll use the janitor::clean_names() function to replace spaces with underscores and add characters before names that start with numbers. 1998 Rank will then become x1998_rank.

tbl <- read_html('https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies') %>%
  html_element(css = '.sortable') %>%
  html_table() %>%
  janitor::clean_names()

The first three rows of this data set will look like:

film	release_year	director	x1998_rank	x2007_rank	change
Citizen Kane	1941	Orson Welles	1	1	0
Casablanca	1942	Michael Curtiz	2	3	1
The Godfather	1972	Francis Ford Coppola	3	2	1

Data Transformation

In order to get the data ready for use in ggplot there are a few data transformation steps that need to happen:

I’d like the labels for the plot to include both the title of the film as well as its year of release. I will use the glue package to easily combine the film and release_year columns.
I want to clean up the rows for movies that aren’t in both lists by replacing the “-” label with NAs. This is done using across() and na_if to replace the “-” characters in the two rank columns with NA.
I need to turn the tibble from wide format to long format with pivot_wider
Finally, I want to have rank be an integer and I want to remove the leading “x” character from the year column

tbl2 <- tbl %>% 
  mutate(title_lbl = glue("{film} ({release_year})"),
         across(ends_with('rank'), ~na_if(., "—"))
  ) %>%
  pivot_longer(
    cols = contains('rank'),
    names_to = 'year',
    values_to = 'rank'
  ) %>%
  mutate(year = str_remove_all(year, '\\D+') %>% as.integer(),
         rank = as.integer(rank))

Now we have 1 row for each instance on a movie for each list. For example, Citizen Kane appears in both lists so it appears in two rows in the data.

film	release_year	director	change	title_lbl	year	rank
Citizen Kane	1941	Orson Welles	0	Citizen Kane (1941)	1998	1
Citizen Kane	1941	Orson Welles	0	Citizen Kane (1941)	2007	1
Casablanca	1942	Michael Curtiz	1	Casablanca (1942)	1998	2

Creating the Plot

In order to make the plot readable, I’ll only be looking at the Top 30 films rather than the full Top 100. I’ll be using a bump chart do the comparison. Bump charts are a visualization technique good for looking at changes in rank over time. There is a package ggbump which provides a ggplot2 geom (geom_bump) to handle the lines for a bump chart. Movies that appear in only one list will not have a line.

As for what the code does, the first section above the theme() call does most of the work by having the points, lines, and titles as well as scaling the axes to the right sizes. Note that in the geom_text calls, I’m using family = ‘rob’ to refer to the Roboto font downloaded earlier. The theme call handles a lot of the formatting and the geom_text() and geom_point() calls after the theme() section create the white circles that contain the ranks.

##Plot
num_films = 30

tbl2 %>%
  filter(rank <= num_films) %>%
  ggplot(aes(x = year, y = rank, color = title_lbl)) +
  #Add Dots
  geom_point(size = 5) +
  #Add Titles
  geom_text(data = . %>% filter(year == min(year)),
            aes(x = year - .5, label = title_lbl), size = 5, hjust = 1, family = 'rob') +
  geom_text(data = . %>% filter(year == max(year)),
            aes(x = year + .5, label = title_lbl), size = 5, hjust = 0, family = 'rob') +
  # Add Bump Lines
  geom_bump(size = 2, smooth = 8) +
  
  # Resize Axes
  scale_x_continuous(limits = c(1990, 2014),
                     breaks = c(1998, 2007),
                     position = 'top') +
  scale_y_reverse() +
  labs(title = glue("How has the AFI Top {num_films} Movies Changed Between Lists"),
       subtitle = "Comparing 1998 and 2007s lists",
       caption = "***Source:*** Wikipedia",
       x = "List Year",
       y = "Rank") + 
  # Set Colors and Sizes
  theme(
    text = element_text(family = 'afi'),
    legend.position = "none",
    panel.grid = element_blank(),
    plot.title = element_text(hjust = .5, color = "white", size = 20),
    plot.caption = element_markdown(hjust = 1, color = "white", size = 12),
    plot.subtitle = element_text(hjust = .5, color = "white", size = 18),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    axis.text.y = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(face = 2, color = "white", size = 18),
    panel.background = element_rect(fill = "black"),
    plot.background = element_rect(fill = "black")
  ) + 
  ## Add in the Ranks with the Circles
  geom_point(data = tibble(x = 1990.5, y = 1:num_films), aes(x = x, y = y), 
             inherit.aes = F,
             color = "white",
             size = 7,
             pch = 21) +
  geom_text(data = tibble(x = 1990.5, y = 1:num_films), aes(x = x, y = y, label = y), 
            inherit.aes = F,
            color = "white",
            fontface = 2,
            family = 'rob') + 
  geom_point(data = tibble(x = 2013.5, y = 1:num_films), aes(x = x, y = y), 
             inherit.aes = F,
             color = "white",
             size = 7,
             pch = 21) +
  geom_text(data = tibble(x = 2013.5, y = 1:num_films), aes(x = x, y = y, label = y), 
            inherit.aes = F,
            color = "white",
            fontface = 2,
            family = 'rob')

Conclusion

While the Wikipedia page tells you exactly what changed between the two lists it provided an opportunity for me to get some practice with making some “nicer” looking ggplot charts and to try out a bump chart and the ggbump package. As for an interpretation of the chart, there’s a couple of things I don’t really understand between the two lists. Mainly why Raging Bull suddenly jumps from the 24th best film to the 4th. Or why City Lights jumps 65 places from 76th to 11th. I guess I’ll just have to watch and find out.

Scraping Google Play Reviews with RSelenium

Mon, 03 May 2021 00:00:00 +0000

When Normal Web Scraping Just Won’t Work

I’ve used rvest in numerous posts to scrape information from static websites or through forms to get data. However, some websites don’t have static data that can be downloaded by just scraping the HTML. Google Play Store reviews are one of these sources.

Reviews on the Google Play Store have what I call a semi-infinite scroll where as you reach the bottom of the page, the site will load the next batch of reviews. However, a special wrinkle in the Play Store page is that after a few loads, the user will be prompted again to click a button to load the next batch of reviews.

Selenium to the Rescue

Selenium is a tool that automates a browser. Its often used for writing automated tests for websites but in this instance it can be used to mimic a user’s browser behavior to load up a bunch of Play Store reviews to the screen before we can then scrape using rvest in the conventional fashion.

Selenium and its R package RSelenium allows a user to interact with a browser through their programming language of choice. Since this is an R blog, I’ll be using R to control the browser.

Scraping Instagram Lite Reviews

Instagram Lite is a recently launched product whose “goal was to offer a smaller download that takes up less space on a mobile device — a feature that specifically caters to users in emerging markets, where storage space is a concern”. Since this is a relatively new product it would be fun to see how its doing. This first post will cover how to use RSelenium to actually get the data and the analysis will be covered in a follow-up post.

Part 1: Loading Libraries

The four libraries used for this data acquisition project are RSelenium which will allow for manipulating a browser through R, tidyverse for constructing the data structure, lubridate to handle the dates in the reviews, and rvest to scrape the HTML after we’re done loading all the reviews with Selenium

library(RSelenium)
library(tidyverse)
library(lubridate)
library(rvest)

Part 2: Start RSelenium

A browser session gets started by called rsDriver from the RSelenium package. While RSelenium can work with Chrome, Firefox, or PhantomJS, I’ve personally found that working with Firefox is the path of least resistance. With Chrome you need to match the chromedriver versions between RSelenium and the Chrome browser and I’ve never successfully pulled that off. While with Firefox you can just set browser="firefox" and it just works.

The first time running RSelenium you can’t have check=F as it will download the drivers that it needs to work. After that first run you can set check=F to skip those checks. The verbose=F option is to suppress excess messaging.

The rsDriver function will start both a Selenium server and start the remote Firefox browser. It returns both a server and a client which will be assigned to remDr.

rD <- rsDriver(browser = "firefox", 
               port = 6768L, 
               #If Running RSelenium for the First Time, you can't have check =F
               #since you'll need to download the appropriate drivers
               check = F, 
               verbose = F
)
remDr <- rD[["client"]]

If everything goes to plan a new Firefox window will open and the address bar will be “oranged” out.

Part 3: Browse to the Instagram Lite Google Play Reviews Page

This part is straight forward, I create a url variable with the desired URL as a string and then use the remote driver remDr to tell the browser to navigate to that page.

#Navigate to URL
url <- "https://play.google.com/store/apps/details?id=com.instagram.lite&hl=en_US&showAllReviews=true"
remDr$navigate(url)

If all goes well the Firefox browser that had opened should now have loaded the Google Play page for Instagram Lite. There will also be a little robot icon on the address bar to show that the browser is under remote control.

Part 4: Loading A Bunch of Reviews

This section is the meat and potato of working with Selenium where we’ll write a script to tell the browser what to do. The summary of what this code block will do is:

Identify the body of the webpage
Send the “end” key to the browser to move to the bottom of the body
Check if the “SHOW MORE” button exists on the screen and wait 2 seconds
If the button exists, find the element and click it.
Wait 3 seconds to let new reviews load and then repeat from Step 2

I repeat this loop 50 times to try to get enough data for analysis. If the browser isn’t running headlessly then you can switch to the remote browser window and watch everything in action (but be careful because manual intervention with the webpage can mess with the intended function of the script)

Figuring out the right classes for the button (RveJvd) took some guess and check work from inspecting the page, however, I believe all Google Play Review pages use the same classes so this could should be adaptable to other apps. But YMMV.

Note: I originally wanted to run this 100 times to try to get more reviews but I kept winding up with an error of unexpected end of hex escape at line 1 column 15497205 that I was unable to debug. So I stuck with 50. But if anyone knows how to avoid that error please let me know in the comments.

#Find Page Element for Body
webElem <- remDr$findElement("css", "body")

#Page to the End
for(i in 1:50){
  message(paste("Iteration",i))
  webElem$sendKeysToElement(list(key = "end"))
  #Check for the Show More Button
  element<- try(unlist(remDr$findElement("class name", "RveJvd")$getElementAttribute('class')),
                silent = TRUE)
  
  #If Button Is There Then Click It
  Sys.sleep(2)
  if(str_detect(element, "RveJvd") == TRUE){
    buttonElem <- remDr$findElement("class name", "RveJvd")
    buttonElem$clickElement()
  }
  
  #Sleep to Let Things Load
  Sys.sleep(3)
}

Part 5: Scraping the Page

Now that we’ve scrolled and pushed buttons and scrolled some more to get a bunch of reviews to load on the screen its time to scrape the reviews.

We can extract the HTML from the remote browser using getPageSource() and readHTML().

##Scrape in HTML Objects
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()

Now that we have the HTML we no longer need the remote Firefox browser or Selenium server so we can shut those down. There have been issues with the Java process remaining open even after calling the stop server pieces so I issue a system command to kill the java process.

#Shut Down Client and Server
remDr$close()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

Part 6: Extracting the Various Parts of the Review

If we look at a single review, there are a number of different elements we’d like to extract.

The Reviewer Name
Number of Stars
Date of Review
Number of Upvotes
Full Text of the Review

This piece was a bit of guess and check working with rvest and looking at the CSS selectors on the page to identify the CSS classes for the pieces that I wanted and extract them with html_elements(), html_attr(), and html_text():

# 1) Reviewer Name
names <- html_obj %>% html_elements("span.X43Kjb") %>% html_text()

# 2) Number of Stars
stars <- html_obj %>% html_elements(".kx8XBd .nt2C1d [role='img']")%>% 
  html_attr("aria-label") %>% 
  #Remove everything that's not numeric
  str_remove_all('\\D+') %>% 
  # Convert to Integer
  as.integer()

#3) Date of Review
dates <- html_obj %>% html_elements(".p2TkOb") %>% 
  html_text() %>% 
  # Convert to a Date
  mdy()

#4) How many helpful clicks
clicks <- html_obj %>% html_elements('div.jUL89d.y92BAb') %>% 
  html_text() %>% 
  #Convert to Integer
  as.integer()

For the text of the review itself there is one wrinkle. From the image above the beginning of the review is shown, but it is truncated. Then a button for “Full Review” would need to be clicked to show the full review. Fortunately, this shows up in the data as “ …Full Review”. So in the cases, where the initial review is truncated, all we need to do is grab all the text that comes after the string “Full Review”:

# 5) Full Text of the Review
reviews <- html_obj %>% html_elements(".UD7Dzf") %>% html_text() 

###Deal with the "Full Review" Issue where text is duplicated
reviews <- if_else(
  #If the review is truncated
  str_detect(reviews, '\\.\\.\\.Full Review'),
  #Grab all the Text After the string '...Full Review'
  str_sub(reviews, 
          start = str_locate(reviews, '\\.\\.\\.Full Review')[, 2]+1
          ),
  #Else remove the leading space from the review as is
  str_trim(reviews)
  )

Part 7: Combine and Save the Data Set

With each piece of the review individually extracted we’ll combine the vectors in a tibble and save the file for the analysis in the next part.

# create the df with all the info
review_data <- tibble(
  names = names, 
  stars = stars, 
  dates = dates, 
  clicks = clicks,
  reviews = reviews
  ) 

saveRDS(review_data, 'data/review_data.RDS')
write_csv(review_data, 'data/review_data.csv')

Just to make sure everything is working we’ll compare an actual review to our data:

review_data %>%
  filter(names %in% c('Sushil Uk07', 'Hana Hoey')) %>%
  knitr::kable()

names	stars	dates	clicks	reviews
Sushil Uk07	3	2021-04-15	0	Good ,but it’s doesn’t Have option to put music in stories
Hana Hoey	3	2021-04-21	0	in features, this app has already including important things. but the movement is very slow

And there you have it. We used Selenium to have a browser scroll for a while to load a bunch of reviews, extracted the data with rvest and then combined and saved the data. In the next post we’ll use this data to understand what downloaders think about Instagram Lite.

Appendix:

In this post the Firefox browser was actually loaded which is a useful way to see what the code is actually doing. But if you didn’t want to actually see the browser you could send extra parameters to the rsDriver function to not make the browser visible:

rsDriver(browser = "firefox", 
         port = 6768L, 
         check = F, 
         verbose = F, 
         #Run the Browser Headlessly
         extraCapabilities = 
           list("moz:firefoxOptions" = 
                  list(
                    args = list('--headless')
                    )
                )
         )

What % of Manhattan Did I Run Through?

Thu, 15 Apr 2021 00:00:00 +0000

In a previous post I created a cool-looking (in my opinion) heatmap of my Marathon training from years back. One of the downsides to that density-based method of making the heat map was that routes I only ran once didn’t show up very clearly. I also wanted to know roughly what % of Manhattan I covered in my runs. This post will use that same data to create a choropleth map by Census Tract to both visualize all the tracts I passed through in my training as well as determine what % of Manhattan’s land area did I cover.

Libraries Used

The packages used in this analysis are the same from the prior analysis, Tidyverse for data manipulation, sf for modifying spatial data, tigris for getting the basemaps to plot my routes and extrafont to bring in new fonts for the plots.

library(tidyverse) # Data Manipulation
library(sf) # Manipulation Spatial Data
library(tigris) # Getting Tract and Roads Spatial Data
library(extrafont) # Better Fonts For GGPLOT

Data Used

The data is also the same running route data from the prior post. For more details on its creation please reference the prior post.

runs_and_routes <- readRDS('data/runs_and_routes.RDS')
all_routes <- readRDS('data/all_routes.RDS')

For the basemap I’m again using the tigris package however this time getting census tracts rather than roads. According to the package, Census tracts generally have a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people. The map is downloaded using the tracts() function with inputs for state and county.

nyc_tracts <- tracts("NY", "New York", cb = T) %>% 
  st_transform(crs = st_crs(runs_and_routes$geometry))

ggplot() + geom_sf(data = nyc_tracts) + ggthemes::theme_map()

Unlike the prior analysis where the heatmap was just overlaid atop the map, here I need to identify which census tracts contained a route I ran vs. which didn’t. This can be done using the st_join function, specifying it to be a left join, and specifying the join type as st_intersects which joins the route information if the lat/long is contained in the census tract. The data is then grouped by tract_name and some other tract metadata. Then I create a field for the number of routes contained in each census tract, which will be used for the choropleth.

#Join Routes to Tracts by Intersecting
nyc_geo_join <- nyc_tracts %>% 
  st_join(all_routes %>% distinct(route_id, geometry),
          join = st_intersects,
          left = T
          ) %>% 
  group_by(
    TRACTCE, #Census Tract ID
    ALAND, #Land Area
    AWATER #Water Area
  ) %>% 
  summarize(num_routes = n_distinct(route_id, na.rm = T), .groups = 'drop') %>% 
  #Set 0 Routes to NA colored
  mutate(num_routes = if_else(num_routes == 0, NA_integer_, num_routes))

Visualization

The choropleth provides an alternative version to the heatmap which will better show each census tract that at least one of my routes had passed through. Really rare routes did not show up on the heatmap, but they will be clearer here.

ggplot() + 
  geom_sf(data = nyc_geo_join, 
          aes(fill = num_routes)) + 
  scale_fill_viridis_c(na.value = "grey90", guide = F) + 
  coord_sf(xlim = c(-74.15, -73.8)) + 
  labs(title = paste0("Census Tracts I've ",emo::ji('running')," Through"),
       fill = "# of Routes Run",
       caption = "**Author:** JLaw") + 
  ggthemes::theme_map() + 
  theme(
    plot.title = element_text(size = 18, family = 'Arial Narrow', hjust = .5),
    plot.caption = ggtext::element_markdown(),
    plot.caption.position = 'plot'
  )

Now the East Side routes are clearer.

What % of Manhattan Did I Run Through?

The island of Manhattan covers 22.7 square miles. I was curious what % of square miles I covered based on census tracts. While this will seriously over-count my distance covered it is easy to calculate. If I ran through the tract I get to count 100% of its land area. If I did not, I count nothing.

The ALAND columns from the Census Tract data contains the land area in square kilometers which I convert to square miles.

data_summary <- nyc_geo_join %>%
  as_tibble %>% 
  mutate(covered = !is.na(num_routes)) %>% 
  group_by(covered) %>% 
  summarize(tracts = n(),
            #Convert Square KM to Square Miles
            area = sum(ALAND)/2589988) %>%
  mutate(pct_tracts = tracts / sum(tracts),
         pct_area = area/sum(area))

During this marathon training, I ran through 101 of Manhattan’s 288 Census Tracts (35%) and passed through census tract’s covering 8.7 mi^2 out of 22.7 mi^2 for 38.4%.

Heatmapping My New York City Marathon Training

Thu, 01 Apr 2021 00:00:00 +0000

Motivation

This post was inspired by my wife who used the GPS data from her Strava app to plot her running routes during 2020. Since I don’t run nearly as much as I used to, I need to go back to when I was training for the NYC marathon to find enough running to make such a map worthwhile. Also presenting a challenge is that I’m a bit of a luddite when it comes to running technology. I don’t have a GPS watch and I don’t run with a phone. To track my runs I manually enter my routes and workouts into MapMyRun and I time my runs with an ol’ fashioned sportswatch.

While this works for me on the road, it made the data gathering process for this visualization more difficult. And while MapMyRun does have TCX files for each workout, its not that useful if the data didn’t come from a GPS watch.

At the end of the day, my goal with this analysis is to make a cool looking heatmap of my training routes for the NYC Marathon… or at least to make a visualization that was cooler looking that my wife’s.

For those who can’t wait… this was final output:

Libraries Used

This analysis uses four main packages. Tidyverse for data manipulation, sf for modifying spatial data, tigris for getting the basemaps to plot my routes and extrafont to bring in new fonts for the plots.

library(tidyverse) # Data Manipulation
library(sf) # Manipulation Spatial Data
library(tigris) # Getting Tract and Roads Spatial Data
library(extrafont) # Better Fonts For GGPLOT

Gathering Data

If I had a GPS watch or used Strava, I could just download all my files which would contain Geo information and plot it directly. But because I do everything manually, I needed to jump through some hoops. From my MapMyRun account I was able to download:

user_workout_history.csv - Containing all of my workouts along with a column for route_id.
GPX files for each route that I had saved.

This led to the semi-painful manual process of using the first file to write down each route id that I had run, look up that route, and download the individual GPX file. Fortunately, I’m a creature of habit and and ran the same routes often, so there were only 24 to individually download.

The User Workout History File

This file was a CSV file exported from MapMyRun which contained one row for each workout I did along with meta-data such as date, time, speed, etc. However, the important column is route id which will be used to join the geo-data from the route’s GPX files.

runs <- read_csv('data/user_workout_history.csv') %>% 
  # Create Route ID column
  mutate(route_id = str_extract(RouteID, '\\d+') %>% as.integer)

The Route GPX Files

As mentioned above the geocoded data for each route lives in GPX files, one for each of the 24 routes. Since I would apply the same pre-processing to each file this is a good candidate for the map_dfr function to construct the data frame.

The following code uses dir() to get a list of all the files in the directory as vectors, the keep() function trims the vector to only the GPX files, and each GPX file is then passed into read_sf to read in the geo-data. The data is subset to only two columns, and a route_id is created based on the numbers in the file name.

Finally, geo-data in sf lives in a GEOMETRY column. However, in order to get the latitudes and longitudes as individual columns I use st_coodinates to creates “X” and “Y” columns for longitude and latitude.

all_routes <- map_dfr(
  #Get all gpx files in the directory
  keep(dir('data'), ~str_detect(.x, "gpx")),
  #Read them in
  ~read_sf(paste0('data/',.x), layer = "track_points") %>% 
    #keep the segment id and the geometry field
    select(track_seg_point_id, geometry) %>% 
    # create a route_id based on the file
    mutate(route_id = parse_number(.x))
) %>% 
  #Extract Lat and Long as Columns
  cbind(., st_coordinates(.))

After the processing the data looks like:

track_seg_point_id	route_id	X	Y	geometry
0	111694131	-73.97597	40.77624	POINT (-73.97597 40.77624)
1	111694131	-73.97555	40.77605	POINT (-73.97555 40.77605)
2	111694131	-73.97555	40.77605	POINT (-73.97555 40.77605)
3	111694131	-73.97546	40.77582	POINT (-73.97546 40.77582)
4	111694131	-73.97546	40.77582	POINT (-73.97546 40.77582)
5	111694131	-73.97552	40.77527	POINT (-73.97552 40.77527)

Combining Runs and Routes

With all the workouts in the runs data and all the routes in the all_routes data, a simple inner-join will combine them. This will duplicates routes that I ran multiple times, which in this case would be the desired behavior.

#Join Routes to Runs to Duplicate 
runs_and_routes <- runs %>% 
  inner_join(all_routes, by = "route_id")

Creating a map of NYC

Since the goal is to create a heatmap of the various routes I ran as part of marathon training, I need a map that contains all of the possible roads in NYC. The tigris package allows for the access to US Census TIGER shapefiles. One of the levels is “roads”, which can be downloaded using the road() function where the first parameter is state and 2nd parameter is county (New York County is Manhattan):

###Download Roads Map from Tigris
nyc <- roads("NY", "New York")

ggplot() + geom_sf(data = nyc) + ggthemes::theme_map()

The function provides road data for all of Manhattan. However, I did not run every part of Manhattan, so it would make more sense to truncate the map to areas where I did run.

In order to do this, I first need to define a boundary box based on my routes. Given a geometry, the st_bbox() function from sf will return a “bbox” object containing the four corners of my routes.

st_bbox(runs_and_routes$geometry)

##      xmin      ymin      xmax      ymax 
## -74.01880  40.70806 -73.93118  40.82113

However, this will not provide any padding around my running routes which will make for a worse visualization. So I will use map2_dbl to add a delta of 0.01 to the maximum values and remove a delta of -0.01 to the minimum values to slightly increase the bounding box.

### Construct Bounding Boxes and Expand Limits By A Delta
bbox <- map2_dbl(
  st_bbox(runs_and_routes$geometry),
  names(st_bbox(runs_and_routes$geometry)),
  ~if_else(str_detect(.y, 'min'), .x - .01, .x + .01)
)

bbox

##      xmin      ymin      xmax      ymax 
## -74.02880  40.69806 -73.92118  40.83113

With an updated bounding box, I can now crop the initial map with my bounding box using the st_crop() function. Also, in order to make the Coordinate Reference Systems the same, I use st_crs() and st_transform to make sure the NYC map is using the same coordinates as my routes.

#Set CRS for NYC to CRS for Running Routes And Crop to the Bounding Box
nyc2 <- st_transform(nyc, crs = st_crs(runs_and_routes$geometry)) %>% 
  st_crop(bbox)

ggplot() + geom_sf(data = nyc2) + ggthemes::theme_map()

We’ve now cut off Governor’s Island from the bottom left corner as well as parts of Northern Manhattan that I never ran to.

Constructing the Heatmap

With the new basemap created and the route data in its own data frame. I can create the heatmap using stat_density2d with the route data and geom_sf with the map data. From the stat_density2d piece I pass in the routes data and set the fill value to be the count at each X and Y using the after_stat() option. The n parameter sets the number of grid points in each directions for the density.

The base map is very rectangular where it is tall but skinny. This made it difficult to add titles. To make things look better, I use ggdraw from the cowplot package to create a new drawing layer and add titles/captions to that layer.

p <- ggplot() + 
  #Construct the Heatmap Portion
  stat_density2d(data = runs_and_routes,
                 aes(x = X, y = Y, fill = after_stat(count)),
                 geom = 'tile',
                 contour = F,
                 n = 1024
                 ) +
  #Draw the Map of Manhattan
  geom_sf(data = nyc2, color = '#999999', alpha = .15) + 
  scale_fill_viridis_c(option = "B", guide = F) + 
  ggthemes::theme_map() + 
  theme(
    panel.background = element_rect(fill = 'black'),
    plot.background = element_rect(fill = 'black')
  )

cowplot::ggdraw(p) + 
  labs(title = "JLaw's Marathon Training Heatmap",
       caption = "**Author**: JLaw") + 
  theme(panel.background = element_rect(fill = "black"),
        plot.background = element_rect(fill = 'black'),
        plot.title = element_text(color = "#DDDDDD",
                                  family = 'Nirmala UI',
                                  #face = 'bold',
                                  size = 18),
        plot.caption = ggtext::element_markdown(color = '#DDDDDD',
                                    family = 'Calibri Light',
                                    hjust = 1,
                                    size = 12),
        

  )

Concluding Thoughts

I’m really happy with how this came out. It also provides some information about my running habits, mainly that I ran in Central Park a lot and that you can roughly tell where I worked at the time as that area is slightly hotter. There are some parts of Manhattan that I did run but don’t show up well in the map because I might have only run there once. An exploration of how much of Manhattan did I run will be covered in a follow-up post.

Exploring Wednesday Night Cable Ratings with OCR

Mon, 01 Mar 2021 00:00:00 +0000

One of my guilty pleasure TV shows is MTV’s The Challenge. Debuting in the late 90s, the show pitted alumni from The Real World and Road Rules against each other in a series of physical events. Now on its 36th season, its found new popularity by importing challengers from other Reality Shows, in the US and Internationally, regularly topping Wednesday Night ratings in the coveted 18-49 demographic.

Looking at the Ratings on showbuzzdaily.com shows that the Challenge was in fact #1 in this demographic. However, it also scores incredibly low on the 50+ demo.

So I figured that exploring the age and gender distributions of Wednesday Night Cable ratings would be interesting. The only caveat is… the data exists in an image.

So for this blog post, I will be extracting the ratings data from the image and doing some exploration on popular shows by age and gender.

Also, huge thanks to Thomas Mock and his The Mockup Blog for serving as a starting point for learning magick.

Using magick to process image data

I’ll be using the magick package to read in the image and do some processing to clean up the image. Then I will use the ocr() function from the tesseract package to actual handle extraction of the data from the image.

library(tidyverse) #Data Manipulation
library(magick) #Image Manipulation
library(tesseract) #Extracting Text from the Image
library(patchwork) #Combining Multiple GGPLOTs Together

The first step is reading in the raw image from the showbuzzdaily.com website which can be done through magick’s image_read() function.

raw_img <- image_read("http://www.showbuzzdaily.com/wp-content/uploads/2021/02/Final-Cable-2021-Feb-03-WED.png")

image_ggplot(raw_img)

The next thing to notice is that while most of the data does exist in a tabular format, there are also headers and footers that don’t follow the tabular structure. So I’ll use image_crop() to keep only the tabular part of the image. The crop function uses a geometry_area() helper function which takes in four parameters. I struggled a bit with the documentation figuring out exactly how to get this working right but eventually internalized geometry_area(703, 1009, 0, 91) as “crop out 703 pixels of width and 1009 pixels of height starting from X-position on the left boundary and y-position 91 pixels from the top”.

chopped_image <- 
  raw_img %>% 
  #crop out width:703px and height:1009px starting +91px from the top
  image_crop(geometry_area(703, 1009, 0, 91)) 

image_ggplot(chopped_image)

Now the non-tabular data (header and footer) have been removed.

The ocr() algorithm that will handle extracting the data from the image can struggle with parts of the image as is. For example, it might think the color boundary between white and green is a character. Therefore, I’m going to try to do the best I can do clean up the image so that the ocr() function can have an easier time. Ultimately this required a lot of guess and check but in the end, I only did two steps for cleaning:

Running a morphology method over the image to remove the horizontal lines separating each group of 5 shows (this required negating the colors of the image so that the filter would have an easier time since white is considered foreground by default). The morphology method modifies an image based on the neighborhood of pixels around it and thinning is subtracting pixels from a shape. So by negating the color the method turns “non-black” pixels to black. Then re-negating turns everything back to “white”.
Turning everything to greyscale to remove remaining colors.

I had tried to remove the color gradients, but it took much more effort and was ultimately not more effective than just going to greyscale.

processed_image <- chopped_image %>% 
  image_negate() %>% #Flip the Colors
  # Remove the Horizontal Lines
  image_morphology(method = "Thinning", kernel = "Rectangle:7x1") %>% 
  # Flip the Colors back to the original
  image_negate() %>% 
  # Turn colors to greyscale
  image_quantize(colorspace = "gray")


image_ggplot(processed_image)

Extracting the Data with OCR

Because I can be lazy, my first attempts at extraction was just to run ocr() on the processed image and hope for the best. However, the best was somewhat frustrating. For example,

ocr(processed_image) %>% 
  str_sub(end = str_locate(., '\\n')[1])

## [1] "1 CHALLENGE: DOUBLE AGENMTV e:00PM 90/0.54 069 0.39 |047 053 0.20 |058 013} 920\n"

Just looking at the top row there are a number of issues that come from just using ocr() directly on the table. The boundary between sections are showing up as “|” or “/” and sometime the decimal doesn’t appear.

Fortunately the function allows you to “whitelist” characters in order to nudge the algorithm on what it should expect to see. So rather than guess and check on the processing of the image to make everything work perfectly. I’ll write a function that allows me to crop to individual columns and specify the proper whitelist for each column.

ocr_text <- function(col_width, col_start, format_code){
  
  ##For Stations Which Are Only Characters
  only_chars <- tesseract::tesseract(
    options = list(
      tessedit_char_whitelist = paste0(LETTERS, collapse = '')
    )
  )
  
  #For Titles Which Are Letters + Numbers + Characters
  all_chars <- tesseract::tesseract(
    options = list(
      tessedit_char_whitelist = paste0(
        c(LETTERS, " ", ".0123456789-()/"), collapse = "")
    )
  )
  
  #For Ratings which are just numbers and a decimal point
  ratings <- tesseract::tesseract(
    options = list(
      tessedit_char_whitelist = "0123456789 ."
    )
  )
  
  #Grab the Column starting at Col Start and with width Col with
  tmp <- processed_image %>% 
    image_crop(geometry_area(col_width, 1009, col_start, 0)) 
  
  # Run OCR with the correct whitelist and turn into a dataframe
  tmp %>% 
    ocr(engine = get(format_code)) %>% 
    str_split("\n") %>%
    unlist() %>%
    enframe() %>%
    select(-name) %>%
    filter(!is.na(value), str_length(value) > 0)
}

The function above takes in a column width and a column start to crop the column and then a label to choose the whitelist for each specific column. The parameters are defined in a list and passed into purrr’s pmap() function. Finally, all the extracted columns will combined together.

#Run the function all the various columns
all_ocr <- list(col_width = c(168, 37, 33, 34, 35, 34),
                col_start = c(28, 196, 307, 346, 385, 598),
                format_code = c("all_chars", 'only_chars', rep("ratings", 4))) %>% 
  pmap(ocr_text) 

#Combine all the columns together and set the names
ratings <- all_ocr %>% 
  bind_cols() %>% 
  set_names(nm = "telecast", "network", "p_18_49", "f_18_49", "m_18_49",
            'p_50_plus')

Final Cleaning

Even with the column specific specifications the ocr() function did not get everything right. Due to the font, it has particular trouble distinguishing between 1s and 4s as well as 8s and 6s. Additionally, sometimes the decimal was still missed. And since all networks were truncated in the original image, I just decided to manually recode.

ratings_clean <- ratings %>% 
  #Fix Things where the decimal was missed
  mutate(across(p_18_49:p_50_plus, ~parse_number(.x)),
         across(p_18_49:p_50_plus, ~if_else(.x > 10, .x/100, .x)),
         #1s and 4s get kindof screwed up; same with 8s and 6s
         p_50_plus = case_when(
           telecast == 'TUCKER CARLSON TONIGHT' ~ 2.71,
           telecast == 'SISTAS SERIES S2' ~ 0.46,
           telecast == 'LAST WORD W/L. ODONNEL' ~ 2.17,
           telecast == 'SITUATION ROOM' & p_50_plus == 1.34 ~ 1.31,
           telecast == 'MY 600-LB LIFE NIA' ~ 0.46,
           TRUE ~ p_50_plus
         ),
         #Clean up 'W/' being read as 'WI' and '11th' as '44th'
         telecast = case_when(
           telecast == '44TH HOUR WIB. WILLIAMS' ~ '11TH HOUR W/B. WILLIAMS',
           telecast == 'ALLIN WI CHRIS HAYES' ~ 'ALL IN W/ CHRIS HAYES',
           telecast == 'BEAT WIARI MELBER' ~'BEAT W/ARI MELBER',
           telecast == 'SPORTSCENTER 124M L' ~ 'SPORTSCENTER 12AM',
           telecast == 'MY 600-LB LIFE NIA' ~ 'MY 600-LB LIFE',
           TRUE ~ telecast
         ),
         # Turn to Title Case
         telecast = str_to_title(telecast),
         # Clean up random characters
         telecast = str_remove(telecast, ' [L|F|S2|L B]+$'),
         #Clean up Network
         network = factor(case_when(
           network == 'TURNI' ~ "TNT",
           network == 'MSNBI' ~ "MSNBC",
           network == 'FOXN' ~ "FoxNews",
           network == 'LIFETI' ~ "Lifetime",
           network == 'BLACK' ~ 'BET',
           network %in% c('AEN', 'AGEN') ~ 'A&E',
           network == 'BRAVC' ~ 'BRAVO',
           network == 'COME' ~ 'COMEDY CENTRAL',
           network == 'NECS' ~ 'NBC SPORTS',
           network == 'TBSN' ~ 'TBS',
           network == 'TL' ~ 'TLC',
           TRUE ~ network
         ))
  )

knitr::kable(head(ratings_clean, 3))

telecast	network	p_18_49	f_18_49	m_18_49	p_50_plus
Challenge Double Agen	MTV	0.54	0.69	0.39	0.13
Nba Regular Season	ESPN	0.33	0.21	0.46	0.40
Aew All Elite Wrestling	TNT	0.32	0.21	0.42	0.32

Now everything should be ready for analysis.

Analysis of Cable Ratings

The decimals in the table for cable ratings refer to the percent of the population watching the show. For instance the p_18_49 field’s value of 0.54 means that 0.54% of the US 18-49 population watched The Challenge on February 3rd.

The Most Popular Shows on Wednesday Night Overall 18-49 and By Gender

The first question is what are the most popular shows for the 18-49 demographic for combined genders and broken apart by gender. These types of combined plots uses the patchwork package to combine the three ggplots into a single plot using a common legend.

##Create Fixed Color Palette For Networks
cols <- scales::hue_pal()(n_distinct(ratings_clean$network))
names(cols) <- levels(ratings_clean$network)

##Top Show By the Key Demo (Combined)
key_all <- ratings_clean %>% 
  slice_max(p_18_49, n = 10) %>% 
  ggplot(aes(x = fct_reorder(telecast, p_18_49), y = p_18_49, fill = network)) + 
    geom_col() + 
    geom_text(aes(label = p_18_49 %>% round(2)), nudge_y = 0.015) + 
    scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
    scale_fill_manual(values = cols) + 
    labs(x = "", title = "All Genders", y = '', fill = '') + 
    coord_flip() + 
    cowplot::theme_cowplot() + 
    theme(
      axis.text.x = element_blank(),
      axis.ticks = element_blank(),
      axis.line.x = element_blank(),
      plot.title.position = 'plot'
    )

#Male Ratings only
key_male <- ratings_clean %>% 
  slice_max(m_18_49, n = 5) %>% 
  ggplot(aes(x = fct_reorder(telecast, m_18_49), y = m_18_49, fill = network)) + 
  geom_col() + 
  geom_text(aes(label = m_18_49 %>% round(2)), nudge_y = .045) + 
  scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
  scale_fill_manual(values = cols, guide = F) + 
  labs(x = "", title = "Male", y = '') + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.x = element_blank(),
    axis.ticks = element_blank(),
    axis.line.x = element_blank(),
    plot.title.position = 'plot'
  )

# Female rating only
key_female <- ratings_clean %>% 
  slice_max(f_18_49, n = 5) %>% 
  ggplot(aes(x = fct_reorder(telecast, f_18_49), y = f_18_49, fill = network)) + 
  geom_col() + 
  geom_text(aes(label = f_18_49 %>% round(2)), nudge_y = .065) + 
  scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
  scale_fill_manual(values = cols, guide = F) + 
  labs(x = "", title = "Female", y = '') + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.x = element_blank(),
    axis.ticks = element_blank(),
    axis.line.x = element_blank(),
    plot.title.position = 'plot'
  )
    
# Combining everything with patchwork syntax
key_all / (key_male | key_female) +
  plot_layout(guides = "collect") + 
  plot_annotation(
    title = "**Wednesday Night Cable Ratings (Feb 3rd, 2021)**",
    caption = "*Source:* Showbuzzdaily.com"
  ) & theme(legend.position = 'bottom',
            plot.title = ggtext::element_markdown(size = 14),
            plot.caption = ggtext::element_markdown())

From the chart its clear that the Challenge is fairly dominant in the 18-49 Demographic with 0.21% (or 1.63x) higher than the 2nd highest show. Although while the Challenge is popular with both genders its the most popular show among 18-49 Females but only 3rd for 18-49 Males after a NBA game and AEW Professional Wrestling.

Also, because the networks for My 600-lb Life (TLC) and Sistas (BET) weren’t in the overall top 10 I couldn’t figure out how to include them in the legend. If anyone has any ideas, please let me know in the comments.

The Most Male-Dominant, Female Dominant, and Gender-Balanced Shows

From the above chart its clear that some shows skew Male (sports) and some skew Female (reality shows like Married at First Sight, My 600-lb Life, and Real Housewives). But I can look at that more directly by comparing the ratios the Female 18-49 rating to the Male 18-49 rating to determine the gender skew of each show. I break the shows into categories of Male Skewed, Female Skewed, and Balanced (where the Female/Male Ratio is closest to 1).

##Female / Male Ratio for Key Demo
bind_rows(
  ratings_clean %>% 
    mutate(f_m_ratio = f_18_49 / m_18_49) %>%
    slice_max(f_m_ratio, n = 5),
  ratings_clean %>% 
    mutate(f_m_ratio = f_18_49 / m_18_49) %>%
    slice_min(f_m_ratio, n = 5),
  ratings_clean %>% 
    mutate(f_m_ratio = f_18_49 / m_18_49,
           balance = abs(1-f_m_ratio)) %>% 
    slice_min(balance, n = 5)
) %>%
  mutate(balance = f_m_ratio-1) %>% 
  ggplot(aes(x = m_18_49, y = f_18_49, fill = balance)) + 
    ggrepel::geom_label_repel(aes(label = telecast)) + 
    geom_abline(lty = 2) + 
    scale_fill_gradient2(high = '#8800FF',mid = '#BBBBBB', low = '#02C2AD',
                         midpoint = 0, guide = F) + 
    labs(title = "Comparing 18-49 Demographics by Gender",
         subtitle = 'Cable Feb 3rd, 2021',
         caption = "*Source:* showbuzzdaily.com",
         x = "Males 18-49 Ratings",
         y = "Females 18-49 Ratings") + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = 'plot',
      plot.caption = ggtext::element_markdown()
    )

Sure enough the most Male dominated shows are sport-related with 2 NBA Games, an NBA pre-game show, an episode of Sportscenter, and a sports talking heads show. Female skewed shows are also not surprising with Married at First Sight, Sistas, My 600-lb Life, and Real Housewives of Salt Lake City topping the list. For the balanced category, I did not have much of an expectation but all the programs seems to be News shows or news adjacent like the Daily Show… which I guess makes sense.

Most Popular Shows for the 50+ Demographic

Turning away from the 18-49 demographic I can also look at the most popular shows for the 50+ demographic. Unfortunately, there is not a 50+ gender breakdown so I can only look at the overall.

ratings_clean %>% 
  slice_max(p_50_plus, n = 10) %>% 
  ggplot(aes(x = fct_reorder(telecast, p_50_plus), y = p_50_plus,  fill = network)) + 
  geom_col() + 
  geom_text(aes(label = p_50_plus %>% round(2)), nudge_y = 0.15) + 
  scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
  labs(x = "", title = "Top 10 Cable Shows for the 50+ Demographic",
       y = '',
       subtitle = "Wednesday, Feb 3rd 2021",
       caption = "*Source:* Showbuzzdaily.com",
       fill = '') + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.x = element_blank(),
    axis.ticks = element_blank(),
    axis.line.x = element_blank(),
    plot.title.position = 'plot',
    plot.caption = ggtext::element_markdown(),
    legend.position = 'bottom'
  )

Interestingly in the 50+ Demo, ALL of the shows are News shows and they only come from 3 networks. Two on CNN, Two on Fox News, and 6 on MSNBC. Again, didn’t have a ton of expectation but it was surprising to be how homogeneous the 50+ demographic was.

The Oldest and Youngest Shows in the Top 50

Similar to the Most Male and Most Female shows in the Top 50 Cable Programs, I’d like to see which shows skew older vs. younger. To do this, I’ll rank order the 18-49 demo and the 50+ demo and plot the ranks against each other. Now there are some massive caveats here in the sense that my data is the Top 50 shows by the 18-49 demo, so its not clear that the 50+ demo is fully represented. Additionally, popularity for each dimension is relative since I don’t know the actual number of people in each demo. Finally, since both scales are ranked, it won’t show the full distance between levels of popularity (e.g, The Challenge is much more popular than the next highest show for 18-49). This was done to produce a better looking visualization.

I had run a K-means clustering algorithm for text colors to make differences more appearant. There isn’t much rigor to this beyond my assumption that 5 clusters would probably make sense (1 for each corner and 1 middle).

#Rank Order the Shows for the 2 Columns
dt <- ratings_clean %>% 
  transmute(
    telecast,
    young_rnk = min_rank(p_18_49),
    old_rnk = min_rank(p_50_plus),
  ) 

# Run K-Means Clustering Algorithm
km <- kmeans(dt %>% select(-telecast), 
             centers = 5, nstart = 10)

#Add the cluster label back to the data
dt2 <- dt %>%
  mutate(cluster = km$cluster)

#Plot
ggplot(dt2, aes(x = young_rnk, y = old_rnk, color = factor(cluster))) + 
  ggrepel::geom_text_repel(aes(label = telecast), size = 3) +
  scale_color_discrete(guide = F) + 
  scale_x_continuous(breaks = c(1, 50),
                     labels = c("Less Popular", "More Popular")) + 
  scale_y_continuous(breaks = c(13, 54),
                     labels = c("Less Popular", "More Popular")) + 
  coord_cartesian(xlim = c(-2, 54), ylim = c(0, 52)) + 
  labs(x = "Popularity Among 18-49",
       y = "Popularity Among 50+",
       title = "Visualizing Popularity of Wednesday Night Cable by Age",
       subtitle = "Comparing 18-49 vs. 50+") + 
  cowplot::theme_cowplot() + 
  theme(
    axis.ticks = element_blank(),
    axis.line = element_blank(),
    axis.text.y = element_text(angle = 90), 
    panel.background = element_rect(fill = '#EEEEEE')

  )

Somewhat surprising (at least to me), that Rachel Maddow and Tucker Carlson are the consensus most popular shows across the two demos. My beloved Challenge is very popular amongst the 18-49 demo and very unpopular among 50+. Sports shows tended to be generally the least popular by either demo and finally certain MSNBC and Fox News shows were popular among the 50+ demo but not the 18-49.

Concluding Thoughts

While I still love The Challenge and am happy for its popularity, its best time was probably about 10 years ago (sorry not sorry). As far as the techniques in this post are concerned, I found extracting the data from an image to be an interesting challenge (no pun intended) but if the table was a tractable size I would probably manually enter the data rather than go through this again. Getting the data correct required a lot of guess and check for working with magick and tesseract.

As for the analysis, I guess its good when things go as expected (most popular shows by gender follow stereotypical gender conventions) but I think the most surprising thing to me was how much cable news dominated the 50+ Demographic…. and I guess the Daily Show is not as popular as I thought it would be.

When Did the US Senate Best Reflect the US Population?

Mon, 01 Feb 2021 00:00:00 +0000

TL;DR

While this is the oldest Senate we’ve ever had, its not the most non-representative Senate when compared to the age distribution of the US Population
The most representative Senate was in the 1970s as the average Senator age declined while the average age in the US increased.
The least representative Senate was in the 1990s as the average age in the US declined while the average age of Senators continued to rise since 1980

Intro

The inspiration for this post stemmed from wcd.fyi’s post on “Which Generations Control the Senate” where the creator broke down the US Senate distribution by generations.

Upon seeing this visualization my initial goal was to see whether certain generations’ trajectory were faster or slower than others and how that would shape our expectation of Senate control in the future. However, as that question expanded and as I thought about how we hear about how the Senate is old and doesn’t reflect the American population, I wanted to see whether or not that’s true.

The purpose of this post is to determine when the US Senate most and least reflected the age distribution of the general US population.

Getting the Data

The data for this analysis will come from two primary sources. Information on the US Senators will come from the same ProPublica Congress API as the original visualization. Information on the US Population Age Distribution will come from a variety of source from the US Census Bureau.

Setting up the libraries

While the workhorse functions for this analysis are the main tidyverse data manipulation and visualization functions, I will be using httr to access the Congress API and tidycensus to access a subset of age distributions. Special shoutout to readr for its various function to help read the differently formatted files from the Census Bureau

library(tidyverse) #Data Manipulation and Visualizaion
library(httr) #Accessing the ProPublica API
library(glue) #Manipulating Strings to Make API Calls Easier
library(lubridate) # Date Manipulation Functions
library(tidycensus) # Package for Accessing Census Data

Getting the Senate Data

The data on the Senators comes from the ProPublica Congress API. According to its documentation you can retrieve a list of Senators for any congress from the 80th (1947) through 117th (2021). To get this data I’ll first write a function that takes in a congressional session and returns the desired data.

get_senate_data <- function(cngrs){
  
  # Issue request to API
  dt <- GET(url = glue('https://api.propublica.org/congress/v1/{cngrs}/senate/members.json'),
            add_headers("X-API-Key" = Sys.getenv("PROPUBLICA_API_KEY")))
  
  x <- content(dt)$results[[1]]$members %>% tibble(dt = .) %>% unnest_wider(dt) %>% 
    mutate(congress = cngrs,
           #The API only Contains 80th Congress Forward.  80th Congress was 1/1947
           start_year = (cngrs-80)*2 + 1947,
           # Use DOB to Infer Age
           age = as.numeric(ymd(paste(start_year, 01, 15, sep = '-')) - ymd(date_of_birth))/365,
           # Bucket Age Using Conventional Census Buckets
           label = case_when(
             age <= 4 ~ 'Under 5 years',
             age <= 9 ~ '5 to 9 years',
             age <= 14 ~ '10 to 14 years',
             age <= 19 ~ '15 to 19 years',
             age <= 24 ~ '20 to 24 years',
             age <= 29 ~ '25 to 29 years',
             age <= 34 ~ '30 to 34 years',
             age <= 39 ~ '35 to 39 years',
             age <= 44 ~ '40 to 44 years',
             age <= 49 ~ '45 to 49 years',
             age <= 54 ~ '50 to 54 years',
             age <= 59 ~ '55 to 59 years',
             age <= 64 ~ '60 to 64 years',
             age <= 69 ~ '65 to 69 years',
             age <= 74 ~ '70 to 74 years',
             age <= 79 ~ '75 to 79 years',
             age <= 84 ~ '80 to 84 years',
             TRUE ~ '85 years'
         )
    )
  
  
  return(x)
  
}

Some notes about this function:

The ProPublica API requires an API key that you need to register for. I’ve stored it in my .Renviron file so I can share the code without sharing my key.
The unnest_wider() function is part of a family of functions to help work with JSON output to turn lists of lists into more rectangular data.

With the function in place, I can get all the Senate data with a single line to call the API for each of the 38 Congress’ and combine into a single tibble using map_dfr which applies the get_senate_data function to each input (the numbers between 80 and 117).

senate <- map_dfr(80:117, get_senate_data)

The API will return all of the Senators who appeared in that Congressional session which due to changes over the course of two years can result in more than 2 senators appearing per state. For simplicity, I’ll reduce the data to only use the 2 senators who were there at the start of the congressional session. This is done using a heuristic that the Senators who were in-place first will have smaller govtrack_id numbers. Finally, senators without DOB information are removed.

senate_clean <- senate %>%
  group_by(congress, state) %>%
  arrange(govtrack_id) %>% 
  slice(1:2) %>% 
  ungroup() %>% 
  filter(!is.na(date_of_birth))

Getting US Population Age Distributions from the Census Bureau

This process was a PITA. Since I wanted to match the coverage of the Senator data which ranged from 1947 through 2021, I needed to find US Population Age Distributions to match. While all this information was available on the Census website it comes from a combination of different files, file formats, and access methods. In summary:

1947 - 1979: Individual files per year that contain the population by each individual age from 0 to 84 and then 85+
1980 - 1989: The entire decade exists in a single fixed-width-file
1990 - 2000: The entire decade exists in a single file but the format is too awful to deal with programmatically, so I rebuilt the file in Excel and used the datapasta add-in to create the tibble.
2001-2004: Nicely existed in a single file
2005-2019: Retrieved from the American Community Survey (ACS) using the tidycensus API.

There were probably easier ways to get everything… but oh well. Since there’s a lot going on for these 5 source, I’m going to not go into as much detail as I normally would in describing what’s happening, but its nothing too complicated.

1947 - 1979

The process for reading these flat files isn’t too dissimilar from the process used on the ProPublica API. I write a function to handle an individual year and run map_dfr on the list of years to create my data set. The one unique piece of this function is that the format of each year isn’t exactly the same, so it first reads the file to find where the data starts and then does the “official” read-in using the skip parameter to start in the right place.

get_1947_to_1979 <- function(yr){
  
  #Read In File
  c <- read_lines(glue('https://www2.census.gov/programs-surveys/popest/tables/1900-1980/national/asrh/pe-11-{yr}.csv'))
  #Find where data starts
  c2 <- which(str_detect(c, '^0'))
  
  # Read in the actual file
  x <- suppressWarnings(read_csv(glue('https://www2.census.gov/programs-surveys/popest/tables/1900-1980/national/asrh/pe-11-{yr}.csv'),
                skip = c2-2)) %>% 
    filter(!is.na(X2)) %>% 
    transmute(
      age = X1,
      population = X2,
      year = yr
    )
}

ages_1947_to_1979 <- map_dfr(1947:1979, get_1947_to_1979)

1980 - 1989

The data for 1980 to 1989 comes from a single fixed-width file. To read it in, I use the read_fwf function from readr. Its very similar to other readr functions like read_csv. The only difference is that you need to specify the positions of the data which can be done in a wide variety of ways. Here i used fwf_widths to tell the function how wide each column is and what to call each column.

The file also contains information at a State level and contains sets for both genders, Males only, and Females only. The rowid construction is so I can pull out only the rows I need for both genders and for the rows with age segment data. Finally, the group_by / summarize is to aggregate the population over the State values.

ages_1980_to_1989 <- read_fwf(
  file = 'https://www2.census.gov/programs-surveys/popest/tables/1980-1990/state/asrh/s5yr8090.txt',
  fwf_widths(c(16, 9, 9, 9, 9, 9, 9,9 , 9, 9, 9, 9,7 ),
             c('Term', 'dropme', 'y1980', 'y1981','y1982', 'y1983','y1984',
               'y1985', 'y1986','y1987', 'y1988','y1989', 'y1990')),
  skip = 10
) %>% 
  mutate(rowid = row_number() %% 58) %>% 
  filter(rowid <= 20 & !rowid %in% c(0, 2, 1)) %>% 
  select(-dropme, -y1990, -rowid) %>% 
  gather(year, population, -Term) %>% 
  transmute(
    label = Term,
    year = as.numeric(str_remove_all(year, 'y')),
    population = as.numeric(population)
  ) %>% 
  group_by(label, year) %>% 
  summarize(population = sum(population), .groups = 'drop')

1990 - 2000

The data for the 1990s comes from a single file in a very machine unfriendly format. Here I copied and pasted the data I needed into an Excel file and used datapasta to copy it into R as a tibble. The wide-format data is then cleaned and turned into long-format data.

ages_1990_to_2000 <- tibble::tribble(
                                   ~Age_Group,    ~y2000,    ~y1999,    ~y1998,    ~y1997,    ~y1996,    ~y1995,    ~y1994,    ~y1993,    ~y1992,    ~y1991,    ~y1990,
                       "Under 5 years.......", 18945000L, 18942000L, 18989000L, 19099000L, 19292000L, 19532000L, 19700000L, 19674000L, 19492000L, 19189000L, 18853000L,
                       "5 to 9 years........", 19681000L, 19947000L, 19929000L, 19754000L, 19439000L, 19096000L, 18752000L, 18442000L, 18293000L, 18205000L, 18062000L,
                       "10 to 14 years......", 20017000L, 19548000L, 19242000L, 19097000L, 19004000L, 18853000L, 18716000L, 18508000L, 18102000L, 17679000L, 17198000L,
                       "15 to 19 years......", 19894000L, 19748000L, 19542000L, 19146000L, 18708000L, 18203000L, 17743000L, 17375000L, 17180000L, 17235000L, 17765000L,
                       "20 to 24 years......", 18693000L, 18026000L, 17678000L, 17488000L, 17508000L, 17982000L, 18389000L, 18785000L, 19047000L, 19156000L, 19135000L,
                       "25 to 29 years......", 17625000L, 18209000L, 18575000L, 18820000L, 18933000L, 18905000L, 19107000L, 19570000L, 20140000L, 20713000L, 21236000L,
                       "30 to 34 years......", 19564000L, 19727000L, 20168000L, 20739000L, 21313000L, 21825000L, 22133000L, 22227000L, 22240000L, 22157000L, 21912000L,
                       "35 to 39 years......", 22044000L, 22545000L, 22615000L, 22636000L, 22553000L, 22296000L, 21978000L, 21605000L, 21098000L, 20530000L, 19982000L,
                       "40 to 44 years......", 22769000L, 22268000L, 21883000L, 21378000L, 20812000L, 20259000L, 19716000L, 19209000L, 18807000L, 18761000L, 17795000L,
                       "45 to 49 years......", 20059000L, 19356000L, 18853000L, 18467000L, 18430000L, 17458000L, 16678000L, 15931000L, 15359000L, 14099000L, 13824000L,
                       "50 to 54 years......", 17626000L, 16446000L, 15722000L, 15158000L, 13928000L, 13642000L, 13195000L, 12728000L, 12055000L, 11648000L, 11370000L,
                       "55 to 59 years......", 13452000L, 12875000L, 12403000L, 11755000L, 11356000L, 11086000L, 10931000L, 10678000L, 10483000L, 10422000L, 10474000L,
                       "60 to 64 years......", 10757000L, 10514000L, 10263000L, 10061000L,  9997000L, 10046000L, 10077000L, 10236000L, 10438000L, 10581000L, 10619000L,
                       "65 to 69 years......",  9414000L,  9447000L,  9592000L,  9777000L,  9901000L,  9926000L,  9967000L, 10013000L,  9974000L, 10027000L, 10077000L,
                       "70 to 74 years......",  8758000L,  8771000L,  8798000L,  8751000L,  8789000L,  8831000L,  8736000L,  8616000L,  8468000L,  8244000L,  8023000L,
                       "75 to 79 years......",  7425000L,  7329000L,  7215000L,  7083000L,  6891000L,  6700000L,  6586000L,  6483000L,  6398000L,  6280000L,  6147000L,
                       "80 to 84 years......",  4968000L,  4817000L,  4732000L,  4661000L,  4575000L,  4478000L,  4360000L,  4255000L,  4140000L,  4039000L,  3935000L,
                       "85 to 89 years......",  2734000L,  2625000L,  2554000L,  2477000L,  2415000L,  2352000L,  2300000L,  2247000L,  2178000L,  2104000L,  2051000L,
                       "90 to 94 years......",  1196000L,  1148000L,  1116000L,  1078000L,  1043000L,  1017000L,   967000L,   916000L,   865000L,   827000L,   765000L,
                       "95 to 99 years......",   369000L,   343000L,   323000L,   304000L,   291000L,   268000L,   250000L,   240000L,   231000L,   218000L,   206000L,
                       "100 years and over..",    68000L,    59000L,    57000L,    54000L,    51000L,    48000L,    45000L,    43000L,    41000L,    40000L,    37000L
                       )  %>% 
  mutate(label = str_remove_all(Age_Group, '\\.')) %>% 
  select(-Age_Group) %>% 
  gather(year, population, -label) %>% 
  mutate(year = as.numeric(str_remove_all(year, 'y')))

2001 - 2004

The 2001-2004 file is pretty similar to the 1980s file.

ages_2001_to_2004 <- read_csv('https://www2.census.gov/programs-surveys/popest/tables/2000-2005/national/asrh/nc-est2005-01.csv',
                              skip = 3) %>% 
  filter(between(row_number(), 2, 22)) %>% 
  gather(year, population, -X1) %>% 
  transmute(
    label = str_remove_all(X1, '\\.'),
    year = as.numeric(str_extract(year, '\\d{4}')),
    population
  ) %>%
  filter(!is.na(year), between(year, 2001, 2004))

2005 - 2019

There’s probably a better way to do this but my original plan was to try to find as granular age buckets as possible and 2005 - 2019 was the first set of years I worked with. So I leveraged the tidycensus package to access the data from the American Community Survey to get population estimates.

#Register API Key
census_api_key(Sys.getenv("CENSUS_API_KEY"))

#Download Data Dictionary
vars <- load_variables(2019, 'acs1')

#Subset to Information For the Age Table
mapping <- vars %>% 
  filter(str_detect(name, 'B01001_'))

# Define Function that Takes in a Year and Returns the Age Group Data
# Data provided at a State Level because I couldn't figure out the 
# geography name for National.
get_2005_2019 <- function(yr){
  get_acs(
  geography = 'state',
  variables = c(
    'B01001_001',
    'B01001_002',
    'B01001_003',
    'B01001_004',
    'B01001_005',
    'B01001_006',
    'B01001_007',
    'B01001_008',
    'B01001_009',
    'B01001_010',
    'B01001_011',
    'B01001_012',
    'B01001_013',
    'B01001_014',
    'B01001_015',
    'B01001_016',
    'B01001_017',
    'B01001_018',
    'B01001_019',
    'B01001_020',
    'B01001_021',
    'B01001_022',
    'B01001_023',
    'B01001_024',
    'B01001_025',
    'B01001_026',
    'B01001_027',
    'B01001_028',
    'B01001_029',
    'B01001_030',
    'B01001_031',
    'B01001_032',
    'B01001_033',
    'B01001_034',
    'B01001_035',
    'B01001_036',
    'B01001_037',
    'B01001_038',
    'B01001_039',
    'B01001_040',
    'B01001_041',
    'B01001_042',
    'B01001_043',
    'B01001_044',
    'B01001_045',
    'B01001_046',
    'B01001_047',
    'B01001_048',
    'B01001_049'
  ),
  year = yr,
  survey = 'acs1'
  ) %>% 
  mutate(year = yr) %>%
  inner_join(vars, by = c("variable" = "name")) %>% 
  filter(str_detect(label, "years")) %>% 
  mutate(label = str_remove_all(label, "Estimate.*!!"))
}

# Download the Data from the API and Clean Up
ages_2005_to_2019 <- map_dfr(2005:2019, get_2005_2019) %>% 
  group_by(year, label) %>% 
  summarize(population = sum(estimate), .groups = 'drop')

Final Data Preparation

In addition to have different file formats each of the files had different age groupings. They’re not wildly different from each other but we need to have standardized groupings to carry out the analysis:

all_years <- ages_1947_to_1979 %>% 
  mutate(
    age = parse_number(age),
    label = case_when(
      age <= 4 ~ 'Under 5 years',
      age <= 9 ~ '5 to 9 years',
      age <= 14 ~ '10 to 14 years',
      age <= 19 ~ '15 to 19 years',
      age <= 24 ~ '20 to 24 years',
      age <= 29 ~ '25 to 29 years',
      age <= 34 ~ '30 to 34 years',
      age <= 39 ~ '35 to 39 years',
      age <= 44 ~ '40 to 44 years',
      age <= 49 ~ '45 to 49 years',
      age <= 54 ~ '50 to 54 years',
      age <= 59 ~ '55 to 59 years',
      age <= 64 ~ '60 to 64 years',
      age <= 69 ~ '65 to 69 years',
      age <= 74 ~ '70 to 74 years',
      age <= 79 ~ '75 to 79 years',
      age <= 84 ~ '80 to 84 years',
      TRUE ~ '85 years',
    )
  ) %>% 
  group_by(year, label) %>% 
  summarize(population = sum(population), .groups = 'drop') %>% 
  rbind(ages_1980_to_1989) %>% 
  rbind(
    ages_1990_to_2000 %>% 
      rbind(ages_2001_to_2004) %>% 
      mutate(
        label = if_else(label %in% c('85 to 89 years',
                                     '90 to 94 years',
                                     '95 to 99 years',
                                     '100 years and over'),
                        '85 years',
                        label
        )
      )
  ) %>% 
  rbind(
    ages_2005_to_2019 %>% 
      mutate(label = case_when(
        label %in% c("15 to 17 years", "18 and 19 years") ~ "15 to 19 years",
        label %in% c("20 years", "21 years", "22 to 24 years") ~ "20 to 24 years",
        label %in% c("60 and 61 years", "62 to 64 years") ~ "60 to 64 years",
        label %in% c("65 and 66 years", "67 to 69 years") ~ "65 to 69 years",
        label == '85 years and over' ~ '85 years',
        TRUE ~ label
        )
      )
  ) %>% 
  group_by(year, label) %>%
  summarize(population = sum(population), .groups = 'drop')

By law a US Senator needs to be at least 30 years old (technically, this wasn’t always true as there are 4 US Senators who were in their late-20s, but those were all in the early 1800s so out of scope for this analysis) so to create a comparable population I’ll limit the US population data to those 30 and older and create the share of 30+ population by age:

eligible_age_bckt <- all_years %>% 
  filter(parse_number(label) >= 30) %>%
  add_count(year, wt = population, name = 'total_population') %>% 
  mutate(pct = population / total_population)

I’ll summarize the Senate data by the same groupings and create the % of Senators by age:

senate_age_bckt <- senate_clean %>%
  count(start_year, label, name = 'num_senators') %>% 
  add_count(start_year, wt = num_senators, name = "total_senators") %>% 
  mutate(pct = num_senators / total_senators)

Finally, we’ll complete the data building steps by stacking the US population data and Senate data on top of each other:

pop_senate_merged <- 
  senate_age_bckt %>% 
  transmute(
    year = start_year, label, pct, grp = "Senators"
  ) %>% 
  rbind(eligible_age_bckt %>% 
          transmute(year, label, pct, grp = "US Pop. Over 30"))
  

knitr::kable(head(pop_senate_merged, 3))

year	label	pct	grp
1947	35 to 39 years	0.0315789	Senators
1947	40 to 44 years	0.0947368	Senators
1947	45 to 49 years	0.0842105	Senators

Looking At Similarity of Senate vs. US Population

Now onto the main course!

Our goal is to determine when the distribution of ages in the Senate are most similar / dissimilar to the distribution of ages in the US Over 30 population. There are many different ways to calculate similarity but for I’m going to keep it simple and use mean absolute difference because its simple and the results are pretty similar to other methods I tried.

dist_measures <- pop_senate_merged %>% 
  #Convert to Long Format to Wide Format
  spread(grp, pct) %>% 
  # Replace NAs with 0s 
  replace_na(list(Senators = 0, `US Pop. Over 30` = 0)) %>% 
  # Calculate Mean Abs Difference
  mutate(distance = abs(Senators - `US Pop. Over 30`)) %>% 
  # Limit to Only Odd Years To Align with Congressional Sessions
  # There isn't 2021 Data in the Census Data
  filter(year %% 2 == 1, year != 2021) %>%
  # Add Up Absolute Deviations
  group_by(year) %>% 
  summarize(distance = mean(distance))

Then the dissimilarity over time can be plotted:

dist_measures %>% 
  ggplot(aes(x = year, y = distance)) + 
    geom_line(lwd = 1.5, color = 'blue') + 
    scale_x_continuous(breaks = seq(1950, 2020, 10)) + 
    labs(x = "Year", y = "Distance between Senate and US Pop", 
         title = "When was the US Senate Most/Least Representitive of the US Population",
         subtitle = "1947 - 2019") + 
    cowplot::theme_cowplot()

Based on the above, the most representative era for the US Senate was in the 70s when the distance was minimized while least representative time was in the late 80s/early 90s. The three most representative years are 1971, 1979, and 1973, while the least representative years are 1989, 1993, and 1991. What was surprising is that the present time is actually more representative than in the 90s and about on the level that it was in the 60s.

To get a better idea of what makes these years representative or non-representative we can look at the distributions for the most similar year, 1971, the most dissimilar year, 1989, and the most recent year available, 2019.

pop_senate_merged %>% 
  filter(year %in% c(1971, 1989, 2019)) %>% 
  ggplot(aes(x = grp, y = pct, fill = fct_rev(label))) + 
  geom_col() + 
  geom_text(aes(label =if_else(pct > .01,
                               paste(label, pct %>% scales::percent(accuracy = 1), sep = ': '), "")),
            position = position_stack(vjust = .5)) + 
  scale_fill_discrete(guide = F) + 
  scale_x_discrete(expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0), labels = scales::percent_format(),
                     breaks = seq(0, 1, .2)) + 
  facet_wrap(~year, nrow = 1) + 
  labs(title = "Difference in US Senate Age Distribution vs. US Population",
       subtitle = "1971 (Most Similar),  1989 (Most Different), 2019 (Most Recent)",
       x = "",
       y = "% of Group") + 
  cowplot::theme_cowplot()

While the Senate never represented the 30-45 population well, in 1971 the distributions were closer with 15% of Senators vs. 35% of the Population. This is much closer than in 1989 when this group made up 4% of Senators vs. 43% of the population and closer than today (2019) when its 3% of Senators vs. 32% of the population.

Finally, between 1989 and 2019 it looks like a glut of Senators who were between 45 and 60 (which was 66% of the Senate in 1989 vs. 26% of the Population) have hung-around as in 2019 this group would be 65 to 80 which still makes up 44% of the Senate vs. 21% of the US Population).

So while this is the oldest Senate we’ve ever had its not the most non-representative to the US Population as the population has gotten older too.

Predicting the Winner of Super Bowl LV

Thu, 07 Jan 2021 00:00:00 +0000

TL;DR

Using Pythagorean expectation we should expect the Baltimore Ravens to be Super Bowl Champions
Using a Bradley-Terry model we should expect the Kansas City Chiefs to be Super Bowl champions
Seems like it will be a good year for the AFC

It’s Playoff Time in the NFL!. While my team has unfortunately missed the playoffs, I wanted to take advantage of the season to try to predict who will win the Super Bowl this year through two different mechanisms:

Pythagorean Expectation
Simulation using Bradley-Terry Models

Getting the Data

While ideally having more historical data would be better, I’m going to keep this exercise quick and dirty by only using the data from the 2020 NFL Regular Season which recently concluded. Data for this season can be easily imported using the nflfastR package. By using the fast_scraper_schedules function, I can quickly get all the games and their results for the 2020 season.

library(tidyverse)
library(nflfastR)
library(scales)

#Get Season 2020 Schedule and results
nfl_games <- fast_scraper_schedules(2020) %>% 
  #Weeks Beyond Week 17 Are the Playoffs
  filter(week <= 17)

knitr::kable(head(nfl_games, 3))

game_id	season	game_type	week	gameday	weekday	gametime	away_team	home_team	away_score	home_score	home_result	stadium	location	roof	surface	old_game_id
2020_01_HOU_KC	2020	REG	1	2020-09-10	Thursday	20:20	HOU	KC	20	34	14	Arrowhead Stadium	Home	outdoors	NA	2020091000
2020_01_SEA_ATL	2020	REG	1	2020-09-13	Sunday	13:00	SEA	ATL	38	25	-13	Mercedes-Benz Stadium	Home	NA	NA	2020091300
2020_01_CLE_BAL	2020	REG	1	2020-09-13	Sunday	13:00	CLE	BAL	6	38	32	M&T Bank Stadium	Home	outdoors	NA	2020091301

The package returned both the data I’m looking for, but also a lot of additional data that could be used if necessary (day of week, dome vs. outdoor, etc.).

Method 1: Pythagorean expectation

Pythagorean expectation was developed by Bill James for Baseball and estimates the % of games that a team “should win” based on runs scored and runs allowed.

It was adapted for Pro Football by Football Outsiders to use the following formula:

Football Outside Almanac in 2011 stated that “From 1988 through 2004, 11 of 16 Super Bowls were won by the team that led the NFL in Pythagorean wins, while only seven were won by the team with the most actual victories”

There needs to be a little data manipulation to get the NFL schedule data into a format to calculate the pythagorean expectation. Most notably splitting each game into two rows of data to capture information on both the home team and away teams.

p_wins <- nfl_games %>% 
  pivot_longer(
    cols = c(contains('team')),
    names_to = "category",
    values_to = 'team'
  ) %>% 
  mutate(points_for = (category=='home_team')*home_score+
           (category=='away_team')*away_score,
         points_against = (category=='away_team')*home_score+
           (category=='home_team')*away_score
  ) %>% 
  group_by(team) %>%
  summarize(pf = sum(points_for, na.rm = T),
            pa = sum(points_against, na.rm = T),
            actual_wins = sum(points_for > points_against, na.rm = T),
            .groups = 'drop'
  ) %>% 
  mutate(p_expectation = pf^2.37/(pf^2.37+pa^2.37)*16)

By pythagorean expectation the top 3 teams in the NFL are:

team	points_for	points_against	actual_wins	expected_wins
BAL	468	303	11	11.8
NO	482	337	12	11.2
TB	492	355	11	10.9

According to Pythagorean Expectation, the Baltimore Ravens are the best team in the NFL while the formula would say that the Kansas City Chiefs, the team with the most actual wins, “should” have only had 10.5 wins vs. the 14 actual wins they had.

An aside: Who “outkicked their coverage”?

The concept of “Expected Wins” allows us to see who outperformed their expectation vs. under-performed. The following plot shows actual wins on the x-axis and expected wins on the y-axis.

library(ggrepel)
p_wins %>% 
  mutate(diff_from_exp = actual_wins - p_expectation) %>% 
  ggplot(aes(x = actual_wins, y = p_expectation, fill = diff_from_exp)) + 
    geom_label_repel(aes(label = team)) + 
    geom_abline(lty = 2) + 
    annotate("label", x = 1, y = 10, hjust = 'left', label = "Underachievers") +
    annotate("label", x = 10, y = 5, hjust = 'left', label = "Overachievers") +
    labs(x = "Actual Wins", y = "Expected Wins", 
         title = "What NFL Teams Over/Under Performed?", 
         caption = "Expected Wins Based on Pythagorian Expectation") + 
    scale_fill_gradient2(guide = F) + 
    cowplot::theme_cowplot()

The largest over-achievers appear to be Kansas city, and Cleveland while the largest under-achievers were Atlanta and Jacksonville.

Method #2: Simulation with Bradley-Terry Models

Bradley-Terry Models are probability models to predict the outcomes of paired comparisons (such as sporting events or ranking items in a competition).

In this case, to predict the future winner of Super Bowl LV. I’ll be using regular season data to estimate “ability parameters” for each team and then using those parameters to run simulations to estimate the winners of the NFL Playoff Match-ups.

The Bradley-Terry Model can be fit using the BradleyTerry2 package.

Step 1: Reshaping the Data

The BradleyTerry2 package can take data in a number of different ways but it is opinionated about the structure so we’ll need to reshape the data to get it into a format that the package wants.

Specifically, it can take in data similar to how glm() can use counts to fit a logistic regression. In this case it would be similar to:

BTm(cbind(win1, win2), team1, team2, ~ team, id = "team", data = sports.data)

The inclusion of only team in the formula means that only the “team” factors are used to estimate abilities. Other predictors can be added such as a home-field advantage but considering the nature of the 2020 season, I’m going to assume there was no home field advantage. The id="team" portion of the formula tells the function how to label factors for the output. For example the team “NYG” will become the “teamNYG” predictor.

Given the nature of the NFL schedule there shouldn’t be any repeats of Home/Away combinations. But to be sure we can group_by() and summarize().

Since the package used for modeling requires that each team variable has the same factor levels, I’ll recode home_team and away_team with new levels.

#Get List of All Teams
all_teams <- sort(unique(nfl_games$home_team))

nfl_shaped <- nfl_games %>%
  mutate(
    home_team = factor(home_team, levels = all_teams),
    away_team = factor(away_team, levels = all_teams),
    home_wins = if_else(home_score > away_score, 1, 0),
    away_wins = if_else(home_score < away_score, 1, 0) 
  ) %>% 
  group_by(home_team, away_team) %>% 
  summarize(home_wins = sum(home_wins),
            away_wins = sum(away_wins),
            .groups= 'drop') 

knitr::kable(head(nfl_shaped, 3), align = 'c')

home_team	away_team	home_wins	away_wins
ARI	BUF	1	0
ARI	DET	0	1
ARI	LA	0	1

Step 2: Fitting the Bradley-Terry Model

The Bradley-Terry model can be fit similar to how other models like glm() are fit. By default, the first factor alphabetically becomes the reference factor and takes a coefficient of zero. All other coefficients are relative to that factor.

library(BradleyTerry2)
base_model <- BTm(cbind(home_wins, away_wins), home_team, away_team,
                  data = nfl_shaped, id = "team")

The summary() function will provide information on residuals, coefficients, and statistical significance, but for brevity, I’ll skip that output.

Step 3: Extracting the Team Abilities

While the package contains a BTAbilities() function to extract the abilities and their standard errors. The qvcalc() function will output abilities along with quasi-standard errors. The advantage of using quasi standard errors is that for the reference category the ability estimate and standard error will both be 0 while quasi-standard errors will be non-zero. The use of quasi-standard errors allow for any comparison.

base_abilities <- qvcalc(BTabilities(base_model)) %>% 
  .[["qvframe"]] %>% 
  as_tibble(rownames = 'team') %>% 
  janitor::clean_names()

knitr::kable(base_abilities %>% 
               mutate(across(where(is.numeric), round, 2)) %>% 
               head(3),
             align = 'c')

team	estimate	se	quasi_se	quasi_var
ARI	0.00	0.00	0.57	0.32
ATL	-0.91	0.88	0.64	0.41
BAL	1.06	0.89	0.65	0.42

Step 4: Simulating Playoff Matchups

To determine each team’s likelihood of winning their match-up I run 1,000 simulations pulling from a distribution of the ability scores using team ability and standard error as parameters. The percent of those 1,000 simulations won by each each represents the likelihood of winning that match-up.

To generate the 1,000 simulations I use the tidyr::crossing() function to replicate each row 1,000 times; then using dplyr to summarize over all simulations.

Since running this for any arbitrary combination of teams isn’t too time consuming, I’ll generate every combination of playoff team across the NFC and AFC even though at least half of these comparisons will be impossible in practice.

playoff_teams = c('BAL', 'BUF', 'CHI', 'CLE', 'GB', 'IND', 'KC', 'LA', 'NO',
                  'PIT', 'SEA', 'TB', 'TEN', 'WAS')

comparisons <- base_abilities %>% 
  filter(team %in% playoff_teams)

#Generate All Potential Combination of Playoff Teams
comparisons <- comparisons %>% 
  rename_with(~paste0("t1_", .x)) %>% 
  crossing(comparisons %>% rename_with(~paste0("t2_", .x)))  %>% 
  filter(t1_team != t2_team)

#Run 1000 Simulations per comparison
set.seed(20210107)

#Draw from Ability Distribution
simulations <- comparisons %>% 
  crossing(simulation = 1:1000) %>% 
  mutate(
    t1_val = rnorm(n(), t1_estimate, t1_quasi_se),
    t2_val = rnorm(n(), t2_estimate, t2_quasi_se),
    t1_win = t1_val > t2_val,
    t2_win = t2_val > t1_val
  )

#Roll up the 1000 Results
sim_summary <- simulations %>% 
  group_by(t1_team, t2_team, t1_estimate, t2_estimate) %>% 
  summarize(t1_wins_pct = mean(t1_win), #Long-Term Average Winning % for Team 1
            t2_wins_pct = mean(t2_win), #Long-Term Average Winning % for Team 2
            .groups = 'drop') %>% 
  mutate(
    #Create a label for the winner
    winner = if_else(t1_wins_pct > t2_wins_pct, t1_team, t2_team)
  )

Step 5: And the winner is….

Now since we have all potential combinations we can step through each of the games on the schedule to determine the likelihood of winning that match-up. For rounds after the initial wild-card round, the teams are re-seeded so the #1 seed will play whatever the lowest winning seed is (can be anywhere from #4 to #7). While initially I wanted to look at each team’s likelihood of winning the Super Bowl, I couldn’t quite figure out how to easily determine the probability of each scenario given the re-seeding process. So I will just step through each round based on the result of the previous round.

For simplicity I define a function to take in the two teams and return the ability scores from the simulations above.

winners <- function(t1, t2){
  dt = sim_summary %>% filter(t1_team == t1 & t2_team == t2) %>% 
    inner_join(
      nflfastR::teams_colors_logos %>% 
        filter(team_abbr == t1) %>% 
        select(t1_team = team_abbr, t1_name = team_name),
      by = "t1_team"
    ) %>% 
    inner_join(
      nflfastR::teams_colors_logos %>% 
        filter(team_abbr == t2) %>% 
        select(t2_team = team_abbr, t2_name = team_name),
      by = "t2_team"
    )
  
  return(
     list(
       team1 = dt$t1_name,
       team1_prob = dt$t1_wins_pct,
       team2 = dt$t2_name,
       team2_prob = dt$t2_wins_pct,
       winner = if_else(dt$winner == dt$t1_team, dt$t1_name, dt$t2_name)
     )
  )
}

NFC

Wild-Card Round

#2. New Orleans Saints (95%) vs. #7. Chicago Bears (5%)

Winner: New Orleans Saints

#3. Seattle Seahawks (71%) vs. #6. Los Angeles Rams (29%)

Winner: Seattle Seahawks

#4. Washington Football Team (4%) vs. #5. Tampa Bay Buccaneers (96%)

Winner: Tampa Bay Buccaneers

Divisional Round

#1. Green Bay Packers (66%) vs. #5. Tampa Bay Buccaneers (34%)

Winner: Green Bay Packers

#2. New Orleans Saints (60%) vs. #3. Seattle Seahawks (40%)

Winner: New Orleans Saints

NFC Championship Game

#1. Green Bay Packers (55%) vs. #2. New Orleans Saints (45%)

The Green Bay Packers are heading to the Super Bowl!

AFC

Wild-Card Round

#2. Buffalo Bills (91%) vs. #7. Indianapolis Colts (9%)

Winner: Buffalo Bills

#3. Pittsburgh Steelers (68%) vs. #6. Cleveland Browns (32%)

Winner: Pittsburgh Steelers

#4. Tennessee Titans (47%) vs. #5. Baltimore Ravens (53%)

Winner: Baltimore Ravens

Divisional Round

#1. Kansas City Chiefs (89%) vs. #5. Baltimore Ravens (11%)

Winner: Kansas City Chiefs

#2. Buffalo Bills (76%) vs. #3. Pittsburgh Steelers (24%)

Winner: Buffalo Bills

AFC Championship Game

#1. Kansas City Chiefs (64%) vs. #2. Buffalo Bills (36%)

Kansas City Chiefs is headed to the Super Bowl!

Super Bowl LV

#1. Green Bay Packers (18%) vs. #1. Kansas City Chiefs (82%)

Apparently the NFC and AFC alternate who the home team is and since the Chiefs were the home team in Super Bowl LIV, the NFC representative will be the home team in Super Bowl LV.

Your Super Bowl LV Champions… the Kansas City Chiefs

7 Tricks I Learned During Advent of Code 2020

Mon, 28 Dec 2020 00:00:00 +0000

I got into the Advent of Code through some co-workers for the first time this year. For those not familiar, its a series of programming puzzles created by Eric Wastl released once a day for the first 25 days of December. The puzzles are programming language agnostic so some use it to learn new a language and others, like myself, just thought it would be something fun to do. While I use R often in my job and for writing this blog, the Advent of Code puzzles are quite different my usual use case. As I did the puzzles, I kept track of some tricks that I learned that I thought were useful (I learned a lot of things but to keep things short, I’ll only list a couple).

Not A Trick.. But Credit Where Credit Is Due

I can’t imagine the amount of work that goes into creating these puzzles.

A bit a cop-out that the first item has nothing to do with R. But I did want to specifically give props to Eric Wastl for making these puzzles. As hard as it was at times to complete the puzzles, I found myself constantly thinking how difficult it must be to create them and ensure that they are solvable.

Now onto the R.

Trick #1: Break apart a string of text into a vector with `str_split()` and `unlist()`

The inputs for Advent of Code are usually flat files and its often necessary to break up the input in order to fill out a matrix or columns in a data frame.

Suppose there is an input like:

....#..

and we want to have each character as a vector element . A function like readLines will input each row as a vector, but in order to split the string into each element we’ll call upon str_split() to break apart the string by a delimiter. Using the empty string (’’) will separate each character to create a list. Then unlist() will break each character into its own element in the vector

input <- "....#..."

print(str_split(input, '') %>% unlist())

## [1] "." "." "." "." "#" "." "." "."

Now as opposed to having 1 string, we have a character vector with each character as its own element.

Trick #2: Combining `str_split()` with `unnest()` can turn a vector of strings into a tidy data frame.

One thing that I worked with more in Advent of Code than I have in the last few years have been matrices. As shown before, most of the input comes as a flat file needing to be processed. Sometimes it was helpful to represent the matrix as a tidy data-set with columns for row_id, col_id, and value vs. the traditional matrix format. The unnest() function will break apart each element of a list into its own row. Using a using a similar input to before but with more rows.

input <- c("....#.......",
           ".#..#....###",
           "....###.....")

tibble(raw = input) %>% 
  mutate(
    row_id = row_number(), #Create Row ID
    value = str_split(raw, '') #Break Each Row Into A List Of Elements
  ) %>% 
  unnest(value) %>% #Break Each Element Into Its Own Row
  group_by(row_id) %>% 
  mutate(col_id = row_number()) %>% #Create Column ID
  head(10) %>% 
  kable(align = 'c')

raw	row_id	value	col_id
….#…….	1	.	1
….#…….	1	.	2
….#…….	1	.	3
….#…….	1	.	4
….#…….	1	#	5
….#…….	1	.	6
….#…….	1	.	7
….#…….	1	.	8
….#…….	1	.	9
….#…….	1	.	10

Now each element of the character vector is its own row its with own row_id and col_id.

Trick #3: `extract()` is a powerhouse function for working with strings

I’ve mentioned before that I think regular expressions are amazing and opens up a world of possibilities. extract() allows for the use to regular expressions and capture groups to create any number of new columns. Its similar to separate() but to me seems more customizable. Given the inputs:

6-7 z: dqzzzjbzz 67
13-16 j: jjjvjmjjkjjjjjjj 123
5-6 m: mmbmmlvmbmmgmmf 5

And you wanted to create a data.frame that had columns for the number range, the character before the ‘:’, the series of characters after the ‘:’ and a final digit . This could be done with str_match() or similar but extract() just makes it so easy. Just give extract() a regular expression and capture in parentheses the things to turn into columns.

input <- c("6-7 z: dqzzzjbzz 67",
           "13-16 j: jjjvjmjjkjjjjjjj 123",
           "5-6 m: mmbmmlvmbmmgmmf 5")

tibble(raw = input) %>% 
  extract(raw, 
          into = c('number_range', 'single_char', 
                   'many_char', 'single_digit'),
          regex = '(\\d+-\\d+) (\\w+): (\\w+) (\\d+)',
          convert = T) %>% 
  kable(align = 'c')

number_range	single_char	many_char	single_digit
6-7	z	dqzzzjbzz	67
13-16	j	jjjvjmjjkjjjjjjj	123
5-6	m	mmbmmlvmbmmgmmf	5

Done and Done (and with convert=T it even turned the single_digit into an int)!

Trick #4: Memoization

Some of the puzzles in AoC use programming concepts I haven’t thought about in a long-term (linked lists) and some used concepts I didn’t know existed. Memoization is one of those terms that I’d heard before but had no idea what it meant. There were a number of puzzles where my initial brute force solutions would take hours or days to complete. But in certain cases, memoization sped things up immensely.

Memoization caches the results of function calls so that if the same call happens a second time, rather than doing the work again, the program can just recall the value from the cache.

Functions can be memoised in R using the memoise::memoise() function to wrap the function.

For this example, I’m borrowing the Fibonacci example from this post on IWNT Statistics.

library(memoise)

# Vanilla Function
fibb <- function(x){
  if(x==0){return(1)}
  else if(x==1){return(1)}
  else{return(fibb(x - 1) + fibb(x-2))}
}

# Same Function But Wrapped In Memoise
memo_fib <- memoise(function(x){
  if(x==0){return(1)}
  else if(x==1){return(1)}
  else{return(memo_fib(x - 1) + memo_fib(x-2))}
})

Running the original version:

tictoc::tic()
fibb(35)

## [1] 14930352

tictoc::toc()

## 26.58 sec elapsed

And the memoised version:

tictoc::tic()
memo_fib(35)

## [1] 14930352

tictoc::toc()

## 0.08 sec elapsed

The memoised version produces a way faster result! While hard to believe, the original function makes close to 30 million calls on its way to finding fibb(35). However, the memoised version, only needs to solve for the 35 unique function calls and can recall the answer from cache for the recursive calls.

Trick #5 - String Replacement with Back References

Back to string manipulation!

Within regular expressions there is a concept of “capture groups” which is when you wrap something in parenthesis and then are able to extract it from the string match (like how str_match() can work). However, you can also reference what is in the capture group to use it for replacement in functions like str_replace_all().

In our example, image we have a string of animals, "the cat, a bird, the dog, ze goat" and we want to insert the adjective red between “the” and each animal. There are many ways to do this, but I will use back-references, which will reference the contents of the capture group without knowing specifically what’s in it.

input <- "the cat, a bird, the dog, ze goat"

str_replace_all(input, '(\\w+) (\\w+)', '\\1 red \\2')

## [1] "the red cat, a red bird, the red dog, ze red goat"

The \\1 is a back-reference to the first capture group in parenthesis (the, a, the, and ze) while \\2 is a reference to the animals.

Trick #6 - Escaping stringR’s regular expression matching with `coll()`

More often than not, stringR’s use of regular expressions as the pattern is a blessing. One place where it was troublesome was when I was trying to use one variable as a pattern to replace another variable. In these cases, the special characters in my pattern (the ‘+’) were treated as part of a RegEx rather than the literal string I wanted to match.

For this example, suppose I want to replace an equation within parenthesis with the word ‘hi’ (not sure why I’d want to do this, but oh well).

tibble(
  eq = c("(1 + 1)", "(7 - 3)", "(12 * 1)")
) %>% 
  mutate(ptrn = str_extract(eq, '\\(.+\\)'),
         new_eq = str_replace_all(eq, ptrn, 'hi'),
  ) %>% 
  kable(align = 'c')

eq	ptrn	new_eq
(1 + 1)	(1 + 1)	(1 + 1)
(7 - 3)	(7 - 3)	(hi)
(12 * 1)	(12 * 1)	(12 * 1)

Notice that the str_replace_all either didn’t work 100% correctly or didn’t work at all for all three cases. Even though as a person this obviously should be a match, in computer-land the symbols “(”, “)”, “+”, and "*" all are special characters for regular expressions and therefore aren’t matching the literal symbols they are intended to match.

Fortunately, there is a function coll() which will compare strings using standard collation rules rather than using RegExp rules. Wrapping the pattern variable in coll() should solve all problems.

tibble(
  eq = c("(1 + 1)", "(7 - 3)", "(12 * 1)")
) %>% 
  mutate(ptrn = str_extract(eq, '\\(.+\\)'),
         new_eq = str_replace_all(eq, ptrn, 'hi'),
         with_coll = str_replace_all(eq, coll(ptrn), 'hi')
  ) %>%
  kable(align = 'c')

eq	ptrn	new_eq	with_coll
(1 + 1)	(1 + 1)	(1 + 1)	hi
(7 - 3)	(7 - 3)	(hi)	hi
(12 * 1)	(12 * 1)	(12 * 1)	hi

Now everything works!

Trick #7 - Use the `assign()` function to programatically create new objects

I always struggle with doing programmatic naming of objects. In the course of one of the puzzles I came across the assign() function which takes a variable name, and a object that will be given the variable name.

Suppose we have data in a data.frame with a column for Player and a value for the cards help by the player and we want to create 2 vectors; one for player 1 and one for player 2. We can use assign to create those objects.

input <- tibble::tribble(
  ~Player, ~Cards,
       1L,     1L,
       1L,     2L,
       1L,     3L,
       2L,     4L,
       2L,     5L,
       2L,     6L
  )

# Generate the string for the variable name with paste and assign an object
for(i in seq_len(n_distinct(input$Player))){
  assign(paste0('player_',i), input %>% filter(Player == i) %>% pull(Cards))
}

print(player_1)

## [1] 1 2 3

print(player_2)

## [1] 4 5 6

Now there are two objected in the environment with names “player_1” and “player_2”

Thanks for Reading!

I would highly encourage everyone to try Advent of Code at some point. I found it really enjoyable to do a different type of programming from my day to day. Although there were instances where doing it in R made things difficult (mainly R being a 1-indexed language) I found the experience really enjoyable.

Exploring NHL Stanley Cup Champion's Points Percentage In Four GGPlots

Tue, 01 Dec 2020 00:00:00 +0000

Motivation

While browsing Reddit’s r/DataIsBeautiful sub-reddit I came across a post from Fabio Votta showing a beeswarm plot of US County vote share in the 2020 Election. Having never seen a beeswarm plot before I wanted to come up with an excuse to try it out. As an NHL fan, I decided to look at the Points Percentage of NHL Stanley Cup champions. This analysis will use information from hockey-reference.com and ggplot to visualize the information.

Getting the Data

The data for this analysis will come from hockey-reference.com which provides statistics on the Stanley Cup Champion teams from 1918 through 2020 (with some exceptions). The points percentage is provided as a direct column in the table.

Setting Up Libraries

The libraries used in this analysis include stalwarts like tidyverse as well as ggplot extensions such as ggtext, ggbeeswarm, ggridges, ggimage to do different visualizations. The extrafont package enables the use of the fonts installed on my machine in ggplots. The loadfonts(device = "win") function loads the additional fonts (if running for the first time the font_import() function needs to be called to build the references).

library(tidyverse) # Data Manipulation and Visualizations
library(rvest) # Web Scraping the NHL Champion Data & Team Colors
library(ggbeeswarm) # Creating Beeswarm Plots
library(ggtext) # Enabling Use of Markdown in ggplots
library(ggridges) # Creating Ridge Density Plots
library(ggimage) # Creating Plots with Images as the Points 
library(glue) # Package for String Manipulation
library(extrafont) # Package to enable use of additional fonts for plotting
loadfonts(device = "win") # Actually loads the fonts

Getting the Data on the Champions

The points data for the Stanley Cup Champions comes from hockey-reference.com. I’ll scrape the table from this website by using rvest and referencing the CSS class .stats_table. Since there’s only one table on the page I can use html_node vs. html_nodes. Eventually I’m planning on joining some additional data to this data frame so I’m doing a minimal amount of data cleaning such as changing the Chicago Blackhawks to 1 word so that it matches the second data set. Additionally I’m renaming the points percentage column to something more R friendly.

nhl_data <- read_html('https://www.hockey-reference.com/awards/stanley.html') %>% 
  html_node(css = '.stats_table') %>% 
  html_table() %>% 
  mutate(Team = str_replace_all(Team, "Black Hawks", "Blackhawks")) %>% 
  rename(pts_pct = `PTS%`)

Getting Data on Team Colors

For one of the future plots I want to use each team’s color to represent their data. This information comes from teamcolorcodes.com. Each team page has a formulaic URL where the team name is ‘-’ delimited. Since this page only has information on current teams, older teams like the Toronto Arenas or Montreal Maroons will not appear. Typically, these names might wind up breaking a loop when they throw an error. However, the use of the possibly() function from purrr will accommodate the error handling. The possibly() function wraps another function and has an otherwise parameter that allows the user to say what the function should provide in case of an error.

In this case, the possibly() function wraps an anonymous function that:

Takes a string for a team name, t, which is converted to lower-case and has the spaces replaced with dashes
Scrapes the first instance of the .colorblock CSS class from the teamcolorcodes.com webpage for the specific team as text.
Performance a regular expression map for the HEX code for the color
Since str_match returns a list where the first element is the entire match and each additional element represents a capture group, pulls the 2nd element from the list.
Finally, the function returns a 1-row tibble with the team name, t, and the HEX code, named color.
In the case that there’s an error, the function will return a 1-row tibble with the team value set to ‘non-match’ and the color value set to NA.

The map_dfr function from purrr is used to run the above function for all unique team names and appends the results into a data.frame.

get_color <-  possibly(
  function(t){
    tibble(
      team = t,
      color = glue("https://teamcolorcodes.com/{t}-color-codes/", 
                   t = str_replace_all(
                     str_to_lower(t), ' ', '-')
                   ) %>% 
              read_html() %>% 
              html_node(css = ".colorblock") %>% 
              html_text() %>% 
              str_match("Hex Color: (#[0-9A-Za-z]{6})") %>% 
              .[, 2]
    )
  },
  otherwise = tibble(team = "non-match", color = NA_character_))

nhl_colors <- map_dfr(unique(nhl_data$Team), get_color)

Combining the Data

Finally, the color data is joined to the Champions data. In the cases where there were not matches in the color data, I’m using black as a default color.

nhl_w_color <- nhl_data %>% 
  left_join(nhl_colors, by = c("Team" = "team")) %>% 
  mutate(
    color = if_else(is.na(color), "black", color)
  )

Visualizations

The Overall Distribution of Points Percentage for NHL Stanley Cup Champions

This code block is a doozy as I did a lot of annotations to add error bars, text labels, arrows, and theme formatting to change what at its heart is a standard density plot.

nhl_w_color %>% 
  ggplot(aes(x = pts_pct)) + 
    geom_density(fill = '#8394A1') + 
    annotate("errorbarh",
            xmin = quantile(nhl_w_color$pts_pct, .10),
            xmax = quantile(nhl_w_color$pts_pct, .90),
            y = 6,
            color = "#e6e7eb") + 
    annotate("linerange",
             x = median(nhl_w_color$pts_pct),
             ymin = 0,
             ymax = 5,
             color = "#e6e7eb",
             lty = 2
    ) + 
    annotate("text",
             label = "Middle 80% and Median",
             y = 6.45,
             x = median(nhl_w_color$pts_pct),
             color = "#e6e7eb") + 
    annotate("text",
             label = quantile(nhl_w_color$pts_pct, .10) %>% 
               scales::percent(accuracy = 1),
             y = 5.2,
             x = quantile(nhl_w_color$pts_pct, .10),
             color = "#e6e7eb") + 
    annotate("text",
             label = quantile(nhl_w_color$pts_pct, .90) %>% 
               scales::percent(accuracy = 1),
             y = 5.2,
             x = quantile(nhl_w_color$pts_pct, .90),
             color = "#e6e7eb") + 
    geom_curve(
      x = median(nhl_w_color$pts_pct),
      xend = median(nhl_w_color$pts_pct)-.005,
      y = 6,
      yend = 3,
      color = "#e6e7eb",
      arrow = arrow(length = unit(0.03, "npc")),
      size = 1
    ) + 
    annotate("text", x = median(nhl_w_color$pts_pct)-.02, y = 3.3,
             label = median(nhl_w_color$pts_pct) %>% 
               scales::percent(accuracy = 1),
             color =  "#e6e7eb") + 
    labs(title = "Points Percentage of Stanley Cup Champions (1918 - 2020)",
         caption = "*Source: hockey-reference.com*",
         x = "Points %",
         y = ""
    ) + 
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) + 
    
    cowplot::theme_cowplot() + 
    theme(
      text = element_text(color = "#e6e7eb", family = 'BentonSans Regular'),
      plot.background = element_rect(fill = "#1a1c2e"),
      axis.text = element_text(color = "#e6e7eb"),
      axis.ticks = element_line(color = "#e6e7eb"),
      axis.line = element_line(color = "#878890"),
      plot.caption = element_markdown(),
      axis.title.y = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      plot.title = element_text(hjust = .5)
    )

Of the 100 champions that there is data for, the median points percentage is 63% while the middle 80% spans 54% - 74%. Ultimately this makes sense since you’d expect a champion to do better than just 50%. However, there are some teams that are really great and have >80% points percentages and a few instances of unlikely champions with a points percentage in the 40s.

Has the Distribution of Champion’s Points Percentages Changed By Decade

To see the density curves over time one approach would be to facet by decade and show each decade in its own panel. Another approach is to use the ggridges package to make a ridge density plot to have each density curve on its own line. The package is very easy to use as its primarily adding a y value and then using geom_density_ridges vs. geom_density.

Sidebar: Computing Decades from Years

In order to create the decade variable I use a trick I learned from David Robinson’s TidyTuesday videos which is to divide the number by bucket width, take the floor of the result, and then multiply it back by the bucket width.

For example, 2016 divided by 10 is 201.6, which after taking the floor is 201, then multiplying back by 10 is 2010. So 2016 is in the 2010s decade.

nhl_w_color %>% 
  mutate(decade = str_sub(Season, 1, 4),
         decade = as.integer(decade),
         decade = floor(decade/10)*10
  ) %>% 
  ggplot(aes(x = pts_pct, y = factor(decade), fill = factor(decade))) + 
    geom_density_ridges() + 
    geom_vline(xintercept = median(nhl_w_color$pts_pct), lty = 2, color = 'white') + 
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) + 
    scale_fill_viridis_d(option = "C", guide = F) + 
    labs(
      x = "Points %",
      y = "Decade",
      title = "Points Percentage of Stanley Cup Champions (1918 - 2020)",
      subtitle = "*By Decade*",
      caption = "*Source: hockey-reference.com*"
    ) + 
    cowplot::theme_cowplot() +
    theme(
      plot.caption = element_markdown(),
      plot.subtitle = element_markdown(),
      text = element_text(color = "#e6e7eb",  family = 'BentonSans Regular'),
      plot.background = element_rect(fill = "#1a1c2e"),
      axis.text = element_text(color = "#e6e7eb"),
      axis.ticks = element_line(color = "#e6e7eb"),
      axis.line = element_line(color = "#878890")
    )

I would have expected there to be a trend of some sort but there’s not a very common story from this chart. The main takeaways are:

The 1970s seems to have had the most dominant teams from a points percentage standpoint
There appears to be a large shift from the 1990s to the 2000s which might be due to the introduction of the shootout and the overtime loss concept which meant that three points could be awarded in a (2 for the winner, 1 for the loser) vs. always being two.

Looking the Points Percentage for Each Team

At the beginning of the post I mentioned that seeing a beeswarm plot provided the motivation for this post. Now I’ll actually create it. The following plot will have one point for each champion which will be highlighted by the team’s colors when that team’s tab is selected.

The two things to note in this code block is:

The tabset is dynamically generated by the markdown by setting the chunk setting to results='asis' and then using cat() to add the HTML for the tabs through a for-loop.
In vanilla RMarkdown, the tabset effect is really easy to do with {.tabset} but in Blogdown/Hugo its a bit trickier to nail the formatting. But its doable by referencing the bootstrap.js documentation To make things look decent, I’m omitting the code chunk but will include it at the bottom.

Looking at the results of this plot we see that the Montreal Canadiens have been the most frequent winner as well as the team that makes up most of those 80%+ seasons. On the other hand, the Chicago Blackhawks have the honor of being the overachieving team that won despite having a sub-40% points percentage.

Making a Histogram with Team Logos

An alternative view to the one above that doesn’t require highlighting would be to make a conventional histogram but using the team icons rather than points or bars. The ggimage package allows for a geom_image to be used by referencing a URL for an image. Fortunately the teamcolors package contains a dataset with links to logos for current NHL team. However, for some of the champion teams that no longer exist I needed to manually add their logos.

In this code block I manually create bin widths of 2.5% using the floor trick mentioned above and make use to a dummy variable to create the stacking effect for each of the logos. Then the geom_image references the URLs contained in the ‘logo’ column.

nhl_w_color %>% 
  left_join(teamcolors::teamcolors %>% select(name, logo), 
            by = c('Team' = 'name')) %>% 
  mutate(
    logo = case_when(
      Team == 'Montreal Maroons' ~ 'https://content.sportslogos.net/logos/1/40/thumbs/4039161926.gif',
      Team == 'Toronto Arenas' ~ 'https://content.sportslogos.net/logos/1/996/thumbs/lgtkven0lgs74prrf26p6rmes.gif',
      Team == 'Toronto St. Patricks' ~ 'https://content.sportslogos.net/logos/1/997/thumbs/6438.gif',
      TRUE ~ logo
    ),
    point_pct_bckt = floor(pts_pct*100/2.5)*2.5/100
  ) %>% 
  arrange(point_pct_bckt, desc(Team)) %>% 
  group_by(point_pct_bckt) %>% 
  mutate(
    dummy = 1,
    y_val = (cumsum(dummy)-1)*3
  ) %>% 
  ggplot(aes(x = point_pct_bckt, y = y_val)) + 
    geom_image(aes(image = logo),
               asp = 1.5,
               size = .05
               ) +
    geom_vline(xintercept = quantile(nhl_data$pts_pct, .5), lty = 2) + 
    labs(x = "Points %", y = "", 
         title = "Points Percentage of Stanley Cup Champions (1918 - 2020)",
         caption = "*Source: hockey-reference.com*") + 
    scale_x_continuous(labels = scales::percent_format(accuracy =)) + 
    cowplot::theme_cowplot() + 
    theme(
      text = element_text( family = 'BentonSans Regular'),
      axis.title.y = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      plot.caption = element_markdown(),
      plot.subtitle = element_markdown(),
      plot.margin = unit(rep(1.2, 4), "cm"),
      plot.title = element_text(hjust = .7)
  )

Now its much easier to see that Montreal makes up most of the dominant teams while Chicago has been both dominant and on the lower ends of the distribution.

Concluding Thoughts

The ggplot2 ecosystem is quite impressive and this post hardly scratches the surface of all the possible options. However, in this post I show 4 ways a single variable, points percentage of NHL Stanley Cup Champions, can be represented.

First, geom_density creates a baseline distribution
geom_density_ridge from ggridges can stratify that initial density plot over another variable
geom_quasirandom from ggbeeswarm will make a ‘violin-type’ plot but with specific points that can then be operated on.
Finally, ggimage can change the geom to reference image URLs.

And as a bonus, I dynamically generated the tabsets for all the teams!

Appendix: Code for Dynamic Tab Generation in Blogdown

##Construct Tabs
cat('<ul class="nav nav-pills nav-fill"> \n')
for(t in sort(unique(nhl_data$Team))){
  tid = str_to_lower(str_remove_all(t, ' |\\.'))
  cat(glue('<li class="nav-item"><a class = "nav-link {active}" data-toggle="tab" href="#{tid}">{t}</a></li> \n',
      active = if_else(t == sort(unique(nhl_data$Team))[1], "active", "")))
}
cat('</ul> \n')

cat('<div class="tab-content"> \n')

for(t in sort(unique(nhl_data$Team))){
  tid = str_to_lower(str_remove_all(t, ' |\\.'))
  cat(glue('<div id="{tid}" class="tab-pane {active}"> \n',
           active = if_else(t == sort(unique(nhl_data$Team))[1], "show active", "")))
  set.seed(20201121)
  
  g <- nhl_w_color %>% 
      mutate(color = if_else(Team == glue('{t}'), 
                             color, 
                             alpha("grey", 0.7))) %>% 
    ggplot(aes(y = 1, x = pts_pct, color = color)) + 
    geom_quasirandom(method = "tukeyDense", groupOnX=F, size = 3, width = 0.2) +
    geom_vline(xintercept = quantile(nhl_data$pts_pct, .5), lty = 2) + 
    labs(x = "Points %", y = "", 
         title = "Points Percentage of Stanley Cup Champions (1918 - 2020)",
         subtitle = glue("<span style='color:{col};'><b><i>{t}</i></b></span> Championships Highlighted",
                         col = nhl_w_color %>% 
                           filter(Team == glue('{t}')) %>% 
                           pull(color) %>% 
                           unique
                         ),
         caption = "*Source: hockey-reference.com*") + 
    scale_color_identity(guide = F) + 
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) + 
    cowplot::theme_cowplot() + 
    theme(
      text = element_text( family = 'BentonSans Regular'),
      axis.title.y = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      plot.caption = element_markdown(),
      plot.subtitle = element_markdown(),
      plot.margin = unit(rep(1.2, 4), "cm"),

    )
  
  print(g)
  
  cat("</div> \n") 
}
cat("</div> \n")

What's the most successful Dancing With the Stars "Profession"? Visualizing with {gt}

Tue, 24 Nov 2020 00:00:00 +0000

Motivation

During this pandemic I’ve found a source of comfort in Dancing with the Stars (DWTS). I’ve never watched any other season before and I think a large part of starting now are:

Lack of anything else to watch
The rapper Nelly (and the St. Lunatics) have a near and dear place in my heart.

On the R front, I’ve wanted to mess around with the gt package for a while now but hadn’t had a great reason to. However, I had originally wanted to do a post on whether DWTS has “score inflation” throughout the season, but that wound up being more complicated than I would have liked. So instead why not answer what is the most successful type of star on Dancing with the Stars.

And on the gt front a huge shout-out to Kaustav Sen whose post on gt for the Great American Beer Festival served as a large design inspiration for this post.

The Final Output

At the end of this post, the final output for the table will look like:

The Pre-Processing

Load the Libraries

The main focus of this post is on the gt package to make the table, however, other packages are used to get and work with the data.

library(rvest) #Web Scrape Wikipedia
library(tidyverse) #Data Manipulation / Plots
library(lubridate) #Date Manipulation
library(gt) #Making Fancy Tables / The Focus of This Post
library(glue) #Text Manipulation

Getting all the DWTS Contestants

In order to find the most successful star type, we need to get a list of all the contestants. Fortunately, Wikipedia has a page for every season and on those pages has a list of information about the contestants including, their name, what their known for, and their status from the season.

Since there are 29 completed Dancing with the Stars seasons this seems like a job for a function to iterate through each season’s Wikipedia page to extract that table. One note is that Season 15 was an all-star season so it will be excluded from this analysis Unfortunately, the contestant table isn’t always in the same place on the page, so the function will need to be a little flexible.

dwts_constants <- function(season_number, tbl_number){
  read_html(glue('https://en.wikipedia.org/wiki/Dancing_with_the_Stars_(American_season_{season_number})')) %>% 
    html_nodes('table') %>% 
    .[[tbl_number]] %>%
    html_table() %>% 
    mutate(season = season_number) %>% 
    janitor::clean_names()
}

Given a season number, and the number table on the page to extract the above function will extract and lightly clean the data. The following code will append all of the contestants on top of each other.

contestants <- dwts_constants(1, 2) %>% 
  bind_rows(dwts_constants(2, 3) ) %>% 
  bind_rows(dwts_constants(3, 3) ) %>% 
  bind_rows(dwts_constants(4, 3) ) %>% 
  bind_rows(dwts_constants(5, 2) ) %>% 
  bind_rows(dwts_constants(6, 2) ) %>% 
  bind_rows(dwts_constants(7, 2) ) %>% 
  bind_rows(dwts_constants(8, 2) ) %>% 
  bind_rows(dwts_constants(9, 2) ) %>% 
  bind_rows(dwts_constants(10, 2)) %>% 
  bind_rows(dwts_constants(11, 2)) %>% 
  bind_rows(dwts_constants(12, 2)) %>% 
  bind_rows(dwts_constants(13, 2)) %>% 
  bind_rows(dwts_constants(14, 3)) %>% 
  #bind_rows(dwts_constants(15, 2)) %>%  #Season 15 is an All-Star Season
  bind_rows(dwts_constants(16, 2)) %>% 
  bind_rows(dwts_constants(17, 2)) %>% 
  bind_rows(dwts_constants(18, 2)) %>% 
  bind_rows(dwts_constants(19, 2)) %>% 
  bind_rows(dwts_constants(20, 2)) %>% 
  bind_rows(dwts_constants(21, 2)) %>% 
  bind_rows(dwts_constants(22, 2)) %>% 
  bind_rows(dwts_constants(23, 2)) %>% 
  bind_rows(dwts_constants(24, 2)) %>% 
  bind_rows(dwts_constants(25, 2)) %>% 
  bind_rows(dwts_constants(26, 2)) %>% 
  bind_rows(dwts_constants(27, 2)) %>% 
  bind_rows(dwts_constants(28, 2)) %>% 
  bind_rows(dwts_constants(29, 2))

Directly from this function the raw data looks like:

celebrity	notability_known_for	professional_partner	status	season	result	professional_partner_a	ref	professional_partner_a_7	celebrity_12_13
Trista Sutter	The Bachelorette star	Louis Van Amstel	Eliminated 1ston June 8, 2005	1	NA	NA	NA	NA	NA
Evander Holyfield	Heavyweight boxer	Edyta Sliwinska	Eliminated 2ndon June 15, 2005	1	NA	NA	NA	NA	NA
Rachel Hunter	Supermodel	Jonathan Roberts	Eliminated 3rdon June 22, 2005	1	NA	NA	NA	NA	NA

Cleaning the data

Looking at the raw data there is a lot of data cleaning to be done:

The contestant’s result shows up in two different columns (result, status)
The result field has both placing information as well as dates for when they were either eliminated or won. For example Eliminated 1st needs to be turned into placed last (depending on how many contestants there were that season)
The data contains contestants who withdrew so their place had nothing to do with their “Profession”
The result field can be cleaned up to be standardized
The notability field needs to be standardized

All of these steps are handled in the following code:

contestant_clean <- contestants %>%
  mutate(
    #Compress Fields That Have Different Names Per Season
    result = coalesce(result, status),
    #Get the dates when Eliminations / Wins Happen
    status_date = mdy(str_extract(result, "\\w+ \\d+, \\d{4}")),
    #Get the Order of Elimination
    eliminated_state = str_extract(result, "Eliminated \\d+") %>% 
      str_remove('Eliminated ') %>%
      as.numeric()
  ) %>% 
  # Remove Contestants that Withdraw
  filter(!str_detect(result, 'Withdrew')) %>% 
  group_by(season) %>% 
  # Add the number of contestants for each season
  mutate(n_contestants = n()) %>% 
  ungroup() %>% 
  #Overwrite Places for 1st/2nd/3rd
  mutate(
    place = case_when(
      str_detect(result, "Winner") ~ 1,
      str_detect(result, "Runner|Second") ~ 2,
      str_detect(result, "Third") ~ 3,
      str_detect(result, "Fourth") ~ 4,
      TRUE ~ n_contestants - eliminated_state + 1
    ),
    # Standardize What Contestants Are "Known For"
    known_for = case_when(
      str_detect(str_to_lower(notability_known_for), 
                 'actor|actress|disney') ~ 'Actor/Actress',
      str_detect(str_to_lower(notability_known_for), 
                 'singer|rapper|band|composer') ~ 'Musician',
      str_detect(str_to_lower(notability_known_for), 
                 'model|miss usa') ~ 'Model',
      str_detect(str_to_lower(notability_known_for),
                 'nhl|nfl|nba|boxer|olympi|diva|tennis|soccer|football|lakers|swim|ufc|nascar|snowboard|wwe|mlb|basketball|rodeo|skier|race car|jockey|dolphins|steelers|packers|lakers|indy 500') ~ 'Athlete',
      str_detect(str_to_lower(notability_known_for), 
                 'journ|anchor|host|caster|personality') ~ 'Media Personality',
      str_detect(str_to_lower(notability_known_for), 
                 'bachelor|star|chef') ~ 'Reality TV Star',
      str_detect(str_to_lower(notability_known_for), 
                 'comedian|magician|entertainer') ~ 'Entertainer',
      str_detect(str_to_lower(notability_known_for), 
                 'owner|co-founder|business|designer') ~ 'Businessperson',
      TRUE ~ "Other"
    )
  ) %>% 
  # Fix Celebrity Column for Season 29
  mutate(celebrity = if_else(is.na(celebrity), celebrity_12_13, celebrity)) %>% 
  # Remove Unneeded Columns
  select(-contains('professional'), -ref, -status, 
         -eliminated_state, -celebrity_12_13) %>% 
  #Want Scores to be between 0 and 1 where 1 is Last Place and 0 is first place.
  mutate(scaled_place = (place-1)/(n_contestants-1))

The scaled_place variable will be used to create a standardized density plot by putting each season on a 1 (Last Place) to 0 (1st Place) scale regardless of the number of contestants in the season. The cleaned data now looks like:

celebrity	notability_known_for	season	result	status_date	n_contestants	place	known_for	scaled_place
Trista Sutter	The Bachelorette star	1	Eliminated 1ston June 8, 2005	2005-06-08	6	6	Reality TV Star	1.0
Evander Holyfield	Heavyweight boxer	1	Eliminated 2ndon June 15, 2005	2005-06-15	6	5	Athlete	0.8
Rachel Hunter	Supermodel	1	Eliminated 3rdon June 22, 2005	2005-06-22	6	4	Model	0.6
Joey McIntyre	New Kids on the Block singer	1	Third placeon June 29, 2005	2005-06-29	6	3	Musician	0.4
John O’Hurley	Actor & game show host	1	Runner-upon July 6, 2005	2005-07-06	6	2	Actor/Actress	0.2

Using Regular Expressions, I’ve collapsed 237 different levels into 9 which are:

Profession	Examples
Actor/Actress	Zendaya, Alexa PenaVega, Amber Riley
Athlete	Jamie Anderson, Antonio Brown, Martina Navratilova
Businessperson	Steve Wozniak, Robert Herjavec, Mark Cuban
Entertainer	Penn Jillette, Marie Osmond, Margaret Cho
Media Personality	Jerry Springer, Bobby Bones, Giselle Fernandez
Model	Bonner Bolton, Shandi Finnessey, Sailor Brinkley-Cook
Musician	Joey McIntyre, Gavin DeGraw, Nick Carter
Other	Sean Spicer, Buzz Aldrin, Noah Galloway
Reality TV Star	The Situation, Lisa Vanderpump, Terra Jolé

Constructing The Table

Organizing the Data

For the table, the information we want is for each “Profession”:

How many contestants were there?
What percentages came in 1st, 2nd, 3rd, and Last?

Some quick dplyr magic will allow us to collapse the list of contestants into the structure we want. We’ll also set the order of the table by the descending percentage of first place wins by “profession”.

contestant_summary <- contestant_clean %>% 
  group_by(known_for) %>% 
  summarize(
    num_stars = n(),
    pct_1st_place = sum(place == 1)/n(),
    pct_2nd_place = sum(place == 2)/n(),
    pct_3rd_place = sum(place == 3)/n(),
    pct_last_place = sum(n_contestants == place) / n()
  ) %>% 
  arrange(-pct_1st_place)

Using {gt} to Build the Table

Now onto actually constructing the table with gt. The gt package provides a grammar for tables similar to what ggplot2 does for charts. The package provides this visualization to show the different parts of a table:

Step 1: The basic construction

The most basic construction of a table is done by using the gt() function.

(g1 <- gt(contestant_summary))

known_for	num_stars	pct_1st_place	pct_2nd_place	pct_3rd_place	pct_last_place
Athlete	79	0.13924051	0.10126582	0.06329114	0.10126582
Musician	38	0.07894737	0.10526316	0.15789474	0.13157895
Actor/Actress	130	0.07692308	0.09230769	0.06923077	0.04615385
Reality TV Star	26	0.07692308	0.07692308	0.07692308	0.03846154
Model	14	0.07142857	0.07142857	0.00000000	0.07142857
Media Personality	21	0.04761905	0.04761905	0.04761905	0.23809524
Businessperson	5	0.00000000	0.00000000	0.00000000	0.20000000
Entertainer	5	0.00000000	0.00000000	0.20000000	0.40000000
Other	9	0.00000000	0.00000000	0.44444444	0.00000000

When I said basic, I meant basic.

Step 2: Adding Titles and Subtitles

The tab_header() function allows alterations to the header of the table. The title and subtitle arguments create the title and subtitle respectively. A nice feature of gt is the html() function will allows the use of HTML and CSS to style these titles. There is also a md() function that allows for markdown rendering.

(g2 <- g1 %>% 
  tab_header(
    title = html('Most <span style="color:#F2CB05">Successful</span> Dancing With the Stars <i>"Professions"</i>'),
    subtitle = html(
      "<span style = 'color: grey'>Covering Seasons 1 to 29 (excluding All-Star Season 15)</span>"
    )
  ))

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
known_for	num_stars	pct_1st_place	pct_2nd_place	pct_3rd_place	pct_last_place
Athlete	79	0.13924051	0.10126582	0.06329114	0.10126582
Musician	38	0.07894737	0.10526316	0.15789474	0.13157895
Actor/Actress	130	0.07692308	0.09230769	0.06923077	0.04615385
Reality TV Star	26	0.07692308	0.07692308	0.07692308	0.03846154
Model	14	0.07142857	0.07142857	0.00000000	0.07142857
Media Personality	21	0.04761905	0.04761905	0.04761905	0.23809524
Businessperson	5	0.00000000	0.00000000	0.00000000	0.20000000
Entertainer	5	0.00000000	0.00000000	0.20000000	0.40000000
Other	9	0.00000000	0.00000000	0.44444444	0.00000000

Step 3: Adding More Style to the Title

The tab_style() function adds various formatting to the table rows and cells. The style section of the arguments tells gt what the style will be and the location argument says where that style should be applied.

The google_font() function allows access to all the fonts on the Google Fonts site.

In this step I’m making the title left-justified, size XX-Large, and using the Anton font.

(g3 <- g2 %>% 
  tab_style(
    style = cell_text(
      font = google_font("Anton"), 
      align = "left", 
      size = "xx-large"
    ),
    locations = cells_title("title")
  )
)

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
known_for	num_stars	pct_1st_place	pct_2nd_place	pct_3rd_place	pct_last_place
Athlete	79	0.13924051	0.10126582	0.06329114	0.10126582
Musician	38	0.07894737	0.10526316	0.15789474	0.13157895
Actor/Actress	130	0.07692308	0.09230769	0.06923077	0.04615385
Reality TV Star	26	0.07692308	0.07692308	0.07692308	0.03846154
Model	14	0.07142857	0.07142857	0.00000000	0.07142857
Media Personality	21	0.04761905	0.04761905	0.04761905	0.23809524
Businessperson	5	0.00000000	0.00000000	0.00000000	0.20000000
Entertainer	5	0.00000000	0.00000000	0.20000000	0.40000000
Other	9	0.00000000	0.00000000	0.44444444	0.00000000

Step 4: Add Styling to the Subtitles

Similar to step 3, this step applies formatting to the subtitle

(g4 <- g3 %>% 
  tab_style(
    style = cell_text(
      font = google_font("Caveat"),
      align = "left", 
      size = "x-large"
    ),
    locations = cells_title("subtitle")
  ) 
)

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
known_for	num_stars	pct_1st_place	pct_2nd_place	pct_3rd_place	pct_last_place
Athlete	79	0.13924051	0.10126582	0.06329114	0.10126582
Musician	38	0.07894737	0.10526316	0.15789474	0.13157895
Actor/Actress	130	0.07692308	0.09230769	0.06923077	0.04615385
Reality TV Star	26	0.07692308	0.07692308	0.07692308	0.03846154
Model	14	0.07142857	0.07142857	0.00000000	0.07142857
Media Personality	21	0.04761905	0.04761905	0.04761905	0.23809524
Businessperson	5	0.00000000	0.00000000	0.00000000	0.20000000
Entertainer	5	0.00000000	0.00000000	0.20000000	0.40000000
Other	9	0.00000000	0.00000000	0.44444444	0.00000000

Step 5: Adding a Spanner Column

A spanner column is a column header that is merged across a number of different columns. It is added with the tab_spanner() function:

(g5 <- g4 %>% 
  tab_spanner(
    label = "Distribution of Results",
    columns = 3:6
  )
)

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
known_for	num_stars	Distribution of Results
pct_1st_place	pct_2nd_place	pct_3rd_place	pct_last_place
Athlete	79	0.13924051	0.10126582	0.06329114	0.10126582
Musician	38	0.07894737	0.10526316	0.15789474	0.13157895
Actor/Actress	130	0.07692308	0.09230769	0.06923077	0.04615385
Reality TV Star	26	0.07692308	0.07692308	0.07692308	0.03846154
Model	14	0.07142857	0.07142857	0.00000000	0.07142857
Media Personality	21	0.04761905	0.04761905	0.04761905	0.23809524
Businessperson	5	0.00000000	0.00000000	0.00000000	0.20000000
Entertainer	5	0.00000000	0.00000000	0.20000000	0.40000000
Other	9	0.00000000	0.00000000	0.44444444	0.00000000

Step 6: Styling the Spanner

Similar to the title and subtitle, we can use tab_style() to apply specific styles to the spanner via the cells_column_spanners() function.

(g6 <- g5 %>% 
  tab_style(
    style = cell_text(
      font = google_font("Courgette"), 
      size = "medium", 
      weight = "bold"
    ),
    locations = cells_column_spanners("Distribution of Results")
  )
)

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
known_for	num_stars	Distribution of Results
pct_1st_place	pct_2nd_place	pct_3rd_place	pct_last_place
Athlete	79	0.13924051	0.10126582	0.06329114	0.10126582
Musician	38	0.07894737	0.10526316	0.15789474	0.13157895
Actor/Actress	130	0.07692308	0.09230769	0.06923077	0.04615385
Reality TV Star	26	0.07692308	0.07692308	0.07692308	0.03846154
Model	14	0.07142857	0.07142857	0.00000000	0.07142857
Media Personality	21	0.04761905	0.04761905	0.04761905	0.23809524
Businessperson	5	0.00000000	0.00000000	0.00000000	0.20000000
Entertainer	5	0.00000000	0.00000000	0.20000000	0.40000000
Other	9	0.00000000	0.00000000	0.44444444	0.00000000

Step 7: Style the Column Headers and the Profession Column

You can apply the same style to different parts of the table by using a list() for the locations argument. Here the style is being applied to all column labels (cells_column_labels(everything())) and to the values in the first column (cells_body(columns = 1)).

(g7 <- g6 %>% 
  tab_style(
    style = cell_text(
      font = google_font("Secular One"), 
      size = "large"
    ),
    locations = list(
      cells_column_labels(everything()), 
      cells_body(columns = 1)
    )
  )  
)

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
known_for	num_stars	Distribution of Results
pct_1st_place	pct_2nd_place	pct_3rd_place	pct_last_place
Athlete	79	0.13924051	0.10126582	0.06329114	0.10126582
Musician	38	0.07894737	0.10526316	0.15789474	0.13157895
Actor/Actress	130	0.07692308	0.09230769	0.06923077	0.04615385
Reality TV Star	26	0.07692308	0.07692308	0.07692308	0.03846154
Model	14	0.07142857	0.07142857	0.00000000	0.07142857
Media Personality	21	0.04761905	0.04761905	0.04761905	0.23809524
Businessperson	5	0.00000000	0.00000000	0.00000000	0.20000000
Entertainer	5	0.00000000	0.00000000	0.20000000	0.40000000
Other	9	0.00000000	0.00000000	0.44444444	0.00000000

Step 8: Styling the cells

Applying a center alignment to the 2nd through 6th columns.

(g8 <- g7 %>% 
  tab_style(
    style = cell_text(
      font = google_font("Spartan"), 
      size = "medium",
      align = 'center'
    ),
    locations = cells_body(columns = 2:6)
  )
)

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
known_for	num_stars	Distribution of Results
pct_1st_place	pct_2nd_place	pct_3rd_place	pct_last_place
Athlete	79	0.13924051	0.10126582	0.06329114	0.10126582
Musician	38	0.07894737	0.10526316	0.15789474	0.13157895
Actor/Actress	130	0.07692308	0.09230769	0.06923077	0.04615385
Reality TV Star	26	0.07692308	0.07692308	0.07692308	0.03846154
Model	14	0.07142857	0.07142857	0.00000000	0.07142857
Media Personality	21	0.04761905	0.04761905	0.04761905	0.23809524
Businessperson	5	0.00000000	0.00000000	0.00000000	0.20000000
Entertainer	5	0.00000000	0.00000000	0.20000000	0.40000000
Other	9	0.00000000	0.00000000	0.44444444	0.00000000

Step 9: Turn Cell Decimals to Percentages

There are a number of fmt_* functions to handle formatting for values. The fmt_percent function will apply a percent format to all the columns beginning with “pct_”. While this is the first instance of using tidyselect syntax for telling gt what columns to use, the package can take names, column numbers, or tidyselect.

(g9 <- g8 %>% 
  fmt_percent(
    columns = starts_with('pct'),
    decimals = 1,
    drop_trailing_zeros = TRUE
  )
 )

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
known_for	num_stars	Distribution of Results
pct_1st_place	pct_2nd_place	pct_3rd_place	pct_last_place
Athlete	79	13.9%	10.1%	6.3%	10.1%
Musician	38	7.9%	10.5%	15.8%	13.2%
Actor/Actress	130	7.7%	9.2%	6.9%	4.6%
Reality TV Star	26	7.7%	7.7%	7.7%	3.8%
Model	14	7.1%	7.1%	0%	7.1%
Media Personality	21	4.8%	4.8%	4.8%	23.8%
Businessperson	5	0%	0%	0%	20%
Entertainer	5	0%	0%	20%	40%
Other	9	0%	0%	44.4%	0%

Step 10: Have some fun by turning column headers into emojis

Like other markdown text in R gt can also support emojis! Here we can add in medals for 1st, 2nd, and 3rd…. and a personal favorite emoji to represent last. Emojis can be added into markdown through the emo::ji() function.

(g10 <- g9 %>% 
  cols_label(
    known_for = "",
    num_stars = paste0("# ",emo::ji('star'), "s"),
    pct_1st_place = paste0(emo::ji("1st_place_medal"), "(1st)"),
    pct_2nd_place = paste0(emo::ji("2nd_place_medal"), "(2nd)"),
    pct_3rd_place = paste0(emo::ji("3rd_place_medal"), "(3rd)"),
    pct_last_place = paste0(emo::ji("poo"), " (last)")
  )
)

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
	# ⭐s	Distribution of Results
🥇(1st)	🥈(2nd)	🥉(3rd)	💩 (last)
Athlete	79	13.9%	10.1%	6.3%	10.1%
Musician	38	7.9%	10.5%	15.8%	13.2%
Actor/Actress	130	7.7%	9.2%	6.9%	4.6%
Reality TV Star	26	7.7%	7.7%	7.7%	3.8%
Model	14	7.1%	7.1%	0%	7.1%
Media Personality	21	4.8%	4.8%	4.8%	23.8%
Businessperson	5	0%	0%	0%	20%
Entertainer	5	0%	0%	20%	40%
Other	9	0%	0%	44.4%	0%

Step 11: Add a source and do some formatting

There a couple things going on in this step:

I’m adding a source line with tab_source_note() and using md() to allow me to use markdown style formatting.
I’m using tab_options() to remove the top border from the table and shrink the gaps between the rows in the table.
I’m using cols_width() to tell gt to make the first column 200px wide

(g11 <- g10 %>%
  tab_source_note(md("**Data:** DWTS Wikipedia Articles | **Table Author:** JLaw")) %>%
  tab_options(
    table.border.top.color = "white",
    data_row.padding = px(0),
  ) %>% 
  cols_width(
    1 ~ px(200),
  )
)

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
	# ⭐s	Distribution of Results
🥇(1st)	🥈(2nd)	🥉(3rd)	💩 (last)
Athlete	79	13.9%	10.1%	6.3%	10.1%
Musician	38	7.9%	10.5%	15.8%	13.2%
Actor/Actress	130	7.7%	9.2%	6.9%	4.6%
Reality TV Star	26	7.7%	7.7%	7.7%	3.8%
Model	14	7.1%	7.1%	0%	7.1%
Media Personality	21	4.8%	4.8%	4.8%	23.8%
Businessperson	5	0%	0%	0%	20%
Entertainer	5	0%	0%	20%	40%
Other	9	0%	0%	44.4%	0%
Data: DWTS Wikipedia Articles \| Table Author: JLaw

Step 12: Adding a Color Scale for the % Columns

The data_color function allows for doing conditional formatting based on the values in the columns. The columns argument allows to specific which colors should receive the formatting. The colors argument defines the palette. And the apply_to argument can take the values of “fill” to fill the background or “text” to change the color of the text.

(g12 <- g11 %>% 
  data_color(
    columns = vars(pct_1st_place, pct_2nd_place, pct_3rd_place, pct_last_place),
    colors = scales::col_numeric(
      palette = c("white", "#3fc1c9"),
      #F2CB05 = Gold COlor
      domain = NULL
    ),
    apply_to = "fill",
  )
 )

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
	# ⭐s	Distribution of Results
🥇(1st)	🥈(2nd)	🥉(3rd)	💩 (last)
Athlete	79	13.9%	10.1%	6.3%	10.1%
Musician	38	7.9%	10.5%	15.8%	13.2%
Actor/Actress	130	7.7%	9.2%	6.9%	4.6%
Reality TV Star	26	7.7%	7.7%	7.7%	3.8%
Model	14	7.1%	7.1%	0%	7.1%
Media Personality	21	4.8%	4.8%	4.8%	23.8%
Businessperson	5	0%	0%	0%	20%
Entertainer	5	0%	0%	20%	40%
Other	9	0%	0%	44.4%	0%
Data: DWTS Wikipedia Articles \| Table Author: JLaw

This looks pretty good… but we can do better!!!

Turning it up to 11 by adding in Density Plots

In order to add in ggplots into a row in the table we need to:

Build a function to create the plot for each row of the table
Use purrr:map() to add the plot as a list-column to the table
Use gt::text_transform to insert the image into the table

NOTE: Since this required making a new data set much of the gt code is repeating the first section but is provided in its entirely for completeness.

Writing the function to build the chart

For the function I want it to take a “profession” and return a density part using the scaled_place variable defined at the top. The function takes in a profession label and a dataset and returns a density plot.

plot_dens <- function(profession, data) {
  
  plot_data <- 
    data %>% 
    filter(known_for == {{ profession }}) 
  
  plot <- 
    plot_data %>% 
    ggplot(aes(x = scaled_place)) +
    geom_density(aes(y = ..scaled..), fill = 'gold') +
    annotate("text", x = 0, y = -.05, 
             label = "1st\nPlace", size = 10, color = "grey40", vjust = 1) +
    annotate("text", x = 1, y = -.05, 
             label = "Last\nPlace", size = 10, color = "grey40", vjust = 1) +
    coord_cartesian(
      xlim = c(-.1, 1.1),
      ylim = c(-.7, NA)
    ) + 
    theme_void()
  
  plot
  
}

Adding the plots into the data set

The main part of this step is using the map() function to iterate through the professions and use them as input into the function defined above. The column plots is a list-column containing all the ggplot information.

The left join is because I want to add in a column for the most recent winner in each category.

contestant_summary_with_graph <- contestant_summary %>% 
  mutate(plots = purrr::map(contestant_summary$known_for %>% unique, 
                            plot_dens, data = contestant_clean)) %>% 
  left_join(
  ###Add in Recent Winner Images
  contestant_clean %>% 
    filter(place == 1) %>% 
    group_by(known_for) %>% 
    slice_max(season, n = 1) %>% 
    select(celebrity, season, known_for) %>% 
    ungroup() %>% 
    transmute(
      known_for,
      lbl = paste0(celebrity,' (Season ',season,")")
    )
  )

Creating the Final Table

In order to turn the plots into columns the text_transform() function is used to take the plots column and run a function that calls ggplot_image with certain height and aspect ratio parameters on each row in the table.

text_transform(
    locations = cells_body(vars(plots)),
    fn = function(x) {
      map(contestant_summary_with_graph$plots, ggplot_image, 
          height = px(120), aspect_ratio = 1.5)
    }
  )

Now we can put it all together. Besides adding in the plots, there’s a few steps to format the Most Recent Winner cell. But nothing that hasn’t been covered earlier.

#Base Table
gt(contestant_summary_with_graph) %>% 
  #Add Titles
  tab_header(
    title = html('Most <span style="color:#F2CB05">Successful</span> Dancing With the Stars <i>"Professions"</i>'),
    subtitle = html(
      "<span style = 'color: grey'>Covering Seasons 1 to 29 (excluding All-Star Season 15)</span>"
    )
  ) %>% 
  #Format Title
  tab_style(
    style = cell_text(
      font = google_font("Anton"), 
      align = "left", 
      size = "xx-large"
    ),
    locations = cells_title("title")
  ) %>% 
  #Format Subtitle
  tab_style(
    style = cell_text(
      font = google_font("Caveat"),
      align = "left", 
      size = "x-large"
    ),
    locations = cells_title("subtitle")
  )  %>% 
  #Adding Spanning Column
  tab_spanner(
    label = "Distribution of Results",
    columns = 3:7
  ) %>% 
  #Style The Spanner Column
  tab_style(
    style = cell_text(
      font = google_font("Courgette"), 
      size = "medium", 
      weight = "bold"
    ),
    locations = cells_column_spanners("Distribution of Results")
  ) %>% 
  #Style the Column Labels and Profession Column
  tab_style(
    style = cell_text(
      font = google_font("Secular One"), 
      size = "large"
    ),
    locations = list(
      cells_column_labels(everything()), 
      cells_body(columns = 1)
    )
  )  %>% 
  #Style the Cells
  tab_style(
    style = cell_text(
      font = google_font("Spartan"), 
      size = "medium",
      align = 'center'
    ),
    locations = cells_body(columns = 2:6)
  ) %>% 
  #Format Cells to %s
  fmt_percent(
    columns = starts_with('pct'),
    decimals = 1,
    drop_trailing_zeros = TRUE
  ) %>% 
  #Turn Headers to Emojis
  cols_label(
    known_for = "",
    num_stars = paste0("# ",emo::ji('star'), "s"),
    pct_1st_place = paste0(emo::ji("1st_place_medal"), "(1st)"),
    pct_2nd_place = paste0(emo::ji("2nd_place_medal"), "(2nd)"),
    pct_3rd_place = paste0(emo::ji("3rd_place_medal"), "(3rd)"),
    pct_last_place = paste0(emo::ji("poo"), " (last)"),
    plots = "",
    lbl = "Most Recent Winner"
  ) %>% 
  ###Add in Source and Doing Some Minor Formatting
  tab_source_note(md("**Data:** DWTS Wikipedia Articles | **Table Author:** JLaw")) %>%
  tab_options(
    table.border.top.color = "white",
    data_row.padding = px(0),
  ) %>% 
  cols_width(
    1 ~ px(200)
  ) %>% 
###Add a Color Scale for 1st Place
  data_color(
    columns = vars(pct_1st_place, pct_2nd_place, pct_3rd_place, pct_last_place),
    colors = scales::col_numeric(
      palette = c("white", "#3fc1c9"),
      #F2CB05 = Gold COlor
      domain = NULL
    ),
    apply_to = "fill",
  ) %>% 
  ######################NEW THINGS START HERE#########################
  # Add In Density Plots (NEW)
  text_transform(
    locations = cells_body(vars(plots)),
    fn = function(x) {
      map(contestant_summary_with_graph$plots, ggplot_image, 
          height = px(120), aspect_ratio = 1.5)
    }
  ) %>% 
  text_transform(
    locations = cells_body(vars(lbl)),
    fn = function(x){
      if_else(!is.na(x), str_replace_all(x, " \\(", "<br> \\("), "")
    }
  ) %>% 
  tab_style(
    style = cell_text(
      style = 'italic',
      size = px(13),
      v_align = 'middle',
      align = 'left'
    ),
    locations = cells_body(columns = vars(lbl))
  ) %>%
  cols_width(
    8 ~ px(100)
  )

Most Successful Dancing With the Stars "Professions"
Covering Seasons 1 to 29 (excluding All-Star Season 15)
	# ⭐s	Distribution of Results	Most Recent Winner
🥇(1st)	🥈(2nd)	🥉(3rd)	💩 (last)
Athlete	79	13.9%	10.1%	6.3%	10.1%	Adam Rippon (Season 26)
Musician	38	7.9%	10.5%	15.8%	13.2%	Kellie Pickler (Season 16)
Actor/Actress	130	7.7%	9.2%	6.9%	4.6%	Jordan Fisher (Season 25)
Reality TV Star	26	7.7%	7.7%	7.7%	3.8%	Kaitlyn Bristowe (Season 29)
Model	14	7.1%	7.1%	0%	7.1%	Brooke Burke (Season 7)
Media Personality	21	4.8%	4.8%	4.8%	23.8%	Bobby Bones (Season 27)
Businessperson	5	0%	0%	0%	20%
Entertainer	5	0%	0%	20%	40%
Other	9	0%	0%	44.4%	0%
Data: DWTS Wikipedia Articles \| Table Author: JLaw

So what is the most successful “profession” in DWTS?

Seems pretty clearly to be the athletes as close to 14% of the Athletes have wound up winning. On the other end of the spectrum, the Media Personalities have faired less well with the lowest winning percentage of a group with 10+ stars and nearly 1 in 4 coming in last place… Reality TV Stars, while in the middle of the pack has been surging with ex-Bachelorettes winning the last two seasons (28 and 29).

An Attempt at Tweaking the Electoral College

Mon, 16 Nov 2020 00:00:00 +0000

Motivation

With the 2020 Election wrapping up and a renewed discussion about the merits of the Electoral College I’ve been thinking more about the system and why it might be the way it is. While I understand the rationale of why doing a complete popular vote would have unintended consequences, I personally feel like the current system has overly valued small states by virtue of having a minimum of 3 electoral votes. My personal hypothesis is that we have too many states. Therefore, my solution would be to start combining the small states that they meet a minimum threshold of the US population. I fully recognize that this would be completely infeasible in practice… but this is just a humble blog. So this analysis will attempt to accomplish three things:

When comparing the population from 1792 vs. 2020, do states generally represent smaller percentages of the US Population? (Do we have too many states from an Electoral College perspective?)
How could a new system be devised by combining states to reach a minimum population threshold?
Would this new system have impacted the results of the 2016 election? (At the time of writing, votes for the 2020 election are still being counted).

Gathering Data

Throughout this post, a number of difference libraries will be used as outputs will include plots, maps, and tables:

Loading Libraries

library(rvest) #Web-Scraping
library(tidyverse) #Data Cleaning and Plotting
library(janitor) #Data Cleaning 
library(sf) #Manipulate Geographic Objects
library(httr) #Used to Download Excel File from Web
library(readxl) #Read in Excel Files
library(kableExtra) #Create HTML Tables

Getting the US Population by State in 1790

Data from the 1790 US Census will be gathered from Wikipedia and scraped using the rvest package. In the following code block, all table tags will be extracted from the webpage and then I guessed and checked until I found the table I was looking for (in this case what I wanted was the 3rd table). The html_table() function converts the HTML table into a data frame and clean_names() from the janitor package will change the column headers into an R friendly format.

Finally, stringr::str_remove_all() will use regular expressions to remove the footnote notation “[X]” from the totals and readr::parse_number() will convert the character variable with commas into a numeric.

us_pop_1790 <- read_html('https://en.wikipedia.org/wiki/1790_United_States_Census') %>%
  html_nodes("table") %>% 
  .[[3]] %>% 
  html_table() %>% 
  clean_names() %>% 
  filter(state_or_territory != 'Total') %>% 
  transmute(
    state = state_or_territory,
    population_1790 = str_remove_all(total, '\\[.+\\]') %>% 
      parse_number(),
    population_percent_1790 = population_1790/sum(population_1790)
  )

Getting US Population by State in 2019

A similar process will be used to get the population estimates for 2019 from Wikipedia. In this case there is only 1 table on the page so html_node('table') can be used rather than html_nodes('table') like in the above code block for 1790.

us_pop_2019 <- read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population') %>% 
  html_node('table') %>% 
  html_table() %>% 
  clean_names() %>% 
  filter(!is.na(estimated_population_per_electoral_vote_2019_note_2),
         !estimated_population_per_electoral_vote_2019_note_2 %in% c('', '—'),
         rank_in_states_territories_2010 != '—') %>%
  transmute(
    state,
    population_2019 = parse_number(population_estimate_july_1_2019_2),
    population_percent_2019 = population_2019 / sum(population_2019)
    )

Getting # of Electoral Votes for Each State by Year

Finally, the table containing number of electoral votes by state by year will be extracted from Wikipedia. New code pieces for this code block are the use of selecting columns by number in the dplyr::select() and dplyr::rename() calls. Also, the use of dplyr::across() which in this context is a replacement for mutate_if, mutate_at, and mutate_all. Here I tell the mutate() to take all variables that start with “electoral votes” and apply the readr::parse_number() function to them keeping the names the same. We’ll use this data set later on.

electoral_votes <- read_html('https://en.wikipedia.org/wiki/United_States_Electoral_College') %>% 
  html_nodes("table") %>% 
  .[[5]] %>% 
  html_table(fill = T) %>% 
  select(2, 4, 36) %>% 
  filter(!Electionyear %in% c('Total', 'Electionyear', "State")) %>% 
  rename(state = 1, electoral_votes_1792 = 2, electoral_votes_2020 = 3) %>% 
  mutate(across(starts_with('electoral_votes'), parse_number))

Q1: Do states today represent smaller proportions of the population than they did when the Electoral College was formed?

My hypothesis is that the electoral college has become less effective because we’ve added too many small states that reflect minor amounts of the US population and that when the Electoral College was established the population distributions of states were more similar.

To check this I’ll be comparing the distributions of State populations as a % of the Total US Population for 1790 and 2019. One note before getting into the code is that in the article for the 1790 state population, Maine is given its own row. However, Maine was a part of Massachusetts until 1820, so since we’re more focused on “electing blocks” rather than states I will merge Maine into Massachusetts.

For this next code block, I join the two population data sets together and then all numeric variables summarized. Then, I melt the population percentages by year into a long-form data frame. Finally, I extract the numeric year from the variable names and compare the box plots of the % of Total Population for each State from 1790 and 2019.

us_pop_2019 %>% 
  left_join(
    us_pop_1790 %>% 
      mutate(state = if_else(state == 'Maine', 'Massachusetts', state)) %>% 
      group_by(state) %>% 
      summarize(across(where(is.numeric), sum)),
    by = "state"
  ) %>% 
  pivot_longer(
    cols = c(contains("percent")),
    names_to = "year",
    values_to = "population_dist"
  ) %>% 
  mutate(year = str_extract(year, '\\d+') %>% as.integer) %>% 
  ggplot(aes(x = fct_rev(factor(year)), y = population_dist, 
             fill = factor(year))) + 
    geom_boxplot() + 
    labs(x = "Year", y = "Population Distribution", 
         title = "State Population Distribution by % of US Population") +
    annotate('linerange', y = 1/nrow(us_pop_2019), 
             xmin = .6, xmax = 1.45, lty = 2) + 
    annotate('linerange', y = 1/(nrow(us_pop_1790)-1), 
             xmin = 1.6, xmax = 2.45, lty = 2) + 
    scale_y_continuous(label = scales::percent_format(accuracy = 1)) + 
    scale_fill_discrete(guide = F) +
    coord_flip() +
    theme_bw()

In the chart above we’re looking at the distribution of states by the % of the total US population they make up. The dashed lines represent the expected values if all states had the same amount. For example, there are 51 “voting bodies” that make up 100% of the US population, so the “expected” amount would be 1/51 or 2.0%. In 1790, the largest state made up 19.2% and the smallest state made up 1.5% of the total population. In 2019, the largest state makes up 12% of the total population and the smallest makes up 0.2% of the total population.

While some of this is due to having more states which means the same 100% is being cut into more pieces. Another way to see whether states are making up smaller pieces of the population today than back is to compare the data to those expected values from before. In the case of 1790, there are 15 voting bodies so on average we’d expected each state to make up 6.7%. And when looking the distribution of the states in 1790, 60% are below the expected amount of 6.7%. This is compared to the distribution in 2019 where 67% are below the expected amount of 2.0%.

When asking whether or not there are more small states in 2019 vs. 1790, I find that 28 of the 51 states (with DC) [55%] have a % of the US Population smaller than the minimum state from 1790 [1.5%]. These 28 states make up 141 or 26% of the 538 electoral votes.

So while there’s not a large difference between actual and expected it does seem that we have a greater concentration of smaller population states now than when the electoral college was first established based on the concentration that make up less than 1.5% of the US population.

Q2. How could states be combined to ensure each “voting group” meets a minimum population threshold?

The fact that 55% of states have a % of 2019 US Population smaller than the smallest percentage in 1790 gives promise to the idea that combining states could be feasible. So for this exercise, I’ll combine states together in order to ensure that each group has at least a minimum of 1.5% of the US Population.

Originally I had wanted to come up with a cool algorithm to find the optimal solution to ensure that each state group hit the 1.5% while taking into account the location of the states being combined and the political culture of the states… but alas I couldn’t figure out how to do it. So I combined the states manually taking into account geography but completely ignoring how states usually vote. In my new construction the following states get combined:

Alaska & Oregon
Arkansas & Mississippi
Connecticut & Rhode Island
Washington DC, Delaware, and West Virginia
Hawaii & Nevada
Iowa & Nebraska
Idaho, Montana, North Dakota, South Dakota, and Wyoming
Kansas & Oklahoma
New Hampshire, Maine, and Vermont
New Mexico & Utah

new_groupings <- us_pop_2019 %>% 
  mutate(
    state = if_else(state == 'D.C.', 'District of Columbia', state),
    new_grouping = case_when(
      state %in% c('New Hampshire', 'Maine', 'Vermont') ~ 'NH/ME/VT',
      state %in% c('Rhode Island', 'Connecticut') ~ 'CT/RI',
      state %in% c('West Virginia', 'Delaware', 'District of Columbia') ~ 
        'DC/DE/WV',
      state %in% c('Alaska', 'Oregon') ~ 'AK/OR',
      state %in% c('Utah', 'New Mexico') ~ 'NM/UT',
      state %in% c('Hawaii', 'Nevada') ~ 'HI/NV',
      state %in% c('Idaho', 'Montana', 'North Dakota', 
                   'South Dakota', 'Wyoming') ~ 'ID/MT/ND/SD/WY',
      state %in% c('Iowa', 'Nebraska') ~ 'IA/NE',
      state %in% c('Arkansas', 'Mississippi') ~ 'AR/MS',
      state %in% c('Oklahoma', 'Kansas') ~ 'KS/OK',
      TRUE ~ state
    )
  )

To display this brave new world, I will construct a map that shows my new compressed electoral map and the resulting changes in the number of electoral votes. The first step is adding the electoral votes into the data frame constructed in the last code block:

new_groupings <- new_groupings %>% 
  left_join(
    electoral_votes %>% 
      transmute(state = if_else(state == 'D.C.', 'District of Columbia', state),
                electoral_votes_2020),
    by = "state"
  )

Next, I need a mechanism to assign a number of electoral votes to my compressed map. Normally, there are 538 electoral votes representing the 435 voting members of Congress, the 100 Senators, and 3 additional electoral votes for Washington DC. Since I’m not trying to rock the boat too much. My new system will maintain the 2 votes per group represented by the Senate allocation and the population allocation from the Congressional side. In order to understand and apply this relationship I’m building a quick and dirty linear regression model to predict the population component for the new of electoral votes

electorial_vote_model <- lm(electoral_votes_2020-2 ~ population_2019, 
                            data = new_groupings)

electorial_vote_model

## 
## Call:
## lm(formula = electoral_votes_2020 - 2 ~ population_2019, data = new_groupings)
## 
## Coefficients:
##     (Intercept)  population_2019  
##     0.094428506      0.000001313

This model shows that there is 1.313 electoral votes per 1 million people.

To visualize what this new electoral map will look map, I will use the sf package. While I’m not very familiar with this package (maybe a subject of a future post), I’ve tinkered around with the format before and have found it very compatible with tidy principles.

The first step is getting a shape file. For the United States, I will leverage the usa_sf function from the albersusa package which will return a map as a simple feature. The “laea” represents the projection.

usa <-  albersusa::usa_sf("laea") %>% select(name, geometry)

knitr::kable(head(usa))

name	geometry
Arizona	MULTIPOLYGON (((-1111066 -8…
Arkansas	MULTIPOLYGON (((557903.1 -1…
California	MULTIPOLYGON (((-1853480 -9…
Colorado	MULTIPOLYGON (((-613452.9 -…
Connecticut	MULTIPOLYGON (((2226838 519…
District of Columbia	MULTIPOLYGON (((1960720 -41…

What makes the magic of the sf class is that the shape information is contained in the geometry column, but everything else can be operated on like a normal data frame. So for the next step, I’ll join the “state groupings” information to this shape file data using the “name” column from the shape data and the state column from the groupings data.

Next, I summarize the data to “combined state groupings” level where I get the sums of the population and the number of original electoral votes. The two unique parts of this summarize statement are:

st_union which will combine geographic areas from the shape file into new shapes. If you wanted to combine the groups but maintain all original boundaries then st_combine would be used instead.
Creating a better label for the combined state names by using paste in the summarize with the collapse option which concatenates the states in the aggregation.
The final mutate step uses the predict function to apply the regression model to compute the new electoral vote values for the combined states. Any state that wasn’t combined retained its original number of votes.

Afterwards, the new data set looks like:

new_usa <- usa %>% 
  left_join(new_groupings %>% 
              transmute(state, 
                        new_grouping, 
                        population_2019, 
                        electoral_votes_2020
                        ), 
            by = c("name" = "state")
  ) %>% 
  group_by(new_grouping) %>% 
  summarize(
    geom = st_union(geometry),
    population_2019 = sum(population_2019),
    electoral_votes = sum(electoral_votes_2020),
    states = paste(name, collapse = '/')
  ) %>% 
  mutate(
    new_ev = if_else(
      states == new_grouping,
      electoral_votes,
      ceiling(predict(electorial_vote_model, newdata = .) + 2)
    ),
    lbl = if_else(new_grouping == states, NA_character_, 
                  paste0(new_grouping, ": ", new_ev - electoral_votes)))

knitr::kable(head(new_usa))

new_grouping	geom	population_2019	electoral_votes	states	new_ev	lbl
AK/OR	MULTIPOLYGON (((-1899337 -2…	4949282	10	Oregon/Alaska	9	AK/OR: -1
Alabama	MULTIPOLYGON (((1145349 -15…	4903185	9	Alabama	9	NA
AR/MS	MULTIPOLYGON (((1052956 -15…	5993974	12	Arkansas/Mississippi	10	AR/MS: -2
Arizona	MULTIPOLYGON (((-1111066 -8…	7278717	11	Arizona	11	NA
California	MULTIPOLYGON (((-1853480 -9…	39512223	55	California	55	NA
Colorado	MULTIPOLYGON (((-613452.9 -…	5758736	9	Colorado	9	NA

Now we’re ready to plot the map. Plotting sf geometries work within the ggplot paradigm where geom_sf will draw the geometries and geom_sf_text will handle the overlays for the given groups. coord_sf changes the coordinate system of the plot. And everything else should be familiar from vanilla ggplot.

new_usa %>% 
ggplot() +
  geom_sf(color = "#2b2b2b", size=0.125, aes(fill = lbl)) +
  geom_sf_text(aes(label = lbl), check_overlap = T, size = 3) + 
  coord_sf(crs = st_crs("+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs"), datum = NA) +
  scale_fill_discrete(guide = F, na.value = "grey90") + 
  labs(title = "Proposed Electoral Map",
       subtitle = "Combining States so each 'Group' makes up at least ~1.5% of US Population",
       caption = "Number represents the change in Electoral Votes due to combining") + 
  ggthemes::theme_map() + 
  theme(
    plot.title = element_text(size = 14)
  )

The states in gray remained unchanged and the filled in states represent our new groupings. The states that directly border each other have been combined into an “electoral grouping” with a newly assigned number of electoral votes. Since the electoral vote model was based on population, the change in the number of electoral votes comes primarily from the loss of the two senate votes for each combined state.

For example, NH/ME/VT originally would have had 11 electoral votes and under the new system will have 7 for a net change of -4 due to the loss of two combined states 2 senate votes.

Under the normal electoral college there were 538 votes and under this new system that number is reduced to 512.

Now that we have our new electoral college, would it have made a difference in 2016?

Q3: Would this new system have impacted the results of the 2016 election?

The 2016 election results between Donald Trump and Hillary Clinton is provided in great detail from the Federal Election Commission. Surprisingly, it was difficult to find the number of votes by state in an easily consumable way where I wouldn’t have to recode all the state names. So the FEC data will have to do even if its took some complicated data manipulation.

Since the FEC data comes from an Excel file, I first need to download the file from the FEC website. I’ll use the GET function from httr to download the Excel file to a temporary file and then will use read_excel from readxl to read in the file.

Before data manipulation, but after filtering to just Trump and Clinton, the data looks like.

GET("https://www.fec.gov/documents/1890/federalelections2016.xlsx", 
    write_disk(tf <- tempfile(fileext = ".xlsx")))

results2016 <- read_excel(tf, sheet = '2016 Pres General Results') %>% 
  clean_names() %>% 
  filter(last_name %in% c('Trump', 'Clinton')) %>% 
  select(state, state_abbreviation, last_name, general_results)

knitr::kable(head(results2016, 5))

state	state_abbreviation	last_name	general_results
Alabama	AL	Trump	1318255
Alabama	AL	Clinton	729547
Alaska	AK	Trump	163387
Alaska	AK	Clinton	116454
Arizona	AZ	Trump	1252401

There was a small data quirk with New York state where because the same candidate can appear on multiple party lines a single candidate appears in multiple rows (Clinton appears 4 times and Trump 3). Therefore a first group-by is done to make the data 2 rows per state. Then the data is cast to a wider format, the electoral votes are added back and allocated to the winning candidate (technically this is wrong since Nebraska and Maine do not use all-or-nothing allocations, but its close enough for this exercise).

Then the data is aggregated to the new electoral groupings from the prior section and our “new” electoral votes are allocated in an all or nothing fashion to the candidate.

results2016 <- results2016 %>% 
  group_by(state, state_abbreviation, last_name) %>% 
  summarize(general_results = sum(general_results, na.rm = T), 
            .groups = 'drop') %>% 
  pivot_wider(
    names_from = "last_name",
    values_from = "general_results"
  ) %>% 
  left_join(
    new_groupings %>% 
      select(state, new_grouping, electoral_votes_2020, population_2019),
    by = "state"
  ) %>% 
  mutate(trump_ev = (Trump > Clinton)*electoral_votes_2020,
         clinton_ev = (Clinton > Trump)*electoral_votes_2020
  ) %>% 
  group_by(new_grouping) %>% 
  summarize(across(where(is.numeric), sum, na.rm = T),
            states = paste(state, collapse = '/')) %>% 
  mutate(new_ev = if_else(
              states == new_grouping,
              electoral_votes_2020,
              ceiling(predict(electorial_vote_model, newdata = .) + 2)
            )) %>% 
  mutate(
    new_trump_ev = if_else(Trump > Clinton, new_ev, 0),
    new_clinton_ev = if_else(Trump < Clinton, new_ev, 0)
  )

knitr::kable(head(results2016, 5))

new_grouping	Clinton	Trump	electoral_votes_2020	population_2019	trump_ev	clinton_ev	states	new_ev	new_trump_ev	new_clinton_ev
AK/OR	1118560	945790	10	4949282	3	7	Alaska/Oregon	9	0	9
Alabama	729547	1318255	9	4903185	9	0	Alabama	9	9	0
AR/MS	865625	1385586	12	5993974	12	0	Arkansas/Mississippi	10	10	0
Arizona	1161167	1252401	11	7278717	11	0	Arizona	11	11	0
California	8753792	4483814	55	39512223	0	55	California	55	0	55

Finally to visualize the difference in electoral votes between the actual 2016 results and our new 2016 results, the prior data set will be summarized and reshaped to get the data back into a tidy format with the proper labeling. The plot is a simple stacked barplot.

results2016 %>% 
  summarize(across(contains(c("trump_ev", "clinton_ev")), sum)) %>% 
  pivot_longer(cols = everything(),
               names_to = 'variable',
               values_to = 'electoral_votes') %>% 
  group_by(str_detect(variable, 'new')) %>% 
  mutate(
    percents = electoral_votes/sum(electoral_votes),
    old_v_new = if_else(str_detect(variable, 'new'), 'New EC', 'Original EC'),
    candidate = case_when(
       str_detect(variable, 'trump') ~ "trump",
       str_detect(variable, 'clinton') ~ 'clinton',
       TRUE ~ 'total'
     ),
    lbl = paste0(electoral_votes, 
                 '\n(', 
                 scales::percent(percents, accuracy = .1) ,')')
  ) %>% 
   ggplot(aes(y = old_v_new, x = percents, fill = candidate)) +
    geom_col(width = .5) +
    geom_text(aes(label = lbl), position = position_stack(vjust = .5)) + 
    geom_vline(xintercept = .5, lty = 2) + 
    scale_x_continuous(label = scales::percent, expand = c(0,0)) + 
    scale_fill_manual(values = c('clinton' = 'blue', 'trump' = 'red')) + 
    guides(fill = guide_legend(reverse = T)) + 
    labs(x = "% of Electoral Vote",
         y = "",
         title = "Comparing 2016 Election Results in the Original vs. New System",
         fill = "") + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = 'plot',
      axis.line = element_blank(),
      axis.ticks.x = element_blank(),
      axis.text.x = element_blank()
    )

With the new electoral grouping system the net change in percentage of electoral votes was only 0.3%, so the overall result wouldn’t have changed.

What Actually Changed in the New System?

The final question would be how did the electoral votes change between the old system and the new system. The tbl_dl data frame is restructuring the data into the table format which will only have rows for groupings where the number of electoral votes is different and I’m creating labels to include the “+” and “-” symbols.

tbl_dt <- results2016 %>% 
  filter(trump_ev != new_trump_ev | clinton_ev != new_clinton_ev) %>% 
  transmute(
    new_grouping,
    clinton_delta = (new_clinton_ev - clinton_ev),
    trump_delta = (new_trump_ev - trump_ev),
    clinton_lbl = paste0(
      if_else(clinton_delta > 0, "+", ""),
      clinton_delta
    ),
    trump_lbl = paste0(
      if_else(trump_delta > 0, "+", ""),
      trump_delta
    )
  ) %>%
  select(new_grouping, clinton_lbl, trump_lbl)

To complete the table visualization I’m using the kableExtra package. The kable_paper argument is a style setting and the two uses of column_spec sets the cell background to either red or green if the label constructed above is non-zero and white otherwise (which will appear blank). This was my first experience with kableExtra and while I’m happy that I was able to get this to be how I wanted, I found certain parts of the syntax a little frustrating.

tbl_dt %>% 
  kbl(align = c('l', 'c', 'c'),
      col.names = c('', 'Clinton', 'Trump'),
      caption = "Election 2016: Candidate's Change in Electoral Votes") %>% 
  kable_paper(full_width = F) %>% 
  column_spec(2, color = 'white', background = case_when(
    str_detect(tbl_dt$clinton_lbl, "\\+") ~ 'green',
    str_detect(tbl_dt$clinton_lbl, "\\-") ~ 'red',
    TRUE ~ 'white'
    )
  ) %>% 
  column_spec(3, color = 'white', background = case_when(
    str_detect(tbl_dt$trump_lbl, "\\+") ~ 'green',
    str_detect(tbl_dt$trump_lbl, "\\-") ~ 'red',
    TRUE ~ 'white'
    )
  )

Table 1: Election 2016: Candidate’s Change in Electoral Votes
	Clinton	Trump
AK/OR	+2	-3
AR/MS	0	-2
CT/RI	-2	0
DC/DE/WV	+1	-5
HI/NV	-2	0
IA/NE	0	-2
ID/MT/ND/SD/WY	0	-7
KS/OK	0	-1
NH/ME/VT	-4	0
NM/UT	-5	+4

In most cases, votes were lost due to the combining of smaller states into these groupings but in a few instances the combination of multiple states changed who won the popular vote. For example, in the Alaska/Oregon there were originally 10 electoral votes (3 from Alaska which went to Trump, 7 from Oregon that went to Clinton). The grouping lost vote in the combining and then the combined Oregon/Alaska went to Clinton overall. Therefore, Clinton gets all 9 new electoral votes (+2 from the initial 7) and Trump loses the 3 he had from Alaska.

Wrapping Up

Back at the beginning of this analysis I hypothesized that the Electoral College had become more over-weighted towards smaller states than back in the 1790s during the early days of the electoral college. Based on comparing the % of the US Population of states from 1790 vs. 2019 I showed that this was true although not massively.

I proposed an idea to revise the Electoral College by combining states to ensure that each grouping makes up at a minimum 1.5% of the US Population, which was the smallest share of population from 1790. This reduced the overall number of electoral votes due to the reduction of the automatic 2 votes per state for the combined states.

Finally, I applied my new Electoral College to the 2016 election… it made almost no difference.

So overall, this thought exercise was fun to work through but it winds up being an incredibly small change to the results from the current system.

Sequence Mining My Browsing History with arulesSequences

Sun, 01 Nov 2020 00:00:00 +0000

Typically when thinking of pattern mining people tend to think of Market Basket Analysis with the conventional example showing people typically buy both Beer and Diapers in the same trip. When order doesn’t matter this is called Association Rules Mining and is implemented by the arules package in R. In this example, the person is buying both diapers and beer. It doesn’t really matter if diapers led to the beer purchase or beer lead to the diaper purchased. However, there are instances where the order of events are important to what we’d consider a pattern. For example, “cause and effect” relationships imply order. Putting your hand on a hot stove leads to burning your hand. The reverse direction of burning your hand leading you to put your hand on a hot stove makes less sense. When the notion of order is applied to association rules mining it becomes “Sequence Mining”. And to do this, we’ll use the arulesSequences package to run the cSPADE algorithm.

Unfortunately, I don’t have access to grocery store data or much other data that would be an interesting use-case for sequence mining. But what I do have is access to my own browsing history. So for this post, I’ll be looking for common sequential patterns in my web own browsing habits.

Getting the Data

I wasn’t able to figure out how to extract my browsing history directly from Chrome in a way that would easily be read into R. However, there are 3rd party programs that can extract browsing histories. In this case, I used a program called BrowsingHistoryView by Nir Sofer. The interface is very straight forward and allowed for extracting my browsing history to a CSV file.

From this program I was able to extract 85 days worth of browsing history from 2020-06-13 through 2020-09-09.

Loading Libraries and Reading in Data

The libraries used in this analysis are the usual gang of tidyverse, lubridate, ggtext which are often used in this blog. Some new ones specific for this analysis are:

arulesSequences - Which will run the sequence mining algorithm
tidygraph and ggraph - Which will allow for plotting my browsing history as a directed graph

library(tidyverse) #Data Manipulation and Plotting
library(lubridate) #Date Manipulation
library(arulesSequences) #Running the Sequence mining algorithm
library(ggtext) #Making adding some flair to plots
library(tidygraph)  ## Creating a Graph Structure
library(ggraph) ## Plotting the Network Graph Structure

A .csv file was created from the Browsing History View software and read into R through readr.

browsing_history <- read_csv('browsing_history_v2.csv')

The read-in data looks as follows:

URL	Title	Visited On	Visit Count	Typed Count	Referrer	Visit ID	Profile	URL Length	Transition Type	Transition Qualifiers
https://watch.wwe.com/original/undertaker-the-last-ride-134576	wwe network - undertaker: the last ride	6/13/2020 2:59:23 PM	2	1	NA	331141	Default	62	Typed	Chain Start,Chain End
https://watch.wwe.com/original/undertaker-the-last-ride-134576	wwe network - undertaker: the last ride	6/13/2020 2:59:28 PM	2	1	NA	331142	Default	62	Link	Chain Start,Chain End
https://www.google.com/search?q=vtt+to+srt&oq=vtt+to+srt&aqs=chrome.0.69i59j0l7.1395j0j4&sourceid=chrome&ie=utf-8	vtt to srt - google search	6/13/2020 4:33:34 PM	2	0	NA	331157	Default	113	Generated	Chain Start,Chain End
https://www.google.com/search?q=vtt+to+srt&oq=vtt+to+srt&aqs=chrome.0.69i59j0l7.1395j0j4&sourceid=chrome&ie=utf-8	vtt to srt - google search	6/13/2020 4:33:37 PM	2	0	NA	331158	Default	113	Link	Chain Start,Chain End
https://twitter.com/	home / twitter	6/13/2020 5:19:55 PM	98	90	NA	331167	Default	20	Typed	Chain Start,Chain End
https://twitter.com/home	home / twitter	6/13/2020 5:20:03 PM	414	0	NA	331168	Default	24	Link	Chain Start,Chain End

Looking at the data there are a number of cleaning steps that will need to be done to make the sequence mining more useful.

The variable names are not machine friendly and contain spaces,
Some of the URLs are redirects or generated and therefore not URLs I specifically went to. I’ll want to exclude those.
Visited On is a character rather than a date
If we’re looking for common patterns, I should probably limit the URLs to just the domains as its very unlikely that I would read the same news articles multiple times.
- Therefore I’ll shorten “https://twitter.com/home” to just “twitter.com/”

The following code block carries out the cleaning steps outlined above:

browsing_history_cleaned <- browsing_history %>% 
  #Make the names more R friendly
  janitor::clean_names() %>%
  #Subset to URLs I either typed or 
  #Linked to (excluding redirects/form submissions)
  filter(transition_type %in% c('Link', 'Typed'),
         str_detect(transition_qualifiers, 'Chain Start')
         )%>% 
  #Keep Only the Base URL and remove the prefix
  mutate(base_url = str_remove(url, '^https?:\\/\\/') %>% 
           str_remove('^www\\.') %>% 
           str_extract(., '^.+?\\/'),
         #Parse the Date Format
         dttm = mdy_hms(visited_on),
         ds = as.Date(dttm)
  ) %>% 
  select(base_url, dttm, title, ds)

The above block:

Uses janitor::clean_names() to convert the column names into an R-friendly format (Visited On -> visited_on)
Keeps only the ‘Typed’ and ‘Link’ transition types to keep only URLs I either typed to or clicked to
Keep only ‘Chain Start’ qualifiers to remove URLs that came from redirects
Create a base_url field by removing the “http[s]://” and “www.” strings if they exist.
Converts visited_on into both a timestamp and a datestamp
Only keeps the four columns we’re interested in.

After these changes, the data looks like:

base_url	dttm	title	ds
watch.wwe.com/	2020-06-13 14:59:23	wwe network - undertaker: the last ride	2020-06-13
watch.wwe.com/	2020-06-13 14:59:28	wwe network - undertaker: the last ride	2020-06-13
google.com/	2020-06-13 16:33:37	vtt to srt - google search	2020-06-13
twitter.com/	2020-06-13 17:19:55	home / twitter	2020-06-13
twitter.com/	2020-06-13 17:20:03	home / twitter	2020-06-13

Sessionizing the Data

Even though I have a date field for my browsing history, the cSPADE algorithm is going to want to be able to differentiate between when one session begins and another session ends. While a reasonable choice might be to break things apart by day, it’s likely that on weekends I have multiple browsing sessions which can sometimes stretch past midnight. So a more reasonable choice might be to say a new session begins if there is a gap of at least 1 hour since the last page I browsed to.

Another aspect of the data that I’d like to deal with is to eliminate when I go to multiple pages within the same domain. Having an eventual rule that “twitter.com/ -> twitter.com” isn’t that interesting. So I will also remove any consecutive rows that have the same domain.

collapsed_history <- browsing_history_cleaned %>% 
  #Order by Time
  arrange(dttm) %>% 
  # Create a new marker every time a Page Browsing is more than 1 hour since
  # the last one
  # Also, create a segment_id to identify each session
  mutate(time_diff = dttm-lag(dttm),
         #Count Segments as more than an hour btw events
         new_segment = if_else(is.na(time_diff) | time_diff >= 60*60, 1, 0),
         segment_id = cumsum(new_segment)
  ) %>% 
  group_by(segment_id) %>% 
  arrange(dttm) %>% 
  #Remove Instances where the same baseurl appears consecutively
  filter(base_url != lag(base_url) | is.na(lag(base_url))) %>% 
  #Create Within Segment ID
  mutate(item_id = row_number()) %>% 
  select(segment_id, ds, dttm, item_id, base_url) %>% 
  ungroup() %>% 
  #Convert Everything to Factor
  mutate(across(.cols = c("segment_id", "base_url"), .f = as.factor))

In order to create segment_ids to represent each session, I use dplyr::lag() to calculate the difference in seconds between each event. Then when the event occurs more than 1 hour after the prior event I mark it with a 1 in the new_segment column. Then using the cumsum option, I can fill down the segment_ids to all the other events in that session.

Similarly I use the lag function to remove consecutively occurring identical base_url.

Finally, a quirk of the arulesSequences package is that the “items” or the URLs in this case must be factors.

The data for the 154 browsing sessions now looks like:

collapsed_history %>% head(5) %>% knitr::kable()

segment_id	ds	dttm	item_id	base_url
1	2020-06-13	2020-06-13 14:59:23	1	watch.wwe.com/
2	2020-06-13	2020-06-13 16:33:37	1	google.com/
2	2020-06-13	2020-06-13 17:19:55	2	twitter.com/
2	2020-06-13	2020-06-13 17:20:09	3	gmail.com/
2	2020-06-13	2020-06-13 17:24:14	4	twitter.com/

Constructing the Transactions Data Set for arulesSequences

I haven’t found a ton of resources online about using the arulesSequences package. This blog post from Revolution Analytics has been one of the best that I’ve found. However, their process involves exporting to .csv and then reading back in to create the transactions data set. Personally, I’d like to avoid doing as much outside of R as possible.

However, the blog post does provide a good amount of detail about how to properly get the data in the proper format. Using the as function, I can convert the previous data frame into a “transactions” format and set the following fields for use in cSPADE:

items: The elements that make up a sequence
sequenceID: The identifier for each sequence
eventID: The identifier for an item within a sequence

sessions <-  as(collapsed_history %>% transmute(items = base_url), "transactions")
transactionInfo(sessions)$sequenceID <- collapsed_history$segment_id
transactionInfo(sessions)$eventID = collapsed_history$item_id

If I wanted to use better controls around time gaps, I would need to provide better information about time. But since this is pretty basic, I don’t use that field as the differentiation between sessions is enough.

The Transaction data class can be viewed with the inspect() function:

inspect(head(sessions))

##     items                  transactionID sequenceID eventID
## [1] {items=watch.wwe.com/} 1             1          1      
## [2] {items=google.com/}    2             2          1      
## [3] {items=twitter.com/}   3             2          2      
## [4] {items=gmail.com/}     4             2          3      
## [5] {items=twitter.com/}   5             2          4      
## [6] {items=gothamist.com/} 6             2          5

Having the “items=” for every items is a little annoying so let’s remove that by altering the itemLabels for the transactions set:

itemLabels(sessions) <- str_replace_all(itemLabels(sessions), "items=", "")
inspect(head(sessions))

##     items            transactionID sequenceID eventID
## [1] {watch.wwe.com/} 1             1          1      
## [2] {google.com/}    2             2          1      
## [3] {twitter.com/}   3             2          2      
## [4] {gmail.com/}     4             2          3      
## [5] {twitter.com/}   5             2          4      
## [6] {gothamist.com/} 6             2          5

Much better.

Running the cSPADE algorithm

The sequence mining algorithm can be run by using the cspade() function in the arulesSequences package. Before running the algorithm, I’ll need to explain the concept of support. Support can be best thought of as the proportion of sessions that contain a certain URL. Why that’s important is that the cSPADE algorithm works recursively to find the frequent patterns starting with 1-item sets, then moving to 2-items, etc. In order to limit how much time the algorithm will run for, you can set a minimum support threshold. Why this helps is that by definition the support of a 2-item set will be less than or equal to the support of either 1-item set. For example, if A occurs 40% of the time, A and B cannot occur more frequently.

So if A alone does not meet the support threshold, then we don’t need to care about any 2 or more item subsets that contain A.

For this purpose I’ll set a minimum support of 25%. The cspade function will return all of the frequent itemsets that occur in my browsing data.

itemsets <- cspade(sessions, 
                   parameter = list(support = 0.25), 
                   control = list(verbose = FALSE))

The summary() function will provide a lot of useful information, but we’ll just look at the first few rows with inspect():

inspect(head(itemsets))

##    items                   support 
##  1 <{buzzfeed.com/}>     0.4090909 
##  2 <{en.wikipedia.org/}> 0.3311688 
##  3 <{facebook.com/}>     0.3311688 
##  4 <{github.com/}>       0.3051948 
##  5 <{google.com/}>       0.8051948 
##  6 <{gothamist.com/}>    0.4090909 
##

Here we see the results of a series of 1-item sets where the support is the number of sessions containing at least 1 visit to that URL. Apparently I use google A LOT as it appears in 80% of my sessions.

We can also convert the itemsets data back to a data frame using the as() function and go back to using the usual dplyr or ggplot functions. For example, I can visualize the 10 Most Frequent Sequences I visit based on the support metric:

#Convert Back to DS
itemsets_df <- as(itemsets, "data.frame") %>% as_tibble()

#Top 10 Frequent Item Sets
itemsets_df %>%
  slice_max(support, n = 10) %>% 
  ggplot(aes(x = fct_reorder(sequence, support),
                    y = support,
                    fill = sequence)) + 
    geom_col() + 
    geom_label(aes(label = support %>% scales::percent()), hjust = 0.5) + 
    labs(x = "Site", y = "Support", title = "Most Frequently Visited Item Sets",
         caption = "**Support** is the percent of segments the contain the item set") + 
    scale_fill_discrete(guide = F) +
    scale_y_continuous(labels = scales::percent,
                       expand = expansion(mult = c(0, .1))) + 
    coord_flip() + 
    cowplot::theme_cowplot() + 
    theme(
      plot.caption = element_markdown(hjust = 0),
      plot.caption.position = 'plot',
      plot.title.position = 'plot'
    )

Now we see some of the 2-item sets. Not only do I use Google in 80% of sessions. In 66% of sessions I visit google twice!

Turning Frequent Sequences into Rules

While knowing what URLs occur frequently is interesting, it would be more interesting if I could generate rules around what websites lead to visits to other websites.

The ruleInduction() function will turn the item sets into “if A then B” style rules. To control the size of the output, I will introduce the concept of confidence. The Confidence of an “If A then B” rule is the % of the times the rule is true when A occurs. So if “if A then B” has a 50% confidence then when A occurs we have a 50% chance of seeing B vs. seeing anything other than B.

For this post, I’ll use a minimum confidence of 60%.

rules <- ruleInduction(itemsets, 
                       confidence = 0.6, 
                       control = list(verbose = FALSE))

inspect(head(rules, 3))

##    lhs                     rhs                    support confidence     lift 
##  1 <{gothamist.com/}>   => <{westsiderag.com/}> 0.2727273  0.6666667 1.901235 
##  2 <{gothamist.com/}>   => <{twitter.com/}>     0.2662338  0.6507937 1.113580 
##  3 <{t.co/}>            => <{twitter.com/}>     0.3246753  0.7812500 1.336806 
##

The returned data structure has 5 fields:

lhs: Left-hand side - The “A” in our “if A then B” rule
rhs: Right-hand side - The “B” in our “if A then B” rule
support: The % of sessions where “A then B” occurs
confidence: How often the rule is true (If A occurs the % of Time that B occurs)
lift: The strength of the association. Defined as the ratio of support “A then B” divided by the Support of A times the Support of B. In other words, how much more likely are we to see “A and B together” vs. what we would expect if A and B were completely independent of each other.

The first row shows two NYC specific blogs, one of NYC overall and one for the Upper West Side. The support shows that 27% of my sessions include these two blogs. The confidence shows that if I visit Gothamist there’s 67% chance I’ll visit WestSideRag after. Finally, the lift shows that the likelihood of this rule is 90% higher than you’d expect if there was no relation between my visiting these sites.

Redundant Rules

In order to create the most effective and simplest rules we’ll want to remove redundant rules. In this context a rule is redundant when a subset of the left-hand side has a higher confidence than the rule with more items on the left-hand side. In simpler terms, we want to simplest rule that doesn’t sacrifice information. For example, {A, B, C} -> D is redundant of {A, B} -> {D} if the confidence of the 2nd rule is greater than or equal to the 1st

A real example from this data is:

lhs	rhs	support	confidence	lift
<{t.co/}>	=> <{twitter.com/}>	0.3246753	0.7812500	1.336806
<{twitter.com/}, {t.co/}>	=> <{twitter.com/}>	0.3181818	0.7777778	1.330864

The addition of “twitter.com” to the left-hand side does not make for a more confident rule so therefore it is redundant.

Removing redundant rules can be done easily with the is.redundant() function:

rules_cleaned <- rules[!is.redundant(rules)]

The rules class can also be converted back to a data.frame with the as() function. Then we can use tidyr::separate() to break apart the rule column into the lhs and rhs columns.

rules_df <- as(rules_cleaned, "data.frame") %>% 
  as_tibble() %>% 
  separate(col = rule, into = c('lhs', 'rhs'), sep = " => ", remove = F)

Now we can look at the highest confidence rules:

rules_df %>% 
  arrange(-confidence) %>% 
  select(lhs, rhs, support, confidence, lift) %>% 
  head() %>% 
  knitr::kable()

lhs	rhs	support	confidence	lift
<{google.com/},{google.com/},{google.com/},{google.com/}>	<{google.com/}>	0.3701299	0.9193548	1.141779
<{github.com/}>	<{google.com/}>	0.2792208	0.9148936	1.136239
<{buzzfeed.com/},{google.com/}>	<{google.com/}>	0.2597403	0.8510638	1.056966
<{t.co/},{google.com/}>	<{google.com/}>	0.2727273	0.8400000	1.043226
<{lifehacker.com/}>	<{reddit.com/}>	0.2532468	0.8297872	1.726854
<{google.com/}>	<{google.com/}>	0.6623377	0.8225806	1.021592

And this is pretty boring. I wind up on Google a lot, so it appears in a lot of the rules. So let’s make this more interesting by removing Google from the results and by also looking at both confidence and lift.

rules_df %>% 
  #Remove All Rules that Involve Google
  filter(!str_detect(rule, '\\{google.com\\/\\}')) %>% 
  #Keep only Rule, Confidence, and Lift - 1
  transmute(rule, confidence, lift = lift - 1) %>% 
  #Pivot Lift and confidence into a single column
  pivot_longer(cols = c('confidence','lift'),
               names_to = "metric", 
               values_to = "value") %>% 
  group_by(metric) %>% 
  #Keep only the Top 10 Rules for Each Metric
  top_n(10, value) %>% 
  ungroup() %>% 
  # Reorder so that order is independent for each metrics
  ggplot(aes(x = tidytext::reorder_within(rule, value, metric),
             y = value,
             fill = rule)) + 
    geom_col() + 
    geom_label(aes(label = value %>% scales::percent()), 
               hjust = 0) +
    scale_fill_discrete(guide = F) + 
    tidytext::scale_x_reordered() + 
    scale_y_continuous(label = scales::percent, 
                       limits = c(0, 1),
                       expand = expansion(mult = c(0, .1))) + 
    labs(x = "Rule", 
         y = "", 
         title = "Top Rules by Confidence and Lift",
         caption = "**Confidence** is the probability RHS occurs 
         given LHS occurs <br />
         **Lift** is the increased liklihood of seeing LHS & RHS together vs. independent") +
    facet_wrap(~metric, ncol = 1, scales = "free_y") +
    coord_flip() +
    theme_minimal() +
    theme(
      plot.caption = element_markdown(hjust = 0),
      plot.caption.position = 'plot',
      strip.text = element_textbox(
        size = 12,
        color = "white", fill = "#5D729D", box.color = "#4A618C",
        halign = 0.5, linetype = 1, r = unit(5, "pt"), width = unit(1, "npc"),
        padding = margin(2, 0, 1, 0), margin = margin(3, 3, 3, 3)
      )
    )

Some of the high lift rules that occur are:

I visit WestSideRag after Gothamist
I visit Reddit after LifeHacker
I visit Buzzfeed after Twitter.

By the way, all this is true. My usually weekday pattern tends to be Twitter -> Gothamist -> WestSideRag -> ILoveTheUpperWest -> Buzzfeed -> LifeHacker -> Reddit.

So it does appear that the Sequence Mining rules do in fact represent my browsing habits! But certain sites like the 2nd Upper West Side blog did not make the top rules.

Visualizing these relationships as a graph

Ultimately, my browsing habits can be restructured as a directed graph where each URL leads to another URL. Then rather than relying on statistical measures like Support, Confidence, and Lift, I can visualize my browsing as a network. However, to turn my data into an edge list I need to re-structure the URLs from a sequential list into a series of “Source/Destination” edges.

To do this, I’ll group by each browsing session, setting the URL to the "source’ and using dplyr::lead() to grab the URL from the next row to form the destination. Then since for the last URL, the destination will be null, I’ll remove these endpoints from the data. Finally, to create edge weightings I’ll count the number of instances for each source/destination pair.

collapsed_history_graph_dt <- collapsed_history %>% 
  group_by(segment_id) %>% 
  transmute(item_id, source = base_url) %>% 
  mutate(destination = lead(source)) %>% 
  ungroup() %>%
  filter(!is.na(destination)) %>% 
  select(source, destination, segment_id) %>% 
  count(source, destination, name = 'instances')

In order to create the graph, I’ll be using the tidygraph and ggraph packages to convert the data frame into the appropriate format and visualize the network in a ggplot style.

To make the resulting network more readable, I’ll filter my edge list to only those with at least 15 occurrences and then use tidygraph::as_tbl_graph to convert to a graph-friendly data type.

g <- collapsed_history_graph_dt %>% 
  filter(instances > 14) %>% 
  as_tbl_graph()

Creating Graph Clusters

To make the visualization a little more interesting I thought it would be fun to cluster the network. The igraph::cluster_optimal function will calculate the optimal community structure of the graph. This membership label then gets applied as a node attribute to the graph object g created in the prior code block.

clp <- igraph::cluster_optimal(g)

g <- g %>% 
  activate("nodes") %>% 
  mutate(community = clp$membership)

Plotting the Network WIth ggraph

Ggraph follows a similar syntax to ggplot where the data object is based in and then there are geoms to reflect the nodes/edges of the plot. The layout option specifies how the nodes and edges will be laid out. Here I’m using the results of the Fruchterman-Reingold algorithm for a force-directed layout. As used in this code block the relevant geoms are:

geom_node_voronoi - Used to plot the clustering as the background of the graph
geom_edge_parallel - Since this is a directional graph, it will draw separate parallel arrows for each direction. The shading will be based on the log number of instances.
geom_node_point - Plots a circle for each node
geom_node_text - Plots the names of the nodes and reduces overlap

set.seed(20201029)
ggraph(g, layout = 'fr') + 
  geom_node_voronoi(aes(fill = as.factor(community)), alpha = .4) + 
  geom_edge_parallel(aes(edge_alpha = log(instances)),
                  #color = "#5851DB",
                  edge_width = 1,
                  arrow = arrow(length = unit(4, 'mm')),
                  start_cap = circle(3, 'mm'),
                  end_cap = circle(3, 'mm')) +
  geom_node_point(fill = 'orange', size = 5, pch = 21) + 
  geom_node_text(aes(label = name), repel = T) + 
  labs(title = "My Browsing History",
       caption = "Minimum 15 Instances") + 
  scale_fill_viridis_d(guide = F) + 
  scale_edge_alpha_continuous(guide = F) + 
  theme_graph()

This graph shows 5 clusters:

Twitter -> Gothamist -> WestSideRag -> ILoveTheUpperWestSide
- The websites I typically visit after work on weekdays
Datacamp / Google Docs
- When I did some Datacamp courses, I take notes in Google Docs so constantly switching back and forth makes sense.
Facebook.com / l.facebook.com
- This is just using Facebook. But interesting that Facebook has no frequent connection outside of the Facebook ecosystem.
BuzzFeed/LifeHacker
- This a the last piece of my usual post-work routine. But perhaps it occurs later after the Twitter/NYC Blog Cluster
The Google Centered Cluster
- Google is the center of my browsing universe but some fun connections here are 127.0.0.1:4321 which is the local instance when I’m developing this blog. This co-occurs with lots to trips to Google, Github, and Stack Overflow while I try to figure out / debug aspects of my blog development pipeline.

Conclusion

There weren’t a ton of resources that showed how to use the arulesSequences package in my searches and most required dumping and rereading a .csv file. Hopefully, this post showed that it isn’t necessary to do that. Additionally, it shows an example of how sequence mining can be used to identify interesting patterns when the order is important. There is a lot of functionality of the arulesSequences package not touched upon in this post, but this should serve as good starting point.

As for visualization, I’ve covered how to plot rules in the usual tabular structure with ggplot2 as well as a network using ggraph. I really like the way the network visualization worked out and in a future post I may go more in-depth to learn about how to best use tidygraph and ggraph.

Looking for Media Bias in Coverage of Trump's COVID Diagnosis

Wed, 07 Oct 2020 00:00:00 +0000

Within the United States, especially these last few years, there has been an increased focus on “fake news” or “bias in the media”. Fox News typically is the poster-child for right-wing bias and everything else seems to be the poster child for left-wing bias. While this is just a humble R blog (and NOT a political blog) I thought it could be an interesting question to look at how a single event is covered from different media sources.

I don’t know much about assessing Media Bias, but fortunately websites like Allsides.com have done the research for me and produced the following chart breaking Media outlets based on direction of bias:

I don’t have a perspective on the accuracy of this chart. But it provides enough information to work with.

Originally, I was planning on using the first Presidential Debate in late September, but with President Trump’s positive COVID-19 diagnosis on Friday 10/2, I’ve decided to use that event instead.

The rules for this analysis are:

Pick one media outlet from each column of the Media Bias Chart above and find an article about Trump’s COVID diagnosis.
No opinion pieces or editorials. The articles should be intended to be reporting on the facts of an event.

The Data

All data was collected on Friday, October 2 (some articles have since changed). The five articles are listed below from most left-leaning to most right-leaning:

I manually copy-pasted the titles, subtitles, and articles text in .txt files ensuring that inserted links to other articles were not accidentally picked up.

Analysis Plan

The main objectives for this analysis are to look at:

Sentiment Analysis of the five different articles
Looking for the most representative words for each article

to see if we can learn anything about Media Bias from these five sources.

The libraries that will be used for this analysis are:

library(tidyverse) #Our Workhorse Data Manipulation / Plotting Functions
library(tidytext) #Tidyverse Friend Package for Text-Mining
library(scales) #For easier value formatting
library(ggtext) # To Be Able to Use Images On Plots
library(ggupset) # For Creating an Upset Chart
library(UpSetR) # For Creating an Upset Chart

Reading in the data

As mentioned above, the text of the five articles are contained in five text files. The following code block will look into the working directory and use purrr's map_dfr function to execute the readr::read_table function and create a column to mark which source the text came from:

articles <- dir() %>% 
  keep(str_detect(., '\\.txt')) %>% 
  map_dfr(
    ~read_table(.x, col_names = F) %>% 
      mutate(source = str_remove_all(.x, '\\.txt'))
    )

One of my earlier blog posts on creating work embeddings to compare TikTok and Instagram describes the basic pieces of tidytext. But the first step of the text analysis is to ‘tokenize’ (split the data into one word per row) using tidytext::unnest_tokens() which will break apart the X1 column containing sentences into a new column called word. Then the next step is removing stop words, which are common words like “the”, “and”, and “a” which don’t add much meaning to analysis. These words are contained in the stop_words data set and using anti_join() will remove them from our wordlist.

words <- articles %>% 
  unnest_tokens(word, X1) %>% 
  anti_join(stop_words)

How long are each of the articles?

The first analysis that we can do is look at the word count of the five articles:

count(words, source, name = "article length", sort = T) %>% 
  knitr::kable()

source	article length
ap	603
cnn	494
foxnews	394
huffpo	271
theblaze	135

What’s potentially interesting about the word counts is that the Associated Press articles which is supposed to be the most non-biased has the largest word count. Then the slightly biased sources (CNN and Fox News) are the next longest. Finally, the articles that were representing the most biased sources (Huffington Post and The Blaze) were the shortest.

What Are The Most Common Words From Each Source?

Another quick analysis can be to look at the most frequent words from each of the five sources. The following code block takes advantage of the ggtext package to use logos for the axis-labels. ggtext is able to render html tags on ggplots using the element_markdown() function and the icons_strip object contains those tags.

The following code block gets the word counts for each word in each source and then uses dplyr’s slice_max() to only keep the Top 10 words for each source. In the ggplot command, the reorder_within() and scale_x_reordered() allows for separate sorting within each facet on the resulting chart.

icons_strip <- c(
  huffpo = "<img src='huffpost.jpg' width = 90 />",
  cnn = "<img src='CNN.png' width='40' />",
  ap = "<img src='ap.png' width='80' />",
  foxnews = "<img src='foxnews.jpg' width='70' />",
  theblaze = "<img src='theblaze.jpeg' width='50' />"
  )

words %>% 
  group_by(source, word) %>% 
  summarize(cnt = n()) %>% 
  slice_max(order_by = cnt, n = 10, with_ties = F) %>% 
  mutate(icon = unname(icons_strip[source]),
         icon = factor(icon, 
                       levels = c(icons_strip['huffpo'], icons_strip['cnn'],
                                  icons_strip['ap'], icons_strip['foxnews'],
                                  icons_strip['theblaze']))
         ) %>%
  ggplot(aes(x = reorder_within(word, cnt, source), y = cnt, fill = icon)) +
    geom_col() +
    scale_x_reordered() +
    scale_fill_discrete(guide = F) +
    labs(x = "Words",
         y = "# of Occurances",
         title = "Most Common Words in Articles about Trump's Positive COVID Test") +
    facet_wrap(~icon, nrow = 2, scales = "free") +
    coord_flip() +
    theme(
      strip.text.x = element_markdown(),
      strip.background.x = element_blank()
    )

Unsurprisingly, the words “President” or “Trump” were the most common words in all five articles.

Looking into the sentiment of each article

A common type of text analysis is sentiment analysis and a simple version of sentiment analysis is to lookup each word of text in a dictionary that labels it as either positive sentiment or negative sentiment. The tidytext package contains a number of different sentiment lexicons and in this analysis I’ll be using the Bing Liu lexicon.

In the following code block I am appending the sentiment labels to our existing data set. Since most of the words will not appear in the sentiment lexicon I’m setting the NA values to a label of “neutral”. Additionally, I’m creating numeric codes for positive words (+1), negative words (-1) and neutral words (0).

words_with_sentiment <- words %>% 
  left_join(get_sentiments('bing')) %>% 
  mutate(
    sentiment = if_else(is.na(sentiment), 'neutral', sentiment),
    sentiment_numeric  = case_when(
                sentiment == 'positive' ~ 1,
                sentiment == 'negative' ~ -1,
                sentiment == 'neutral' ~ 0)
  )

The first look at sentiment across the five articles is to look at an average sentiment score using the numeric coding above. Everything in this code block is either vanilla dplyr or ggplot or has been covered in other blog posts. New to this block is scale_x_discrete(labels = icons) which uses the named vector of <img> tags to apply the logos to the y-axis after the coordinate flip:

icons <- c(
  huffpo = "<img src='huffpost.jpg' width = 130 />",
  cnn = "<img src='CNN.png' width='50' />",
  ap = "<img src='ap.png' width='130' />",
  foxnews = "<img src='foxnews.jpg' width='75' />",
  theblaze = "<img src='theblaze.jpeg' width='75' />"
  )

words_with_sentiment %>% 
  group_by(source) %>% 
  summarise(
    avg_sentiment = sum(sentiment_numeric, na.rm = T) / n()
  ) %>% 
  ggplot(aes(
    x = factor(source, 
               levels = c('theblaze', 'foxnews', 'ap', 'cnn', 'huffpo')),
    y = avg_sentiment
  )) + 
  geom_linerange(ymin = 0, aes(ymax = avg_sentiment)) + 
  geom_point(size = 4, aes(color = ifelse(avg_sentiment < 0, 'red', 'green'))) + 
  geom_text(aes(label = avg_sentiment %>% round(2)), vjust = -1) + 
  geom_hline(yintercept = 0) + 
  labs(x = "", 
       y = expression(paste("Avg. Sentiment Scores (",
                            Sigma," Positive - ", 
                            Sigma," Negative) / # Words")),
       title = "Total Sentiment of Articles About Trump's Positve COVID Test") + 
  scale_x_discrete(labels = icons) + 
  scale_y_continuous(labels = scales::percent) + 
  scale_color_identity(guide = F) + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    axis.text.y = element_markdown(color = "black", size = 11),
    axis.text.x = element_blank(),
    axis.title.x = element_text(size = 10),
    plot.title.position = 'plot'
  )

Most interesting about these results is that there is a rank ordering ranging from the most left-leaning article is the most negative and the most right-leaning is the most positive. Other interesting items of note is that the right-leaning articles have higher absolute sentiment scores than the left-leaning articles and that the Associated Press articles has an average sentiment score of nearly zero.

While I went into this with no prior hypothesis, I’m guessing that the left-leaning articles are taking a more dire view of Trump’s COVID diagnosis while the right-leaning are focusing more on the hopeful recovery.

An alternative lens is rather than looking at average sentiment scores, I can look at the distribution of Positive/Negative/Neutral words within each source.

words_with_sentiment %>% 
  count(source, sentiment) %>% 
  group_by(source) %>% 
  mutate(pct = n / sum(n)) %>% 
  ungroup() %>% 
  ggplot(aes(x = factor(source, 
                        levels = c('theblaze', 'foxnews', 'ap', 'cnn', 'huffpo')),
             y = pct, 
             fill = factor(sentiment, 
                           levels = c('negative', 'neutral', 'positive'))
             )
         ) + 
    geom_col() + 
    geom_text(aes(label = pct %>% scales::percent(accuracy = 1)), 
              position = position_stack(vjust = .5)) + 
    labs(x = "", 
         y = "% of Words",
         title = "What is the Sentiment of Different Articles on Trump's Positive COVID Test?", 
         fill = "Sentiment") + 
  scale_x_discrete(labels = icons) + 
  guides(fill = guide_legend(reverse = T)) + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    plot.title.position = 'plot',
    legend.position = 'bottom',
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_markdown(),
    plot.title = element_text(size = 12)
  )

While most words in all five articles are neutral this views lets us see the Positive vs. Negative Distribution in more detail. The left-leaning sources skew slightly towards negative sentiment with close to equal percentages of positive and negative while the right-leaning articles have a higher occurrence of positive terms and a decently lower occurrence of negative terms.

Determing the Most Representitive Words with TF-IDF

TF-IDF or Term Frequency-Inverse Document Frequency produces a numeric value to represent “how important a word is to a document in a collection of documents”. Earlier in this post we looked at the most common words in each source. A problem with using most frequent word as a measure of importance is that if a word is very common everywhere then it can’t be really important. For example, the word “President” is likely not descriptive to any one document since all five are about President Trump’s COVID diagnosis. However, we should still consider frequency as part of the measure of importance. However, we can solve the common in all documents problem by weighting the metric by the inverse of the number of documents a word appears in. This is the IDF (inverse document frequency) portion of TF-IDF. In TF-IDF, the result is the product of the:

Term Frequency (TF) - Within each document what % of words is word X?
Inverse Document Frequency (IDF) - How many documents does word X appear in?
- This is defined as Log(Total # of Documents / # of Documents with Word X)

Calculating TF-IDF is easy with tidytext::bind_tfidf() which takes as parameters, the word column (word), the document column (source), and a counts column (n). The function appends the tf, idf, and tf_idf columns to the data set.

words %>% 
  add_count(word, name = 'total_count') %>%
  filter(total_count >= 5) %>% 
  count(source, word) %>% 
  bind_tf_idf(word, source, n) %>% 
  mutate(icon = unname(icons_strip[source]),
         icon = factor(icon, 
                       levels = c(icons_strip['huffpo'], icons_strip['cnn'],
                                  icons_strip['ap'], icons_strip['foxnews'],
                                  icons_strip['theblaze']))
         ) %>% 
  group_by(icon) %>% 
  slice_max(order_by = tf_idf, n = 10, with_ties = F) %>% 
  ggplot(aes(x = reorder_within(word, tf_idf, icon), 
             y = tf_idf, 
             fill = icon)) +
    geom_col() +
    scale_x_reordered() +
    scale_fill_discrete(guide = F) +
    labs(x = "Words",
         y = "TF-IDF",
         title = "Most Characteristic Words For Each Source",
         subtitle = "Based on TF-IDF") +
    facet_wrap(~icon, nrow = 2, scales = "free_y") +
    coord_flip() +
    theme(
      strip.text.x = element_markdown(),
      strip.background.x = element_blank()
    )

For the Huffington post article the most important word is “debate”, followed by “wearing”. Within the other documents the word "debate: does not appear at all in the right-leaning articles.

In the Fox News article it shouldn’t be a surprise than the word “fox” has a higher importance to Fox News than to any other article.

What was interesting was that the two right leaning articles’ most representative words were “tweeted”. In actually reading the articles, they both primarily were quoting the tweets from the President, Vice President, First Lady and various other White House spokespeople.

Overall, the TF-IDF results weren’t particularly interesting besides the Huffington Post’s referencing the debates and the right-leaning articles relying on Twitter for much of the information.

Content aside, I was a little confused how “tweeted” could be the most representative word for two different articles since importance is partially determined by the word NOT appearing in other articles. However, after thinking about this more it could be possible if one of the articles had a lot of word overlap with other articles. The Blaze article was already the shortest with only 135 words and perhaps it doesn’t add much new information.

What is the Overlap of Words Across All Sources?

In order to determine whether The Blaze just doesn’t have many unique words we’ll need to construct a way to see what sources contain each word. While with fewer numbers of sources tools like Venn Diagrams would be useful to determine overlaps, with 5 different sources there could be 32 different overlap combinations.

A useful tool for viewing overlap of many groups in a simpler way is an “Upset” chart. In an upset chart the number of words occurring in each overlap group is displayed as a bar and what the group represents is shown in the box beneath the chart where a circle is filled in if the source is part of the group and not filled in otherwise. Shout out to Joel Soroos, whose blog post helped me implement Upset charts in R.

There are a couple of packages that can make these charts but I’ll use ggupset since it works well with the tidy data format. In order to get the data in the proper format I’ll need to structure the data so that each row is a word and then there is a list-column containing all of the sources that contain that word. This can be done using group_by and summarize with list() as the aggregate function.

Then a ggplot can be turned into an Upset chart through the use of scale_x_upset(). Another piece of this code that’s pretty new to me is geom_text(stat='count', aes(label=after_stat(count)), nudge_y = 10). Since the data is structured as a list of words, I don’t have a column that represents the number of words in each group to pass into geom_text(). Therefore the line after_stat will tell geom_text() that we’re going to use the count statistic but also to set the label value after doing the stat calculation. Admittedly I’m not great with the stat_* aspects of ggplot and the after_* functions. But this is nice that I don’t have to do all the calculations before passing into ggplot.

words %>% 
  count(source, word) %>%
  group_by(word) %>% 
  summarize(sources = list(source)) %>% 
  ggplot(aes(x = sources)) + 
    geom_bar() + 
    geom_text(stat='count', aes(label=after_stat(count)), nudge_y = 10) +
    scale_x_upset() + 
    labs(x = "Set of Sources", 
         y = "# of Unique Words",
         title = "How Many Sources Does Each Word Appear In?",
         caption = "Each column represents a unique combination of sources") + 
    cowplot::theme_cowplot()

To read the Upset chart, the first bar shows that the largest group is composed of 197 words and represents the words that are ONLY in the CNN article. The second bar is 186 words that ONLY appear in the Associated Press article. For an example of an overlap, the 5th bar represents the 40 words that appear in BOTH the AP and CNN article.

To answer the original question of whether The Blaze’s high TF-IDF score for ‘tweeted’ is due to a low number of unique words in The Blaze article we can look for the group that is ONLY the Blaze words. Finding the column that contains only the one filled in circle for the Blaze we can see that there are only eight words that are unique to the Blaze article. Granted some of this is due to The Blaze being the shortest article and the AP article having the most words.

Conclusion

This post looked at five articles about the event of President Trump’s COVID-19 diagnosis from different media sources to see how coverage might differ depending on each outlet’s bias. While I’m not an expert in bias and I don’t think any results here are so strong as to suggest obvious bias there were a few areas where the ordering does seem to indicate some amount of ‘slant’ in line with how AllSides.com rated the five outlets.

The level of bias (both left and right) of a media outlet correlated with shorter articles.
For this particular event, based on sentiment analysis the left-leaning outlets took a slight negative slant while the right-leaning outlets took a more positive slant.
The right-leaning articles appeared to rely more on tweets for the text-content of the article

Appendix: Upset Charts with UpSetR

There is an alternative implementation for Upset charts using the UpSetR package that doesn’t run through ggplot. In order to use this package each source needs to become its own column with a value of 1 if the word appears in the source and zero otherwise. Additionally, the data can’t be in the tibble format which is why as.data.frame is used before calling the upset() function.

words %>% 
  distinct(source, word) %>% 
  mutate(val = 1) %>%
  pivot_wider(
    names_from = 'source',
    values_from = 'val',
    values_fill = 0
  ) %>%
  as.data.frame %>% 
  upset(order.by = 'freq',
        empty.intersections = T,
        sets.x.label = 'Word Count',
        text.scale = 1.25)

The two main advantages of UpSetR is the ability to show empty intersecting groups and the Word Count graph on the left. For example, with this version we can see that there are zero words that only appear in the Blaze and the Huffington Post article. Also, its clearer in this package that the AP and CNN have more words than the rest of the articles.

What's The Best Day to Get Married?

Thu, 01 Oct 2020 00:00:00 +0000

TL;DR

There really isn’t a best day to get married as there isn’t much differentiation between various days. Both as part of this analysis and in life. Do what’s best for you and your relationship.
However, if ALL you care about is both being married on a Saturday, maximizing the number of times your anniversary will fall on a Saturday, and maximizing the number of “big” anniversaries that fall on a Saturday then avoid the 24 months after a leap day!

The original objective of this analysis

Having somewhat recently celebrated a 5th anniversary on a Saturday (🥂), Mrs. JLaw asked “How many anniversaries will we have on a Saturday?” and “When is the next big one that we’ll have?”. Upon finding out that our next “big” Saturday anniversary won’t come until our 50th, she suggested that I look into whether certain days would have been best to have gotten married.

In actually looking into this analysis, there’s not that much difference in the number of Saturdays or “big” Saturdays regardless of wedding date.

So the initial question was, what are the BEST and WORST dates to get married when optimizing for maximizing the number of “big” (multiples of 5) anniversaries occurring on a Saturday. The constraints being that the initial wedding date ALSO needed to be a on a Saturday.

Exploring Wedding Dates and Anniversaries

Since I’ll be working with dates the lubridate package will be the workhorse for preparing my data.

library(tidyverse) #Data Manipulation
library(lubridate) #Working with Dates
library(glue) # A package that works similar to the paste function
library(eulerr) # A package to create Venn-Diagrams (technically Euler Diagrams)

In order to make the universe of wedding dates tractable I’ll be looking at all potential dates occurring on a Saturday in the past 10 years (since 1/1/2010) and through the next 5 years (through 12/31/2025). The seq.Date() function from the lubridate package makes generating sets of dates super easy. It works similar to seq() where you give it a starting point, and ending point but in this case you also provide the interval (‘day’, ‘month’, ‘year’, etc.).

In the following code block, I’m constructing a tibble with a column called wedding_date that is all days between 1/1/2010 and 12/31/2025 using the ymd() function from lubridate to turn the integers into a date. Then I’m creating a column called wedding_date_day that uses the wday() function from lubridate to return the day of the week. The “abbr” and “label” options have it return “Mon”, “Tue”, “Wed” rather than integer values which is the default (this is in part because I constantly forget whether 1 refers to Sunday or Monday… so this eliminates that problem). Finally, I keep only dates that are Saturdays and remove leap days since those will get weird as we look at annual anniversaries.

wedding_dates <- tibble(
  wedding_date = seq.Date(ymd(20100101), ymd(20251231), by = 'day'),
  wedding_date_day = wday(wedding_date, abbr = T, label = T)
) %>% 
  #Keep only Saturdays
  filter(wedding_date_day == 'Sat') %>% 
  #Remove Leap Years because they're unique
  filter(!(day(wedding_date)==29 & month(wedding_date) == 2))

This will create a tibble with 834 rows representing all Saturdays between 2010 and 2025.

Counting the Number of Saturday Anniversaires and “Big” Saturday Anniversaries

I will look at the first 50 years of marriage for any of these wedding dates. So for each of the 834 potential wedding dates I need to:

Calculate the day of week for each anniversary for the next 50 years
For each wedding date, count the number of anniversaries that fall on a Saturday
For each wedding date, count the number of “big” anniversaries that fall on a Saturday (again, “big” anniversaries being multiples of 5 such as 5th, 10th, …)

At first I really wanted to figure out a way to do this in a wide-format using map functions or rowwise functions, but in the end I couldn’t figure it out in the time I wanted to spend exploring. Therefore, I’m keeping the data in a long-format by using tidyr::crossing() to expand each wedding days by the 50 anniversaries. So in the end each row in the initial data set will now have 50 rows.

Then for each of the Wedding Date/Anniversary Year combinations, I re-use the wday() function to get the day of the week and then group_by the wedding date and summarize() to count the number of Saturday anniversaries (num_sat) and “big” Saturday anniversaries (num_big_sat).

The two non-typical parts of this code block are the .groups argument to summarize() and the use of paste() in the summarize(). The .groups argument returns an ungrouped tibble rather than only removing the last grouping layer which is the default (this would have returned a grouped tibble with wedding_date as the grouping variable… which would probably be fine but occasionally grouped tibbles cause downstream issues).

Using paste() in the summarize() with the collapse=',' argument creates a concatenated comma-space separated string of the “big” anniversary years that fall on a Saturday and NA otherwise. The use of stringr::str_remove_all() is to remove the NAs from the string.

If you’re reading this and are unfamiliar with regular expressions, I highly recommend getting familiar with them, especially when working with text. The regular expression “NA,? ?” means to remove the pattern “NA” followed by 0 or 1 commas followed by 0 or 1 spaces. But the TL;DR here is that when a “big” anniversary didn’t fall on a Saturday the string “NA” would be concatenated and I wanted to remove those. So “5, NA, NA, NA, 45, NA” would just become “5, 45,”. Not ideal.. but it’ll do.

wedding_dates_w_annv <- wedding_dates %>% 
  #Expand Each Date to Have 50 Anniversaries
  crossing(anniversary = 1:50) %>% 
  #Get the Day of Week for Those Anniversaries
  mutate(anniversary_day = wday(wedding_date + years(anniversary), label = T, abbr = T)) %>% 
  #Summarize By Wedding Date counting the number of saturdays, number of saturdays w/ meaningful anniversary
  group_by(wedding_date, wedding_date_day) %>% 
  summarize(
    num_sat = sum(anniversary_day == 'Sat'),
    num_big_sat = sum(anniversary_day == 'Sat' & anniversary %% 5 == 0),
    #Building a string of all meaningful anniversary years,
    big_sat_years = str_remove_all(
      paste(
        if_else(anniversary_day == 'Sat' & anniversary %% 5 == 0, 
                anniversary, 
                NA_integer_
                ), 
        collapse = ', '), 
      "NA,? ?"),
    .groups = 'drop'
  )

Post-processing the data looks like:

wedding_date	wedding_date_day	num_sat	num_big_sat	big_sat_years
2010-01-02	Sat	7	1	45,
2010-01-09	Sat	7	1	45,
2010-01-16	Sat	7	1	45,

How many “Big” Saturday Anniversaries Does Anyone Get?

The first question to explore is for the 834 Saturdays in our data as potential wedding dates, how many of the “big” anniversaries will fall on a Saturday. The following code block is pretty vanilla dplyr with the use of count() and mutate(). If you’ve never seen the glue() package and function before, it works a lot like paste() in its most basic form. The main difference is that R will execute the code within the { } so it can be included directly within the quotes rather than separated by commas. It can also be used similar to .format() in Python.

wedding_dates_w_annv %>% 
  # Get frequencies of Big Saturday Anniversaries
  count(num_big_sat) %>% 
  # Create %s 
  mutate(pct = n/sum(n)) %>% 
  ggplot(aes(x = as.factor(num_big_sat), y = pct, fill = as.factor(num_big_sat))) +
    geom_col() + 
    geom_text(aes(label = glue("{pct %>% scales::percent()} (n={n %>% scales::comma()})")), nudge_y = 0.02) + 
    labs(title = "How many ***BIG*** anniversaries are celebrated on Saturday?",
         subtitle = glue("Saturday Wedding Dates 2010 - 2025 (n = {nrow(wedding_dates_w_annv)})"),
         caption = "Big = Multiple of 5 (5th, 10th, etc.)",
         x = "# of Big Anniversaries on Saturdays",
         y = "% of Wedding Dates") + 
    scale_fill_discrete(guide = F) + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title = ggtext::element_markdown(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()
    )

The primary reason there isn’t a best or worst wedding date is that all potential wedding dates either have 1 or 2 BIG anniversaries on a Saturday. So there isn’t too much of a difference in choice of dates.

So let’s look at how many anniversaries in total occur on a Saturday.

How many total Saturday Anniversaries Does Anyone Get?

wedding_dates_w_annv %>% 
  count(num_sat) %>% 
  mutate(pct = n/sum(n)) %>% 
  ggplot(aes(x = as.factor(num_sat), y = pct, fill = as.factor(num_sat))) +
  geom_col() + 
  geom_text(aes(label = glue("{pct %>% scales::percent()} (n={n %>% scales::comma()})")), nudge_y = 0.02) + 
  labs(title = "How many anniversaries are celebrated on Saturday?",
       subtitle = glue("Saturday Wedding Dates 2010 - 2025 (n = {nrow(wedding_dates_w_annv)})"),
       x = "# of Anniversaries on Saturdays",
       y = "% of Wedding Dates") + 
  scale_fill_discrete(guide = F) + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )

Continuing with the theme of there not being major differences. 75% of Saturday Wedding dates will have 7 anniversaries on a Saturday and 25% will have 6. So, while 7 would be preferable, the difference between 7 vs. 6 again is not large.

Looking at Both Total Saturdays and “Big” Saturdays

Since “Big” Anniversaries had a 50/50 distribution and overall Saturdays had a 25/75 distribution the next step would be to see the cross-product of the two previous fields:

wedding_dates_w_annv %>% 
  count(num_sat, num_big_sat) %>% 
  mutate(pct = n / sum(n)) %>% 
  ggplot(aes(x = factor(num_sat), y = factor(num_big_sat), fill = pct)) + 
    geom_tile() + 
    geom_text(aes(label = glue("{pct %>% scales::percent()} \n (n={n})"))) + 
    labs(title = "A deeper look into Saturday Anniversaries",
         subtitle = glue("Saturday Wedding Dates 2010 - 2025 (n = {nrow(wedding_dates_w_annv)})"),
         x = "# of Anniversaries on Saturdays",
         y = "# of 'Big' Anniversaries on Saturday") + 
    scale_fill_gradient(guide = F, low = "#769293", high = "#fad7d5") +
    cowplot::theme_cowplot()

Looking across both dimensions, everyone who has two “big” anniversaries on a Saturday ALSO has 7 anniversaries on a Saturday. However, not everyone who has 7 anniversaries on Saturday will have 2 “big” anniversaries on a Saturday. Instead there are three groups:

6 Total / 1 Big (25%)
7 Total / 1 Big (25%)
7 Total / 2 Big (50%)

In this case, having 7 anniversaries and 2 “Big” Anniversaries seems preferable to the other two groups… if you only cared about having your anniversary on a Saturday.

What “Big” Anniversaries Will Be Celebrated on Saturdays?

So far, I’ve defined “big” anniversaries as multiples of 5 (5th, 10th, … 45th, 50th). However, I haven’t looked at which of those big ones are occurring on a Saturday. To show these “big” anniversaries I’ll use the eulerr package to create a Venn-Diagram of these years.

The package expects a specific format where each column is a logical indicating whether or not an observation is a member of that group. From a quick check on the big_sat_years field I can see that the only “big” anniversaries that fall on Saturdays are the 5th, 45th, and 50th.

Of note is the regular expression “\b5\b” for identifying the 5th anniversary. \\b represents a word boundary so it is included to make sure that the 5th anniversary doesn’t accidentally get picked up by str_detect() as part of 45 or 50, which would occur if I only searched for “5”.

wedding_dates_w_annv %>% 
  #Constructing Logicals for Venn Diagrams
  transmute(
    `5th \n Anniversary` = str_detect(big_sat_years, '\\b5\\b'),
    `45th \n Anniversary` = str_detect(big_sat_years, '45'),
    `50th \n Anniversary` = str_detect(big_sat_years, '50')
  ) %>%
  #Plot the Venn-Diagram
  euler() %>% 
  plot(quantities = list(type = c('counts', 'percent')),
       percentages = TRUE,
       main = "Which 'big' anniversaries get celebrated on Saturdays?",
       )

So 75% of wedding dates will celebrate their 45th anniversary on a Saturday. 50% will celebrate ONLY their 45th anniversary and 25% will celebrate their 45th and 50th anniversaries on a Saturday. The last 25% will celebrate their 5th and 50th anniversary on a Saturday. No one will ONLY celebrate either their 5th or 50th. Fitting this into our three group paradigm from the prior section:

6 Total / 1 Big (25%) - Will ONLY celebrate their 45th Anniversary
7 Total / 1 Big (25%) - Will ONLY celebrate their 45th Anniversary
7 Total / 2 Big (50%)
- 25% will celebrate their 5th and 50th
- 25% will celebrate their 45th and 50th

Is there a time component to which group you end up in?

This final section looks at the time component to whether you wind up in the 6/1, 7/1, or 7/2 group. In order to summarize to a Year/Month level, the average number of Saturdays and “Big” Saturdays will be used. Then in the following heat-map, the year of the wedding date appears on the y-axis and the month of the wedding is on the x-axis.

wedding_dates_w_annv %>% 
  #Reformat to Year-Month (%Y = Year w/ Century, %m = Month as Zero-Padded Decimal)
  mutate(
    m = month(wedding_date),
    y = year(wedding_date),
  ) %>% 
  group_by(m, y) %>% 
  #Get Averages
  summarize(across(starts_with('num'), mean), .groups = 'drop') %>%
  mutate(grp = glue("{num_sat} Total / {num_big_sat} Big")) %>% 
  ggplot(aes(x = factor(y), y = factor(m), fill = grp)) + 
  geom_tile() + 
  scale_fill_viridis_d(option = "D") + 
  labs(x = "Year of Wedding Date",
       y = "Month of Wedding Date",
       title = "Looking at # Saturdays / ***'BIG'*** Saturdays",
       fill = "") + 
  cowplot::theme_cowplot() + 
  theme(plot.title = ggtext::element_markdown()) +
  coord_flip()

There appears to be a reproducible pattern to which of the three groups you’ll wind up in based on the initial wedding date. Probably not surprisingly this occurs in a 4 year cycle.

6 Total / 1 Big - Starts in March after a leap year and continues for the next 12 months.
- Examples: Mar 2012-Feb 2013, Mar 2016-Feb 2017, Mar 2020-Feb 2021
7 Total / 1 Big - The following 12 months after the first group
- Examples: Mar 2013-Feb 2014, Mar 2017-Feb 2018, Mar 2021-Feb 2022
7 Total / 2 Big - The following 24 months after the second group
- Examples: Mar 2014-Feb 2016, Mar 2018-Feb 2020, Mar 2022-Feb 2024

Conclusion

Weddings (or the choice not to have one) are personal decisions for which there is no right or wrong. HOWEVER, if you should choose to require to have your wedding on a Saturday and want to maximize the number of anniversaries you celebrate on Saturday as well as the number the “big” anniversaries celebrated on Saturdays then you’d do well to avoid the 24 months after leap-day.

But the differences between the three groups identified here are pretty small. So while the original question was what are the best and worst days to get married the good answer is that it really doesn’t matter!

COVID-19s Impact on the NYC Subway System

Mon, 07 Sep 2020 00:00:00 +0000

At 8pm on March 22nd, 2020, the “New York State on PAUSE” executive order became effective and New York City went on lockdown until June 8th, when the Phase 1 reopening began. During this time usage of the public transit systems had a sudden drop as all non-essential services needed to close. In this analysis, I look at MTA Subway Fare data to understand the effect of the PAUSE order on New York City Subway Ridership.

The goals here are to:

See the overall effect of the PAUSE order on ridership
See if regional differences around the city differ by type of Metrocard (Full Fare, Unlimited, etc.)
Create an interactive map to understand the regional differences in usage declines

Packages Used

library(tidyverse) #For Data Manipulation and Plotting
library(janitor) #For cleaning up the variable names in the CSV Files
library(lubridate) #For date processing 
library(patchwork) # For combining multiple ggplots together
library(ggmap) # For producing a static map
library(ggtext) # For adding some flair to ggplot
library(leaflet) # For Making Interactive Plots
library(rvest) # For Web Scraping Links to Download

Gathering the Data

The Metropolitan Transit Authority (MTA), which runs the New York City Subway system, publishes the number of Metrocard swipes that occur in the system on a weekly basis by Fare type (Full-Fare, 30-day Unlimited, Student Discount, Senior Discount, etc).

Fortunately, since each weekly file exists as a .csv. with a roughly similar format it can be easily scraped using the rvest package. For this initial scrape, I will be getting any file with a filename from 2019 or 2020. According to the MTA website, the data is uploaded on a two-week delay so a file titled fares_200905.csv (9/5/20) will actually contain the data from two-weeks earlier.

The process for obtaining all of the data will be:

Use rvest to extract the paths to all files in a vector by identifying all the anchor tags on the page (html_nodes("a")) and then extracting the href attribute (html_attr("href"))
Use purrr’s keep and stringr’s str_detect to keep only the elements of the initial vector that match a certain pattern (have titles for 2019 or 2020)
Use purrr’s map_dfr function to apply a function to each .csv file where the function:
- Read’s the .csv file the MTA’s website (using readr::read_csv)
- Cleans the column names to a more R friend format (using janitor::clean_names)
- Removes any columns where all values are NA
- Creates some meta-data around the actual time periods the data reflects
- Turns character formatted numbers into actual numbers (using readr::parse_number)
- Cast to a long-format (using tidyr::pivot_longer)

all_weeks <- read_html("http://web.mta.info/developers/fare.html") %>%
  html_nodes("a") %>% 
  html_attr("href") %>% 
  keep(str_detect(., 'fares_(20)|(19)\\d{4}\\.csv')) %>% 
  map_dfr(., function(x){
    return(
      read_csv(paste0("http://web.mta.info/developers/", x), skip = 2) %>% 
        clean_names %>%
        #Drop Dead Columns
        select_if(~!all(is.na(.x))) %>%
        mutate(
          key = str_extract(x, '\\d+'),
          
          #The data in the files covers seven-day periods beginning on the Saturday 
          #two weeks prior to the posting date and ending on the following Friday. 
          #Thus, as an example, the file labeled Saturday, January 15, 2011, has data 
          #covering the period from Saturday, January 1, 2011, through Friday, January 7. 
          #The file labeled January 22 has data covering the period from 
          #Saturday, January 8, through Friday, January 14. And so on and so forth
          week_start = ymd(paste0('20',key)) - days(14),
          week_end = ymd(paste0('20',key)) - days(8)
        ) %>%
        mutate(across(c(-remote, -station, -week_start, -week_end, -key), parse_number)) %>% 
        pivot_longer(
          cols = c(-remote, -station, -week_start, -week_end, -key),
          names_to = "fare_type",
          values_to = "fares"
        )
    )
  }
)

Time-Series of Subway Usage by Week

A first glance at understanding to effect of COVID on the NYC Subway system is to look at a weekly time-series of total subway usage. In this chart and in the future, when looking at the amount of ridership decline I will be comparing points one months prior to the start of the PAUSE act (week of February 22nd) and one month after the PAUSE act (week of April 18th).

From a coding perspective, this step is aggregating all the individual fare data by week and plotting it using ggplot2. The only non-vanilla ggplot portion is the use of ggtext’s geom_textbox to add to flair to the annotations.

The red dots on the chart represent the comparison points used for the rest of this analysis and the dashed black line is March 22nd, when the PAUSE act went into effect.

all_weeks %>% 
  group_by(key, week_start, week_end) %>% 
  summarize(fares = sum(fares, na.rm = T), .groups = 'drop') %>% 
  ggplot(aes(x = week_start, y = fares/1e6)) + 
    geom_line(color = '#0039A6') + 
    geom_vline(xintercept = ymd(20200322), lty = 2) + 
    geom_point(data = tibble(
      week_start = c(ymd(20200222), ymd(20200418)),
      fares = c(30768135, 2548002)
    ), color = 'red', size =3
    ) +
    geom_textbox(
      x = ymd(20191001),
      y = 15,
      label = "A ***<span style = 'color:red'>92% Decrease</span>*** \n in Subway Ridership \n 1 month before \n vs. 1 month after \n PAUSE order",
      fill = 'cornsilk',
      halign = 0.5,
    ) + 
    labs(x = "Week Beginning", y = "# of MTA Subway Fares (millions)",
         title = "<span style='color:#0039A6'>MTA</span> Ridership (Jan 2019 - Aug 2020)",
         subtitle = "PAUSE Order Begins on 3/22/2020") + 
    scale_y_continuous(labels = scales::comma) +
    cowplot::theme_cowplot() + 
    theme(
      plot.title = element_markdown()
    )

From this chart its clear to see that COVID had a strong effect on Subway ridership as there was a 92% decline between a month prior and a month post. While the ridership is beginning to trend upwards again, the overall numbers are still drastically smaller than in the pre-COVID time.

Exploring the Overall Distribution of Fares

The NYC Subway uses Metrocards in order to gain access to the system. There are also a number of different types of Metrocards. Since ~94% of rides occur on the 7 most common card types, I’ll be focusing on those and bucketing the rest into an “other” group. The 7 most common are:

Full Fare - A person loads money on their Metrocard and pays per trip
Annual Unlimited - A person pays a fixed amount for a year of unlimited rides (typically offered through a person’s workplace)
30 Day Unlimited - A person pays a fixed amount for 30 days of unlimited rides
7 Day Unlimited - A person pays a fixed amount for 7 days of unlimited rides
Student - Assigned by schools to students for a certain number of trips per day
Senior Citizen - A reduced-fare Metrocard used by those Age 65 and over or with a disability
EasyPayXpress - A person sets up an account that automatically reloads the card when the balance gets low

There needs to be some data cleaning to make our data more human readable as well as only focus on the weeks we want to compare vs. all weeks since 2019. This code step will keep only the weeks we care about, cast each time period to a column, given those time periods a nicer name, and give the fare_types a nicer name, and finally filter out some stations that are part of the MTA system but aren’t actually subway stations. These include the Airtrain at JFK Airport and the PATH trains between New York and New Jersey.

combined <- all_weeks %>% 
  filter(week_start %in% c(ymd(20200222), ymd(20200418))) %>% 
  pivot_wider(
    id_cols = c('remote', 'station', 'fare_type'),
    names_from = week_start,
    values_from = fares,
    values_fill = list(fares = 0)
  ) %>% 
  rename(apr=`2020-04-18`, feb=`2020-02-22`) %>% 
  mutate(
    fare_type = case_when(
      fare_type == 'ff' ~ 'Full Fare',
      fare_type == 'x30_d_unl' ~ '30-Day Unlimited',
      fare_type == 'x7_d_unl' ~ '7-Day Unlimited',
      fare_type == 'students' ~ 'Student',
      fare_type == 'sen_dis' ~ 'Senior Citizen/Disabled',
      fare_type == 'tcmc_annual_mc' ~ 'Annual Metrocard',
      fare_type == 'mr_ezpay_exp' ~ 'EasyPayXpress',
      TRUE ~ fare_type
    )
  ) %>% 
  #Remove SBS Bus Stations and PATH
  filter(!str_detect(station, "SBS-|PA-|AIRTRAIN"))

After cleaning, our data covers 443 different subway stations and 26 different fare_types.

In order to recode the fare types outside of the top 7 I first need to identify what the Top 7 fare types are. In the below code, I create a vector of the Top 7 Fare Types based on the February data.

top_7 <- combined %>% 
  count(fare_type, wt = feb, sort = T) %>% 
  head(7) %>% 
  pull(fare_type)

Then the final step is to aggregate the data over the various stations. In this step, there is also the use of fct_other from forcats to keep only the top 7 fares and create an “Other Fares” label for everything else. Also, the use of other forcats functions such as fct_reorder and fct_relevel are used to order the Fare Types by most common to least common (fct_reorder) but the ensure the the “Other Fares” group is last (fct_relevel).

agg_data <- combined %>% 
  pivot_longer(
    cols = c('feb', 'apr'),
    names_to = "month",
    values_to = 'fares'
  ) %>% 
  # Collapse Non-Top 7 Fares to "Other" Group
  mutate(
    fare_type = fct_other(fare_type, keep = top_7, other_level = "Other Fares")
  ) %>% 
  #Order with Month First So Summarize Will Return a Grouped DF by Month
  group_by(month, fare_type) %>% 
  summarize(fares = sum(fares)) %>% 
  #Create % Variable
  mutate(pct = fares / sum(fares),
         period = if_else(month == 'feb', '2/22 - 2/28', '4/18 - 4/24')
  ) %>% 
  ungroup() %>% 
  #Refactor Fare Type for Charts
  mutate(
    fare_type = fct_reorder(fare_type, fares, .fun = max) %>% fct_relevel(., "Other Fares", after = 0L)
  )

The following plots leverage the patchwork package to combine multiple ggplots together to show both the share of Fare Types Pre/Post COVID as well as the actual number of fares. This code is somewhat cumbersome and could probably be done more easily with facets, but I wanted to play with plot_annotation and plot_layout from patchwork in order to add titles to the combined image rather than each plot individually. If you haven’t used patchwork to combine multiple plots, I highly recommend it.

(agg_data %>% 
  ggplot(aes(x = fare_type, 
             y = pct, 
             fill = fct_rev(period))) + 
    geom_col(position = 'dodge') + 
    geom_text(aes(label = pct %>% scales::percent(accuracy = .1)),
              position = position_dodge(width = .9),
              hjust = 0,
              size = 3) +
    labs(x = "Fare Type", y = "Share of Fares",
         title = "Share of Subway Rides",
         fill = "Period") + 
    guides(fill = guide_legend(reverse = T)) + 
    coord_flip(ylim = c(0, .6)) + 
    cowplot::theme_cowplot() + 
    theme(
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank(),
      plot.title = element_text(size = 12)
    )
) + 
(agg_data %>% 
  ggplot(aes(x = fare_type, 
             y = fares, 
             fill = fct_rev(period))) + 
  geom_col(position = 'dodge') + 
  geom_text(aes(label = fares %>% scales::comma()),
            position = position_dodge(width = .9),
            hjust = 0,
            size = 3) +
  labs(x = "", y = "Number of Fares",
       title = "# of Subway Rides",
       fill = "Period") + 
  scale_fill_discrete(guide = F) +
  coord_flip(ylim = c(0, 15e6)) + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    plot.title = element_text(size = 12)
  )
) + plot_annotation(
  title = 'Changes in NYC Subway Ridership Pre/Post PAUSE',
  caption = 'NYC PAUSE Began March 22nd'
) + plot_layout(guides = "collect") & theme(legend.position = "bottom")

The panel on the right (number of fares) makes it very clear that the number of subway rides have plummeted in the month following the PAUSE act with Full Fare rides dropping from 10M to 1.2M. But more interesting is the specialty types of cards (Unlimited and Student) have very severe declines with the 30-day unlimited dropping 96% from 8M to 350k.

In terms of a share of swipes. The Full Fare Metrocard actually increases in share from 36% to 50%. However, this is likely because Students are learning virtually and those who are able to work from home doing so. Additionally, if subway travel is becoming more infrequent its no longer cost effective to use 30-day unlimited cards, so there is also an effect from people who WOULD have used specialty cards switching to Full Fare.

Does the decline by Fare Type depend on the area of NYC?

From the first two charts its clear that there is an overall decline in Subway ridership and that decline is occurring across all Fare types. Another question is “do these declines change by area of the city?” To do this, I’ll be using ggmap to create maps of NYC Subway Stations by the various Fare types.

The first step is to create data at the station and fare type level, geocode the MTA station data (Huge thanks to Chris Whong who had done the work of mapping Lat/Longs to the Station Names). Since Chris’ work was from 2013, the newer stations such as Hudson Yards and the 2nd Avenue Subway do not appear.

In order to clean up the map, in cases where there were multiple geocodes for a single station the max Lat and max Long were used and stations with less than 1000 pre-COVID swipes of a given fare type were removed.

station_level <- combined %>% 
  mutate(
    fare_type = fct_other(fare_type, keep = top_7, other_level = "Other Fares")
  ) %>% 
  group_by(remote, station, fare_type) %>% 
  summarize(feb = sum(feb),
            apr = sum(apr)
  ) %>% 
  mutate(
    abs_change = apr-feb,
    rel_change = apr/feb - 1
  )

geocodes <- read_csv('https://raw.githubusercontent.com/chriswhong/nycturnstiles/master/geocoded.csv', 
                     col_names = c('remote', 'zuh', 'station', 'line', 'system', 'lat', 'long'),
)

comb_geo <- station_level %>% 
  inner_join(geocodes %>% group_by(remote) %>% summarize(lat = max(lat), long = max(long)), by = "remote") %>%
  filter(feb > 1000) %>% 
  ungroup()

Creating the Maps with ggmap

Since the overall trends seem like there is a large decline in ridership across the entire city, I wanted to create new breakpoints to understand where were larger declines vs. smaller declines. To do this I used the classInt::classIntervals() function with the fisher style to algorithmically find the breakpoints in the data. The cut_format function will format the break labels are percentages rather than decimals.

brks <- classInt::classIntervals(comb_geo$rel_change, n = 5, style = 'fisher')

comb_geo$grp_val = kimisc::cut_format(comb_geo$rel_change, 
                                      brks$brks, 
                                      include.lowest = T,  
                                      format_fun = scales::percent)

To create the static map using ggmap I first need to create the base layer that the data will be plotted on. There a many ways to do this but I chose to define a boundary box using Lats and Longs from NYC.gov. The zoom option controls how many tiles should be used in the boundary box. The larger the number the more tiles / more zoomed in your are.

nyc <-get_map(c(
  left = -74.1,
  right = -73.699215,
  top = 40.915568,
  bottom = 40.55
),  zoom = 11, source = 'osm')

Since there are 7 different Fare Types to look at I’m breaking apart the maps into two sets of Fare Types, the unlimited cards, and everything else. The element_markdown() in the theme() block is from ggtext and allows for certain HTML tags to format text in ggplots.

ggmap(nyc, 
      base_layer = ggplot(comb_geo %>%
                            filter(str_detect(fare_type, "Unlimited|Annual")), 
                          aes(x = long, y = lat, color = grp_val))) +
  geom_point() + 
  labs(
    title = "NYC Ridership Decline by <b><i style='color:#0039A6'>Unlimited Fare Types</i></b>",
    color = "% Ridership Decline (Feb vs. Apr)",
    x = "", y = "") +
  facet_wrap(~fare_type, nrow = 1) +
  guides(color=guide_legend(nrow=2,byrow=TRUE)) +
  theme(legend.position = 'bottom',
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_markdown())

Based on the unlimited cards decline by Subway station its clear that there ARE regional difference in how much COVID has affected usage. The 30-day unlimited card has the highest amount of decline in Manhattan and the parts of Brooklyn and Queens nearest to Manhattan. Meanwhile, the outer parts of Brooklyn, the Bronx, and Spanish Harlem have lower levels of decline. This is consistent with areas of lower socioeconomic status still needing to take the subway due to a higher likelihood of jobs that cannot be done from home.

On the whole the different types of unlimited cards have similar patterns. Although the 7-day Unlimited has more areas not in the largest decline bucket.

ggmap(nyc, 
      base_layer = ggplot(comb_geo %>%
                            filter(!str_detect(fare_type, "Unlimited|Annual|Other")), 
                          aes(x = long, y = lat, 
                              color = grp_val))) +
  geom_point() + 
  labs(
    title = "NYC Ridership Decline by <b><i style='color:#0039A6'>Other Fare Types</i></b>",
    color = "% Ridership Decline (Feb vs. Apr)",
    x = "", y = "") +
  facet_wrap(~fct_reorder(fare_type, -feb), nrow = 1) +
  guides(color=guide_legend(nrow=2,byrow=TRUE)) +
  theme(legend.position = 'bottom',
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_markdown()
        )

The two largest contrasts in the non-unlimited groups are that Student cards are almost entirely in the largest decline bucket, which makes sense, as Students were engaged in distance learning. Similarly the EasyPayXpress is almost entirely in the largest decline bucket and almost entirely in Manhattan. This also makes sense as its potentially made up of commuters who don’t want to normally deal with refilling a card constantly but wouldn’t use it enough to justify an unlimited card. The closing of non-essential businesses and rise of Work-From-Home is the likely cause.

For the full-fare cards the only area with the most severe declines are in “Core Manhattan” but other areas have smaller declines, potentially due to shifting from one Fare Type to the full fare due to less need to use the Subway System.

Creating an Interactive Map with Leaflet

while the above ggmaps are useful, its difficult to know exactly where the neighborhoods are with the largest/smaller declines. The above maps are useful for a general idea but having an interactive map that would allow the user to pan and zoom would yield greater insights. In order to create one I will use the leaflet package which serves as an API to the javascript library of the same name.

Since for this map I will only be looking at the overall declines as opposed to the declines by Fare Type I need to re-summarize the data and create new breaks based on the overalls. The creation of the msg variable is to provide a pop-up to leaflet.

map_prep <- comb_geo %>%
  group_by(remote, station, lat, long) %>% 
  summarize(feb = sum(feb),
            apr = sum(apr),
            .groups = 'drop'
  ) %>% 
  mutate(rel_change = apr/feb - 1,    
         msg = paste(station, "has decreased", scales::percent(rel_change, accuracy = .1),
                "pre-PAUSE to post-PAUSE from", feb %>% scales::comma(), "to",
                apr %>% scales::comma(), "fares.")
  )

map_prep_breaks <- classInt::classIntervals(map_prep$rel_change, 
                                            n = 5, 
                                            style = 'fisher')

##Add in the Breaks
map_prep$grp_val = kimisc::cut_format(map_prep$rel_change, 
                                      map_prep_breaks$brks, 
                                      include.lowest = T,  
                                      format_fun = scales::percent
                                      )

One of the things that I found difficult about leaflet was that creating a color palette to go with my breaks required a function that mapped the values to the colors. The factpal in leaflet associates a factor variable with a palette. In this case it takes the factors for the grp_val created above and maps them to colors from the “Set1” palette.

factpal <- colorFactor("Set1", map_prep$grp_val)

Creating a basic map with leaflet is fairly straight-forward and the syntax is pretty user friendly. The main things to know when interpreting the code is that the “~” character means that its referring to a variable name within the passed in data (similar to how aes() does the same for ggplot).

This function call while long does the following:

Passes in my dataset map_prep to the leaflet() function
Adds the background tiles from the CartoDB.Positron theme
Adds circle markers for each observation in my data set using the lats/longs with a fixed radius of 250, no border (stroke), and using a fill color from our pre-defined palette with 100% opacity. The hover labels will be the station names and when clicked the msg variable will be the pop-up.
Finally add a legend in the top-right corner with the pre-defined colors and breakpoints.

The use of the widgetframe::frameWidget() was to get the map to load on the blog and was not necessary for use in RStudio.

library(widgetframe)

ll_map <- leaflet(map_prep) %>%
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircles(
    lng = ~long,
    lat = ~lat,
    radius = 250,
    #radius = 4,
    stroke = F,
    fill = T,
    color = ~factpal(grp_val),
    fillOpacity = 1,
    label = ~station,
    group = 'stations',
    popup = ~msg
    ) %>%
    addLegend(
      title = "% Change in Rides",
      pal = factpal,
      values = ~grp_val,
      position = 'topright'
    )

frameWidget(ll_map)

From this view, the regional difference in Subway usage declines are very apparent. The ‘red’ circles representing the largest declines are clustered in “Core Manhattan” which is from Lower Manhattan up to around 59th street. This would be where the majority of commuter swipes would be that were eliminated due to PAUSE. Then as you move further from central Manhattan the declines become less severe.

The two callouts are the prevalence of the purple dots in the Bronx and orange “X” pattern in eastern Brooklyn (Brownsville, New Lots, East New York). According to New York City Government Poverty Measures, Bronx Community Districts 1-6 have the largest percent of population below the poverty line followed by Brownsville and East New York which matches the narrative of areas of lower Socioeconomic status being less likely to be able to avoid the Subway during the pandemic and having less severe declines in ridership than areas of Lower and Midtown Manhattan.

Conclusions

COVID-19 and the New York State PAUSE act have had a dramatic impact on the ridership of the NYC Subway System. Overall ridership was down 92% between February and April as New York became the “COVID capital of the world” during those months. The MTA’s detailed data on types of fares at each station allows for a more granular look into how the pandemic is altering rider behavior leading to decreased usage of Unlimited Cards and Student cards as people are more constrained to their homes as well as areas of lower socioeconomic status having less severe changes in ridership comparable to more affluent areas of the city.

What's the Difference Between Instagram and TikTok? Using Word Embeddings to Find Out

Sun, 02 Aug 2020 00:00:00 +0000

TL;DR

Instagram - Tiktok = Photos, Photographers and Selfies
Tiktok - Instagram = Witchcraft and Teens

but read the whole post to find out why!

Purpose

The original intent of this post was to learn to train my own Word2Vec model, however, as is a running theme.. my laptop is not great and training a neural network would never work. However, in looking for alternatives, I had come across a post from Julia Silge from 2017 which outlined how create Word Embeddings using a combination of point-wise mutual information (PMI) and Singular Value Decomposition (SVD). This was based on a methodology from Chris Moody’s Stitchfix Post called Stop Using word2vec. Ms. Silge’s methodology has been updated as part of her book Supervised Machine Learning for Text Analysis in R.

Word Embeddings are vector representations of words in a large number of dimensions that capture the context of how words are used. They have been used to show fancy examples of how you can do math with words. One of the most well known example is King - Man + Woman = Queen.

Since TikTok and Instagram are both popular social media apps, especially among teenagers, I figured it would be an interesting exercise to see if I could figure out Tiktok - Instagram = ???? and Instagram - TikTok = ????.

Getting and Cleaning the Data

In order to create these vector representations I need data. In the example posts above, they use the Hacker News corpus which is available on Google’s BigQuery. In quickly browsing that data it didn’t seem like there was enough to do something as targeted as Instagram vs. TikTok. So I decided to use Twitter data both because I thought it would be a decent source of information and second because it was a good excuse to try out the rtweet package.

In addition to the rtweet package, I’ll be using tidyverse for data manipulations and plotting, tidytext to create the word tokens, and widyr in order to do the PMI, SVD, and Cosine Similarity calculations.

library(rtweet) # To Exract Data from Twitter
library(tidyverse) # Data Manipulation and Plotting
library(tidytext) # To create the Work Tokens and Bigrams
library(widyr) #For doing PMI, SVD, and Similarity

Turns out getting data from Twitter really couldn’t be easier with rtweet. The search_tweets() function is very straight forward and really is all you need. In this case, I wanted to run two separate queries, one for “instagram” and one for “tiktok”, so I used search_tweets2() which allows you to pass a vector of queries rather than a single one. In the code below, my two queries, one for “instagram” and one for “tiktok” are captured in the q parameter (with additionally filters to remove tweets with links and tweets with replies). The n parameter says that I want 50,000 tweets for each query. Additionally, I tell the query that I don’t want retweets, I want to grab recent tweets (the package can only search the last 6-7 days), and I want only English language.

tweets <- search_tweets2(
  q = c("tiktok -filter:quote -filter:replies -filter:links", 
        'instagram -filter:quote -filter:replies -filter:links'),
  n = 50000,
  include_rts = FALSE,
  retryonratelimit = TRUE,
  type = 'recent',
  lang = 'en'
)

The query for this data was originally run on 7/21/2020 and returned 108,000 rows. Because the Twitter data contains many characters not typically considered words, the data was run through some data cleaning and duplicated tweets (ones that contained both “instagram” and “tiktok” were deduped.

cleaned_tweets <- tweets %>% 
  #Encode to Native
  transmute(status_id, text = plain_tweets(text)) %>% 
  #Remove Potential Dups
  distinct() %>% 
  #Clean Data
  mutate(
    text = str_remove_all(text, "^[[:space:]]*"), # Remove leading whitespaces
    text = str_remove_all(text, "[[:space:]]*$"), # Remove trailing whitespaces
    text = str_replace_all(text, " +"," "), # Remove extra whitespaces
    text = str_replace_all(text, "'", "%%"), # Replacing apostrophes with %%
    #text = iconv(text, "latin1", "ASCII", sub=""), # Remove emojis/dodgy unicode
    text = str_remove_all(text, "<(.*)>"), # Remove pesky Unicodes like <U+A>
    text = str_replace_all(text, "\\ \\. ", " "), # Replace orphaned fullstops with space
    text = str_replace_all(text, "  ", " "), # Replace double space with single space
    text = str_replace_all(text, "%%", "\'"), # Changing %% back to apostrophes
    text = str_remove_all(text, "https(.*)*$"), # remove tweet URL
    text = str_replace_all(text, "\\n", " "), # replace line breaks with space
    text = str_replace_all(text, "&amp;", "&"), # fix ampersand &,
    text = str_remove_all(text, '&lt;|&gt;'), 
    text = str_remove_all(text, '\\b\\d+\\b') #Remove Numbers
  )

Finally, the data was tokenized to break apart the tweets into a tidy format of 1 row per word. For example, “The quick brown fox” will be broken into 4 rows, the first containing “the”, the second containing “quick” and so on. Besides tokenization, stop words and infrequent words (<20 occurrences) were removed. Stop words are very common words like “the”, “it”, etc. that don’t add much meaning to the Tweets.

tokens <- cleaned_tweets %>% 
  #Tokenize
  unnest_tokens(word, text) %>% 
  #Remove Stop Words
  anti_join(stop_words, by = "word") %>% 
  #Remove All Words Occurring Less Than 20 Times
  add_count(word) %>%
  filter(n >= 20) %>%
  select(-n)

Creating the Embeddings

The way that word embeddings are able to capture the context of individual words is by looking at what words appear around the word of interest. Getting from Tokens to Embeddings are done in three steps:

Create the Sliding Window to capture words occurring together
Calculate the point-wise mutual information to provide a measure to how likely two words will appear together
Use SVD to decompose the matrix of 4,586 words into some number of dimensions (in this case we’ll use 100).

Creating the Sliding Windows

Sliding windows in text is kind of like a rolling average for numbers where at any point there is a subset of data that we’re looking at as subset moves over the entire data set.

A very simple example would be the string “the quick brown fox jumps over the lazy dog” with a window size of four will generate six windows

window_id	words
1_1	the, quick, brown, fox
1_2	quick, brown, fox, jumps
1_3	brown, fox, jumps, over
1_4	fox, jumps, over, the
1_5	jumps, over, the, lazy
1_6	over, the, lazy, dog

The examples from the blog post use a combination of nesting and the furrr package to create sliding windows in parallel. Since my laptop only has two cores, there isn’t much benefit to parallelizing this code. Fortunately, someone in the comments linked to this gist from Jason Punyon which works very quickly for me.

slide_windows <- function(tbl, doc_var, window_size) {
  tbl %>%
    group_by({{doc_var}}) %>%
    mutate(WordId = row_number() - 1,
           RowCount = n()) %>%
    ungroup() %>%
    crossing(InWindowIndex = 0:(window_size-1)) %>%
    filter((WordId - InWindowIndex) >= 0, # starting position of a window must be after the beginning of the document
           (WordId - InWindowIndex + window_size - 1) < RowCount # ending position of a window must be before the end of the document
    ) %>%
    mutate(window_id = WordId - InWindowIndex + 1)
}

The one parameter that we need to choose when creating the windows is the window size. There is no right or wrong answer for a window size since it will depend on the question being asked. From Julia Silge’s post, “A smaller window size, like three or four, focuses on how the word is used and learns what other words are functionally similar. A larger window size, like ten, captures more information about the domain or topic of each word, not constrained by how functionally similar the words are (Levy and Goldberg 2014). A smaller window size is also faster to compute”. For this example, I’m choosing a size of eight.

Point-wise Mutual Information

Point-wise mutual information is an association measurement to determine how likely two words are to occur together normalized by how likely each of the words are to be found on their own. The higher the PMI the more likely words are to be found close together vs. on their own.

PMI(word1, word2) = log(P(word1, word2)/(P(word1)P(word2)))

This can be calculated using the pairwise_pmi() function from the widyr package.

Singular Value Decomposition

This final step will turn our set of Word/Word PMI values into a 100-dimensional embedding for each word using Singular Value Decomposition, which is a technique for dimensionality reduction.

This is calculated using the widely_svd() function also from widyr.

Putting it all together

Executing these three steps can done in the following code:

tidy_word_vectors <- tokens %>%  
  slide_windows(status_id, 8) %>% #Create Sliding Window of 8 Words (Step 1)
  unite(window_id, status_id, window_id) %>% #Create new ID for each window
  pairwise_pmi(word, window_id) %>%  #Calculate the PMI (Step 2)
  widely_svd(item1, item2, pmi, nv = 100, maxit = 1000) #Create 100 Dimension Embedding (Step 3)

The data at this point looks like:

## # A tibble: 6 x 3
##   item1     dimension    value
##   <chr>         <int>    <dbl>
## 1 instagram         1  0.0273 
## 2 instagram         2 -0.0101 
## 3 instagram         3 -0.0729 
## 4 instagram         4  0.107  
## 5 instagram         5 -0.00237
## 6 instagram         6 -0.0844

Where item1 represents the word, dimension is each of our 100 dimensions for the word vector, and value is the numeric value for that dimension for that word.

The Fun Stuff

Now that we have these embeddings, which again are 100-dimensional vectors to represent each word we can start doing analysis to hopefully find some find things.

What Word is Most Similar to Instagram?

To find the most similar words, we can use cosine similarity to determine which vectors are most similar to our target words. Cosine similarity can be calculated using the pairwise_similarity() function from the widyr package.

Let’s look at what’s most similar to “Instagram”:

#Get 10 Most Similar Words to Instagram
ig <- tidy_word_vectors %>% 
  pairwise_similarity(item1, dimension, value) %>%
  filter(item1 == 'instagram') %>% 
  arrange(desc(similarity)) %>% 
  head(10)

#Plot most similar words
ig %>%
  ggplot(aes(x = fct_reorder(item2, similarity), y = similarity, fill = item2)) + 
    geom_col() + 
    scale_fill_discrete(guide = F) +
    labs(x = "", y = "Similarity Score", 
         title = "Words Most Similar to <i style='color:#833AB4'>Instagram</i>") + 
    coord_flip() + 
    hrbrthemes::theme_ipsum_rc(grid="X") + 
    theme(
      plot.title.position = "plot",
      plot.title = ggtext::element_markdown()
    )

Looking at what words are most similar can provide a good gut check for whether things are working. Among the top words are “post(s)”, “dms”, “celebrities” which all seem to make sense in the context of Instagram. Admittedly, I got a chuckle about “Instagram hoes”, but that does have its own Urban Dictionary definition so I suppose its legit.

What Word is Most Similar to TikTok?

We can do the same calculation with ‘tiktok’ as opposed to ‘instagram’

tt <- tidy_word_vectors %>% 
  pairwise_similarity(item1, dimension, value) %>%
  filter(item1 == 'tiktok') %>% 
  arrange(desc(similarity)) %>% 
  head(10)

tt %>%
  ggplot(aes(x = fct_reorder(item2, similarity), y = similarity, fill = item2)) + 
  geom_col() + 
  scale_fill_discrete(guide = F) +
  labs(x = "", y = "Similarity Score", 
       title = "Words Most Similar to <i style='color:#69C9D0'>TikTok</i>") + 
  coord_flip() + 
  hrbrthemes::theme_ipsum_rc(grid="X") + 
  theme(
    plot.title.position = "plot",
    plot.title = ggtext::element_markdown()
  )

Now admittedly, I’m less familiar with TikTok than I am Instagram, but from what I do know (and what I can Google), these make a lot of sense. The word most similar to TikTok is “dances” and I do know that TikTok is known for its viral dances. Some of the other terms I needed to look up but they seem legit. For example, “Straight TikTok” is used to refer to more mainstream TikTok vs. “Alt Tiktok” and “fyp” is the “For You Page” (I don’t actually know what this is, but I know its something TikTok-y). So again, I feel pretty good about these results.

What is the Difference between TikTok and Instagram?

As mentioned at the start the goal of this post is to create Instagram - Tiktok = ? and Tiktok - Instagram = ? similar to the king - man + woman = queen often referenced in Word2Vec (or other embedding) posts.

Since both TikTok and Instagram are now represented by 100-dimensional numeric vectors doing the subtraction is as simple as doing a pairwise subtraction on each dimension. Since our data is in a tidy format it takes a little bit of data wrangling to pull that off, but ultimately we’re going to grab the Top 10 Closest words to (Instagram-TikTok) and (TikTok-Instagram) by treating these resulting vectors as fake “words” and adding them to the data set before calculating the cosine similarity.

tt_ig_diff <- tidy_word_vectors %>% 
  #Calculate TikTok - Instagram
  filter(item1 %in% c('tiktok', 'instagram')) %>% 
  pivot_wider(names_from = "item1", values_from = "value") %>% 
  transmute(
    item1 = 'tiktok_minus_ig',
    dimension,
    value = tiktok - instagram
  ) %>% 
  bind_rows(
    #Calculate Instagram - TikTok
    tidy_word_vectors %>% 
    filter(item1 %in% c('tiktok', 'instagram')) %>% 
    pivot_wider(names_from = "item1", values_from = "value") %>% 
    transmute(
      item1 = 'ig_minus_tiktok',
      dimension,
      value = instagram - tiktok
    )
  ) %>% 
  #Add in the rest of the individual words
  bind_rows(tidy_word_vectors) %>% 
  #Calculate Cosine Similarity on All Words
  pairwise_similarity(item1, dimension, value) %>% 
  #Keep just the simiarities to the two "fake words"
  filter(item1 %in% c('tiktok_minus_ig', 'ig_minus_tiktok')) %>% 
  #Grab top 10 most similar values for each "fake word"
  group_by(item1) %>% 
  top_n(10, wt = similarity) 

#Plotting the Top 10 Words by Similarity
tt_ig_diff %>%
  mutate(item1 = if_else(
    item1 == "ig_minus_tiktok", "Instagram - TikTok = ", "Tiktok - Instagram = "
  )) %>% 
  ggplot(aes(x = reorder_within(item2, by = similarity, within = item1), 
             y = similarity, fill = item2)) + 
  geom_col() + 
  scale_fill_discrete(guide = F) +
  scale_x_reordered() + 
  labs(x = "", y = "Similarity Score", 
       title = "What's the Difference between <i style='color:#833AB4'>Instagram</i> 
       and  <i style='color:#69C9D0'>TikTok</i>") + 
  facet_wrap(~item1, scales = "free_y") + 
  coord_flip() + 
  hrbrthemes::theme_ipsum_rc(grid="X") + 
  theme(
    plot.title.position = "plot",
    plot.title = ggtext::element_markdown(),
    strip.text = ggtext::element_textbox(
      size = 12,
      color = "white", fill = "#5D729D", box.color = "#4A618C",
      halign = 0.5, linetype = 1, r = unit(5, "pt"), width = unit(1, "npc"),
      padding = margin(2, 0, 1, 0), margin = margin(3, 3, 3, 3)
    )
  )

A couple of things jump out from the results. First, the vectors for TikTok and Instagram aren’t similar enough to each other to not make “TikTok” or “Instagram” the most similar value. This is likely because of the data collection methodology of using TikTok and Instagram as search terms on Twitter. Also, as a result of this there is a bit of overlap between the “Most Similar Word to X” and the “Most Similar Word to X-Y”.

However, once you get past the overlaps there are some interesting findings. For Instagram - TikTok you get “Selfies, Photo(s), Photographer” which makes a ton of sense since Instagram is primary a photo app while TikTok is entirely a video app.

For Tiktok - Instagram, there still is a lot of overlap with just TikTok, but for the new items there’s a bunch of Witchcraft terms (coven, witchtok). But according to Wired UK TikTok has become the home of modern witchcraft that seems to track. Also, “Teens” is surfaced as a difference between Tiktok and Instagram reflecting its popularity among US Teenagers.

Concluding Thoughts

I wanted to get involved with Word Embeddings through Word2Vec but I don’t have the technology for it. Luckily resources on the internet provided a way to do this with tools not requiring a Neural Network. By grabbing data from Twitter it was easy to create word embeddings and to try to understand the differences between TikTok and Instagram. In practice it would be good to have had more than 100,000 Tweets and I wish that there was a way to get word context more in the wild than specific search terms. But in the end, I’m happy with the results.

A Racing Barplot of Top US Baby Names 1880-2018

Sat, 04 Jul 2020 00:00:00 +0000

A few month’s back Mrs. JLaw and I were discussing baby names (purely for academic purposes) and it got me thinking about how have popular names changed over time. It was a particular interest to me as someone who had a name that was somewhat popular for a while and has since fallen out of fashion.

This also provided me an opportunity to try out one of those ‘racing barplots’ that have been popping up all over the place. Also, while I’ve used the gganimate package a number of times, I constantly forget the syntax. And since this site is as much for me (probably moreso) than anyone else, this will be a good reference in case I try to do this again.

On to the project….

Fortunately, I know that baby name data is easily available as the Social Security Administration website. And while I don’t reminder how I found the flat files for all years it is available as a ZIP file containing 139 .txt files, containing popular boys and girls names for each year. However, I don’t really want to deal with downloading and unzipping files, so I’m going to try to query the SSA site directly.

Loading Some Libraries

To do this project, I’ll use:

httr - To construct the POST command to get the SSA to return a webpage with the data I want
rvest - To scrape the table of popular name data from the content returned from the httr request
tidyverse meta-package - for combining the data from each request (purrr), data manipulation (dplyr), and visualization (ggplot2)
gganimate - to animate the ggplot2 plots and make them look super cool
scales - To make the count of baby names in the chart appear prettier (comma-formatted)

library(tidyverse)
library(gganimate)
library(scales)
library(httr)
library(rvest)

Reading the Data

As mentioned before, the data is available as a series of .txt files from the SSA. When I originally did this, I downloaded and extracted the ZIP file, but as I’m redoing this for the post, I’d rather have a solution that is entirely self-contained so I’m going to try to use httr to actually query the SSA data.

So how to actually get the data from the website?

From the Baby Names By Birth Year section, I can input the birth year, how many names I want, and whether I want counts or percentages.

When I click go, I wind up at https://www.ssa.gov/cgi-bin/popularnames.cgi with my desired results in a table. Using Google Chrome’s Network Inspector I can see that I sent a POST request with three parameters (year, top, and number):

Now that I know what I need to send, I can create a function in purrr to request each year and stack each response on top of each other using map_dfr. For the inputs, I know that I want all available years (which I know are 1880 through 2018) and I only need the top 10 (so top = 10) and I want counts rather than percentages (number = “n”)

babynames <- map_dfr(
  1880:2018, #Inputs to My Function
  #Define Function to Apply for Each Year
  function(year){
    #Construct POST Command
    POST(
      #Where to Send the Request
      url = "https://www.ssa.gov/cgi-bin/popularnames.cgi",
      #What to Send the Requests (my three parameters)
      body = paste0("year=",year,"&top=10&number=n")
    ) %>%
    #Extract the Content from the Request Response
    content("parsed") %>% 
    #Extract All The Tables
    html_nodes('table') %>%
    #Only Keep the 3rd Table (done through some guess and check)
    .[[3]] %>% 
    #Store the Table Data as a data.frame
    html_table() %>%
    #Add a column to the data frame for year
    mutate(
      year = year
    )
  }
)

My expectation for this data is that there would be 139 distinct values for year and 1390 rows in the data. And in fact there are 139 distinct years (😍) and 1529 rows (😡).

So what’s going on… Let’s look at the year 1880.

Rank	Male name	Number of males	Female name	Number of females	year
1	John	9,655	Mary	7,065	1880
2	William	9,532	Anna	2,604	1880
3	James	5,927	Emma	2,003	1880
4	Charles	5,348	Elizabeth	1,939	1880
5	George	5,126	Minnie	1,746	1880
6	Frank	3,242	Margaret	1,578	1880
7	Joseph	2,632	Ida	1,472	1880
8	Thomas	2,534	Alice	1,414	1880
9	Henry	2,444	Bertha	1,320	1880
10	Robert	2,415	Sarah	1,288	1880
Note: Rank 1 is the most popular,
rank 2 is the next most popular, and so forth. All names are from Social Security card applications
for births that occurred in the United States. Note: Rank 1 is the most popular,
rank 2 is the next most popular, and so forth. All names are from Social Security card applications
for births that occurred in the United States. Note: Rank 1 is the most popular,
rank 2 is the next most popular, and so forth. All names are from Social Security card applications
for births that occurred in the United States. Note: Rank 1 is the most popular,
rank 2 is the next most popular, and so forth. All names are from Social Security card applications
for births that occurred in the United States. Note: Rank 1 is the most popular,
rank 2 is the next most popular, and so forth. All names are from Social Security card applications
for births that occurred in the United States. 1880

Cleaning the Data

We were expected 10 rows but we got 11 because of a footnote at the bottom of the table. I could go fix the data pulling step to explicitly only get the Top 10 rows but there are a bunch of other data cleaning steps to do, so may as well do everything at once. In this step I’m going to:

Remove that pesky footer row
Turn the Table from Wide Format to Long Format (so genders are on top of each other)
Convert the Counts to Numeric

babynames_clean <- babynames %>% 
  #Remove the Note row by filter rows where the Rank column has the string "Note"
  filter(!str_detect(Rank, "Note")) %>%
  #Turn Data from Wide Format to Long Format 
  pivot_longer(
    cols = c("Male name", "Female name", "Number of males", "Number of females"),
    names_to = "variable",
    values_to = "value"
  ) %>% 
  #Construct a way to split the Names and Counts
  mutate(
    gender = if_else(str_detect(str_to_lower(variable), 'female'), 'F', 'M'),
    new_variable = if_else(str_detect(variable, "name"), "name", "count")
  ) %>% 
  #Pivot Wider to Have Names and Counts in Separate Columns
  pivot_wider(
    id_cols = c('Rank', 'year', 'gender'), 
    names_from = "new_variable",
    values_from = "value"
  ) %>% 
  #Convert Count to Numeric
  mutate(
    count = parse_number(count),
    Rank = parse_number(Rank)
  )

Now let’s look at our cleaned data for year 1880:

Rank	year	gender	name	count
1	1880	M	John	9655
1	1880	F	Mary	7065
2	1880	M	William	9532
2	1880	F	Anna	2604
3	1880	M	James	5927
3	1880	F	Emma	2003
4	1880	M	Charles	5348
4	1880	F	Elizabeth	1939
5	1880	M	George	5126
5	1880	F	Minnie	1746

Beautiful!!!

Making The Barplot

Now that we’ve gotten and cleaned the data, the real fun can begin.

My personal strategy for building animated ggplots is to first build the static version of the plot (in this case filtering to one year). Then once that is good, adding in the gganimate magic keynotes like transition and ease.

While you can generated an animated plot by the code interactively, I find it easiest to save the plot object and then render using the animate() function. This way there are more ways to control how the animation occurs like duration, and frames per second.

Because my laptop isn’t particular great, trying to nail down the aesthetics of making the animation look good (not too fast, not too slow) is the most time consuming part.

Creating a generic function

Since I’m creating two charts for Baby Boys and Baby Girls that will be identical except for some labeling, I’m going to write a function to actually build the animated chart and then I will call them in a future section.

#Input a 
gen_graph <- function(cond){
  
  #Use stereotypical gender colors for the two graphs
  if(cond == "F"){
    lbl = "Girl"
    col = "#FFC0CB"
  }else{
    lbl = "Boy"
    col = "#89cff0"
  }
  
  #Construct Animated Object
  animated <- babynames_clean %>% 
    #Filter to specific gender
    filter(gender == cond) %>%
    # Construct Basic GGPLOT Plot
    ggplot(aes(x = Rank, y = count/2, group = name)) + 
    geom_col(fill = col) + 
    geom_text(aes(label = count %>% comma(accuracy = 1)), hjust = 0, size = 10) + 
    geom_text(aes(label = name), y = 0, vjust = .2, hjust = 1, size = 10) +
    labs(x = paste0(lbl,"'s Name"), y = "# of Babies",
         title = paste0("Top 10 ", lbl, "'s Baby Names (1880-2018)"),
         #{frame_time} is a gganimate param that updates based on the time value
         #Its used to dynamically update the subtitle
         subtitle = '{round(frame_time,0)}',
         caption = 'Source: Social Security Administration') + 
    scale_x_reverse() + 
    coord_flip(clip = 'off') + 
    theme_minimal() +
    theme(axis.line=element_blank(),
          axis.text.x=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks=element_blank(),
          axis.title.x=element_blank(),
          axis.title.y=element_blank(),
          legend.position="none",
          panel.background=element_blank(),
          panel.border=element_blank(),
          panel.grid.major=element_blank(),
          panel.grid.minor=element_blank(),
          panel.grid.major.x = element_line(size=.4, 
                                            color="grey" ),
          panel.grid.minor.x = element_line(size=.1, 
                                            color="grey" ),
          plot.title.position = "plot",
          plot.title=element_text(size=20, 
                                  face="bold", 
                                  colour="#313632"),
          plot.subtitle=element_text(size=50, 
                                     color="#a3a5a8"),
          plot.caption =element_text(size=15, 
                                     color="#313632"),
          plot.background=element_blank(),
          plot.margin = margin(1, 9, 1, 9, "cm")) + 
    #Add in GGANIMATE Magic
    transition_time(year) + 
    ease_aes('cubic-in-out') +
    view_follow(fixed_x = T)

  animate(animated, fps = 10, duration = 30, width = 1000, height = 600, 
          end_pause = 20, start_pause = 20)
    
}

Most Popular Boy’s Names

gen_graph("M")

Most Popular Boy’s Names

gen_graph("F")

Thanks for reading my first blog post! In the future, I’ll work to get the sizing of the output charts to work better but for now… good > perfect.

R | JLaw's R Blog

Are Birth Dates Still Destiny for Canadian NHL Players?

Section 1: What is the distribution of births by month in Canada?

Section 2: What is the difstribution of births by month for Canadian NHL players?

Section 3: Putting it all together

The Most Unexpectedly Good and Bad TV Episodes

Data

Creating an episode level data set

Methodology

Function to find the Unexpected Episodes

Results

The TV Shows with the Most Unexpected Episodes

The 10 Most Unexpectedly Good Episodes

The 10 Most Unexpectedly Bad Episodes

Drilling into a Few Shows

Special Thanks

Appendix: Code for the example Plot for Stranger Things

When Will NYC's Subway Ridership Recover?

Libraries

Data

When will Subway fares return to 80% of Pre-COVID? To 100%?

Exploring Types of Subway Fares with Hierarchical Forecasting

Libraries

Data Preparation

Forecasting

So did the forecasts reconcile correctly?

How much did each Fare Type recovery to Pre-COVID levels?

Appendix: Measuring Forecast Accuracy

How much has COVID cost the NYC Subway system in "lost fares"?

Libraries

Data

Methodology

Modeling

Pre-Preprocessing

Model Fitting

Validating

Forecasting the COVID Time Period

Calculating the “Lost Fare” Amount

Concluding Thoughts

ML for the Lazy: Can AutoML Beat My Model?

Why is this ML for the Lazy?

The Data

Model #1: TidyModels

Model #2 - h2o AutoML

Model #3: SuperLearner

Comparing the Three Models

Other Posts in the Icing the Kicker Series

Ain't Nothin But A G-Computation (and TMLE) Thang: Exploring Two More Causal Inference Methods

G-Computation

Step 1: Fit a model using all the data

Step 2: Create Duplicates of the Data Set

Step 3: Predict the Probability of Success for the Duplicates

Step 4: Use the Predicted Successes to Calculate the Causal Effect

Step 5: Bootstrap the Process to Obtain Confidence Intervals

Targeted Maximum Liklihood Estimation (TMLE)

Summary

Other Posts in the Icing the Kicker Series

Does Icing the Kicker Really Work? A Causal Inference Exercise

What Have Other Analyses Shown?

What Would a Naive Analysis Show?

A more robust solution

Step 1: Develop the Propensity Model

Step 2: Use the Propensity Scores to weight the non-Iced Field Goal Attempts

Step 3: Ensure the Post-Weighted Data is no longer Imbalanced

Step 4: Calculate the ATT

Predicting When Kickers Get Iced with {tidymodels}

Part I: Data Gathering

Part 2: Building the Model

Tidymodels

Part 3: Interpreting the model

Variable Importance

Partial Dependency

SHAP Values

Wrapping Up

Examining College Football Conference Realignment with {ggraph}

Set up

Creating a Network of the FBS Conference Affiliations

Zooming into the Big 12

Exploring College Football Non-Conference Rivalries with {ggraph}

Getting Started + The Data