<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>R | JLaw&#39;s R Blog</title>
    <link>https://jlaw.netlify.app/category/r/</link>
      <atom:link href="https://jlaw.netlify.app/category/r/index.xml" rel="self" type="application/rss+xml" />
    <description>R</description>
    <generator>Source Themes Academic (https://sourcethemes.com/academic/)</generator><language>en-us</language><copyright>© JLaw 2023</copyright><lastBuildDate>Mon, 04 Dec 2023 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://jlaw.netlify.app/images/icon_hu760fc316ca52fa0336457759fdc853c8_70171_512x512_fill_lanczos_center_2.png</url>
      <title>R</title>
      <link>https://jlaw.netlify.app/category/r/</link>
    </image>
    
    <item>
      <title>Are Birth Dates Still Destiny for Canadian NHL Players?</title>
      <link>https://jlaw.netlify.app/2023/12/04/are-birth-dates-still-destiny-for-canadian-nhl-players/</link>
      <pubDate>Mon, 04 Dec 2023 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2023/12/04/are-birth-dates-still-destiny-for-canadian-nhl-players/</guid>
      <description>


&lt;p&gt;In the first chapter &lt;a href=&#34;https://www.amazon.com/Outliers-Story-Success-Malcolm-Gladwell/dp/0316017930&#34;&gt;Malcolm Gladwell’s Outliers&lt;/a&gt; he discusses how in Canadian Junior Hockey there is a higher likelihood for players to be born in the first quarter of the year. In his words:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;gladwell.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Because these kids are older within their year they make all the important teams at a young age which gets them better resources for skill development and so on.&lt;/p&gt;
&lt;p&gt;While it seems clear that more players are born in the first few months of the year, what isn’t explored is whether or not this would be expected. Maybe more people in Canada &lt;strong&gt;in general&lt;/strong&gt; are born earlier in the year.&lt;/p&gt;
&lt;p&gt;I will explore whether Gladwell’s result is expected as well as whether this is still true in today’s NHL for Canadian-born players.&lt;/p&gt;
&lt;p&gt;To answer these questions I will download data on birth rates from Statistics Canada as well as player roster data from the NHL’s API.&lt;/p&gt;
&lt;p&gt;This analysis will leverage the &lt;code&gt;httr&lt;/code&gt; package to download the data, &lt;code&gt;tidyverse&lt;/code&gt; for data manipulation, and &lt;code&gt;ggtext&lt;/code&gt;/&lt;code&gt;ggimage&lt;/code&gt;/&lt;code&gt;scales&lt;/code&gt; for visualization.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(httr)
library(scales)
library(ggimage)
library(ggtext)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;section-1-what-is-the-distribution-of-births-by-month-in-canada&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Section 1: What is the distribution of births by month in Canada?&lt;/h2&gt;
&lt;p&gt;Gladwell’s thesis is that you see more Canadian Junior hockey players born earlier in the year because of the way that cut-offs are set for Youth Hockey. I think that he is correct but what if most people in Canada are born in the beginning of the year. Then this might be representative of the population rather than an outlier effect.&lt;/p&gt;
&lt;p&gt;Information about births by month in Canada can be found at &lt;a href=&#34;https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1310041501&#34;&gt;Statistics Canada&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;canadaBirth.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Initially I had tried to web-scrape the table using &lt;code&gt;rvest&lt;/code&gt; but could not figure out a way to deal with the “Number” row. Since the data could be downloaded as a CSV file my alternative solution was to use &lt;code&gt;httr&lt;/code&gt; to send a call to the download link to grab the file. The URL was found by using the inspect option in Firefox when clicking the download link.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;canada_raw &amp;lt;- GET(&amp;#39;https://www150.statcan.gc.ca/t1/tbl1/en/dtl!downloadDbLoadingData-nonTraduit.action?pid=1310041501&amp;amp;latestN=0&amp;amp;startDate=19910101&amp;amp;endDate=20220101&amp;amp;csvLocale=en&amp;amp;selectedMembers=%5B%5B1%5D%2C%5B%5D%2C%5B1%5D%5D&amp;amp;checkedLevels=1D1%2C1D2&amp;#39;) %&amp;gt;%
  content()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;GET()&lt;/code&gt; command sends the request to the server and the &lt;code&gt;content()&lt;/code&gt; function returns the results. Without the &lt;code&gt;content()&lt;/code&gt; function there is a lot of additional information about the call such as headers, request url, etc.&lt;/p&gt;
&lt;p&gt;The raw data contains many columns that are either duplicative or unnecessary for this analysis:&lt;/p&gt;
&lt;table style=&#34;width:100%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;18%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;REF_DATE&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;GEO&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;DGUID&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Month of birth&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Characteristics&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;UOM&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;UOM_ID&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;SCALAR_FACTOR&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;SCALAR_ID&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;VECTOR&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;COORDINATE&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;VALUE&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;STATUS&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;SYMBOL&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;TERMINATED&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;DECIMALS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1991&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Canada, place of residence of mother&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2016A000011124&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Total, month of birth&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Number of live births&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Number&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;223&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;units&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;v21400536&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1.1.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;403816&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1992&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Canada, place of residence of mother&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2016A000011124&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Total, month of birth&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Number of live births&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Number&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;223&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;units&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;v21400536&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1.1.1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;399109&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I pulled the data for 1991 through 2022 and each year has a total row as well as a row for each individual month. To clean up this data I filter out the total rows with &lt;code&gt;str_detect()&lt;/code&gt;, keep only the &lt;em&gt;REF_DATE&lt;/em&gt; for the year, extract the month using &lt;code&gt;str_extract&lt;/code&gt; and keep the &lt;em&gt;VALUE&lt;/em&gt; which is the actual number of births.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;canada_births &amp;lt;- canada_raw %&amp;gt;%
  filter(!str_detect(`Month of birth`, &amp;#39;Total&amp;#39;)) %&amp;gt;%
  transmute(
    REF_DATE,
    MONTH = str_extract(`Month of birth`, &amp;#39;Month of birth, (\\w+)&amp;#39;, 1),
    VALUE
  ) %&amp;gt;% 
  group_by(MONTH) %&amp;gt;% 
  summarize(country_births = sum(VALUE)) %&amp;gt;% 
  mutate(country_pct = country_births/sum(country_births))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then the distribution can be calculated by &lt;code&gt;dplyr&lt;/code&gt; functions. The true distribution of birth month in Canada vs. the expected distribution if every day had an equal chance is shown below:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;canada_births %&amp;gt;%
  transmute(
    `Canada %` = country_pct,
    `Expected % from Days in Month` = case_when(
      MONTH %in% c(&amp;#39;April&amp;#39;, &amp;#39;June&amp;#39;, &amp;#39;September&amp;#39;, &amp;#39;November&amp;#39;) ~ 30/365,
      MONTH == &amp;#39;February&amp;#39; ~ 28/365,
      TRUE ~ 31/365,
    ),
    `Difference` = `Canada %` - `Expected % from Days in Month`,
    month_id = factor(MONTH, levels = c(&amp;#39;January&amp;#39;, &amp;#39;February&amp;#39;, &amp;#39;March&amp;#39;, &amp;#39;April&amp;#39;,
                                        &amp;#39;May&amp;#39;, &amp;#39;June&amp;#39;, &amp;#39;July&amp;#39;, &amp;#39;August&amp;#39;,
                                        &amp;#39;September&amp;#39;, &amp;#39;October&amp;#39;, &amp;#39;November&amp;#39;, &amp;#39;December&amp;#39;))
  ) %&amp;gt;% 
  gather(lbl, value, -month_id) %&amp;gt;% 
  spread(month_id, value) %&amp;gt;%
  mutate(
    lbl = factor(lbl, levels = c(&amp;#39;Canada %&amp;#39;, &amp;#39;Expected % from Days in Month&amp;#39;, &amp;#39;Difference&amp;#39;)),
    across(January:December, ~percent(.x, accuracy = .1))) %&amp;gt;%
  arrange(lbl) %&amp;gt;% 
  kable(col.names = c(&amp;quot;&amp;quot;, names(.)[-1]))&lt;/code&gt;&lt;/pre&gt;
&lt;table style=&#34;width:100%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;25%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;January&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;February&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;March&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;April&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;May&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;June&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;July&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;August&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;September&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;October&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;November&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;December&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Canada %&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.0%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;7.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.4%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.8%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.9%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.7%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.7%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.4%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;7.8%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;7.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Expected % from Days in Month&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;7.7%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.2%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.2%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.2%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.2%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;8.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Difference&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;-0.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;-0.1%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0.0%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0.2%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0.3%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0.3%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0.4%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0.2%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0.5%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;-0.1%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;-0.4%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;-0.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;At first glance, Canadians seem &lt;strong&gt;less&lt;/strong&gt; likely to be born in the beginning of the year (particularly January and February) than from random distribution. They’re also less likely to be born in the end of the year.&lt;/p&gt;
&lt;p&gt;Let’s see what the Canadian NHL players look like:&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;section-2-what-is-the-difstribution-of-births-by-month-for-canadian-nhl-players&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Section 2: What is the difstribution of births by month for Canadian NHL players?&lt;/h2&gt;
&lt;p&gt;To get the information about the NHL players I will use &lt;code&gt;httr&lt;/code&gt; to query the NHLs API. My original version of this analysis used the &lt;code&gt;nhlapi&lt;/code&gt; package which is on CRAN. But the NHL changed their API at some point in the last few months so that package no longer functioned.&lt;/p&gt;
&lt;p&gt;Getting the 2023-2024 team rosters can be done through the API endpoint &lt;code&gt;https://api-web.nhle.com/v1/roster/{team}/20232024&lt;/code&gt; where &lt;code&gt;{team}&lt;/code&gt; is a three-character code representing an individual team. To get the rosters for each team I need to first get the codes for each team.&lt;/p&gt;
&lt;p&gt;This is going to involve a bunch of JSON manipulation which is new to me so their is probably a more elegant solution.&lt;/p&gt;
&lt;p&gt;All information on NHL teams can be retrieved from the &lt;code&gt;https://api.nhle.com/stats/rest/en/team&lt;/code&gt; endpoint. Using the same &lt;code&gt;GET()&lt;/code&gt; / &lt;code&gt;content()&lt;/code&gt; set from the prior section I can get all the team information&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;teams &amp;lt;- GET(&amp;#39;https://api.nhle.com/stats/rest/en/team&amp;#39;) %&amp;gt;% 
  content()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This comes back as a list with two items, “data” which contains all the useful information and “total” which contains the number of elements returned in “data”. I just need the “data” piece.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;teams &amp;lt;- teams %&amp;gt;% 
  .[[&amp;#39;data&amp;#39;]] &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now “teams” is a list with 59 elements with each element containing information (id, franchiseId, fullName, legaugeId, rawTricode, triCode) about a team.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;teams[1:3] %&amp;gt;% jsonlite::toJSON(auto_unbox = T) %&amp;gt;% jsonlite::prettify()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [
##     {
##         &amp;quot;id&amp;quot;: 11,
##         &amp;quot;franchiseId&amp;quot;: 35,
##         &amp;quot;fullName&amp;quot;: &amp;quot;Atlanta Thrashers&amp;quot;,
##         &amp;quot;leagueId&amp;quot;: 133,
##         &amp;quot;rawTricode&amp;quot;: &amp;quot;ATL&amp;quot;,
##         &amp;quot;triCode&amp;quot;: &amp;quot;ATL&amp;quot;
##     },
##     {
##         &amp;quot;id&amp;quot;: 34,
##         &amp;quot;franchiseId&amp;quot;: 26,
##         &amp;quot;fullName&amp;quot;: &amp;quot;Hartford Whalers&amp;quot;,
##         &amp;quot;leagueId&amp;quot;: 133,
##         &amp;quot;rawTricode&amp;quot;: &amp;quot;HFD&amp;quot;,
##         &amp;quot;triCode&amp;quot;: &amp;quot;HFD&amp;quot;
##     },
##     {
##         &amp;quot;id&amp;quot;: 31,
##         &amp;quot;franchiseId&amp;quot;: 15,
##         &amp;quot;fullName&amp;quot;: &amp;quot;Minnesota North Stars&amp;quot;,
##         &amp;quot;leagueId&amp;quot;: 133,
##         &amp;quot;rawTricode&amp;quot;: &amp;quot;MNS&amp;quot;,
##         &amp;quot;triCode&amp;quot;: &amp;quot;MNS&amp;quot;
##     }
## ]
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Ultimately I want to restructure this set of nested lists into a rectangular format. The way I’ll do this is create a tibble of list columns using &lt;code&gt;tibble()&lt;/code&gt; and then &lt;code&gt;tidyr::unnest_wider&lt;/code&gt; to turn each element of a list-column into its own column.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;teams &amp;lt;- teams %&amp;gt;% 
  tibble(data = .) %&amp;gt;% 
  unnest_wider(data)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now everything is in a much more legible format:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;id&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;franchiseId&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;fullName&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;leagueId&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;rawTricode&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;triCode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;11&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;35&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Atlanta Thrashers&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;133&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;ATL&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;ATL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;34&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;26&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Hartford Whalers&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;133&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;HFD&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;HFD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;31&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;15&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Minnesota North Stars&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;133&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;MNS&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;MNS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That was all just to get the 3-character codes needed to actually get the rosters. Since a separate call is made to the Roster endpoint for each team this is a good opportunity to create a function. Then I can use &lt;code&gt;purrr::map_dfr&lt;/code&gt; to iterate through the team codes to combine all the rosters together.&lt;/p&gt;
&lt;p&gt;For the function, it’ll take a team code for input and extract the player’s first name, last name, birth date, and birth country.&lt;/p&gt;
&lt;p&gt;The data structure returned from the Roster endpoint is a list with elements for forwards, defensemen, and goalies, Then for each player within the data looks like:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;GET(glue::glue(&amp;#39;https://api-web.nhle.com/v1/roster/NJD/20232024&amp;#39;)) %&amp;gt;% 
    content() %&amp;gt;%
    .[[&amp;#39;forwards&amp;#39;]] %&amp;gt;%
    .[[1]] %&amp;gt;%
    jsonlite::toJSON(auto_unbox = T, pretty = T) %&amp;gt;% 
    jsonlite::prettify()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## {
##     &amp;quot;id&amp;quot;: 8479414,
##     &amp;quot;headshot&amp;quot;: &amp;quot;https://assets.nhle.com/mugs/nhl/20232024/NJD/8479414.png&amp;quot;,
##     &amp;quot;firstName&amp;quot;: {
##         &amp;quot;default&amp;quot;: &amp;quot;Nathan&amp;quot;
##     },
##     &amp;quot;lastName&amp;quot;: {
##         &amp;quot;default&amp;quot;: &amp;quot;Bastian&amp;quot;
##     },
##     &amp;quot;sweaterNumber&amp;quot;: 14,
##     &amp;quot;positionCode&amp;quot;: &amp;quot;R&amp;quot;,
##     &amp;quot;shootsCatches&amp;quot;: &amp;quot;R&amp;quot;,
##     &amp;quot;heightInInches&amp;quot;: 76,
##     &amp;quot;weightInPounds&amp;quot;: 205,
##     &amp;quot;heightInCentimeters&amp;quot;: 193,
##     &amp;quot;weightInKilograms&amp;quot;: 93,
##     &amp;quot;birthDate&amp;quot;: &amp;quot;1997-12-06&amp;quot;,
##     &amp;quot;birthCity&amp;quot;: {
##         &amp;quot;default&amp;quot;: &amp;quot;Kitchener&amp;quot;
##     },
##     &amp;quot;birthCountry&amp;quot;: &amp;quot;CAN&amp;quot;,
##     &amp;quot;birthStateProvince&amp;quot;: {
##         &amp;quot;default&amp;quot;: &amp;quot;ON&amp;quot;
##     }
## }
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To get only the data I want, I’ll (1) pass a team code into the function to call to the API with &lt;code&gt;GET()&lt;/code&gt; and &lt;code&gt;content()&lt;/code&gt;, (2) use &lt;code&gt;flatten()&lt;/code&gt; to remove the level of forwards, defensemen, and goalies to have all the players as one nested list, (3) turn the data into a tibble of list-columns with &lt;code&gt;tibble()&lt;/code&gt;, and (4) use the &lt;code&gt;tidyr::hoist()&lt;/code&gt; function to pull only the items I want from the structure. Finally, I use &lt;code&gt;transmute&lt;/code&gt; to add the 3-character input to the results and to exclude the &lt;em&gt;data&lt;/em&gt; list-column.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;get_roster &amp;lt;- function(team){
  GET(glue::glue(&amp;#39;https://api-web.nhle.com/v1/roster/{team}/20232024&amp;#39;)) %&amp;gt;% 
  content() %&amp;gt;%
  flatten() %&amp;gt;%
  tibble(data = .) %&amp;gt;%
  hoist(&amp;#39;data&amp;#39;,
    &amp;#39;firstName&amp;#39; = list(&amp;#39;firstName&amp;#39;, 1L),
    &amp;#39;lastName&amp;#39; = list(&amp;#39;lastName&amp;#39;, 1L),
    &amp;#39;birthDate&amp;#39;,
    &amp;#39;birthCountry&amp;#39;
  ) %&amp;gt;% 
  transmute(team = team, firstName, lastName, birthDate, birthCountry)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Within &lt;code&gt;hoist()&lt;/code&gt;, the construction of &lt;code&gt;list(&#39;firstName&#39;, 1L)&lt;/code&gt; is to avoid having to pull the “default” sub-item within firstName. This way simply grabs the value of the first element within the firstName item. Since birthDate and birthCountry have no sub-items there is no need to do that for those fields.&lt;/p&gt;
&lt;p&gt;Finally to get all the players from all the teams I use &lt;code&gt;purrr::map_dfr()&lt;/code&gt; to iterate through the team codes and run my function. There is a filter to remove any items with missing firstName fields because the Team endpoint returns information for all historical teams (e.g, Atlanta Thrashers, Hartford Whalers, etc.). Since these teams are not active in 2023-2024 they return information but the fields I want don’t populate.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_roster &amp;lt;- map_dfr(teams$triCode, get_roster) %&amp;gt;%
  filter(!is.na(firstName))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have a dataset of the 774 players in the NHL. This number is slightly larger than the expected number of NHL players (736 = 23 players * 32 teams) so there is likely a nuance to how a roster player is determined but it shouldn’t matter for this analysis.&lt;/p&gt;
&lt;p&gt;Since I only want to look at Canadian players because I have no idea if the same cut-offs that apply in Canada apply in other countries in the world. I’ll also do some data cleanup on birth months and calculate the player distribution.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;canada_players &amp;lt;- all_roster %&amp;gt;% 
  filter(birthCountry == &amp;#39;CAN&amp;#39;) %&amp;gt;% 
  mutate(
    mob = month(ymd(birthDate), label = T, abbr = F),
    mob_id = month(ymd(birthDate))
  ) %&amp;gt;% 
  count(mob_id, mob, name = &amp;quot;players&amp;quot;) %&amp;gt;%
  mutate(player_pct = players/sum(players))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have the distribution of birth months for the 314 Canadian NHL players&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;section-3-putting-it-all-together&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Section 3: Putting it all together&lt;/h2&gt;
&lt;p&gt;The last section is to combine the Canada birth month data from Section 1 with the Canadian NHL player from Section 2 and make a pretty visualization.&lt;/p&gt;
&lt;p&gt;First I combining the data and create a field for the percentage of births you’d expect if every day was equally likely (ex. if January has 31 days then there is a 31/365 chance of being randomly born in January):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;combined &amp;lt;- canada_players %&amp;gt;%
  left_join(canada_births, by = c(&amp;#39;mob&amp;#39; = &amp;#39;MONTH&amp;#39;)) %&amp;gt;%
  #Put in random value
  mutate(
    random = case_when(
      mob_id %in% c(4, 6, 9, 11) ~ 30/365,
      mob_id %in% c(1, 3, 5, 7, 8, 10, 12) ~ 31/365,
      mob_id == 2 ~ 28/365
    )
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the visualization I’m going to use the &lt;code&gt;ggimage&lt;/code&gt; package to use icons of the Canadian flag and the NHL logo. This package can render a URL so I’ll create variables for those URLs.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;NHL_ICON &amp;lt;- &amp;quot;https://pbs.twimg.com/media/F9sTTAYakAAkRv6.png&amp;quot;
CANADA_ICON &amp;lt;- &amp;quot;https://cdn-icons-png.flaticon.com/512/5372/5372678.png&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, a combination of &lt;code&gt;ggplot&lt;/code&gt;, &lt;code&gt;ggtext&lt;/code&gt;, and &lt;code&gt;ggimage&lt;/code&gt; is used to create the visualization.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(combined, aes(x = fct_reorder(mob, -mob_id))) + 
  geom_line(aes(y = random, group = 1), lty = 2, color = &amp;#39;grey60&amp;#39;) + 
  geom_linerange(aes(ymin = country_pct, ymax = player_pct)) + 
  geom_image(aes(image = NHL_ICON, y = player_pct), size = .08) + 
  geom_image(aes(image = CANADA_ICON, y = country_pct), size = .07) + 
  geom_text(aes(label = percent(player_pct, accuracy = .1), 
                y = if_else(player_pct &amp;gt; country_pct, player_pct + .004, player_pct - .004))) + 
  geom_text(aes(label = percent(country_pct, accuracy = .1), 
                y = if_else(country_pct &amp;gt; player_pct, country_pct + .004, country_pct - .004))) +
  annotate(
    &amp;#39;curve&amp;#39;,
    xend = 2.3,
    x = 1.5,
    yend = .084,
    y = .10,
    curvature = .25,
    arrow = arrow(
      length = unit(7, &amp;quot;pt&amp;quot;),
      type = &amp;quot;closed&amp;quot;
    )) + 
  annotate(
    &amp;#39;richtext&amp;#39;,
    x = 1,
    y = .105,
    label = &amp;quot;The grey line is the expected % of births&amp;lt;br /&amp;gt;if birth month was completely random&amp;quot;,
    size = 4
  ) + 
  scale_y_continuous(labels = percent) + 
  coord_flip() + 
  labs(x = &amp;quot;Month of Birth&amp;quot;, y = &amp;quot;Percentage of Births (%)&amp;quot;,
       title = &amp;quot;Are Canadian NHL Players More Likely to be Born Early in the Year?&amp;quot;,
       subtitle = &amp;#39;Comparing the distribution of birth months between Canadian NHL players and Canada in general &amp;#39;,
       caption = glue::glue(&amp;quot;&amp;lt;img src = {NHL_ICON} width = &amp;#39;15&amp;#39; height=&amp;#39; 15&amp;#39; /&amp;gt; - Canadian NHL Players Birth Month Distribution &amp;lt;br /&amp;gt;
                            &amp;lt;img src = {CANADA_ICON} width = &amp;#39;15&amp;#39; height=&amp;#39; 15&amp;#39; /&amp;gt; - Canadian Birth Month (1991-2022) Distribution&amp;quot;)
       ) + 
  theme_light() +
  theme(
    text = element_text(family = &amp;#39;Asap SemiCondensed&amp;#39;, size = 14),
    plot.title.position = &amp;#39;plot&amp;#39;,
    plot.title = element_markdown(),
    plot.caption = element_markdown()
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2023/12/04/are-birth-dates-still-destiny-for-canadian-nhl-players/index_files/figure-html/unnamed-chunk-17-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Visually it looks pretty clear that there are more Canadian NHL players born in January/February than expected and fewer players born in August through the end of the year. May and July are interesting but I don’t have an intuition for why more NHL players might be born in those months.&lt;/p&gt;
&lt;p&gt;For a more stats-y perspective. A chi-sq test can be used to see if the distribution of the Canadian NHL players is different than Canada in general. In the following code, &lt;em&gt;x&lt;/em&gt; is the number of Canadian NHL players born in each month and &lt;em&gt;p&lt;/em&gt; is the expected proportion based on the distribution of birth months for Canada as a whole.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;broom::tidy(chisq.test(x = combined$players, p = combined$country_pct))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 × 4
##   statistic p.value parameter method                                  
##       &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;                                   
## 1      25.6 0.00752        11 Chi-squared test for given probabilities&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The p-value of &amp;lt;.01 means that we can reject the null hypothesis that they are the same distribution.&lt;/p&gt;
&lt;p&gt;So it seems that Malcolm Gladwell’s thesis in Outliers still holds true in today’s NHL among Canadian players.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>The Most Unexpectedly Good and Bad TV Episodes</title>
      <link>https://jlaw.netlify.app/2023/09/28/the-most-unexpectedly-good-and-bad-tv-episodes/</link>
      <pubDate>Thu, 28 Sep 2023 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2023/09/28/the-most-unexpectedly-good-and-bad-tv-episodes/</guid>
      <description>


&lt;p&gt;The 9th episode of the 2nd Season of Ted Lasso is an episode called “Beard After Hours” which I found to be a pretty bad episode on a pretty good show. I wondered whether others found this to be an unexpectedly bad episode of TV or if it was just me. The website &lt;a href=&#34;https://www.ratingraph.com/tv-shows/ted-lasso-ratings-81599/&#34;&gt;RatinGraph&lt;/a&gt; confirmed that while it wasn’t the worst episode of the series, its in the bottom 3.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;TedLasso.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Further Googling had shown that this episode (along with one other) were the results of the series getting an extension from 10 episodes to 12 episodes for Season 2. Thus, “Beard After Hours” was a filler episode intended to not affect the main plot line.&lt;/p&gt;
&lt;p&gt;This got me thinking about other unexpectedly bad episodes of TV. And since doing unexpectedly bad and unexpectedly good are similar I figured why not both. So in this post, I find the 10 most unexpectedly good and unexpectedly bad episodes of television.&lt;/p&gt;
&lt;div id=&#34;data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://developer.imdb.com/non-commercial-datasets/&#34;&gt;IMDB&lt;/a&gt; provides datasets for personal and non-commercial use which contains information on TV Series, their episodes, and the ratings of those episodes. More specifically I will be using the &lt;em&gt;title.basics.tsv.gz&lt;/em&gt; file for basic info on TV Series (and episode names), &lt;em&gt;title.episode.tsv.gz&lt;/em&gt; to get all of the episode IDs for the TV Series, and &lt;em&gt;title.ratings.tsv.gz&lt;/em&gt; to get the ratings and number votes for each episode.&lt;/p&gt;
&lt;p&gt;For this analysis, there are no fancy packages being used. Just &lt;code&gt;tidyverse&lt;/code&gt;, &lt;code&gt;glue&lt;/code&gt;, and &lt;code&gt;broom&lt;/code&gt; for data manipulation and &lt;code&gt;ggtext&lt;/code&gt; and &lt;code&gt;ggrepel&lt;/code&gt; for enhancements to the visualizations.&lt;/p&gt;
&lt;p&gt;First step is loading libraries,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(broom)
library(ggrepel)
library(glue)
library(ggtext)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;setting some global settings for visualization,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;theme_set(theme_light(base_size = 14, base_family = &amp;quot;Asap SemiCondensed&amp;quot;))

theme_update(
  panel.grid.minor = element_blank(),
  plot.title = element_text(face = &amp;quot;bold&amp;quot;),
  plot.title.position = &amp;quot;plot&amp;quot;
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and reading in the 3 IMDB data files. The raw files are tab-delimited and use the &lt;code&gt;\N&lt;/code&gt; character for a missing value, the &lt;code&gt;na&lt;/code&gt; parameter in &lt;code&gt;read_delim&lt;/code&gt; tells R to set these to &lt;code&gt;NA&lt;/code&gt; rather than keep them as a string.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;basics &amp;lt;- read_delim(file = &amp;#39;data/title.basics.tsv&amp;#39;, delim = &amp;#39;\t&amp;#39;, na = &amp;#39;\\N&amp;#39;)
ratings &amp;lt;- read_delim(file = &amp;#39;data/title.ratings.tsv&amp;#39;, delim = &amp;#39;\t&amp;#39;)
episodes &amp;lt;- read_delim(file = &amp;#39;data/title.episode.tsv&amp;#39;, delim = &amp;#39;\t&amp;#39;, na =&amp;#39;\\N&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;basics&lt;/code&gt; file contains nearly 250k TV Series which is way more than I want to deal with so I’ll keep shows that meet a certain criteria:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;IMDB categorizes it as a TV Series&lt;/li&gt;
&lt;li&gt;The show started in 1990 or later (because I wanted things I’d be familiar with)&lt;/li&gt;
&lt;li&gt;IMDB classifies it as either a Comedy or a Drama&lt;/li&gt;
&lt;li&gt;IMDB does &lt;strong&gt;not&lt;/strong&gt; classify it as a Talk Show, Reality Show, News, Game Show, or Short
&lt;ul&gt;
&lt;li&gt;Genres on IMDB can have multiple categories for example &lt;a href=&#34;https://www.imdb.com/title/tt0320037&#34;&gt;Jimmy Kimmel Live!&lt;/a&gt; is classified as Comedy, Music, and Talk-Show&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;There are at least 20 episodes in the Series (need a track record for how a show is rated)&lt;/li&gt;
&lt;li&gt;Each episode has on average 250 votes (want to have enough stability in the ratings and for the show to be somewhat popular)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These exclusions are handled with the following code:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;basics_agg &amp;lt;- basics %&amp;gt;%
  # Limit to TV Series
  filter(titleType == &amp;#39;tvSeries&amp;#39;) %&amp;gt;% 
  # Keep Only Shows Starting In or After 1990
  filter(startYear &amp;gt;= 1990) %&amp;gt;%
  # Join all the Episodes to the TV Series data
  inner_join(episodes, by = join_by(tconst==parentTconst)) %&amp;gt;%
  # Join the ratings to the episode data
  inner_join(ratings, by = join_by(tconst.y == tconst)) %&amp;gt;% 
  # Calculate summary statistics for each show
  group_by(tconst, titleType, primaryTitle, originalTitle, 
           isAdult, startYear, endYear, runtimeMinutes, genres) %&amp;gt;%
  summarize(
    total_episodes = n(),
    avg_votes = mean(numVotes),
    overall_average = sum(numVotes * averageRating) / sum(numVotes),
    .groups = &amp;#39;drop&amp;#39;
  ) %&amp;gt;%
  # Keep Comedies and Dramas
  filter(str_detect(genres, &amp;#39;Comedy|Drama&amp;#39;)) %&amp;gt;%
  # Exclude Other Genres
  filter(!str_detect(genres, &amp;#39;Talk-Show|Reality-TV|News|Game-Show|Short&amp;#39;)) %&amp;gt;%
  # Keep Only if 20+ Episodes on Series
  filter(total_episodes &amp;gt;= 20) %&amp;gt;%
  # Keep Only if Episodes Average 250 Votes or More
  filter(avg_votes &amp;gt; 250)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now there are only 700 shows remaining in the data which is much more manageable!&lt;/p&gt;
&lt;div id=&#34;creating-an-episode-level-data-set&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Creating an episode level data set&lt;/h3&gt;
&lt;p&gt;So far the &lt;code&gt;basics_agg&lt;/code&gt; data set is just a list of 700 TV Series and their information. To build a model to predict episode ratings I’ll have to build a data set where each row is an episode. This will replicate some the logic from above that merges the 3 data-sets together:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_tv_details &amp;lt;- basics_agg %&amp;gt;% 
  ## Join in Episode Data
  inner_join(episodes, by = join_by(tconst==parentTconst)) %&amp;gt;%
  ## Join in Ratings Data
  inner_join(ratings, by = join_by(tconst.y == tconst)) %&amp;gt;% 
  # Bring in Episode Titles
  left_join(basics %&amp;gt;% filter(titleType == &amp;#39;tvEpisode&amp;#39;) %&amp;gt;% 
              transmute(tconst, episodeTitle = primaryTitle),
            by = join_by(tconst.y == tconst)) %&amp;gt;% 
  arrange(tconst, seasonNumber, episodeNumber) %&amp;gt;% 
  group_by(tconst) %&amp;gt;% 
  # Create variables for total number episodes
  mutate(episodeOverall = row_number(tconst),
         seasonNumber = factor(seasonNumber)
  ) %&amp;gt;%
  # Filter Out Missing Data
  filter(!is.na(seasonNumber) &amp;amp; !is.na(episodeNumber)) %&amp;gt;%
  ungroup()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now the dataset is all prepared to find our &lt;strong&gt;unexpected&lt;/strong&gt; episodes.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;methodology&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Methodology&lt;/h2&gt;
&lt;p&gt;The methodology I’m using for what’s an &lt;strong&gt;unexpectedly&lt;/strong&gt; good or bad episode of TV is similar to the methodology used in the &lt;a href=&#34;https://robjhyndman.com/hyndsight/tsoutliers/&#34;&gt;tsoutliers() function in the forecast package&lt;/a&gt;. Although since this isn’t really a time-series, I’ll be modifying it slightly to not account for “seasonal components”. My method is:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;For each TV Series create a prediction of what the expected IMDB rating would be.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Using a linear model with the Overall Episode number to capture a global trend (does the series get better or worse over time) as well as Season Number and Episode Number (and their interaction) to capture more local effects (is a certain season as a whole just bad).&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;For example, in the show Scrubs (which I love), the 9th season is rated much lower than Seasons 1-8. Therefore episodes in Season 9 aren’t &lt;strong&gt;unexpectedly&lt;/strong&gt; bad since the whole season is bad.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Calculate the difference between the Predicted Ratings from the model in Step #1 and the Actual Rating from IMDB.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Look at the distribution of the differences from Step #2. Episodes will be labeled as &lt;strong&gt;unexpectedly&lt;/strong&gt; good or bad if the difference calculated in step #2 is large enough.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;For “large enough” I look at the interquartile range (IQR) of the differences (the 75th percentile minus the 25th percentile) and label an episode as &lt;strong&gt;unexpectedly bad&lt;/strong&gt; if that episode’s difference is less than the 25th Percentile - 3 times the IQR and &lt;strong&gt;unexpectedly good&lt;/strong&gt; if that episode’s difference is greater than the 75th Percentile + 3 times the IQR.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The amount of &lt;strong&gt;unexpectedness&lt;/strong&gt; is based on the difference between the lower/upper bound and the actual value.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;This is different than just using the difference between the predicted and the actual values.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;The reason being that if a show has a very wide expected range, for example from 4 to 9. Then if the predicted value is 6.5 and the actual value is 9.1 then there’s a difference of 2.8 from the predicted value but only 0.1 outside the expected range. &lt;strong&gt;I want to focus on the greatest gap from expected&lt;/strong&gt; so I want to take larger variability into account.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A visual explanation using an episode of Stranger Things as an example is shown below. I want to focus more on the difference between the 7.4 and 6.1 vs. the 8.5 and 6.1:
&lt;img src=&#34;example.png&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;function-to-find-the-unexpected-episodes&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Function to find the Unexpected Episodes&lt;/h3&gt;
&lt;p&gt;The steps above have been built into a function called &lt;code&gt;get_anomalies()&lt;/code&gt; which runs the four steps described above. The parameter &lt;em&gt;onlyAnomalies&lt;/em&gt; determines whether to return &lt;strong&gt;only&lt;/strong&gt; unexpected episodes or to return all episodes. The differences described in step 2 are added using the &lt;code&gt;augment&lt;/code&gt; function from &lt;code&gt;broom&lt;/code&gt; :&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;get_anomalies &amp;lt;- function(dt, onlyAnomalies = T){
  
  ## STEP 1: Run Linear Model on IMDB Ratings vs. Episode Number + Season Info
  #if multiple seasons for show use both global and local trend
  if(n_distinct(dt$seasonNumber) &amp;gt; 1){
    model &amp;lt;- lm(averageRating ~ episodeOverall + seasonNumber*episodeNumber, 
                data = dt)
  }
  # if only one season then global trend = local trend
  else{
    model &amp;lt;- lm(averageRating ~ episodeOverall, data = dt)
  }
  
  ### Step 2 - Add in Residuals from model to initial data set
  results &amp;lt;- augment(model, dt) %&amp;gt;% 
  ### Step 3 - Calculate the 3*IQR Range for each episode
    mutate(
      ## Determine the IQR of the Residuals (P75 - P25)
      iqr = (quantile(.resid, .75)-quantile(.resid, .25)),
      ## Set Lower Bound for expected range of residuals
      lci = quantile(.resid, .25)-(3*iqr),
      ## Set Upper Bound for expected range of residuals
      uci = quantile(.resid, .75)+(3*iqr),
      ## Tag an episode as an anomaly if its actual rating is outside the bounds
      anomaly = if_else(.resid &amp;gt; uci | .resid &amp;lt; lci, T, F),
      
      ## Set expected range back in the scale of the 0-10 prediction.
      lower = .fitted + lci,
      upper = .fitted + uci,
      
      # Step 4 - Calculate the difference between the bounds and the actual 
      # value to use for measure of unexpectedness
      remainder = if_else(.resid &amp;lt; 0, averageRating-lower, averageRating-upper)
    ) %&amp;gt;% 
    # Subset columns
    select(episodeOverall, seasonNumber, episodeNumber, episodeTitle, 
           averageRating, .fitted, .resid, 
           anomaly, lower, upper, remainder)
  
  # Determine whether to return all episodes or just the unexpected episodes
  if(onlyAnomalies == T){
    return(results %&amp;gt;% filter(anomaly == T))
  }
  else{
    return(results)
  }
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;results&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;The function above needs to be run individually on the 700 TV Series in the data. To run all 700 models in a simple way I use the &lt;a href=&#34;https://r4ds.had.co.nz/many-models.html&#34;&gt;&lt;em&gt;Many Models&lt;/em&gt;&lt;/a&gt; framework by nesting data into list-columns and using &lt;code&gt;map&lt;/code&gt; to run the function on each subset of data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results &amp;lt;- all_tv_details %&amp;gt;% 
  # Create a dataset with 1 row per TV Series with all data in a list-column
  group_by(primaryTitle) %&amp;gt;%
  nest() %&amp;gt;% 
  # Run the function to get the unexpected episodes as a new list-column
  mutate(results = map(data, get_anomalies)) %&amp;gt;% 
  # Break the new list-column back into individual rows
  unnest(results) %&amp;gt;%
  # Drop the original list columns and ungroup the data set
  select(-data) %&amp;gt;% 
  ungroup() %&amp;gt;%
  # Use Glue package to make a pretty label
  mutate(
    lbl = glue(&amp;quot;**{primaryTitle}** S{s}E{e} - {episodeTitle}&amp;quot;,
               s = if_else(as.numeric(seasonNumber) &amp;lt; 10, 
                           glue(&amp;quot;0{seasonNumber}&amp;quot;), glue(&amp;quot;{seasonNumber}&amp;quot;)),
               e = if_else(as.numeric(episodeNumber) &amp;lt; 10, 
                           glue(&amp;quot;0{episodeNumber}&amp;quot;), glue(&amp;quot;{episodeNumber}&amp;quot;))
    )
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And now without further ado… the RESULTS!&lt;/p&gt;
&lt;div id=&#34;the-tv-shows-with-the-most-unexpected-episodes&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The TV Shows with the Most &lt;strong&gt;Unexpected Episodes&lt;/strong&gt;&lt;/h3&gt;
&lt;p&gt;Overall, 143 episodes were identified as being &lt;strong&gt;unexpectedly&lt;/strong&gt; good or bad. Of this bunch 64% are &lt;strong&gt;unexpectedly&lt;/strong&gt; bad showing that its more common for a good show to miss then it is for a show to hit an unexpected home run.&lt;/p&gt;
&lt;p&gt;The first thing I want to look at are the 10 TV Series that have the most &lt;strong&gt;unexpected&lt;/strong&gt; episodes both good and bad.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results %&amp;gt;% 
  # Group by Show
  group_by(primaryTitle) %&amp;gt;% 
  # Count the number of unexpected episodes in total as well as good and bad
  summarize(
    total = n(),
    `unexpectedly bad` = sum(.resid &amp;lt; 0)*-1,
    `unexpectedly good` = sum(.resid &amp;gt; 0)
  ) %&amp;gt;%
  # Get the Top 10 by Total
  slice_max(order_by = total, n = 10, with_ties = F) %&amp;gt;% 
  pivot_longer(
    cols = c(`unexpectedly bad`, `unexpectedly good`),
    names_to = &amp;quot;type&amp;quot;,
    values_to = &amp;quot;episodes&amp;quot;
  ) %&amp;gt;% 
  ggplot(aes(x = episodes, y=fct_reorder(primaryTitle, total), fill = type)) + 
    geom_col() + 
    geom_text(aes(label = if_else(episodes != 0, abs(episodes), NA)), 
              hjust = &amp;quot;inward&amp;quot;, color = &amp;#39;grey90&amp;#39;) +
    labs(x = &amp;quot;# of Unexpected Episodes&amp;quot;,
         y = &amp;quot;TV Series&amp;quot;,
         title = &amp;quot;TV Series with Most &amp;lt;i style = &amp;#39;color:#ba2a22&amp;#39;&amp;gt;Unexpected&amp;lt;/i&amp;gt; Episodes&amp;quot;,
         fill = &amp;quot;&amp;quot;) + 
    scale_fill_viridis_d(option = &amp;quot;cividis&amp;quot;, begin = .2, end = .8, 
                         labels = str_to_title) +
     
    theme(
      legend.position = &amp;#39;top&amp;#39;,
      plot.title = element_markdown(),
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank(),
      axis.title = element_text(size = 12),
      axis.text.y = element_text(size = 10),
      legend.margin = margin(0, 0, -5, 0)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2023/09/28/the-most-unexpectedly-good-and-bad-tv-episodes/index_files/figure-html/unnamed-chunk-8-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Surprisingly, at least to me, SpongeBob Squarepants has the most &lt;strong&gt;unexpected&lt;/strong&gt; episodes with 5 and they’re all unexpectedly bad. This is in contrast with Desperate Housewives and Big Bang Theory which have all unexpectedly good episodes.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-10-most-unexpectedly-good-episodes&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The 10 Most &lt;strong&gt;Unexpectedly Good&lt;/strong&gt; Episodes&lt;/h3&gt;
&lt;p&gt;The Top 10 &lt;strong&gt;Unexpectedly Good&lt;/strong&gt; episodes are:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Define Elements for Manual Label
color = c(&amp;quot;Actual\nRating&amp;quot; = &amp;quot;darkred&amp;quot;,
          &amp;quot;Predicted\nRating&amp;quot; = &amp;#39;black&amp;#39;, 
          &amp;quot;Series\nAverage&amp;quot; = &amp;quot;darkblue&amp;quot;)
shape = c(&amp;quot;Actual\nRating&amp;quot; = 19, 
          &amp;quot;Predicted\nRating&amp;quot; = 19, 
          &amp;quot;Series\nAverage&amp;quot; = 1)

# Subset to only the Unexpectedly Good Results
good_results &amp;lt;- results %&amp;gt;% 
  filter(.resid &amp;gt; 0)  %&amp;gt;% 
  slice_max(order_by = remainder, n = 10, with_ties = F)

# Plot
good_results %&amp;gt;% 
  select(lbl, .resid, Predicted = .fitted, Actual = averageRating, 
         lower, upper, remainder) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(lbl, remainder))) + 
  geom_pointrange(aes(y = Predicted, ymin = lower, ymax = upper, 
                      color = &amp;#39;Predicted\nRating&amp;#39;)) + 
  geom_point(aes(y = Actual, color = &amp;quot;Actual\nRating&amp;quot;), size = 2) +
  geom_text(aes(label = Actual, y = Actual), color = &amp;#39;darkred&amp;#39;, nudge_x = .3) +
  geom_text(aes(label = round(lower, 1), y = lower),  nudge_x = .3, size = 3) +
  geom_text(aes(label = round(upper, 1), y = upper),  nudge_x = .3, size = 3) +
  geom_text(aes(label = round(Predicted, 1), y = Predicted),  nudge_x = .3) +
  scale_color_manual(values = color, name = &amp;#39;&amp;#39;) +
  labs(x = &amp;quot;&amp;quot;, y = &amp;quot;IMDB Rating&amp;quot;, 
       title = &amp;quot;Top 10 Unexpectedly &amp;lt;i style = &amp;#39;color:#2E8B57&amp;#39;&amp;gt;Good Episodes &amp;lt;/i&amp;gt;&amp;quot;,
       subtitle = &amp;quot;*As measured by the difference between Prediction Interval and Actual IMDB Episode Rating*&amp;quot;) +
  coord_flip() + 
  theme(
    plot.title = element_markdown(),
    plot.title.position = &amp;#39;plot&amp;#39;,
    plot.subtitle = element_markdown(size = 10),
    panel.grid.major.x = element_blank(),
    axis.text.x = element_markdown(size = 10),
    axis.title.x = element_text(size = 11),
    axis.text.y = element_markdown(size = 9),
    legend.position = &amp;#39;top&amp;#39;,
    legend.margin = margin(0, 0, -5, 0),
    legend.text = element_text(size = 9),
    legend.key.size = unit(0.2, &amp;quot;cm&amp;quot;)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2023/09/28/the-most-unexpectedly-good-and-bad-tv-episodes/index_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;According to this method the most &lt;strong&gt;unexpectedly good&lt;/strong&gt; episode of TV since 1990 is from &lt;strong&gt;The Fresh Prince of Bel-Air&lt;/strong&gt;’s 4th Season entitled “Papa’s Got a Brand New Excuse. Its a pretty entertaining show in general but this episode contains quite possible the most iconic scene from the show. Inquirer.com called the end of this episode &lt;a href=&#34;https://www.inquirer.com/news/bel-air-fresh-prince-dramatic-moments-will-smith-peacock-20220215.html&#34;&gt;“among the most tear-jerking in sitcom history”&lt;/a&gt;.&lt;/p&gt;
&lt;iframe width=&#34;560&#34; height=&#34;315&#34; src=&#34;https://www.youtube.com/embed/wQYQjJeaIek&#34; frameborder=&#34;0&#34; allowfullscreen&gt;
&lt;/iframe&gt;
&lt;p&gt;I’m not really familiar with many of the other episodes from this list, but at least from number 1 it seems like the method works well. Remember this isn’t looking for the best TV episodes but the best ones from a show you wouldn’t expect.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-10-most-unexpectedly-bad-episodes&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The 10 Most &lt;strong&gt;Unexpectedly Bad&lt;/strong&gt; Episodes&lt;/h3&gt;
&lt;p&gt;The Top 10 &lt;strong&gt;Unexpectedly Bad&lt;/strong&gt; episodes are:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Subset Data to only Unexpectedly Bad Results
bad_results &amp;lt;- results %&amp;gt;% 
  filter(.resid &amp;lt; 0) %&amp;gt;% 
  slice_min(order_by = remainder, n = 10, with_ties = F)

# Plot
bad_results %&amp;gt;% 
  select(lbl, .resid, Predicted = .fitted, 
         Actual = averageRating, lower, upper, remainder) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(lbl, -remainder))) + 
    geom_pointrange(aes(y = Predicted, ymin = lower, 
                        ymax = upper, color = &amp;#39;Predicted\nRating&amp;#39;)) + 
    geom_point(aes(y = Actual, color = &amp;quot;Actual\nRating&amp;quot;), size = 2) +
    geom_text(aes(label = Actual, y = Actual), 
              color = &amp;#39;darkred&amp;#39;, nudge_x = .3) +
    geom_text(aes(label = round(lower, 1), y = lower),  
              nudge_x = .3, size = 3) +
    geom_text(aes(label = round(upper, 1), y = upper),  
              nudge_x = .3, size = 3) +
    geom_text(aes(label = round(Predicted, 1), y = Predicted),  
              nudge_x = .3) +
    scale_color_manual(values = color, name = &amp;#39;&amp;#39;) +
    labs(x = &amp;quot;&amp;quot;, y = &amp;quot;IMDB Rating&amp;quot;, 
         title = &amp;quot;Top 10 Unexpectedly &amp;lt;i style = &amp;#39;color:#b22222&amp;#39;&amp;gt;Bad Episodes &amp;lt;/i&amp;gt;&amp;quot;,
         subtitle = &amp;quot;*As measured by the difference between Prediction Interval and Actual IMDB Episode Rating*&amp;quot;) +
    coord_flip() + 
    theme(
      plot.title = element_markdown(),
      plot.title.position = &amp;#39;plot&amp;#39;,
      plot.subtitle = element_markdown(size = 10),
      panel.grid.major.x = element_blank(),
      axis.text.x = element_markdown(size = 10),
      axis.title.x = element_text(size = 11),
      axis.text.y = element_markdown(size = 9),
      legend.position = &amp;#39;top&amp;#39;,
      legend.margin = margin(0, 0, -5, 0),
      legend.text = element_text(size = 9),
      legend.key.size = unit(0.2, &amp;quot;cm&amp;quot;)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2023/09/28/the-most-unexpectedly-good-and-bad-tv-episodes/index_files/figure-html/unnamed-chunk-10-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I have seen more of the bad list than I did the good list. While I don’t know much about later seasons of Riverdale, I can speak more about #5 which comes from Stranger Things’ 2nd Season. This episode involved a side-quest of one of the main characters meeting a family member that hasn’t appeared in any episode since.&lt;/p&gt;
&lt;p&gt;Also, the Scrubs episode that appears “My Night to Remember” is the only clip show from the series.&lt;/p&gt;
&lt;p&gt;In both of these examples, they’re &lt;strong&gt;unexpectedly bad&lt;/strong&gt; because the shows in general are good (they have the two highest upper bounds of the Top 10) but these two episodes did nothing to advance the plot and were ultimately filler. Much like the Ted Lasso episode that motivated this analysis.&lt;/p&gt;
&lt;p&gt;Moral of the story is that people don’t like filler episodes.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;drilling-into-a-few-shows&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Drilling into a Few Shows&lt;/h3&gt;
&lt;p&gt;Just for fun I wrote a general purpose function inspired by the RatinGraph charts that will show any TV Series’s trend-lines and expected range as well as highlight any &lt;strong&gt;unexpected&lt;/strong&gt; episodes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_shows &amp;lt;- function(title){
  # Run the anomaly function on a single show and return all the episodes
  get_anomalies(
    all_tv_details %&amp;gt;% 
      filter(primaryTitle == title),
    onlyAnomalies = F
  ) %&amp;gt;% 
    mutate(primaryTitle = title) %&amp;gt;%
    ggplot(aes(x = episodeOverall, y = averageRating, color = seasonNumber)) + 
    # Plot the expected value range
    geom_ribbon(aes(ymin = lower, ymax = upper), fill = &amp;#39;lightblue&amp;#39;, 
                color = NA, alpha = .3) + 
    # Plot the overall trend line across the entire Series
    geom_smooth( se = F, method = &amp;#39;lm&amp;#39;, lty = 2, color = &amp;#39;grey60&amp;#39;) +
    # Plot the trendlines for each season
    geom_smooth(aes(group = seasonNumber), se = F, method = &amp;#39;lm&amp;#39;, 
                lty = 2, show.legend = F) +
    # Plot the actuals for each episode
    geom_point(alpha = .5) + 
    # Add annotations for any outliers
    geom_label_repel(data = . %&amp;gt;% filter(anomaly == T), size = 3, 
                     min.segment.length = 0, 
                     aes(label = glue(&amp;quot;Season {seasonNumber} Episode {episodeNumber}
                                    {episodeTitle}
                                    Rating: {averageRating}&amp;quot;)),
                     show.legend = F) + 
    guides(color = guide_legend(nrow = 1)) +
    labs(x = &amp;#39;Episodes&amp;#39;, y = &amp;#39;IMDB Rating&amp;#39;, title = title, color = &amp;#39;Season:&amp;#39;) + 
    theme(
      plot.title = element_text(family = &amp;#39;Roboto&amp;#39;),
      legend.position = &amp;#39;bottom&amp;#39;,
    )
  
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;First let’s look at the show with the most &lt;strong&gt;unexpectedly good&lt;/strong&gt; episode, The Fresh Prince of Bel-Air.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_shows(&amp;#39;The Fresh Prince of Bel-Air&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2023/09/28/the-most-unexpectedly-good-and-bad-tv-episodes/index_files/figure-html/unnamed-chunk-12-1.png&#34; width=&#34;672&#34; /&gt;
For most of its episodes the IMDB ratings are a solid 7.5. There are some ups and some downs but nothing like the 9.7 rating that “Papa’s Got a Brand New Excuse” received.&lt;/p&gt;
&lt;p&gt;On the negative side, let’s look at Scrubs:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_shows(&amp;#39;Scrubs&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2023/09/28/the-most-unexpectedly-good-and-bad-tv-episodes/index_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;You can see that the model accounts for Season 9 being rated worse than 1-8 although not being &lt;strong&gt;unexpected&lt;/strong&gt; since the whole season is poorly rating. Also, the clip episode “My Night To Remember” being far worse at 5.4 then the 8 rating that Seasons 1-8 usually had.&lt;/p&gt;
&lt;p&gt;Finally, let’s look at Ted Lasso since it was the inspiration for this post:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_shows(&amp;#39;Ted Lasso&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2023/09/28/the-most-unexpectedly-good-and-bad-tv-episodes/index_files/figure-html/unnamed-chunk-14-1.png&#34; width=&#34;672&#34; /&gt;
No major outliers. And the episode “Beard After Hours” (the lowest green dot) while lower than expected doesn’t meet the extreme criteria to be included here.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;special-thanks&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Special Thanks&lt;/h2&gt;
&lt;p&gt;A special thanks to Cédric Scherer. His posit::conf(2023) presentation on &lt;a href=&#34;https://posit-conf-2023.github.io/dataviz-ggplot2/&#34;&gt;Engaging and Beautiful Data Visualizations with ggplot2&lt;/a&gt; taught me a TON.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;appendix-code-for-the-example-plot-for-stranger-things&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Appendix: Code for the example Plot for Stranger Things&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;examples &amp;lt;- results %&amp;gt;% filter(primaryTitle == &amp;#39;Stranger Things&amp;#39;)

examples %&amp;gt;% 
  select(lbl, .resid, Predicted = .fitted, Actual = averageRating, lower, upper, remainder) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(lbl, -remainder))) + 
  geom_pointrange(aes(y = Predicted, ymin = lower, ymax = upper, color = &amp;#39;Predicted\nRating&amp;#39;)) + 
  geom_point(aes(y = Actual, color = &amp;quot;Actual\nRating&amp;quot;), size = 2) +
  geom_text(aes(label = Actual, y = Actual), color = &amp;#39;darkred&amp;#39;, nudge_x = .1) +
  geom_text(aes(label = round(lower, 1), y = lower),  nudge_x = .05, size = 3) +
  geom_text(aes(label = round(upper, 1), y = upper),  nudge_x = .05, size = 3) +
  geom_text(aes(label = round(Predicted, 1), y = Predicted),  nudge_x = .1) +
  annotate(
    &amp;quot;richtext&amp;quot;,
    y = 8.5,
    x = 1.3,
    size = 3,
    label = &amp;quot;*The expected IMDB ratings for this episode of &amp;lt;br&amp;gt; Stranger Things is between 7.4 and 9.7*&amp;quot;,
    family = &amp;quot;Asap SemiCondensed&amp;quot;,
    label.color = NA
  ) + 
  annotate(
    &amp;#39;curve&amp;#39;,
    xend = 1,
    x = .7,
    yend = 6.1,
    y = 6.5,
    curvature = .25,
    arrow = arrow(
      length = unit(7, &amp;quot;pt&amp;quot;),
      type = &amp;quot;closed&amp;quot;
    )
  ) + 
  annotate(
    &amp;#39;text&amp;#39;,
    x = .7,
    y = 6.5,
    label = &amp;quot;The episode had an\n6.1 rating on IMDB&amp;quot;,
    family = &amp;quot;Asap SemiCondensed&amp;quot;,
    size = 3,
    vjust = 1
  ) + 
  annotate(
    &amp;#39;curve&amp;#39;,
    x = 1,
    xend = 1,
    y = 6.1,
    yend = 7.4,
    color = &amp;#39;darkred&amp;#39;,
    lty = 2,
    curvature = -.3,
    arrow = arrow(
      length = unit(7, &amp;quot;pt&amp;quot;),
      type = &amp;quot;closed&amp;quot;,
      ends = &amp;#39;both&amp;#39;
    )
  ) + 
  annotate(
    &amp;#39;richtext&amp;#39;,
    x = 1.35,
    y = 6.75,
    size = 3,
    color = &amp;#39;darkred&amp;#39;,
    family = &amp;quot;Asap SemiCondensed&amp;quot;,
    label = &amp;quot;&amp;lt;i&amp;gt;The &amp;lt;b&amp;gt;&amp;#39;unexpectedness&amp;#39;&amp;lt;/b&amp;gt; is the difference between&amp;lt;br&amp;gt;the outer bound (7.4) and the actual (6.1)
    &amp;lt;br&amp;gt;7.4 - 6.1 = 1.3&amp;quot;,
    label.color = NA,
    fill = NA
  ) + 
  scale_color_manual(values = color, name = &amp;#39;&amp;#39;) +
  labs(x = &amp;quot;&amp;quot;, y = &amp;quot;IMDB Rating&amp;quot;, 
       title = &amp;quot;&amp;lt;span style = &amp;#39;color:#ff1515&amp;#39;&amp;gt;Stranger Things&amp;lt;/span&amp;gt; S02E07 - Chapter Seven: The Lost Sister&amp;quot;,
       subtitle = &amp;quot;Outer bounds are defined based on **3xIQR**&amp;quot;) +
  coord_flip() + 
  theme(
    plot.title = element_markdown(),
    plot.title.position = &amp;#39;plot&amp;#39;,
    plot.subtitle = element_markdown(size = 10),
    panel.grid.major.x = element_blank(),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_text(size = 11),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.position = &amp;#39;top&amp;#39;,
    legend.margin = margin(0, 0, -5, 0),
    legend.text = element_text(size = 10),
    legend.key.size = unit(0.2, &amp;quot;cm&amp;quot;)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2023/09/28/the-most-unexpectedly-good-and-bad-tv-episodes/index_files/figure-html/unnamed-chunk-15-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>When Will NYC&#39;s Subway Ridership Recover?</title>
      <link>https://jlaw.netlify.app/2022/08/29/when-will-nyc-s-subway-ridership-recover/</link>
      <pubDate>Mon, 29 Aug 2022 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2022/08/29/when-will-nyc-s-subway-ridership-recover/</guid>
      <description>


&lt;p&gt;While writing my posts about COVID’s effect on NYC Subway ridership the New York Times published an article called &lt;a href=&#34;https://www.nytimes.com/2022/08/15/nyregion/mta-nyc-budget.html&#34;&gt;&lt;em&gt;The Pandemic Wasn’t Supposed to Hurt New York Transit This Much&lt;/em&gt;&lt;/a&gt;. The article states:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;nytimes.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I believe the 80% target by 2026 comes from a McKinsey study. While I don’t know the details of the study I do have some subway fare data sitting around. So why not compare the article’s data to my own.&lt;/p&gt;
&lt;p&gt;The methodology will be similar to what I did in my &lt;a href=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/&#34;&gt;initial subway forecasting post&lt;/a&gt; using the &lt;code&gt;modeltime&lt;/code&gt; package and the champion model Prophet w/ XGBoosted Errors to do the forecasting.&lt;/p&gt;
&lt;div id=&#34;libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Libraries&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;### Data Manipulation Packages
library(timetk) # For time series features in recipe
library(tidyverse) # General Data Manipulation
library(scales) # Making prettier scales
library(lubridate) # Dealing with Dates

# Modeling Ecosystem
library(modeltime) # Framework for Time Series models
library(tidymodels) # Framework for general recipe and workflows

### Model Packages
library(prophet) # Algorithm for forecasting&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data&lt;/h2&gt;
&lt;p&gt;The data is the same as from my &lt;a href=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/&#34;&gt;initial post&lt;/a&gt;. Its initially at the week/station/fare level. For this exercise I just need the data at the weekly level.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fares &amp;lt;- readRDS(file.path(here(), &amp;#39;content&amp;#39;, &amp;#39;post&amp;#39;, 
                           &amp;#39;2022-07-13-how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares&amp;#39;, 
                           &amp;#39;data&amp;#39;,
                           &amp;#39;mta_data.RDS&amp;#39;)) %&amp;gt;% 
  group_by(week_start) %&amp;gt;% 
  summarize(fares = sum(fares))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My &lt;a href=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/&#34;&gt;first blog post&lt;/a&gt; in this series covered the &lt;code&gt;modeltime&lt;/code&gt; package in more detail for trying out many different forecasting models. That post found that Prophet with XGBoosted Errors was the best model. Here I’ll be replicating that workflow for that model type.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: Defining the pre-processing recipe&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This step defines the forecasting formula as predicting fares based on all other features. Then it creates a bunch of time series specific features from the date field in &lt;code&gt;step_timeseries_signature&lt;/code&gt;. &lt;code&gt;step_rm&lt;/code&gt; removes some variables created in the prior step that aren’t useful, and finally &lt;code&gt;step_dummy&lt;/code&gt; turns all the categorical variables into one-hot encoded indicators. Here I also set the training data set as the MTA Fares beginning during the COVID period (after April 1, 2020) since training on the prior time period will give very strange results.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rec &amp;lt;- recipe(fares ~ ., data = fares %&amp;gt;% filter(week_start &amp;gt;= ymd(20200401))) %&amp;gt;%
  step_timeseries_signature(week_start) %&amp;gt;% 
  step_rm(matches(&amp;quot;(.iso$)|(am.pm$)|(.xts$)|(hour)|(minute)|(second)|(wday)&amp;quot;)) %&amp;gt;% 
  step_dummy(all_nominal(), one_hot = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Define the Model Workflow and Fit the Model&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Sticking with the &lt;code&gt;tidymodels&lt;/code&gt; framework, here I define a workflow which will consist of the recipe created in &lt;strong&gt;Step 1&lt;/strong&gt; through &lt;code&gt;add_recipe&lt;/code&gt; and the model set through &lt;code&gt;add_model()&lt;/code&gt;. Within &lt;code&gt;add_model()&lt;/code&gt; the model type is set to Boosted Prophet. I believe the ‘prophet_xgboost’ is the default engine so &lt;code&gt;set_engine&lt;/code&gt; isn’t necessary, but good to keep around anyway.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prophet_boost_wf &amp;lt;- workflow() %&amp;gt;%
  add_model(
    prophet_boost(seasonality_yearly = TRUE) %&amp;gt;%
      set_engine(&amp;#39;prophet_xgboost&amp;#39;)
  ) %&amp;gt;% 
  add_recipe(rec) %&amp;gt;%
  fit(fares %&amp;gt;% filter(week_start &amp;gt;= ymd(20200401)) )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;Step 3: Using the Model to Forecast the Future&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In this instance I don’t have a test set to work with so I’m jumping directly into forecasting. Also since I don’t know how long it will take for the forecast to recover to pre-COVID levels, I’ll set the forecast horizon for 6 years in the &lt;code&gt;h&lt;/code&gt; parameter. Passing in the actual_data let it be included in the output data set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;final_fcst &amp;lt;- modeltime_table(
    prophet_boost_wf
  ) %&amp;gt;% 
  modeltime_forecast(
    h = &amp;quot;6 years&amp;quot;,
    actual_data = fares,
    keep_data = TRUE
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;Visualizing the Forecast&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;modeltime&lt;/code&gt; package makes it easy to visualize the data through the &lt;code&gt;plot_modeltime_forecast&lt;/code&gt; package. The default is to create a &lt;code&gt;plot.ly&lt;/code&gt; plot but that can be converted to a &lt;code&gt;ggplot2&lt;/code&gt; plot by setting &lt;code&gt;.interactive&lt;/code&gt; to &lt;code&gt;FALSE&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;final_fcst %&amp;gt;% 
  plot_modeltime_forecast(.interactive = F) + 
  scale_y_continuous(labels = comma)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/08/29/when-will-nyc-s-subway-ridership-recover/index_files/figure-html/unnamed-chunk-6-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;when-will-subway-fares-return-to-80-of-pre-covid-to-100&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;When will Subway fares return to 80% of Pre-COVID? To 100%?&lt;/h2&gt;
&lt;p&gt;Now we can see how close my forecast is to the New York Times Report. I don’t actually know what the NY Times is considering Pre-COVID levels but for my purposes I’ll use the average number of fares in December 2019 to be my Pre-COVID.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;baseline &amp;lt;- fares %&amp;gt;% 
  filter(month(week_start)==12 &amp;amp; year(week_start) == 2019) %&amp;gt;% 
  summarize(avg_fares = mean(fares)) %&amp;gt;% 
  pull(avg_fares)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;From the projection plot above its clear that there is a seasonality that peaks in the fall and drops in December through the New Year. To declare victory at 80% I’m going to require that there are 4 consecutive weeks of fares being above the Pre-COVID baseline.&lt;/p&gt;
&lt;p&gt;I’m not sure of a great way to define when is the earliest date of the first run of 4 weeks above a threshold but I’m working it out in three steps:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Define an indicator for whether that week is above 80% (&lt;code&gt;above_80_ind&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Run a counter for each time that the indicator flips from 0 to 1 (&lt;code&gt;run_id_80&lt;/code&gt;) to get an id for each run&lt;/li&gt;
&lt;li&gt;For each &lt;code&gt;run_id_80&lt;/code&gt; get the sum of &lt;code&gt;above_80_inds&lt;/code&gt; to represent the length of each run (&lt;code&gt;run_length_80&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rec_pct &amp;lt;- final_fcst %&amp;gt;% 
  filter(week_start &amp;gt;= ymd(20200401)) %&amp;gt;% 
  # Build Recovery Percentage
  mutate(recovery_pct = .value / baseline) %&amp;gt;%
  # Define Runs of when recovery_pct is above .8
  mutate(
    above_80_ind = (recovery_pct &amp;gt; .8),
    above_100_ind = (recovery_pct &amp;gt; 1)
  ) %&amp;gt;% 
  # Define ID for each time we start a run
  mutate(
    run_id_80 = cumsum(if_else(above_80_ind == 1 &amp;amp; lag(above_80_ind) == 0, 
                               1, 0)),
    run_id_100 = cumsum(if_else(above_100_ind == 1 &amp;amp; lag(above_100_ind) == 0, 
                               1, 0))
  ) %&amp;gt;% 
  add_count(run_id_80, wt = above_80_ind, name = &amp;quot;run_length_80&amp;quot;) %&amp;gt;%
  add_count(run_id_100, wt = above_100_ind, name = &amp;quot;run_length_100&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I can plot the recovery percentage by week and show that the first time there are four consecutive weeks above 80% is 2025-07-05 and the first time there are four consecutive weeks above 100% of the Pre-COVID value is 2027-06-26.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rec_pct %&amp;gt;% 
  ggplot(aes(x = week_start, y = recovery_pct)) + 
    geom_line(color = &amp;quot;#0039A6&amp;quot;) + 
    geom_segment(aes(x = min(week_start), 
                     xend = rec_pct[which.max(rec_pct$run_length_80 &amp;gt;= 4), ]$week_start,
                     y = .8,
                     yend = .8), lty = 2) + 
    geom_segment(aes(x = rec_pct[which.max(rec_pct$run_length_80 &amp;gt;= 4), ]$week_start,
                     xend = rec_pct[which.max(rec_pct$run_length_80 &amp;gt;= 4), ]$week_start,
                     y = 0,
                     yend = .8), lty = 2) + 
    geom_segment(aes(x = min(week_start), 
                     xend = rec_pct[which.max(rec_pct$run_length_100 &amp;gt;= 4), ]$week_start,
                     y = 1,
                     yend = 1), lty = 2) + 
    geom_segment(aes(x = rec_pct[which.max(rec_pct$run_length_100 &amp;gt;= 4), ]$week_start,
                     xend = rec_pct[which.max(rec_pct$run_length_100 &amp;gt;= 4), ]$week_start,
                     y = 0,
                     yend = 1), lty = 2) + 
    scale_x_date(breaks = &amp;quot;1 years&amp;quot;,
                 labels = year,
                 expand = c(0, 0)) + 
    scale_y_continuous(labels = percent, expand = c(0, 0),
                       breaks = seq(0, 1.6, .2)) + 
    labs(title = &amp;quot;Projected MTA Recovery vs. Pre-COVID&amp;quot;,
         subtitle = &amp;quot;Pre-COVID Baseline from December 2019&amp;quot;, 
         x = &amp;quot;Date&amp;quot;, y = &amp;quot;% of Dec 2019 Baseline&amp;quot;) + 
    cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/08/29/when-will-nyc-s-subway-ridership-recover/index_files/figure-html/unnamed-chunk-9-1.png&#34; width=&#34;672&#34; /&gt;
Based on this projection the NY Times article is being slightly pessimistic. According to the above NYC should reach 80% of Pre-COVID baseline by Mid-2025 which is earlier than the article’s projection of 2026.&lt;/p&gt;
&lt;p&gt;Who will be right? We’ll have to wait at least 3 years to find out!&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Exploring Types of Subway Fares with Hierarchical Forecasting</title>
      <link>https://jlaw.netlify.app/2022/08/24/exploring-types-of-subway-fares-with-hierarchical-forecasting/</link>
      <pubDate>Wed, 24 Aug 2022 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2022/08/24/exploring-types-of-subway-fares-with-hierarchical-forecasting/</guid>
      <description>


&lt;p&gt;In my &lt;a href=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/&#34;&gt;prior post&lt;/a&gt; I used forecasting to look at the effect of COVID on the expected amount of New York City subway swipes. In this post I will drill a level deeper to run forecasts for various types of subway fares to see if any particularly type has recovered better or worse than any others.&lt;/p&gt;
&lt;p&gt;The goal for this post will be to create a top-level forecast for total NYC subway fares and forecasts for each of the types of subway fares. The sub-levels of individual subway fares form a natural hierarchy with the total number. For my forecast, I’d like the forecasts for the sub-levels and for the total to match each other for the sake of consistency. This is called “hierarchical forecasting”. More details can be found in Rob Hyndman and George Athanasopoulos’ &lt;a href=&#34;https://otexts.com/fpp3/hierarchical.html&#34;&gt;Forecasting: Principles and Practice&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The book makes use of the &lt;code&gt;fable&lt;/code&gt; and &lt;code&gt;tsibble&lt;/code&gt; packages.&lt;/p&gt;
&lt;div id=&#34;libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Libraries&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tsibble) # Data Structure for Time Series
library(tidyverse) # Data Manipulation Packages
library(fable) # Time Series Forecasting Models
library(lubridate) # Date Manipulation
library(scales) # Convenience Functions for Percents&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;data-preparation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data Preparation&lt;/h2&gt;
&lt;p&gt;The data set will be the same as from the &lt;a href=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/&#34;&gt;prior blog post&lt;/a&gt; which contains weekly Subway data by station, card type, and week from May 2010 through June 2022. Please see the &lt;a href=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/&#34;&gt;previous post&lt;/a&gt; for more details on the data processing. The raw fare files come from the &lt;a href=&#34;http://web.mta.info/developers/fare.html&#34;&gt;MTA’s Website&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dt &amp;lt;- readRDS(here(&amp;#39;content/post/2022-07-13-how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/data/mta_data.rds&amp;#39;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the data set there are 30 different fare types, however, I really don’t want to create 30 different forecasts. Especially if some of these are going to be small volume. The top 5 fare types make up 93% of the fares, so I’ll group the other 25 into an “other” category. Then I aggregate the data set to the week and fare_type level and add up the fares column which represents the number of swipes for each fare type.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dt_by_fare &amp;lt;- dt %&amp;gt;%
  #Remove Out of Pattern Thursday File
  filter(week_start != &amp;#39;2010-12-30&amp;#39;) %&amp;gt;%
  #Clean Up fare types and create date fields
  mutate(
    week_start = ymd(week_start),
    year_week = yearweek(week_start),
    fare_type = case_when(
      fare_type == &amp;#39;ff&amp;#39; ~ &amp;#39;full_fare&amp;#39;,
      fare_type == &amp;#39;x30_d_unl&amp;#39; ~ &amp;#39;monthly_unlimited&amp;#39;,
      fare_type == &amp;#39;x7_d_unl&amp;#39; ~ &amp;#39;weekly_unlimited&amp;#39;,
      fare_type == &amp;#39;students&amp;#39; ~ &amp;#39;student&amp;#39;,
      fare_type == &amp;#39;sen_dis&amp;#39; ~ &amp;#39;seniors&amp;#39;,
      TRUE ~ &amp;#39;other&amp;#39;
    )
  ) %&amp;gt;% 
  group_by(week_start, year_week, key,  fare_type) %&amp;gt;% 
  # Drop all the groupings during summary
  summarize(fares = sum(fares),  .groups = &amp;#39;drop&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now the data set has gone from 7,244,430 rows to 3,702.&lt;/p&gt;
&lt;p&gt;To be able to use the &lt;code&gt;fable&lt;/code&gt; package to do forecasting, the data needs to be in the &lt;code&gt;tsibble&lt;/code&gt; format. This construction takes a “key” and an “index” parameter. The “key” is the grouping factor which in this case is the &lt;em&gt;fare_type&lt;/em&gt; and the “index” is the time parameter which will be the &lt;em&gt;year_week&lt;/em&gt; field.&lt;/p&gt;
&lt;p&gt;Then to create the “hierarchical” structure into the data, the &lt;code&gt;aggregate_key&lt;/code&gt; function from &lt;code&gt;fabletools&lt;/code&gt; is used. Telling the structure to be aggregated over the fare_types by adding up the fares will allow for forecasting reconciliation to ensure that the forecast outputs are coherent.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dt_ts &amp;lt;- tsibble(dt_by_fare, key = fare_type, index = year_week) %&amp;gt;% 
  aggregate_key(fare_type, fares = sum(fares))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;dt_ts&lt;/code&gt; data set is now 628 rows greater than the &lt;code&gt;dt_by_fare&lt;/code&gt; data set. This is because of the aggregated layer that was generated from &lt;code&gt;aggregate_key()&lt;/code&gt;. The 628 is the number of distinct weeks in the data.&lt;/p&gt;
&lt;p&gt;If continuing down the forecasting path there would eventually be an error during the forecast step due to a missing value in the initial time series. The &lt;code&gt;scan_gaps()&lt;/code&gt; function from &lt;code&gt;tsibble&lt;/code&gt; will look for implicit missing observations (gaps in the index). The &lt;code&gt;count_gaps()&lt;/code&gt; function will also provide a similar summary.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;scan_gaps(dt_ts) %&amp;gt;% 
  count(year_week) %&amp;gt;%
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;year_week&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;n&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2011 W18&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2013 W16&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The function shows that I’m missing the data for the 18th week of 2011 and the 16th week at 2013. At first I thought this was a problem with my data processing from before. But when visiting the &lt;a href=&#34;http://web.mta.info/developers/fare.html&#34;&gt;MTA website&lt;/a&gt; those files are actually missing.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;missing.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Notice that the file for May 21st, 2011 is not listed. Same with May 4th, 2013.&lt;/p&gt;
&lt;p&gt;To get around this issue, I need to first turn the implicit missings into explicit NAs. This can be done with &lt;code&gt;tsibble&lt;/code&gt;’s &lt;code&gt;fill_gaps()&lt;/code&gt; function which adds in &lt;em&gt;NA&lt;/em&gt;s for the missing dates.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dt_ts &amp;lt;- dt_ts  %&amp;gt;% 
  group_by_key() %&amp;gt;% 
  fill_gaps()


dt_ts %&amp;gt;% 
  head() %&amp;gt;% 
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;year_week&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;fare_type&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;fares&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2011 W18&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2013 W16&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2010 W21&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;11545507&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2010 W22&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;12580200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2010 W23&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;12820291&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2010 W24&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;12707781&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Notice that the two missing dates now appear. However, the forecasting is also going to have problems with the &lt;em&gt;NA&lt;/em&gt; values. So I’ll need to fill in a value. For simplicity, I’m going to use &lt;code&gt;tidyr&lt;/code&gt;’s &lt;code&gt;fill&lt;/code&gt; function and just use the previous value.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dt_ts &amp;lt;- dt_ts %&amp;gt;% 
  arrange(year_week) %&amp;gt;% 
  fill(fares, .direction = &amp;#39;down&amp;#39;)

dt_ts %&amp;gt;% 
  filter(year_week %in% c(yearweek(&amp;#39;2011 W17&amp;#39;), 
                          yearweek(&amp;#39;2011 W18&amp;#39;), 
                          yearweek(&amp;#39;2011 W19&amp;#39;)
                          ),
         fare_type == &amp;#39;full_fare&amp;#39;
           ) %&amp;gt;% 
  arrange(fare_type) %&amp;gt;% 
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;year_week&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;fare_type&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;fares&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2011 W17&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;13795196&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2011 W18&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;13795196&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2011 W19&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;13794517&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;forecasting&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Forecasting&lt;/h2&gt;
&lt;p&gt;The objective of this post is to determine which types of Subway fares have been most affected by COVID. In order to do this I’ll consider the time between 2010-2019 to be the pre-COVID period which the forecasting model will be built and then I’ll forecast 2020 - June 2022 and compare to the actuals.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;fable&lt;/code&gt; package uses the &lt;code&gt;model()&lt;/code&gt; function to set and fit forecasts. In this case I’m creating a forecast named &lt;em&gt;base&lt;/em&gt; and using an ARIMA model on the univariate time series for fares. If I had wanted to use Exponential Smoothing I would just change &lt;code&gt;ARIMA()&lt;/code&gt; to &lt;code&gt;ETS()&lt;/code&gt;. So in short, &lt;code&gt;fable&lt;/code&gt; provides a simple mechanism to fit forecasts.&lt;/p&gt;
&lt;p&gt;As it presently stands the &lt;strong&gt;base&lt;/strong&gt; model for the aggregate time series does not have to match the total of the individual series. The &lt;code&gt;reconcile()&lt;/code&gt; function lets you choose the method of all the key structure of the data will be made to “work”.&lt;/p&gt;
&lt;p&gt;In this example, I’m trying out:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Bottoms-Ups: Make the aggregate level equal the sum of the individuals&lt;/li&gt;
&lt;li&gt;Top-Down: Make the individual forecasts equal the aggregate series&lt;/li&gt;
&lt;li&gt;Min Trace: Reconciliation using the minimum race combination method which looks to &lt;a href=&#34;https://otexts.com/fpp3/reconciliation.html&#34;&gt;minimize the forecast variances of the set of coherent forecasts&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fit &amp;lt;- dt_ts %&amp;gt;% 
  filter(year(year_week)&amp;lt; 2020) %&amp;gt;%
  model(base = ARIMA(fares))%&amp;gt;%
  reconcile(bottom_up = bottom_up(base),
            top_down = top_down(base),
            min_trace = min_trace(base, &amp;quot;mint_shrink&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;fit&lt;/code&gt; object now contains four types of forecasts (base, bottom_up, top_down, min_trace) for each fare type and for the aggregation of the fare types.&lt;/p&gt;
&lt;p&gt;Handling the forecasting for 2020+ data is handled by the &lt;code&gt;forecast()&lt;/code&gt; function. The &lt;code&gt;fit&lt;/code&gt; object is passed into the &lt;code&gt;forecast()&lt;/code&gt; function and the 2020+ data gets passed into the &lt;em&gt;new_data&lt;/em&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fc &amp;lt;- fit %&amp;gt;% 
  forecast(new_data = dt_ts %&amp;gt;% filter(year(year_week) &amp;gt;= &amp;#39;2020&amp;#39;)) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;fc&lt;/code&gt; object now contains the four forecasts for each fare type and the aggregate forecast for the last 2.5 years of data. This can be displayed with the &lt;code&gt;autoplot()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;autoplot(fc, dt_ts %&amp;gt;% ungroup(), level = NULL) + 
  facet_wrap(~fare_type, scales = &amp;quot;free_y&amp;quot;) + 
  scale_y_continuous(labels = scales::comma_format()) + 
  labs(color = &amp;quot;&amp;quot;, x = &amp;quot;Date&amp;quot; ,y = &amp;quot;Number of Fares&amp;quot;) + 
  theme(
    legend.position = &amp;#39;bottom&amp;#39;,
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/08/24/exploring-types-of-subway-fares-with-hierarchical-forecasting/index_files/figure-html/visual-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;so-did-the-forecasts-reconcile-correctly&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;So did the forecasts reconcile correctly?&lt;/h2&gt;
&lt;p&gt;Since this post is about Hierarchical Time Series it will be important to check to see if the reconciliation works. In the following chart, I will add up the fare type forecasts for each of the four forecasting models and compare them to the aggregate forecast. For simplicity I will just choose a single data point.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fc %&amp;gt;% filter(year_week == yearweek(&amp;#39;2020 W01&amp;#39;)) %&amp;gt;%
  as_tibble() %&amp;gt;% 
  transmute(fare_type = if_else(
    is_aggregated(fare_type), &amp;#39;aggregated&amp;#39;, as.character(fare_type)),
    year_week, model = .model, forecast = .mean) %&amp;gt;% 
  spread(model, forecast) %&amp;gt;% 
  group_by(is_aggregated = ifelse(fare_type == &amp;#39;aggregated&amp;#39;, 
                                  &amp;#39;Top-Level&amp;#39;, 
                                  &amp;#39;Sum of Components&amp;#39;)) %&amp;gt;% 
  summarize(across(where(is.numeric), sum)) %&amp;gt;% 
  gather(model, value, -is_aggregated) %&amp;gt;% 
  ggplot(aes(x = model, y = value, fill = is_aggregated)) + 
    geom_col(position = &amp;#39;dodge&amp;#39;) + 
    geom_text(aes(label = paste0(round(value/1e6, 1), &amp;quot;MM&amp;quot;)), vjust = 0,
              position = position_dodge(width = 1)) +
    coord_cartesian(ylim = c(30e6, 30.6e6)) + 
    scale_y_continuous(labels = function(x){paste0(x/1e6, &amp;quot;MM&amp;quot;)}) + 
    scale_fill_viridis_d(option = &amp;quot;C&amp;quot;, begin = .2, end = .8) + 
    labs(title = &amp;quot;Comparing Different Reconciliation Methods&amp;quot;,
         subtitle = &amp;quot;Week 1 2020&amp;quot;,
         caption = &amp;#39;NOTE: y-axis does NOT start at 0&amp;#39;,
         x = &amp;quot;Reconcilation Method&amp;quot;, y = &amp;quot;Total # of Fares&amp;quot;,
         fill = &amp;quot;&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      legend.position = &amp;#39;bottom&amp;#39;,
      legend.direction = &amp;#39;horizontal&amp;#39;
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/08/24/exploring-types-of-subway-fares-with-hierarchical-forecasting/index_files/figure-html/checking_reconciliation-1.png&#34; width=&#34;100%&#34; /&gt;
In the &lt;em&gt;base&lt;/em&gt; (unreconciled) model the top-level time series is 500K fares higher than the sum of the various fare types. However, we want the forecasts to be consistent with each other and that’s exactly what we see in the three reconciled models. In the bottoms-up model, the “top-level” is scaled down to match the sum of the fare types. In top-down the sum of components are scaled up to match the “top-level”. And min_trace is somewhere in-between.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;how-much-did-each-fare-type-recovery-to-pre-covid-levels&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;How much did each Fare Type recovery to Pre-COVID levels?&lt;/h2&gt;
&lt;p&gt;Now that we have the reconciled forecasts we’re now able to actually to the analysis to determine which Fare Types have recovered the most and least to pre-COVID levels. This will be done using the maximum available date in the data set and the min_trace forecast.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bind_rows(
  dt_ts %&amp;gt;% 
    filter(year_week == max(year_week)) %&amp;gt;% 
    as_tibble() %&amp;gt;%
    transmute(fare_type =  if_else(is_aggregated(fare_type), 
                                   &amp;#39;All Fares&amp;#39;, 
                                   as.character(fare_type)), 
              time = &amp;quot;actuals&amp;quot;, 
              fares),
  fc %&amp;gt;% 
    as_tibble() %&amp;gt;% 
    filter(year_week == max(year_week), .model == &amp;quot;min_trace&amp;quot;) %&amp;gt;% 
    as_tibble() %&amp;gt;%
    transmute(fare_type =  if_else(is_aggregated(fare_type), 
                                   &amp;#39;All Fares&amp;#39;, 
                                   as.character(fare_type)), 
              time = &amp;#39;projected&amp;#39;, 
              fares = .mean)
) %&amp;gt;% 
  spread(time, fares) %&amp;gt;% 
  mutate(recovery = actuals / projected) %&amp;gt;% 
  gather(period, fares, -fare_type, -recovery) %&amp;gt;%
  ggplot(aes(x = fct_reorder(fare_type, -fares), y = fares, fill = fct_rev(period))) + 
    geom_col(position = &amp;#39;dodge&amp;#39;) + 
    geom_text(aes(label = paste0(round(fares/1e6, 1), &amp;quot;MM&amp;quot;)), vjust = 0,
              position = position_dodge(width = .9), size = 3) + 
    stat_summary(
      aes(x = fare_type, y = fares),
      geom = &amp;#39;label&amp;#39;,
      inherit.aes = F,
      fontface = &amp;#39;bold&amp;#39;, fill = &amp;#39;lightgrey&amp;#39;, size = 3,
      fun.data = function(x){
        return(data.frame(y = max(x)+8e6,
                          label = paste0((min(x)/max(x)) %&amp;gt;% percent,
                          &amp;quot;\nRecovered&amp;quot;)))
      }
    )  + 
    labs(title = &amp;quot;Actuals vs. Projected Subway Fares&amp;quot;,
         subtitle = &amp;quot;% Recovered is difference between Actual and Projected&amp;quot;,
         caption = &amp;quot;Comparing W24 2022 Data&amp;quot;,
         x = &amp;quot;&amp;quot;,
         y = &amp;quot;# of Fares&amp;quot;,
         fill = &amp;quot;&amp;quot;) + 
    scale_fill_viridis_d(option = &amp;quot;C&amp;quot;, begin = .2, end = .8) + 
    #This link was dope https://stackoverflow.com/questions/22945651/remove-space-between-plotted-data-and-the-axes
    scale_y_continuous(expand = expansion(mult = c(0, .12))) + 
    cowplot::theme_cowplot() + 
    theme(
      legend.position = &amp;#39;bottom&amp;#39;,
      legend.direction = &amp;#39;horizontal&amp;#39;,
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/08/24/exploring-types-of-subway-fares-with-hierarchical-forecasting/index_files/figure-html/recovery-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Overall, the forecast shows that subway fares have only recovered to 40% of the Pre-COVID levels. The fare types that have recovered the most are the Student and Senior cards which may make sense as schools are generally back to in-person instruction. The fare type that has recovered the least is the monthly unlimited card which also makes sense as hybrid work environments make paying for a full month of unlimited a less valuable proposition.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;appendix-measuring-forecast-accuracy&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Appendix: Measuring Forecast Accuracy&lt;/h2&gt;
&lt;p&gt;To end this post its worthwhile to show how I would measure the forecast accuracy. The &lt;code&gt;accuracy()&lt;/code&gt; function from &lt;code&gt;fabletools&lt;/code&gt; makes it very easy to see forecasting accuracy metrics. Just pass in the forecast, the actuals, and a list of metrics and you get a tibble back.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;####Appendix: Forecast Accuracy
fc %&amp;gt;%
  accuracy(
    data = dt_ts,
    measures = list(rmse = RMSE, mase = MASE, mape = MAPE)
  ) %&amp;gt;%
  filter(.model == &amp;#39;min_trace&amp;#39;) %&amp;gt;% 
  arrange(mape) %&amp;gt;% 
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table style=&#34;width:100%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;15%&#34; /&gt;
&lt;col width=&#34;27%&#34; /&gt;
&lt;col width=&#34;9%&#34; /&gt;
&lt;col width=&#34;16%&#34; /&gt;
&lt;col width=&#34;15%&#34; /&gt;
&lt;col width=&#34;16%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;.model&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;fare_type&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.type&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;rmse&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mase&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mape&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;min_trace&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;seniors&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Test&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;497395.6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.090555&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;169.1391&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;min_trace&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;full_fare&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Test&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7735877.4&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;11.608419&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;199.1030&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;min_trace&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;other&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Test&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1835483.2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5.458558&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;241.3594&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;min_trace&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;aggregated&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Test&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20683422.4&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;12.284181&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;260.5088&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;min_trace&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;weekly_unlimited&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Test&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4266233.0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8.342853&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;295.2711&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;min_trace&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;monthly_unlimited&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Test&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5788133.3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;12.763524&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;489.1300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;min_trace&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;student&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Test&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;703170.5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.469493&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43077.3022&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Although since we’re trying predict “what if COVID didn’t happen” I don’t expect these forecasts to perform very well.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>How much has COVID cost the NYC Subway system in &#34;lost fares&#34;?</title>
      <link>https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/</link>
      <pubDate>Wed, 13 Jul 2022 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/</guid>
      <description>


&lt;p&gt;With things in NYC beginning to return to normal after two years of COVID I found myself thinking about how much money was lost in Subway fares in the 2+ years where people were working from home. Seeing an opportunity to mess around with some forecasting packages, I set out to determine *how much money in lost rides has COVID cost the MTA?“.&lt;/p&gt;
&lt;p&gt;For this post, I’ll be using the &lt;code&gt;modeltime&lt;/code&gt; &lt;a href=&#34;https://www.business-science.io/code-tools/2020/06/29/introducing-modeltime.html?utm_content=buffer86bae&amp;amp;utm_medium=social&amp;amp;utm_source=twitter.com&amp;amp;utm_campaign=buffer&#34;&gt;package&lt;/a&gt; from Business-Science.io which is a time-series integration into tidymodels to run multiple time series candidates and choose the best one.&lt;/p&gt;
&lt;div id=&#34;libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Libraries&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;### Data Manipulation Packages
library(timetk)
library(tidyverse)
library(scales)
library(lubridate)

# Modeling Ecosystem
library(modeltime) 
library(tidymodels) 
library(treesnip) 

### Model Packages
library(catboost)
library(prophet)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data&lt;/h2&gt;
&lt;p&gt;For this project, I’ll be using the MTA’s weekly &lt;a href=&#34;http://web.mta.info/developers/fare.html&#34;&gt;fare data&lt;/a&gt; which contains the number swipes for each fare type, for each station. I’d previously scraped data from this website in a &lt;a href=&#34;https://jlaw.netlify.app/2020/09/07/covid-19s-impact-on-the-nyc-subway-system/&#34;&gt;prior blog post&lt;/a&gt; so I won’t go through the methodology again.&lt;/p&gt;
&lt;p&gt;Since for this project I don’t need station level or fare type granularity, I’m going to aggregate the data set to the date level.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dt &amp;lt;- readRDS(&amp;#39;data/mta_data.RDS&amp;#39;) %&amp;gt;% 
  group_by(week_start) %&amp;gt;% 
  summarize(fares = sum(fares))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;methodology&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Methodology&lt;/h2&gt;
&lt;p&gt;The dataset contains the weekly number of subway swipes from May 2010 through June 2022. To determine the number of “lost fares”, I’m going to build a forecast of the number of swipes from 2020 onwards and use the residuals between the forecast and the actual data to determine the number of “lost swipes”. Since I don’t reasonably expect a model to accurately predict 2020 onwards but I want to ensure I will have a reasonable model, I will train the model on data from 2010 through 2018 and then validate based on the 2019 data which should be similar to 2018.&lt;/p&gt;
&lt;p&gt;Based on the validation set I will choose the best model and then using that model I will forecast 2020, 2021, and 2022.&lt;/p&gt;
&lt;p&gt;Ultimately this test plan looks as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dt %&amp;gt;% 
  mutate(lbl = case_when(
    week_start &amp;lt; ymd(20190101) ~ &amp;quot;a) Train&amp;quot;,
    year(week_start) == 2019 ~ &amp;#39;b) Validate&amp;#39;,
    year(week_start) &amp;gt;= 2020 ~ &amp;#39;c) Test&amp;#39;
  ), 
  total_fares = fares) %&amp;gt;% 
  ggplot(aes(x = week_start)) + 
  geom_line(data = dt, aes(y = fares), color = &amp;#39;grey60&amp;#39;) + 
  geom_line(aes(y = fares, color = lbl)) + 
  labs(title = &amp;#39;Testing Plan for Forecasting&amp;#39;,
       x = &amp;quot;Date&amp;quot;, y = &amp;quot;# of Metrocard Swipes&amp;quot;,
       color = &amp;quot;&amp;quot;) + 
  scale_y_continuous(labels = comma) + 
  facet_wrap(~lbl, nrow = 3) + 
  cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/index_files/figure-html/test_plan-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In order to split the data, I’m going to first chop off the 2020+ data into a test dataframe:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;test &amp;lt;- dt %&amp;gt;% filter(year(week_start) &amp;gt;= 2020)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And then use &lt;em&gt;timetk&lt;/em&gt;’s &lt;code&gt;time_series_split&lt;/code&gt; to create the sets that will be used for model development and validation:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;splits &amp;lt;- time_series_split(
  dt %&amp;gt;% filter(year(week_start) &amp;lt; 2020) %&amp;gt;% arrange(week_start),
  assess = 52, cumulative = T)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The assess options tells the function to split the last 52 weeks of data into the validation set and the cumulative option tells the function to use all the other data in the training set.&lt;/p&gt;
&lt;p&gt;The training data runs from 2010-05-29 to 2018-12-29 and the validation data runs from 2019-01-05 to 2019-12-28.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;modeling&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Modeling&lt;/h2&gt;
&lt;p&gt;The modeling process will use the recipe / workflow process that is used in the &lt;code&gt;tidymodels&lt;/code&gt; ecosystem. However, add-on packages like &lt;code&gt;modeltime&lt;/code&gt; and &lt;code&gt;treesnip&lt;/code&gt; will allow for extensions to time series and other ML algorithms. For a more detailed look at Tidymodels check out my post on &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;icing the kicker&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;pre-preprocessing&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Pre-Preprocessing&lt;/h3&gt;
&lt;p&gt;The first step with tidymodels is to set up a recipe for pre-processing and featuring engineering. It tells the ecosystem the model formula and what new features to create or remove. In the below recipe, I’m setting the &lt;em&gt;week_start&lt;/em&gt; fields to be an “id” as opposed to a predictor because some of the models we’ll try (CatBoost, XGBoost) can’t handle dates. The “id” role means that the data remains but isn’t used in the model.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;step_timeseries_signature()&lt;/code&gt; creates a large number of features based on the date field such as fields for year, day, half, quarter, month, day of the year, day of week, etc. It also includes a number of time based fields which won’t be necessarily since this data is at a weekly grain. These unnecessary fields are removed in the &lt;code&gt;step_rm()&lt;/code&gt; function. Finally, all categorical variables are one-hot-encoded to turn them into indicator variables using &lt;code&gt;step_dummy()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rec &amp;lt;- recipe(fares ~ ., data = training(splits)) %&amp;gt;%
  update_role(week_start, new_role = &amp;#39;id&amp;#39;) %&amp;gt;% 
  step_timeseries_signature(week_start) %&amp;gt;% 
  step_rm(matches(&amp;quot;(.iso$)|(am.pm$)|(.xts$)|(hour)|(minute)|(second)|(wday)&amp;quot;)) %&amp;gt;% 
  step_dummy(all_nominal(), one_hot = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;model-fitting&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Model Fitting&lt;/h3&gt;
&lt;p&gt;To determine the best model for the forecasting portion I’m going to look at 6 different modeling workflows:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;CatBoost&lt;/li&gt;
&lt;li&gt;XGBoost&lt;/li&gt;
&lt;li&gt;Auto Arima with XGBoosted Errors&lt;/li&gt;
&lt;li&gt;Exponential Smoothing&lt;/li&gt;
&lt;li&gt;Prophet&lt;/li&gt;
&lt;li&gt;Prophet with XGBoosted Errors&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For each of these models, I will set up a workflow, add the proper model using the parsnip interface, add the recipe, and fit the model. For the last 4 models, I re-update the role of the &lt;em&gt;week_start&lt;/em&gt; field back to a predictor from an id since those models can use the date field directly.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;catboost_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(
    boost_tree(mode = &amp;#39;regression&amp;#39;) %&amp;gt;% 
      set_engine(&amp;#39;catboost&amp;#39;)
  ) %&amp;gt;% 
  add_recipe(rec) %&amp;gt;% 
  fit(training(splits))

xgboost_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(
    boost_tree(mode = &amp;#39;regression&amp;#39;) %&amp;gt;% 
      set_engine(&amp;#39;xgboost&amp;#39;)
  ) %&amp;gt;% 
  add_recipe(rec) %&amp;gt;% 
  fit(training(splits))

arima_boosted_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(
    arima_boost() %&amp;gt;%
      set_engine(engine = &amp;quot;auto_arima_xgboost&amp;quot;)
  ) %&amp;gt;%
  add_recipe(rec %&amp;gt;% update_role(week_start, new_role = &amp;quot;predictor&amp;quot;)) %&amp;gt;%
  fit(training(splits))


ets_wf &amp;lt;- workflow() %&amp;gt;% 
  add_model(
    exp_smoothing() %&amp;gt;%
      set_engine(engine = &amp;quot;ets&amp;quot;)
  ) %&amp;gt;%
  add_recipe(rec %&amp;gt;% update_role(week_start, new_role = &amp;quot;predictor&amp;quot;)) %&amp;gt;%
  fit(training(splits))

prophet_wf &amp;lt;- workflow() %&amp;gt;%
  add_model(
    prophet_reg(seasonality_yearly = TRUE) %&amp;gt;% 
      set_engine(engine = &amp;#39;prophet&amp;#39;)
  ) %&amp;gt;%
  add_recipe(rec %&amp;gt;% update_role(week_start, new_role = &amp;quot;predictor&amp;quot;)) %&amp;gt;%
  fit(training(splits))

prophet_boost_wf &amp;lt;- workflow() %&amp;gt;%
  add_model(
    prophet_boost(seasonality_yearly = TRUE) %&amp;gt;%
      set_engine(&amp;#39;prophet_xgboost&amp;#39;)
  ) %&amp;gt;% 
  add_recipe(rec %&amp;gt;% update_role(week_start, new_role = &amp;quot;predictor&amp;quot;)) %&amp;gt;%
  fit(training(splits))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;validating&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Validating&lt;/h3&gt;
&lt;p&gt;To apply these models to the validation set and calculate accuracy I use the &lt;code&gt;modeltime&lt;/code&gt; package’s &lt;code&gt;modeltime_table()&lt;/code&gt; and &lt;code&gt;modeltime_calibrate()&lt;/code&gt; functions. The first organizes the various workflows into a single object and the later will compute the accurate based on the validation set of 2019 data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;calibration_table &amp;lt;- modeltime_table(
  catboost_wf,
  xgboost_wf,
  arima_boosted_wf,
  ets_wf,
  prophet_wf,
  prophet_boost_wf
) %&amp;gt;% 
  modeltime_calibrate(testing(splits))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I can then assess the accuracy measures for the time series using &lt;code&gt;table_modeltime_accuracy()&lt;/code&gt; after sorting by the root mean squared error which will be the accuracy metric I use to determine the best model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;calibration_table %&amp;gt;%
  modeltime_accuracy() %&amp;gt;%
  arrange(rmse) %&amp;gt;% 
  select(.model_desc, where(is.double)) %&amp;gt;%
  mutate(across(where(is.double), 
                ~if_else(.x &amp;lt; 10, round(.x, 2), round(.x, 0)))) %&amp;gt;%
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table style=&#34;width:100%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;45%&#34; /&gt;
&lt;col width=&#34;11%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;11%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;.model_desc&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mae&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mape&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mase&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;smape&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;rmse&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;rsq&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;PROPHET W/ XGBOOST ERRORS&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;947892&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.59&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1271929&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.76&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;PROPHET W/ REGRESSORS&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1150569&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.75&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.71&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.76&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1515907&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.63&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;XGBOOST&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1292654&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.08&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.80&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1888753&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.59&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;ARIMA(0,1,2) W/ XGBOOST ERRORS&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1515049&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.81&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.94&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4.96&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1946304&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.55&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;CATBOOST&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1900626&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.31&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.17&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2362239&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.62&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;ETS(A,N,A)&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1930427&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.19&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6.31&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2436219&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.08&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;From the accuracy table, the best model was the Prophet w/ XGBoosted Errors.&lt;/p&gt;
&lt;p&gt;The calibration table data contains a column called &lt;code&gt;.calibration_data&lt;/code&gt; which contains the validation set predictions which I can use to visualize the the forecasted fit vs. the actuals in for the 2019 data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;calibration_table %&amp;gt;% 
    select(.model_desc, .calibration_data) %&amp;gt;% 
    unnest(cols = c(.calibration_data)) %&amp;gt;% 
    filter(year(week_start)==2019, .model_desc != &amp;#39;ACTUAL&amp;#39;) %&amp;gt;% 
    ggplot(aes(x = week_start)) + 
      geom_line(aes(y = .actual), color = &amp;#39;black&amp;#39;, lty = 2) + 
      geom_line(aes(y = .prediction, color = .model_desc), lwd = 1.2) + 
      facet_wrap(~.model_desc, ncol = 2) + 
      scale_color_discrete(guide = &amp;quot;none&amp;quot;) +
      scale_y_continuous(label = comma) + 
      labs(title = &amp;quot;Comparing Models to Test Set of 2009&amp;quot;, 
           subtitle = &amp;quot;Dashed Line is Actuals&amp;quot;,
           y = &amp;quot;# of Fares&amp;quot;,
           x = &amp;quot;Date&amp;quot;) + 
      theme_bw() + 
      theme(
        axis.text.x = element_text(angle = 60, hjust = .5, vjust = .5)
      )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/index_files/figure-html/validation_viz-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;forecasting-the-covid-time-period&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Forecasting the COVID Time Period&lt;/h2&gt;
&lt;p&gt;Now that I’ve identified the Prophet w/ XGBoosted errors model as the best model, its time to retrain it one final time on both the training and validation data before using it to forecast the COVID time period. The refiting on all data is handled by &lt;code&gt;modeltime_refit()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;refit_tbl &amp;lt;- calibration_table %&amp;gt;% 
    filter(.model_desc ==&amp;#39;PROPHET W/ XGBOOST ERRORS&amp;#39; ) %&amp;gt;%
    modeltime_refit(data = bind_rows(training(splits), testing(splits)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, the forecasting onto the test set is handled by &lt;code&gt;modeltime_forecast()&lt;/code&gt;. The test data and actuals are passed into the function so that the actuals and forecast can be directly compared.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;final_fcst &amp;lt;- refit_tbl %&amp;gt;% 
  modeltime_forecast(
    new_data = test,
    actual_data = dt,
    keep_data = TRUE
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The forecast vs. the actuals can be visualized with &lt;code&gt;plot_modeltime_forecast()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;final_fcst %&amp;gt;% 
  plot_modeltime_forecast(.conf_interval_show = T, .interactive = F) + 
  scale_y_continuous(labels = comma)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/index_files/figure-html/final_viz-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;calculating-the-lost-fare-amount&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Calculating the “Lost Fare” Amount&lt;/h2&gt;
&lt;p&gt;Now with forecast computed I can determine the number of lost fares by comparing the forecast number of fares to the actual number of fares. Then to convert that to an amount of money, I’m using a simplistic assumption that each fare would have cost about 2 dollars. This is a heuristic since there are &lt;a href=&#34;https://new.mta.info/fares&#34;&gt;many different kinds of fares&lt;/a&gt; in the NYC Subway system which have different costs. A full-fare cost $2.75,
a monthly unlimited card costs $127, for Seniors and other reduced fare populations the cost is half-price as $1.35.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;loss_amt &amp;lt;- final_fcst %&amp;gt;% 
  filter(.model_desc == &amp;#39;PROPHET W/ XGBOOST ERRORS&amp;#39;,
         .index &amp;gt;= min(test$week_start)) %&amp;gt;% 
  mutate(diff = fares-.value,
         diff_lo = fares - .conf_lo,
         diff_hi = fares - .conf_hi,
         fare = diff * 2.00,
         fare_lo = diff_lo * 2.00,
         fare_high = diff_hi* 2.00) %&amp;gt;% 
  arrange(.index) %&amp;gt;%
  mutate(fares_lost = cumsum(fare),
         fares_lost_lo = cumsum(fare_lo),
         fares_lost_high = cumsum(fare_high)) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using the confidence intervals of the predictions I can form a range of how much in “lost fares” the MTA suffered since 2020.&lt;/p&gt;
&lt;p&gt;Ultimately, this analysis shows that the MTA has likely lost $5B in lost fares since 2020, but it would be as low as $4.4B or as high as $5.7B.&lt;/p&gt;
&lt;p&gt;The cumulative loss can be visualized as follows:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;loss_amt %&amp;gt;% 
  filter(.index &amp;gt;= ymd(20200101)) %&amp;gt;%
  ggplot(aes(x = .index, y = fares_lost*-1)) + 
    geom_line() + 
    geom_ribbon(aes(ymin = fares_lost_lo*-1, ymax = fares_lost_high*-1), alpha = .3,
                fill = &amp;#39;darkgreen&amp;#39;) + 
    scale_y_continuous(labels = dollar, breaks = seq(0, 6e9, 1e9), expand = c(0 ,0)) + 
    labs(title = &amp;quot;Cumulative Amount of Subway Fares Lost Since 2020&amp;quot;,
         x = &amp;quot;Date&amp;quot;, y = &amp;quot;$ Lost&amp;quot;, caption = &amp;quot;$ Lost = Projected Swipes Lost * $2.00&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = &amp;#39;plot&amp;#39;,
      panel.grid.major.y = element_line(color = &amp;#39;grey45&amp;#39;)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/07/13/how-much-has-covid-cost-the-nyc-subway-system-in-lost-fares/index_files/figure-html/loss_viz-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;concluding-thoughts&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Concluding Thoughts&lt;/h2&gt;
&lt;p&gt;While things are starting to return to more “normalcy” on the NYC subway its still far from what is was in the pre-COVID times. Based on this forecasting exercise, its estimated that the MTA has already lost around $5B in “lost fares” and that number is continuing to grow. Because while things are recovering, there’s still a long way to go.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>ML for the Lazy: Can AutoML Beat My Model?</title>
      <link>https://jlaw.netlify.app/2022/05/03/ml-for-the-lazy-can-automl-beat-my-model/</link>
      <pubDate>Tue, 03 May 2022 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2022/05/03/ml-for-the-lazy-can-automl-beat-my-model/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2022/05/03/ml-for-the-lazy-can-automl-beat-my-model/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;In this fourth (and hopefully final) entry in my “Icing the Kicker” series of posts, I’m going to jump back to the &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;first post&lt;/a&gt; where I used &lt;code&gt;tidymodels&lt;/code&gt; to predict whether or not a kick attempt would be iced. However, this time I see if using the &lt;code&gt;h2o&lt;/code&gt; AutoML feature and the &lt;code&gt;SuperLearner&lt;/code&gt; package can improve the predictive performance of my initial model.&lt;/p&gt;
&lt;div id=&#34;why-is-this-ml-for-the-lazy&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Why is this ML for the Lazy?&lt;/h3&gt;
&lt;p&gt;I called this ML for the Lazy because for h2o and SuperLearner models I’m going to do absolutely nothing but let the algorithms run. No tuning, no nothing.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The Data&lt;/h3&gt;
&lt;p&gt;The data for this exercise was initially described in the &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;first&lt;/a&gt; post in the series. During this post I will construct three models:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Replicating the final model from the original &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;post&lt;/a&gt; using &lt;code&gt;tidymodels&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;A version using &lt;code&gt;h2o&lt;/code&gt;’s &lt;a href=&#34;https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html&#34;&gt;autoML&lt;/a&gt; function&lt;/li&gt;
&lt;li&gt;A version using the &lt;a href=&#34;https://github.com/ecpolley/SuperLearner&#34;&gt;&lt;code&gt;SuperLearner&lt;/code&gt;&lt;/a&gt; package for ensembles&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) #Data Manipulation
library(tidymodels) # Data Splitting and Replicating Initial Model
library(themis) # For SMOTE Recipie
library(h2o) # For AutoML
library(SuperLearner) # For SuperLearner Ensemble.
library(here) # For path simplification&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’ll read in the data from the first post. This code block should look familiar from the other three posts in the series.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fg_data &amp;lt;- readRDS(here(&amp;#39;content/post/2022-01-17-predicting-when-kickers-get-iced-with-tidymodels/data/fg_attempts.RDS&amp;#39;)) %&amp;gt;%
  transmute(
    regulation_time_remaining,
    attempted_distance,
    drive_is_home_offense = if_else(drive_is_home_offense, 1, 0),
    score_diff,
    prior_miss = if_else(prior_miss==1, &amp;#39;yes&amp;#39;, &amp;#39;no&amp;#39;),
    offense_win_prob,
    is_overtime = if_else(period &amp;gt; 4, 1, 0),
    is_iced = factor(is_iced, levels = c(1, 0), labels = c(&amp;#39;iced&amp;#39;, &amp;#39;not_iced&amp;#39;))
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The next step is to replicate how the data was divided in the training and testing sets from the initial post. This is done using the &lt;code&gt;initial_split()&lt;/code&gt; function from &lt;code&gt;rsample&lt;/code&gt;. The seed will be set to what it originally was so that the same training and testing splits are used.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(20220102)
ice_split &amp;lt;- initial_split(fg_data, strata = is_iced)
ice_train &amp;lt;- training(ice_split)
ice_test &amp;lt;- testing(ice_split)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;model-1-tidymodels&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Model #1: TidyModels&lt;/h3&gt;
&lt;p&gt;To replicate the results from &lt;code&gt;tidymodels&lt;/code&gt; I will first reconstruct the pre-processing recipe that used one-hot encoding to turn categorical variables into numeric and applied the SMOTE algorithm to deal with the severe class imbalance in the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rec_smote &amp;lt;- recipe(is_iced ~ ., data = ice_train) %&amp;gt;%
  step_dummy(all_nominal_predictors(), one_hot = T) %&amp;gt;%
  step_smote(is_iced) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In that post the final model was a tuned XGBoost model with the following parameters:
&lt;img src=&#34;params.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So rather than set up a tuning grid, I’ll just build a spec that includes that exact parameters and combine it with the recipe in a workflow:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;orig_wf &amp;lt;- workflow(rec_smote,
               boost_tree(
                 &amp;quot;classification&amp;quot;,
                 mtry = 5,
                 trees = 1641,
                 min_n = 19,
                 tree_depth = 8,
                 learn_rate = 0.007419,
                 loss_reduction = 9.425834,
                 sample_size = 0.9830687,
                 stop_iter = 21
               )) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next step is to run the model on the full training data and predict on the testing data using the &lt;code&gt;last_fit()&lt;/code&gt; function. I will have the function returns testing set metrics for Precision, Recall, and F1 Score.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;orig_results &amp;lt;- last_fit(orig_wf, 
                         ice_split, 
                         metrics=metric_set(f_meas, precision, recall))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The performance metrics can be extracted using the &lt;code&gt;collect_metrics()&lt;/code&gt; function and then I’ll do some post-processing to put it in a format that will eventually be combined with the other models:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;orig_metrics &amp;lt;- collect_metrics(orig_results) %&amp;gt;% 
  transmute(
    label = &amp;quot;Original Model&amp;quot;,
    metric = .metric,
    estimate = .estimate
  )

kable(orig_metrics)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;label&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;metric&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;estimate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Original Model&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4324324&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Original Model&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;precision&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3428571&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Original Model&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;recall&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5853659&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;And the &lt;code&gt;collection_predictions()&lt;/code&gt; function will extract the predictions for the test set to use in a confusion matrix:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;orig_cf &amp;lt;- collect_predictions(orig_results) %&amp;gt;%
  count(is_iced, .pred_class) %&amp;gt;% 
  mutate(label = &amp;quot;Original Model&amp;quot;, .before = 1) %&amp;gt;% 
  rename(pred = .pred_class)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;model-2---h2o-automl&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Model #2 - h2o AutoML&lt;/h3&gt;
&lt;p&gt;The next candidate will be h2o’s AutoML function. h2O is an open-source machine learning platform that runs in java and has interfaces with R amongst others. The AutoML feature will auto-magically try different models and eventually construct a leaderboard of the best models. For this section, the blog post from &lt;a href=&#34;https://rileyking.netlify.app/post/could-automl-win-in-the-sliced-data-science-competition/&#34;&gt;Riley King&lt;/a&gt; was an inspiration as AutoML was used to compare against data from the Sliced data science competition.&lt;/p&gt;
&lt;p&gt;In order to start using h2o I must first initialize the engine:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o.init()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;h2O also has its own data format which must used. Fortunately its easy to convert between the tibbles and this format with &lt;code&gt;as.h2o&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_data &amp;lt;- as.h2o(ice_train)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Due to how h2o is set up, I’ll need to specific the name of the dependent variable (y) as a string and provide the list of predictors as a vector of strings (x). This is most easily done prior to the function call using &lt;code&gt;setdiff()&lt;/code&gt; to remove the dependent from the other variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;y &amp;lt;- &amp;quot;is_iced&amp;quot;
x &amp;lt;- setdiff(names(train_data), y)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now its time to run the AutoML function. In the &lt;code&gt;h2o.automl()&lt;/code&gt; function I provide the name of the dependent variable, the vector of the independent variable, a project name which I believe doesn’t matter for this purpose, a boolean to tell it to try to balance classes, and a seed so that results are replicable. The final parameter I give the function is the “max_runtime_secs”. Since the algorithm will continue to spawn new models it needs a criteria to know when to stop. For convenience, I will allow it to run for 10 minutes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2oAML &amp;lt;- h2o.automl(
  y = y,
  x = x,
  training_frame = train_data,
  project_name = &amp;quot;ice_the_kicker_bakeoff&amp;quot;,
  balance_classes = T,
  max_runtime_secs = 600,
  seed = 20220425
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When the AutoML algorithm completes each model that was run will be placed in a leaderboard which can be accessed by:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;leaderboard_tbl &amp;lt;- h2oAML@leaderboard %&amp;gt;% as_tibble()

leaderboard_tbl %&amp;gt;% head() %&amp;gt;% kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;42%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;16%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;model_id&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;auc&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;logloss&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;aucpr&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mean_per_class_error&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;rmse&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mse&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;GBM_grid_1_AutoML_1_20220430_215247_model_47&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9193728&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1092884&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9953164&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4591029&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1759418&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0309555&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;GBM_grid_1_AutoML_1_20220430_215247_model_95&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9186846&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1098212&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9953443&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4834514&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1766268&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0311970&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;StackedEnsemble_AllModels_4_AutoML_1_20220430_215247&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9182852&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1070530&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9951190&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4476356&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1731510&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0299813&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;StackedEnsemble_AllModels_3_AutoML_1_20220430_215247&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9182371&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1072580&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9952534&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4525710&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1735284&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0301121&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;GBM_grid_1_AutoML_1_20220430_215247_model_69&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9181298&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1097581&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9952088&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4819644&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1765332&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0311640&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;StackedEnsemble_AllModels_2_AutoML_1_20220430_215247&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9179346&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1077711&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9950966&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4411767&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1738748&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0302325&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;I can get the top model from the leaderboard by running &lt;code&gt;h2o.getModel()&lt;/code&gt; on the model id from the leaderboard. In this case it was a Gradient Boosted Machine (GMB).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_names &amp;lt;- leaderboard_tbl$model_id
top_model &amp;lt;- h2o.getModel(model_names[1])&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With the model id I can also see what the parameters are that were used in this model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;top_model@model$model_summary %&amp;gt;% 
  pivot_longer(cols = everything(),
               names_to = &amp;quot;Parameter&amp;quot;, values_to = &amp;quot;Value&amp;quot;) %&amp;gt;% 
  kable(align = &amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;Parameter&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;number_of_trees&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;74.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;number_of_internal_trees&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;74.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;model_size_in_bytes&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;11612.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;min_depth&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;max_depth&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;mean_depth&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;min_leaves&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;max_leaves&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8.000000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;mean_leaves&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;7.851351&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;While for &lt;code&gt;tidymodels&lt;/code&gt; the &lt;code&gt;last_fit()&lt;/code&gt; function ran the model on the test set for me, for h2o I’ll need to do that myself… but its not that difficult. h2o has an &lt;code&gt;h2o.predict()&lt;/code&gt; function similar to R’s &lt;code&gt;predict()&lt;/code&gt; which takes in a model and data to predict on through a &lt;em&gt;newdata&lt;/em&gt; parameter. For that &lt;em&gt;newdata&lt;/em&gt; I need to convert the test data into the h2o format through &lt;code&gt;as.h2o()&lt;/code&gt;. Then I bind the predictions as a new column into the rest of the test data so that performance statistics and confusion metrics can be generated.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o_predictions &amp;lt;- h2o.predict(top_model, newdata = as.h2o(ice_test)) %&amp;gt;%
  as_tibble() %&amp;gt;%
  bind_cols(ice_test)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similar to how I needed to do the predictions manually, I’ll also need to collect the performance metrics manually. This is also easy using the &lt;code&gt;yardstick&lt;/code&gt; package:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o_metrics &amp;lt;- bind_rows(
  #Calculate Performance Metrics
  yardstick::f_meas(h2o_predictions, is_iced, predict),
  yardstick::precision(h2o_predictions, is_iced, predict),
  yardstick::recall(h2o_predictions, is_iced, predict)
) %&amp;gt;%
  # Add an id column and make it the first column
  mutate(label = &amp;quot;h2o&amp;quot;, .before = 1) %&amp;gt;% 
  # Remove the periods from column names
  rename_with(~str_remove(.x, &amp;#39;\\.&amp;#39;)) %&amp;gt;%
  # Drop the estimator column
  select(-estimator)

kable(h2o_metrics)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;label&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;metric&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;estimate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;h2o&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3731020&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;h2o&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;precision&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2398884&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;h2o&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;recall&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8390244&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Finally, I’ll compute the confusion matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h2o_cf &amp;lt;- h2o_predictions %&amp;gt;% 
  count(is_iced, pred= predict) %&amp;gt;% 
  mutate(label = &amp;quot;h2o&amp;quot;, .before = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;model-3-superlearner&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Model #3: SuperLearner&lt;/h3&gt;
&lt;p&gt;The third candidate model that I’ll try is through the &lt;code&gt;SuperLearner&lt;/code&gt; package. &lt;code&gt;SuperLearner&lt;/code&gt; is an ensemble package that will create many different types of models and then by taking a weighted combination of those models hopes to attain better performance accuracy than any of the individual models along.&lt;/p&gt;
&lt;p&gt;To use the &lt;code&gt;SuperLearner()&lt;/code&gt; function, the dependent variable vector Y must be provide as a numeric vector, and the predictors vector X must also only contain numeric data, therefore all factors are converted back to numeric.&lt;/p&gt;
&lt;p&gt;Since I’m predicting a binary outcome (whether or not a kick attempt will be iced) I specify the family as &lt;em&gt;binomial&lt;/em&gt;. Finally, the models to be combined as specified in the &lt;code&gt;SL.library&lt;/code&gt; argument. The full list of models are contained in the &lt;code&gt;listWrappers()&lt;/code&gt; function. However, I’m choosing a subset primarily out of convenience. Mostly that I couldn’t get some of the other models (for example bartMachine) to run properly. The models I’m choosing to include in the ensemble are a GLM, XGBoost, GLM w/ Interactions, Regularized GLM (glmnet), MARS (earth), GAM, and a Random Forest.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mod &amp;lt;- SuperLearner(
  Y = ice_train %&amp;gt;% mutate(iced = if_else(is_iced == &amp;#39;iced&amp;#39;, 1, 0)) %&amp;gt;% 
    pull(iced),
  X = ice_train %&amp;gt;% mutate(prior_miss = if_else(prior_miss == &amp;#39;yes&amp;#39;, 1, 0)) %&amp;gt;% 
    select(-is_iced) %&amp;gt;% as.data.frame,
  family = binomial(),
  SL.library = c( &amp;#39;SL.glm&amp;#39;, &amp;quot;SL.xgboost&amp;quot;, &amp;quot;SL.glm.interaction&amp;quot;, &amp;#39;SL.glmnet&amp;#39;, 
                  &amp;#39;SL.earth&amp;#39;, &amp;#39;SL.gam&amp;#39;, &amp;#39;SL.randomForest&amp;#39;)
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the ultimate output of the SuperLearner is a weighted combination of those models I can extract the weights and show which models have the highest influence on the final predictions.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mod$coef %&amp;gt;% as_tibble(rownames = &amp;quot;model&amp;quot;) %&amp;gt;% 
  mutate(model = str_remove_all(model, &amp;#39;(SL\\.)|(_All)&amp;#39;)) %&amp;gt;%
  ggplot(aes(x = fct_reorder(model, -value), y = value, fill = model)) + 
    geom_col() + 
    geom_text(aes(label = value %&amp;gt;% percent), vjust = 0) + 
    scale_fill_viridis_d(option = &amp;quot;B&amp;quot;, end = .8, guide = &amp;#39;none&amp;#39;) + 
    labs(x = &amp;quot;Model&amp;quot;, y = &amp;quot;% Contribution of Model&amp;quot;, title = &amp;quot;% Contribution For Each Component to SuperLearner&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/05/03/ml-for-the-lazy-can-automl-beat-my-model/index_files/figure-html/plot_weights-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It appears that Random Forest has the strongest effect followed by the MARS model and the XGBoost model.&lt;/p&gt;
&lt;p&gt;Predicting the test set is similar to the h2o version except I can use the generic predict function. However, the predict function will return a vector of probabilities of being iced rather than a label like h2o did. Therefore I need to make a judgement call on a probability cut-off for determining an attempt as iced or not. I’ll choose to use the incidence rate of the training data, 4.2%, as the cut-off. Probabilities greater than 4.2% will be considered “iced” and below that will be “not iced”.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pred_sl = predict(mod, newdata = ice_test %&amp;gt;% 
                    mutate(prior_miss = if_else(prior_miss == &amp;#39;yes&amp;#39;, 1, 0)) %&amp;gt;%
                    select(-is_iced) %&amp;gt;% 
                    as.data.frame, type = &amp;#39;response&amp;#39;)$pred 

pred_sl &amp;lt;- ice_test %&amp;gt;%
  mutate(pred = if_else(pred_sl &amp;gt;= mean(ice_train$is_iced == &amp;#39;iced&amp;#39;), 1, 0),
         pred = factor(pred, levels = c(1, 0), labels = c(&amp;#39;iced&amp;#39;, &amp;#39;not_iced&amp;#39;)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Similar to the above section, I’ll use &lt;code&gt;yardstick&lt;/code&gt; for the performance metrics.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sl_metrics &amp;lt;- bind_rows(
  yardstick::f_meas(pred_sl, is_iced, pred),
  yardstick::precision(pred_sl, is_iced, pred),
  yardstick::recall(pred_sl, is_iced, pred)
) %&amp;gt;% 
  mutate(label = &amp;quot;SuperLearner&amp;quot;, .before = 1) %&amp;gt;% 
  rename_with(~str_remove(.x, &amp;#39;\\.&amp;#39;)) %&amp;gt;% 
  select(-estimator)

kable(sl_metrics)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;label&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;metric&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;estimate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;SuperLearner&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3389513&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;SuperLearner&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;precision&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2097335&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;SuperLearner&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;recall&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8829268&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;And calculate the confusion matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sl_cf &amp;lt;- pred_sl %&amp;gt;% 
  count(is_iced, pred) %&amp;gt;% 
  mutate(label = &amp;quot;SuperLearner&amp;quot;, .before = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;comparing-the-three-models&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Comparing the Three Models&lt;/h3&gt;
&lt;p&gt;For each of the three models I’ve calculated Precision, Recall, and F1. I’ll combine this information in a plot so its easier to see the different performance for each model:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bind_rows(
  orig_metrics,
  h2o_metrics,
  sl_metrics
) %&amp;gt;% 
  ggplot(aes(x = str_wrap(label, 9), y = estimate, fill = label)) + 
    geom_col() + 
    geom_text(aes(label = estimate %&amp;gt;% percent), vjust = 1, color = &amp;#39;grey90&amp;#39;) + 
    scale_fill_viridis_d(option = &amp;quot;C&amp;quot;, end = .6, guide = &amp;#39;none&amp;#39;) + 
    facet_wrap(~metric, nrow = 1, scales = &amp;quot;free_y&amp;quot;) +
    labs(x = &amp;quot;Model&amp;quot;, y = &amp;quot;Performance Metric&amp;quot;,
         title = &amp;quot;Comparing the Performance Metrics on Test Set&amp;quot;) + 
    theme_light() + 
    theme(
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      strip.text = element_text(color = &amp;#39;black&amp;#39;)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/05/03/ml-for-the-lazy-can-automl-beat-my-model/index_files/figure-html/combined_perf-1.png&#34; width=&#34;100%&#34; /&gt;
From the perspective of the F1-Score which balances precision and recall the original model performed the best. But looking at the components it appears that the original model had a higher precision meaning that when it predicted an iced attempt it was more likely to be right than the other models (although was still only right 35% of the time). However, it left some true iced attempts on the table since its recall was substantially lower than both the h2o model and the SuperLearner model.&lt;/p&gt;
&lt;p&gt;I can get a better lens on what things are and are not being predicted well by looking at each model’s confusion matrix on the test set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bind_rows(
  orig_cf,
  h2o_cf,
  sl_cf
) %&amp;gt;% 
  group_by(label) %&amp;gt;% 
  mutate(pct = n/sum(n)) %&amp;gt;% 
  ggplot(aes(x = is_iced, y = pred, fill = n)) + 
    geom_tile() + 
    geom_text(aes(label = glue::glue(&amp;#39;{n}\n({pct %&amp;gt;% percent})&amp;#39;)),
              color = &amp;#39;grey90&amp;#39;) + 
    facet_wrap(~label, nrow = 1) + 
    scale_fill_viridis_c(guide = &amp;#39;none&amp;#39;, end = .8) + 
    labs(x = &amp;quot;Actual Value&amp;quot;, y = &amp;quot;Predicted Value&amp;quot;,
         title = &amp;quot;Comparing Confusion Matrices&amp;quot;) + 
    theme_light() + 
    theme(
      axis.ticks = element_blank(),
      strip.text = element_text(color = &amp;#39;black&amp;#39;)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/05/03/ml-for-the-lazy-can-automl-beat-my-model/index_files/figure-html/combined_cf-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In the confusion matrix its much easier to see that the original model was less likely to make a prediction of iced than the other two models. This led to it having the higher precision but also the lower recall as the original model missed 85 iced attempts vs. the h2o model only missing 33 and the SuperLearner only missing 24.&lt;/p&gt;
&lt;p&gt;So which model performed the best? If I’m just going by balanced performance by looking at the F1 score then the original model outperformed the other two. However, its worth thinking about whether precision or recall is more important since that could have an influence on how to view the model’s performance. If ensuring that all the iced kicked are captured is most important then I should weight more towards recall. But if I want to feel that when the model predicts an iced kick that there will be an iced kick I should stick with the original model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;other-posts-in-the-icing-the-kicker-series&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Other Posts in the Icing the Kicker Series&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Part I: &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;Predicting When Kickers Get Iced with {tidymodels}&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part II: &lt;a href=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/&#34;&gt;Does Icing the Kicker Really Work? A Causal Inference Exercise&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part III: &lt;a href=&#34;https://jlaw.netlify.app/2022/03/13/ain-t-nothin-but-a-g-computation-and-tmle-thang-exploring-two-more-causal-inference-methods/&#34;&gt;Ain’t Nothin But A G-Computation (and TMLE) Thang: Exploring Two More Causal Inference Methods&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Ain&#39;t Nothin But A G-Computation (and TMLE) Thang: Exploring Two More Causal Inference Methods</title>
      <link>https://jlaw.netlify.app/2022/03/13/ain-t-nothin-but-a-g-computation-and-tmle-thang-exploring-two-more-causal-inference-methods/</link>
      <pubDate>Sun, 13 Mar 2022 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2022/03/13/ain-t-nothin-but-a-g-computation-and-tmle-thang-exploring-two-more-causal-inference-methods/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2022/03/13/ain-t-nothin-but-a-g-computation-and-tmle-thang-exploring-two-more-causal-inference-methods/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;In my &lt;a href=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/&#34;&gt;last post&lt;/a&gt; I looked at the causal effect of icing the kicker using weighting. Those results found that icing the kicker had a non-significant effect on the success of the field goal attempt with a point estimate of -2.82% (CI: -5.88%, 0.50%). In this post I will explore two other methodologies for causal inference with observational data, &lt;strong&gt;G-Computation&lt;/strong&gt; and &lt;strong&gt;Target Maximum Likelihood Estimation&lt;/strong&gt;. Beyond the goal of exploring new methodologies I will see how consistent these estimates are to the prior post.&lt;/p&gt;
&lt;div id=&#34;g-computation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;G-Computation&lt;/h2&gt;
&lt;p&gt;I first learned about G-Computation from &lt;a href=&#34;https://malco.io/&#34;&gt;Malcom Barrett’s&lt;/a&gt; &lt;a href=&#34;https://causal-inference-r-workshop.netlify.app/07-g-computation.html&#34;&gt;Causal Inference in R workshop&lt;/a&gt;. For causal inference the ideal goal is to see what would happen to a field goal attempt in the world where the kicker is iced vs. isn’t iced. However, in the real world only one of these outcomes is possible. G-Computation creates these hypothetical worlds by:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Fitting a model on observed data including treatment indicator (whether the kicker is iced) and covariates (other situational information)&lt;/li&gt;
&lt;li&gt;Creating duplicates of the data set where all observations are set to a single level of treatment (in this case make two replications of the data, one where all kicks are iced and one where all kicks are &lt;strong&gt;NOT&lt;/strong&gt; iced)&lt;/li&gt;
&lt;li&gt;Predict the FG success for these replicates&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Calculate Avg(Iced) - Avg(Not Iced) to obtain the causal effect.&lt;/li&gt;
&lt;li&gt;Bootstrap the entire process in order to get valid confidence intervals.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For this exercise I won’t need any complicated packages. Using &lt;code&gt;rsample&lt;/code&gt; for bootstrapping will be as exotic as it gets.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(rsample)
library(scales)
library(here)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And the data that will be used is the same from the prior two blog posts which is the 19,072 Field Goal Attempts from College Football between 2013 and 2021. For details on that data and its construction please refer to the &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;first post in this series&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fg_attempts &amp;lt;- readRDS(here(&amp;#39;content/post/2022-01-17-predicting-when-kickers-get-iced-with-tidymodels/data/fg_attempts.RDS&amp;#39;)) %&amp;gt;%
  transmute(
    regulation_time_remaining,
    attempted_distance,
    drive_is_home_offense = if_else(drive_is_home_offense, 1, 0),
    score_diff,
    prior_miss,
    offense_win_prob,
    is_iced = factor(is_iced, levels = c(0, 1), labels = c(&amp;#39;Not Iced&amp;#39;, &amp;#39;Iced&amp;#39;)),
    fg_made,
    id_play
  )&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;step-1-fit-a-model-using-all-the-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 1: Fit a model using all the data&lt;/h3&gt;
&lt;p&gt;The first step in the G-Computation process is to fit a model using all covariates and the treatment indicator against the outcome of field goal success. This will use the same covariates from the &lt;a href=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/&#34;&gt;prior post&lt;/a&gt; which include the amount of time remaining in regulation, the distance of the field goal attempt, whether the kicking team is on offense or defense, the squared difference in score, whether the kicking team had previously missed in the game, and the pre-game win probability for the kicking team. The treatment effect is &lt;code&gt;is_iced&lt;/code&gt; which reflects whether the defense called timeout before the kick and the outcome &lt;code&gt;fg_made&lt;/code&gt; is whether the kick was successful.&lt;/p&gt;
&lt;p&gt;Since I’m predicted a binary outcome I will use logistic regression.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;m &amp;lt;- glm(fg_made ~ is_iced + regulation_time_remaining + attempted_distance + 
           drive_is_home_offense + I(score_diff^2) + prior_miss + offense_win_prob,
         data = fg_attempts,
         family = &amp;#39;binomial&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-2-create-duplicates-of-the-data-set&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 2: Create Duplicates of the Data Set&lt;/h3&gt;
&lt;p&gt;In order to create the hypothetical world of what would have happened if kicks were iced or not iced I’ll create duplicates of the data; one where all the data is “iced” and one where all the data is “not iced”. The effect that I am interested in is the “average treatment effect on the treated” (ATT) which is for the kicks that were actually “iced” what would have happened if they weren’t? Therefore for these duplicates I’ll only be using the observations where “icing the kicker” actually occurred and create one duplicate version where the &lt;code&gt;is_iced&lt;/code&gt; is set to zero.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;replicated_data &amp;lt;- bind_rows(
  # Get all of the Iced Kicks
  fg_attempts %&amp;gt;% filter(is_iced == &amp;#39;Iced&amp;#39;),
  # Get all of the Iced Kicks But set the treatment field to &amp;quot;Not Iced&amp;quot;
  fg_attempts %&amp;gt;% filter(is_iced == &amp;#39;Iced&amp;#39;) %&amp;gt;% mutate(is_iced = &amp;#39;Not Iced&amp;#39;)
)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-3-predict-the-probability-of-success-for-the-duplicates&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 3: Predict the Probability of Success for the Duplicates&lt;/h3&gt;
&lt;p&gt;This will be very straight forward using the &lt;code&gt;predict()&lt;/code&gt; function. Using &lt;code&gt;type = &#39;response&#39;&lt;/code&gt; returns the probabilities vs. the predicted log-odds.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;replicated_data &amp;lt;- replicated_data %&amp;gt;%
  mutate(p_success = predict(m, newdata = ., type = &amp;#39;response&amp;#39;))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-4-use-the-predicted-successes-to-calculate-the-causal-effect&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 4: Use the Predicted Successes to Calculate the Causal Effect&lt;/h3&gt;
&lt;p&gt;From the predicted data I can calculate the average success when Iced = 1 and when Iced = 0 and take the difference to obtain the causal effect of icing the kicker.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;replicated_data %&amp;gt;% 
  group_by(is_iced) %&amp;gt;% 
  # Get average success by group
  summarize(p_success = mean(p_success)) %&amp;gt;%
  spread(is_iced, p_success) %&amp;gt;%
  # Calculate the causal effect
  mutate(ATT = `Iced` - `Not Iced`) %&amp;gt;%
  # Pretty format using percentages
  mutate(across(everything(), scales::percent_format(accuracy = .01))) %&amp;gt;% 
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Iced&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Not Iced&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;ATT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;67.66%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;70.12%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;-2.46%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;From this calculation, the average treatment effect on the treated is -2.46% which is very close to the -2.82% from the &lt;a href=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/&#34;&gt;previous post&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But to know if this effect would be statistically significant I’ll need to bootstrap the whole process.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-5-bootstrap-the-process-to-obtain-confidence-intervals&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 5: Bootstrap the Process to Obtain Confidence Intervals&lt;/h3&gt;
&lt;p&gt;To bootstrap the function using &lt;code&gt;rsample&lt;/code&gt; I need to first create a function that takes splits from the bootstraps and returns the ATT estimates calculated in Step 4 above:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g_computation &amp;lt;- function(split, ...){
  .df &amp;lt;- analysis(split)
  
  m &amp;lt;- glm(fg_made ~ is_iced + regulation_time_remaining + attempted_distance + 
                   drive_is_home_offense + I(score_diff^2) + prior_miss + offense_win_prob,
                 data = .df,
                 family = &amp;#39;binomial&amp;#39;)
  
  return(
    # Create the Replicated Data
    bind_rows(
        fg_attempts %&amp;gt;% filter(is_iced == &amp;#39;Iced&amp;#39;),
        fg_attempts %&amp;gt;% filter(is_iced == &amp;#39;Iced&amp;#39;) %&amp;gt;% mutate(is_iced = &amp;#39;Not Iced&amp;#39;)
    ) %&amp;gt;% 
      # Calculate predictions on replicated data
      mutate(p_success = predict(m, newdata = ., type = &amp;#39;response&amp;#39;)) %&amp;gt;%
      group_by(is_iced) %&amp;gt;%
      summarize(p_success = mean(p_success)
      ) %&amp;gt;%
      spread(is_iced, p_success) %&amp;gt;%
      # Calculate ATT
      mutate(ATT = `Iced` - `Not Iced`)
  )
  
} &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that the entire process has been wrapped in a function I need to create the bootstrap samples that will be passed into the function In the next code block I create 1,000 bootstrap samples and using &lt;code&gt;purrr:map&lt;/code&gt; pass each sample into the function to obtain the ATTs.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(20220313)

g_results &amp;lt;- bootstraps(fg_attempts, 1000, apparent = T) %&amp;gt;% 
  mutate(results = map(splits, g_computation)) %&amp;gt;%
  select(results, id) %&amp;gt;%
  unnest(results)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, I’ll use the 2.5 and 97.5 percentiles to form the confidence intervals and the mean to form the point estimate of the ATT distribution returned from the bootstrap process.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g_results %&amp;gt;% 
  summarize(.lower = quantile(ATT, .025),
            .estimate = mean(ATT),
            .upper = quantile(ATT, .975)) %&amp;gt;%
  mutate(across(everything(), scales::percent_format(accuracy = .01))) %&amp;gt;%
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;.lower&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.estimate&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.upper&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;-5.66%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;-2.51%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0.59%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Using G-Computation I reach the same conclusion that icing the kicker &lt;strong&gt;does not&lt;/strong&gt; have a statistically significant effect on FG success. The point estimate of the effect of icing the kicker was -2.51% (CI: -5.66%, 0.59%)&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;targeted-maximum-liklihood-estimation-tmle&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Targeted Maximum Liklihood Estimation (TMLE)&lt;/h2&gt;
&lt;p&gt;In the &lt;a href=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/&#34;&gt;previous post&lt;/a&gt; using weighting and in the G-Computation section above there is a fundamental assumption that all of the covariates that can influence Icing the Kicker’s influence on field goal success have been controlled for in the model. In practice, this is difficult to know for sure. In this case, there is a probably an influence of weather and wind direction/speed that is not covered in this data because it was difficult to obtain. Targeted Maximum Likelihood Estimation (TMLE) is one of the “doubly robust” estimators that will provide some safety against model misspecification.&lt;/p&gt;
&lt;p&gt;In TMLE, there will be one model to estimate the probability that a kick attempt is being iced (propensity score) and a second model will be used to estimate how icing the kicker and other covariates will effect the success of that kick (outcome model). These models get combined in an ensemble to produce estimates of the average treatment effect on the treated. The “doubly robust” aspect is that the result will be a consistent estimator as long as one of the two models is correctly specified.&lt;/p&gt;
&lt;p&gt;For more information on TMLE as a double robust estimate check out the excellent blog from &lt;a href=&#34;https://multithreaded.stitchfix.com/blog/2021/07/23/double-robust-estimator/&#34;&gt;StitchFix&lt;/a&gt; which is a large influence on this section.&lt;/p&gt;
&lt;p&gt;To run TMLE in R, I’ll use the &lt;code&gt;tmle&lt;/code&gt; package which will estimate the propensity score and outcome model using the &lt;code&gt;SuperLearner&lt;/code&gt; package which stacks models to create an ensemble. As the blog states, “using SuperLearner is a way to hedge your bets rather than putting all your money on a single model, drastically reducing the chances we’ll suffer from model misspecification” since SuperLearner can leverage many different types of sub-models.&#34;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tmle)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;tmle()&lt;/code&gt; function will run the procedure to estimate the various causal effect statistics. The parameters of the function are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Y&lt;/em&gt; is whether the Field Goal attempt was successful&lt;/li&gt;
&lt;li&gt;&lt;em&gt;A&lt;/em&gt; is the treatment indicators of whether the Field Goal attempt was iced or not&lt;/li&gt;
&lt;li&gt;&lt;em&gt;W&lt;/em&gt; is a data set of covariates&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Q.SL.library&lt;/em&gt; is the set of sub-models that &lt;code&gt;SuperLearner&lt;/code&gt; will use to estimate the outcome model&lt;/li&gt;
&lt;li&gt;&lt;em&gt;g.SL.library&lt;/em&gt; is the set of sub-models that &lt;code&gt;SuperLearner&lt;/code&gt; will use to estimate the propensity scores&lt;/li&gt;
&lt;li&gt;&lt;em&gt;V&lt;/em&gt; is the number of folds to use for the cross-validation to determine the optimal models&lt;/li&gt;
&lt;li&gt;&lt;em&gt;family&lt;/em&gt; is set to ‘binomial’ since the outcome data is binary&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The types of sub-models under consideration will be GLMs, GLMs w/ Interactions, GAMs, and polynomial MARS model. The complete list of models available in SuperLearner can be found &lt;a href=&#34;https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html#review-available-models&#34;&gt;here&lt;/a&gt; or using the &lt;code&gt;listWrappers()&lt;/code&gt; function.&lt;/p&gt;
&lt;p&gt;If you actually know the forms of the propensity model or outcome model those could be directly specified using &lt;code&gt;gform&lt;/code&gt; or &lt;code&gt;Qform&lt;/code&gt;. But I’ll be letting SuperLearner do all the work.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tmle_model &amp;lt;- tmle(Y=fg_attempts$fg_made
                   ,A=if_else(fg_attempts$is_iced==&amp;#39;Iced&amp;#39;, 1, 0)
                   ,W=fg_attempts %&amp;gt;% 
                     transmute(regulation_time_remaining, attempted_distance,
                            drive_is_home_offense, score_diff=score_diff^2,
                            prior_miss, offense_win_prob)
                   ,Q.SL.library=c(&amp;quot;SL.glm&amp;quot;, &amp;quot;SL.glm.interaction&amp;quot;, &amp;quot;SL.gam&amp;quot;, &amp;quot;SL.polymars&amp;quot;)
                   ,g.SL.library=c(&amp;quot;SL.glm&amp;quot;, &amp;quot;SL.glm.interaction&amp;quot;, &amp;quot;SL.gam&amp;quot;, &amp;quot;SL.polymars&amp;quot;)
                   ,V=10
                   ,family=&amp;quot;binomial&amp;quot;
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The TMLE object contains the results for a variety of causal effects (ATE, ATT, etc.). Since all the comparisons I’ve looked at use the ATT, I’ll do that again here.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tibble(
  .lower = tmle_model$estimates$ATT$CI[1],
  .estimate = tmle_model$estimates$ATT$psi,
  .upper = tmle_model$estimates$ATT$CI[2]
) %&amp;gt;%
  mutate(across(everything(), scales::percent_format(accuracy = .01))) %&amp;gt;%
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;.lower&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.estimate&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.upper&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;-5.77%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;-2.63%&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0.52%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The results of the TMLE are consistent in the conclusion that the effect of icing the kicker is not statistically significant. But from a point estimate perspective the TMLE procedure estimates that the effect is slightly larger than G-Computation at -2.63% but smaller than Weighting.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;summary&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Summary&lt;/h2&gt;
&lt;p&gt;Throughout this post and the &lt;a href=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/&#34;&gt;last post&lt;/a&gt; I’ve calculated the Average Treatment Effect on the Treated using three different methodologies the results of which are:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/03/13/ain-t-nothin-but-a-g-computation-and-tmle-thang-exploring-two-more-causal-inference-methods/index_files/figure-html/summary_results-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Altogether the three methodology align on the idea that icing the kicker is not a significant effect on the outcome of the Field Goal and even if it were (based on point estimate) it would be quite small.&lt;/p&gt;
&lt;div id=&#34;other-posts-in-the-icing-the-kicker-series&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Other Posts in the Icing the Kicker Series&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Part I: &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;Predicting When Kickers Get Iced with {tidymodels}&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part II: &lt;a href=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/&#34;&gt;Does Icing the Kicker Really Work? A Causal Inference Exercise&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Does Icing the Kicker Really Work? A Causal Inference Exercise</title>
      <link>https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/</link>
      <pubDate>Mon, 14 Feb 2022 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;In my prior post I &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;looked at when coaches were most likly to ice a kicker&lt;/a&gt; where ‘icing a kicker’ means for a defense to call a timeout right before the offense is about to kick a field goal. In this post, I’ll be looking to apply causal inference techniques to see &lt;strong&gt;whether icing the kicker even matters&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In a perfect world we’d run an A/B test or some type of experiment where some games could be played with the ability to ice the kicker and some without. However, this is unfeasible because the fairness of sports requires that games are played with the same rules.&lt;/p&gt;
&lt;p&gt;It would also be easy to just look at the field goal percentage when a kicker was iced vs. wasn’t. However, this would have a lot of selection bias as the situations where a kicker is likely to be iced is different than what might be the normal field goal attempt.&lt;/p&gt;
&lt;p&gt;This analysis will follow a similar flow to the &lt;a href=&#34;https://github.com/malcolmbarrett/causal_inference_r_workshop&#34;&gt;Causal Inference in R Workshop&lt;/a&gt; conducted by &lt;a href=&#34;https://www.lucymcgowan.com/&#34;&gt;Lucy D’Agostino McGowan&lt;/a&gt; and &lt;a href=&#34;https://malco.io/&#34;&gt;Malcolm Barrett&lt;/a&gt;. For the data I’ll be reusing the data from my &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;prior post&lt;/a&gt; which 19,072 Field Goal Attempts from College Football between 2013 and 2021. For details on that data and its construction please refer to the &lt;a href=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/&#34;&gt;prior post&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;what-have-other-analyses-shown&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What Have Other Analyses Shown?&lt;/h2&gt;
&lt;p&gt;This is not the first time this question has been asked:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A &lt;a href=&#34;https://www.footballstudyhall.com/2018/11/24/18110091/is-icing-the-kicker-really-a-thing&#34;&gt;Football Study Hall&lt;/a&gt; article found that “Looking at all field goal attempts in Q4 and OT, there were 1070 attempts. 761, or 71% of them were good. Given the condition of whether a kicker was iced or not does seem to make a difference. For kickers who were iced, the number of made field goals drops to 123/196, or 63%, while the kickers who were not iced was 638/874, or 73% were good.”&lt;/li&gt;
&lt;li&gt;An &lt;a href=&#34;https://www.sbnation.com/2017/11/27/16707624/chris-boswell-icing-the-kicker-nfl-coaches-timeout&#34;&gt;SB Nation&lt;/a&gt; article, which was actually more of a game recap, has the subtitle “Icing the kicker doesn’t work, but coaches keep on doing it anyway.”.&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://grantland.com/features/icing-kicker-work/&#34;&gt;Grantland&lt;/a&gt; found that icing the kicker doesn’t work&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.espn.com/blog/statsinfo/post/_/id/34217/icing-the-kicker-remains-ineffective-practice&#34;&gt;ESPN&lt;/a&gt; found that “attempts to ice a kicker at the end of a game actually increased the kicker’s chances of success”&lt;/li&gt;
&lt;li&gt;Finally, &lt;a href=&#34;https://mixpanel.com/blog/nfl-data-icing-the-kicker/&#34;&gt;Mixpanel&lt;/a&gt; found “it seems kickers that have been iced are a whole 0.1% less likely to make their kick successfully”&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To generally, the consensus seems to be that the effect of icing the kicker is somewhere between not effective to potential harmful to the kicking team.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-would-a-naive-analysis-show&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What Would a Naive Analysis Show?&lt;/h2&gt;
&lt;p&gt;I’ll start by doing a really naive analysis of just looking at data as-is comparing iced to non-iced kickers. To start I’ll load the libraries for this analysis and read in the field goal attempt data from my prior post.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(here)
library(gtsummary)
library(broom)
library(survey)
library(rsample)
library(smd)


fg_attempts &amp;lt;- readRDS(here(&amp;#39;content/post/2022-01-17-predicting-when-kickers-get-iced-with-tidymodels/data/fg_attempts.RDS&amp;#39;)) %&amp;gt;%
  transmute(
    regulation_time_remaining,
    attempted_distance,
    drive_is_home_offense = if_else(drive_is_home_offense, 1, 0),
    score_diff,
    prior_miss,
    offense_win_prob,
    is_iced = factor(is_iced, levels = c(0, 1), labels = c(&amp;#39;Not Iced&amp;#39;, &amp;#39;Iced&amp;#39;)),
    fg_made,
    id_play
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As a reminder the data contains 19,072 field goals attempts from College Football FBS Regular Season games between 2013
and 2021. For the very naive analysis I’ll just look at the data as-is.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fg_attempts %&amp;gt;% 
  group_by(`Was Iced` = is_iced) %&amp;gt;% 
  summarize(
    `FG Attempts` = n(),
    `FG Made` = sum(fg_made == T),
    `FG %` = mean(fg_made) %&amp;gt;% scales::percent(accuracy = .1)
  ) %&amp;gt;% 
  knitr::kable(align=&amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;Was Iced&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;FG Attempts&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;FG Made&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;FG %&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Not Iced&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;18268&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13882&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;76.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;Iced&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;804&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;544&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;67.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;From the very Naive data, 76% of non-iced kicks were converted vs. 67.7% of iced kicks for a difference of 8.3%!! This seems like decently large difference (and if we ran a test of proportions on this it would be statistically significant).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;a-more-robust-solution&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;A more robust solution&lt;/h2&gt;
&lt;p&gt;But comparing iced kicks to non-iced kicks as-is doesn’t make much sense. As many of the articles referenced above state, icing the kicker is something done to increase in the pressure in high-pressure situations like when the kick would determine who wins the game. These types of situations are vastly different than the lower-pressure situations where the majority of field goals occur.&lt;/p&gt;
&lt;p&gt;An easy way to determine whether there are differences in the factors that might lead to a field goal being iced is by looking at the &lt;em&gt;standardized mean differences&lt;/em&gt; of the other features in the data set to see the extent of the difference between the iced and non-iced attempts.&lt;/p&gt;
&lt;p&gt;I’ll be using the &lt;code&gt;tbl_summary()&lt;/code&gt; function from &lt;code&gt;{{gtsummary}}&lt;/code&gt; to create this table. In the below code, I split the data by &lt;em&gt;is_iced&lt;/em&gt;, tell the function to show the mean and standard deviation for all continuous variables, show the percentage for binary variables and each value should be rounded to two digits. The standardized mean difference gets added through the &lt;code&gt;add_difference()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_summary(
  fg_attempts,
  by = &amp;#39;is_iced&amp;#39;,
  include = c(regulation_time_remaining, attempted_distance, 
              drive_is_home_offense, score_diff, prior_miss, offense_win_prob, 
              is_iced),
  statistic = list(all_continuous() ~ &amp;quot;{mean} ({sd})&amp;quot;,
                   all_dichotomous() ~ &amp;quot;{p}%&amp;quot;),
  digits = list(everything() ~ 2)
) %&amp;gt;% 
  add_difference(everything() ~ &amp;quot;smd&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;gcyubuiepn&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;
&lt;style&gt;html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#gcyubuiepn .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#gcyubuiepn .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#gcyubuiepn .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#gcyubuiepn .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 6px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#gcyubuiepn .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#gcyubuiepn .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#gcyubuiepn .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#gcyubuiepn .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#gcyubuiepn .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#gcyubuiepn .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#gcyubuiepn .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#gcyubuiepn .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#gcyubuiepn .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#gcyubuiepn .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#gcyubuiepn .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#gcyubuiepn .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#gcyubuiepn .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#gcyubuiepn .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#gcyubuiepn .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#gcyubuiepn .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#gcyubuiepn .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#gcyubuiepn .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#gcyubuiepn .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#gcyubuiepn .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#gcyubuiepn .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#gcyubuiepn .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#gcyubuiepn .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#gcyubuiepn .gt_left {
  text-align: left;
}

#gcyubuiepn .gt_center {
  text-align: center;
}

#gcyubuiepn .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#gcyubuiepn .gt_font_normal {
  font-weight: normal;
}

#gcyubuiepn .gt_font_bold {
  font-weight: bold;
}

#gcyubuiepn .gt_font_italic {
  font-style: italic;
}

#gcyubuiepn .gt_super {
  font-size: 65%;
}

#gcyubuiepn .gt_footnote_marks {
  font-style: italic;
  font-weight: normal;
  font-size: 65%;
}
&lt;/style&gt;
&lt;table class=&#34;gt_table&#34;&gt;
  
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_left&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Characteristic&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Not Iced&lt;/strong&gt;, N = 18,268&lt;sup class=&#34;gt_footnote_marks&#34;&gt;1&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Iced&lt;/strong&gt;, N = 804&lt;sup class=&#34;gt_footnote_marks&#34;&gt;1&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Difference&lt;/strong&gt;&lt;sup class=&#34;gt_footnote_marks&#34;&gt;2&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;95% CI&lt;/strong&gt;&lt;sup class=&#34;gt_footnote_marks&#34;&gt;2,3&lt;/sup&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;regulation_time_remaining&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;1,936.02 (964.84)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;1,170.94 (898.40)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.82&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.75, 0.89&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;attempted_distance&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;35.25 (9.31)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;38.69 (9.85)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.36&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.43, -0.29&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;drive_is_home_offense&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;51.98%&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;50.87%&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.02&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.05, 0.09&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;score_diff&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;1.15 (13.57)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.25 (10.94)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.11&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.04, 0.18&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;prior_miss&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;13.74%&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;17.66%&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.11&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.18, -0.04&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;offense_win_prob&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.53 (0.28)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.49 (0.24)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.12&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.05, 0.19&lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;
  
  &lt;tfoot&gt;
    &lt;tr class=&#34;gt_footnotes&#34;&gt;
      &lt;td colspan=&#34;5&#34;&gt;
        &lt;p class=&#34;gt_footnote&#34;&gt;
          &lt;sup class=&#34;gt_footnote_marks&#34;&gt;
            &lt;em&gt;1&lt;/em&gt;
          &lt;/sup&gt;
           
          Mean (SD); %
          &lt;br /&gt;
        &lt;/p&gt;
        &lt;p class=&#34;gt_footnote&#34;&gt;
          &lt;sup class=&#34;gt_footnote_marks&#34;&gt;
            &lt;em&gt;2&lt;/em&gt;
          &lt;/sup&gt;
           
          Standardized Mean Difference
          &lt;br /&gt;
        &lt;/p&gt;
        &lt;p class=&#34;gt_footnote&#34;&gt;
          &lt;sup class=&#34;gt_footnote_marks&#34;&gt;
            &lt;em&gt;3&lt;/em&gt;
          &lt;/sup&gt;
           
          CI = Confidence Interval
          &lt;br /&gt;
        &lt;/p&gt;
      &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tfoot&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;When looking at standardized mean differences, generally values less than 0.1 mean there is a adequate balance between the two groups. Between 0.1 and 0.2 is not too alarming, but values greater than 0.2 would indicate a heavy imbalance. In this data, the time remaining, attempted distance show large differences between iced and non-iced attempts.&lt;/p&gt;
&lt;p&gt;While there are many mechanisms to correct for the imbalances between the observed groups (Matching, Weighting, Stratification, etc.) I’m going to focus on weighting for this analysis. The process will be:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Develop Propensity Scores based on other features to predict the probability that a field goal attempt will be iced with logistic regression.&lt;/li&gt;
&lt;li&gt;Use the weights to adjust the population of the non-iced group to reflect the iced group. Since I’m looking to determine whether icing the kicker actually matters I want to measure the difference in Field Goal Success Rates for situations when the kicker might be iced. This is called the Average Treatment Effect on the Treated (ATT). This is in contrast to the Average Treatment Effect (ATE), which would measure the causal effect of icing the kicker in general and not just in situations where icing would occur.&lt;/li&gt;
&lt;li&gt;Ensure that the post-weighted data are not imbalanced like the pre-weighted data.&lt;/li&gt;
&lt;li&gt;Calculate the ATT and bootstrap confidence intervals.&lt;/li&gt;
&lt;/ol&gt;
&lt;div id=&#34;step-1-develop-the-propensity-model&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 1: Develop the Propensity Model&lt;/h3&gt;
&lt;p&gt;The first step in developing the weights to make the population more “even” is to develop a propensity score for the treatment. Here I’ll run a logistic regression using &lt;code&gt;glm()&lt;/code&gt; to predict where the Field Goal attempt will be iced based on the covariates that were unbalanced from before.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p_iced &amp;lt;- glm(is_iced ~ regulation_time_remaining + attempted_distance + 
             drive_is_home_offense + I(score_diff^2) + prior_miss + offense_win_prob, 
           data = fg_attempts, 
           family = &amp;#39;binomial&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These are called propensity models because their output represent the propensity of a given attempt to get iced.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-2-use-the-propensity-scores-to-weight-the-non-iced-field-goal-attempts&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 2: Use the Propensity Scores to weight the non-Iced Field Goal Attempts&lt;/h3&gt;
&lt;p&gt;Then by using the &lt;code&gt;augment()&lt;/code&gt; function from the &lt;code&gt;{{broom}}&lt;/code&gt; package, I can add the predicted values from the model to the &lt;em&gt;fg_attempts&lt;/em&gt; data set. The probabilities from this model can be used to re-weight the data in any number of ways. You can make adjust both the test and control to make them look like each other. You can also adjust test group to look like the control, and you can weight the control to look like the test.&lt;/p&gt;
&lt;p&gt;In this case, since I want to understand the causal effect of icing the kicker on kicks that are likely to be iced, I’ll be re-weighting the control group to look like the test group. Thus, I will be looking for the average treatment effect among the treated (ATT) vs. the overall average treatment effect (ATE).&lt;/p&gt;
&lt;p&gt;The formula for re-weighting the population for the ATT is:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;eq.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Where &lt;em&gt;p_i&lt;/em&gt; is the attempt’s propensity to be iced and &lt;em&gt;Z_i&lt;/em&gt; is whether the attempt was &lt;strong&gt;actually&lt;/strong&gt; iced. This winds up assigning each attempt in the test group to 1 and will upweight field goal attempts that had higher propensities for being iced from the non-iced group.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;weighted_dt &amp;lt;- p_iced %&amp;gt;% 
  augment(type.predict = &amp;quot;response&amp;quot;, data = fg_attempts) %&amp;gt;%
  mutate(
    w_att = ((.fitted * (is_iced==&amp;#39;Iced&amp;#39;))/.fitted) + 
      ((.fitted*(is_iced != &amp;#39;Iced&amp;#39;))/(1-.fitted))
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Before showing the effects of the weighting let’s first look at the unweighted propensity scores:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(weighted_dt, aes(x = .fitted, fill = is_iced)) + 
  geom_density(alpha = .5) + 
  scale_x_continuous(labels = scales::percent) + 
  #scale_y_log10(labels = scales::comma) + 
  scale_fill_manual(values = c(&amp;#39;Iced&amp;#39; = &amp;#39;green&amp;#39;, &amp;#39;Not Iced&amp;#39; = &amp;#39;blue&amp;#39;)) + 
  labs(x = &amp;quot;P(Icing The Kicker)&amp;quot;,
       y = &amp;quot;&amp;quot;,
       title = &amp;quot;Probability of a FG Attempt Being Iced (Unweighted)&amp;quot;,
       fill = &amp;quot;Kicker Iced?&amp;quot;) + 
  cowplot::theme_cowplot() + 
  theme(
    legend.position = &amp;#39;bottom&amp;#39;,
    legend.justification = &amp;#39;center&amp;#39;,
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/index_files/figure-html/unweighted_p_scores-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This makes it very clear that the distributions of propensity to ice differs very heavily between the group that was actually iced and that which was not. Its also nice to see that the group that was actually iced generally have higher propensity scores that those that don’t.&lt;/p&gt;
&lt;p&gt;Now let’s look at the distribution of propensity score when taking the weights into account. The distribution of the Iced group is shown in green and is unchanged from the pre and post weightings. On the bottom is the Non-Iced attempts. The overall distribution is shown in grey and the re-weighted distribution is shown in blue. Notice how it more closely reflects the distribution of the iced group.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;weighted_dt %&amp;gt;%
  tidyr::spread(is_iced, .fitted, sep = &amp;quot;_p&amp;quot;) %&amp;gt;%
  ggplot() +
  geom_histogram(bins = 50, aes(is_iced_pIced), alpha = 0.5) + 
  geom_histogram(bins = 50, aes(is_iced_pIced, weight = w_att), fill = &amp;quot;green&amp;quot;, alpha = 0.5) + 
  geom_histogram(bins = 50, alpha = 0.5, aes(x = `is_iced_pNot Iced`, y = -..count..)) + 
  geom_histogram(bins = 50, aes(x = `is_iced_pNot Iced`, weight = w_att, y = -..count..), fill = &amp;quot;blue&amp;quot;, alpha = 0.5) + 
  geom_hline(yintercept = 0, lwd = 0.5) +
  scale_y_continuous(label = abs) +
  scale_x_continuous(label = scales::percent) + 
  labs(title = &amp;quot;Post-Weighted Probability of FG Attempt Being Iced&amp;quot;,
       subtitle = &amp;quot;grey is unweighted distribution&amp;quot;,
       x = &amp;quot;P(Icing the Kicker)&amp;quot;,
       y = &amp;quot;# of Attempts&amp;quot;) + 
  theme_minimal() + 
  geom_rect(aes(xmin = 0.45, xmax = .47, ymin = 5, ymax = 100), fill = &amp;quot;#5DB854&amp;quot;) + 
  geom_text(aes(x = 0.46, y = 50), label = &amp;quot;Iced&amp;quot;, angle = 270, color = &amp;quot;white&amp;quot;) + 
  geom_rect(aes(xmin = 0.45, xmax = .47, ymin = -100, ymax = -5), fill = &amp;quot;#5154B8&amp;quot;) + 
  geom_text(aes(x = 0.46, y = -50), label = &amp;quot;Non-Iced&amp;quot;, angle = 270, color = &amp;quot;white&amp;quot;) + 
  coord_cartesian(ylim = c(-100, 100))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/index_files/figure-html/weighted_p_scores-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-3-ensure-the-post-weighted-data-is-no-longer-imbalanced&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 3: Ensure the Post-Weighted Data is no longer Imbalanced&lt;/h3&gt;
&lt;p&gt;The next step is to run some diagnostics to make sure that the imbalance that we saw back at the beginning of this post in the standardized mean differences have gone away. I’m going to use the &lt;code&gt;{{survey}}&lt;/code&gt; package and the &lt;code&gt;tbl_svysummary()&lt;/code&gt; function from &lt;code&gt;{{gtsummary}}&lt;/code&gt; to create a survey design object that incorporates the weights that were derived above. The &lt;code&gt;ids = ~ 1&lt;/code&gt; code tells the design object that there are no clusters in this.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;svy_des &amp;lt;- svydesign(
  ids = ~ 1,
  data = weighted_dt,
  weights = ~ w_att
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then the syntax of &lt;code&gt;tbl_svysummary()&lt;/code&gt; is identical to &lt;code&gt;tbl_summary()&lt;/code&gt; just it uses the survey design object rather than a data frame. Like with the additional table, I’m adding in the standardized mean difference.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_svysummary(
  svy_des,
  by = &amp;#39;is_iced&amp;#39;,
  include = c(regulation_time_remaining, attempted_distance, 
              drive_is_home_offense, score_diff, prior_miss, 
              offense_win_prob, is_iced),
  statistic = list(all_continuous() ~ &amp;quot;{mean} ({sd})&amp;quot;,
                   all_dichotomous() ~ &amp;quot;{p}%&amp;quot;),
  digits = list(everything() ~ 2)
) %&amp;gt;% 
  add_difference(everything() ~ &amp;quot;smd&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;xuavxqcloz&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;
&lt;style&gt;html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#xuavxqcloz .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#xuavxqcloz .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#xuavxqcloz .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#xuavxqcloz .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 6px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#xuavxqcloz .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#xuavxqcloz .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#xuavxqcloz .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#xuavxqcloz .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#xuavxqcloz .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#xuavxqcloz .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#xuavxqcloz .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#xuavxqcloz .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#xuavxqcloz .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#xuavxqcloz .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#xuavxqcloz .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#xuavxqcloz .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#xuavxqcloz .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#xuavxqcloz .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#xuavxqcloz .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#xuavxqcloz .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#xuavxqcloz .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#xuavxqcloz .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#xuavxqcloz .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#xuavxqcloz .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#xuavxqcloz .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#xuavxqcloz .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#xuavxqcloz .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#xuavxqcloz .gt_left {
  text-align: left;
}

#xuavxqcloz .gt_center {
  text-align: center;
}

#xuavxqcloz .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#xuavxqcloz .gt_font_normal {
  font-weight: normal;
}

#xuavxqcloz .gt_font_bold {
  font-weight: bold;
}

#xuavxqcloz .gt_font_italic {
  font-style: italic;
}

#xuavxqcloz .gt_super {
  font-size: 65%;
}

#xuavxqcloz .gt_footnote_marks {
  font-style: italic;
  font-weight: normal;
  font-size: 65%;
}
&lt;/style&gt;
&lt;table class=&#34;gt_table&#34;&gt;
  
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_left&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Characteristic&lt;/strong&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Not Iced&lt;/strong&gt;, N = 800&lt;sup class=&#34;gt_footnote_marks&#34;&gt;1&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Iced&lt;/strong&gt;, N = 804&lt;sup class=&#34;gt_footnote_marks&#34;&gt;1&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;Difference&lt;/strong&gt;&lt;sup class=&#34;gt_footnote_marks&#34;&gt;2&lt;/sup&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;&lt;strong&gt;95% CI&lt;/strong&gt;&lt;sup class=&#34;gt_footnote_marks&#34;&gt;2,3&lt;/sup&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;regulation_time_remaining&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;1,187.16 (875.67)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;1,170.94 (898.40)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.02&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.08, 0.12&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;attempted_distance&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;38.76 (9.34)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;38.69 (9.85)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.01&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.09, 0.11&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;drive_is_home_offense&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;51.18%&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;50.87%&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.01&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.09, 0.10&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;score_diff&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.63 (11.03)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.25 (10.94)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.08&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.02, 0.18&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;prior_miss&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;17.42%&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;17.66%&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.01&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.10, 0.09&lt;/td&gt;&lt;/tr&gt;
    &lt;tr&gt;&lt;td class=&#34;gt_row gt_left&#34;&gt;offense_win_prob&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.49 (0.26)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.49 (0.24)&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;0.00&lt;/td&gt;
&lt;td class=&#34;gt_row gt_center&#34;&gt;-0.10, 0.10&lt;/td&gt;&lt;/tr&gt;
  &lt;/tbody&gt;
  
  &lt;tfoot&gt;
    &lt;tr class=&#34;gt_footnotes&#34;&gt;
      &lt;td colspan=&#34;5&#34;&gt;
        &lt;p class=&#34;gt_footnote&#34;&gt;
          &lt;sup class=&#34;gt_footnote_marks&#34;&gt;
            &lt;em&gt;1&lt;/em&gt;
          &lt;/sup&gt;
           
          Mean (SD); %
          &lt;br /&gt;
        &lt;/p&gt;
        &lt;p class=&#34;gt_footnote&#34;&gt;
          &lt;sup class=&#34;gt_footnote_marks&#34;&gt;
            &lt;em&gt;2&lt;/em&gt;
          &lt;/sup&gt;
           
          Standardized Mean Difference
          &lt;br /&gt;
        &lt;/p&gt;
        &lt;p class=&#34;gt_footnote&#34;&gt;
          &lt;sup class=&#34;gt_footnote_marks&#34;&gt;
            &lt;em&gt;3&lt;/em&gt;
          &lt;/sup&gt;
           
          CI = Confidence Interval
          &lt;br /&gt;
        &lt;/p&gt;
      &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tfoot&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;Notice that all of the SMDs are now below the 0.1 threshold.&lt;/p&gt;
&lt;p&gt;Another way to visualize the changes in SMDs between is using a Love Plot. Named after Dr. Thomas E. Love, the Love Plot is a way of summarizing covariate balance before and after weighting. In the first code block, I calculate both the weighted and unweighted standardized mean differences using the &lt;code&gt;{{smd}}&lt;/code&gt; package. In the &lt;code&gt;smd()&lt;/code&gt; code blocks, I pass in each variable, the group variable, and in the case of the weighted version, the weights.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;smds &amp;lt;- weighted_dt %&amp;gt;%
  # Calculate the SMD for Each Variable
  summarise(
    # List Variables to run functions
    across(c(regulation_time_remaining, attempted_distance, 
             drive_is_home_offense, score_diff, prior_miss,
             offense_win_prob),
           # List functions
           list(
             unweighted = ~smd(.x, is_iced, na.rm = T)$estimate, 
             weighted = ~smd(.x, is_iced, w_att, na.rm = T)$estimate 
           ),
           # Assign how the naming will show up in the output
           # Assign placeholder _zzz_ to split on in the next step
           .names = &amp;quot;{.col}_zzz_{.fn}&amp;quot;)
  )

smds %&amp;gt;% 
  pivot_longer( 
    everything(),
    values_to = &amp;quot;SMD&amp;quot;, 
    names_to = c(&amp;quot;variable&amp;quot;, &amp;quot;Method&amp;quot;), 
    names_sep = &amp;quot;_zzz_&amp;quot;
  ) %&amp;gt;%
  ggplot(
    aes(x = abs(SMD), y = variable, group = Method, color = Method)
  ) +  
  geom_line(orientation = &amp;quot;y&amp;quot;) +
  geom_point() + 
  geom_vline(xintercept = 0.1, color = &amp;quot;black&amp;quot;, size = 0.1) + 
  labs(title = &amp;quot;Love Plot Pre/Post Weighting&amp;quot;,
       subtitle = &amp;quot;Post-Weighted Variables are All Balanced&amp;quot;,
       y= &amp;quot;&amp;quot;) + 
  cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/02/14/does-icing-the-kicker-really-work/index_files/figure-html/unnamed-chunk-7-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The Love Plot clearly shows that the weighted version of the data has corrected the imbalances that we’ve seen in the unweighted version since all variables are now below 0.1. So it looks like the propensity score weighting&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-4-calculate-the-att&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 4: Calculate the ATT&lt;/h3&gt;
&lt;p&gt;The final step is to calculate the average treatment effect on the treatment by regressing our outcome variable (Fields Goal Made) by the “treatment” (Whether the kick was iced or not) weighted by the weighting scheme we came up with above. I’m using a linear probability model for convenience so that the coefficient is more human interpretable. But there is a case to be made for using a logistic regression since &lt;em&gt;fg_made&lt;/em&gt; is binary.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;final.model &amp;lt;- lm(fg_made ~ is_iced, data = weighted_dt, weights = w_att)

tidy(final.model, conf.int = T) %&amp;gt;%
  select(term, estimate, conf.low, conf.high) %&amp;gt;% 
  mutate(across(where(is.numeric), ~scales::percent(.x, accuracy = .01))) %&amp;gt;% 
  knitr::kable(align = &amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;term&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;estimate&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;conf.low&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;conf.high&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;(Intercept)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;70.51%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;69.58%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;71.44%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;is_icedIced&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-2.85%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-4.16%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-1.54%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;From this model the results look like that icing the kicker results in a decreased success rate of 2.85%. The confidence intervals from the linear model would suggest that its statistically significant. &lt;em&gt;However&lt;/em&gt;, the confidence intervals generated above are overly optimistic as the weights are treated as separate individuals rather than actual weights. In order to get more robust confidence intervals, I’ll use bootstrapping to redo the entire process 1000 times. The following function does the entire process from above (propensity score -&amp;gt; weights -&amp;gt; output model).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#### Bootstrapping Estimates
fit_ipw &amp;lt;- function(split, ...) { 
  .df &amp;lt;- analysis(split) 
  
  # fit propensity score model
  propensity_model &amp;lt;- glm(
    is_iced ~ regulation_time_remaining + attempted_distance + 
             drive_is_home_offense + I(score_diff^2)  + prior_miss + offense_win_prob, 
    family = binomial(), 
    data = .df
  )
  
  # calculate inverse probability weights
  .df &amp;lt;- propensity_model %&amp;gt;% 
    augment(type.predict = &amp;quot;response&amp;quot;, data = .df) %&amp;gt;%
    mutate(
      w_att = ((.fitted * (is_iced==&amp;#39;Iced&amp;#39;))/.fitted) + 
      ((.fitted*(is_iced != &amp;#39;Iced&amp;#39;))/(1-.fitted))
    )
  
  # fit correctly bootstrapped ipw model
  lm(fg_made ~ is_iced, data = .df, weights = w_att) %&amp;gt;%
    tidy()
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The bootstrapping will be done using the &lt;code&gt;{{rsample}}&lt;/code&gt; package and the &lt;code&gt;bootstraps()&lt;/code&gt; function. In the function I ask for 1,000 bootstrapped samples (the apparent option includes a 1001st sample that’s the entire data set). Then I apply the above function to every bootstrapped sample through &lt;code&gt;{{purrr}}&lt;/code&gt;’s &lt;code&gt;map()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# fit ipw model to bootstrapped samples
set.seed(20220130)
ipw_results &amp;lt;- bootstraps(fg_attempts, 1000, apparent = TRUE) %&amp;gt;% 
  mutate(results = map(splits, fit_ipw))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, the &lt;code&gt;int_t()&lt;/code&gt; function generates confidence intervals from the t-distribution based on the results of the 1,000 bootstrapped samples.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# get t-statistic-based CIs
int_t(ipw_results, results) %&amp;gt;%
  filter(term == &amp;quot;is_icedIced&amp;quot;) %&amp;gt;% 
  select(term, .lower, .estimate, .upper) %&amp;gt;% 
  mutate(across(where(is.numeric), ~scales::percent(.x, accuracy = .01))) %&amp;gt;%
  knitr::kable(align = &amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;term&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;.lower&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;.estimate&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;.upper&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;is_icedIced&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-5.88%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-2.82%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.50%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;From the bootstrapped results, we have the a similar point estimate of -2.82% which is much smaller than the 8.3% that was seen in the naive analysis. but the confidence intervals now spans from -5.88% to 0.50% making the results not significantly different from zero.&lt;/p&gt;
&lt;p&gt;So in conclusion, we can’t definitively say that icing the kicker is actually harmful to the kicker’s success which seems consistent with the other studies that say that either its ineffective or only mildly effective at best.&lt;/p&gt;
&lt;p&gt;In the next post in this series, I’ll be looking at alternative causal inference methodologies like G-computation and targeted maximum likelihood estimation (TLME) to see if the results are similar or different to the results from this post.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Predicting When Kickers Get Iced with {tidymodels}</title>
      <link>https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/</link>
      <pubDate>Mon, 24 Jan 2022 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;I’m constantly on the lookout for things I can use for future posts for this blog. My goal is usually two-fold. First, what is a tool or technique I want to try/learn and second is there an interesting data set that I can use with those tools. I’d been wanting to play around with {tidymodels} for a while but hadn’t found the right problem. Watching some of the NCAA bowl games over the winter break finally provided me with a use-case. My original question of &lt;strong&gt;whether icing the kicker really works?&lt;/strong&gt; will be explored in a future post but it led to the question for this post which will explore &lt;strong&gt;predicting when coaches will choose to ice the kicker&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This post will explore the data gathering process from the &lt;a href=&#34;https://collegefootballdata.com/&#34;&gt;College Football Database&lt;/a&gt;, the modeling process using &lt;code&gt;tidymodels&lt;/code&gt;, and explaining the model using tools such as variable importance plots, partial dependency plots, and SHAP values.&lt;/p&gt;
&lt;p&gt;Huge thanks to &lt;a href=&#34;https://juliasilge.com&#34;&gt;Julia Silge&lt;/a&gt; whose numerous blog posts on tidymodels were instrumental as a resource for learning the ecosystem.&lt;/p&gt;
&lt;div id=&#34;part-i-data-gathering&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Part I: Data Gathering&lt;/h2&gt;
&lt;p&gt;To determine whether or not a potential field goal attempt will get iced or not I’ll need data on each field goal attempt, I’ll need a definition of what is &lt;strong&gt;&lt;em&gt;icing the kicker&lt;/em&gt;&lt;/strong&gt;, and I’ll need other features that would be predictive of whether or not a kicker will be iced.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Icing_the_kicker&#34;&gt;Wikipedia&lt;/a&gt; defines “icing the kicker” as “the act of calling a timeout immediately prior to the snap in order to disrupt the process of kicking a field goal”. Therefore, we’ll define a field goal attempt as being iced if a timeout is called by the defense directly before it.&lt;/p&gt;
&lt;p&gt;The data for this post comes from the &lt;a href=&#34;https://collegefootballdata.com/&#34;&gt;College Football Database&lt;/a&gt; More details on this API can be found in my earlier post on &lt;a href=&#34;https://jlaw.netlify.app/2021/12/27/exploring-college-football-non-conference-rivalries-with-ggraph/&#34;&gt;Exploring Non-Conference Rivalries&lt;/a&gt; so the set-up will not be covered here. Play-by-Play data from any game can be accessed from the &lt;code&gt;cfbd_pbp_data()&lt;/code&gt; function.&lt;/p&gt;
&lt;p&gt;Looking at the returned data, the features that I’ll explore as potentially predictive are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Regulation Time Remaining in the Game (or if the game is in overtime)&lt;/li&gt;
&lt;li&gt;Distance of the Field Goal Attempt&lt;/li&gt;
&lt;li&gt;The Score Difference&lt;/li&gt;
&lt;li&gt;Whether the kicking team is the home team&lt;/li&gt;
&lt;li&gt;Whether the kicking team has missed earlier in the game&lt;/li&gt;
&lt;li&gt;The pre-game winning probability of the kicking team (to assess whether the game is expected to be close)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The packages needed for the data gathering process are &lt;code&gt;tidyverse&lt;/code&gt; for data manipulation and &lt;code&gt;cfbfastR&lt;/code&gt; to access the API.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(cfbfastR)
library(tidyverse)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For convenience I’ll be looking at NCAA Regular Season football games between 2013 and 2021. The API notes that prior to the College Football Playoff in 2014 the regular season was weeks 1-14 and since 2014 its been weeks 1 to 15. To create a loop of the weeks and years to pass to the data pull function I’ll use &lt;code&gt;expand.grid()&lt;/code&gt; to create all combinations of weeks and years and then apply a filter to keep only valid weeks.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid &amp;lt;- expand.grid(
  year = 2013:2021,
  week = 1:15
) %&amp;gt;%
  arrange(year, week) %&amp;gt;%
  # Before 2014 there were only 14 regular season weeks
  filter(year &amp;gt; 2014 | week &amp;lt;= 14) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The API does provide options to specify which types of plays to return. However, to determine whether or not a timeout was called immediately before it I’ll need to pull the data for EVERY play to accurately apply a lag function. Since I don’t want to keep every play at the end of the day, I’ll create a function to handle the API call and some post processing using the grid of weeks and years above as inputs to the function. I use &lt;code&gt;map2_dfr()&lt;/code&gt; from &lt;code&gt;purrr&lt;/code&gt; to iterate over two parameters into a function.&lt;/p&gt;
&lt;p&gt;The call to &lt;code&gt;cfbd_pbp_data()&lt;/code&gt; with week and year parameters will return the play by play data for every game in that week. To process the data I subset to relevant columns, create some lagged columns to determine the time that the play started (since the time in the data reflects the end of play) and the plays that came immediately before. The information from the lagged variables get used to define the dependent variable &lt;em&gt;is_iced&lt;/em&gt; as if the prior play was a timeout called by the defensive team during the same drive then we’ll consider the attempted to be iced.&lt;/p&gt;
&lt;p&gt;Then I create some additional values that will be used in the modeling, subset my data to only be field goal attempts (and remove any duplicated rows that unfortunately exist), and create the variable for whether the kicking team had a prior miss in the game.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;###Get Play by Play Data
fg_attempts &amp;lt;- map2_dfr(grid$year, grid$week, function(year, week){
  
  
  plays &amp;lt;- cfbd_pbp_data(year=year, week=week, season_type = &amp;#39;regular&amp;#39;) %&amp;gt;%
    group_by(game_id) %&amp;gt;%
    arrange(id_play, .by_group = TRUE) %&amp;gt;% 
    #Subset to only relevant columns
    select(offense_play, defense_play, home, away, 
           drive_start_offense_score, drive_start_defense_score,
           game_id, drive_id, drive_number, play_number,
           period, clock.minutes, clock.seconds, yard_line, yards_gained,
           play_type, play_text, id_play,
           drive_is_home_offense, 
           offense_timeouts,
           defense_timeouts,
           season, wk) %&amp;gt;% 
    mutate(
      # Get prior play end time to use as current play start time
      play_start_mins = lag(clock.minutes),
      play_start_secs = lag(clock.seconds),
      # Get previous plays
      lag_play_type = lag(play_type),
      lag_play_text = lag(play_text),
      
      # Create Other Variables
      is_iced = coalesce(
        if_else(
          # If the same drive, the immediately prior play was a timeout 
          # called by the defensive team
          drive_id == lag(drive_id) &amp;amp; 
            play_number - 1 == lag(play_number) &amp;amp; 
            lag_play_type == &amp;#39;Timeout&amp;#39; &amp;amp;
            str_detect(str_to_lower(lag_play_text), str_to_lower(defense_play)),
          1,
          0
        ), 
        0
      ),
      score_diff = drive_start_offense_score - drive_start_defense_score,
      time_remaining_secs = 60*play_start_mins + play_start_secs,
      fg_made = if_else(play_type == &amp;#39;Field Goal Good&amp;#39;, 1, 0)
    ) %&amp;gt;% 
    ungroup() %&amp;gt;% 
    ## Keep only Field Goal Attempt Plays
    filter(str_detect(play_type, &amp;#39;Field Goal&amp;#39;),
           !str_detect(play_type, &amp;#39;Blocked&amp;#39;)) %&amp;gt;%
    #Distinct Out Bad Rows
    distinct(game_id, drive_id, period, clock.minutes, clock.seconds, play_type, play_text,
             .keep_all = T) %&amp;gt;%
    ## Determine if the offensive team has missed a field goal once already during the game
    group_by(game_id, offense_play) %&amp;gt;% 
    mutate(min_miss = min(if_else(play_type == &amp;#39;Field Goal Missed&amp;#39;, id_play, NA_character_), na.rm = T),
           prior_miss = if_else(id_play &amp;lt;= min_miss | is.na(min_miss), 0, 1)
    ) %&amp;gt;% 
    ungroup()
  }
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Getting the offensive win probabilities has to come from a separate function, &lt;code&gt;cfbd_metrics_wp_pregame()&lt;/code&gt;. This function came return a season’s worth of data by only calling the year. Using &lt;code&gt;map_dfr&lt;/code&gt; with the years 2013 to 2021 will return this data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;betting_lines &amp;lt;- map_dfr(unique(grid$year), ~cfbd_metrics_wp_pregame(year = .x, season_type = &amp;#39;regular&amp;#39;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The final step is adding the win probability data to the play by play data by joining the data and assigning the win probability for each play to the offensive team vs. home/away. Finally, I do some final data cleaning to not have negative timeouts remaining, extracting the attempted distance of the field goal from a play-by-play string, and defining the regulation time remaining. The last step is removing attempts where icing the kicker would be impossible. Since the defense needs a timeout to be able to ice, any attempt where the defense has no timeouts gets excluded.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;fg_data &amp;lt;- fg_attempts %&amp;gt;%
  inner_join(betting_lines %&amp;gt;%
               select(game_id, home_win_prob, away_win_prob),
             by = &amp;quot;game_id&amp;quot;) %&amp;gt;%
  mutate(offense_win_prob = if_else(offense_play == home, home_win_prob, away_win_prob),
         defense_timeouts = pmax(defense_timeouts, 0),
         regulation_time_remaining = if_else(
           period &amp;gt; 4, 0, (4-period)*900+pmin(time_remaining_secs, 900)),
         attempted_distance = coalesce(str_extract(play_text, &amp;#39;\\d+&amp;#39;) %&amp;gt;% as.numeric(),
                                       yards_gained)
         ) %&amp;gt;%
  #Need to Ensure that Icing Could Occur
  filter(defense_timeouts &amp;gt; 0 | is_iced)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The result of this is a dataset of 19,072 field goal attempts covering 6,435 games over 9 seasons. of the 19,072 attempts, 804 (4%) would be considered as &lt;em&gt;iced&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;part-2-building-the-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Part 2: Building the Model&lt;/h2&gt;
&lt;p&gt;Normally, I would do some EDA to better understand the data set but in the interest of word count I’ll jump right into using &lt;code&gt;tidymodels&lt;/code&gt; to predict whether or not a given field goal attempt will be iced. In order to make the data work with the XGBoost algorithm I’ll subset and convert some numeric variables including our dependent variable to factors. A frustrating thing I learned in writing this post is that with a factor dependent variable the assumption is that the first level is the positive class. I’m recoding &lt;em&gt;is_iced&lt;/em&gt; to reflect that. The libraries I’ll be working with for the modeling section are &lt;code&gt;tidymodels&lt;/code&gt; for nearly everything and &lt;code&gt;themis&lt;/code&gt; to use SMOTE to attempt to correct class imbalance, and &lt;code&gt;finetune&lt;/code&gt; to run the &lt;code&gt;tune_race&lt;/code&gt; option.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidymodels)
library(themis)
library(finetune)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;model_data &amp;lt;- fg_data %&amp;gt;%
  transmute(
    regulation_time_remaining,
    attempted_distance,
    drive_is_home_offense = if_else(drive_is_home_offense, 1, 0),
    score_diff,
    prior_miss = if_else(prior_miss==1, &amp;#39;yes&amp;#39;, &amp;#39;no&amp;#39;),
    offense_win_prob,
    is_overtime = if_else(period &amp;gt; 4, 1, 0),
    is_iced = factor(is_iced, levels = c(1, 0), labels = c(&amp;#39;iced&amp;#39;, &amp;#39;not_iced&amp;#39;))
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One of the powerful pieces of the &lt;code&gt;tidymodels&lt;/code&gt; ecosystem is that its possible to try out different pre-processing recipes and model specifications with ease. For example, this dataset is heavily class imbalanced, I can easily try two versions of the model where one attempts to correct for this and one that does not. To assess how good a job my model does at predicting future data I’ll split by data into a training set and test set, stratifying on &lt;em&gt;is_iced&lt;/em&gt; to ensure the dependent variable is balanced across the slices. The &lt;code&gt;initial_split()&lt;/code&gt; function creates the split with a default proportion of 75% and &lt;code&gt;training()&lt;/code&gt; and &lt;code&gt;testing()&lt;/code&gt; extracts the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(20220102)
ice_split &amp;lt;- initial_split(model_data, strata = is_iced)
ice_train &amp;lt;- training(ice_split)
ice_test &amp;lt;- testing(ice_split)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One thing to note is that XGBoost has many tuning parameters so I’ll use cross-validation to figure out the best combination of hyper-parameters. The &lt;code&gt;vfold_cv()&lt;/code&gt; function will take the training data and split it into 5 folds again stratifying by the &lt;em&gt;is_iced&lt;/em&gt; variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train_5fold &amp;lt;- ice_train %&amp;gt;%
  vfold_cv(5, strata = is_iced)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;tidymodels&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Tidymodels&lt;/h3&gt;
&lt;p&gt;My interpretation of the building blocks are &lt;strong&gt;recipes&lt;/strong&gt; which handle how data should pre-processed, &lt;strong&gt;specifications&lt;/strong&gt; which tells &lt;code&gt;tidymodels&lt;/code&gt; which algorithms and parameters to use, and &lt;strong&gt;workflows&lt;/strong&gt; that bring them together. Since I’ve done most of the pre-processing in the data gathering piece these recipes will be pretty vanilla. However, this data is &lt;em&gt;heavily&lt;/em&gt; imbalanced with only 4% of attempts being iced. So I will have two recipes. The first sets up the formula and one-hot encodes the categorical variables.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rec_norm &amp;lt;- recipe(is_iced ~ ., data = ice_train) %&amp;gt;%
  step_dummy(all_nominal_predictors(), one_hot =T) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and the second will add a second step that uses &lt;code&gt;step_smote()&lt;/code&gt; to create new examples of the minority class to fix the class imbalance problem.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rec_smote &amp;lt;- recipe(is_iced ~ ., data = ice_train) %&amp;gt;%
  step_dummy(all_nominal_predictors(), one_hot = T) %&amp;gt;%
  step_smote(is_iced) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then I’ll define my specification. The hyper-parameters that I want to tune are set to &lt;code&gt;tune()&lt;/code&gt; and then I tell {tidymodels} that I want to use XGBoost for a classification problem.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xg_spec &amp;lt;- boost_tree(
  trees = tune(), 
  tree_depth = tune(), 
  min_n = tune(), 
  loss_reduction = tune(),                     
  sample_size = tune(), 
  mtry = tune(),         
  learn_rate = tune(), 
  stop_iter = tune()
) %&amp;gt;% 
  set_engine(&amp;quot;xgboost&amp;quot;) %&amp;gt;% 
  set_mode(&amp;quot;classification&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The recipes and specifications are combined in workflows (if using 1 recipe and 1 specification) or workflow sets if wanting to use different combinations. In the &lt;code&gt;workflow_set()&lt;/code&gt; function you can specify a list of recipes as preproc and a list of specifications as models. The &lt;em&gt;cross&lt;/em&gt; parameter being set to true creates every possible combination. For this analysis I’ll have 2 preproc/model combinations:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wf_sets &amp;lt;- workflow_set(
  preproc = list(norm = rec_norm, 
                 smote = rec_smote),
  models = list(vanilla = xg_spec),
  cross = T
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next step is setting up the grid of parameters that will be tried in the model specifications above. Since there a lot of parameters to be tuned and I don’t want this to run forever I’m using &lt;code&gt;grid_latin_hypercube&lt;/code&gt; to set 100 combinations of parameters that try to cover the entire parameter space but without running every combination.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;grid &amp;lt;- grid_latin_hypercube(
  trees(),
  tree_depth(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), ice_train),
  learn_rate(),
  stop_iter(range = c(10L, 50L)),
  size = 100
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally its time to train the various workflows that have been designed. To do this I’ll pass the workflow_set defined above into the &lt;code&gt;workflow_map()&lt;/code&gt; function. The “tune_race_anova” specification tells the training process to abandon certain hyper-parameter values if they’re not showing value. More detail can be found &lt;a href=&#34;https://juliasilge.com/blog/baseball-racing/&#34;&gt;Julia Silge’s post&lt;/a&gt;. Also passed into this function are the resamples generated from the 5 folds, the grid of parameters, a control set that will save the predictions and workflows so that I can revisit them later on. Finally, I create a metric set of the performance metrics I want to calculate here choosing accuracy , ROC AUC, Multinomial Log Loss, and F1 Measure.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Set up Multiple Cores
doParallel::registerDoParallel(cores = 4)


tuned_results &amp;lt;- wf_sets %&amp;gt;% 
  workflow_map(
    &amp;quot;tune_race_anova&amp;quot;,
    resamples = train_5fold,
    grid = grid,
    control = control_race(save_pred = TRUE,
                           parallel_over = &amp;quot;everything&amp;quot;,
                           save_workflow = TRUE),
    metrics = metric_set(f_meas, accuracy, roc_auc, mn_log_loss, pr_auc, precision, recall),
    seed = 20210109
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Tidymodels has an &lt;code&gt;autoplot()&lt;/code&gt; function which will plot the best scoring model runs for each metric. However, I want a little more customization then what that function (or at least what I know of that function) provides. Using &lt;code&gt;map_dfr()&lt;/code&gt; I’m going to stack the top model for each specification for each of the 5 performance metrics on top of each other using &lt;code&gt;rank_results()&lt;/code&gt; to get the top model for each config for each metric.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;perf_stats &amp;lt;- map_dfr(c(&amp;#39;accuracy&amp;#39;, &amp;#39;roc_auc&amp;#39;, &amp;#39;mn_log_loss&amp;#39;, &amp;#39;pr_auc&amp;#39;, &amp;#39;f_meas&amp;#39;,
                        &amp;#39;precision&amp;#39;, &amp;#39;recall&amp;#39;),
                ~rank_results(tuned_results, rank_metric = .x, select_best = T) %&amp;gt;% 
                filter(.metric == .x) 
        )

perf_stats %&amp;gt;% 
  ggplot(aes(x = wflow_id, color = wflow_id, y = mean)) +
    geom_pointrange(aes(y = mean, ymin = mean - 1.96*std_err, ymax = mean + 1.96*std_err)) + 
    facet_wrap(~.metric, scales = &amp;quot;free_y&amp;quot;) + 
    scale_color_discrete(guide = &amp;#39;none&amp;#39;) + 
    labs(title = &amp;quot;Performance Metric for Tuned Results&amp;quot;,
         x = &amp;quot;Model Spec&amp;quot;,
         y = &amp;quot;Metric Value&amp;quot;,
         color = &amp;quot;Model Config&amp;quot;
    ) + 
    theme_light()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/index_files/figure-html/tune_results-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Since I do care about the positive class more than the negative class but I don’t have a strong preference to false positive vs. false negative being more costly I’m going to use F1-Score as the &lt;a href=&#34;https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/&#34;&gt;performance metric I care most about&lt;/a&gt;. As expected the plain vanilla specification had a higher accuracy than the version using SMOTE to correct for imbalance. But it had lower values for F1, PR AUC and ROC AUC. I can also use &lt;code&gt;rank_results()&lt;/code&gt; to show the top models for the F1 measure across the different specifications:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rank_results(tuned_results, rank_metric = &amp;#39;f_meas&amp;#39;) %&amp;gt;%
  select(wflow_id, .config, .metric, mean, std_err) %&amp;gt;%
  filter(.metric == &amp;#39;f_meas&amp;#39;) %&amp;gt;% 
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;wflow_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.config&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.metric&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;mean&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;std_err&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;smote_vanilla&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model050&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4113826&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0122165&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;smote_vanilla&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model049&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4096641&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0135101&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;smote_vanilla&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model045&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4092579&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0123975&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;smote_vanilla&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model076&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4075923&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0094581&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;smote_vanilla&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model097&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4049903&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0089085&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;smote_vanilla&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model063&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4047996&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0101844&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;smote_vanilla&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model030&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4033350&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0105798&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;norm_vanilla&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model040&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2830217&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0225049&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The top 7 models by F1 are all various configurations of the SMOTE recipe. The best model specification had an average F1 of 0.411 across the five folds. To get a better understanding of what this model’s specification actually was I extract the model configuration that has the best F1-score by using &lt;code&gt;extract_workflow_set_result()&lt;/code&gt; with the workflow id and then &lt;code&gt;select_best()&lt;/code&gt; with the metric I care about:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;##Get Best Model
best_set &amp;lt;- tuned_results %&amp;gt;% 
  extract_workflow_set_result(&amp;#39;smote_vanilla&amp;#39;) %&amp;gt;% 
  select_best(metric = &amp;#39;f_meas&amp;#39;)

kable(best_set)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;11%&#34; /&gt;
&lt;col width=&#34;11%&#34; /&gt;
&lt;col width=&#34;15%&#34; /&gt;
&lt;col width=&#34;12%&#34; /&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;col width=&#34;23%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;mtry&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;trees&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;min_n&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;tree_depth&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;learn_rate&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;loss_reduction&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;sample_size&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;stop_iter&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.config&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1641&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.007419&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9.425834&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9830687&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model050&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The best model in this case had 5 random predictors, 1641 trees, and so on.&lt;/p&gt;
&lt;p&gt;Now that I know which model configuration is the best one, the last step is to final the model using the full training data and predict on the test set. The next block of code extracts the workflow, sets the parameters to be those from the &lt;em&gt;best_set&lt;/em&gt; defined above using &lt;code&gt;finalize_workflow&lt;/code&gt;, and then &lt;code&gt;last_fit()&lt;/code&gt; does the final fitting using the full training set and prediction on the testing data when we pass it the workflow and the split object.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;final_fit &amp;lt;- tuned_results %&amp;gt;%
  extract_workflow(&amp;#39;smote_vanilla&amp;#39;) %&amp;gt;%
  finalize_workflow(best_set) %&amp;gt;%
  last_fit(ice_split, metrics=metric_set(accuracy, roc_auc, mn_log_loss, 
                                         pr_auc, f_meas, precision, recall))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then with &lt;code&gt;collect_metrics()&lt;/code&gt; I can see the final results when the model was applied to the test set that had been unused thus far.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;collect_metrics(final_fit) %&amp;gt;% 
  kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;.metric&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.estimator&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;.estimate&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;.config&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;accuracy&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;binary&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9341443&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;f_meas&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;binary&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4332130&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;precision&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;binary&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3438395&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;recall&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;binary&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5853659&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;roc_auc&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;binary&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9101661&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;mn_log_loss&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;binary&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.1546972&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;pr_auc&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;binary&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3505282&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Preprocessor1_Model1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The F1 score is actually higher than in the training at 0.43% with a precision of 34%, a recall of 59%, and a ROC AUC of 0.91%.&lt;/p&gt;
&lt;p&gt;Tidymodels also makes it very easy to display ROC curves using &lt;code&gt;collect_predictions&lt;/code&gt; to get the predictions from the final model and test set and &lt;code&gt;roc_curve&lt;/code&gt; to calculate the sensitivity and specificity.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;collect_predictions(final_fit) %&amp;gt;%
  roc_curve(is_iced, .pred_iced) %&amp;gt;%
  ggplot(aes(1 - specificity, sensitivity)) +
  geom_abline(lty = 2, color = &amp;quot;gray80&amp;quot;, size = 1.5) +
  geom_path(alpha = 0.8, size = 1) +
  coord_equal() +
  labs(color = NULL)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/index_files/figure-html/roc_curve-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As well as calculate the confusion matrix with &lt;code&gt;collect_predictions&lt;/code&gt; and &lt;code&gt;conf_mat&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;collect_predictions(final_fit) %&amp;gt;%
  conf_mat(is_iced, .pred_class) %&amp;gt;%
  autoplot(type = &amp;#39;heatmap&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/index_files/figure-html/confusion_matrix-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;part-3-interpreting-the-model&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Part 3: Interpreting the model&lt;/h2&gt;
&lt;p&gt;So now the model has been built can be used for predicting whether or not a field goal attempt will get iced given certain parameters. But XGBoost is in the class of “black-box” models where it might be difficult to know what’s going on under the hood. In this third part, I’ll explore:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Variable Importance&lt;/li&gt;
&lt;li&gt;Partial Dependency Plots&lt;/li&gt;
&lt;li&gt;SHAP Values&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All of which will help to provide some interpretability to the model fit in part 2.&lt;/p&gt;
&lt;div id=&#34;variable-importance&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Variable Importance&lt;/h3&gt;
&lt;p&gt;Variable Importance plots are one way of understanding which predictor has the largest effect on the model outcomes. There are many ways to measure variable importance but the one I’m using is the default in the {vip} package for XGBoost which is “gain”. Variable importance using gain measures the fractional contribution of each feature to the model based on the total gain of the feature’s splits where gain is the improvement to accuracy brought by a feature to its branches.&lt;/p&gt;
&lt;p&gt;The {vip} package provides variable importance when given a model object as an input. To get that I use &lt;code&gt;extract_fit_parsnip()&lt;/code&gt; to get the parsnip version of the model object. Then the &lt;code&gt;vip()&lt;/code&gt; function does the rest.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(vip)

extract_fit_parsnip(final_fit) %&amp;gt;%
  vip(geom = &amp;quot;point&amp;quot;, include_type = T) + 
  geom_text(aes(label = scales::percent(Importance, accuracy = 1)),
            nudge_y = 0.023) + 
  theme_light()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/index_files/figure-html/varImp-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Unsurprisingly, the regulation time remaining is the most important feature which makes sense because the amount of time remaining dictates whether using the timeout on a kick is worthwhile. Although whether the kicking team is the home team being the 2nd most important feature is a bit more surprising as I would have thought game situation would apply more than home or away team. I thought score difference would be higher.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;partial-dependency&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Partial Dependency&lt;/h3&gt;
&lt;p&gt;Variable importance tells us “what variables matters” but it doesn’t tell us “how they matter”. Are the relationships between the predictors and predictions linear or non-linear. Is there some magic number where a step function occurs. Variable Importance cannot answer these questions, but partial dependency plots can!&lt;/p&gt;
&lt;p&gt;A partial dependency plot shows the effect of a predictor on the model outcome holding everything else constant. The {pdp} package can be used to generate these plots. The {pdp} package is a little less friendly with {tidymodels} since you need to provide the native model object rather than the parsnip version (which is still easily accessible using &lt;code&gt;extract_fit_engine()&lt;/code&gt;). Also, the data passed into the &lt;code&gt;partial()&lt;/code&gt; function needs to be the same as the data that actually goes into the model object. So I create &lt;em&gt;fitted_data&lt;/em&gt; by &lt;code&gt;prep()&lt;/code&gt;ing the recipe and then &lt;code&gt;bake()&lt;/code&gt;’ing which applies the recipe to the original data set.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;partial&lt;/code&gt; function can also take a while to run, so I’m using {furrr} which allows for {purrr} functions to be run in parallel on the {future} backend. In the &lt;code&gt;future_map_dfr&lt;/code&gt; function, I’m running &lt;code&gt;partial&lt;/code&gt; on every predictor in the data and stacking the results on top of each other so that I can plot them in the final step. The use of &lt;em&gt;prob=T&lt;/em&gt; converts the model output to a probability but since XGBoost probabilities are uncalibrated best not to read too much into the values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(pdp)

##Get Processed Training Data
model_object &amp;lt;- extract_fit_engine(final_fit)

fitted_data &amp;lt;- rec_smote %&amp;gt;%
  prep() %&amp;gt;%
  bake(new_data = model_data) %&amp;gt;%
  select(-is_iced)

library(furrr)
plan(multisession, workers = 4)

all_partial &amp;lt;- future_map_dfr(
  names(fitted_data), ~as_tibble(partial(
    model_object,
    train = fitted_data,
    pred.var = .x,
    type = &amp;#39;classification&amp;#39;,
    plot = F,
    prob = T, #Converts model output to probability scale
    trim.outliers = T
  )) %&amp;gt;% 
    mutate(var = .x) %&amp;gt;%
    rename(value = all_of(.x)),
  .progress = T,
  .options = furrr_options(seed = 20220109)
)

all_partial %&amp;gt;% 
  #Remove Prior Miss since its one-hot encoded
  filter(!str_detect(var, &amp;#39;prior_miss|overtime&amp;#39;)) %&amp;gt;% 
  ggplot(aes(x = value, y = yhat, color = var)) + 
    geom_line() + 
    geom_smooth(se = F, lty = 2, span = .5) + 
    facet_wrap(~var, scales = &amp;quot;free&amp;quot;) + 
    #scale_y_continuous(labels = percent_format(accuracy = .1)) + 
    scale_color_discrete(guide = &amp;#39;none&amp;#39;) +
    labs(title = &amp;quot;Partial Dependency Plots for Whether A Kick Gets Iced?&amp;quot;,
         subtitle = &amp;quot;Looking at 19,072 NCAA Field Goal Attempts (2013-2021)&amp;quot;,
         x = &amp;quot;Variable Value&amp;quot;,
         y = &amp;quot;Prob. of Attempt Getting Iced&amp;quot;) + 
    theme_light()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/index_files/figure-html/pdp-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;From these plots we can tell that the likelihood of getting iced increases when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The Attempted Distance is between 30-50 yards&lt;/li&gt;
&lt;li&gt;When two teams are expected to be somewhat evenly matched (based on pre-game win probabilities)&lt;/li&gt;
&lt;li&gt;When nearly the end of the game or the end of the half (that middle spike in regulation time remaining is halftime since timeouts reset at the beginning of each half)&lt;/li&gt;
&lt;li&gt;When the kicking team is losing by a very small margin (or when the game is within +/- 10 points)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While variable importance told us that Regulation Time Remaining was the most important variable, the partial dependency plot shows us how it affects the model in a non-linear way.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;shap-values&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;SHAP Values&lt;/h3&gt;
&lt;p&gt;The next measure of interpretability combines pieces of both variable importance and partial dependency plots. SHAP values are &lt;a href=&#34;https://liuyanguu.github.io/post/2019/07/18/visualization-of-shap-for-xgboost/&#34;&gt;claimed to be the most advanced method to interpret results from tree-based models&lt;/a&gt;. They are based on Shaply values from game theory and measure feature importance based on the marginal contribution of each predictor for each observation to the model output.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&#34;https://liuyanguu.github.io/post/2019/07/18/visualization-of-shap-for-xgboost/&#34;&gt;{SHAPforxgboost}&lt;/a&gt; package provides an interface to getting SHAP values. The plot that will give us overall variable importance is the SHAP summary plot which we’ll get using &lt;code&gt;shap.plot.summary&lt;/code&gt;. However, first the data structure needs to be prepped using the model object and the training data in a matrix.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(SHAPforxgboost)

shap_long &amp;lt;- shap.prep(xgb_model = extract_fit_engine(final_fit), 
                        X_train = fitted_data %&amp;gt;% as.matrix())
                       
shap.plot.summary(shap_long)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/index_files/figure-html/shap-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In the summary plot, the most important variables are ordered from top to bottom. Within any given variable each point represents an observation. The shading of the point represents which that observation has a higher or lower value for that features. For example, in regulation time remaining lower amounts of remaining time will be orange while higher amounts will be purple. The position on the left or right side of zero represents whether they decrease or increase the likelihood of getting iced. For regulation time remaining notice that the very purple is strongly negative (on the left side) and the very orange is strongly positive (on the right side).&lt;/p&gt;
&lt;p&gt;Similar to the variable importance plot, regulation time remaining was the most important feature.&lt;/p&gt;
&lt;p&gt;We can also get dependency plots similar to the partial dependency plots with SHAP values using &lt;code&gt;shap.plot.dependence&lt;/code&gt;. We’ll look at the regulation time remaining on the x-axis and the SHAP values for regulation time remaining on the y-axis. Since this returns a ggplot object, I’ll add in vertical lines to represent the end of each quarter.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;SHAPforxgboost::shap.plot.dependence(data_long = shap_long, x = &amp;#39;regulation_time_remaining&amp;#39;, 
                                     y = &amp;#39;regulation_time_remaining&amp;#39;, 
                                     color_feature = &amp;#39;regulation_time_remaining&amp;#39;) + 
  ggtitle(&amp;quot;Shap Values vs. Regulation Time Remaining&amp;quot;) + 
  geom_vline(xintercept = 0, lty = 2) + 
    geom_vline(xintercept = 900, lty = 2) + 
    geom_vline(xintercept = 1800, lty = 2) + 
    geom_vline(xintercept = 2700, lty = 2) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2022/01/24/predicting-when-kickers-get-iced-with-tidymodels/index_files/figure-html/shap2-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Similar to the summary plot, the less time remaining in the game the more orange the point and the more time remaining the more purple. Again, like in the partial dependency plot, we see a non-linear relationship with increases towards the end of each quarter and heavy spikes in the last 3 minutes of the 2nd and 4th quarters.&lt;/p&gt;
&lt;p&gt;This is just an example of the things that can be done with SHAP values but hopefully its usefulness for understanding both what’s important and how its important has been illustrated.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;wrapping-up&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Wrapping Up&lt;/h2&gt;
&lt;p&gt;This was quite long so a huge thanks if you made it to the end. This post took a tour through {tidymodels} and some interpretable ML tools to look at when field goal attempts are more likely to get iced. If you’re a football fan then the results shouldn’t be terribly surprising. It its good to know that the model outputs generally pass the domain expertise “sniff-test”. In the next post, I’ll use this same data to attempt to understand whether icing the kicker actually works in making the kicker more likely to miss the attempt.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Examining College Football Conference Realignment with {ggraph}</title>
      <link>https://jlaw.netlify.app/2021/12/29/examining-college-football-conference-realignment-with-ggraph/</link>
      <pubDate>Wed, 29 Dec 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/12/29/examining-college-football-conference-realignment-with-ggraph/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/12/29/examining-college-football-conference-realignment-with-ggraph/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;In my previous &lt;a href=&#34;https://jlaw.netlify.app/2021/12/27/exploring-college-football-non-conference-rivalries-with-ggraph/&#34;&gt;post&lt;/a&gt; I looked at College Football Non-Conference games to create a network map overlaid on top of the United States using the {ggraph} package. In this post I’ll be extending that to examine Conference Realignment, which is when colleges change from one conference to the next. Over the years, this has been caused by reactions to internal politics between Football schools vs. Basketball schools, or schools wanting an increase in clout by joining a more prestigious conference.&lt;/p&gt;
&lt;p&gt;More specifically, I’ll be making a network map based on historical conference affiliations to visualize the changes that have occurred due to realignment. Then I’ll zoom specifically into the case of the Big 12 conference to show how the graph reflects the history of the conference.&lt;/p&gt;
&lt;p&gt;Since all of the packages being used in this post were described in the prior &lt;a href=&#34;https://jlaw.netlify.app/2021/12/27/exploring-college-football-non-conference-rivalries-with-ggraph/&#34;&gt;post&lt;/a&gt;, I’ll be skipping through that section.&lt;/p&gt;
&lt;div id=&#34;set-up&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Set up&lt;/h1&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(cfbfastR)
library(tidygraph)
library(ggraph)
library(ggtext)
library(showtext)

font_add_google(&amp;#39;Roboto&amp;#39;, &amp;quot;roboto&amp;quot;)
showtext_auto()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-a-network-of-the-fbs-conference-affiliations&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Creating a Network of the FBS Conference Affiliations&lt;/h1&gt;
&lt;p&gt;For both analyses I’ll be creating a network graph where individual schools are the nodes and the edges represent whether those schools were in the same conference in a given year. Since conference affiliations will change over time, the number of years that schools were in the same conference will form a strength of association. To get this data, I’ll be using the &lt;code&gt;cfdb_team_info()&lt;/code&gt; function from the {cfbfastR} package to return a list of all the FBS schools and their conference affiliation for each year between 1980 and 2021.&lt;/p&gt;
&lt;p&gt;The choice of 1980 is arbitrary to limit the number of connections and the size of the data. However, the package can return data much further back in time.&lt;/p&gt;
&lt;p&gt;In order to extract the data for each year I pass a vector of years 1980 through 2021 into &lt;code&gt;map_dfr&lt;/code&gt; from {purrr} to run a custom function taking each individual year as an input and stacking the results into a single data frame. My custom function first calls the College Football Database API to retrieve all the schools for a given year and removes all the Independent schools since they do not have an affiliation (for example, Notre Dame). Then since I need to get my list of schools into a list of co-occurrences for each conference, I &lt;code&gt;group_by()&lt;/code&gt; conference so the next parts of the function get run on a conference by conference basis and &lt;code&gt;expand&lt;/code&gt; the school column to create two columns with all within conference combinations. Since “all within conference combinations” includes having the same school twice, I’ll filter out those rows, and since A/B is different than B/A, I’ll create new variables that will always put the school coming first alphabetically into &lt;code&gt;school1&lt;/code&gt; and the other into &lt;code&gt;school2&lt;/code&gt;. Technically, this will double count each entry but I’ll run &lt;code&gt;distinct()&lt;/code&gt; to get the unique set since I’m going to eventually weight by the number of years and this function runs one year at a time.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;conference_graph_data &amp;lt;- map_dfr(1980:2021, function(yr){
  # get the list of schools for a given year
  x &amp;lt;- cfbd_team_info(year = yr) %&amp;gt;%
    # remove independents
    filter(conference != &amp;#39;FBS Independents&amp;#39;) %&amp;gt;%
    group_by(conference) %&amp;gt;% 
    # get all combinations of schools within each conference
    expand(school, school, .name_repair = &amp;#39;universal&amp;#39;)%&amp;gt;% 
    # Remove the combinations that are the same school twice
    filter(school...2 != school...3) %&amp;gt;%
    # Enforce an order so that each school pair appears in the same order
    mutate(school1 = if_else(school...2 &amp;lt; school...3, school...2, school...3),
           school2 = if_else(school...2 &amp;lt; school...3, school...3, school...2),
           season = yr) %&amp;gt;%
    # subset the columns
    select(season, conference, school1, school2) %&amp;gt;%
    # remove duplicates since each combination would be counted twice
    distinct()
  return(x)
  
})&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the nodes on this graph I’ll only want the schools that are part of the Football Bowl Subdivision in 2021 rather than schools that may have dropped down to the FCS. To get this list I’ll run &lt;code&gt;cfdb_team_info(year = 2021)&lt;/code&gt; to get a data frame of all 2021 schools. But since I only need a vector to filter on, I’ll use &lt;code&gt;pull()&lt;/code&gt; to just extract the school name to the vector.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;current_fbs &amp;lt;- cfbd_team_info(year = 2021) %&amp;gt;%
  pull(school)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, I’ll use the {tidygraph} package to turn this list of edges into a &lt;code&gt;tbl_graph()&lt;/code&gt; object. First I &lt;code&gt;ungroup&lt;/code&gt; the data frame since it would still be grouped from my custom function. Then using the &lt;code&gt;count()&lt;/code&gt; function I create a &lt;code&gt;weight&lt;/code&gt; column for each year the schools are affiliated with each other. Next, I leverage the vector I created in the step before to keep only edges where both schools are currently in the FBS. Then I create the &lt;code&gt;tbl_graph&lt;/code&gt; object using &lt;code&gt;as_tbl_graph()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;tbl_graph&lt;/code&gt; objects can be manipulated using {dplyr} verbs to create additional information for either the nodes or the edges. In this instance I add two additional columns to the nodes:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;I add the number of schools that each node is affiliated with using &lt;code&gt;centrality_degree()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;I create grouping of node communities using &lt;code&gt;group_louvain()&lt;/code&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;conf_graph_all &amp;lt;- conference_graph_data %&amp;gt;% 
  ungroup() %&amp;gt;% 
  count(school1, school2, name = &amp;#39;weight&amp;#39;, sort = T) %&amp;gt;% 
  filter(school1 %in% current_fbs &amp;amp; school2 %in% current_fbs) %&amp;gt;% 
  as_tbl_graph(directed = F) %&amp;gt;%
  mutate(degree = centrality_degree(),
         community = group_louvain())

print(conf_graph_all)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tbl_graph: 128 nodes and 1142 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 128 x 3 (active)
##   name          degree community
##   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;     &amp;lt;int&amp;gt;
## 1 Air Force         18         3
## 2 Alabama           13         1
## 3 Arizona           11         7
## 4 Arizona State     11         7
## 5 Auburn            13         1
## 6 Ball State        15         4
## # ... with 122 more rows
## #
## # Edge Data: 1,142 x 3
##    from    to weight
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;  &amp;lt;int&amp;gt;
## 1     1    12     42
## 2     1    32     42
## 3     1    42     42
## # ... with 1,139 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that in the above output, you can see the columns for &lt;code&gt;degree&lt;/code&gt; and &lt;code&gt;community&lt;/code&gt; that I created. For the Arizona and Arizona State columns, the &lt;code&gt;degree&lt;/code&gt; means that they are each connected to 11 schools (which I found kind of shocking, but since the Pac-10 formed in 1978 it does make sense that they’ve only been in a conference with the other now Pac-12 schools). The &lt;code&gt;community&lt;/code&gt; column means that they both belong to the same grouping of nodes, which in this case is probably the Pac-12.&lt;/p&gt;
&lt;p&gt;For creating the network visualization itself, I’m using the {ggraph} package which has a very similar syntax to {ggplot2}. The important notes here is that I’m displaying the edges as straight lines using &lt;code&gt;geom_edge_link()&lt;/code&gt; and varying the shading, color, and width based on the weight. And I’m displaying the nodes as labels using &lt;code&gt;geom_node_label&lt;/code&gt; and filling in by the &lt;code&gt;community&lt;/code&gt; column. Everything else should be pretty normal if you’re familiar with {ggplot2} syntax.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;conf_graph_all %&amp;gt;% 
  ggraph() + 
  geom_edge_link(aes(edge_alpha = weight, edge_color = weight, edge_width = weight)) + 
  geom_node_label(aes(label = name, fill = factor(community)), show.legend = F, size = 3) + 
  scale_edge_alpha_continuous(guide = &amp;#39;none&amp;#39;) + 
  scale_edge_width() + 
  scale_edge_color_viridis(option = &amp;#39;C&amp;#39;, end = .8, guide = &amp;#39;none&amp;#39;) + 
  scale_size_discrete(range = c(4, 6)) + 
  ggthemes::scale_fill_gdocs(guide = F, palette = ggthemes::tableau_color_pal()) + 
  labs(title = &amp;quot;2021 FBS College Football Teams Conference Affiliations&amp;quot;,
       subtitle = &amp;quot;Network of Affiliated Schools (1980 - 2021)&amp;quot;,
       edge_width = &amp;quot;Years Affiliated&amp;quot;,
       caption = &amp;#39;**Source:** CollegeFootballData API&amp;#39;) + 
  theme_graph() + 
  theme(
    legend.position = &amp;#39;bottom&amp;#39;,
    plot.title = element_markdown(family = &amp;#39;roboto&amp;#39;),
    plot.subtitle = element_markdown(family = &amp;#39;roboto&amp;#39;),
    plot.caption = element_markdown()
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;featured.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;While I normally like to have everything be reproducible it felt necessary to do some annotations about what the various communities are and how they reflect the current conference structure as well as how schools that change conferences appear as caught in a tug of war between two communities. These annotations, while possible to due in R, are much easier to do outside of it.&lt;/p&gt;
&lt;p&gt;The piece that I enjoy the most is the depiction of the former Big East football teams. Syracuse, Virginia Tech, Miami, Pittsburgh, and Boston College left for the ACC between 2004 and 2013 (with Louisville following in 2014); West Virginia left for the Big 12 in 2012; And Rutgers left for the Big Ten in 2014 (along with Maryland who left the ACC for the Big Ten and shows up very clearly between those two clusters).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;zooming-into-the-big-12&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Zooming into the Big 12&lt;/h1&gt;
&lt;p&gt;Using a similar technique to the one above I can look at a sub-graph of the current Big 12 schools. I chose the Big 12 for this example because I think the history of the conference is both interesting and well structured when compared to the complete chaos or complete stability of other conferences. Just to get this out of the way, College Football conference sometimes anchor more to branding in their names than accuracy. You might notice that the Big 12 only has 10 schools and the Big Ten has 14. Best not too think too much about this.&lt;/p&gt;
&lt;p&gt;Similar to before, I’ll query the College Football Data Base API and pass in the parameter B12 for the Big 12 Conference and the year 2021 to get the list of existing schools and then I’ll use that list to filter to the current Big 12 schools and any other schools that has ever been affiliated with a current Big 12 school. For simplicity later on I create an indicator for whether the node is a current Big 12 school.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;current_big_12 &amp;lt;- cfbd_team_info(conference = &amp;#39;B12&amp;#39;, year = 2021) %&amp;gt;%
  pull(school)


conf_graph_b12 &amp;lt;- conference_graph_data %&amp;gt;% 
  ungroup() %&amp;gt;% 
  # Filter to only pairs that involve at least 1 Big 12 School
  filter(school1 %in% current_big_12 | school2 %in% current_big_12) %&amp;gt;% 
  # Count the pairs to form the number of years that they were affiliated
  count(school1, school2, name = &amp;#39;weight&amp;#39;, sort = T) %&amp;gt;% 
  # Turn to tbl_graph_object
  as_tbl_graph(directed = F) %&amp;gt;%
  # Create indicator for a current Big 12 Schools
  mutate(is_current_big_12 = name %in% current_big_12) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Using similar code to the full network above, I can plot the Big 12 sub-graph after filtering to only nodes from current Big 12 schools. In this case, rather than using the default &lt;code&gt;ggraph()&lt;/code&gt; layout, I give it the &lt;code&gt;&#39;fr&#39;&lt;/code&gt; string which applies the Fruchterman-Reingold layout algorithm. Since this can provide non-deterministic layouts, I set the seed before running.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(20211229)
conf_graph_b12 %&amp;gt;%
  # Filter to current Big 12 Schools
  filter(is_current_big_12) %&amp;gt;%
  ggraph(&amp;#39;fr&amp;#39;) + 
  geom_edge_link(aes(edge_alpha = weight, edge_color = weight, 
                     edge_width = weight)) + 
  geom_node_label(aes(label = name)) + 
  scale_edge_alpha_continuous(guide = &amp;#39;none&amp;#39;) + 
  scale_edge_width() + 
  scale_edge_color_viridis(option = &amp;#39;C&amp;#39;, end = .8, guide = &amp;#39;none&amp;#39;) + 
  labs(title = &amp;quot;2021 Big 12 Football Conference&amp;quot;,
       subtitle = &amp;quot;Network Graph Based on Conference Affiliations 1980-2021&amp;quot;,
       edge_width = &amp;quot;Years Affiliated&amp;quot;,
       caption = &amp;#39;**Source:** CollegeFootballData API&amp;#39;) + 
  theme_graph() + 
  theme(
    legend.position = &amp;#39;bottom&amp;#39;,
    plot.title = element_markdown(family = &amp;#39;roboto&amp;#39;),
    plot.subtitle = element_markdown(family = &amp;#39;roboto&amp;#39;),
    plot.caption = element_markdown()
    
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/12/29/examining-college-football-conference-realignment-with-ggraph/index_files/figure-html/b12_chart-1.png&#34; width=&#34;100%&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Just eyeballing the above graph it looks like there are 4 clusters of nodes:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The strong network of Oklahoma, Oklahoma State, Iowa State, Kansas, and Kansas State&lt;/li&gt;
&lt;li&gt;The strong network of Texas, Texas Tech, and Baylor&lt;/li&gt;
&lt;li&gt;TCU is a moderate strength network with the Texas schools&lt;/li&gt;
&lt;li&gt;West Virginia without any strong connections.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When looking through the &lt;a href=&#34;https://en.wikipedia.org/wiki/Big_12_Conference#Former_members&#34;&gt;Big 12&lt;/a&gt; conference history this structure makes a ton of sense. The conference was formed in 1996 from the merging of the &lt;a href=&#34;https://en.wikipedia.org/wiki/Big_Eight_Conference#Conference_split&#34;&gt;Big 8&lt;/a&gt; which included the schools in group 1 (as well as Nebraska, Colorado, and Missouri who eventually left for other conferences in 2011-2012) and the &lt;a href=&#34;https://en.wikipedia.org/wiki/Southwest_Conference#Football&#34;&gt;Southwest Conference&lt;/a&gt; from which Texas Tech, Texas and Baylor joined (Texas A&amp;amp;M joined as well but left for a different conference in 2012). So the strong networks in groups 1 and 2 and the weaker connections between them reflect these original conference and their merging.&lt;/p&gt;
&lt;p&gt;TCU was part of the original Southwest Conference with the Texas schools but did not join the Big 12 until 2012 instead journeying through the Western Athletic Conference (WAC), Conference USA, and the Mountain West Conference. This is reflected in their connection with the Texas schools (through their time in the Southwest Conference) but with weaker strength than the other Texas schools have with each other.&lt;/p&gt;
&lt;p&gt;Finally, West Virginia joined the Big 12 in 2012 from the Big East conference and prior to that point had no affiliation with any of the other schools.&lt;/p&gt;
&lt;p&gt;While the graph is good for showing the structure of the relationships it can be difficult to follow the conference merges and changes. This should be more apparent in the visualization below:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;conference_graph_data %&amp;gt;%
  ungroup() %&amp;gt;%
  filter(school1 %in% current_big_12 | school2 %in% current_big_12) %&amp;gt;%
  gather(dummy, school, -season, -conference) %&amp;gt;% 
  select(-dummy) %&amp;gt;%
  distinct() %&amp;gt;% 
  add_count(school, name = &amp;#39;years&amp;#39;) %&amp;gt;%
  group_by(school, years, conference) %&amp;gt;% 
  summarize(start = min(season)-.5, end = max(season)+.5) %&amp;gt;%
  mutate(first_conference = max(if_else(start == min(start), conference, NA_character_), na.rm = T),
         first_start = max(if_else(start == min(start), start, NA_real_), na.rm = T),
         n_conferences = n_distinct(conference)) %&amp;gt;%
  arrange(first_start, first_conference, -years, n_conferences, school) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  mutate(ord = row_number()) %&amp;gt;% 
  filter(school %in% current_big_12) %&amp;gt;%
  ggplot(aes(x = fct_reorder(school, ord, min, .desc = T))) + 
  geom_linerange(aes(ymin = start, ymax = end, color = conference), size = 8) + 
  labs(x = &amp;quot;Schools&amp;quot;, y = &amp;quot;Season&amp;quot;, color = &amp;quot;Conference&amp;quot;,
       title = &amp;quot;Conference Migration of the Current Big 12 Schools&amp;quot;) + 
  coord_flip() + 
  ggthemes::scale_color_tableau() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.y = element_markdown(),
    plot.subtitle = element_markdown(),
    panel.grid.major.y = element_line(color = &amp;#39;grey90&amp;#39;)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/12/29/examining-college-football-conference-realignment-with-ggraph/index_files/figure-html/school_migration-1.png&#34; width=&#34;100%&#34; /&gt;
Given the history of the Big 12 Conference and college football conference realignment in general it does appears that network structures work well for encoding the history of conference affiliations into a visualization.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Exploring College Football Non-Conference Rivalries with {ggraph}</title>
      <link>https://jlaw.netlify.app/2021/12/27/exploring-college-football-non-conference-rivalries-with-ggraph/</link>
      <pubDate>Mon, 27 Dec 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/12/27/exploring-college-football-non-conference-rivalries-with-ggraph/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/12/27/exploring-college-football-non-conference-rivalries-with-ggraph/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;We’re in the middle of College Football’s bowl post-season and I’d been wanting to do a more in-depth post on networks using {tidygraph} and {ggraph} for a while. So now seemed like as good a time as any to explore some College Football data. I had used {ggraph} in prior posts on &lt;a href=&#34;https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/&#34;&gt;exploring season’s of MTV’s The Challenge&lt;/a&gt; and when &lt;a href=&#34;https://jlaw.netlify.app/2020/11/01/sequence-mining-my-browsing-history-with-arulessequences/&#34;&gt;sequence mining my web browsing&lt;/a&gt; but this post will be more focused on the network visualization than those two posts.&lt;/p&gt;
&lt;p&gt;In this post I will explore &lt;strong&gt;what are the most common non-Conference games?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;But really the goal is to create some fun visualizations that hopefully will tell a story.&lt;/p&gt;
&lt;div id=&#34;getting-started-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting Started + The Data&lt;/h2&gt;
&lt;p&gt;For many of the posts on this blog I tend to web scrape my own data. Initially I had planned to use Wikipedia to get a list of all the Football Bowl Subdivision (FBS) teams and their 2019 schedule to do this analysis. However, this proved difficult to find the right data that was easily accessible. However, &lt;strong&gt;there truly is an R package for everything&lt;/strong&gt; and enter {cfbfastR} which provides access to the &lt;a href=&#34;https://collegefootballdata.com/&#34;&gt;College Football Database&lt;/a&gt; API and provided me with easy access to all the information I needed. To use this package all that’s needed is registering for a free API key and adding it to your .Renviron file.&lt;/p&gt;
&lt;p&gt;In addition to {cfbfastR} for getting the data, I’ll be using {showtext} to access Google Fonts, {tidyverse} for general data manipulation, {tidygraph} for handling the network data, and {ggraph} to handle the network graph plotting. Access to the Google Font &lt;em&gt;Roboto&lt;/em&gt; is done using {showtext}’s &lt;code&gt;font_add_google&lt;/code&gt; function and then &lt;code&gt;showtext_auto()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(cfbfastR)
library(tidygraph)
library(ggraph)
library(ggtext)
library(showtext)

font_add_google(&amp;#39;Roboto&amp;#39;, &amp;quot;roboto&amp;quot;)
showtext_auto()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;what-are-the-largest-non-conference-rivalries-in-college-footballs-fbs&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What are the largest non-Conference Rivalries in College Football’s FBS?&lt;/h2&gt;
&lt;p&gt;The goal will be to create a map showing the links between College Footballs largest non-Conference rivalries. In this case, “largest” will be defined as most frequent. While College Football has many rivalries that are between Conference rivals I wanted to focus on non-Conference because I felt it would make for a better visualization. Additionally, since Conference teams generally have to play each other frequently it would be more difficult to discern a “chosen” rivalry vs. one dictated by conference membership.&lt;/p&gt;
&lt;p&gt;The data that I’ll need for this analysis are:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;A list of the FBS schools. I’ll use 2019 data since the 2021 season is still in progress and the 2020 was abnormal.&lt;/li&gt;
&lt;li&gt;A list of all the games played between 2010 and 2019 which is the time-frame I’ll be using for this analysis.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Fortunately, both of these are really easily available from the College Football Data Base. The helper function &lt;code&gt;cfdb_team_info&lt;/code&gt; returns all of the FBS schools for the 2019 season with information on the school itself as well as the latitudes and longitudes of the schools saving me the need to geocode.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;cfdb_game_info&lt;/code&gt; provides all the games for a specified year. In order to get all the seasons between 2010 and 2019 I use &lt;code&gt;map_dfr&lt;/code&gt; to iterate over the vector 2010-2019 and row bind each output into a combined data frame.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;schools &amp;lt;- cfbd_team_info(year = 2019)

schedule &amp;lt;- map_dfr(2010:2019, cfbd_game_info)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To create a network graph I will need to create datasets to represent the nodes of the graph, in this case schools, and the edges, the match-ups between the two schools. For the nodes this will be straight-forward since I will just need a subset of the columns in &lt;code&gt;schools&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nodes &amp;lt;- schools %&amp;gt;%
  select(id = team_id, school, conference, latitude, longitude)

knitr::kable(head(nodes, 5))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;school&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;conference&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;latitude&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;longitude&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2005&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Air Force&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Mountain West&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38.99697&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-104.84362&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2006&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Akron&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Mid-American&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;41.07255&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-81.50834&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;333&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Alabama&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;SEC&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;33.20828&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-87.55038&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2026&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Appalachian State&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Sun Belt&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;36.21143&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-81.68543&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;12&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Arizona&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Pac-12&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;32.22881&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-110.94887&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Edges will be a little trickier since I want this graph to be undirected. If Notre Dame plays USC, I don’t really care who was the home team or the away team, so I’ll need to find a way to count these as the same match-up. While I’m sure there’s a better way to do this I decided to solve this problem by making the team that goes first alphabetically &lt;em&gt;school1&lt;/em&gt; and the other team &lt;em&gt;school2&lt;/em&gt;. This will apply a consistent ordering between any match-up.&lt;/p&gt;
&lt;p&gt;In order to use the {tidygraph} package the edge list needs to have &lt;em&gt;from&lt;/em&gt; and &lt;em&gt;to&lt;/em&gt; columns even if the graph is undirected. Then once I have the edge list I construct a &lt;em&gt;weight&lt;/em&gt; column by using the &lt;code&gt;count()&lt;/code&gt; function from {dplyr}.&lt;/p&gt;
&lt;p&gt;I also exclude all conference games using a field that comes in the data set as well as an additional filter to ensure that both nodes are FBS schools since FBS schools can play non-FBS schools during the season.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;edge_list &amp;lt;- schedule %&amp;gt;% 
  # Remove any conference games
  filter(conference_game == F,
         #require that both the home and away teams are in our graph 
         home_id %in% nodes$id, 
         away_id %in% nodes$id) %&amp;gt;% 
  # apply alphabetical ordering to the two teams
  mutate(
    first_team = if_else(home_team &amp;lt; away_team, home_team, away_team),
    first_id = if_else(home_team &amp;lt; away_team, home_id, away_id),
    second_team = if_else(home_team &amp;lt; away_team, away_team, home_team),
    second_id = if_else(home_team &amp;lt; away_team, away_id, home_id)
  ) %&amp;gt;%
  select(from = first_id, to = second_id, first_team, second_team) %&amp;gt;%
  count(from, to, first_team, second_team, name = &amp;#39;weight&amp;#39;)

knitr::kable(head(edge_list, 5))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;from&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;to&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;first_team&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;second_team&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;weight&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;23&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Auburn&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;San José State&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;97&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Auburn&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Louisville&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;166&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Auburn&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;New Mexico State&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;228&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Auburn&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Clemson&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;264&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Auburn&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Washington&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;An interpretation of this first row is that Auburn played San Jose State twice between 2010 and 2019 and only played Louisville once.&lt;/p&gt;
&lt;p&gt;The {tidygraph} package has its own structure called a &lt;code&gt;tbl_graph&lt;/code&gt; which combines the nodes and edges into a single data structure and allows the user to manipulate either portion. While there is a constructor specifically for the &lt;code&gt;tbl_graph&lt;/code&gt; object, I was having trouble getting it to work so I used &lt;code&gt;graph_from_data_frame&lt;/code&gt; from {igraph} and then cast the graph to a &lt;code&gt;tbl_graph&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Also, no disrespect to the University of Hawaii but their presence really messes up the graph since Hawaii is &lt;strong&gt;so&lt;/strong&gt; far from the other schools. So I’m just going to exclude them.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g &amp;lt;- igraph::graph_from_data_frame(d = edge_list, directed = F, vertices = nodes) %&amp;gt;% 
  as_tbl_graph() %&amp;gt;% 
  filter(!str_detect(school, &amp;#39;Hawai&amp;#39;))

print(g)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tbl_graph: 129 nodes and 1137 edges
## #
## # An undirected simple graph with 1 component
## #
## # Node Data: 129 x 5 (active)
##   name  school            conference    latitude longitude
##   &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;             &amp;lt;chr&amp;gt;            &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
## 1 2005  Air Force         Mountain West     39.0    -105. 
## 2 2006  Akron             Mid-American      41.1     -81.5
## 3 333   Alabama           SEC               33.2     -87.6
## 4 2026  Appalachian State Sun Belt          36.2     -81.7
## 5 12    Arizona           Pac-12            32.2    -111. 
## 6 9     Arizona State     Pac-12            33.4    -112. 
## # ... with 123 more rows
## #
## # Edge Data: 1,137 x 5
##    from    to first_team second_team      weight
##   &amp;lt;int&amp;gt; &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;      &amp;lt;chr&amp;gt;             &amp;lt;int&amp;gt;
## 1    10    90 Auburn     San José State        2
## 2    10    52 Auburn     Louisville            1
## 3    10    70 Auburn     New Mexico State      1
## # ... with 1,134 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that the output contains two sets of data, one for nodes and one for edges. Also note, that the nodes are noted as &lt;em&gt;(active)&lt;/em&gt;. There is a function called &lt;code&gt;activate&lt;/code&gt; which will let a user switch between node and edge data within the &lt;code&gt;tbl_graph&lt;/code&gt; object and use functions like &lt;code&gt;mutate&lt;/code&gt;, &lt;code&gt;filter&lt;/code&gt;, etc. on the data.&lt;/p&gt;
&lt;div id=&#34;visualizing-the-graph&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Visualizing the Graph&lt;/h3&gt;
&lt;p&gt;Normally, a graph can be displayed using any number of algorithms to show optimal clustering and separation. However, in this case my nodes are actual schools with actual locations given by their latitudes and longitudes. So for my graph, if I want to show them on a United States map I will need to create a layout that forces the nodes in their true geographic positions. This can be done using the &lt;code&gt;create_layout&lt;/code&gt; function which takes the graph and then &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt; positions. Since those &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt; positions need to be in the same order as the nodes in the graph object I’m just going to reference the graph object directly when populating &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;y&lt;/em&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lay = create_layout(g, &amp;#39;manual&amp;#39;, x= g %&amp;gt;% pull(longitude), y=g %&amp;gt;% pull(latitude))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With the layout in place I can construct the graph. The syntax for {{ggraph}} isn’t much different from {{ggplot2}}. The main difference is in the starting function where {{ggraph}} takes in a graph and/or a layout. In this case because my custom layout already contains the graph I can just pass in the layout. Then there are some specific geoms for the graphs such as &lt;code&gt;geom_node_point&lt;/code&gt; which places a point at each node, and &lt;code&gt;geom_edge_arc&lt;/code&gt; which draws an arc for each edge with the &lt;em&gt;strength&lt;/em&gt; parameter controlling how “arc-y” to make the edge (as opposed to a straight line which could be done with &lt;code&gt;geom_edge_link&lt;/code&gt;). Then there are some specific styles like edge_alpha vs. alpha. But if you’re familiar with {ggplot2}} then this syntax shouldn’t be too different. The only other piece which I had never used before was &lt;code&gt;borders(&#34;state&#34;, color = &#39;grey90&#39;)&lt;/code&gt; to draw the US state borders.&lt;/p&gt;
&lt;p&gt;While the more common games will show up with thicker and brighter lines not everyone knows the location of every FBS college in the US. So for the match-ups that occurred in at least of 8 of the 10 available years, I’ll add labels to the edges.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggraph(lay) + 
  borders(&amp;quot;state&amp;quot;, color = &amp;#39;grey90&amp;#39;) +
  geom_node_point(color = &amp;#39;grey90&amp;#39;) + 
  geom_edge_arc(strength = 0.1, 
                aes(edge_alpha = weight, 
                    edge_color = weight, 
                    edge_width = weight,
                    label = if_else(weight &amp;gt;= 8, 
                                    paste0(first_team,&amp;#39;-&amp;#39;,second_team), &amp;quot;&amp;quot;)
                ),
                vjust = -.5,
                hjust = 0,
                label_colour = &amp;#39;white&amp;#39;,
                label_size = 6) + 
  scale_edge_color_viridis(begin = .2, end = .8, option = &amp;quot;A&amp;quot;, direction = 1,
                           labels = round) + 
  scale_edge_width_continuous(range = c(.5, 1.5), guide = &amp;#39;none&amp;#39;) + 
  scale_edge_alpha_continuous(guide = &amp;#39;none&amp;#39;, range = c(0.1, 1)) + 
  labs(title = &amp;quot;NCAA FBS Non-Conference Games (2010 - 2019)&amp;quot;,
       caption = &amp;#39;**Source:** CollegeFootballData API&amp;#39;,
       edge_color = &amp;quot;# of Games Played&amp;quot;) + 
  theme(
    panel.background = element_rect(fill = &amp;#39;black&amp;#39;),
    plot.background = element_rect(fill = &amp;#39;black&amp;#39;),
    plot.caption = element_markdown(color = &amp;#39;white&amp;#39;, size = 16),
    plot.subtitle = element_textbox_simple(family = &amp;#39;roboto&amp;#39;, size = 20, 
                                           color = &amp;#39;white&amp;#39;),
    plot.title = element_markdown(hjust = .5, family = &amp;#39;roboto&amp;#39;, 
                                  color = &amp;#39;white&amp;#39;, size = 40),
    legend.position = &amp;#39;bottom&amp;#39;,
    legend.title = element_text(family = &amp;#39;roboto&amp;#39;, size = 20, color = &amp;#39;white&amp;#39;, 
                                vjust = 1),
    legend.text = element_text(family = &amp;#39;roboto&amp;#39;, size = 20, color = &amp;#39;white&amp;#39;),
    legend.background = element_rect(fill = &amp;#39;black&amp;#39;)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/12/27/exploring-college-football-non-conference-rivalries-with-ggraph/index_files/figure-html/final_graph-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;analysis&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Analysis&lt;/h2&gt;
&lt;p&gt;Besides looking cool (in my opinion) this chart shows an edge for &lt;strong&gt;every&lt;/strong&gt; non-conference game that occurred between 2010 and 2019 which is a lot of games. But to answer the questions of the largest Non-Conference rivalries there are a couple of patterns that arise:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The independent schools are over-represented which is not surprising since all of their games are non-conference games. This includes Notre Dame and BYU.&lt;/li&gt;
&lt;li&gt;Games between schools that are in-state but in different conferences (Florida vs. Florida State, Colorado vs. Colorado State, Clemson vs. South Carolina, Georgia vs. Georgia Tech).&lt;/li&gt;
&lt;li&gt;Games between schools that have functional reasons to be rivals such as the three service academies (Army, Navy, and Air Force).&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;While not terribly surprising for anyone that follows college football, this post hopefully shows how you can create a network graph out of geographic coordinates and fix the layout so that it can be applied on top of a real map.&lt;/p&gt;
&lt;p&gt;In the next post I’ll be continuing on the theme of College Football and network graphs to see what we can learn about Conference Realignment!&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>What&#39;s the Most American of American Films?  An Analysis with {gt} and {gtExtras}</title>
      <link>https://jlaw.netlify.app/2021/10/18/what-s-the-most-american-of-american-films-an-analysis-with-gt-and-gtextras/</link>
      <pubDate>Mon, 18 Oct 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/10/18/what-s-the-most-american-of-american-films-an-analysis-with-gt-and-gtextras/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/10/18/what-s-the-most-american-of-american-films-an-analysis-with-gt-and-gtextras/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;&lt;img src=&#34;tbl_small.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;I love movies&lt;/strong&gt;. I enjoy watching them, I enjoy reading about the industry (sometimes), and as a bit of a data-nerd (exhibit a: &lt;a href=&#34;https://jlaw.netlify.app&#34;&gt;my blog&lt;/a&gt;), I enjoy learning about the outliers in the industry. One of my favorite trends to follow is the shifting dynamics of Hollywood being driven more by International Box Office and the impact this has on the types of movies being made. One of my favorite examples is the movie &lt;a href=&#34;https://www.boxofficemojo.com/title/tt0803096/?ref_=bo_se_r_1&#34;&gt;Warcraft&lt;/a&gt;. From a critical perspective the movie is not good sporting a &lt;a href=&#34;https://www.rottentomatoes.com/m/warcraft&#34;&gt;Rotten Tomatoes score&lt;/a&gt; of 28% (although the audience score is 76%). However, there is a massive disparity in the box office gross with only $47M of its $439M coming from the United States. Ultimately, this movie was a failure in the US but incredibly popular internationally.&lt;/p&gt;
&lt;p&gt;With the announcement of the &lt;a href=&#34;https://blog.rstudio.com/2021/09/30/rstudio-table-contest-2021/&#34;&gt;RStudio 2021 Table Contest&lt;/a&gt;, I wanted to look into identifying what are the movies that were the successful abroad but a failure in the US. But after playing with the data a bit I decided to flip the question to ask &lt;strong&gt;what is the most “American” movie&lt;/strong&gt;. That is what were the most successful movies in the US that did not perform well abroad.&lt;/p&gt;
&lt;div id=&#34;part-1-gathering-the-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Part 1: Gathering the Data&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=0&#34;&gt;Box Office Mojo&lt;/a&gt; has a table with the Top 1000 grossing movies with their split between Domestic and International grosses. This table should form the best backbone of finding successful US movies. However, since the most “American” movie could be anywhere in the Top 1000, I’ll need to gather all 1000.&lt;/p&gt;
&lt;div id=&#34;loading-libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Loading Libraries&lt;/h2&gt;
&lt;p&gt;Aside from &lt;code&gt;tidyverse&lt;/code&gt; the main package needed to extract this table will be &lt;code&gt;rvest&lt;/code&gt; which is used for tidy web scraping. The &lt;code&gt;glue&lt;/code&gt; package will be used to make string construction a bit easier and &lt;code&gt;httr&lt;/code&gt; will be used to access the &lt;a href=&#34;https://www.omdbapi.com&#34;&gt;Open Movie Database API&lt;/a&gt; to augment the initial Box Office gross data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rvest) # Scrape Table From BoxOfficeMofo
library(tidyverse) # Data Manipulations
library(glue) # String Interpolation
library(httr) # Accessing the OMDB API&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since the Box Office Mojo table is paginated, I’ll need a loop to get through all 1000. The starting point for the table is controlled by the offset parameter in the URL. The &lt;code&gt;map_dfr&lt;/code&gt; function from &lt;code&gt;purrr&lt;/code&gt; will make it very easy to loop through the different offset parameters and combine each run into a single data set.&lt;/p&gt;
&lt;p&gt;I’ll be feeding &lt;code&gt;map_dfr&lt;/code&gt; parameter values of 0, 200, 400, 600, and 800 iteratively and passing it into the Box Office Mojo URL. The &lt;code&gt;glue()&lt;/code&gt; function allows me to insert the offset value directly into the string through the &lt;code&gt;{}&lt;/code&gt;. In this code block each iteration:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Grabs an offset parameter (0 to 800, by 200)&lt;/li&gt;
&lt;li&gt;Passes that into an anonymous function that as the parameter &lt;code&gt;x&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Runs &lt;code&gt;read_html()&lt;/code&gt; on the URL with the offset and extracts the &lt;em&gt;table&lt;/em&gt; element with &lt;code&gt;html_elements()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Extract the information from the table with &lt;code&gt;html_table()&lt;/code&gt; into a tibble&lt;/li&gt;
&lt;li&gt;When I get to the OMDB API piece rather than searching by title I can search directly by IMDB ID and since Box Office Mojo is owned by IMDB, I’m going to extract the ID from the links in the table:
&lt;ul&gt;
&lt;li&gt;From the previously extracted &lt;em&gt;table&lt;/em&gt; element, extract the &lt;em&gt;&lt;a&gt;&lt;/em&gt; tags with &lt;code&gt;html_elements()&lt;/code&gt; and extract the &lt;em&gt;href&lt;/em&gt; attributes from those &lt;em&gt;&lt;a&gt;&lt;/em&gt; tags using &lt;code&gt;html_attr()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Since &lt;code&gt;html_attr()&lt;/code&gt; returns &lt;strong&gt;all&lt;/strong&gt; the &lt;em&gt;href&lt;/em&gt; attributes as a vector, not just the IMDB ids, I use &lt;code&gt;keep()&lt;/code&gt; from &lt;code&gt;purrr&lt;/code&gt; to only keep the elements that contain the string “tt” as all IMDB Ids start with “tt”.&lt;/li&gt;
&lt;li&gt;Then finally, I pull the “tt” and the numeric portion out of the vector using &lt;code&gt;str_extract()&lt;/code&gt; from &lt;code&gt;stringr&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Iterate through 0 to 800 by 200 and pass as X into the function
tbl &amp;lt;- map_dfr(seq(0, 800, 200),
               function(x){
                 #Read URL
                 base &amp;lt;- glue(&amp;quot;https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset={x}&amp;quot;) %&amp;gt;%
                   read_html() %&amp;gt;% 
                   # Extract Table Structure
                   html_element(&amp;#39;table&amp;#39;)
                 
                   bind_cols(
                     #Get Actual Table Data
                     base %&amp;gt;% html_table(convert = F),
                     
                     #Get IMDB IDs From Links
                     imdb_id = base %&amp;gt;% 
                       html_elements(&amp;#39;a&amp;#39;) %&amp;gt;% 
                       html_attr(&amp;#39;href&amp;#39;) %&amp;gt;%
                       keep(~str_detect(.x, &amp;#39;tt&amp;#39;)) %&amp;gt;%
                       str_extract(&amp;#39;tt\\d+&amp;#39;)
                     
                   )
               })&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In order to get the &lt;code&gt;html_table()&lt;/code&gt; piece to work correctly, I needed to set &lt;code&gt;convert=F&lt;/code&gt; which tells the function not to try to turn numeric-looking values into numbers. Since everything was read in as a character, I need to do some light data cleaning using the &lt;code&gt;parse_number()&lt;/code&gt; function from &lt;code&gt;readr&lt;/code&gt; to turn characters that look like numbers into numbers.&lt;/p&gt;
&lt;p&gt;I’ll also need to define what I mean when I say a movie is the “Most American”. What I want is to find movies that did well in the US and didn’t do well abroad. But…&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If I look at the highest percentage of Domestic Gross I’ll get movies that might not have had an International release or did not have a large US gross (and therefore might not have been successful in the US)&lt;/li&gt;
&lt;li&gt;If I look at the highest differences between US and International Gross I might find things that made a lot of money both Domestically and Internationally but just more domestically.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In order to find a balance between the two, I create the “domestic score” which is ratio of the percent of Worldwide Gross that was Domestic to the percent of Worldwide Gross that was International (in order to maximum &#34;US-centric movies) but also to weight this ratio by the &lt;em&gt;log2&lt;/em&gt; of the Domestic Lifetime Gross in order to make sure that we’re finding successful movies and not just small movies that were only released in the US.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&#34;equation.PNG&#34; /&gt;
&lt;/center&gt;
&lt;p&gt;Then since I want my results to be in a table I don’t need all 1,000 movies, so I’ll use &lt;code&gt;arrange()&lt;/code&gt; and &lt;code&gt;head()&lt;/code&gt; to grab the Top 5 by the domestic score.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_clean &amp;lt;- tbl %&amp;gt;% 
  janitor::clean_names() %&amp;gt;% 
  mutate(
    rank = parse_number(rank),
    worldwide_lifetime_gross = parse_number(worldwide_lifetime_gross),
    domestic_lifetime_gross = parse_number(domestic_lifetime_gross),
    domestic_percent = parse_number(domestic_percent)/100,
    foreign_lifetime_gross = parse_number(foreign_lifetime_gross),
    foreign_percent = parse_number(foreign_percent)/100,
    year = parse_number(year),
    # Developing a way to get the highest domestic percentages that also did well domestically
    domestic_score = (domestic_percent / foreign_percent)*log2(domestic_lifetime_gross)
  ) %&amp;gt;%
  arrange(-domestic_score) %&amp;gt;%
  # Keep The Top 10 As Candidates for the API
  head(5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To make this table a little more fun there’s a couple elements that I’d like to bring in from the &lt;a href=&#34;https://www.ombdapi.com&#34;&gt;Open Movie Database&lt;/a&gt; such as the Rotten Tomatoes score, release dates, awards, and URL for the movie’s poster. In order to use the API you first need to register for an API key. I’ve stored that in my &lt;em&gt;.Renviron&lt;/em&gt; file so I can place it into glue.&lt;/p&gt;
&lt;p&gt;To use the API I can search for movies using the IMDB Id that I had gotten from above which gets used as part of the &lt;em&gt;i=&lt;/em&gt; parameter to the URL which gets passed to the &lt;code&gt;GET()&lt;/code&gt; function from the &lt;code&gt;httr&lt;/code&gt; package. The information for the 5 movies from above get passed in using the &lt;code&gt;map_dfr()&lt;/code&gt; function. The anonymous function takes in the IMDB id and returns a tibble that contains the extra information that I wanted for the table.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;###Use OMDB Data for the Country Filters and Poster Data
omdb_data &amp;lt;- map_dfr(tbl_clean$imdb_id,
                      function(id){
                        omdb_resp &amp;lt;- GET(URLencode(glue(&amp;quot;https://www.omdbapi.com/?apikey={Sys.getenv(&amp;#39;OMDB_API_KEY&amp;#39;)}&amp;amp;i={id}&amp;amp;type=movie&amp;amp;r=json&amp;quot;)))
                        if(content(omdb_resp)$Response == &amp;quot;True&amp;quot;){
                          return(
                            content(omdb_resp, as = &amp;#39;parsed&amp;#39;) %&amp;gt;% 
                              tibble(
                                imdb_id = id,
                                api_title = .$Title,
                                release_date = .$Released,
                                runtime = .$Runtime,
                                language = .$Language,
                                country = .$Country,
                                awards = .$Awards,
                                poster_url = .$Poster,
                                ratings_source = ifelse(length(.$Ratings) &amp;gt; 0,
                                                        .$Ratings[[2]]$Source,
                                                        &amp;quot;missing&amp;quot;),
                                rating = ifelse(length(.$Ratings) &amp;gt; 0,
                                                .$Ratings[[2]]$Value,
                                                &amp;quot;-99&amp;quot;)
                              ) %&amp;gt;% select(-.) %&amp;gt;% distinct() 
                          )
                        }
                      })&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The raw JSON returned from the API looks like:
&lt;img src=&#34;json_output.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;and output of the OMDB data table looks like:&lt;/p&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;13%&#34; /&gt;
&lt;col width=&#34;86%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;field&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;imdb_id&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;tt0878804&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;api_title&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;The Blind Side&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;release_date&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;20 Nov 2009&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;runtime&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;129 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;language&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;English&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;country&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;United States&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;awards&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Won 1 Oscar. 9 wins &amp;amp; 30 nominations total&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;poster_url&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;a href=&#34;https://m.media-amazon.com/images/M/MV5BMjEzOTE3ODM3OF5BMl5BanBnXkFtZTcwMzYyODI4Mg@@._V1_SX300.jpg&#34; class=&#34;uri&#34;&gt;https://m.media-amazon.com/images/M/MV5BMjEzOTE3ODM3OF5BMl5BanBnXkFtZTcwMzYyODI4Mg@@._V1_SX300.jpg&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ratings_source&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Rotten Tomatoes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;rating&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;66%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;With the Box Office Data and the OMDB Data in separate data sets, I can combine them together through the common IMDB id. Finally, I’ll keep only movies listed as the United States (can’t be American if not at least partially made in the good ol USA) and I’ll extract the number of Oscars won our of the awards string to be used later.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Combine All Data
combine_dt &amp;lt;- tbl_clean %&amp;gt;% 
  inner_join(omdb_data, by = &amp;quot;imdb_id&amp;quot;) %&amp;gt;%
  #Keep US Movies
  filter(str_detect(country, &amp;quot;United States&amp;quot;)) %&amp;gt;%
  extract(awards, &amp;quot;num_oscars&amp;quot;, &amp;quot;Won (\\d+) Oscar&amp;quot;, remove = F, convert = T) %&amp;gt;%
  replace_na(list(num_oscars = 0))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With the data set constructed, now onto the table.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;part-2-constructing-the-table&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Part 2: Constructing the Table&lt;/h1&gt;
&lt;p&gt;The libraries used to construct the table are &lt;code&gt;gt&lt;/code&gt; and &lt;code&gt;gtExtras&lt;/code&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(gt)
library(gtExtras)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I plan to use images for the number of Oscars won, the Rotten Tomatoes score (fresh or rotten), and flags to show the Domestic Box Office and International Box Office so rather than have long URLs in the table construction itself, I’ll create constant variables and refer to those in the code:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ROTTEN_URL = &amp;#39;https://www.rottentomatoes.com/assets/pizza-pie/images/icons/tomatometer/tomatometer-rotten.f1ef4f02ce3.svg&amp;#39;
FRESH_URL = &amp;#39;https://www.rottentomatoes.com/assets/pizza-pie/images/icons/tomatometer/tomatometer-fresh.149b5e8adc3.svg&amp;#39;
OSCAR_URL = &amp;#39;https://upload.wikimedia.org/wikipedia/en/7/7f/Academy_Award_trophy.png&amp;#39;
US_FLAG_URL = &amp;#39;https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/188px-Flag_of_the_United_States.svg.png&amp;#39;
WORLD_FLAG_URL = &amp;#39;https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/EarthFlag1.svg/525px-EarthFlag1.svg.png&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since &lt;code&gt;gt&lt;/code&gt; has a lot of syntax, I’ll combine a bunch of steps together rather than showing each individual change. But the start of the table is just the &lt;code&gt;gt()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- combine_dt %&amp;gt;%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %&amp;gt;%
  gt() &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p1.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;350&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;At first this is pretty ugly as a table but fortunately &lt;code&gt;gt&lt;/code&gt; and &lt;code&gt;gtExtras&lt;/code&gt; have a lot of very convenient features to make the table very pretty very quickly. The first set of steps will be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Turn the URL to the movie poster into the action poster with &lt;code&gt;gt_img_rows()&lt;/code&gt; from &lt;code&gt;gtExtras&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Turn the domestic percentage field to a percent format with &lt;code&gt;fmt_percent()&lt;/code&gt; from &lt;code&gt;gt&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Turn the Domestic and Foreign Box Office Gross Values to dollar in millions with &lt;code&gt;fmt_currency&lt;/code&gt; from &lt;code&gt;gt&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Turn the Worldwide Lifetime Gross into a bar plot with &lt;code&gt;gt_plt_bar()&lt;/code&gt; from &lt;code&gt;gtExtas&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- combine_dt %&amp;gt;%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %&amp;gt;%
  gt() %&amp;gt;%
  
  #New Code
  gt_img_rows(poster_url, height = 75) %&amp;gt;%
  fmt_percent(domestic_percent, decimals = 1) %&amp;gt;%
  fmt_currency(columns = c(&amp;quot;domestic_lifetime_gross&amp;quot;, &amp;quot;foreign_lifetime_gross&amp;quot;),
               suffixing = T, decimals = 1) %&amp;gt;%
  gt_plt_bar(worldwide_lifetime_gross, color = &amp;#39;darkgreen&amp;#39;, width = 50)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p2.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;540&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;The next steps will use the &lt;code&gt;text_transform&lt;/code&gt; functions from &lt;code&gt;gt&lt;/code&gt; to turn the number of Oscars won into the literal Oscar image for each Oscar won, and for the Rotten Tomatoes score, I’ll use with the “Fresh” image if the score is above 60% or the “Rotten” image if below 60%.&lt;/p&gt;
&lt;p&gt;In general the &lt;code&gt;text_tranform()&lt;/code&gt; function takes two parameters. The first is where is the function will be applied. In the first example, &lt;code&gt;locations = cells_body(rating)&lt;/code&gt; means that I will apply the function defined in &lt;code&gt;fn&lt;/code&gt; to the &lt;em&gt;rating&lt;/em&gt; column. Then for the &lt;code&gt;fn&lt;/code&gt; I’m using &lt;code&gt;glue()&lt;/code&gt; to choose the &lt;em&gt;FRESH_URL&lt;/em&gt; or &lt;em&gt;ROTTEN_URL&lt;/em&gt; based on the numeric value of the rating itself and using &lt;code&gt;web_image()&lt;/code&gt; to display the image.&lt;/p&gt;
&lt;p&gt;For the number of Oscars…. I’m not 100% sure why I needed to use the &lt;code&gt;lapply()&lt;/code&gt; and &lt;code&gt;html()&lt;/code&gt; rendering to get the number Oscar statues to repeat. I suppose its has to do with the way that data is being passed around in the &lt;code&gt;text_transform()&lt;/code&gt; function. However, “working” is better than perfect in this case. The function takes the &lt;code&gt;num_oscars&lt;/code&gt; field and replicates the Oscar image as many times as necessary.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- combine_dt %&amp;gt;%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %&amp;gt;%
  gt() %&amp;gt;%
    gt_img_rows(poster_url, height = 75) %&amp;gt;%
    fmt_percent(domestic_percent, decimals = 1) %&amp;gt;%
    fmt_currency(columns = c(&amp;quot;domestic_lifetime_gross&amp;quot;, &amp;quot;foreign_lifetime_gross&amp;quot;),
                 suffixing = T, decimals = 1) %&amp;gt;%
    gt_plt_bar(worldwide_lifetime_gross, color = &amp;#39;darkgreen&amp;#39;, width = 50) %&amp;gt;% 
  
  
    #### NEW CODE
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue(&amp;#39;{web_image(img)}&amp;lt;br /&amp;gt;{rating}&amp;#39;, 
               img = if_else(parse_number(rating) &amp;lt; 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %&amp;gt;%
      text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x &amp;lt;- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %&amp;gt;%
              gt::html()
          })
          }
      )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p3.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;540&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;gtExtras&lt;/code&gt; package has an awesome function called &lt;a href=&#34;https://jthomasmock.github.io/gtExtras/reference/gt_merge_stack.html&#34;&gt;gt_merge_stack()&lt;/a&gt; that will take one column and stack it on top of a second column. This is a really cool way to condense information in an easy way. Using this I will merge the title and release date columns and place the release date under the title.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- combine_dt %&amp;gt;%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %&amp;gt;%
  gt() %&amp;gt;%
    gt_img_rows(poster_url, height = 75) %&amp;gt;%
    fmt_percent(domestic_percent, decimals = 1) %&amp;gt;%
    fmt_currency(columns = c(&amp;quot;domestic_lifetime_gross&amp;quot;, &amp;quot;foreign_lifetime_gross&amp;quot;),
                 suffixing = T, decimals = 1) %&amp;gt;%
    gt_plt_bar(worldwide_lifetime_gross, color = &amp;#39;darkgreen&amp;#39;, width = 50) %&amp;gt;% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue(&amp;#39;{web_image(img)}&amp;lt;br /&amp;gt;{rating}&amp;#39;, 
               img = if_else(parse_number(rating) &amp;lt; 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %&amp;gt;%
      text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x &amp;lt;- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %&amp;gt;%
              gt::html()
          })
          }
      ) %&amp;gt;%
  
  
  ###NEW CODE
      gt_merge_stack(title, release_date)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p4.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;540&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;
To make a valuable info-graphic I’ll need to add in titles, subtitle, and to have appropriate attribution to myself, I’ll add in source notes as well. To this do, I’ll use the &lt;code&gt;tab_header()&lt;/code&gt; to define the title and subtitle, and the &lt;code&gt;tab_source_note()&lt;/code&gt; option to add the source line. Within this blocks the &lt;code&gt;html()&lt;/code&gt; and &lt;code&gt;md()&lt;/code&gt; functions allow for the use of HTML and Markdown respectively to render text.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- combine_dt %&amp;gt;%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %&amp;gt;%
  gt() %&amp;gt;%
    gt_img_rows(poster_url, height = 75) %&amp;gt;%
    fmt_percent(domestic_percent, decimals = 1) %&amp;gt;%
    fmt_currency(columns = c(&amp;quot;domestic_lifetime_gross&amp;quot;, &amp;quot;foreign_lifetime_gross&amp;quot;),
                 suffixing = T, decimals = 1) %&amp;gt;%
    gt_plt_bar(worldwide_lifetime_gross, color = &amp;#39;darkgreen&amp;#39;, width = 50) %&amp;gt;% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue(&amp;#39;{web_image(img)}&amp;lt;br /&amp;gt;{rating}&amp;#39;, 
               img = if_else(parse_number(rating) &amp;lt; 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %&amp;gt;%
    text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x &amp;lt;- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %&amp;gt;%
              gt::html()
          })
          }
        ) %&amp;gt;%
    gt_merge_stack(title, release_date) %&amp;gt;%
  
  ###NEW CODE
    tab_header(
      title = html(&amp;quot;What are the most &amp;lt;b&amp;gt;&amp;lt;span style=&amp;#39;color:#002868&amp;#39;&amp;gt;American&amp;lt;/span&amp;gt;&amp;lt;/b&amp;gt; of American Films?&amp;quot;),
      subtitle = html(&amp;quot;As measured by the share of Box Office Gross coming from the United States versus the rest of the world, movies with or about &amp;lt;b&amp;gt;Adam Sandler&amp;lt;/b&amp;gt;, &amp;lt;b&amp;gt;Football&amp;lt;/b&amp;gt;, and &amp;lt;b&amp;gt;Christmas&amp;lt;/b&amp;gt; tend to be Box Office successes in the United States but not the rest of the world.  Although, it is unclear whether it is Football or Adam Sandler that makes the movie most appealing to American tastes.&amp;quot;)
      ) %&amp;gt;%
      tab_source_note(
      md(&amp;quot;***Author:*** JLaw | ***Sources:*** [BoxOfficeMojo,com](https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=0) and [Open Movie Database](https://www.omdbapi.com/)&amp;quot;)
    ) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p5.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;670&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;
Since the table can get pretty wide, it would be helpful to alternate the background colors of the rows so that its easy to follow the information. This can be done with &lt;code&gt;opt_row_striping()&lt;/code&gt; which will add the striping with defaults and the &lt;em&gt;row.striping.background_color&lt;/em&gt; option within &lt;code&gt;tab_options()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- combine_dt %&amp;gt;%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %&amp;gt;%
  gt() %&amp;gt;%
    gt_img_rows(poster_url, height = 75) %&amp;gt;%
    fmt_percent(domestic_percent, decimals = 1) %&amp;gt;%
    fmt_currency(columns = c(&amp;quot;domestic_lifetime_gross&amp;quot;, &amp;quot;foreign_lifetime_gross&amp;quot;),
                 suffixing = T, decimals = 1) %&amp;gt;%
    gt_plt_bar(worldwide_lifetime_gross, color = &amp;#39;darkgreen&amp;#39;, width = 50) %&amp;gt;% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue(&amp;#39;{web_image(img)}&amp;lt;br /&amp;gt;{rating}&amp;#39;, 
               img = if_else(parse_number(rating) &amp;lt; 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %&amp;gt;%
    text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x &amp;lt;- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %&amp;gt;%
              gt::html()
          })
          }
        ) %&amp;gt;%
    gt_merge_stack(title, release_date) %&amp;gt;%
    tab_header(
      title = html(&amp;quot;What are the most &amp;lt;b&amp;gt;&amp;lt;span style=&amp;#39;color:#002868&amp;#39;&amp;gt;American&amp;lt;/span&amp;gt;&amp;lt;/b&amp;gt; of American Films?&amp;quot;),
      subtitle = html(&amp;quot;As measured by the share of Box Office Gross coming from the United States versus the rest of the world, movies with or about &amp;lt;b&amp;gt;Adam Sandler&amp;lt;/b&amp;gt;, &amp;lt;b&amp;gt;Football&amp;lt;/b&amp;gt;, and &amp;lt;b&amp;gt;Christmas&amp;lt;/b&amp;gt; tend to be Box Office successes in the United States but not the rest of the world.  Although, it is unclear whether it is Football or Adam Sandler that makes the movie most appealing to American tastes.&amp;quot;)
      ) %&amp;gt;%
      tab_source_note(
      md(&amp;quot;***Author:*** JLaw | ***Sources:*** [BoxOfficeMojo,com](https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=0) and [Open Movie Database](https://www.omdbapi.com/)&amp;quot;)
    ) %&amp;gt;%
  
  ###NEW CODE
  opt_row_striping() %&amp;gt;%
  tab_options(row.striping.background_color = &amp;quot;#ececec&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p6.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;670&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;
Now every other row was has a light shade of grey.&lt;/p&gt;
&lt;p&gt;The next thing to do is to fix up the column labels. This is done with the &lt;code&gt;col_labels()&lt;/code&gt; function which allows me to change how the variable names used for each column will be displayed. Using the use of &lt;code&gt;glue()&lt;/code&gt;, &lt;code&gt;html()&lt;/code&gt;, &lt;code&gt;web_image()&lt;/code&gt;, and &lt;code&gt;emo::ji()&lt;/code&gt; and I can insert images into the column titles. Also, since so many columns are related to Box Office Grosses, I’ll create a column spanner with &lt;code&gt;tab_spanner()&lt;/code&gt; that goes from the &lt;em&gt;domestic_gross&lt;/em&gt; column to the &lt;em&gt;worldwide_lifetime_gross&lt;/em&gt;. Finally, since removing the label of &lt;em&gt;poster_url&lt;/em&gt; will shrink the column width, I’ll increase the width with &lt;code&gt;cols_width()&lt;/code&gt; and the &lt;code&gt;px()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- combine_dt %&amp;gt;%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %&amp;gt;%
  gt() %&amp;gt;%
    gt_img_rows(poster_url, height = 75) %&amp;gt;%
    fmt_percent(domestic_percent, decimals = 1) %&amp;gt;%
    fmt_currency(columns = c(&amp;quot;domestic_lifetime_gross&amp;quot;, &amp;quot;foreign_lifetime_gross&amp;quot;),
                 suffixing = T, decimals = 1) %&amp;gt;%
    gt_plt_bar(worldwide_lifetime_gross, color = &amp;#39;darkgreen&amp;#39;, width = 50) %&amp;gt;% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue(&amp;#39;{web_image(img)}&amp;lt;br /&amp;gt;{rating}&amp;#39;, 
               img = if_else(parse_number(rating) &amp;lt; 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %&amp;gt;%
    text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x &amp;lt;- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %&amp;gt;%
              gt::html()
          })
          }
        ) %&amp;gt;%
    gt_merge_stack(title, release_date) %&amp;gt;%
    tab_header(
      title = html(&amp;quot;What are the most &amp;lt;b&amp;gt;&amp;lt;span style=&amp;#39;color:#002868&amp;#39;&amp;gt;American&amp;lt;/span&amp;gt;&amp;lt;/b&amp;gt; of American Films?&amp;quot;),
      subtitle = html(&amp;quot;As measured by the share of Box Office Gross coming from the United States versus the rest of the world, movies with or about &amp;lt;b&amp;gt;Adam Sandler&amp;lt;/b&amp;gt;, &amp;lt;b&amp;gt;Football&amp;lt;/b&amp;gt;, and &amp;lt;b&amp;gt;Christmas&amp;lt;/b&amp;gt; tend to be Box Office successes in the United States but not the rest of the world.  Although, it is unclear whether it is Football or Adam Sandler that makes the movie most appealing to American tastes.&amp;quot;)
      ) %&amp;gt;%
      tab_source_note(
      md(&amp;quot;***Author:*** JLaw | ***Sources:*** [BoxOfficeMojo,com](https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=0) and [Open Movie Database](https://www.omdbapi.com/)&amp;quot;)
    ) %&amp;gt;%
  opt_row_striping() %&amp;gt;%
  tab_options(row.striping.background_color = &amp;quot;#ececec&amp;quot;) %&amp;gt;%
  
  ### New Code
  cols_label(
      poster_url = &amp;quot;&amp;quot;,
      title = &amp;quot;Title&amp;quot;,
      domestic_lifetime_gross = html(glue(&amp;quot;{web_image(US_FLAG_URL)}United States&amp;quot;)),
      foreign_lifetime_gross = html(glue(&amp;quot;{web_image(WORLD_FLAG_URL)}Rest of World&amp;quot;)),
      domestic_percent = &amp;quot;US % of Total&amp;quot;,
      worldwide_lifetime_gross = glue(&amp;quot;{emo::ji(&amp;#39;dollar&amp;#39;)}Total{emo::ji(&amp;#39;dollar&amp;#39;)}&amp;quot;),
      num_oscars = &amp;quot;# Oscars won&amp;quot;,
      rating = &amp;quot;Rotten Tomatoes Score&amp;quot;
    ) %&amp;gt;%
  tab_spanner(label = &amp;quot;Box Office Gross&amp;quot;, columns = domestic_lifetime_gross:worldwide_lifetime_gross) %&amp;gt;%
  cols_width(
      poster_url ~ px(75)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p7.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;770&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;
Now this has come a lot a long way from the first image but there’s a lot of cleaning up that needs to be done with the various &lt;code&gt;tab_style()&lt;/code&gt; functions. The &lt;code&gt;tab_style()&lt;/code&gt; function takes two arguments. The &lt;em&gt;style&lt;/em&gt; which is how things will look and the &lt;em&gt;location&lt;/em&gt; which is where the styling will be applied. For the style I’ll be using the &lt;code&gt;cell_text()&lt;/code&gt; helper to alter the size, weight (bolding), transform (to turn to all uppercase), alignment and font (using the &lt;code&gt;google_font()&lt;/code&gt; helper).&lt;/p&gt;
&lt;p&gt;For the locations, there are helpers for each part of the table. There is &lt;code&gt;cells_body()&lt;/code&gt; for the cell text, &lt;code&gt;cells_column_labels()&lt;/code&gt; for the column headers, &lt;code&gt;cells_title()&lt;/code&gt;, which can take a “title” or “subtitle” option for those elements and &lt;code&gt;cells_column_spanners()&lt;/code&gt; for the column spanners I created in the prior step. Within locations, you can further specify which columns the formatting will apply to. While it defaults to &lt;code&gt;everything()&lt;/code&gt;, the columns can be entered as if they’re part of a &lt;em&gt;select&lt;/em&gt; statement for &lt;code&gt;dplyr&lt;/code&gt;. Finally, if wanting to include multiple locations (or multiple styles) in the same code block, the various helpers can be wrapped in a &lt;code&gt;list()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;For the formatting, I’ll:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Change the font, size, alignment, and make everything upper-case for the &lt;strong&gt;title&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Change the font, size, and alignment for the &lt;strong&gt;subtitle&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Change the font, size, and make everything upper-case and bold for the &lt;strong&gt;column headers&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Make all of the &lt;strong&gt;column headers&lt;/strong&gt; center aligned except for the &lt;em&gt;title&lt;/em&gt; column.&lt;/li&gt;
&lt;li&gt;Change the font and center-align all of the &lt;strong&gt;cells&lt;/strong&gt; except for the &lt;em&gt;title&lt;/em&gt; column.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- combine_dt %&amp;gt;%
  select(poster_url, title, release_date, domestic_lifetime_gross, foreign_lifetime_gross,
         domestic_percent, worldwide_lifetime_gross, num_oscars, rating) %&amp;gt;%
  gt() %&amp;gt;%
    gt_img_rows(poster_url, height = 75) %&amp;gt;%
    fmt_percent(domestic_percent, decimals = 1) %&amp;gt;%
    fmt_currency(columns = c(&amp;quot;domestic_lifetime_gross&amp;quot;, &amp;quot;foreign_lifetime_gross&amp;quot;),
                 suffixing = T, decimals = 1) %&amp;gt;%
    gt_plt_bar(worldwide_lifetime_gross, color = &amp;#39;darkgreen&amp;#39;, width = 50) %&amp;gt;% 
    text_transform(
        locations = cells_body(rating),
        fn = function(rating){
          glue(&amp;#39;{web_image(img)}&amp;lt;br /&amp;gt;{rating}&amp;#39;, 
               img = if_else(parse_number(rating) &amp;lt; 60, ROTTEN_URL, FRESH_URL)
          )
        }
      ) %&amp;gt;%
    text_transform(
        locations = cells_body(num_oscars),
        fn = function(x){
          int_x &amp;lt;- as.integer(x)
          lapply(int_x, function(y){
            rep(web_image(OSCAR_URL, height=60), y) %&amp;gt;%
              gt::html()
          })
          }
        ) %&amp;gt;%
    gt_merge_stack(title, release_date) %&amp;gt;%
    tab_header(
      title = html(&amp;quot;What are the most &amp;lt;b&amp;gt;&amp;lt;span style=&amp;#39;color:#002868&amp;#39;&amp;gt;American&amp;lt;/span&amp;gt;&amp;lt;/b&amp;gt; of American Films?&amp;quot;),
      subtitle = html(&amp;quot;As measured by the share of Box Office Gross coming from the United States versus the rest of the world, movies with or about &amp;lt;b&amp;gt;Adam Sandler&amp;lt;/b&amp;gt;, &amp;lt;b&amp;gt;Football&amp;lt;/b&amp;gt;, and &amp;lt;b&amp;gt;Christmas&amp;lt;/b&amp;gt; tend to be Box Office successes in the United States but not the rest of the world.  Although, it is unclear whether it is Football or Adam Sandler that makes the movie most appealing to American tastes.&amp;quot;)
      ) %&amp;gt;%
      tab_source_note(
      md(&amp;quot;***Author:*** JLaw | ***Sources:*** [BoxOfficeMojo,com](https://www.boxofficemojo.com/chart/ww_top_lifetime_gross/?offset=0) and [Open Movie Database](https://www.omdbapi.com/)&amp;quot;)
    ) %&amp;gt;%
  opt_row_striping() %&amp;gt;%
  tab_options(row.striping.background_color = &amp;quot;#ececec&amp;quot;) %&amp;gt;%
  cols_label(
      poster_url = &amp;quot;&amp;quot;,
      title = &amp;quot;Title&amp;quot;,
      domestic_lifetime_gross = html(glue(&amp;quot;{web_image(US_FLAG_URL)}United States&amp;quot;)),
      foreign_lifetime_gross = html(glue(&amp;quot;{web_image(WORLD_FLAG_URL)}Rest of World&amp;quot;)),
      domestic_percent = &amp;quot;US % of Total&amp;quot;,
      worldwide_lifetime_gross = glue(&amp;quot;{emo::ji(&amp;#39;dollar&amp;#39;)}Total{emo::ji(&amp;#39;dollar&amp;#39;)}&amp;quot;),
      num_oscars = &amp;quot;# Oscars won&amp;quot;,
      rating = &amp;quot;Rotten Tomatoes Score&amp;quot;
    ) %&amp;gt;%
  tab_spanner(label = &amp;quot;Box Office Gross&amp;quot;, columns = domestic_lifetime_gross:worldwide_lifetime_gross) %&amp;gt;%
  cols_width(
      poster_url ~ px(75)
    ) %&amp;gt;%
  
  ## New Code
  tab_style(
      style = cell_text(
        size = &amp;quot;x-large&amp;quot;,
        font = google_font(&amp;#39;Josefin Sans&amp;#39;),
        align = &amp;#39;left&amp;#39;,
        transform = &amp;#39;uppercase&amp;#39;
      ),
      location = cells_title(&amp;quot;title&amp;quot;)
    ) %&amp;gt;%
  tab_style(
      style = cell_text(
        size = &amp;quot;medium&amp;quot;,
        font = google_font(&amp;#39;Inter&amp;#39;),
        align = &amp;#39;left&amp;#39;
      ),
      location = cells_title(&amp;quot;subtitle&amp;quot;)
    ) %&amp;gt;%
  tab_style(
      style = cell_text(
        size = &amp;#39;large&amp;#39;,
        weight = &amp;#39;bold&amp;#39;,
        transform = &amp;#39;uppercase&amp;#39;,
        font = google_font(&amp;#39;Bebas Neue&amp;#39;)
      ),
      locations = list(cells_column_labels(), cells_column_spanners())
    ) %&amp;gt;%
  tab_style(
      style = cell_text(align = &amp;#39;center&amp;#39;),
      locations = cells_column_labels(-title)
    ) %&amp;gt;%
  tab_style(
      style = cell_text(font = google_font(&amp;#39;Sora&amp;#39;), align = &amp;#39;center&amp;#39;, size = &amp;#39;small&amp;#39;),
      locations = cells_body(-title)
    ) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p8.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;770&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;
And now our table looks pretty!!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;In this blog post, I’ve defined a methodology for identifying the most “American” of US films and based on the results in the table it seems like the Most American things are Football, Adam Sandler, and Christmas.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Finding the Eras of MTV&#39;s The Challenge Through Clustering</title>
      <link>https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/</link>
      <pubDate>Wed, 15 Sep 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;Since 1998, &lt;a href=&#34;https://en.wikipedia.org/wiki/The_Challenge_(TV_series)&#34;&gt;MTV’s The Challenge&lt;/a&gt; (formerly the Real World/Road Rules Challenge) has graced the airwaves where it is currently in Season 37. In a prior &lt;a href=&#34;https://jlaw.netlify.app/2021/03/01/exploring-wednesday-night-cable-ratings-with-ocr/&#34;&gt;post&lt;/a&gt; I had mentioned that this is one of my guilty pleasure shows so this will likely not be the last post that is based around &lt;a href=&#34;https://www.complex.com/pop-culture/2015/01/the-challenge-mtv-americas-fifth-sport&#34;&gt;America’s 5th professional sport&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;For casting the show, the early years revolved around having alumni from MTV’s The Real World and Road Rules compete against each other (in an odd bit of irony or misnaming, the first season called Road Rules: All Stars actually consisted of &lt;strong&gt;only&lt;/strong&gt; Real World alumni). Over the next 37 seasons, the series has evolved bringing in other MTV properties such as “Are You the One?” and expanding internationally to properties like “Survivor: Turkey” and “Love Island UK”.&lt;/p&gt;
&lt;p&gt;Since the cast of characters has continuously evolved over the 37 seasons, I thought it would be interested to see if I can algorithmically classify the eras of the show based on the cast of each season through Hierarchical Clustering and visualizing using UMAP.&lt;/p&gt;
&lt;div id=&#34;libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Libraries&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidygraph) # For manipulating network data sets
library(ggraph) # For visualizing network data sets
library(tidyverse) # General Data Manipulation
library(rvest) # For web scraping data from Wikipedia
library(widyr) # For calculating cosine similarity of seasons
library(umap) # For dimensionality reduction&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting the Data&lt;/h2&gt;
&lt;p&gt;Since the goal is to cluster the seasons of the The Challenge based on similarity of their casts, I need to get the cast list from each of the 37 seasons. Fortunately, Wikipedia contains the casts within each season’s page. Unfortunately, I’m lazy and really don’t want to specifically hunt down the URLs for each of 37 seasons and write individual rvest code.&lt;/p&gt;
&lt;p&gt;So I’ll use the &lt;a href=&#34;https://en.wikipedia.org/wiki/The_Challenge_(TV_series)#Seasons&#34;&gt;Seasons&lt;/a&gt; table on Wikipedia to act as a driver file for each season’s page using &lt;code&gt;rvest&lt;/code&gt; to extract the table using its xpath, pulling out all of the anchor elements (&lt;code&gt;&amp;lt;a&amp;gt;&lt;/code&gt;), using &lt;em&gt;html_attrs()&lt;/em&gt; to extract all of the attributes into a list and using &lt;code&gt;purrr&lt;/code&gt;’s &lt;code&gt;map_dfr&lt;/code&gt; function to combine all of the links into a list. Unfortunately, there are multiple links on row of the table (one for the title and one of the location of the season), so using &lt;code&gt;stringr&lt;/code&gt;’s &lt;em&gt;str_detect&lt;/em&gt;, I’ll keep only the rows that has the word “Challenge” in the title. Or “Stars” in the case of the first season which was just called “Road Rules: All-Stars”.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;seasons &amp;lt;- read_html(&amp;#39;https://en.wikipedia.org/wiki/The_Challenge_(TV_series)&amp;#39;) %&amp;gt;%
  html_element(xpath = &amp;#39;/html/body/div[3]/div[3]/div[5]/div[1]/table[2]&amp;#39;) %&amp;gt;% 
  html_elements(&amp;#39;a&amp;#39;) %&amp;gt;% 
  html_attrs() %&amp;gt;% 
  map_dfr(bind_rows) %&amp;gt;% 
  filter(str_detect(title, &amp;#39;Challenge|Stars&amp;#39;)) %&amp;gt;%
  select(-class)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;53%&#34; /&gt;
&lt;col width=&#34;46%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;href&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;title&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;/wiki/Road_Rules:_All_Stars&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Road Rules: All Stars&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;/wiki/Real_World/Road_Rules_Challenge_(season)&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Real World/Road Rules Challenge (season)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;/wiki/Real_World/Road_Rules_Challenge_2000&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Real World/Road Rules Challenge 2000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The dataset now has the Wikipedia link for each season in the &lt;em&gt;href&lt;/em&gt; column and a more human-reading title in the &lt;em&gt;title&lt;/em&gt; column.&lt;/p&gt;
&lt;p&gt;The next problem to tackle is iterating through each season to extract the cast table. The issue here is that the Cast table is not uniform on each season’s page and the cast table is not always the same ordered table. So in the end I did have to look at all 37 pages to determine which tables and which columns within those tables to extract.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Set up which tables and columns to extract from Wikipedia
seasons &amp;lt;- seasons %&amp;gt;%
  mutate(
    season_num = row_number(), #Define Season Identifier
    # Set Table Numbers On Page To Extract
    table_num = case_when(
      season_num %in% c(1, 12, 16, 19) ~ &amp;#39;2&amp;#39;,
      season_num %in% c(27) ~ &amp;#39;3&amp;#39;,
      season_num %in% c(2, 4, 5, 6, 9, 11) ~ &amp;#39;4,5&amp;#39;,
      TRUE ~ &amp;#39;3, 4&amp;#39;
    ),
    # Set Column Numbers to Extract From Each Table
    keep_cols = case_when(
      season_num %in% c(5) ~ &amp;#39;1, 2&amp;#39;,
      season_num %in% c(12, 19, 27) ~ &amp;#39;1, 3&amp;#39;,
      TRUE ~ &amp;#39;1&amp;#39;
    )
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For example, the default was where the two tables to extract were the 3rd and 4th tables on the page and I only needed the first column.&lt;/p&gt;
&lt;p&gt;With this additional metadata, I could now write a function to read the URL and extract the correct tables and table columns:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;### Write Function to Scrape the Names
get_cast &amp;lt;- function(url, table_num, keep_cols, title, order){
  
  ##Convert the String Inputs into a numeric vector
  table_num = str_split(table_num, &amp;#39;,&amp;#39;) %&amp;gt;% unlist() %&amp;gt;% as.numeric()
  keep_cols = str_split(keep_cols, &amp;#39;,&amp;#39;) %&amp;gt;% unlist() %&amp;gt;% as.numeric()
  
  #Read Page and Filter Tables
  ct &amp;lt;- read_html(paste0(&amp;#39;https://en.wikipedia.com/&amp;#39;,url)) %&amp;gt;%
    # Extract Table Tags
    html_elements(&amp;#39;table&amp;#39;) %&amp;gt;%
    # Keep only the specified tables
    .[table_num] %&amp;gt;% 
    # Extract the information from the tables into a list (if more than 1)
    html_table() %&amp;gt;% 
    # Use MAP to keep only the selected columns from each table
    map(~select(.x, all_of(keep_cols)))
  
  #If Keeping Multiple Columns Gather to a Single Column Called Name
  if(length(keep_cols) == 1){
    ct &amp;lt;- ct %&amp;gt;% 
      map(~rename(.x, &amp;quot;Name&amp;quot; = 1)) 
  }else if(length(keep_cols) &amp;gt; 1){
    ct &amp;lt;- ct %&amp;gt;%
      map(~gather(.x, &amp;quot;Field&amp;quot;, &amp;quot;Name&amp;quot;)) %&amp;gt;% 
      map(~select(.x, 2)) 
  }
  
  # Combine all the tables into 1 columns and append title column
  ct &amp;lt;- ct %&amp;gt;% map_dfr(bind_rows) %&amp;gt;% mutate(title = title, order = order)

  return(ct)
  
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The five parameters passed to this function are all contained in the driver file created above. In order to iterate through the seasons to create a data set of the cast members I’ll use the &lt;code&gt;pmap_dfr()&lt;/code&gt; function from &lt;code&gt;purrr&lt;/code&gt; to provide more than two inputs to a function (pmap vs. map and map2) and combine all the outputs into a single data frame by binding the rows (the dfr part of the function name).&lt;/p&gt;
&lt;p&gt;In pmap, the first parameter is a list of the various parameters to pass to the function and the second parameter is the function to be called. The elements of the list can then be referred to as ..1 being the href parameters (first parameter from the list), ..2 the table_name parameter, and so on.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;###Create Dataset with all names
all_cast &amp;lt;- pmap_dfr(list(seasons$href, 
                          seasons$table_num, 
                          seasons$keep_cols,
                          seasons$title,
                          seasons$season_num), 
                     ~get_cast(..1, ..2, ..3, ..4, ..5))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The results of this new table now looks like:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Name&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;title&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;order&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Cynthia Roberts&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Road Rules: All Stars&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Eric Nies&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Road Rules: All Stars&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Jon Brennan&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Road Rules: All Stars&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div id=&#34;cleaning-the-data-and-final-preparations&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Cleaning the Data and Final Preparations&lt;/h3&gt;
&lt;p&gt;The data on Wikipedia is &lt;em&gt;fairly&lt;/em&gt; clean but there are places where automation is no substitute for domain knowledge. In this case the cast tables refer to what people were called in that specific season. But in some cases as cast members have returned for multiple seasons what they have been called has changed. For example. the now host of NBC’s First Look, Johnny “Bananas” Devenanzio, began his time on The Challenge as “John Devenanzio”, then “Johnny Devenanzio”, and finally, “Johnny ‘Bananas’ Devenanzio” for his most recent 12 seasons. Some female cast members married as “Tori Hall” became “Tori Fiorenza”. And in the most subtle of changes, “Nany González” appears both with and without the accent over the “a” (huge shoutout to the &lt;a href=&#34;https://www.reddit.com/r/MtvChallenge/comments/pj8by1/oc_visualizing_the_most_frequently_appearing/&#34;&gt;r/MtvChallenge&lt;/a&gt; sub-Reddit for calling me out on that when it cause Nany to not appear in my data visualization).&lt;/p&gt;
&lt;p&gt;Other changes are less interesting such as removing footnotes from people’s names, fixing that in the Seasons table both Season 5 and Season 23 are called “Battle of the Seasons”, and appending the season’s names onto the cast table&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;###Clean up the Cast Member Columns and Clean up The Title Columns
###Domain Knowledge that these are all the same people (especially the married ones)
all_cast_clean &amp;lt;- all_cast %&amp;gt;%
  mutate(
    #Remove Footnotes
    Name = str_remove_all(Name, &amp;#39;\\[.*\\]&amp;#39;),
    #Fix the Various References to Johnny Bananas
    Name = if_else(str_detect(Name, &amp;#39;John.* Devenanzio&amp;#39;),&amp;#39;Johnny &amp;quot;Bananas&amp;quot; Devenanzio&amp;#39;,Name),
    Name = if_else(str_detect(Name, &amp;#39;Jordan.*Wiseley&amp;#39;), &amp;#39;Jordan Wiseley&amp;#39;, Name),
    Name = if_else(str_detect(Name, &amp;#39;Natalie.*Duran&amp;#39;), &amp;#39;Natalie &amp;quot;Ninja&amp;quot; Duran&amp;#39;, Name),
    Name = if_else(str_detect(Name, &amp;#39;Theresa Gonz&amp;#39;), &amp;#39;Theresa Jones&amp;#39;, Name),
    Name = if_else(str_detect(Name, &amp;#39;Tori Fiorenza&amp;#39;), &amp;#39;Tori Hall&amp;#39;, Name),
    Name = if_else(str_detect(Name, &amp;#39;Nany&amp;#39;), &amp;#39;Nany González&amp;#39;, Name)
  )

##Season Table
seasons_table &amp;lt;- read_html(&amp;#39;https://en.wikipedia.org/wiki/The_Challenge_(TV_series)&amp;#39;) %&amp;gt;%
  html_element(xpath = &amp;#39;/html/body/div[3]/div[3]/div[5]/div[1]/table[2]&amp;#39;) %&amp;gt;%
  html_table() %&amp;gt;%
  janitor::clean_names() %&amp;gt;%
  mutate(year = str_extract(original_release, &amp;#39;\\d{4}&amp;#39;) %&amp;gt;% as.integer()) %&amp;gt;%
  select(order, short_title = title, year) %&amp;gt;%
  distinct() %&amp;gt;%
  mutate(short_title = if_else(order == 23, &amp;#39;Battle of the Seasons 2&amp;#39;, short_title))


all_cast_info &amp;lt;- all_cast_clean %&amp;gt;%
  left_join(seasons_table, by = &amp;quot;order&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;exploring-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Exploring the Data&lt;/h2&gt;
&lt;p&gt;Before getting into the real meat of the analysis, I’m going to do some quick EDA to answer some potentially interesting questions about The Challenge Cast that we can see in the data.&lt;/p&gt;
&lt;div id=&#34;who-has-been-on-the-most-challenges&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Who Has Been on the Most Challenges?&lt;/h3&gt;
&lt;p&gt;A quick question might be what challenger has been on the most seasons. This can be answered pretty quickly with the &lt;code&gt;count()&lt;/code&gt; function from &lt;code&gt;dplyr&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_cast_info %&amp;gt;%
  count(Name, sort = T) %&amp;gt;%
  head(7) %&amp;gt;%
  ggplot(aes(x = fct_reorder(Name, n), y = n, fill = Name)) + 
    geom_col() + 
    geom_text(aes(label = n, hjust = 0)) +
    ghibli::scale_fill_ghibli_d(name = &amp;#39;LaputaMedium&amp;#39;, guide = &amp;#39;none&amp;#39;) + 
    scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
    coord_flip() + 
    labs(x = &amp;quot;Challenger&amp;quot;, y = &amp;quot;# of Appearances&amp;quot;, 
         title = &amp;quot;Who Has Been on the Most Seasons of the Challenge?&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = &amp;#39;plot&amp;#39;
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/index_files/figure-html/most_challenges-1.png&#34; width=&#34;672&#34; /&gt;
As any Challenge fan knows, Johnny Bananas has been on the most seasons with 20 and CT just behind at 19.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;looking-at-consecutive-season-behavior&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Looking at Consecutive Season Behavior&lt;/h2&gt;
&lt;p&gt;An interesting visualization we can do is to explore how frequently Challengers are on consecutive seasons using a series of dumbbell plots. In this plot there will be a point for each endpoint of a stretch of consecutive seasons and they will be connected by a line.&lt;/p&gt;
&lt;p&gt;Check out the post on the &lt;a href=&#34;https://www.reddit.com/r/MtvChallenge/comments/pj8by1/oc_visualizing_the_most_frequently_appearing/&#34;&gt;r/MtvChallenge&lt;/a&gt; sub-Reddit for a nicer (although slightly wrong) version of this plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_cast_info %&amp;gt;% 
    ## Add the number of seasons for each challenger as a new column
    add_count(Name, name = &amp;#39;num_seasons&amp;#39;) %&amp;gt;%
    # Filter to only those who have been on 10+ seasons
    filter(num_seasons &amp;gt;= 10) %&amp;gt;%
    # For each challenger define consecutive segments based on when the prior
    # season number is more than 1 or missing (for the first observation)
    group_by(Name) %&amp;gt;%
    arrange(order, .by_group = T) %&amp;gt;%
    mutate(
      diff = order - lag(order),
      new_segment = if_else(is.na(diff) | diff &amp;gt; 1, 1, 0),
      run = cumsum(new_segment)
    ) %&amp;gt;% 
    # Define the endpoints of each segment
    group_by(Name, run) %&amp;gt;% 
    summarize(start = min(order),
              end = max(order),
              num_seasons = max(num_seasons)) %&amp;gt;%
  ggplot(aes(x = fct_rev(fct_reorder(Name, start, min)), 
             color = Name, fill = Name)) + 
    geom_linerange(aes(ymin = start, ymax = end), size = 1) + 
    geom_point(aes(y = start), size = 2) + 
    geom_point(aes(y = end), size = 2) + 
    scale_fill_discrete(guide = &amp;#39;none&amp;#39;) + 
    scale_color_discrete(guide = &amp;#39;none&amp;#39;) +
    scale_y_continuous(breaks = seq(1, 37, 2)) + 
    labs(x = &amp;quot;&amp;quot;, y = &amp;quot;Seasons&amp;quot;, title = &amp;quot;How Often Were Challengers On The Show?&amp;quot;,
         subtitle = &amp;quot;*Only Challengers Appearing On At Least 10 Seasons Ordered By First Appearance*&amp;quot;,
         caption = &amp;quot;*Source:* Wikipedia | **Author:** Jlaw&amp;quot;) + 
    coord_flip() + 
    cowplot::theme_cowplot() + 
    theme(
      panel.grid.major.y = element_line(size = .5, color = &amp;#39;#DDDDDD&amp;#39;),
      plot.subtitle = ggtext::element_markdown(),
      plot.title.position = &amp;#39;plot&amp;#39;,
      plot.caption = ggtext::element_markdown(),
      axis.ticks.y = element_blank()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/index_files/figure-html/dumbbell-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;which-seasons-had-the-highest-percentage-of-one-and-done-challengers&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Which Seasons Had the Highest Percentage of “one and done” Challengers?&lt;/h2&gt;
&lt;p&gt;Sometimes the show will bring a cast member on and it doesn’t work out and you never see them again. I can also look at which seasons had the largest number of cast members who were never seen again. Since Season 37 is still airing and we don’t know who will / won’t come back, I’ve excluded that season:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_cast_info %&amp;gt;% 
  add_count(Name, name = &amp;quot;num_seasons&amp;quot;) %&amp;gt;%
  filter(num_seasons == 1 &amp;amp; order != 37) %&amp;gt;%
  count(short_title, year) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(short_title, n), y = n, fill = year)) + 
    geom_col() + 
    geom_text(aes(label = n), hjust = 0) +
    labs(x = &amp;quot;Season Title&amp;quot;, y = &amp;quot;Number of &amp;#39;one and done&amp;#39; Challengers&amp;quot;,
         title = &amp;quot;What Season Had the Most &amp;#39;One and Done&amp;#39; Challengers&amp;quot;,
         subtitle = &amp;quot;Lighter Colors are Later Seasons&amp;quot;,
         fill = &amp;quot;Year Aired&amp;quot;) +
    scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
    scale_fill_viridis_c() + 
    guides (fill = guide_colourbar(barwidth = 15, barheight = 0.5)) + 
    expand_limits(x = 0, y = 0) + 
    coord_flip() + 
    cowplot::theme_cowplot() + 
    theme(
        plot.title.position = &amp;#39;plot&amp;#39;,
        legend.position = &amp;#39;bottom&amp;#39;
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/index_files/figure-html/one_and_done-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The seasons with the largest number of one and done’s tended to be seasons where the shows had large influxes of new challengers due to different formats. Battle of the Seasons was a very large cast and the first to not have small teams. Battle of the Bloodlines was a concept where 50% of the challengers were family members who had never been on the show and thankfully never were again.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-are-the-most-similar-episodes-of-the-challenge&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What Are the Most Similar Episodes of the Challenge?&lt;/h2&gt;
&lt;p&gt;I can visualize season similarity in a network graph, however, I need to first restructure the data. Right now I just have all the positive cases but I need to build data that has every person/season combination with 1/0 indicators. Then I can use the &lt;code&gt;pairwise_similarity()&lt;/code&gt; function from &lt;code&gt;widyr&lt;/code&gt; to get the cosine similarity of each season. The &lt;code&gt;upper=F&lt;/code&gt; setting makes it so there’s only 1 row for each combination (e.g, only A, B rather than both A,B and B,A):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;similarity &amp;lt;- all_cast_info %&amp;gt;%
  #Create an indicator for all the positive cases
  transmute(order, short_title, Name, ind = 1) %&amp;gt;%
  # Make a wide data set and fill in 0s for all the negative cases
  pivot_wider(
    names_from = &amp;#39;Name&amp;#39;,
    values_from = &amp;#39;ind&amp;#39;,
    values_fill = 0
  ) %&amp;gt;% 
  # Bring the table back to long format with 1/0s
  pivot_longer(
    cols = c(-order, -short_title),
    names_to = &amp;quot;Name&amp;quot;,
    values_to = &amp;quot;ind&amp;quot;
  ) %&amp;gt;% 
  pairwise_similarity(short_title, Name, ind, upper = F, diag = F) %&amp;gt;% 
  arrange(-similarity)  %&amp;gt;%
  filter(similarity &amp;gt; .29)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;item1&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;item2&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;similarity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Vendettas&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Final Reckoning&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.6806139&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;War of the Worlds&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;War of the Worlds 2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5760221&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Invasion of the Champions&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;XXX: Dirty 30&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.5635760&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The most similar seasons in the data are Vendettas (Season 31) and Final Reckoning (Season 32) which makes sense as these were consecutive seasons that were also the last two pieces of a trilogy.&lt;/p&gt;
&lt;p&gt;The similarity threshold of 0.29 was chosen judgmentally to include as many seasons as possible without over-complicating the graph. The next step in building the network graph itself. I’m setting a seed since the layout in &lt;code&gt;ggraph&lt;/code&gt; is non-deterministic and I’d like to make it reproducible. The similarity data frame is converted to a tbl_graph object with &lt;code&gt;as_tbl_graph&lt;/code&gt;, I join in the short titles to from the labels and then set edges to have alpha values (transparency) tied to similarity and use the names for the node labels.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(20210904)
as_tbl_graph(similarity) %&amp;gt;%
  left_join(seasons_table, by = c(&amp;#39;name&amp;#39; = &amp;quot;short_title&amp;quot;)) %&amp;gt;%
  ggraph(layout = &amp;#39;fr&amp;#39;) + 
    geom_edge_link(aes(alpha = similarity), width = 1.5) + 
    geom_node_label(aes(label = name, fill = order), size = 5) + 
    scale_fill_viridis_c(begin = .3) + 
    scale_shape_discrete(guide = &amp;#39;none&amp;#39;) + 
    scale_x_continuous(expand = expansion(add = c(.6, .8))) +
    labs(title = &amp;quot;Network of Challenge Seasons&amp;quot;,
         subtitle = &amp;quot;Edges measured by Cosine Similarity of Cast&amp;quot;,
         caption = &amp;quot;All Stars and RW vs RR did not have &amp;gt;.0.29 Similarity to Any Other Season&amp;quot;,
         alpha = &amp;quot;Cosine Similarity&amp;quot;,
         fill = &amp;quot;Season #&amp;quot;) + 
    theme_graph(plot_margin = margin(30, 0, 0, 30)) + 
    theme(
      legend.position = &amp;#39;bottom&amp;#39;
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/index_files/figure-html/network_graph-1.png&#34; width=&#34;1152&#34; /&gt;
Through the network graph we can see that the first two seasons aren’t connected to anything and don’t appear and then Seasons 3 and 5 and Seasons 4 and 6 exist in their own clusters. But the rest of the structure you can trace from early seasons to later season.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;clustering-the-seasons-with-hierarchical-clustering&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Clustering the Seasons with Hierarchical Clustering&lt;/h2&gt;
&lt;p&gt;Now that EDA is done, its time to determine our eras through clustering. In order to use hierarchical clustering I need to create a distance matrix. To do so I will replicate some of the code from above where each row will be a season and each column a Challenger and the value will be either 1 if they were on that season or 0 otherwise.&lt;/p&gt;
&lt;p&gt;Since this data is binary I will be using a binary distance where 1 and 1 is a match and any 1/0 pair is a mismatch (e.g, 0 and 0 despite being the same value does not count as similarity). The definition is the proportion of bits in which only one is on among those where at least one is on.&lt;/p&gt;
&lt;p&gt;Then the hierarchical clustering algorithm is run with &lt;code&gt;hclust&lt;/code&gt;. There are many different agglomeration methods that can be used ranging from &lt;strong&gt;single&lt;/strong&gt; (where difference between clusters is defined by their closest elements), &lt;strong&gt;complete&lt;/strong&gt; (which defines differences by farthest apart elements), &lt;strong&gt;average&lt;/strong&gt; (which is the average of all the points distance), and &lt;strong&gt;Ward&lt;/strong&gt; which is the minimal distance between sum of squares. For more information, see this &lt;a href=&#34;https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-clustering/217742#217742&#34;&gt;CrossValidated&lt;/a&gt; answer.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Cast Data to Wide Format
dt &amp;lt;- all_cast_info %&amp;gt;%
  transmute(order, short_title, Name, ind = 1) %&amp;gt;%
  pivot_wider(
    names_from = &amp;#39;Name&amp;#39;,
    values_from = &amp;#39;ind&amp;#39;,
    values_fill = 0
  )

# 
dst &amp;lt;- dt %&amp;gt;%
  # Remove fields I don&amp;#39;t want part of the distance function
  select(-order, -short_title) %&amp;gt;%
  dist(method = &amp;#39;binary&amp;#39;) %&amp;gt;%
  #the agglomeration method to be used. 
  hclust(method = &amp;#39;ward.D2&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I can then visualize the resulting dendrogram using the &lt;code&gt;plot()&lt;/code&gt; function and supplying the short_title field I previously excluded as a label parameter. By looking at the dendrogram it seems like there are five clusters which I will highlight with the &lt;code&gt;rect.hclust&lt;/code&gt; function and specifying k=5:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(dst, labels = dt$short_title, 
     main = &amp;#39;Hierarchical Clustering of Challenge Seasons&amp;#39;,
     xlab = &amp;#39;&amp;#39;)
rect.hclust(dst, k = 5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/index_files/figure-html/dendrogram-1.png&#34; width=&#34;672&#34; /&gt;
Based on the dendrogram, there are five clusters:&lt;/p&gt;
&lt;table&gt;
&lt;tr&gt;
&lt;th&gt;
&lt;/th&gt;
&lt;th&gt;
Cluster #1
&lt;/th&gt;
&lt;th&gt;
Cluster #2
&lt;/th&gt;
&lt;th&gt;
Cluster #3
&lt;/th&gt;
&lt;th&gt;
Cluster #4
&lt;/th&gt;
&lt;th&gt;
Cluster #5
&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
Seasons:
&lt;/td&gt;
&lt;td&gt;
1 (All-Stars) to 11 (Gauntlet 2)
&lt;/td&gt;
&lt;td&gt;
12 (Fresh Meat) to 18 (The Ruins)
&lt;/td&gt;
&lt;td&gt;
19 (Fresh Meat 2) to 26 (Battle of the Exes 2)
&lt;/td&gt;
&lt;td&gt;
27 (Battle of the Bloodlines) to 32 (Final Reckoning)
&lt;/td&gt;
&lt;td&gt;
33 (War of the Worlds) to 37 (Spies, Lies, and Allies)
&lt;/td&gt;
&lt;tr&gt;
&lt;td&gt;
Why?
&lt;/td&gt;
&lt;td&gt;
Original seasons when challengers were only from Real World or Road Rules
&lt;/td&gt;
&lt;td&gt;
First Introduction of challengers not from prior properties
&lt;/td&gt;
&lt;td&gt;
Second Injection of challenges not from prior properties
&lt;/td&gt;
&lt;td&gt;
Half of the case are family members of prior challengers
&lt;/td&gt;
&lt;td&gt;
Introduction of large influx of new challengers from internaional reality shows
&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;So it seems like the algorithm latched on to change points where the casts became heavily rookies, which would make sense since that is a forced dissimilarity.&lt;/p&gt;
&lt;p&gt;Returning to the data I can append the cluster assignment to the orignal data with the &lt;code&gt;cuttree&lt;/code&gt; function and providing it the number of clusters to return.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h_clust_results &amp;lt;- dt %&amp;gt;%
  mutate(cluster = cutree(dst, k = 5))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;dimensionality-reduction-with-umap&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Dimensionality Reduction with UMAP&lt;/h2&gt;
&lt;p&gt;The data set used for the clustering contained 37 rows representing each season of The Challenge and 360 columns representing every challenge who has ever been on the show. This type of data is prime for dimensionality reduction. Uniform Manifold Approximation and Projection (UMAP) is a technique that can be used for dimensionality reduction and visualization similar to T-SNE. The UMAP algorithm can be found in the &lt;code&gt;umap&lt;/code&gt; package.&lt;/p&gt;
&lt;p&gt;Running UMAP is pretty straightforward with the &lt;code&gt;umap()&lt;/code&gt; function and here I give it the very wide data set used for clustering. In the returned object there is an element called &lt;em&gt;layout&lt;/em&gt; which contains the compressed two dimensional space returned by UMAP. Again I’m setting a seed as the results of UMAP can be non-deterministic.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(20210904)
ump &amp;lt;- umap(dt %&amp;gt;% select(-order, -short_title))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I can add then those dimensions to the clustering results from above to see how closely the UMAP compression will match the clustering from the &lt;code&gt;hclust&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;h_clust_results %&amp;gt;% 
  select(order, short_title, cluster) %&amp;gt;%
  # Add in UMAP dimensions
  mutate(
    dim1 = ump$layout[, 1],
    dim2 = ump$layout[, 2]
  ) %&amp;gt;%
  ggplot(aes(x = dim1, y = dim2, color = factor(cluster))) + 
  geom_text(aes(label = short_title)) + 
  labs(title = &amp;#39;UMAP Projection of Challenge Seasons&amp;#39;,
       subtitle = &amp;quot;Colors Represent Prior Clustering&amp;quot;) + 
  scale_color_discrete(guide = &amp;#39;none&amp;#39;) + 
  scale_x_continuous(expand = expansion(add = c(.3, .4))) + 
  cowplot::theme_cowplot() + 
  theme(
    axis.ticks = element_blank(),
    axis.text = element_blank()
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/index_files/figure-html/umap_viz-1.png&#34; width=&#34;672&#34; /&gt;
Overall, the UMAP projection captures similar information to the clustering since both methods were unsupervised and the colors (the prior clusters) are very close in the UMAP projected space.&lt;/p&gt;
&lt;div id=&#34;predicting-new-observations-with-umap&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Predicting New Observations with UMAP&lt;/h3&gt;
&lt;p&gt;In the summer of 2020, &lt;a href=&#34;https://en.wikipedia.org/wiki/The_Challenge:_All_Stars&#34;&gt;The Challenge: All Stars&lt;/a&gt; aired on Paramount+. The series was intended to bring back fan favorites from early seasons of the challenge (although whether the actual cast would be considered fan favorites, all-stars, or even from early seasons was debatable). An interesting final question to ask is: &lt;em&gt;what cluster would The Challenge: All Stars belong to in the UMAP space?&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This next block of code is going to do a lot of heavy lifting but isn’t dissimilar from what was done in the earlier parts of this post. I will be downloading the cast from Wikipedia, cleaning it (more marriages and nicknames), and adding it to the original data set to get the 0 cases and the filtering it back to The Challenge: All Stars season.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_stars &amp;lt;- 
  # Take Original Data Set
  all_cast_info %&amp;gt;%
  # Add Indicators
  transmute(order, short_title, Name, ind = 1) %&amp;gt;%
  # Get the New Challenge Season
  bind_rows(
    get_cast(&amp;#39;wiki/The_Challenge:_All_Stars&amp;#39;, &amp;#39;3, 4&amp;#39;, &amp;#39;1&amp;#39;, &amp;#39;The Challenge: All Stars&amp;#39;, 99)  %&amp;gt;%
      transmute(order, short_title = title, Name, ind = 1)  %&amp;gt;%
      #Cleaning Names
      mutate(
        Name = case_when(
          Name == &amp;quot;Katie Cooley&amp;quot; ~ &amp;quot;Katie Doyle&amp;quot;,
          Name ==  &amp;#39;Eric &amp;quot;Big Easy&amp;quot; Banks&amp;#39; ~ &amp;#39;Eric Banks&amp;#39;,
          Name == &amp;#39;Teck Holmes&amp;#39; ~ &amp;#39;Tecumshea &amp;quot;Teck&amp;quot; Holmes III&amp;#39;,
          TRUE ~ Name
        )
      )
  ) %&amp;gt;% 
  # Cast to Wider
  pivot_wider(
    names_from = &amp;#39;Name&amp;#39;,
    values_from = &amp;#39;ind&amp;#39;,
    values_fill = 0
  ) %&amp;gt;% 
  # Filter back to the All Stars Season
  filter(short_title == &amp;#39;The Challenge: All Stars&amp;#39;) %&amp;gt;%
  # Removing Things that Won&amp;#39;t Be Predicted
  select(-order, -short_title)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then predicting the All Stars season in the UMAP space can be done similar to other predictions in R with the &lt;code&gt;predict&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_stars_pred &amp;lt;- predict(ump, all_stars)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;which returns a matrix with 1 row for the season and 2 columns for the UMAP x and y dimensions. Then this can be visualized on top of the original UMAP projection as an annotation.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Take Original Data
h_clust_results %&amp;gt;% 
  select(order, short_title, cluster) %&amp;gt;%
  # Add in the original UMAP data
  mutate(
    dim1 = ump$layout[, 1],
    dim2 = ump$layout[, 2]
  ) %&amp;gt;%
  ggplot(aes(x = dim1, y = dim2, color = factor(cluster))) +
  #ggrepel::geom_text_repel(aes(label = short_title)) + 
  geom_text(aes(label = short_title)) +
  # Add Annotation for the Challenge All Stars Season with the predicted
  # projection.
  annotate(
    &amp;#39;label&amp;#39;,
    label = &amp;#39;The Challenge: All Stars&amp;#39;,
            x = all_stars_pred[, 1],
            y = all_stars_pred[, 2],
            color = &amp;#39;black&amp;#39;) + 
  labs(title = &amp;#39;Predicting Challenge All-Stars Onto Prior UMAP Projection&amp;#39;,
       subtitle = &amp;quot;Colors Represent Prior Clustering&amp;quot;) + 
  scale_color_discrete(guide = &amp;#39;none&amp;#39;) + 
  scale_fill_discrete(guide = &amp;#39;none&amp;#39;) +
  scale_x_continuous(expand = expansion(add = c(.3, .4))) +
  cowplot::theme_cowplot() + 
  theme(
    axis.ticks = element_blank(),
    axis.text = element_blank()
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/09/15/finding-the-eras-of-mtv-s-the-challenge-through-clustering/index_files/figure-html/viz_new_umap-1.png&#34; width=&#34;672&#34; /&gt;
It seems like Challenge All-Stars would be part of the first cluster of the first group of seasons but is somewhat between the “green cluster” which could make sense as there were a couple of cast members on the show who first showed up in the 23rd season (Battle of the Seasons 2).&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>$GME To The Moon: How Much of an Outlier Was Gamestop&#39;s January Rise?</title>
      <link>https://jlaw.netlify.app/2021/08/12/gme-to-the-moon-how-unexpected-was-gamestop-s-january-stock-rally/</link>
      <pubDate>Thu, 12 Aug 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/08/12/gme-to-the-moon-how-unexpected-was-gamestop-s-january-stock-rally/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/08/12/gme-to-the-moon-how-unexpected-was-gamestop-s-january-stock-rally/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;Between January 13th and January 27th, 2021 the stock price for Gamestop (&lt;a href=&#34;https://www.google.com/finance/quote/GME:NYSE&#34;&gt;GME&lt;/a&gt;) rose 10x from $31 to $347 dollars. This rise was in part due to increased popularity on the Reddit forum &lt;a href=&#34;https://www.reddit.com/r/wallstreetbets/&#34;&gt;r/wallstreetbets&lt;/a&gt; looking to create a short squeeze and because they “liked the stock”. This rapid rise also drew attention of popular media such as &lt;a href=&#34;https://www.cnbc.com/2021/01/26/gamestop-shares-are-jumping-again-but-short-sellers-arent-backing-down.html&#34;&gt;CNBC&lt;/a&gt;:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;Capture.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;However, this post will not try to understand the mechanics of why GME rose or whether it &lt;em&gt;should&lt;/em&gt; have risen. What I will try to answer is &lt;strong&gt;“how unexpected was its rise”&lt;/strong&gt; using an array of different forecasting tools. To assess how expected this rise in GME stock is, I’ll be using the following packages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Anomalize&lt;/li&gt;
&lt;li&gt;Prophet&lt;/li&gt;
&lt;li&gt;Forecast (auto.arima)&lt;/li&gt;
&lt;li&gt;CausalImpact&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;From these methods we should get a good idea of just how unexpected this rise was. The method for doing this will be using historical price data through January 21st to predict the Gamestop stock price for the period of January 22nd, through February 4th and looking at the mean average percent error (MAPE&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;) to quantify the amount of unexpectedness.&lt;/p&gt;
As a reminder the MAPE is calculated as:
&lt;center&gt;
&lt;img src=&#34;mape.png&#34; style=&#34;width:25.0%&#34; /&gt;
&lt;/center&gt;
&lt;p&gt;where A is the actual and F is the forecasted value.&lt;/p&gt;
&lt;div id=&#34;peer-sets&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Peer Sets&lt;/h2&gt;
&lt;p&gt;While I can look at the GME time-series and know that its an outlier relative to past performance maybe something in early January caused &lt;strong&gt;all&lt;/strong&gt; video games related stocks to increase. The peer set that I will look at using as external regressors are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Nintendo (NTDOF) - &lt;em&gt;Maker of the Switch System&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Sony (SONY) - &lt;em&gt;Maker of the Playstation System&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;Microsoft (MSFT) - &lt;em&gt;Maker of the XBox System&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Data&lt;/h1&gt;
&lt;p&gt;I’ll be using the stock prices for these four stocks from 1/1/2016 through 2/22/2021 for this analysis and I will use the &lt;a href=&#34;https://business-science.github.io/tidyquant/&#34;&gt;&lt;code&gt;tidyquant&lt;/code&gt;&lt;/a&gt; package to get this data through the &lt;em&gt;tq_get()&lt;/em&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyquant) #Get Stock Data 
library(tidyverse) #Data Manipulation
library(lubridate) #Date Manipulation&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;### Make Data Weekly
dt &amp;lt;- tq_get(c(&amp;#39;GME&amp;#39;, &amp;#39;SONY&amp;#39;, &amp;#39;NTDOF&amp;#39;, &amp;#39;MSFT&amp;#39;),
             get=&amp;#39;stock.prices&amp;#39;,
             from = &amp;#39;2016-01-01&amp;#39;,
             to = &amp;#39;2021-02-22&amp;#39;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With the data pulled we can visualize each of the time-series for the four stocks. While the peer stocks all rose between 2020 and Feb 2021 it does appear that Gamestop truly “goes to the moon” above and beyond the peer stocks.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dt %&amp;gt;% 
  filter(ymd(date) &amp;gt;= ymd(20200101)) %&amp;gt;% 
  ggplot(aes(x = date, y=close, color = symbol, group = symbol)) + 
     geom_line() + 
    geom_vline(xintercept = ymd(20210122), lty = 2, color = &amp;#39;red&amp;#39;) + 
    geom_vline(xintercept = ymd(20210204), lty = 2, color = &amp;#39;red&amp;#39;) + 
   labs(x = &amp;quot;Date&amp;quot;, y = &amp;quot;Closing Price&amp;quot;, title = &amp;quot;Gamestop&amp;#39;s Ride to the Moon &amp;amp;#128640;&amp;amp;#128640;&amp;amp;#128640;&amp;quot;,
         subtitle = &amp;quot;Fueled by &amp;lt;span style=&amp;#39;color:#ff4500&amp;#39;&amp;gt;&amp;lt;b&amp;gt;r/wallstreetbets&amp;lt;/b&amp;gt;&amp;lt;/span&amp;gt; $GME rose nearly 10x in a week&amp;quot;,
        caption = &amp;quot;&amp;lt;i&amp;gt;Prediction zone bounded by the &amp;lt;span style=&amp;#39;color:red&amp;#39;&amp;gt;red dashed&amp;lt;/span&amp;gt; lines&amp;lt;/i&amp;gt;&amp;quot;
        ) +
     scale_color_discrete(guide = &amp;#39;none&amp;#39;) +
     scale_x_date(date_breaks = &amp;quot;6 months&amp;quot;, date_labels = &amp;quot;%b %Y&amp;quot;) + 
      facet_wrap(~symbol, ncol = 1, scales = &amp;quot;free_y&amp;quot;) + 
      cowplot::theme_cowplot() + 
      theme(
        plot.title = ggtext::element_markdown(),
        plot.subtitle = ggtext::element_markdown(),
        plot.caption = ggtext::element_markdown(),
        strip.background = element_blank(),
        strip.text = ggtext::element_textbox(
          size = 12,
          color = &amp;quot;white&amp;quot;, fill = &amp;quot;#5D729D&amp;quot;, box.color = &amp;quot;#4A618C&amp;quot;,
          halign = 0.5, linetype = 1, r = unit(5, &amp;quot;pt&amp;quot;), width = unit(1, &amp;quot;npc&amp;quot;),
          padding = margin(2, 0, 1, 0), margin = margin(3, 3, 3, 3)
        )
      )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/08/12/gme-to-the-moon-how-unexpected-was-gamestop-s-january-stock-rally/index_files/figure-html/plot_stock-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;anomalize&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Anomalize&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://github.com/business-science/anomalize&#34;&gt;&lt;code&gt;anomalize&lt;/code&gt;&lt;/a&gt; is a package developed by &lt;a href=&#34;https://www.business-science.io/&#34;&gt;Business Science&lt;/a&gt; to enable tidy anomaly detection. This package has three primarily functions:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;time_decompose()&lt;/code&gt; - which separates the data into its components&lt;/li&gt;
&lt;li&gt;&lt;code&gt;anomalize()&lt;/code&gt; - which runs anomaly detection on the remainder component&lt;/li&gt;
&lt;li&gt;&lt;code&gt;time_recompose()&lt;/code&gt; - recomposes the data to create limits around the “normal” data.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The package also provides two options for calculating the remainders, STL and Twitter. The STL method does seasonal decomposition through loess while the Twitter method does seasonal decomposition through medians. Additionally there are two options for calculating the anomalies from the remainders, IQR and GESD.&lt;/p&gt;
&lt;p&gt;As for which methods to choose, a talk from &lt;a href=&#34;https://www.youtube.com/watch?v=n9GOvto69aQ&amp;amp;t=6s&#34;&gt;Catherine Zhou&lt;/a&gt; summarizes the choice as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Twitter + GESD is better for highly seasonal data&lt;/li&gt;
&lt;li&gt;STL + IQR better if seasonality isn’t a factor.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;More details on these methods are available in the &lt;a href=&#34;https://cran.r-project.org/web/packages/anomalize/vignettes/anomalize_methods.html&#34;&gt;anomalize methods&lt;/a&gt; vignettes.&lt;/p&gt;
&lt;p&gt;Since all of these stocks benefit from increases in holiday sales I’ll use &lt;strong&gt;STL + IQR&lt;/strong&gt;. Unfortunately, &lt;code&gt;anomalize&lt;/code&gt; (to my knowledge) cannot handle covariates, so I’ll only be checking for anomalies for the Gamestop stock. Although I’ll add other regressors in the other packages.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(anomalize)

anomalize_dt &amp;lt;- dt %&amp;gt;%
  filter(symbol == &amp;#39;GME&amp;#39;) %&amp;gt;% 
  # Merge keeps all of the original data in the decomposition
  time_decompose(close, method = &amp;#39;stl&amp;#39;, merge = T, trend = &amp;quot;1 year&amp;quot;) %&amp;gt;% 
  anomalize(remainder, method = &amp;quot;iqr&amp;quot;) %&amp;gt;% 
  time_recompose() %&amp;gt;% 
  filter(between(date, ymd(20210122), ymd(20210204)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Looking at our prediction window returns:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predictions_anomalize &amp;lt;- anomalize_dt %&amp;gt;% 
  transmute(date, actual = close, predicted = trend + season, 
            normal_lower = recomposed_l1, normal_upper = recomposed_l2, 
            residual = remainder, anomaly)


knitr::kable(predictions_anomalize, digits = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;date&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;actual&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;predicted&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;normal_lower&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;normal_upper&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;residual&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;anomaly&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-22&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;65.01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;16.97&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.08&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;23.68&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;48.04&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;76.79&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.06&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.17&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;23.77&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;59.73&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;147.98&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.15&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;23.86&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;130.83&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-27&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;347.51&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.27&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.37&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;23.98&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;330.24&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-28&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;193.60&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.36&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.47&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;24.07&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;176.24&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;325.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.44&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.54&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;24.15&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;307.56&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;225.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.53&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.64&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;24.24&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;207.47&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;90.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.62&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.73&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;24.33&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;72.38&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-03&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;92.41&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.74&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.84&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;24.45&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;74.67&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;53.50&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.83&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10.94&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;24.54&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;35.67&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;So &lt;code&gt;anomalize&lt;/code&gt; correctly identified all dates as anomalies vs what was expected. Now I can calculate the MAPE as 84.09% which means that only 16% of Gamestop’s stock movement was predicted.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;prophet&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Prophet&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://facebook.github.io/prophet/&#34;&gt;&lt;code&gt;Prophet&lt;/code&gt;&lt;/a&gt; is a forecasting library that was developed by Facebook. To calculate the MAPE, I will fit the prophet model to the data before the prediction period and then predict for the data in our prediction period (post). Prophet does allow for the addition of other regressors so I will run two version of the model. The first will just be on the Gamestop time series and the second will bring in the Sony, Nintendo, and Microsoft regressors.&lt;/p&gt;
&lt;div id=&#34;data-processing&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data Processing&lt;/h2&gt;
&lt;p&gt;Currently, the data is in a tidy format where all symbols are in a separate row. In order to use them in prophet (and in future packages), I need to have the data in a format where each row is a date and all of the symbols are separate columns. Additionally, to be used in prophet the data must have a &lt;code&gt;ds&lt;/code&gt; column for for the date and a &lt;code&gt;y&lt;/code&gt; column for the time series being projected. The following code block will split into the pre-period and the prediction period as well as rename the GME series to &lt;code&gt;y&lt;/code&gt; and date to &lt;code&gt;ds&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prep_data &amp;lt;- dt %&amp;gt;% 
  select(date, symbol, close) %&amp;gt;% 
  pivot_wider(names_from = &amp;#39;symbol&amp;#39;, values_from = &amp;#39;close&amp;#39;) %&amp;gt;% 
  rename(y = GME, ds = date)

pre &amp;lt;- prep_data %&amp;gt;% filter(ds &amp;lt;= ymd(20210121))
pred &amp;lt;- prep_data %&amp;gt;% filter(between(ds, ymd(20210122), ymd(20210204)))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;model-1-only-the-gamestop-time-series&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Model 1: Only the Gamestop Time Series&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(prophet)

#Build the Model
model_no_regressors &amp;lt;- prophet(pre)
#Predict on the Future Data
model_no_regressors_pred &amp;lt;- predict(model_no_regressors, pred)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can look at the predicted results and the residuals by joining the actual data back to the predicted data:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predictions_prophet_no_reg &amp;lt;- model_no_regressors_pred %&amp;gt;% 
  inner_join(pred %&amp;gt;% select(ds, y), by = &amp;quot;ds&amp;quot;) %&amp;gt;% 
  transmute(ds, actual = y, predicted = yhat, lower = yhat_lower, 
            upper = yhat_upper, residual = y-yhat)

knitr::kable(predictions_prophet_no_reg, digits = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;ds&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;actual&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;predicted&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;lower&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;upper&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;residual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-22&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;65.01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.74&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14.91&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20.59&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;76.79&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.31&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14.56&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20.15&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;59.48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;147.98&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.14&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14.13&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19.90&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;130.84&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-27&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;347.51&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.03&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14.29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19.93&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;330.48&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-28&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;193.60&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;16.89&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;13.89&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19.31&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;176.71&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;325.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;16.55&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;13.99&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19.29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;308.45&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;225.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;16.24&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;13.43&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19.01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;208.76&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;90.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;16.16&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;13.66&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;18.90&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;73.84&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-03&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;92.41&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;16.14&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;13.46&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;18.83&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;76.27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;53.50&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;16.11&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;13.47&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;18.79&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;37.39&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;From this I can calculate the MAPE as 84.71% again indicatoring that only 16% of the movement was “expected”.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;model-2-gamestop-regressors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Model 2: Gamestop + Regressors&lt;/h2&gt;
&lt;p&gt;To run a prophet model with regressions the syntax is a little bit different as rather than pass a dataset into the &lt;code&gt;prophet()&lt;/code&gt; function, I’ll need to start with the &lt;code&gt;prophet()&lt;/code&gt; function, add the regressors and then pass the data into a &lt;code&gt;fit_prophet()&lt;/code&gt; function to actually fit the model.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Initialize Model
prophet_reg &amp;lt;- prophet()

#Add Regressors
prophet_reg &amp;lt;- add_regressor(prophet_reg, &amp;#39;MSFT&amp;#39;)
prophet_reg &amp;lt;- add_regressor(prophet_reg, &amp;#39;SONY&amp;#39;)
prophet_reg &amp;lt;- add_regressor(prophet_reg, &amp;#39;NTDOF&amp;#39;)

#Fit Model
prophet_reg &amp;lt;- fit.prophet(prophet_reg, pre)

# Predict on Future Data
prophet_reg_pred &amp;lt;- predict(prophet_reg, pred)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then looking at the predictions:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predictions_prophet_reg &amp;lt;- prophet_reg_pred %&amp;gt;% 
  inner_join(pred %&amp;gt;% select(ds, y), by = &amp;quot;ds&amp;quot;) %&amp;gt;% 
  transmute(ds, actual = y, predicted = yhat, lower = yhat_lower, 
            upper = yhat_upper, residual = y-yhat)

knitr::kable(predictions_prophet_reg, digits = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;ds&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;actual&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;predicted&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;lower&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;upper&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;residual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-22&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;65.01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20.32&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.95&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;22.78&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.69&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;76.79&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19.17&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;16.66&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.71&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;57.62&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;147.98&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;18.95&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;16.22&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.39&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;129.03&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-27&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;347.51&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;18.07&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;15.52&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20.69&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;329.44&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-28&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;193.60&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.47&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14.93&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19.74&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;176.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;325.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.61&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;15.09&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19.98&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;307.39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;225.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.22&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14.67&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19.65&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;207.78&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;90.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;17.61&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14.97&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20.02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;72.39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-03&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;92.41&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.18&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;18.40&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;23.63&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;71.23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;53.50&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;18.75&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;23.61&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;32.25&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;which gives us a MAPE of 82.14%. The addition of the external regressors make the forecast errors &lt;em&gt;slightly&lt;/em&gt; lower. Now the movement is 18% expected.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;auto.arima&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Auto.Arima&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://otexts.com/fpp2/arima-r.html&#34;&gt;&lt;code&gt;auto.arima()&lt;/code&gt;&lt;/a&gt; is a function within the &lt;code&gt;forecast&lt;/code&gt; package that algorithmically determines the proper specification for an ARIMA (auto-regressive integrated moving average) model. The basic version of auto-arima fits on a univariate series which I will do first, and then I’ll use external regressors similar to what was done with Prophet.&lt;/p&gt;
&lt;div id=&#34;model-1-only-gamestop-time-series&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Model 1: Only Gamestop Time Series&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(forecast)

# Fit auto arima model
auto_arima_model &amp;lt;- auto.arima(pre$y)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The function returns an ARIMA(1, 2, 2) model. The &lt;code&gt;forecast()&lt;/code&gt; function is then used for use the model to forecast into the future.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Forecast 10 Periods Ahead
auto_arima_pred &amp;lt;- forecast(auto_arima_model, 10)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then as with the earlier models I can look at the predictions vs. the actuals. The forecast object returns a list where I can pull out the forecast from the “mean” item and the predicted bound using &lt;em&gt;lower&lt;/em&gt; and &lt;em&gt;upper&lt;/em&gt;. The list contains intervals for both 80% and 95% so the &lt;code&gt;[, 2]&lt;/code&gt; pulls the 95% intervals.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predictions_auto_arima &amp;lt;- pred %&amp;gt;% 
  bind_cols(
    tibble(
      predicted = auto_arima_pred$mean %&amp;gt;% as.numeric(),
      lower = auto_arima_pred$lower[, 2] %&amp;gt;% as.numeric(),
      upper = auto_arima_pred$upper[, 2] %&amp;gt;% as.numeric()
    )
  ) %&amp;gt;% 
  transmute(
    ds, actual = y, predicted, lower, upper, residuals = y - predicted
  )
  
knitr::kable(predictions_auto_arima, digits = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;ds&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;actual&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;predicted&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;lower&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;upper&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;residuals&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-22&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;65.01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43.71&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;42.35&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.07&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;76.79&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.37&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;42.41&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;46.32&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;32.42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;147.98&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;42.62&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.47&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;102.94&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-27&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;347.51&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.70&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;42.87&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;48.54&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;301.81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-28&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;193.60&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;46.38&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43.17&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;49.59&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;147.22&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;325.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43.48&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;50.60&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;277.96&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;225.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.72&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43.82&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;51.61&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;177.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;90.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;48.37&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.16&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;52.59&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;41.63&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-03&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;92.41&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;49.05&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.53&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;53.57&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43.36&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;53.50&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;49.71&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.89&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;54.53&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.79&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This gives a MAPE of 57.20%, which is much better than the prior methods.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;adding-in-external-regressors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Adding in External Regressors&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;auto.arima&lt;/code&gt; can also take into account external regressors through the &lt;code&gt;xreg&lt;/code&gt; parameter. Its a little trickier to implement since the regressors need to be in a Matrix. But as usual, &lt;a href=&#34;https://stats.stackexchange.com/questions/41070/how-to-setup-xreg-argument-in-auto-arima-in-r&#34;&gt;StackOverflow&lt;/a&gt; comes through with a solution. In this case its from the package author himself!&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Create Matrix of External Regressors
xreg &amp;lt;- model.matrix(~ SONY + NTDOF + MSFT - 1, data = pre)
# Fit ARIMA Model
auto_arima_reg &amp;lt;- auto.arima(pre$y, xreg = xreg)

# Create Matrix of Extenral Regressors for Forecasting
xreg_pred &amp;lt;- model.matrix(~ SONY + NTDOF + MSFT - 1, data = pred)
# Forecast with External Regressors
auto_arima_reg_fcst &amp;lt;- forecast(auto_arima_reg, h = 10, xreg = xreg_pred)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predictions_auto_arima_reg &amp;lt;- pred %&amp;gt;% 
  bind_cols(
    tibble(
      predicted = auto_arima_reg_fcst$mean %&amp;gt;% as.numeric(),
      lower = auto_arima_reg_fcst$lower[, 2] %&amp;gt;% as.numeric(),
      upper = auto_arima_reg_fcst$upper[, 2] %&amp;gt;% as.numeric()
    )
  ) %&amp;gt;% 
  transmute(
    ds, actual = y, predicted, lower, upper, residuals = y - predicted
  )
  
knitr::kable(predictions_auto_arima_reg, digits = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;ds&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;actual&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;predicted&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;lower&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;upper&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;residuals&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-22&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;65.01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43.70&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;42.34&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.06&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;21.31&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;76.79&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.37&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;42.42&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;46.32&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;32.42&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;147.98&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;42.69&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.52&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;102.88&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-27&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;347.51&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.69&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;42.86&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;48.52&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;301.82&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-28&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;193.60&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;46.51&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43.31&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;49.71&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;147.09&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;325.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;46.96&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43.41&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;50.51&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;278.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;225.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;47.88&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;51.76&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;177.12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;90.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;48.53&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.33&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;52.73&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;41.47&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-03&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;92.41&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;49.55&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;54.06&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;42.86&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;53.50&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;50.17&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.36&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;54.99&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3.33&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This gives a MAPE of 57.03%. Again the addition of external regressors only makes things &lt;em&gt;slightly&lt;/em&gt; better.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;causalimpact&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;CausalImpact&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://github.com/google/CausalImpact&#34;&gt;&lt;code&gt;CausalImpact&lt;/code&gt;&lt;/a&gt; is a package developed by Google to measure the causal impact of an intervention on a time series. The package uses a Bayesian Structural Time-Series model to estimate a counter-factual of how a response would have evolved without the intervention. This package works by comparing a time-series of interest to a set of control time series and uses the relationships pre-intervention to predict the counterfactual.&lt;/p&gt;
&lt;p&gt;CasualInference also will require some data preparation as it requires a &lt;code&gt;zoo&lt;/code&gt; object as an input. But I can largely leverage the &lt;code&gt;prep_data&lt;/code&gt; data set created in the prophet section as CausalInference only requires that the field of interest is in the first column. The construction of the &lt;code&gt;zoo&lt;/code&gt; object take in the data and the date index as its two parameters.&lt;/p&gt;
&lt;p&gt;Then for running the causal impact analysis, I pass in the &lt;code&gt;zoo&lt;/code&gt; data set and specific what are the pre-period and the post-period. The &lt;em&gt;model.args&lt;/em&gt; options of &lt;code&gt;model.args = list(nseasons = 5, season.duration = 1)&lt;/code&gt; adds day of week seasonality by specifying that there are 5 periods to a seasonal component that each point represents 1 period of a season. For another example to add day of week seasonality to data with hourly granularity then I would specify &lt;code&gt;nseasons=7&lt;/code&gt; and &lt;code&gt;season.duration=24&lt;/code&gt; to say that there are 7 period in a season and 24 data points in a period.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(CausalImpact)

#Create Zoo Object
dt_ci &amp;lt;- zoo(prep_data %&amp;gt;% dplyr::select(-ds), prep_data$ds)

#Run Causal Impact
ci &amp;lt;- CausalImpact(dt_ci, 
                   pre.period = c(as.Date(&amp;#39;2020-05-03&amp;#39;), as.Date(&amp;#39;2021-01-21&amp;#39;)),
                   post.period = c(as.Date(&amp;#39;2021-01-22&amp;#39;), as.Date(&amp;#39;2021-02-04&amp;#39;)),
                   model.args = list(nseasons = 5, season.duration = 1)
                   )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To get the information about the predictions, I can pull them out of the &lt;em&gt;series&lt;/em&gt; attribute within the &lt;code&gt;ci&lt;/code&gt; object. While not being used in this analysis, the &lt;code&gt;summary()&lt;/code&gt; and &lt;code&gt;plot()&lt;/code&gt; functions are very useful. And the option for &lt;code&gt;summary(ci, &#34;report&#34;)&lt;/code&gt; is interesting in that it gives a full paragraph description of the results.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;predictions_causal_inference &amp;lt;- ci$series %&amp;gt;% 
  as_tibble(rownames = &amp;#39;ds&amp;#39;) %&amp;gt;% 
  filter(between(ymd(ds), ymd(20210122), ymd(20210204))) %&amp;gt;% 
  transmute(ds, actual = response, predicted = point.pred, 
            lower = point.pred.lower, upper = point.pred.upper, 
            residual = point.effect)
  
knitr::kable(predictions_causal_inference, digits = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;ds&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;actual&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;predicted&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;lower&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;upper&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;residual&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-22&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;65.01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;39.17&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;34.08&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.21&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;25.84&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;76.79&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38.40&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;32.58&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;43.85&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38.39&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-26&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;147.98&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38.71&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;33.25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.76&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;109.27&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-27&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;347.51&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38.58&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;32.86&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;44.85&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;308.93&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-28&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;193.60&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;39.06&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;33.24&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.83&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;154.54&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-01-29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;325.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;39.21&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;32.71&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.93&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;285.79&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-01&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;225.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38.29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;32.08&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;186.71&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-02&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;90.00&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38.60&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;31.59&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.92&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;51.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-03&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;92.41&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38.40&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;31.52&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;45.83&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;54.01&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2021-02-04&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;53.50&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38.96&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;31.11&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;46.97&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14.54&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This would give us a MAPE of 64.60%, which is between the &lt;code&gt;auto.arima&lt;/code&gt; models and the other methods.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;This post looked at five different mechanisms to forecast what Gamestop’s stock price would be during the period when it spiked. Bringing all of the projections together with the actuals gives us:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_combined &amp;lt;- bind_rows(
  #Actuals
  dt %&amp;gt;% filter(symbol == &amp;#39;GME&amp;#39;) %&amp;gt;% 
    transmute(ds = ymd(date), lbl = &amp;#39;actuals&amp;#39;, y = close),
  #Anomalize
  predictions_anomalize %&amp;gt;% 
    transmute(ds = ymd(date), lbl = &amp;quot;Anomalize&amp;quot;, y = predicted),
  #Prophet Regressors
  predictions_prophet_no_reg %&amp;gt;% 
    transmute(ds = ymd(ds), lbl = &amp;quot;Prophet (No Regressors)&amp;quot;, y = predicted),
  #Prophet No Regressors
  predictions_prophet_reg %&amp;gt;% 
    transmute(ds = ymd(ds), lbl = &amp;quot;Prophet (w/ Regressors)&amp;quot;, y = predicted),
  #Auto.Arima (No Regressors)
  predictions_auto_arima %&amp;gt;% 
    transmute(ds = ymd(ds), lbl = &amp;quot;Auto.Arima (No Regressors)&amp;quot;, y = predicted),
  #Auto.Arima (w/ Regressors)
  predictions_auto_arima_reg %&amp;gt;% 
    transmute(ds = ymd(ds), lbl = &amp;quot;Auto.Arima (w/ Regressors)&amp;quot;, y = predicted),
  #Causal Inference
  predictions_causal_inference %&amp;gt;% 
    transmute(ds = ymd(ds), lbl = &amp;quot;CausalImpact&amp;quot;, y = predicted)
) 

all_combined %&amp;gt;%
  filter(ds &amp;gt;= &amp;#39;2021-01-18&amp;#39; &amp;amp; ds &amp;lt;= &amp;#39;2021-02-04&amp;#39;) %&amp;gt;% 
  ggplot(aes(x = ds, y = y, color = lbl)) + 
    geom_line() + 
    geom_vline(xintercept = ymd(20210122), lty = 2, color = &amp;#39;darkred&amp;#39;) + 
    geom_vline(xintercept = ymd(20210204), lty = 2, color = &amp;#39;darkred&amp;#39;) + 
    labs(title = &amp;quot;Comparing GME Price Projections 1/22/21 - 2/4/21&amp;quot;,
         x = &amp;quot;Date&amp;quot;,
         y = &amp;quot;GME Closing Price ($)&amp;quot;,
         color = &amp;quot;&amp;quot;) + 
    scale_x_date(date_breaks = &amp;quot;2 days&amp;quot;, date_labels = &amp;quot;%b %d&amp;quot;) + 
    scale_y_log10() + 
    scale_color_manual(values = wesanderson::wes_palette(&amp;quot;Zissou1&amp;quot;, 
                                                       n = 7,
                                                       type = &amp;#39;continuous&amp;#39;)) +
    cowplot::theme_cowplot() + 
    theme(
      legend.direction = &amp;#39;horizontal&amp;#39;,
      legend.position = &amp;#39;bottom&amp;#39;
    ) + 
    guides(color=guide_legend(nrow=3,byrow=TRUE))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/08/12/gme-to-the-moon-how-unexpected-was-gamestop-s-january-stock-rally/index_files/figure-html/combined_plots-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Looking at all the projections together its clear that no forecasting method really saw the massive spike in price coming. Although it looks like the Auto.Arima method comes closest, but potentially more because its started from the highest point rather than any forecast being particularly sensitive.&lt;/p&gt;
&lt;p&gt;Looking just at January 27th, the peak of the spike gives the clearest perspective on the difference between the actual and all the projections:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_combined %&amp;gt;% 
  filter(ds == &amp;#39;2021-01-27&amp;#39;) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(lbl, y), y = y, fill = lbl)) + 
    geom_col() + 
    geom_text(aes(label = y %&amp;gt;% scales::dollar(),
                  hjust = (y &amp;gt;= 300))) + 
    labs(x = &amp;quot;Projection Method&amp;quot;,
         y = &amp;quot;GME Closing Price on Jan 27&amp;quot;,
         title = &amp;quot;Looking at the Peak of the Spike&amp;quot;,
         subtitle = &amp;quot;Gamestop Closing Price on January 27, 2021&amp;quot;,
         fill = &amp;quot;&amp;quot;) +
   scale_fill_manual(guide = F, 
                     values = wesanderson::wes_palette(&amp;quot;Zissou1&amp;quot;, 
                                                       n = 7,
                                                       type = &amp;#39;continuous&amp;#39;)) + 
   scale_y_continuous(label = scales::dollar) +
   coord_flip() + 
   cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/08/12/gme-to-the-moon-how-unexpected-was-gamestop-s-january-stock-rally/index_files/figure-html/jan27only-1.png&#34; width=&#34;672&#34; /&gt;
No methodology really comes within $300 of the actual price. To quantify just &lt;em&gt;how&lt;/em&gt; unexpected Gamestop’s rise is, I’ll look at the MAPEs for all the forecasting methods.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;format_mape &amp;lt;- function(dt, method){
 return(
   dt %&amp;gt;% 
    yardstick::mape(actual, predicted) %&amp;gt;% 
    transmute(Method = method, MAPE = .estimate %&amp;gt;% scales::percent(scale = 1, accuracy = .01))
 )
}

bind_rows(
  #Anomalize
  format_mape(predictions_anomalize, &amp;quot;Anomalize&amp;quot;),
  #Prophet Regressors
  format_mape(predictions_prophet_no_reg, &amp;quot;Prophet (No Regressors)&amp;quot;),
  #Prophet No Regressors
  format_mape(predictions_prophet_reg, &amp;quot;Prophet (w/ Regressors)&amp;quot;), 
  #Auto.Arima
  format_mape(predictions_auto_arima, &amp;quot;Auto.Arima (No Regressors)&amp;quot;), 
  #Auto.Arima (w/ Rregressors)
  format_mape(predictions_auto_arima_reg, &amp;quot;Auto.Arima (w/ Regressors)&amp;quot;), 
  #Causal Inference
  format_mape(predictions_causal_inference, &amp;quot;CausalImpact&amp;quot;)
) %&amp;gt;% 
  knitr::kable(align = c(&amp;#39;l&amp;#39;, &amp;#39;r&amp;#39;))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Method&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;MAPE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Anomalize&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;84.09%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Prophet (No Regressors)&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;84.71%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Prophet (w/ Regressors)&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;82.14%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Auto.Arima (No Regressors)&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;57.20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Auto.Arima (w/ Regressors)&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;57.03%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;CausalImpact&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;64.60%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Using the MAPE as the measure of “unexpectedness” I would conclude that this outcome 57% to 85% unexpected (although a lot of the accuracy comes less from the models doing a good job of predicting the spike and more from the models being flat and the stock price coming back down). So despite a small rise before the projection period, its clear that Gamestops’s meteoric rise and then fall would a very unexpected event.&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;Practically, the MAPE function is being calculated using the &lt;code&gt;yardstick&lt;/code&gt; package where the format is &lt;code&gt;yardstick::mape(truth, estimate)&lt;/code&gt; where truth and estimate are the columns for the actual and predicted values.&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>How to not have Plot.ly Inflate Hugo&#39;s Reading Time</title>
      <link>https://jlaw.netlify.app/2021/07/26/how-to-not-have-plot-ly-inflate-hugo-s-reading-time/</link>
      <pubDate>Mon, 26 Jul 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/07/26/how-to-not-have-plot-ly-inflate-hugo-s-reading-time/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/07/26/how-to-not-have-plot-ly-inflate-hugo-s-reading-time/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;I’m a big proponent of enabling the reading time option on this blog which uses Hugo’s academic theme. I always appreciate seeing it on other blogs so I know how much time to invest in the post. I also like it because its a feedback mechanism for me to try to write more concisely. But having too long a reading time at the beginning of a post can be a deterrent to getting people to read.&lt;/p&gt;
&lt;p&gt;Writing the recap post for &lt;a href=&#34;https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/&#34;&gt;this blog’s 1 year anniversary&lt;/a&gt;, when I first generated the post using plot.ly for an interactive chart, I noticed that the reading time ballooned up to 98 minutes from the 13 that it was supposed to be.&lt;/p&gt;
&lt;p&gt;Turning to “Dr. Google” I didn’t find any immediate solutions for getting the reading time to be more tractable. However, I did figure out a small “hack” within RMarkdown to provide the same end output to the blog, but without the increase in reading time.&lt;/p&gt;
&lt;p&gt;This post will show:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;That this happens&lt;/li&gt;
&lt;li&gt;Why this happens&lt;/li&gt;
&lt;li&gt;And a way to continue to use plot.ly from RMarkdown without having it balloon the post’s reading time.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(plotly)&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;what-is-happening&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What is Happening?&lt;/h2&gt;
&lt;p&gt;When rendering a RMarkdown file to Hugo and using a &lt;code&gt;plot.ly&lt;/code&gt; chart that includes categorical data it will cause the article’s reading time to balloon. At least it will in the case where there are many points with categorical data. For this trivial example, I’ll see which character from Friends had the most lines throughout the run of the show. Apparently this is available in a &lt;code&gt;friends&lt;/code&gt; R package… because everything is available in an R package!!&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- friends::friends %&amp;gt;%
  filter(!is.na(speaker)) %&amp;gt;% 
  #Creating Running Season and Episode Indicator
  inner_join(
    friends::friends %&amp;gt;% 
      distinct(season, episode) %&amp;gt;%
      arrange(season, episode) %&amp;gt;%
      mutate(episode_num = row_number()),
    by = c(&amp;#39;season&amp;#39;, &amp;#39;episode&amp;#39;)
  ) %&amp;gt;%
  #Summarize By Character
  count(episode_num, speaker, name = &amp;quot;lines&amp;quot;) %&amp;gt;%
  group_by(speaker) %&amp;gt;% 
  arrange(episode_num) %&amp;gt;%
  mutate(total_lines = cumsum(lines),
         max_lines = max(total_lines)) %&amp;gt;%
  ungroup() %&amp;gt;%
  #Keep Top 20
  mutate(rnk = dense_rank(-max_lines)) %&amp;gt;%
  filter(rnk &amp;lt;= 20) %&amp;gt;% 
  ggplot(aes(x = episode_num, y = total_lines, color = speaker)) + 
    geom_line() + 
    labs(x = &amp;quot;Episode # of Friends&amp;quot;,
         y = &amp;quot;Number of Lines&amp;quot;,
         title = &amp;quot;Cumulative Number of Lines Spoken by Characters on Friends&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(legend.position=&amp;#39;none&amp;#39;,
          plot.title = element_text(size = 14)) 

ggplotly(p)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p1.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;400&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;But &lt;strong&gt;WTF&lt;/strong&gt;… when I render this page I see that the Reading Time is 60 minutes!!! For this article to this point!! Insanity.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;wtf.PNG&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;why-is-this-happening&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Why is this happening?&lt;/h2&gt;
&lt;p&gt;The TL;DR of what’s going on is that plot.ly embeds all of the data from the chart directly into the page source. So if we view the page source we’ll see elements for every point of the data:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;plotly_data.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Then (I believe) Hugo misinterprets aspects of this data as additional word count and that’s how an article that should only be a few minutes becomes closer to an hour.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;how-to-get-around-this&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;How to get around this?&lt;/h2&gt;
&lt;p&gt;In my post, I worked around this by:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Displaying the code I wanted to show with an &lt;code&gt;eval=FALSE&lt;/code&gt; option on the code chunk to not actually render the plot.ly chart but show the code that &lt;strong&gt;WOULD&lt;/strong&gt; render it.&lt;/li&gt;
&lt;li&gt;Having a 2nd code block that’s nearly identical with a &lt;code&gt;echo=FALSE&lt;/code&gt; option on the code chunk to not show the code that is actually run. This code chunk should &lt;strong&gt;also&lt;/strong&gt; save the plot.ly widget as a self-contained file to the directory using something like &lt;code&gt;htmlwidgets::saveWidget(p1, file=&#34;p1.html&#34;, selfcontained = T)&lt;/code&gt; when p1 is the &lt;code&gt;ggplotly()&lt;/code&gt; element and &lt;em&gt;p1.html&lt;/em&gt; is output.&lt;/li&gt;
&lt;li&gt;Have a 3rd code chunk with &lt;code&gt;echo=FALSE&lt;/code&gt; to create an iframe tag that will contain the HTML file created in step 2. This is done with &lt;code&gt;htmltools::tags$iframe(src = &#34;p1.html&#34;)&lt;/code&gt; and some other options.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To show this in action (although I’ll display all 3 code blocks in this example)&lt;/p&gt;
&lt;div id=&#34;code-block-1-the-code-you-want-to-display&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Code Block 1: The Code You Want To Display&lt;/h3&gt;
&lt;p&gt;This is a repeat from the code from above which has &lt;code&gt;eval=FALSE&lt;/code&gt; so its shown but not run:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- friends::friends %&amp;gt;% 
  filter(!is.na(speaker)) %&amp;gt;% 
  #Creating Running Season and Episode Indicator
  inner_join(
    friends::friends %&amp;gt;% 
      distinct(season, episode) %&amp;gt;%
      arrange(season, episode) %&amp;gt;%
      mutate(episode_num = row_number()),
    by = c(&amp;#39;season&amp;#39;, &amp;#39;episode&amp;#39;)
  ) %&amp;gt;%
  #Summarize By Character
  count(episode_num, speaker, name = &amp;quot;lines&amp;quot;) %&amp;gt;%
  group_by(speaker) %&amp;gt;% 
  arrange(episode_num) %&amp;gt;%
  mutate(total_lines = cumsum(lines),
         max_lines = max(total_lines)) %&amp;gt;%
  ungroup() %&amp;gt;%
  #Keep Top 20
  mutate(rnk = dense_rank(-max_lines)) %&amp;gt;%
  filter(rnk &amp;lt;= 20) %&amp;gt;% 
  ggplot(aes(x = episode_num, y = total_lines, color = speaker)) + 
    geom_line() + 
    labs(x = &amp;quot;Episode # of Friends&amp;quot;,
         y = &amp;quot;Number of Lines&amp;quot;,
         title = &amp;quot;Cumulative Number of Lines Spoken by Characters on Friends&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(legend.position=&amp;#39;none&amp;#39;,
          plot.title = element_text(size = 14)) 

ggplotly(p)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;code-block-2-the-code-thats-actually-run-to-save-the-plot.ly-chart-to-an-external-file&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Code Block 2: The Code That’s ACTUALLY run to save the plot.ly chart to an external file&lt;/h3&gt;
&lt;p&gt;This would normally have &lt;code&gt;echo=FALSE&lt;/code&gt; so that it is run but not seen. It is identical to the prior code block but it will save the chart to &lt;em&gt;p1.html&lt;/em&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Identical Code to CB1
p &amp;lt;- friends::friends %&amp;gt;% 
  filter(!is.na(speaker)) %&amp;gt;% 
  #Creating Running Season and Episode Indicator
  inner_join(
    friends::friends %&amp;gt;% 
      distinct(season, episode) %&amp;gt;%
      arrange(season, episode) %&amp;gt;%
      mutate(episode_num = row_number()),
    by = c(&amp;#39;season&amp;#39;, &amp;#39;episode&amp;#39;)
  ) %&amp;gt;%
  #Summarize By Character
  count(episode_num, speaker, name = &amp;quot;lines&amp;quot;) %&amp;gt;%
  group_by(speaker) %&amp;gt;% 
  arrange(episode_num) %&amp;gt;%
  mutate(total_lines = cumsum(lines),
         max_lines = max(total_lines)) %&amp;gt;%
  ungroup() %&amp;gt;%
  #Keep Top 20
  mutate(rnk = dense_rank(-max_lines)) %&amp;gt;%
  filter(rnk &amp;lt;= 20) %&amp;gt;% 
  ggplot(aes(x = episode_num, y = total_lines, color = speaker)) + 
    geom_line() + 
    labs(x = &amp;quot;Episode # of Friends&amp;quot;,
         y = &amp;quot;Number of Lines&amp;quot;,
         title = &amp;quot;Cumulative Number of Lines Spoken by Characters on Friends&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(legend.position=&amp;#39;none&amp;#39;,
          plot.title = element_text(size = 14)) 

################MODIFIED PART STARTS HERE##############################

## Save the plot.ly chart to an object
p1 &amp;lt;- ggplotly(p)

## Save the object as a self-contained HTML file
htmlwidgets::saveWidget(p1, file=&amp;quot;p1.html&amp;quot;, selfcontained = T)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;code-block-3-the-code-to-redner-the-stand-alone-plot.ly-chart&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Code Block 3: The code to redner the stand-alone plot.ly chart&lt;/h3&gt;
&lt;p&gt;This also would normally have &lt;code&gt;echo=FALSE&lt;/code&gt; to run the code but not display it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;htmltools::tags$iframe(
  src = &amp;quot;p1.html&amp;quot;, 
  scrolling = &amp;quot;no&amp;quot;, 
  seamless = &amp;quot;seamless&amp;quot;,
  frameBorder = &amp;quot;0&amp;quot;,
  height=400,
  width=800
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p1.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;400&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;And now as you can see, we have the plot.ly chart displayed. But the reading time is a much more manageable 5 minutes. This is because the source HTML now looks like:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;after.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Much fewer words. &lt;em&gt;Hope this helps&lt;/em&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Celebrating the Blog&#39;s First Birthday With googleAnalyticsR</title>
      <link>https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/</link>
      <pubDate>Wed, 14 Jul 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;On July 4th, 2020, I posted the &lt;a href=&#34;https://jlaw.netlify.app/2020/07/04/a-racing-barplot-of-top-us-baby-names-1880-2018/&#34;&gt;first article&lt;/a&gt; to this humble R blog as a small hobby to do something new while working from home through COVID. Very recently, this blog celebrated its first year and I wanted to leverage Google Analytics to do a look back at the last year, what’s done well as well as when and where people were visiting from. Much of this content is heavily leveraged from &lt;a href=&#34;https://statsandr.com/blog/track-blog-performance-in-r/&#34;&gt;Antoine Soetewey’s Stats and R blog post&lt;/a&gt; and &lt;strong&gt;but&lt;/strong&gt; the numbers contained here will be &lt;em&gt;much&lt;/em&gt; smaller than on his.&lt;/p&gt;
&lt;p&gt;As it says on the home page, this blog was primarily meant for me to be able have a more accessible set of code snippets / a reason to do random analyses to continue learning. That fact that people have taken the time to read it has been awesome and really the icing on a really delicious cake.&lt;/p&gt;
&lt;p&gt;So to everything currently reading or who has read the blog in the last year. Thank you so much! Now onto the recap!!&lt;/p&gt;
&lt;div id=&#34;libraries-and-set-up&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Libraries and Set-up&lt;/h2&gt;
&lt;p&gt;The libraries I’ll use in this analysis generally serve 1 of 2 functions, either to access and manipulate data from Google Analytics which is the workhorse of this post or to do/edit data visitations (plotly, scales, gghalves, ggflags).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(plotly) #For turning ggplots into INTERACTIVE ggplots
library(tidyverse) #General Data Manipulation
library(googleAnalyticsR) # To access the Google Analytics API
library(scales) # Making text prettier
library(gghalves) # Creating Half Boxplot / Half Point Plots
library(wesanderson) # To have some more fun colors
library(countrycode) # Convert Country Names to 2 Letter Codes
library(ggflags) # Plot Flags&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Getting Google Analytics was also somewhat tricky to get set up. As someone who’s not terrible familiar with Google Cloud Platform and was a little hazy about using the generic public account that comes with &lt;code&gt;googleAnalyticsR&lt;/code&gt;, I struggled a bit with getting the authentication correct. The &lt;a href=&#34;http://code.markedmondson.me/googleAnalyticsR/articles/rmarkdown.html&#34;&gt;googleAnalyticsR documentation&lt;/a&gt; provides some guidance for getting the authentication to work with markdown. But I trialed and errored so much that I don’t think I can provide much guidance for what I did. I just kindof kept running &lt;code&gt;ga_auth_setup()&lt;/code&gt; until things seemed like they were working.&lt;/p&gt;
&lt;p&gt;But after getting client ids and auth ids into my R Environment, I can authenticate the markdown file with:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ga_auth()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As a last piece of set-up most of the functions in &lt;code&gt;googleAnalyticsR&lt;/code&gt; take in a &lt;em&gt;view_id&lt;/em&gt; and a date range. Since those will all be the same since I’m looking at &lt;strong&gt;&lt;em&gt;this&lt;/em&gt;&lt;/strong&gt; blog and for the first year, I’ll create those variables first so they can be referenced in each call:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;view_id &amp;lt;- ga_account_list()$viewId

start_date &amp;lt;- as.Date(&amp;quot;2020-07-04&amp;quot;)
end_date &amp;lt;- as.Date(&amp;quot;2021-07-03&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-headlines-users-and-sessions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The Headlines (Users and Sessions)&lt;/h2&gt;
&lt;p&gt;The first thing to explore will be the total number of &lt;em&gt;users&lt;/em&gt;, &lt;em&gt;sessions&lt;/em&gt;, and &lt;em&gt;pageviews&lt;/em&gt; that occurred over the first year of the R Blog. To access the Google Analytics API, I’ll use the &lt;code&gt;google_analytics()&lt;/code&gt; function. The parameters are pretty self-explanatory in that you give it your &lt;em&gt;ViewId&lt;/em&gt;, a date range, a set of metrics, and a set of dimensions to get data returned. The &lt;code&gt;anti_sample&lt;/code&gt; option will split up the call so that nothing gets sampled.&lt;/p&gt;
&lt;p&gt;A complete list of metrics and dimensions can be found in the &lt;a href=&#34;https://ga-dev-tools.appspot.com/dimensions-metrics-explorer/?&#34;&gt;Google Analytics Metrics and Dimension Explorer&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;totals &amp;lt;- google_analytics(view_id,
                           date_range = c(start_date, end_date),
                           metrics = c(&amp;quot;users&amp;quot;, &amp;quot;sessions&amp;quot;, &amp;quot;pageviews&amp;quot;),
                           anti_sample = TRUE 
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Over the first year of the blog (July 4th, 2020 through July 3rd, 2021), I had 3,081 users visit with 4,211 sessions and 6,685 total page views. Given the relatively minimal promotion, I’ll call that a win 🏆.&lt;/p&gt;
&lt;p&gt;I have a hypothesis that most of my views came in the days immediately following posts as I’m connected to the &lt;a href=&#34;https://www.r-bloggers.com/&#34;&gt;R-Bloggers&lt;/a&gt; aggregator. To check this hypothesis I’ll compare the time series of sessions to the days when posts were first made. Since the post dates are embedded in the URLs (/year/month/day/title), I’ll get the full URLs from Google Analytics and pull out the post dates from there:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Get all visited pages 
launch_dates &amp;lt;- google_analytics(view_id,
                                 date_range = c(start_date, end_date),
                                 metrics = c(&amp;quot;pageviews&amp;quot;), 
                                 dimensions = c(&amp;quot;pagePath&amp;quot;),
                                 anti_sample = TRUE
)

# Grab All the URLs that have the /year/month/day pattern and at 
#least 10 page views
launch_dates &amp;lt;- launch_dates %&amp;gt;%
  #Keep only rows that match my pattern
  filter(str_detect(pagePath, &amp;#39;/\\d+/\\d+/\\d+&amp;#39;)) %&amp;gt;%
  #Extract and convert the date components
  extract(pagePath, regex=&amp;#39;/(\\d+)/(\\d+)/(\\d+)/&amp;#39;,
         into = c(&amp;#39;year&amp;#39;, &amp;#39;month&amp;#39;, &amp;#39;day&amp;#39;)) %&amp;gt;% 
  #Turn the components into an actual date field
  mutate(dt = lubridate::ymd(paste(year, month, day, sep = &amp;#39;-&amp;#39;)),
         #Fixing an error in this logic
         dt = if_else(dt == lubridate::ymd(20201201), 
                      lubridate::ymd(20201206),
                      dt)) %&amp;gt;% 
  group_by(dt) %&amp;gt;% 
  summarize(pg = sum(pageviews)) %&amp;gt;% 
  filter(pg &amp;gt; 10)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now I can get the &lt;strong&gt;sessions over time&lt;/strong&gt; from Google Analytics and overlay the launch dates on top of them:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessions_over_time &amp;lt;- google_analytics(view_id,
                           date_range = c(start_date, end_date),
                           metrics = c(&amp;quot;sessions&amp;quot;),
                           dimensions = c(&amp;quot;date&amp;quot;),
                           anti_sample = TRUE 
)

sessions_over_time %&amp;gt;% 
  left_join(launch_dates, by = c(&amp;quot;date&amp;quot; = &amp;quot;dt&amp;quot;), keep = T) %&amp;gt;% 
  ggplot(aes(x = date, y = sessions)) + 
    geom_line() + 
    geom_point(aes(x = dt), color = &amp;#39;darkblue&amp;#39;, size = 3) + 
    scale_x_date(date_breaks = &amp;#39;month&amp;#39;, date_labels = &amp;#39;%b %Y&amp;#39;) + 
    labs(x = &amp;quot;Date&amp;quot;, y = &amp;quot;# of Sessions&amp;quot;, 
         title = &amp;quot;Sessions Over the Last Year&amp;quot;,
         subtitle = &amp;quot;Blue dots represent post dates&amp;quot;) + 
    cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/index_files/figure-html/sessions_over_time-1.png&#34; width=&#34;1152&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Based on the post dates (blue dotes), it does not seem like post dates correlate to the highest volume. While there are some peaks on post days, particularly in March through July, there are a number of large spikes that occur a bit after the posting dates.&lt;/p&gt;
&lt;div id=&#34;looking-at-monthly-active-users&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Looking at Monthly Active Users&lt;/h3&gt;
&lt;p&gt;For whatever reason I thought I would be cool to have 1,000 monthly active users on the blog (1,000 unique visitors in a 30 day period). Given that there were only 3,081 throughout the course of the year it doesn’t seem likely that I made this goal. But fortunately we don’t have to guess:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mau &amp;lt;- google_analytics(view_id,
                           date_range = c(start_date, end_date),
                           metrics = c(&amp;quot;30dayUsers&amp;quot;), 
                           dimensions = c(&amp;quot;date&amp;quot;),
                           anti_sample = TRUE
)

mau %&amp;gt;% 
  ggplot(aes(x = date, y = `30dayUsers`)) + 
  geom_line(color = wes_palette(&amp;#39;Moonrise2&amp;#39;, n=1, &amp;#39;discrete&amp;#39;)) + 
  geom_smooth(se = F, lty = 2, color = wes_palette(&amp;#39;BottleRocket1&amp;#39;, 1)) + 
  labs(x = &amp;quot;Date&amp;quot;, y = &amp;quot;# of Sessions&amp;quot;, 
       title = &amp;quot;Monthly Active Users (30 Days)&amp;quot;,
       subtitle = &amp;quot;Smoothed Line in Red&amp;quot;) + 
  cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/index_files/figure-html/mau-1.png&#34; width=&#34;1152&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The blog definitely became more popular in April (awesome). But sadly, the monthly active user count topped out at at 872 😢. Better luck in 2021-2022.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;days-of-the-week&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Days of the Week&lt;/h3&gt;
&lt;p&gt;Next up is looking at the number of sessions split by days of the week. Just for fun here, I’ll utilize the &lt;code&gt;gghalves&lt;/code&gt; package which allows you to &lt;a href=&#34;https://erocoar.github.io/gghalves/&#34;&gt;create hybrid geoms&lt;/a&gt;. In this case, I’ll make a half box plot, half point plot to be able to show the distribution in box plot form but also get a better idea of the actual distribution from the points. The &lt;em&gt;side&lt;/em&gt; parameter tells the function to plot on the left or right half.&lt;/p&gt;
&lt;p&gt;Since most days have very few sessions, the plot is set to a log10 scale&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessions_dow &amp;lt;- google_analytics(view_id,
                                date_range = c(start_date, end_date),
                                metrics = c(&amp;quot;sessions&amp;quot;),
                                dimensions = c(&amp;quot;Date&amp;quot;, &amp;quot;dayOfWeekName&amp;quot;),
                                anti_sample = TRUE
)

sessions_dow %&amp;gt;% 
  # Code the text labels to a Factor
  mutate(dayOfWeekName = factor(dayOfWeekName,
                                levels = c(&amp;#39;Sunday&amp;#39;, &amp;#39;Monday&amp;#39;, &amp;#39;Tuesday&amp;#39;, 
                                           &amp;#39;Wednesday&amp;#39;, &amp;#39;Thursday&amp;#39;, &amp;#39;Friday&amp;#39;,
                                           &amp;#39;Saturday&amp;#39;))) %&amp;gt;%
  ggplot(aes(x = dayOfWeekName, y = sessions, fill = dayOfWeekName)) + 
    geom_half_boxplot(side = &amp;#39;l&amp;#39;, outlier.shape = NA) + 
    geom_half_point(side = &amp;#39;r&amp;#39;, aes(color = dayOfWeekName)) +
    labs(title = &amp;#39;What is the Day of Week Distribution of Sessions?&amp;#39;,
         x = &amp;quot;Day of Week&amp;quot;,
         y = &amp;quot;Sessions&amp;quot;) + 
    scale_y_log10() + 
    scale_fill_manual(guide = &amp;#39;none&amp;#39;, 
                      values = wes_palette(&amp;#39;Zissou1&amp;#39;, n =7, type = &amp;#39;continuous&amp;#39;)) + 
    scale_color_manual(guide = &amp;#39;none&amp;#39;,
                       values = wes_palette(&amp;#39;Zissou1&amp;#39;, n =7, type = &amp;#39;continuous&amp;#39;)) + 
    cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/index_files/figure-html/dow-1.png&#34; width=&#34;1152&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It seems like Monday is the most popular day and then there’s a slight decline throughout the rest of the week. The median number of sessions for the weekdays are all fairly similar but there is a higher ceiling for Monday, Tuesday, Wednesday than there is for Thurs and Friday.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;sources-and-pages&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Sources and Pages&lt;/h2&gt;
&lt;p&gt;I don’t do I ton of promotion of the blog but I &lt;strong&gt;am&lt;/strong&gt; very interested in knowing how people are getting to the site as well as what pages people gravitated to the most.&lt;/p&gt;
&lt;div id=&#34;how-are-people-getting-to-the-site&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;How are people getting to the Site?&lt;/h3&gt;
&lt;p&gt;Google Analytics provides the referral source for site visitors. Let’s take a look at the top 10 referral sources to the site:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sources &amp;lt;- google_analytics(view_id,
                               date_range = c(start_date, end_date),
                               metrics = c(&amp;quot;sessions&amp;quot;),
                               dimensions = c(&amp;quot;source&amp;quot;),
                               anti_sample = TRUE 
)

sources %&amp;gt;% 
  mutate(pct = sessions / sum(sessions)) %&amp;gt;% 
  #Get top 10 rows by session value
  slice_max(sessions, n = 10) %&amp;gt;%
  ggplot(aes(x = fct_reorder(source, sessions), 
             y = sessions,
             fill = source)) + 
  geom_col() +
  geom_text(aes(label = paste0(sessions %&amp;gt;% comma(accuracy = 1), &amp;#39; (&amp;#39;,
                               pct %&amp;gt;% percent(accuracy = .1), &amp;#39;)&amp;#39;)),
            nudge_y = 80) + 
  scale_y_continuous(expand = c(0, 0)) + 
  scale_fill_manual(guide = &amp;#39;none&amp;#39;,
                    values = wes_palette(&amp;#39;FantasticFox1&amp;#39;, n=10, &amp;#39;continuous&amp;#39;)) + 
  labs(x = &amp;quot;Referral Sources&amp;quot;, y = &amp;quot;# of Sessions&amp;quot;,
       title = &amp;quot;Where Did People Visiting the Blog Come From?&amp;quot;) + 
  coord_flip(ylim = c(0, 1600)) + 
  cowplot::theme_cowplot() + 
  theme(
    plot.title.position = &amp;#39;plot&amp;#39;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/index_files/figure-html/sources-1.png&#34; width=&#34;1152&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Somewhat surprising to me is that nearly a third of sessions are direct to the site and another 20% are from Google. Given that I think &lt;a href=&#34;https://www.r-bloggers.com/&#34;&gt;R-Bloggers&lt;/a&gt; is probably my primary mechanism of promotion, I’m not surprised that its in the Top 3, but kindof surprised that it is #3. It is also kindof cool to see referrals from rweekly.org and linkedin where I don’t know exactly how my blog is popping up.. but happy that it is!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-are-the-most-visited-posts-on-the-site&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;What are the Most Visited Posts on the Site?&lt;/h3&gt;
&lt;p&gt;One of the most obvious questions for this post is what prior post generated the most views. Because there are non-post pages on the site (such as the home page), I’ll need to do some cleaning to keep only the actual posts. But then we can look at the top 10 posts by page views.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;top_pages &amp;lt;- google_analytics(view_id,
                               date_range = c(start_date, end_date),
                               metrics = c(&amp;quot;pageviews&amp;quot;),
                               dimensions = c(&amp;quot;pageTitle&amp;quot;),
                               anti_sample = TRUE
)

top_pages %&amp;gt;% 
  # Remove that all page titles end in | Jlaw&amp;#39;s R Blog
  mutate(pageTitle = str_remove_all(pageTitle, &amp;quot; \\| JLaw&amp;#39;s R Blog&amp;quot;)) %&amp;gt;%
  # Remove the Main Post Page, the Home Page, and Unknown Pages
  filter(!pageTitle %in% c(&amp;#39;(not set)&amp;#39;, &amp;quot;JLaw&amp;#39;s R Blog&amp;quot;, &amp;quot;Posts&amp;quot;)) %&amp;gt;%
  # Keep the Top 10 By Page Views
  slice_max(pageviews, n = 10) %&amp;gt;%
  ggplot(aes(x = fct_reorder(str_wrap(pageTitle, 75), pageviews), 
             y = pageviews,
             fill = pageTitle)) + 
  geom_col() +
  geom_text(aes(label = pageviews %&amp;gt;% comma(accuracy = 1)),
            hjust = 1) + 
  scale_fill_discrete(guide = F) + 
  scale_y_continuous(expand = c(0, 0)) + 
  labs(x = &amp;quot;&amp;quot;, y = &amp;quot;# of Users&amp;quot;,
       title = &amp;quot;Most Popular Posts&amp;quot;) + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    plot.title.position = &amp;#39;plot&amp;#39;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/index_files/figure-html/post1-1.png&#34; width=&#34;1152&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I’m not surprised that the “Scraping the Google Play Store with RSelenium” is the Top Post on the site as it got picked up by at least one other website that I was aware of. Also, as far as I know its not a very common topic. Similarly, my post on &lt;code&gt;arulesSequence&lt;/code&gt; isn’t surprising as that’s an interesting package with not a ton of blog posts about it. However, I did not realize that the “7 Things I Learned During Advent of Code 2020” was as popular as it was. And finally, it makes me kindof happy that the Visualizing Dancing with the Stars winners with &lt;code&gt;gt&lt;/code&gt; was number 4. I really like that post and Hugo (how I generate this site) got really confused and claims that the reading time is an hour when it is much shorter. So I’m happy that people weren’t too scared off.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;HOWEVER&lt;/strong&gt;, while its good to know which are the most popular posts in general. Some of these posts are older than others and have had more of a chance to generate page views than others. For example, the Instagram Lite post is from late June while the Advent of Code post is from December. To counter this, I can look at the cumulative number of page views from the first page view date. Then we can see which post is accumulating views the fastest. To this, I’m going to create a static ggplot but then use &lt;code&gt;ggplotly&lt;/code&gt; to make it interactive:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pages_by_time &amp;lt;- google_analytics(view_id,
                                  date_range = c(start_date, end_date),
                                  metrics = c(&amp;quot;pageviews&amp;quot;),
                                  dimensions = c(&amp;quot;date&amp;quot;, &amp;quot;pageTitle&amp;quot;),
                                  anti_sample = TRUE 
)

p &amp;lt;- pages_by_time %&amp;gt;% 
  filter(pageTitle != &amp;#39;(not set)&amp;#39;) %&amp;gt;% 
  #Filter out pages with less than 50 pageviews
  add_count(pageTitle, wt = pageviews, name = &amp;quot;total_views&amp;quot;) %&amp;gt;% 
  filter(total_views &amp;gt;= 50) %&amp;gt;% 
  # Calculate Days Since Post and Cumulative Number of Views
  group_by(pageTitle) %&amp;gt;% 
  arrange(pageTitle, date) %&amp;gt;% 
  mutate(
    min_date = min(date),
    days_since_post = date - min(date),
    cuml_views = cumsum(pageviews)) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  #The text aesthetic allows me to add that field into the tooltip for plotly
  ggplot(aes(x = days_since_post, y = cuml_views, color = pageTitle, text = min_date)) + 
    geom_line() + 
    coord_cartesian(xlim = c(0, 100), ylim = c(0, 550)) + 
    labs(title = &amp;quot;Which Posts Got the Views the Fastest?&amp;quot;,
         subtitle = &amp;quot;First 100 Days Since Post Date&amp;quot;,
         y = &amp;quot;Cumulative Page Views&amp;quot;,
         x = &amp;quot;Days Since Post Date&amp;quot;) +
    cowplot::theme_cowplot() + 
    # This will work with Plotly while scale_color_discrete(guide = F) will not
    theme(legend.position=&amp;#39;none&amp;#39;) 

# Create Interactive Version of GGPLOT
ggplotly(p)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;iframe src=&#34;p1.html&#34; scrolling=&#34;no&#34; seamless=&#34;seamless&#34; frameBorder=&#34;0&#34; height=&#34;400&#34; width=&#34;800&#34;&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;Now its a little clearer to see the bump that the RSelenium post got around day 8 that shot it to most popular. Also, the Instagram Lite post at 8 days since publishing is actually the most viewed for a Day 8. However, its trajectory is beginning to flatten and while it seems like it will be one of the more popular ones, it doesn’t seem like it will catch RSelenium.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;what-countries-are-people-visiting-the-site-from&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What Countries are People Visiting The Site From?&lt;/h2&gt;
&lt;p&gt;The blog was visited by users from 134 countries throughout the year, which is pretty crazy to think about. We can look at the distribution of countries by users to see whether the blog is most popular in the US (which is expected) or if it has a stronger than expected International appeal. To add some pizzazz to the graph, I’ll use the &lt;code&gt;countrycode&lt;/code&gt; package to convert the country names into two-letter codes and then use &lt;code&gt;ggflag&lt;/code&gt; to add the flags to the plot (note that geom_flag works by having a &lt;em&gt;country&lt;/em&gt; aesthetic set).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;users_by_country &amp;lt;- google_analytics(view_id,
                               date_range = c(start_date, end_date),
                               metrics = c(&amp;quot;users&amp;quot;),
                               dimensions = c(&amp;quot;country&amp;quot;),
                               anti_sample = TRUE 
)

users_by_country %&amp;gt;% 
  filter(country != &amp;#39;(not set)&amp;#39;) %&amp;gt;% 
  #Get % Column and Recode Countries to the iso2c standard
  mutate(pct = users/sum(users),
         code = str_to_lower(countrycode(country, 
                                         origin = &amp;#39;country.name.en&amp;#39;, 
                                         destination = &amp;#39;iso2c&amp;#39;)
                             )
         ) %&amp;gt;%
  # Get Top 10 Countries by # of Users
  slice_max(users, n = 10) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(country, users), 
             y = users,
             fill = country,
             country = code)) + 
    geom_col() +
    geom_text(aes(label = paste0(users %&amp;gt;% comma(accuracy = 1), 
                                 &amp;#39; (&amp;#39;, pct %&amp;gt;% percent(accuracy = .1), &amp;#39;)&amp;#39;)),
              nudge_y = 50) + 
    geom_flag(y = 30, size = 15) + 
    scale_fill_discrete(guide = F) + 
    scale_y_continuous(expand = c(0, 0)) + 
    labs(x = &amp;quot;Country&amp;quot;, y = &amp;quot;# of Users&amp;quot;,
         title = &amp;quot;Where Did Users Come From?&amp;quot;) + 
    coord_flip(ylim = c(0, 1100)) + 
    cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/07/14/celebrating-the-blog-s-first-birthday-with-googleanalyticsr/index_files/figure-html/users_by_country-1.png&#34; width=&#34;1152&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As expected the US is where the most users are location with close to 30% of all users. However, what’s a bit surprising is that 70% of the users are &lt;em&gt;NOT&lt;/em&gt; from the US. And in the Top 10 countries there’s a pretty good representation across the continents with North America, South America, Europe, Asia, and Australia all represented (Africa gets its first representation at #29 with Nigeria).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;concluding-thoughts&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Concluding Thoughts&lt;/h2&gt;
&lt;p&gt;First and foremost, thank you to everyone who has supported the blog by reading it over the past year. This really did start out as a small hobby for myself during COVID but I hope that others have found some value in the various posts. For this post in particular, I hope it displays all the things you can find within Google Analytics. For me personally, it made me happy to take this post to reflect on the first year of the blog and see the reach that a single person doing this in their spare time can have. So again, thank you all and onto Year 2 (and another shot at that 1000 Monthly Active User Goal!!)&lt;/p&gt;
&lt;center&gt;
&lt;img src=&#34;Thank-you-word-cloud.jpg&#34; /&gt;
&lt;/center&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>What Are People Sayin&#39; About Instagram Lite?</title>
      <link>https://jlaw.netlify.app/2021/06/26/what-people-are-sayin-about-instagram-lite/</link>
      <pubDate>Sat, 26 Jun 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/06/26/what-people-are-sayin-about-instagram-lite/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/06/26/what-people-are-sayin-about-instagram-lite/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;In the beginning of May, I used &lt;a href=&#34;https://jlaw.netlify.app/2021/05/03/scraping-google-play-reviews-with-rselenium/&#34;&gt;&lt;code&gt;RSelenium&lt;/code&gt; to scrape the Google Play Store reviews&lt;/a&gt; for Instagram Lite to demonstrate how the package can be used to automate browser behavior. Its taken longer than I had initially planned to do this follow-up on the analysis of that data. But better late than never. So in this analysis I will do some exploratory work and some text mining to look at questions such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How have IG Lite reviews been trending?&lt;/li&gt;
&lt;li&gt;What are prevalent topics in the Google Play reviews about IGLite?&lt;/li&gt;
&lt;li&gt;For words with negative sentiment, why are people feeling negatively?&lt;/li&gt;
&lt;li&gt;What are the most prevalent keywords in the set of reviews?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The main libraries that I will use to do this analysis are &lt;code&gt;udpipe&lt;/code&gt; for applying the language model used to develop part of speech tagging, &lt;code&gt;BTM&lt;/code&gt; to construct the Biterm model, and &lt;code&gt;textrank&lt;/code&gt; / &lt;code&gt;wordcloud&lt;/code&gt; to do keyword extraction and make the wordcloud. Both &lt;code&gt;udpipe&lt;/code&gt;, &lt;code&gt;BTM&lt;/code&gt;, and &lt;code&gt;textrank&lt;/code&gt; are part of the &lt;a href=&#34;http://www.bnosac.de&#34;&gt;Bnosac&lt;/a&gt; NLP ecosystem.&lt;/p&gt;
&lt;p&gt;The analyses from these posts are heavily inspired from Bnosac’s posts on &lt;a href=&#34;http://www.bnosac.be/index.php/blog/98-biterm-topic-modelling-for-short-texts&#34;&gt;Biterm Modeling&lt;/a&gt; and &lt;a href=&#34;http://www.bnosac.be/index.php/blog/85-you-did-a-sentiment-analysis-with-tidytext-but-you-forgot-to-do-dependency-parsing-to-answer-why-is-something-positive-negative&#34;&gt;Sentiment Analysis&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)  # General Data Manipulation
library(lubridate) # Date Manipulations
library(extrafont)  # To use more fun fonts in GGPLOT
loadfonts(device = &amp;quot;win&amp;quot;)
library(udpipe) # Tokenizing, Lemmatising, Tagging and Dependency Parsing
library(BTM) # Biterm Topic Modeling
library(scales) # To help format  plots
library(textrank) # Keyword Extraction
library(wordcloud) # Create wordcloud&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For data I’ll be using the result file from the my &lt;a href=&#34;https://jlaw.netlify.app/2021/05/03/scraping-google-play-reviews-with-rselenium/&#34;&gt;web scraping post&lt;/a&gt; from April:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;iglite &amp;lt;- read_csv(&amp;#39;https://raw.githubusercontent.com/jtlawren67/jlawblog/master/content/post/2021-05-03-scraping-google-play-reviews-with-rselenium/data/review_data.csv&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;As a reminder the data looks like:&lt;/p&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;1%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;1%&#34; /&gt;
&lt;col width=&#34;91%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;names&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;stars&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;dates&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;clicks&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;reviews&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Harikrishnan&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2021-04-05&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4787&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Its surely consumes less data than original app, but many of you may not get comfortable with this interface. One of the major problems I faced was that stories are getting replayed many times without me doing anything. The next major issue is that if you dont like a post it comes to your feed everytime over and over again until you like the post. Hope Instgram Team will find a solution to these problems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Piyush AryaPrakash&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2021-04-06&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3655&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;It’s good to see that they are providing a lite version. But it doesn’t even work . It’s better to use in chrome than downloading lite. What’s the problem - The feeds never get refreshed . You just have to scroll down and when you click refresh still you see the same feeds. Doesn’t support links . Lags too much . Too much annoying while using the messenger. Despite having a good internet connection it keeps laging saying something went wrong. It’s too slow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Badri narayan&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2021-04-24&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Very nice app as it is lite so it is good consume less data have limited things but I don’t understand you can watch reels in app but if someone send you reels it shows not supported in lite so it should be fixed and during dark mode the text we type is not visible fix this too and everything is good &amp;lt;U+0001F917&amp;gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div id=&#34;exploring-the-ig-lite-review-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Exploring the IG Lite Review Data&lt;/h2&gt;
&lt;p&gt;Given the time the initial analysis was run I captured 2,040 reviews covering dates from 2019-03-03 and 2021-04-24. However, reviews from earlier than December 2020 are likely referring to the initial version of IG Lite rather than the &lt;a href=&#34;https://techcrunch.com/2020/12/16/facebook-launches-new-instagram-lite-app-in-india-global-rollout-to-follow-later/&#34;&gt;relaunched version&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The first thing to look at is to see how the review counts have been trending over time:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;iglite %&amp;gt;% 
  count(dates, name = &amp;quot;reviews&amp;quot;) %&amp;gt;%
  filter(dates &amp;gt;= lubridate::ymd(20201201)) %&amp;gt;%
  ggplot(aes(x = dates, y = reviews)) + 
    geom_line() + 
    geom_smooth(se = F, lty = 2) + 
    labs(y = &amp;quot;# of Reviews in data set&amp;quot;, x = &amp;quot;Month&amp;quot;,
         title = &amp;quot;Number of IGLite Reviews In Dataset&amp;quot;) + 
    cowplot::theme_cowplot() +
    theme(
      plot.title.position = &amp;#39;plot&amp;#39;,
      text = element_text(family = &amp;#39;Arial Narrow&amp;#39;)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/06/26/what-people-are-sayin-about-instagram-lite/index_files/figure-html/review_count-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The trend of reviews started strong in mid-December upon the launch of IG Lite before stabilizing at around 10 per day before beginning an incline in February and reaching around 20 reviews per day. So if we assume that increasing reviews are correlated with increasing users then it seems like IG Lite is gaining momentum.&lt;/p&gt;
&lt;p&gt;But are the reviews good reviews? As an app that is continuously iterating it would be interesting to see how the distribution of Star Ratings from 1-5 to changed over time as more reviews come in. To do this we can look at the cumulative distributions for each star rating from Dec 2020 through April 2021.&lt;/p&gt;
&lt;p&gt;Since certain days do not have coverage across all 5 reviews (remember we’ll only getting 10 per day at the beginning). I’ll need to create a skeleton for each day and all five ratings so that zeros are taken into account rather than treated as gaps. For this I’ll using tidyr’s &lt;code&gt;crossing()&lt;/code&gt; function, which is a bit like &lt;code&gt;expand.grid()&lt;/code&gt; in Base R to create a data set with all combinations of vectors.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Create a data frame with every day from 12/1/2020 through the max date and  1-5 
#value for stars on each day
tidyr::crossing(
  dates = seq.Date(ymd(20201201), max(iglite$dates), by = &amp;#39;day&amp;#39;),
  stars = 1:5
  ) %&amp;gt;% 
  # Join  actual data to the skeleton to get the number of reviews for that day
  left_join(
    iglite %&amp;gt;%
      count(dates, stars, name = &amp;quot;reviews&amp;quot;) %&amp;gt;%
      filter(dates &amp;gt;= lubridate::ymd(20201201)),
    by = c(&amp;quot;dates&amp;quot;, &amp;quot;stars&amp;quot;)
  ) %&amp;gt;% 
  # Fill any missing values with 0
  replace_na(list(reviews = 0)) %&amp;gt;%
  # Create the cumulative count of reviews for each star level
  group_by(stars) %&amp;gt;%
  arrange(dates) %&amp;gt;% 
  mutate(cuml_stars = cumsum(reviews)) %&amp;gt;%
  ungroup() %&amp;gt;% 
  # Add a column for the cumulative count of reviews for up to that point
  add_count(dates, wt = cuml_stars, name = &amp;quot;total_review_in_date&amp;quot;) %&amp;gt;%
  # Create the cumulative distribution for that star level to that point
  # For the most recent day create a label to be used in the post
  mutate(pct = cuml_stars / total_review_in_date,
         lbl = if_else(dates == max(dates), 
                       paste(stars, pct %&amp;gt;% percent(accuracy = 1), sep = &amp;#39;: &amp;#39;), 
                       NA_character_)) %&amp;gt;% 
  # Remove the dates prior to having 25 total reviews
  filter(total_review_in_date &amp;gt;= 25) %&amp;gt;% 
  # Plot the distribution
  ggplot(aes(x = dates, y = pct, color = as.factor(stars))) + 
    geom_line() + 
    ggrepel::geom_label_repel(aes(label = lbl)) + 
    scale_color_discrete(guide = F) + 
    scale_y_continuous(labels = percent) + 
    labs(title = &amp;quot;IGLite Rating Distribution&amp;quot;,
         subtitle = &amp;quot;Cumulative Distribution Dec - Apr&amp;quot;,
         caption = &amp;quot;Dates Start at 25 Reviews&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = &amp;#39;plot&amp;#39;,
      text = element_text(family = &amp;#39;Arial Narrow&amp;#39;)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/06/26/what-people-are-sayin-about-instagram-lite/index_files/figure-html/cumulative%20ratings-1.png&#34; width=&#34;672&#34; /&gt;
Looking at the distributions over time, in January one and three star ratings were the most common with around 25% each. Fives and twos were relatively low. However, since January, the number of fives have climbed to eventually make up 23% of the total reviews in the data set. Unfortunately, the number of one star reviews has also climbed and is the most common review in the data set at 31%.&lt;/p&gt;
&lt;p&gt;An alternative way of utilizing the star ratings is to create a Net Promoter-like score. If you’ve ever received an email asking “On a scale from 1 to 10 how likely are you to recommend this to a friend”, you’ve been a part of the Net Promoter Score. &lt;a href=&#34;https://en.wikipedia.org/wiki/Net_Promoter&#34;&gt;The Net Promoter Score is a score from -100 to 100 that is an index about how willing people are to reccomend a product&lt;/a&gt;. It divides the world into Promoters (scores 9 and 10) and Detractors (scores 6 and below) and then calculates % of Promoters - % of Detractors.&lt;/p&gt;
&lt;p&gt;In this case, I’ll consider a promoter as someone who rates IGLite a 4 or a 5 and a detractor someone who rates IGLite a 1 or a 2. Then we can calculate our version of NPS for each month to get a rough look at sentiment trend.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;iglite %&amp;gt;%
  # Filter to December
  filter(dates &amp;gt;= lubridate::ymd(20201201)) %&amp;gt;% 
  # Turn star scores into Promoter / Detractors and create a dataset where
  # for each day we&amp;#39;ll have Favorable/Unfavorable/Neutral as columns
  mutate(lbl = case_when(
    stars &amp;gt;= 4 ~ &amp;quot;favorable&amp;quot;,
    stars &amp;lt;= 2 ~ &amp;quot;unfavorable&amp;quot;,
    TRUE ~ &amp;quot;neutral&amp;quot;
    ),
    mth = format(dates, &amp;quot;%Y-%m&amp;quot;)
  ) %&amp;gt;% 
  count(mth, lbl, name = &amp;quot;reviews&amp;quot;) %&amp;gt;%
  spread(lbl, reviews) %&amp;gt;% 
  replace_na(list(favorable = 0, unfavorable = 0, neutral = 0)) %&amp;gt;% 
  # Calculate the NPS score
  mutate(
         total = favorable + neutral + unfavorable,
         pct_favorable = favorable/total,
         pct_unfavorable = unfavorable/total,
         nps = pct_favorable - pct_unfavorable
         ) %&amp;gt;%
  # Plot the NPS score by month
  ggplot(aes(x = mth, y = nps), group = 1) + 
    geom_col(aes(fill = if_else(nps &amp;lt; 0, &amp;#39;darkred&amp;#39;, &amp;#39;darkgreen&amp;#39;))) + 
    geom_point() + 
    geom_label(aes(label = nps %&amp;gt;% percent(accuracy = .1))) + 
    scale_fill_discrete(guide = F) + 
    labs(title = &amp;quot;NPS Score for IGLite&amp;quot;,
         subtitle = &amp;quot;NPS = % Promoters (Reviews &amp;gt; 3) - % Detractors (Reviews &amp;lt; 3)&amp;quot;,
         y = &amp;quot;Net Promoter Score&amp;quot;,
         x = &amp;quot;Month&amp;quot;) + 
   cowplot::theme_cowplot() + 
    theme(
      plot.title.position = &amp;#39;plot&amp;#39;,
      text = element_text(family = &amp;#39;Arial Narrow&amp;#39;),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/06/26/what-people-are-sayin-about-instagram-lite/index_files/figure-html/nps-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Yikes! This does not look great with each of the 5 months in the data having a negative NPS score. However, similar to the cumulative ratings in the chart above the later months (March and April) have faired much better than the first two months post-release (Jan and Feb) with the NPS score being close to zero. Looking at the raw data, it seems like the “neutral” comes from being polarizing with 42% Promoters and 43% Detractors rather than having a lot of people with neutral with 3 star ratings:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;Month&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Total Reviews&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;% Favorable&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;% Neutral&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;% Unfavorable&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;NPS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;2020-12&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;301&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;33.9%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;25.9%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;40.2%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-6.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;2021-01&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;248&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;33.9%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;16.1%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;50.0%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-16.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;2021-02&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;361&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;34.6%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15.0%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;50.4%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-15.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;2021-03&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;565&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;40.0%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;18.4%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;41.6%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-1.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;2021-04&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;540&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;41.7%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15.4%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;43.0%&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-1.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;text-mining&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Text-Mining&lt;/h2&gt;
&lt;p&gt;With the EDA portion done, its on to Text Mining the reviews. In a past-post I had used &lt;a href=&#34;https://jlaw.netlify.app/2020/08/02/what-s-the-difference-between-instagram-and-tiktok-using-word-embeddings-to-find-out/&#34;&gt;the Tidytext Ecosystem to look at Tweet difference between Instagram and TikTok&lt;/a&gt; but this time I will be using the &lt;a href=&#34;http://www.bnosac.be&#34;&gt;Bnosac&lt;/a&gt; ecosystem of packages to do Biterm Modeling, Sentiment Analysis with dependency parsing, and then the &lt;code&gt;textrank&lt;/code&gt; and &lt;code&gt;wordcloud&lt;/code&gt; package to generate a word-cloud of extracted keywords.&lt;/p&gt;
&lt;div id=&#34;pre-processing-with-udpipe&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Pre-processing with udpipe&lt;/h3&gt;
&lt;p&gt;In &lt;a href=&#34;https://jlaw.netlify.app/2020/10/07/looking-for-media-bias-in-coverage-of-trump-s-covid-diagnosis/&#34;&gt;prior text-mining posts&lt;/a&gt;, I used tidytext to handle tokenization, however, in this analysis I will leverage the &lt;code&gt;udpipe&lt;/code&gt; package. The &lt;code&gt;udpipe&lt;/code&gt; is a R wrapper around the C++ library of the same name that uses a pre-trained language models to &lt;a href=&#34;https://bnosac.github.io/udpipe/docs/doc1.html&#34;&gt;easily tokenize, tag, lemmatize or perform dependency parsing on text in any language&lt;/a&gt;. The “ud” in udpipe stands for Universal Dependencies which is a “&lt;a href=&#34;https://universaldependencies.org/#ud-treebanks&#34;&gt;framework for consistent annotation of grammar&lt;/a&gt;”.&lt;/p&gt;
&lt;p&gt;In order to prepare the data for the model there needs to be some light pre-processing as &lt;code&gt;udpipe&lt;/code&gt; expects the data to have a &lt;code&gt;doc_id&lt;/code&gt; and a &lt;code&gt;text&lt;/code&gt; field.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Columns need to be doc_id and text for the model
cleaned &amp;lt;- iglite %&amp;gt;% 
  mutate(doc_id = row_number(),
         text = str_to_lower(reviews),
         text = str_replace_all(text, &amp;quot;&amp;#39;&amp;quot;, &amp;quot;&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To annotate our data with &lt;code&gt;udpipe&lt;/code&gt; I’ll call the &lt;code&gt;udpipe()&lt;/code&gt; function with my data and the language of the model to use. This function is will download the appropriate language model, in this case English, and then annotate the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;annotated_reviews    &amp;lt;- udpipe(cleaned, &amp;quot;english&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To show what the &lt;code&gt;udpipe&lt;/code&gt; model did to the data we can look at the first review before the annotations:&lt;/p&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;100%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;text&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;its surely consumes less data than original app, but many of you may not get comfortable with this interface. one of the major problems i faced was that stories are getting replayed many times without me doing anything. the next major issue is that if you dont like a post it comes to your feed everytime over and over again until you like the post. hope instgram team will find a solution to these problems&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;and after the annotations:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;annotated_reviews %&amp;gt;% filter(doc_id == 1) %&amp;gt;% head(3) %&amp;gt;% knitr::kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;36%&#34; /&gt;
&lt;col width=&#34;1%&#34; /&gt;
&lt;col width=&#34;1%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;1%&#34; /&gt;
&lt;col width=&#34;1%&#34; /&gt;
&lt;col width=&#34;24%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;1%&#34; /&gt;
&lt;col width=&#34;1%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;doc_id&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;paragraph_id&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;sentence_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;sentence&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;start&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;end&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;term_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;token_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;token&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;lemma&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;upos&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;xpos&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;feats&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;head_token_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;dep_rel&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;deps&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;misc&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;its surely consumes less data than original app, but many of you may not get comfortable with this interface.&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;its&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;its&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;PRON&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;PRP$&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;nsubj&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;its surely consumes less data than original app, but many of you may not get comfortable with this interface.&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;surely&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;surely&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;ADV&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;RB&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;advmod&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;its surely consumes less data than original app, but many of you may not get comfortable with this interface.&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;12&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;19&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;consumes&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;consume&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;VERB&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;VBZ&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;root&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;We now get a ton of metadata including indicators for the sentence, we can the token (&lt;code&gt;token&lt;/code&gt;) and its lemma (&lt;code&gt;lemma&lt;/code&gt;) (note that consumes becomes consume), parts of speech (&lt;code&gt;upos&lt;/code&gt;), and dependency relationships (&lt;code&gt;deprel&lt;/code&gt;) and more.&lt;/p&gt;
&lt;p&gt;Now that we’ve tokenized the data we can start using it to analyze the reviews.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;biterm-modeling&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Biterm Modeling&lt;/h3&gt;
&lt;p&gt;The first analysis task will be biterm modeling using the &lt;code&gt;BTM&lt;/code&gt; package. The Biterm Topic Model model was developed by &lt;a href=&#34;https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf&#34;&gt;Yan et. al&lt;/a&gt; as a means to determining the topics that occur in short-texts such as Tweets (or in this case Google Play Reviews). Its meant to provide an improvement to traditional topics modeling in uses cases such as this. My understanding of the difference between traditional topic modeling and biterm topic model is that in the former, the model learns word co-occurrence within documents, while with the later, the model learns word co-occurrences within a window across the entire set of documents. In this context a “biterm” &lt;a href=&#34;https://github.com/bnosac/BTM&#34;&gt;consists of two words co-occurring in the same context, for example, in the same short text window&lt;/a&gt;. This analysis is modeled after the one from &lt;a href=&#34;http://www.bnosac.be/index.php/blog/98-biterm-topic-modelling-for-short-texts&#34;&gt;bnosac&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the BTM model we can explicitly tell the model which word co-occurrences we care about vs. letting it run on everything. This enables us to only care about certain parts of speech, words of certain lengths, and non-stop words. For this analysis we will consider a co-occurrence window of 3 while removing stopwords, removing words with less than 3 characters, and only keeping nouns, adjectives, verbs, and adverbs.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Define a Dictionary of BiTerms
library(data.table)
library(stopwords)
biterms &amp;lt;- as.data.table(annotated_reviews)
biterms &amp;lt;- biterms[, cooccurrence(x = lemma,
                                  relevant = upos %in% c(&amp;quot;NOUN&amp;quot;, &amp;quot;ADJ&amp;quot;, &amp;quot;VERB&amp;quot;) &amp;amp; 
                                             nchar(lemma) &amp;gt; 2 &amp;amp; !lemma %in% stopwords(&amp;quot;en&amp;quot;),
                                  skipgram = 3),
                   by = list(doc_id)]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The biterm data set we’ve constructed looks like:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;biterms %&amp;gt;% head(5) %&amp;gt;% knitr::kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;doc_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;term1&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;term2&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;cooc&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;like&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;post&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;consume&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;less&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;less&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;data&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;original&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;app&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;get&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;comfortable&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This states that in the first review, the word pair (like, post) occurs within a 3 word window twice in the document.&lt;/p&gt;
&lt;p&gt;Now we can actually construct the biterm model. For simplicity, I’m setting it to train 9 topics. The background = T setting makes the 1st topic a background topic that reflects to empirical word distribution to filter out common words (which is why k = 10):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123456)

train_data &amp;lt;- annotated_reviews %&amp;gt;% 
  filter(
    upos %in% c(&amp;quot;NOUN&amp;quot;, &amp;quot;ADJ&amp;quot;, &amp;quot;VERB&amp;quot;),
    !lemma %in% stopwords::stopwords(&amp;quot;en&amp;quot;),
     nchar(lemma) &amp;gt; 2
  ) %&amp;gt;%
  select(doc_id, lemma)

btm_model     &amp;lt;- BTM(train_data, biterms = biterms, k = 10, iter = 2000, background = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we’ve constructed topics, there needs to be a good way to visualize those topics. Fortunately the &lt;code&gt;textplot&lt;/code&gt; package handles this nicely:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(textplot)
library(ggraph)
set.seed(123456)

plot(btm_model, top_n = 10,
     title = &amp;quot;BTM model of IGLite Reviews&amp;quot;,
     labels = c(&amp;quot;&amp;quot;,
                &amp;quot;Reels&amp;quot;,
                &amp;quot;Likes the App&amp;quot;,
                &amp;quot;Takes Too Long&amp;quot;,
                &amp;quot;Can&amp;#39;t Upload&amp;quot;,
                &amp;quot;Dark Mode&amp;quot;, 
                &amp;quot;Bugs&amp;quot;, 
                &amp;quot;Feature Requests&amp;quot;,
                &amp;quot;Uses Less Resources&amp;quot;, 
                &amp;quot;Instagram Lite&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/06/26/what-people-are-sayin-about-instagram-lite/index_files/figure-html/BTM_Vis-1.png&#34; width=&#34;864&#34; /&gt;
From this chart we can see that there’s a lot of people mentioning bugs and other problems, specifically around upload. People talking about how IG Lite consumes less space and data, people wanting new features such as a music sticker option in stories, and a LOT of people wanting Dark Mode. And there are people who like it and think its a good app.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;sentiment-analysis-withe-dependency-parsing&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Sentiment Analysis withe Dependency Parsing&lt;/h3&gt;
&lt;p&gt;In many sentiment analyses a dictionary method is used to assign positive sentiment and negative sentiment and then some sort of aggregation occurs to determine whether a document is “happy” or “sad” or whatever other type of emotion. But what gets left on the table is “Why” there is positive or negative sentiment. In this case, we can see that people gave IG Lite bad ratings or complained about issues, but without looking through every review, it tough to know why.&lt;/p&gt;
&lt;p&gt;This next piece is based on a &lt;a href=&#34;http://www.bnosac.be/index.php/blog/85-you-did-a-sentiment-analysis-with-tidytext-but-you-forgot-to-do-dependency-parsing-to-answer-why-is-something-positive-negative&#34;&gt;bnosac blog post&lt;/a&gt; and will leverage the dependency output from &lt;code&gt;udpipe&lt;/code&gt; to see what words are connected to the words with negative sentiment.&lt;/p&gt;
&lt;p&gt;To first determine words with negative sentiment I will need an external dictionaries to identify:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Positive vs. Negative words - the base positive vs. negative scoring&lt;/li&gt;
&lt;li&gt;Amplifying and Deamplifying words - words like ‘very’ which make an emotion more intense or ‘barely’ which make an emotion less intense.&lt;/li&gt;
&lt;li&gt;Negators - words like ‘not’ which would flip the sentiment&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For these lists I will get the data used in the &lt;code&gt;sentometrics&lt;/code&gt; &lt;a href=&#34;https://github.com/SentometricsResearch/sentometrics&#34;&gt;package&lt;/a&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;load(url(&amp;quot;https://github.com/SentometricsResearch/sentometrics/blob/master/data-raw/FEEL_eng_tr.rda?raw=true&amp;quot;))
load(url(&amp;quot;https://github.com/SentometricsResearch/sentometrics/blob/master/data-raw/valence-raw/valShifters.rda?raw=true&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and break them up into separate vectors of words:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;polarity_terms &amp;lt;- FEEL_eng_tr %&amp;gt;% transmute(term = x, polarity = y)
polarity_negators &amp;lt;- valShifters$valence_en %&amp;gt;% filter(t==1) %&amp;gt;% pull(x) %&amp;gt;% str_replace_all(&amp;quot;&amp;#39;&amp;quot;,&amp;quot;&amp;quot;)
polarity_amplifiers &amp;lt;- valShifters$valence_en %&amp;gt;% filter(t==2) %&amp;gt;% pull(x) %&amp;gt;% str_replace_all(&amp;quot;&amp;#39;&amp;quot;,&amp;quot;&amp;quot;)
polarity_deamplifiers &amp;lt;- valShifters$valence_en %&amp;gt;% filter(t==3) %&amp;gt;% pull(x) %&amp;gt;% str_replace_all(&amp;quot;&amp;#39;&amp;quot;,&amp;quot;&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, I can use &lt;code&gt;udpipe&lt;/code&gt;’s &lt;code&gt;txt_sentiment&lt;/code&gt; function to use these lists to score my annotated data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sentiments &amp;lt;- txt_sentiment(annotated_reviews, term = &amp;quot;lemma&amp;quot;, 
                            polarity_terms = polarity_terms,
                            polarity_negators = polarity_negators, 
                            polarity_amplifiers = polarity_amplifiers,
                            polarity_deamplifiers = polarity_deamplifiers)
sentiments &amp;lt;- sentiments$data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In addition to the initial annotations there are now columns for polarity (just the positive / negative based on the term) and sentiment_polarity which incorporates the additional information.&lt;/p&gt;
&lt;p&gt;Now that there are sentiments I’m going to want to find the words that those negative terms modify using &lt;code&gt;cbind_dependencies()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;reasons &amp;lt;- sentiments %&amp;gt;%
  #Attached Parent Words to Data
  cbind_dependencies() %&amp;gt;%
  #Filter Columns
  select(doc_id, lemma, token, upos, polarity, sentiment_polarity, token_parent, lemma_parent, upos_parent, dep_rel) %&amp;gt;%
  #Keep Only Terms with Negative Sentiment
  filter(sentiment_polarity &amp;lt; 0)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The revised data now looks like:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;head(reasons) %&amp;gt;% knitr::kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;17%&#34; /&gt;
&lt;col width=&#34;11%&#34; /&gt;
&lt;col width=&#34;11%&#34; /&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;doc_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;lemma&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;token&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;upos&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;polarity&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;sentiment_polarity&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;token_parent&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;lemma_parent&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;upos_parent&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;dep_rel&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;less&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;less&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;ADJ&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-1.8&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;data&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;data&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NOUN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;amod&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;comfortable&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;comfortable&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;ADJ&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-1.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;get&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;get&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;VERB&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;xcomp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;problem&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;problems&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NOUN&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-1.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;one&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;one&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NUM&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;nmod&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;do&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;do&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;AUX&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-1.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;like&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;like&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;VERB&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;aux&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;problem&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;problems&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NOUN&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-1.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;solution&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;solution&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NOUN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;nmod&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;do&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;does&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;AUX&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-1.0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;work&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;work&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;VERB&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;aux&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;A quick look at the data calls out a problem that exists with all dictionary based approaches which is that there is a context that the analyst knows that a dictionary cannot. For example, the term above “less data” is taken to be a negative because having “less data” would be bad… except in the context of Instagram Lite &lt;strong&gt;using&lt;/strong&gt; “less data” would actually be good.&lt;/p&gt;
&lt;p&gt;To get a better understanding of why we’re seeing negative sentiment I will construct a network graph between the negative term and the thing they are modifying and looking for the common phrases.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Keep only dependency relationships that are adjectival modifiers 
# (terms that modify a noun / pronoun)
reasons &amp;lt;- filter(reasons, dep_rel %in% &amp;quot;amod&amp;quot;)

# Count Number of occurrences
word_cooccurences &amp;lt;- reasons %&amp;gt;% 
  count(lemma, lemma_parent, name = &amp;quot;cooc&amp;quot;, sort = T) 

# Create the Nodes as either the term in the dictionary or a word linked 
#to the term in the dictionary
vertices &amp;lt;- bind_rows(
  data_frame(key = unique(reasons$lemma)) %&amp;gt;% 
    mutate(in_dictionary = if_else(key %in% polarity_terms$term, 
                                   &amp;quot;in_dictionary&amp;quot;, 
                                   &amp;quot;linked-to&amp;quot;)),
  data_frame(key = unique(setdiff(reasons$lemma_parent, reasons$lemma))) %&amp;gt;% 
    mutate(in_dictionary = &amp;quot;linked-to&amp;quot;)
  )

library(ggraph)
library(igraph)

# Keep Top 20 Words CoOccurances
cooc &amp;lt;- head(word_cooccurences, 20)
set.seed(123456789)

cooc %&amp;gt;%  
  graph_from_data_frame(vertices = filter(vertices, 
                                          key %in% c(cooc$lemma, 
                                                     cooc$lemma_parent))) %&amp;gt;%
  ggraph(layout = &amp;quot;fr&amp;quot;) +
  geom_edge_link0(aes(edge_alpha = cooc, edge_width = cooc)) +
  geom_node_point(aes(color = in_dictionary), size = 5) +
  geom_node_text(aes(label = name), vjust = 1.8, col = &amp;quot;darkgreen&amp;quot;) +
  scale_color_viridis_d(option = &amp;quot;C&amp;quot;, begin = .2, end = .8) + 
  ggtitle(&amp;quot;Which words are linked to the negative terms&amp;quot;) +
  theme_void()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/06/26/what-people-are-sayin-about-instagram-lite/index_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;
In the network we see the “less data” as the strongest co-occurrence even thought it (and many other words in this group) are not strictly negative words. Some of these connections make sense to be negative like “slow speed” or “useless app” which seems unquestionably bad. But some of these don’t make sense to me like “full screen” being bad. Although looking at a few of the sample reviews that say full screen they are usually in reference to full screen modes not working. So while it does appear that the sentiment model is capturing that “full screen” is discussed as a negative thing, the graph view above does not make that clear.&lt;/p&gt;
&lt;p&gt;So dependency parsing for sentiment analysis seems like a cool idea but is a bit “your mileage may vary”.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;word-clouds-on-keywords&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Word Clouds on Keywords&lt;/h3&gt;
&lt;p&gt;The last text analysis technique for this post will probably be the most well known… wordclouds. It will show what are the most common words in our data set and can be used to understand the set of reviews at a quick glance. But rather than relying on most common words, I’ll use the &lt;code&gt;textrank&lt;/code&gt; package to extract relevant keywords text where keywords are defined as combinations of words following each other. To try to get the most relevant set of keywords, I will be limiting to nouns, adjective, and verbs and will create a wordcloud of the top 30.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;textrank_keywords(annotated_reviews$lemma,
                  relevant = annotated_reviews$upos %in% c(&amp;#39;NOUN&amp;#39;, &amp;#39;ADJ&amp;#39;, &amp;#39;VERB&amp;#39;)) %&amp;gt;% 
  .$keywords %&amp;gt;% filter(ngram &amp;gt; 1 &amp;amp; freq &amp;gt; 1, !str_detect(keyword, &amp;#39;be&amp;#39;)) %&amp;gt;%
  slice_max(freq, n = 50) %&amp;gt;% 
  with(wordcloud(keyword, freq, max.words = 50, colors = brewer.pal(10, &amp;#39;Dark2&amp;#39;)))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/06/26/what-people-are-sayin-about-instagram-lite/index_files/figure-html/word_cloud-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So what are people saying about IG Lite…. that they want dark mode, they want music stickers and that its a good app.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;In this post I leveraged the Google Play Reviews that were scraped back in April to analyze the ratings and the review text using some of less well-known NLP packages (at least in my opinion) to do modified versions of Topic Modeling with Biterm Models, modified versions of sentiment analysis with dependency parsing, and a modified version of a word cloud using keyword extraction.&lt;/p&gt;
&lt;p&gt;As far as answering the questions about what are people saying about IG Lite. It seems really mixed. In terms of star ratings things appeared to start very rough in Jan / Feb but had improved through March and April. From the topic models, some people like that its less resource intense than “Instagram Heavy” while others find it buggy and lacking features. From the sentiment analysis, this polarized view can be summed up in the nodes that formed “Good App”, “Good Enough”, and “Useless App” such that there’s no dominant sentiment.&lt;/p&gt;
&lt;p&gt;Except Dark Mode… give the people dark mode.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>How have the AFI Top 30 Movies Changed Between 1998 and 2007?</title>
      <link>https://jlaw.netlify.app/2021/05/16/how-has-the-afi-top-30-movies-changed-between-1998-and-2007/</link>
      <pubDate>Sun, 16 May 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/05/16/how-has-the-afi-top-30-movies-changed-between-1998-and-2007/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/05/16/how-has-the-afi-top-30-movies-changed-between-1998-and-2007/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;During COVID I’ve started watching some older “classic” movies that I hadn’t seen before but felt for whatever reason I &lt;em&gt;should&lt;/em&gt; have seen as a movie fan. Last week, I had watched &lt;a href=&#34;https://www.imdb.com/title/tt0041959/&#34;&gt;The Third Man&lt;/a&gt; after listening to a podcast about &lt;a href=&#34;https://www.theringer.com/2021/5/4/22418490/top-five-spy-movies-and-without-remorse&#34;&gt;Spy Movies&lt;/a&gt;. After watching it I was surprised to find out that while it was named the &lt;a href=&#34;https://en.wikipedia.org/wiki/BFI_Top_100_British_films&#34;&gt;Top British Film of All-Time&lt;/a&gt; it is &lt;strong&gt;NOT&lt;/strong&gt; in the AFI Top 100 list that was refreshed in 2007. However, it was in the original list of 1998.&lt;/p&gt;
&lt;p&gt;This got me thinking about what were all the differences between the original 1998 list and the revised 2007 list. And while the results are very clearly in the &lt;a href=&#34;https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies&#34;&gt;Wikipedia table&lt;/a&gt; I though it would be fun to try out a visualization using bump charts. This posts utilizes the &lt;code&gt;ggbump&lt;/code&gt; package to make the bump chart and much of the code and style in this post is influenced from the package &lt;a href=&#34;https://cran.r-project.org/web/packages/ggbump/readme/README.html&#34;&gt;README&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Libraries&lt;/h2&gt;
&lt;p&gt;The main parts of this post will be:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Scraping the table from Wikipedia using &lt;code&gt;rvest&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Doing some light transformations with &lt;code&gt;dplyr&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Doing the plotting with &lt;code&gt;ggplot2&lt;/code&gt;, &lt;code&gt;ggbump&lt;/code&gt;, and a couple of other packages for fonts.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rvest)
library(tidyverse)
library(glue)
library(ggbump)
library(ggtext)
library(showtext)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When making the plot I wanted to leverage the font that’s actually used on the American Film Institute web-page which turned out to be the Google Font &lt;em&gt;Nunito&lt;/em&gt;. Using the &lt;code&gt;showtext&lt;/code&gt; package, I can install the Google fonts into the R session and load them for use in plotting. The function &lt;code&gt;font_add_google&lt;/code&gt; from the &lt;code&gt;showtext&lt;/code&gt; package takes two arguments, the name of the Google Font and a family alias that can be used to refer to the font later. For example, in the code below, I’ll be referring to “Nunito” as the “afi” family later on. The &lt;code&gt;showtext_auto&lt;/code&gt; call allows for the family aliases to be used in future code.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Load Google Font
font_add_google(&amp;quot;Nunito&amp;quot;, &amp;quot;afi&amp;quot;)
font_add_google(&amp;quot;Roboto&amp;quot;, &amp;quot;rob&amp;quot;)
showtext_auto()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;scraping-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Scraping the Data&lt;/h2&gt;
&lt;p&gt;The data for the original and new AFI Top 100 Lists are in the same table on the &lt;a href=&#34;https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies&#34;&gt;AFI 100 Years… 100 Movies Wikipedia table&lt;/a&gt;. I’ll be using &lt;code&gt;rvest&lt;/code&gt; to grab the table and import it into a tibble. I do this by providing &lt;code&gt;rvest&lt;/code&gt; with the URL using &lt;code&gt;read_html&lt;/code&gt;, search for a specific CSS class with &lt;code&gt;html_element&lt;/code&gt; and then extract the information from the table with &lt;code&gt;html_table&lt;/code&gt;. Since &lt;code&gt;rvest&lt;/code&gt; will take the column names exactly from the table, which will include spaces, I’ll use the &lt;code&gt;janitor::clean_names()&lt;/code&gt; function to replace spaces with underscores and add characters before names that start with numbers. &lt;em&gt;1998 Rank&lt;/em&gt; will then become &lt;em&gt;x1998_rank&lt;/em&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl &amp;lt;- read_html(&amp;#39;https://en.wikipedia.org/wiki/AFI%27s_100_Years...100_Movies&amp;#39;) %&amp;gt;%
  html_element(css = &amp;#39;.sortable&amp;#39;) %&amp;gt;%
  html_table() %&amp;gt;%
  janitor::clean_names()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first three rows of this data set will look like:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;film&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;release_year&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;director&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;x1998_rank&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;x2007_rank&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Citizen Kane&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1941&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Orson Welles&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Casablanca&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1942&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Michael Curtiz&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;The Godfather&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1972&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Francis Ford Coppola&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;data-transformation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data Transformation&lt;/h2&gt;
&lt;p&gt;In order to get the data ready for use in ggplot there are a few data transformation steps that need to happen:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;I’d like the labels for the plot to include both the title of the film as well as its year of release. I will use the &lt;code&gt;glue&lt;/code&gt; package to easily combine the &lt;em&gt;film&lt;/em&gt; and &lt;em&gt;release_year&lt;/em&gt; columns.&lt;/li&gt;
&lt;li&gt;I want to clean up the rows for movies that aren’t in both lists by replacing the “-” label with &lt;code&gt;NA&lt;/code&gt;s. This is done using &lt;code&gt;across()&lt;/code&gt; and &lt;code&gt;na_if&lt;/code&gt; to replace the “-” characters in the two rank columns with &lt;code&gt;NA&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;I need to turn the tibble from wide format to long format with &lt;code&gt;pivot_wider&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Finally, I want to have rank be an integer and I want to remove the leading “x” character from the year column&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl2 &amp;lt;- tbl %&amp;gt;% 
  mutate(title_lbl = glue(&amp;quot;{film} ({release_year})&amp;quot;),
         across(ends_with(&amp;#39;rank&amp;#39;), ~na_if(., &amp;quot;—&amp;quot;))
  ) %&amp;gt;%
  pivot_longer(
    cols = contains(&amp;#39;rank&amp;#39;),
    names_to = &amp;#39;year&amp;#39;,
    values_to = &amp;#39;rank&amp;#39;
  ) %&amp;gt;%
  mutate(year = str_remove_all(year, &amp;#39;\\D+&amp;#39;) %&amp;gt;% as.integer(),
         rank = as.integer(rank))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have 1 row for each instance on a movie for each list. For example, Citizen Kane appears in both lists so it appears in two rows in the data.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;film&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;release_year&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;director&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;change&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;title_lbl&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;year&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Citizen Kane&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1941&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Orson Welles&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Citizen Kane (1941)&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1998&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Citizen Kane&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1941&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Orson Welles&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Citizen Kane (1941)&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2007&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Casablanca&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1942&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Michael Curtiz&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Casablanca (1942)&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1998&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-the-plot&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating the Plot&lt;/h2&gt;
&lt;p&gt;In order to make the plot readable, I’ll only be looking at the Top 30 films rather than the full Top 100. I’ll be using a bump chart do the comparison. Bump charts are a visualization technique good for looking at changes in rank over time. There is a package &lt;code&gt;ggbump&lt;/code&gt; which provides a &lt;code&gt;ggplot2&lt;/code&gt; geom (&lt;code&gt;geom_bump&lt;/code&gt;) to handle the lines for a bump chart. Movies that appear in only one list will not have a line.&lt;/p&gt;
&lt;p&gt;As for what the code does, the first section above the &lt;code&gt;theme()&lt;/code&gt; call does most of the work by having the points, lines, and titles as well as scaling the axes to the right sizes. Note that in the &lt;code&gt;geom_text&lt;/code&gt; calls, I’m using &lt;em&gt;family = ‘rob’&lt;/em&gt; to refer to the Roboto font downloaded earlier. The theme call handles a lot of the formatting and the &lt;code&gt;geom_text()&lt;/code&gt; and &lt;code&gt;geom_point()&lt;/code&gt; calls after the &lt;code&gt;theme()&lt;/code&gt; section create the white circles that contain the ranks.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;##Plot
num_films = 30

tbl2 %&amp;gt;%
  filter(rank &amp;lt;= num_films) %&amp;gt;%
  ggplot(aes(x = year, y = rank, color = title_lbl)) +
  #Add Dots
  geom_point(size = 5) +
  #Add Titles
  geom_text(data = . %&amp;gt;% filter(year == min(year)),
            aes(x = year - .5, label = title_lbl), size = 5, hjust = 1, family = &amp;#39;rob&amp;#39;) +
  geom_text(data = . %&amp;gt;% filter(year == max(year)),
            aes(x = year + .5, label = title_lbl), size = 5, hjust = 0, family = &amp;#39;rob&amp;#39;) +
  # Add Bump Lines
  geom_bump(size = 2, smooth = 8) +
  
  # Resize Axes
  scale_x_continuous(limits = c(1990, 2014),
                     breaks = c(1998, 2007),
                     position = &amp;#39;top&amp;#39;) +
  scale_y_reverse() +
  labs(title = glue(&amp;quot;How has the AFI Top {num_films} Movies Changed Between Lists&amp;quot;),
       subtitle = &amp;quot;Comparing 1998 and 2007s lists&amp;quot;,
       caption = &amp;quot;***Source:*** Wikipedia&amp;quot;,
       x = &amp;quot;List Year&amp;quot;,
       y = &amp;quot;Rank&amp;quot;) + 
  # Set Colors and Sizes
  theme(
    text = element_text(family = &amp;#39;afi&amp;#39;),
    legend.position = &amp;quot;none&amp;quot;,
    panel.grid = element_blank(),
    plot.title = element_text(hjust = .5, color = &amp;quot;white&amp;quot;, size = 20),
    plot.caption = element_markdown(hjust = 1, color = &amp;quot;white&amp;quot;, size = 12),
    plot.subtitle = element_text(hjust = .5, color = &amp;quot;white&amp;quot;, size = 18),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    axis.text.y = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(face = 2, color = &amp;quot;white&amp;quot;, size = 18),
    panel.background = element_rect(fill = &amp;quot;black&amp;quot;),
    plot.background = element_rect(fill = &amp;quot;black&amp;quot;)
  ) + 
  ## Add in the Ranks with the Circles
  geom_point(data = tibble(x = 1990.5, y = 1:num_films), aes(x = x, y = y), 
             inherit.aes = F,
             color = &amp;quot;white&amp;quot;,
             size = 7,
             pch = 21) +
  geom_text(data = tibble(x = 1990.5, y = 1:num_films), aes(x = x, y = y, label = y), 
            inherit.aes = F,
            color = &amp;quot;white&amp;quot;,
            fontface = 2,
            family = &amp;#39;rob&amp;#39;) + 
  geom_point(data = tibble(x = 2013.5, y = 1:num_films), aes(x = x, y = y), 
             inherit.aes = F,
             color = &amp;quot;white&amp;quot;,
             size = 7,
             pch = 21) +
  geom_text(data = tibble(x = 2013.5, y = 1:num_films), aes(x = x, y = y, label = y), 
            inherit.aes = F,
            color = &amp;quot;white&amp;quot;,
            fontface = 2,
            family = &amp;#39;rob&amp;#39;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/05/16/how-has-the-afi-top-30-movies-changed-between-1998-and-2007/index_files/figure-html/plot-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;While the Wikipedia page tells you exactly what changed between the two lists it provided an opportunity for me to get some practice with making some “nicer” looking ggplot charts and to try out a bump chart and the &lt;code&gt;ggbump&lt;/code&gt; package. As for an interpretation of the chart, there’s a couple of things I don’t really understand between the two lists. Mainly why Raging Bull suddenly jumps from the 24th best film to the 4th. Or why City Lights jumps 65 places from 76th to 11th. I guess I’ll just have to watch and find out.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Scraping Google Play Reviews with RSelenium</title>
      <link>https://jlaw.netlify.app/2021/05/03/scraping-google-play-reviews-with-rselenium/</link>
      <pubDate>Mon, 03 May 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/05/03/scraping-google-play-reviews-with-rselenium/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/05/03/scraping-google-play-reviews-with-rselenium/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;when-normal-web-scraping-just-wont-work&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;When Normal Web Scraping Just Won’t Work&lt;/h2&gt;
&lt;p&gt;I’ve &lt;a href=&#34;https://jlaw.netlify.app/2020/12/01/exploring-nhl-stanley-cup-champion-s-points-percentage-in-four-ggplots/&#34;&gt;used&lt;/a&gt; &lt;a href=&#34;https://jlaw.netlify.app/2020/11/24/what-s-the-most-successful-dancing-with-the-stars-profession-visualizing-with-gt/&#34;&gt;rvest&lt;/a&gt; in &lt;a href=&#34;https://jlaw.netlify.app/2020/09/07/covid-19s-impact-on-the-nyc-subway-system/&#34;&gt;numerous&lt;/a&gt; &lt;a href=&#34;https://jlaw.netlify.app/2020/07/04/a-racing-barplot-of-top-us-baby-names-1880-2018/&#34;&gt;posts&lt;/a&gt; to scrape information from static websites or through forms to get data. However, some websites don’t have static data that can be downloaded by just scraping the HTML. Google Play Store reviews are one of these sources.&lt;/p&gt;
&lt;p&gt;Reviews on the Google Play Store have what I call a semi-infinite scroll where as you reach the bottom of the page, the site will load the next batch of reviews. However, a special wrinkle in the Play Store page is that after a few loads, the user will be prompted again to click a button to load the next batch of reviews.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;show_more.PNG&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;selenium-to-the-rescue&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Selenium to the Rescue&lt;/h2&gt;
&lt;p&gt;Selenium is a tool that automates a browser. Its often used for writing automated tests for websites but in this instance it can be used to mimic a user’s browser behavior to load up a bunch of Play Store reviews to the screen before we can then scrape using &lt;code&gt;rvest&lt;/code&gt; in the conventional fashion.&lt;/p&gt;
&lt;p&gt;Selenium and its R package &lt;code&gt;RSelenium&lt;/code&gt; allows a user to interact with a browser through their programming language of choice. Since this is an R blog, I’ll be using R to control the browser.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;scraping-instagram-lite-reviews&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Scraping Instagram Lite Reviews&lt;/h2&gt;
&lt;p&gt;Instagram Lite is a recently launched product whose &lt;a href=&#34;https://techcrunch.com/2020/05/11/instagram-lite-shuts-down-in-advance-of-a-relaunch/&#34;&gt;“goal was to offer a smaller download that takes up less space on a mobile device — a feature that specifically caters to users in emerging markets, where storage space is a concern”&lt;/a&gt;. Since this is a relatively new product it would be fun to see how its doing. This first post will cover how to use &lt;code&gt;RSelenium&lt;/code&gt; to actually get the data and the analysis will be covered in a follow-up post.&lt;/p&gt;
&lt;div id=&#34;part-1-loading-libraries&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Part 1: Loading Libraries&lt;/h3&gt;
&lt;p&gt;The four libraries used for this data acquisition project are &lt;code&gt;RSelenium&lt;/code&gt; which will allow for manipulating a browser through R, &lt;code&gt;tidyverse&lt;/code&gt; for constructing the data structure, &lt;code&gt;lubridate&lt;/code&gt; to handle the dates in the reviews, and &lt;code&gt;rvest&lt;/code&gt; to scrape the HTML after we’re done loading all the reviews with Selenium&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(RSelenium)
library(tidyverse)
library(lubridate)
library(rvest)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;part-2-start-rselenium&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Part 2: Start RSelenium&lt;/h3&gt;
&lt;p&gt;A browser session gets started by called &lt;code&gt;rsDriver&lt;/code&gt; from the &lt;code&gt;RSelenium&lt;/code&gt; package. While &lt;code&gt;RSelenium&lt;/code&gt; can work with Chrome, Firefox, or PhantomJS, I’ve personally found that working with Firefox is the path of least resistance. With Chrome you need to match the chromedriver versions between &lt;code&gt;RSelenium&lt;/code&gt; and the Chrome browser and I’ve never successfully pulled that off. While with Firefox you can just set &lt;code&gt;browser=&#34;firefox&#34;&lt;/code&gt; and it just works.&lt;/p&gt;
&lt;p&gt;The first time running &lt;code&gt;RSelenium&lt;/code&gt; you can’t have &lt;code&gt;check=F&lt;/code&gt; as it will download the drivers that it needs to work. After that first run you can set &lt;code&gt;check=F&lt;/code&gt; to skip those checks. The &lt;code&gt;verbose=F&lt;/code&gt; option is to suppress excess messaging.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;rsDriver&lt;/code&gt; function will start both a Selenium server and start the remote Firefox browser. It returns both a server and a client which will be assigned to &lt;code&gt;remDr&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rD &amp;lt;- rsDriver(browser = &amp;quot;firefox&amp;quot;, 
               port = 6768L, 
               #If Running RSelenium for the First Time, you can&amp;#39;t have check =F
               #since you&amp;#39;ll need to download the appropriate drivers
               check = F, 
               verbose = F
)
remDr &amp;lt;- rD[[&amp;quot;client&amp;quot;]]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If everything goes to plan a new Firefox window will open and the address bar will be “oranged” out.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;remote_firefox.PNG&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;part-3-browse-to-the-instagram-lite-google-play-reviews-page&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Part 3: Browse to the Instagram Lite Google Play Reviews Page&lt;/h3&gt;
&lt;p&gt;This part is straight forward, I create a &lt;code&gt;url&lt;/code&gt; variable with the desired URL as a string and then use the remote driver &lt;code&gt;remDr&lt;/code&gt; to tell the browser to navigate to that page.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Navigate to URL
url &amp;lt;- &amp;quot;https://play.google.com/store/apps/details?id=com.instagram.lite&amp;amp;hl=en_US&amp;amp;showAllReviews=true&amp;quot;
remDr$navigate(url)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If all goes well the Firefox browser that had opened should now have loaded the Google Play page for Instagram Lite. There will also be a little robot icon on the address bar to show that the browser is under remote control.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;iglite_page.PNG&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;part-4-loading-a-bunch-of-reviews&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Part 4: Loading A Bunch of Reviews&lt;/h3&gt;
&lt;p&gt;This section is the meat and potato of working with Selenium where we’ll write a script to tell the browser what to do. The summary of what this code block will do is:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Identify the body of the webpage&lt;/li&gt;
&lt;li&gt;Send the “end” key to the browser to move to the bottom of the body&lt;/li&gt;
&lt;li&gt;Check if the “SHOW MORE” button exists on the screen and wait 2 seconds&lt;/li&gt;
&lt;li&gt;If the button exists, find the element and click it.&lt;/li&gt;
&lt;li&gt;Wait 3 seconds to let new reviews load and then repeat from Step 2&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I repeat this loop 50 times to try to get enough data for analysis. If the browser isn’t running headlessly then you can switch to the remote browser window and watch everything in action (but be careful because manual intervention with the webpage can mess with the intended function of the script)&lt;/p&gt;
&lt;p&gt;Figuring out the right classes for the button (RveJvd) took some guess and check work from inspecting the page, however, I believe all Google Play Review pages use the same classes so this could &lt;em&gt;should&lt;/em&gt; be adaptable to other apps. But YMMV.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; I originally wanted to run this 100 times to try to get more reviews but I kept winding up with an error of &lt;code&gt;unexpected end of hex escape at line 1 column 15497205&lt;/code&gt; that I was unable to debug. So I stuck with 50. But if anyone knows how to avoid that error please let me know in the comments.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Find Page Element for Body
webElem &amp;lt;- remDr$findElement(&amp;quot;css&amp;quot;, &amp;quot;body&amp;quot;)

#Page to the End
for(i in 1:50){
  message(paste(&amp;quot;Iteration&amp;quot;,i))
  webElem$sendKeysToElement(list(key = &amp;quot;end&amp;quot;))
  #Check for the Show More Button
  element&amp;lt;- try(unlist(remDr$findElement(&amp;quot;class name&amp;quot;, &amp;quot;RveJvd&amp;quot;)$getElementAttribute(&amp;#39;class&amp;#39;)),
                silent = TRUE)
  
  #If Button Is There Then Click It
  Sys.sleep(2)
  if(str_detect(element, &amp;quot;RveJvd&amp;quot;) == TRUE){
    buttonElem &amp;lt;- remDr$findElement(&amp;quot;class name&amp;quot;, &amp;quot;RveJvd&amp;quot;)
    buttonElem$clickElement()
  }
  
  #Sleep to Let Things Load
  Sys.sleep(3)
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;part-5-scraping-the-page&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Part 5: Scraping the Page&lt;/h3&gt;
&lt;p&gt;Now that we’ve scrolled and pushed buttons and scrolled some more to get a bunch of reviews to load on the screen its time to scrape the reviews.&lt;/p&gt;
&lt;p&gt;We can extract the HTML from the remote browser using &lt;code&gt;getPageSource()&lt;/code&gt; and &lt;code&gt;readHTML()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;##Scrape in HTML Objects
html_obj &amp;lt;- remDr$getPageSource(header = TRUE)[[1]] %&amp;gt;% read_html()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we have the HTML we no longer need the remote Firefox browser or Selenium server so we can shut those down. &lt;a href=&#34;https://github.com/ropensci/RSelenium/issues/228&#34;&gt;There have been issues with the Java process remaining open&lt;/a&gt; even after calling the stop server pieces so I issue a system command to kill the java process.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Shut Down Client and Server
remDr$close()
rD$server$stop()
system(&amp;quot;taskkill /im java.exe /f&amp;quot;, intern=FALSE, ignore.stdout=FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;part-6-extracting-the-various-parts-of-the-review&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Part 6: Extracting the Various Parts of the Review&lt;/h3&gt;
&lt;p&gt;If we look at a single review, there are a number of different elements we’d like to extract.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;review_img.PNG&#34; /&gt;&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The Reviewer Name&lt;/li&gt;
&lt;li&gt;Number of Stars&lt;/li&gt;
&lt;li&gt;Date of Review&lt;/li&gt;
&lt;li&gt;Number of Upvotes&lt;/li&gt;
&lt;li&gt;Full Text of the Review&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This piece was a bit of guess and check working with &lt;code&gt;rvest&lt;/code&gt; and looking at the CSS selectors on the page to identify the CSS classes for the pieces that I wanted and extract them with &lt;code&gt;html_elements()&lt;/code&gt;, &lt;code&gt;html_attr()&lt;/code&gt;, and &lt;code&gt;html_text()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# 1) Reviewer Name
names &amp;lt;- html_obj %&amp;gt;% html_elements(&amp;quot;span.X43Kjb&amp;quot;) %&amp;gt;% html_text()

# 2) Number of Stars
stars &amp;lt;- html_obj %&amp;gt;% html_elements(&amp;quot;.kx8XBd .nt2C1d [role=&amp;#39;img&amp;#39;]&amp;quot;)%&amp;gt;% 
  html_attr(&amp;quot;aria-label&amp;quot;) %&amp;gt;% 
  #Remove everything that&amp;#39;s not numeric
  str_remove_all(&amp;#39;\\D+&amp;#39;) %&amp;gt;% 
  # Convert to Integer
  as.integer()

#3) Date of Review
dates &amp;lt;- html_obj %&amp;gt;% html_elements(&amp;quot;.p2TkOb&amp;quot;) %&amp;gt;% 
  html_text() %&amp;gt;% 
  # Convert to a Date
  mdy()

#4) How many helpful clicks
clicks &amp;lt;- html_obj %&amp;gt;% html_elements(&amp;#39;div.jUL89d.y92BAb&amp;#39;) %&amp;gt;% 
  html_text() %&amp;gt;% 
  #Convert to Integer
  as.integer()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the text of the review itself there is one wrinkle. From the image above the beginning of the review is shown, but it is truncated. Then a button for “Full Review” would need to be clicked to show the full review. Fortunately, this shows up in the data as “&lt;Text Preview&gt; …Full Review&lt;The Actual Full Review&gt;”. So in the cases, where the initial review is truncated, all we need to do is grab all the text that comes after the string “Full Review”:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# 5) Full Text of the Review
reviews &amp;lt;- html_obj %&amp;gt;% html_elements(&amp;quot;.UD7Dzf&amp;quot;) %&amp;gt;% html_text() 

###Deal with the &amp;quot;Full Review&amp;quot; Issue where text is duplicated
reviews &amp;lt;- if_else(
  #If the review is truncated
  str_detect(reviews, &amp;#39;\\.\\.\\.Full Review&amp;#39;),
  #Grab all the Text After the string &amp;#39;...Full Review&amp;#39;
  str_sub(reviews, 
          start = str_locate(reviews, &amp;#39;\\.\\.\\.Full Review&amp;#39;)[, 2]+1
          ),
  #Else remove the leading space from the review as is
  str_trim(reviews)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;part-7-combine-and-save-the-data-set&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Part 7: Combine and Save the Data Set&lt;/h3&gt;
&lt;p&gt;With each piece of the review individually extracted we’ll combine the vectors in a tibble and save the file for the analysis in the next part.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# create the df with all the info
review_data &amp;lt;- tibble(
  names = names, 
  stars = stars, 
  dates = dates, 
  clicks = clicks,
  reviews = reviews
  ) 

saveRDS(review_data, &amp;#39;data/review_data.RDS&amp;#39;)
write_csv(review_data, &amp;#39;data/review_data.csv&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Just to make sure everything is working we’ll compare an actual review to our data:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;show_more.PNG&#34; /&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;review_data %&amp;gt;%
  filter(names %in% c(&amp;#39;Sushil Uk07&amp;#39;, &amp;#39;Hana Hoey&amp;#39;)) %&amp;gt;%
  knitr::kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;9%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;71%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;names&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;stars&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;dates&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;clicks&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;reviews&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Sushil Uk07&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2021-04-15&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Good ,but it’s doesn’t Have option to put music in stories&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Hana Hoey&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2021-04-21&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;in features, this app has already including important things. but the movement is very slow&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;And there you have it. We used Selenium to have a browser scroll for a while to load a bunch of reviews, extracted the data with &lt;code&gt;rvest&lt;/code&gt; and then combined and saved the data. In the next post we’ll use this data to understand what downloaders think about Instagram Lite.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;appendix&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Appendix:&lt;/h2&gt;
&lt;p&gt;In this post the Firefox browser was actually loaded which is a useful way to see what the code is actually doing. But if you didn’t want to actually see the browser you could send extra parameters to the &lt;code&gt;rsDriver&lt;/code&gt; function to not make the browser visible:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rsDriver(browser = &amp;quot;firefox&amp;quot;, 
         port = 6768L, 
         check = F, 
         verbose = F, 
         #Run the Browser Headlessly
         extraCapabilities = 
           list(&amp;quot;moz:firefoxOptions&amp;quot; = 
                  list(
                    args = list(&amp;#39;--headless&amp;#39;)
                    )
                )
         )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>What % of Manhattan Did I Run Through?</title>
      <link>https://jlaw.netlify.app/2021/04/15/what-of-manhattan-did-i-run-through/</link>
      <pubDate>Thu, 15 Apr 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/04/15/what-of-manhattan-did-i-run-through/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/04/15/what-of-manhattan-did-i-run-through/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;&lt;img src=&#34;choropleth.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In a &lt;a href=&#34;https://jlaw.netlify.app/2021/04/01/heatmapping-my-new-york-city-marathon-training/&#34;&gt;previous post&lt;/a&gt; I created a cool-looking (in my opinion) heatmap of my Marathon training from years back. One of the downsides to that density-based method of making the heat map was that routes I only ran once didn’t show up very clearly. I also wanted to know roughly what % of Manhattan I covered in my runs. This post will use that same data to create a choropleth map by Census Tract to both visualize all the tracts I passed through in my training as well as determine what % of Manhattan’s land area did I cover.&lt;/p&gt;
&lt;div id=&#34;libraries-used&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Libraries Used&lt;/h2&gt;
&lt;p&gt;The packages used in this analysis are the same from the prior analysis, &lt;code&gt;Tidyverse&lt;/code&gt; for data manipulation, &lt;code&gt;sf&lt;/code&gt; for modifying spatial data, &lt;code&gt;tigris&lt;/code&gt; for getting the basemaps to plot my routes and &lt;code&gt;extrafont&lt;/code&gt; to bring in new fonts for the plots.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) # Data Manipulation
library(sf) # Manipulation Spatial Data
library(tigris) # Getting Tract and Roads Spatial Data
library(extrafont) # Better Fonts For GGPLOT&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;data-used&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data Used&lt;/h2&gt;
&lt;p&gt;The data is also the same running route data from the prior post. For more details on its creation please reference the &lt;a href=&#34;https://jlaw.netlify.app/2021/04/01/heatmapping-my-new-york-city-marathon-training/&#34;&gt;prior post&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;runs_and_routes &amp;lt;- readRDS(&amp;#39;data/runs_and_routes.RDS&amp;#39;)
all_routes &amp;lt;- readRDS(&amp;#39;data/all_routes.RDS&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the basemap I’m again using the &lt;code&gt;tigris&lt;/code&gt; package however this time getting census tracts rather than roads. According to the package, &lt;em&gt;Census tracts generally have a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people&lt;/em&gt;. The map is downloaded using the &lt;code&gt;tracts()&lt;/code&gt; function with inputs for state and county.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nyc_tracts &amp;lt;- tracts(&amp;quot;NY&amp;quot;, &amp;quot;New York&amp;quot;, cb = T) %&amp;gt;% 
  st_transform(crs = st_crs(runs_and_routes$geometry))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot() + geom_sf(data = nyc_tracts) + ggthemes::theme_map()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/04/15/what-of-manhattan-did-i-run-through/index_files/figure-html/map1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Unlike the &lt;a href=&#34;https://jlaw.netlify.app/2021/04/01/heatmapping-my-new-york-city-marathon-training/&#34;&gt;prior analysis&lt;/a&gt; where the heatmap was just overlaid atop the map, here I need to identify which census tracts contained a route I ran vs. which didn’t. This can be done using the &lt;code&gt;st_join&lt;/code&gt; function, specifying it to be a left join, and specifying the join type as &lt;code&gt;st_intersects&lt;/code&gt; which joins the route information if the lat/long is contained in the census tract. The data is then grouped by tract_name and some other tract metadata. Then I create a field for the number of routes contained in each census tract, which will be used for the choropleth.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Join Routes to Tracts by Intersecting
nyc_geo_join &amp;lt;- nyc_tracts %&amp;gt;% 
  st_join(all_routes %&amp;gt;% distinct(route_id, geometry),
          join = st_intersects,
          left = T
          ) %&amp;gt;% 
  group_by(
    TRACTCE, #Census Tract ID
    ALAND, #Land Area
    AWATER #Water Area
  ) %&amp;gt;% 
  summarize(num_routes = n_distinct(route_id, na.rm = T), .groups = &amp;#39;drop&amp;#39;) %&amp;gt;% 
  #Set 0 Routes to NA colored
  mutate(num_routes = if_else(num_routes == 0, NA_integer_, num_routes))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;visualization&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Visualization&lt;/h2&gt;
&lt;p&gt;The choropleth provides an alternative version to the heatmap which will better show each census tract that &lt;strong&gt;at least one&lt;/strong&gt; of my routes had passed through. Really rare routes did not show up on the heatmap, but they will be clearer here.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot() + 
  geom_sf(data = nyc_geo_join, 
          aes(fill = num_routes)) + 
  scale_fill_viridis_c(na.value = &amp;quot;grey90&amp;quot;, guide = F) + 
  coord_sf(xlim = c(-74.15, -73.8)) + 
  labs(title = paste0(&amp;quot;Census Tracts I&amp;#39;ve &amp;quot;,emo::ji(&amp;#39;running&amp;#39;),&amp;quot; Through&amp;quot;),
       fill = &amp;quot;# of Routes Run&amp;quot;,
       caption = &amp;quot;**Author:** JLaw&amp;quot;) + 
  ggthemes::theme_map() + 
  theme(
    plot.title = element_text(size = 18, family = &amp;#39;Arial Narrow&amp;#39;, hjust = .5),
    plot.caption = ggtext::element_markdown(),
    plot.caption.position = &amp;#39;plot&amp;#39;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/04/15/what-of-manhattan-did-i-run-through/index_files/figure-html/cloropleth-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now the East Side routes are clearer.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-of-manhattan-did-i-run-through&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;What % of Manhattan Did I Run Through?&lt;/h2&gt;
&lt;p&gt;The island of Manhattan covers 22.7 square miles. I was curious what % of square miles I covered based on census tracts. While this will seriously over-count my distance covered it is easy to calculate. If I ran through the tract I get to count 100% of its land area. If I did not, I count nothing.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;ALAND&lt;/code&gt; columns from the Census Tract data contains the land area in square kilometers &lt;a href=&#34;https://www.census.gov/quickfacts/fact/note/US/LND110210&#34;&gt;which I convert to square miles&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data_summary &amp;lt;- nyc_geo_join %&amp;gt;%
  as_tibble %&amp;gt;% 
  mutate(covered = !is.na(num_routes)) %&amp;gt;% 
  group_by(covered) %&amp;gt;% 
  summarize(tracts = n(),
            #Convert Square KM to Square Miles
            area = sum(ALAND)/2589988) %&amp;gt;%
  mutate(pct_tracts = tracts / sum(tracts),
         pct_area = area/sum(area))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;During this marathon training, I ran through 101 of Manhattan’s 288 Census Tracts (35%) and passed through census tract’s covering 8.7 &lt;em&gt;mi^2&lt;/em&gt; out of 22.7 &lt;em&gt;mi^2&lt;/em&gt; for &lt;strong&gt;38.4%&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Heatmapping My New York City Marathon Training</title>
      <link>https://jlaw.netlify.app/2021/04/01/heatmapping-my-new-york-city-marathon-training/</link>
      <pubDate>Thu, 01 Apr 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/04/01/heatmapping-my-new-york-city-marathon-training/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/2021/04/01/heatmapping-my-new-york-city-marathon-training/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;motivation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Motivation&lt;/h2&gt;
&lt;p&gt;This post was inspired by my wife who used the GPS data from her Strava app to plot her running routes during 2020. Since I don’t run nearly as much as I used to, I need to go back to when I was training for the NYC marathon to find enough running to make such a map worthwhile. Also presenting a challenge is that I’m a bit of a &lt;a href=&#34;https://www.merriam-webster.com/dictionary/Luddite&#34;&gt;luddite&lt;/a&gt; when it comes to running technology. I don’t have a GPS watch and I don’t run with a phone. To track my runs I manually enter my routes and workouts into &lt;a href=&#34;http://www.mapmyrun.com&#34;&gt;MapMyRun&lt;/a&gt; and I time my runs with an ol’ fashioned sportswatch.&lt;/p&gt;
&lt;p&gt;While this works for me on the road, it made the data gathering process for this visualization more difficult. And while MapMyRun does have TCX files for each workout, its not that useful if the data didn’t come from a GPS watch.&lt;/p&gt;
&lt;p&gt;At the end of the day, my goal with this analysis is to make a cool looking heatmap of my training routes for the NYC Marathon… or at least to make a visualization that was cooler looking that my wife’s.&lt;/p&gt;
&lt;p&gt;For those who can’t wait… this was final output:
&lt;img src=&#34;running_heatmap.PNG&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;libraries-used&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Libraries Used&lt;/h2&gt;
&lt;p&gt;This analysis uses four main packages. &lt;code&gt;Tidyverse&lt;/code&gt; for data manipulation, &lt;code&gt;sf&lt;/code&gt; for modifying spatial data, &lt;code&gt;tigris&lt;/code&gt; for getting the basemaps to plot my routes and &lt;code&gt;extrafont&lt;/code&gt; to bring in new fonts for the plots.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) # Data Manipulation
library(sf) # Manipulation Spatial Data
library(tigris) # Getting Tract and Roads Spatial Data
library(extrafont) # Better Fonts For GGPLOT&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;gathering-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Gathering Data&lt;/h2&gt;
&lt;p&gt;If I had a GPS watch or used Strava, I could just download all my files which would contain Geo information and plot it directly. But because I do everything manually, I needed to jump through some hoops. From my &lt;a href=&#34;http://www.mapmyrun.com&#34;&gt;MapMyRun&lt;/a&gt; account I was able to download:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;code&gt;user_workout_history.csv&lt;/code&gt; - Containing all of my workouts along with a column for &lt;em&gt;route_id&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;GPX files for each route that I had saved.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This led to the semi-painful manual process of using the first file to write down each route id that I had run, look up that route, and download the individual GPX file. Fortunately, I’m a creature of habit and and ran the same routes often, so there were only 24 to individually download.&lt;/p&gt;
&lt;div id=&#34;the-user-workout-history-file&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The User Workout History File&lt;/h3&gt;
&lt;p&gt;This file was a CSV file exported from MapMyRun which contained one row for each workout I did along with meta-data such as date, time, speed, etc. However, the important column is route id which will be used to join the geo-data from the route’s GPX files.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;runs &amp;lt;- read_csv(&amp;#39;data/user_workout_history.csv&amp;#39;) %&amp;gt;% 
  # Create Route ID column
  mutate(route_id = str_extract(RouteID, &amp;#39;\\d+&amp;#39;) %&amp;gt;% as.integer)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;the-route-gpx-files&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The Route GPX Files&lt;/h3&gt;
&lt;p&gt;As mentioned above the geocoded data for each route lives in GPX files, one for each of the 24 routes. Since I would apply the same pre-processing to each file this is a good candidate for the &lt;code&gt;map_dfr&lt;/code&gt; function to construct the data frame.&lt;/p&gt;
&lt;p&gt;The following code uses &lt;code&gt;dir()&lt;/code&gt; to get a list of all the files in the directory as vectors, the &lt;code&gt;keep()&lt;/code&gt; function trims the vector to only the GPX files, and each GPX file is then passed into &lt;code&gt;read_sf&lt;/code&gt; to read in the geo-data. The data is subset to only two columns, and a &lt;em&gt;route_id&lt;/em&gt; is created based on the numbers in the file name.&lt;/p&gt;
&lt;p&gt;Finally, geo-data in &lt;code&gt;sf&lt;/code&gt; lives in a GEOMETRY column. However, in order to get the latitudes and longitudes as individual columns I use &lt;code&gt;st_coodinates&lt;/code&gt; to creates “X” and “Y” columns for longitude and latitude.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_routes &amp;lt;- map_dfr(
  #Get all gpx files in the directory
  keep(dir(&amp;#39;data&amp;#39;), ~str_detect(.x, &amp;quot;gpx&amp;quot;)),
  #Read them in
  ~read_sf(paste0(&amp;#39;data/&amp;#39;,.x), layer = &amp;quot;track_points&amp;quot;) %&amp;gt;% 
    #keep the segment id and the geometry field
    select(track_seg_point_id, geometry) %&amp;gt;% 
    # create a route_id based on the file
    mutate(route_id = parse_number(.x))
) %&amp;gt;% 
  #Extract Lat and Long as Columns
  cbind(., st_coordinates(.))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After the processing the data looks like:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;track_seg_point_id&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;route_id&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;X&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Y&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;geometry&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;111694131&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-73.97597&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40.77624&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;POINT (-73.97597 40.77624)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;111694131&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-73.97555&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40.77605&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;POINT (-73.97555 40.77605)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;111694131&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-73.97555&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40.77605&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;POINT (-73.97555 40.77605)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;111694131&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-73.97546&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40.77582&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;POINT (-73.97546 40.77582)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;111694131&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-73.97546&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40.77582&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;POINT (-73.97546 40.77582)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;111694131&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-73.97552&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;40.77527&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;POINT (-73.97552 40.77527)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;combining-runs-and-routes&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Combining Runs and Routes&lt;/h3&gt;
&lt;p&gt;With all the workouts in the &lt;code&gt;runs&lt;/code&gt; data and all the routes in the &lt;code&gt;all_routes&lt;/code&gt; data, a simple inner-join will combine them. This will duplicates routes that I ran multiple times, which in this case would be the desired behavior.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Join Routes to Runs to Duplicate 
runs_and_routes &amp;lt;- runs %&amp;gt;% 
  inner_join(all_routes, by = &amp;quot;route_id&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-a-map-of-nyc&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Creating a map of NYC&lt;/h3&gt;
&lt;p&gt;Since the goal is to create a heatmap of the various routes I ran as part of marathon training, I need a map that contains all of the possible roads in NYC. The &lt;code&gt;tigris&lt;/code&gt; package allows for the access to US Census TIGER shapefiles. One of the levels is “roads”, which can be downloaded using the &lt;code&gt;road()&lt;/code&gt; function where the first parameter is state and 2nd parameter is county (New York County is Manhattan):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;###Download Roads Map from Tigris
nyc &amp;lt;- roads(&amp;quot;NY&amp;quot;, &amp;quot;New York&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot() + geom_sf(data = nyc) + ggthemes::theme_map()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/04/01/heatmapping-my-new-york-city-marathon-training/index_files/figure-html/map1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The function provides road data for all of Manhattan. However, I did not run every part of Manhattan, so it would make more sense to truncate the map to areas where I did run.&lt;/p&gt;
&lt;p&gt;In order to do this, I first need to define a boundary box based on my routes. Given a geometry, the &lt;code&gt;st_bbox()&lt;/code&gt; function from &lt;code&gt;sf&lt;/code&gt; will return a “bbox” object containing the four corners of my routes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;st_bbox(runs_and_routes$geometry)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      xmin      ymin      xmax      ymax 
## -74.01880  40.70806 -73.93118  40.82113&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;However, this will not provide any padding around my running routes which will make for a worse visualization. So I will use &lt;code&gt;map2_dbl&lt;/code&gt; to add a delta of 0.01 to the maximum values and remove a delta of -0.01 to the minimum values to slightly increase the bounding box.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;### Construct Bounding Boxes and Expand Limits By A Delta
bbox &amp;lt;- map2_dbl(
  st_bbox(runs_and_routes$geometry),
  names(st_bbox(runs_and_routes$geometry)),
  ~if_else(str_detect(.y, &amp;#39;min&amp;#39;), .x - .01, .x + .01)
)

bbox&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      xmin      ymin      xmax      ymax 
## -74.02880  40.69806 -73.92118  40.83113&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With an updated bounding box, I can now crop the initial map with my bounding box using the &lt;code&gt;st_crop()&lt;/code&gt; function. Also, in order to make the Coordinate Reference Systems the same, I use &lt;code&gt;st_crs()&lt;/code&gt; and &lt;code&gt;st_transform&lt;/code&gt; to make sure the NYC map is using the same coordinates as my routes.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Set CRS for NYC to CRS for Running Routes And Crop to the Bounding Box
nyc2 &amp;lt;- st_transform(nyc, crs = st_crs(runs_and_routes$geometry)) %&amp;gt;% 
  st_crop(bbox)

ggplot() + geom_sf(data = nyc2) + ggthemes::theme_map()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/04/01/heatmapping-my-new-york-city-marathon-training/index_files/figure-html/nyc_map_2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We’ve now cut off Governor’s Island from the bottom left corner as well as parts of Northern Manhattan that I never ran to.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;constructing-the-heatmap&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Constructing the Heatmap&lt;/h2&gt;
&lt;p&gt;With the new basemap created and the route data in its own data frame. I can create the heatmap using &lt;code&gt;stat_density2d&lt;/code&gt; with the route data and &lt;code&gt;geom_sf&lt;/code&gt; with the map data. From the &lt;code&gt;stat_density2d&lt;/code&gt; piece I pass in the routes data and set the fill value to be the count at each X and Y using the &lt;code&gt;after_stat()&lt;/code&gt; option. The &lt;code&gt;n&lt;/code&gt; parameter sets the number of grid points in each directions for the density.&lt;/p&gt;
&lt;p&gt;The base map is very rectangular where it is tall but skinny. This made it difficult to add titles. To make things look better, I use &lt;code&gt;ggdraw&lt;/code&gt; from the &lt;code&gt;cowplot&lt;/code&gt; package to create a new drawing layer and add titles/captions to that layer.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p &amp;lt;- ggplot() + 
  #Construct the Heatmap Portion
  stat_density2d(data = runs_and_routes,
                 aes(x = X, y = Y, fill = after_stat(count)),
                 geom = &amp;#39;tile&amp;#39;,
                 contour = F,
                 n = 1024
                 ) +
  #Draw the Map of Manhattan
  geom_sf(data = nyc2, color = &amp;#39;#999999&amp;#39;, alpha = .15) + 
  scale_fill_viridis_c(option = &amp;quot;B&amp;quot;, guide = F) + 
  ggthemes::theme_map() + 
  theme(
    panel.background = element_rect(fill = &amp;#39;black&amp;#39;),
    plot.background = element_rect(fill = &amp;#39;black&amp;#39;)
  )

cowplot::ggdraw(p) + 
  labs(title = &amp;quot;JLaw&amp;#39;s Marathon Training Heatmap&amp;quot;,
       caption = &amp;quot;**Author**: JLaw&amp;quot;) + 
  theme(panel.background = element_rect(fill = &amp;quot;black&amp;quot;),
        plot.background = element_rect(fill = &amp;#39;black&amp;#39;),
        plot.title = element_text(color = &amp;quot;#DDDDDD&amp;quot;,
                                  family = &amp;#39;Nirmala UI&amp;#39;,
                                  #face = &amp;#39;bold&amp;#39;,
                                  size = 18),
        plot.caption = ggtext::element_markdown(color = &amp;#39;#DDDDDD&amp;#39;,
                                    family = &amp;#39;Calibri Light&amp;#39;,
                                    hjust = 1,
                                    size = 12),
        

  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/2021/04/01/heatmapping-my-new-york-city-marathon-training/index_files/figure-html/heatmap-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;concluding-thoughts&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Concluding Thoughts&lt;/h2&gt;
&lt;p&gt;I’m really happy with how this came out. It also provides some information about my running habits, mainly that I ran in Central Park a lot and that you can roughly tell where I worked at the time as that area is slightly &lt;em&gt;hotter&lt;/em&gt;. There are some parts of Manhattan that I did run but don’t show up well in the map because I might have only run there once. An exploration of how much of Manhattan did I run will be covered in a follow-up post.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Exploring Wednesday Night Cable Ratings with OCR</title>
      <link>https://jlaw.netlify.app/2021/03/01/exploring-wednesday-night-cable-ratings-with-ocr/</link>
      <pubDate>Mon, 01 Mar 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/03/01/exploring-wednesday-night-cable-ratings-with-ocr/</guid>
      <description>
&lt;script src=&#34;index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;One of my guilty pleasure TV shows is &lt;a href=&#34;https://en.wikipedia.org/wiki/The_Challenge_(TV_series)&#34;&gt;MTV’s The Challenge&lt;/a&gt;. Debuting in the late 90s, the show pitted alumni from The Real World and Road Rules against each other in a series of physical events. Now on its 36th season, its found new popularity by importing challengers from other Reality Shows, in the US and Internationally, regularly topping Wednesday Night ratings in the coveted 18-49 demographic.&lt;/p&gt;
&lt;p&gt;Looking at the Ratings on &lt;a href=&#34;http://www.showbuzzdaily.com/articles/showbuzzdailys-top-150-wednesday-cable-originals-network-finals-2-3-2021.html&#34;&gt;showbuzzdaily.com&lt;/a&gt; shows that the Challenge was in fact #1 in this demographic. However, it also scores incredibly low on the 50+ demo.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;ratings.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So I figured that exploring the age and gender distributions of Wednesday Night Cable ratings would be interesting. The only caveat is… &lt;strong&gt;the data exists in an image&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;So for this blog post, I will be extracting the ratings data from the image and doing some exploration on popular shows by age and gender.&lt;/p&gt;
&lt;p&gt;Also, huge thanks to Thomas Mock and his &lt;a href=&#34;https://themockup.blog/posts/2021-01-18-reading-tables-from-images-with-magick/&#34;&gt;The Mockup Blog&lt;/a&gt; for serving as a starting point for learning &lt;code&gt;magick&lt;/code&gt;.&lt;/p&gt;
&lt;div id=&#34;using-magick-to-process-image-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using magick to process image data&lt;/h2&gt;
&lt;p&gt;I’ll be using the &lt;code&gt;magick&lt;/code&gt; package to read in the image and do some processing to clean up the image. Then I will use the &lt;em&gt;ocr()&lt;/em&gt; function from the &lt;code&gt;tesseract&lt;/code&gt; package to actual handle extraction of the data from the image.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) #Data Manipulation
library(magick) #Image Manipulation
library(tesseract) #Extracting Text from the Image
library(patchwork) #Combining Multiple GGPLOTs Together&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first step is reading in the raw image from the &lt;a href=&#34;http://www.showbuzzdaily.com/articles/showbuzzdailys-top-150-wednesday-cable-originals-network-finals-2-3-2021.html&#34;&gt;showbuzzdaily.com&lt;/a&gt; website which can be done through &lt;code&gt;magick&lt;/code&gt;’s &lt;em&gt;image_read()&lt;/em&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;raw_img &amp;lt;- image_read(&amp;quot;http://www.showbuzzdaily.com/wp-content/uploads/2021/02/Final-Cable-2021-Feb-03-WED.png&amp;quot;)

image_ggplot(raw_img)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/orig-1.png&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The next thing to notice is that while most of the data does exist in a tabular format, there are also headers and footers that don’t follow the tabular structure. So I’ll use &lt;em&gt;image_crop()&lt;/em&gt; to keep only the tabular part of the image. The crop function uses a &lt;em&gt;geometry_area()&lt;/em&gt; helper function which takes in four parameters. I struggled a bit with the documentation figuring out exactly how to get this working right but eventually internalized &lt;em&gt;geometry_area(703, 1009, 0, 91)&lt;/em&gt; as “crop out 703 pixels of width and 1009 pixels of height starting from X-position on the left boundary and y-position 91 pixels from the top”.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;chopped_image &amp;lt;- 
  raw_img %&amp;gt;% 
  #crop out width:703px and height:1009px starting +91px from the top
  image_crop(geometry_area(703, 1009, 0, 91)) 

image_ggplot(chopped_image)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/cropping-1.png&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now the non-tabular data (header and footer) have been removed.&lt;/p&gt;
&lt;p&gt;The &lt;em&gt;ocr()&lt;/em&gt; algorithm that will handle extracting the data from the image can struggle with parts of the image as is. For example, it might think the color boundary between white and green is a character. Therefore, I’m going to try to do the best I can do clean up the image so that the &lt;em&gt;ocr()&lt;/em&gt; function can have an easier time. Ultimately this required a lot of guess and check but in the end, I only did two steps for cleaning:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Running a morphology method over the image to remove the horizontal lines separating each group of 5 shows (this required negating the colors of the image so that the filter would have an easier time since white is considered foreground by default). The morphology method modifies an image based on the neighborhood of pixels around it and thinning is subtracting pixels from a shape. So by negating the color the method turns “non-black” pixels to black. Then re-negating turns everything back to “white”.&lt;/li&gt;
&lt;li&gt;Turning everything to greyscale to remove remaining colors.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;I had tried to remove the color gradients, but it took much more effort and was ultimately not more effective than just going to greyscale.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;processed_image &amp;lt;- chopped_image %&amp;gt;% 
  image_negate() %&amp;gt;% #Flip the Colors
  # Remove the Horizontal Lines
  image_morphology(method = &amp;quot;Thinning&amp;quot;, kernel = &amp;quot;Rectangle:7x1&amp;quot;) %&amp;gt;% 
  # Flip the Colors back to the original
  image_negate() %&amp;gt;% 
  # Turn colors to greyscale
  image_quantize(colorspace = &amp;quot;gray&amp;quot;)


image_ggplot(processed_image)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/processing-1.png&#34; width=&#34;576&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;extracting-the-data-with-ocr&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Extracting the Data with OCR&lt;/h2&gt;
&lt;p&gt;Because I can be lazy, my first attempts at extraction was just to run &lt;em&gt;ocr()&lt;/em&gt; on the processed image and hope for the best. However, the best was somewhat frustrating. For example,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ocr(processed_image) %&amp;gt;% 
  str_sub(end = str_locate(., &amp;#39;\\n&amp;#39;)[1])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;1 CHALLENGE: DOUBLE AGENMTV e:00PM 90/0.54 069 0.39 |047 053 0.20 |058 013} 920\n&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Just looking at the top row there are a number of issues that come from just using &lt;em&gt;ocr()&lt;/em&gt; directly on the table. The boundary between sections are showing up as “|” or “/” and sometime the decimal doesn’t appear.&lt;/p&gt;
&lt;p&gt;Fortunately the function allows you to “whitelist” characters in order to nudge the algorithm on what it should expect to see. So rather than guess and check on the processing of the image to make everything work perfectly. I’ll write a function that allows me to crop to individual columns and specify the proper whitelist for each column.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ocr_text &amp;lt;- function(col_width, col_start, format_code){
  
  ##For Stations Which Are Only Characters
  only_chars &amp;lt;- tesseract::tesseract(
    options = list(
      tessedit_char_whitelist = paste0(LETTERS, collapse = &amp;#39;&amp;#39;)
    )
  )
  
  #For Titles Which Are Letters + Numbers + Characters
  all_chars &amp;lt;- tesseract::tesseract(
    options = list(
      tessedit_char_whitelist = paste0(
        c(LETTERS, &amp;quot; &amp;quot;, &amp;quot;.0123456789-()/&amp;quot;), collapse = &amp;quot;&amp;quot;)
    )
  )
  
  #For Ratings which are just numbers and a decimal point
  ratings &amp;lt;- tesseract::tesseract(
    options = list(
      tessedit_char_whitelist = &amp;quot;0123456789 .&amp;quot;
    )
  )
  
  #Grab the Column starting at Col Start and with width Col with
  tmp &amp;lt;- processed_image %&amp;gt;% 
    image_crop(geometry_area(col_width, 1009, col_start, 0)) 
  
  # Run OCR with the correct whitelist and turn into a dataframe
  tmp %&amp;gt;% 
    ocr(engine = get(format_code)) %&amp;gt;% 
    str_split(&amp;quot;\n&amp;quot;) %&amp;gt;%
    unlist() %&amp;gt;%
    enframe() %&amp;gt;%
    select(-name) %&amp;gt;%
    filter(!is.na(value), str_length(value) &amp;gt; 0)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The function above takes in a column width and a column start to crop the column and then a label to choose the whitelist for each specific column. The parameters are defined in a list and passed into &lt;code&gt;purrr&lt;/code&gt;’s &lt;em&gt;pmap()&lt;/em&gt; function. Finally, all the extracted columns will combined together.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Run the function all the various columns
all_ocr &amp;lt;- list(col_width = c(168, 37, 33, 34, 35, 34),
                col_start = c(28, 196, 307, 346, 385, 598),
                format_code = c(&amp;quot;all_chars&amp;quot;, &amp;#39;only_chars&amp;#39;, rep(&amp;quot;ratings&amp;quot;, 4))) %&amp;gt;% 
  pmap(ocr_text) 

#Combine all the columns together and set the names
ratings &amp;lt;- all_ocr %&amp;gt;% 
  bind_cols() %&amp;gt;% 
  set_names(nm = &amp;quot;telecast&amp;quot;, &amp;quot;network&amp;quot;, &amp;quot;p_18_49&amp;quot;, &amp;quot;f_18_49&amp;quot;, &amp;quot;m_18_49&amp;quot;,
            &amp;#39;p_50_plus&amp;#39;) &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;final-cleaning&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Final Cleaning&lt;/h2&gt;
&lt;p&gt;Even with the column specific specifications the &lt;em&gt;ocr()&lt;/em&gt; function did not get everything right. Due to the font, it has particular trouble distinguishing between &lt;code&gt;1&lt;/code&gt;s and &lt;code&gt;4&lt;/code&gt;s as well as &lt;code&gt;8&lt;/code&gt;s and &lt;code&gt;6&lt;/code&gt;s. Additionally, sometimes the decimal was still missed. And since all networks were truncated in the original image, I just decided to manually recode.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ratings_clean &amp;lt;- ratings %&amp;gt;% 
  #Fix Things where the decimal was missed
  mutate(across(p_18_49:p_50_plus, ~parse_number(.x)),
         across(p_18_49:p_50_plus, ~if_else(.x &amp;gt; 10, .x/100, .x)),
         #1s and 4s get kindof screwed up; same with 8s and 6s
         p_50_plus = case_when(
           telecast == &amp;#39;TUCKER CARLSON TONIGHT&amp;#39; ~ 2.71,
           telecast == &amp;#39;SISTAS SERIES S2&amp;#39; ~ 0.46,
           telecast == &amp;#39;LAST WORD W/L. ODONNEL&amp;#39; ~ 2.17,
           telecast == &amp;#39;SITUATION ROOM&amp;#39; &amp;amp; p_50_plus == 1.34 ~ 1.31,
           telecast == &amp;#39;MY 600-LB LIFE NIA&amp;#39; ~ 0.46,
           TRUE ~ p_50_plus
         ),
         #Clean up &amp;#39;W/&amp;#39; being read as &amp;#39;WI&amp;#39; and &amp;#39;11th&amp;#39; as &amp;#39;44th&amp;#39;
         telecast = case_when(
           telecast == &amp;#39;44TH HOUR WIB. WILLIAMS&amp;#39; ~ &amp;#39;11TH HOUR W/B. WILLIAMS&amp;#39;,
           telecast == &amp;#39;ALLIN WI CHRIS HAYES&amp;#39; ~ &amp;#39;ALL IN W/ CHRIS HAYES&amp;#39;,
           telecast == &amp;#39;BEAT WIARI MELBER&amp;#39; ~&amp;#39;BEAT W/ARI MELBER&amp;#39;,
           telecast == &amp;#39;SPORTSCENTER 124M L&amp;#39; ~ &amp;#39;SPORTSCENTER 12AM&amp;#39;,
           telecast == &amp;#39;MY 600-LB LIFE NIA&amp;#39; ~ &amp;#39;MY 600-LB LIFE&amp;#39;,
           TRUE ~ telecast
         ),
         # Turn to Title Case
         telecast = str_to_title(telecast),
         # Clean up random characters
         telecast = str_remove(telecast, &amp;#39; [L|F|S2|L B]+$&amp;#39;),
         #Clean up Network
         network = factor(case_when(
           network == &amp;#39;TURNI&amp;#39; ~ &amp;quot;TNT&amp;quot;,
           network == &amp;#39;MSNBI&amp;#39; ~ &amp;quot;MSNBC&amp;quot;,
           network == &amp;#39;FOXN&amp;#39; ~ &amp;quot;FoxNews&amp;quot;,
           network == &amp;#39;LIFETI&amp;#39; ~ &amp;quot;Lifetime&amp;quot;,
           network == &amp;#39;BLACK&amp;#39; ~ &amp;#39;BET&amp;#39;,
           network %in% c(&amp;#39;AEN&amp;#39;, &amp;#39;AGEN&amp;#39;) ~ &amp;#39;A&amp;amp;E&amp;#39;,
           network == &amp;#39;BRAVC&amp;#39; ~ &amp;#39;BRAVO&amp;#39;,
           network == &amp;#39;COME&amp;#39; ~ &amp;#39;COMEDY CENTRAL&amp;#39;,
           network == &amp;#39;NECS&amp;#39; ~ &amp;#39;NBC SPORTS&amp;#39;,
           network == &amp;#39;TBSN&amp;#39; ~ &amp;#39;TBS&amp;#39;,
           network == &amp;#39;TL&amp;#39; ~ &amp;#39;TLC&amp;#39;,
           TRUE ~ network
         ))
  )

knitr::kable(head(ratings_clean, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;telecast&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;network&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;p_18_49&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;f_18_49&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;m_18_49&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;p_50_plus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Challenge Double Agen&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;MTV&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.54&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.69&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.39&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Nba Regular Season&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;ESPN&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.33&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.21&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.46&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Aew All Elite Wrestling&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;TNT&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.32&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.21&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.42&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.32&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Now everything should be ready for analysis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;analysis-of-cable-ratings&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Analysis of Cable Ratings&lt;/h2&gt;
&lt;p&gt;The decimals in the table for cable ratings refer to the &lt;a href=&#34;https://en.wikipedia.org/wiki/Audience_measurement#:~:text=Ratings%20point%20is%20a%20measure,households%20in%20the%20United%20States&#34;&gt;percent of the population watching the show&lt;/a&gt;. For instance the &lt;code&gt;p_18_49&lt;/code&gt; field’s value of 0.54 means that 0.54% of the US 18-49 population watched The Challenge on February 3rd.&lt;/p&gt;
&lt;div id=&#34;the-most-popular-shows-on-wednesday-night-overall-18-49-and-by-gender&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The Most Popular Shows on Wednesday Night Overall 18-49 and By Gender&lt;/h3&gt;
&lt;p&gt;The first question is what are the most popular shows for the 18-49 demographic for combined genders and broken apart by gender. These types of combined plots uses the &lt;code&gt;patchwork&lt;/code&gt; package to combine the three ggplots into a single plot using a common legend.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;##Create Fixed Color Palette For Networks
cols &amp;lt;- scales::hue_pal()(n_distinct(ratings_clean$network))
names(cols) &amp;lt;- levels(ratings_clean$network)

##Top Show By the Key Demo (Combined)
key_all &amp;lt;- ratings_clean %&amp;gt;% 
  slice_max(p_18_49, n = 10) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(telecast, p_18_49), y = p_18_49, fill = network)) + 
    geom_col() + 
    geom_text(aes(label = p_18_49 %&amp;gt;% round(2)), nudge_y = 0.015) + 
    scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
    scale_fill_manual(values = cols) + 
    labs(x = &amp;quot;&amp;quot;, title = &amp;quot;All Genders&amp;quot;, y = &amp;#39;&amp;#39;, fill = &amp;#39;&amp;#39;) + 
    coord_flip() + 
    cowplot::theme_cowplot() + 
    theme(
      axis.text.x = element_blank(),
      axis.ticks = element_blank(),
      axis.line.x = element_blank(),
      plot.title.position = &amp;#39;plot&amp;#39;
    )

#Male Ratings only
key_male &amp;lt;- ratings_clean %&amp;gt;% 
  slice_max(m_18_49, n = 5) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(telecast, m_18_49), y = m_18_49, fill = network)) + 
  geom_col() + 
  geom_text(aes(label = m_18_49 %&amp;gt;% round(2)), nudge_y = .045) + 
  scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
  scale_fill_manual(values = cols, guide = F) + 
  labs(x = &amp;quot;&amp;quot;, title = &amp;quot;Male&amp;quot;, y = &amp;#39;&amp;#39;) + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.x = element_blank(),
    axis.ticks = element_blank(),
    axis.line.x = element_blank(),
    plot.title.position = &amp;#39;plot&amp;#39;
  )

# Female rating only
key_female &amp;lt;- ratings_clean %&amp;gt;% 
  slice_max(f_18_49, n = 5) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(telecast, f_18_49), y = f_18_49, fill = network)) + 
  geom_col() + 
  geom_text(aes(label = f_18_49 %&amp;gt;% round(2)), nudge_y = .065) + 
  scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
  scale_fill_manual(values = cols, guide = F) + 
  labs(x = &amp;quot;&amp;quot;, title = &amp;quot;Female&amp;quot;, y = &amp;#39;&amp;#39;) + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.x = element_blank(),
    axis.ticks = element_blank(),
    axis.line.x = element_blank(),
    plot.title.position = &amp;#39;plot&amp;#39;
  )
    
# Combining everything with patchwork syntax
key_all / (key_male | key_female) +
  plot_layout(guides = &amp;quot;collect&amp;quot;) + 
  plot_annotation(
    title = &amp;quot;**Wednesday Night Cable Ratings (Feb 3rd, 2021)**&amp;quot;,
    caption = &amp;quot;*Source:* Showbuzzdaily.com&amp;quot;
  ) &amp;amp; theme(legend.position = &amp;#39;bottom&amp;#39;,
            plot.title = ggtext::element_markdown(size = 14),
            plot.caption = ggtext::element_markdown())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/most_popular_18_49-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;p&gt;From the chart its clear that the Challenge is fairly dominant in the 18-49 Demographic with 0.21% (or 1.63x) higher than the 2nd highest show. Although while the Challenge is popular with both genders its the most popular show among 18-49 Females but only 3rd for 18-49 Males after a NBA game and AEW Professional Wrestling.&lt;/p&gt;
&lt;p&gt;Also, because the networks for My 600-lb Life (TLC) and Sistas (BET) weren’t in the overall top 10 I couldn’t figure out how to include them in the legend. If anyone has any ideas, please let me know in the comments.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-most-male-dominant-female-dominant-and-gender-balanced-shows&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The Most Male-Dominant, Female Dominant, and Gender-Balanced Shows&lt;/h3&gt;
&lt;p&gt;From the above chart its clear that some shows skew Male (sports) and some skew Female (reality shows like Married at First Sight, My 600-lb Life, and Real Housewives). But I can look at that more directly by comparing the ratios the Female 18-49 rating to the Male 18-49 rating to determine the gender skew of each show. I break the shows into categories of &lt;em&gt;Male Skewed&lt;/em&gt;, &lt;em&gt;Female Skewed&lt;/em&gt;, and &lt;em&gt;Balanced&lt;/em&gt; (where the Female/Male Ratio is closest to 1).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;##Female / Male Ratio for Key Demo
bind_rows(
  ratings_clean %&amp;gt;% 
    mutate(f_m_ratio = f_18_49 / m_18_49) %&amp;gt;%
    slice_max(f_m_ratio, n = 5),
  ratings_clean %&amp;gt;% 
    mutate(f_m_ratio = f_18_49 / m_18_49) %&amp;gt;%
    slice_min(f_m_ratio, n = 5),
  ratings_clean %&amp;gt;% 
    mutate(f_m_ratio = f_18_49 / m_18_49,
           balance = abs(1-f_m_ratio)) %&amp;gt;% 
    slice_min(balance, n = 5)
) %&amp;gt;%
  mutate(balance = f_m_ratio-1) %&amp;gt;% 
  ggplot(aes(x = m_18_49, y = f_18_49, fill = balance)) + 
    ggrepel::geom_label_repel(aes(label = telecast)) + 
    geom_abline(lty = 2) + 
    scale_fill_gradient2(high = &amp;#39;#8800FF&amp;#39;,mid = &amp;#39;#BBBBBB&amp;#39;, low = &amp;#39;#02C2AD&amp;#39;,
                         midpoint = 0, guide = F) + 
    labs(title = &amp;quot;Comparing 18-49 Demographics by Gender&amp;quot;,
         subtitle = &amp;#39;Cable Feb 3rd, 2021&amp;#39;,
         caption = &amp;quot;*Source:* showbuzzdaily.com&amp;quot;,
         x = &amp;quot;Males 18-49 Ratings&amp;quot;,
         y = &amp;quot;Females 18-49 Ratings&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = &amp;#39;plot&amp;#39;,
      plot.caption = ggtext::element_markdown()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/gender_break-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Sure enough the most Male dominated shows are sport-related with 2 NBA Games, an NBA pre-game show, an episode of Sportscenter, and a sports talking heads show. Female skewed shows are also not surprising with Married at First Sight, Sistas, My 600-lb Life, and Real Housewives of Salt Lake City topping the list. For the balanced category, I did not have much of an expectation but all the programs seems to be News shows or news adjacent like the Daily Show… which I guess makes sense.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;most-popular-shows-for-the-50-demographic&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Most Popular Shows for the 50+ Demographic&lt;/h3&gt;
&lt;p&gt;Turning away from the 18-49 demographic I can also look at the most popular shows for the 50+ demographic. Unfortunately, there is not a 50+ gender breakdown so I can only look at the overall.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ratings_clean %&amp;gt;% 
  slice_max(p_50_plus, n = 10) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(telecast, p_50_plus), y = p_50_plus,  fill = network)) + 
  geom_col() + 
  geom_text(aes(label = p_50_plus %&amp;gt;% round(2)), nudge_y = 0.15) + 
  scale_y_continuous(expand = expansion(mult = c(0, .1))) + 
  labs(x = &amp;quot;&amp;quot;, title = &amp;quot;Top 10 Cable Shows for the 50+ Demographic&amp;quot;,
       y = &amp;#39;&amp;#39;,
       subtitle = &amp;quot;Wednesday, Feb 3rd 2021&amp;quot;,
       caption = &amp;quot;*Source:* Showbuzzdaily.com&amp;quot;,
       fill = &amp;#39;&amp;#39;) + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.x = element_blank(),
    axis.ticks = element_blank(),
    axis.line.x = element_blank(),
    plot.title.position = &amp;#39;plot&amp;#39;,
    plot.caption = ggtext::element_markdown(),
    legend.position = &amp;#39;bottom&amp;#39;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/plus50_overall-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Interestingly in the 50+ Demo, &lt;em&gt;ALL&lt;/em&gt; of the shows are News shows and they only come from 3 networks. Two on CNN, Two on Fox News, and 6 on MSNBC. Again, didn’t have a ton of expectation but it was surprising to be how homogeneous the 50+ demographic was.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-oldest-and-youngest-shows-in-the-top-50&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The Oldest and Youngest Shows in the Top 50&lt;/h3&gt;
&lt;p&gt;Similar to the Most Male and Most Female shows in the Top 50 Cable Programs, I’d like to see which shows skew older vs. younger. To do this, I’ll rank order the 18-49 demo and the 50+ demo and plot the ranks against each other. Now there are some massive caveats here in the sense that my data is the Top 50 shows by the 18-49 demo, so its not clear that the 50+ demo is fully represented. Additionally, popularity for each dimension is relative since I don’t know the actual number of people in each demo. Finally, since both scales are ranked, it won’t show the full distance between levels of popularity (e.g, The Challenge is much more popular than the next highest show for 18-49). This was done to produce a better looking visualization.&lt;/p&gt;
&lt;p&gt;I had run a K-means clustering algorithm for text colors to make differences more appearant. There isn’t much rigor to this beyond my assumption that 5 clusters would probably make sense (1 for each corner and 1 middle).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Rank Order the Shows for the 2 Columns
dt &amp;lt;- ratings_clean %&amp;gt;% 
  transmute(
    telecast,
    young_rnk = min_rank(p_18_49),
    old_rnk = min_rank(p_50_plus),
  ) 

# Run K-Means Clustering Algorithm
km &amp;lt;- kmeans(dt %&amp;gt;% select(-telecast), 
             centers = 5, nstart = 10)

#Add the cluster label back to the data
dt2 &amp;lt;- dt %&amp;gt;%
  mutate(cluster = km$cluster)

#Plot
ggplot(dt2, aes(x = young_rnk, y = old_rnk, color = factor(cluster))) + 
  ggrepel::geom_text_repel(aes(label = telecast), size = 3) +
  scale_color_discrete(guide = F) + 
  scale_x_continuous(breaks = c(1, 50),
                     labels = c(&amp;quot;Less Popular&amp;quot;, &amp;quot;More Popular&amp;quot;)) + 
  scale_y_continuous(breaks = c(13, 54),
                     labels = c(&amp;quot;Less Popular&amp;quot;, &amp;quot;More Popular&amp;quot;)) + 
  coord_cartesian(xlim = c(-2, 54), ylim = c(0, 52)) + 
  labs(x = &amp;quot;Popularity Among 18-49&amp;quot;,
       y = &amp;quot;Popularity Among 50+&amp;quot;,
       title = &amp;quot;Visualizing Popularity of Wednesday Night Cable by Age&amp;quot;,
       subtitle = &amp;quot;Comparing 18-49 vs. 50+&amp;quot;) + 
  cowplot::theme_cowplot() + 
  theme(
    axis.ticks = element_blank(),
    axis.line = element_blank(),
    axis.text.y = element_text(angle = 90), 
    panel.background = element_rect(fill = &amp;#39;#EEEEEE&amp;#39;)

  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/unnamed-chunk-4-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Somewhat surprising (at least to me), that Rachel Maddow and Tucker Carlson are the consensus most popular shows across the two demos. My beloved Challenge is very popular amongst the 18-49 demo and very unpopular among 50+. Sports shows tended to be generally the least popular by either demo and finally certain MSNBC and Fox News shows were popular among the 50+ demo but not the 18-49.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;concluding-thoughts&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Concluding Thoughts&lt;/h2&gt;
&lt;p&gt;While I still love The Challenge and am happy for its popularity, its best time was probably about 10 years ago (sorry not sorry). As far as the techniques in this post are concerned, I found extracting the data from an image to be an interesting challenge (no pun intended) but if the table was a tractable size I would probably manually enter the data rather than go through this again. Getting the data correct required a lot of guess and check for working with &lt;code&gt;magick&lt;/code&gt; and &lt;code&gt;tesseract&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;As for the analysis, I guess its good when things go as expected (most popular shows by gender follow stereotypical gender conventions) but I think the most surprising thing to me was how much cable news dominated the 50+ Demographic…. and I guess the Daily Show is not as popular as I thought it would be.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>When Did the US Senate Best Reflect the US Population?</title>
      <link>https://jlaw.netlify.app/2021/02/01/when-did-the-us-senate-best-reflect-the-us-population/</link>
      <pubDate>Mon, 01 Feb 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/02/01/when-did-the-us-senate-best-reflect-the-us-population/</guid>
      <description>
&lt;script src=&#34;index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;tldr&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;While this is the oldest Senate we’ve ever had, its not the most non-representative Senate when compared to the age distribution of the US Population&lt;/li&gt;
&lt;li&gt;The most representative Senate was in the 1970s as the average Senator age declined while the average age in the US increased.&lt;/li&gt;
&lt;li&gt;The least representative Senate was in the 1990s as the average age in the US declined while the average age of Senators continued to rise since 1980&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;intro&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Intro&lt;/h2&gt;
&lt;p&gt;The inspiration for this post stemmed from &lt;a href=&#34;https://www.wcd.fyi/features/senate-generations&#34;&gt;wcd.fyi’s post&lt;/a&gt; on “Which Generations Control the Senate” where the creator broke down the US Senate distribution by generations.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;senate_generations.png&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Upon seeing this visualization my initial goal was to see whether certain generations’ trajectory were faster or slower than others and how that would shape our expectation of Senate control in the future. However, as that question expanded and as I thought about how we hear about how the &lt;a href=&#34;https://gen.medium.com/why-is-congress-so-old-64f014a9d819&#34;&gt;Senate is old&lt;/a&gt; and doesn’t reflect the American population, I wanted to see whether or not that’s true.&lt;/p&gt;
&lt;p&gt;The purpose of this post is to determine &lt;strong&gt;when the US Senate most and least reflected the age distribution of the general US population&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting the Data&lt;/h2&gt;
&lt;p&gt;The data for this analysis will come from two primary sources. Information on the US Senators will come from the same &lt;a href=&#34;https://projects.propublica.org/api-docs/congress-api/&#34;&gt;ProPublica Congress API&lt;/a&gt; as the original visualization. Information on the US Population Age Distribution will come from a variety of source from the &lt;a href=&#34;https://www.census.gov&#34;&gt;US Census Bureau&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;setting-up-the-libraries&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Setting up the libraries&lt;/h3&gt;
&lt;p&gt;While the workhorse functions for this analysis are the main &lt;code&gt;tidyverse&lt;/code&gt; data manipulation and visualization functions, I will be using &lt;code&gt;httr&lt;/code&gt; to access the Congress API and &lt;code&gt;tidycensus&lt;/code&gt; to access a subset of age distributions. Special shoutout to &lt;code&gt;readr&lt;/code&gt; for its various function to help read the differently formatted files from the &lt;a href=&#34;https://www.census.gov&#34;&gt;Census Bureau&lt;/a&gt;&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) #Data Manipulation and Visualizaion
library(httr) #Accessing the ProPublica API
library(glue) #Manipulating Strings to Make API Calls Easier
library(lubridate) # Date Manipulation Functions
library(tidycensus) # Package for Accessing Census Data&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-the-senate-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Getting the Senate Data&lt;/h3&gt;
&lt;p&gt;The data on the Senators comes from the &lt;a href=&#34;https://projects.propublica.org/api-docs/congress-api/&#34;&gt;ProPublica Congress API&lt;/a&gt;. According to its documentation you can retrieve a list of Senators for any congress from the 80th (1947) through 117th (2021). To get this data I’ll first write a function that takes in a congressional session and returns the desired data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;get_senate_data &amp;lt;- function(cngrs){
  
  # Issue request to API
  dt &amp;lt;- GET(url = glue(&amp;#39;https://api.propublica.org/congress/v1/{cngrs}/senate/members.json&amp;#39;),
            add_headers(&amp;quot;X-API-Key&amp;quot; = Sys.getenv(&amp;quot;PROPUBLICA_API_KEY&amp;quot;)))
  
  x &amp;lt;- content(dt)$results[[1]]$members %&amp;gt;% tibble(dt = .) %&amp;gt;% unnest_wider(dt) %&amp;gt;% 
    mutate(congress = cngrs,
           #The API only Contains 80th Congress Forward.  80th Congress was 1/1947
           start_year = (cngrs-80)*2 + 1947,
           # Use DOB to Infer Age
           age = as.numeric(ymd(paste(start_year, 01, 15, sep = &amp;#39;-&amp;#39;)) - ymd(date_of_birth))/365,
           # Bucket Age Using Conventional Census Buckets
           label = case_when(
             age &amp;lt;= 4 ~ &amp;#39;Under 5 years&amp;#39;,
             age &amp;lt;= 9 ~ &amp;#39;5 to 9 years&amp;#39;,
             age &amp;lt;= 14 ~ &amp;#39;10 to 14 years&amp;#39;,
             age &amp;lt;= 19 ~ &amp;#39;15 to 19 years&amp;#39;,
             age &amp;lt;= 24 ~ &amp;#39;20 to 24 years&amp;#39;,
             age &amp;lt;= 29 ~ &amp;#39;25 to 29 years&amp;#39;,
             age &amp;lt;= 34 ~ &amp;#39;30 to 34 years&amp;#39;,
             age &amp;lt;= 39 ~ &amp;#39;35 to 39 years&amp;#39;,
             age &amp;lt;= 44 ~ &amp;#39;40 to 44 years&amp;#39;,
             age &amp;lt;= 49 ~ &amp;#39;45 to 49 years&amp;#39;,
             age &amp;lt;= 54 ~ &amp;#39;50 to 54 years&amp;#39;,
             age &amp;lt;= 59 ~ &amp;#39;55 to 59 years&amp;#39;,
             age &amp;lt;= 64 ~ &amp;#39;60 to 64 years&amp;#39;,
             age &amp;lt;= 69 ~ &amp;#39;65 to 69 years&amp;#39;,
             age &amp;lt;= 74 ~ &amp;#39;70 to 74 years&amp;#39;,
             age &amp;lt;= 79 ~ &amp;#39;75 to 79 years&amp;#39;,
             age &amp;lt;= 84 ~ &amp;#39;80 to 84 years&amp;#39;,
             TRUE ~ &amp;#39;85 years&amp;#39;
         )
    )
  
  
  return(x)
  
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Some notes about this function:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The ProPublica API requires an API key that you need to register for. I’ve stored it in my .Renviron file so I can share the code without sharing my key.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;unnest_wider()&lt;/code&gt; function is part of a family of functions to help work with JSON output to turn lists of lists into more rectangular data.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With the function in place, I can get all the Senate data with a single line to call the API for each of the 38 Congress’ and combine into a single tibble using &lt;code&gt;map_dfr&lt;/code&gt; which applies the &lt;em&gt;get_senate_data&lt;/em&gt; function to each input (the numbers between 80 and 117).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;senate &amp;lt;- map_dfr(80:117, get_senate_data)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The API will return all of the Senators who appeared in that Congressional session which due to changes over the course of two years can result in more than 2 senators appearing per state. For simplicity, I’ll reduce the data to only use the 2 senators who were there at the start of the congressional session. This is done using a heuristic that the Senators who were in-place first will have smaller &lt;em&gt;govtrack_id&lt;/em&gt; numbers. Finally, senators without DOB information are removed.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;senate_clean &amp;lt;- senate %&amp;gt;%
  group_by(congress, state) %&amp;gt;%
  arrange(govtrack_id) %&amp;gt;% 
  slice(1:2) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  filter(!is.na(date_of_birth))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-us-population-age-distributions-from-the-census-bureau&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Getting US Population Age Distributions from the Census Bureau&lt;/h3&gt;
&lt;p&gt;This process was a PITA. Since I wanted to match the coverage of the Senator data which ranged from 1947 through 2021, I needed to find US Population Age Distributions to match. While all this information was available on the Census website it comes from a combination of different files, file formats, and access methods. In summary:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;1947 - 1979:&lt;/strong&gt; Individual files per year that contain the population by each individual age from 0 to 84 and then 85+&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;1980 - 1989:&lt;/strong&gt; The entire decade exists in a &lt;a href=&#34;https://www2.census.gov/programs-surveys/popest/tables/1980-1990/state/asrh/s5yr8090.txt&#34;&gt;single fixed-width-file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;1990 - 2000:&lt;/strong&gt; The entire decade exists in a &lt;a href=&#34;https://www2.census.gov/programs-surveys/popest/tables/1990-2000/national/totals/nat-agesex.txt&#34;&gt;single file&lt;/a&gt; but the format is too awful to deal with programmatically, so I rebuilt the file in Excel and used the &lt;code&gt;datapasta&lt;/code&gt; add-in to create the tibble.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;2001-2004:&lt;/strong&gt; Nicely existed in a &lt;a href=&#34;https://www2.census.gov/programs-surveys/popest/tables/2000-2005/national/asrh/nc-est2005-01.csv&#34;&gt;single file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;2005-2019:&lt;/strong&gt; Retrieved from the American Community Survey (ACS) using the &lt;code&gt;tidycensus&lt;/code&gt; API.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There were probably easier ways to get everything… but oh well. Since there’s a lot going on for these 5 source, I’m going to not go into as much detail as I normally would in describing what’s happening, but its nothing too complicated.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;section&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;1947 - 1979&lt;/h3&gt;
&lt;p&gt;The process for reading these flat files isn’t too dissimilar from the process used on the ProPublica API. I write a function to handle an individual year and run &lt;code&gt;map_dfr&lt;/code&gt; on the list of years to create my data set. The one unique piece of this function is that the format of each year isn’t exactly the same, so it first reads the file to find where the data starts and then does the “official” read-in using the &lt;em&gt;skip&lt;/em&gt; parameter to start in the right place.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;get_1947_to_1979 &amp;lt;- function(yr){
  
  #Read In File
  c &amp;lt;- read_lines(glue(&amp;#39;https://www2.census.gov/programs-surveys/popest/tables/1900-1980/national/asrh/pe-11-{yr}.csv&amp;#39;))
  #Find where data starts
  c2 &amp;lt;- which(str_detect(c, &amp;#39;^0&amp;#39;))
  
  # Read in the actual file
  x &amp;lt;- suppressWarnings(read_csv(glue(&amp;#39;https://www2.census.gov/programs-surveys/popest/tables/1900-1980/national/asrh/pe-11-{yr}.csv&amp;#39;),
                skip = c2-2)) %&amp;gt;% 
    filter(!is.na(X2)) %&amp;gt;% 
    transmute(
      age = X1,
      population = X2,
      year = yr
    )
}

ages_1947_to_1979 &amp;lt;- map_dfr(1947:1979, get_1947_to_1979)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;section-1&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;1980 - 1989&lt;/h3&gt;
&lt;p&gt;The data for 1980 to 1989 comes from a single fixed-width file. To read it in, I use the &lt;code&gt;read_fwf&lt;/code&gt; function from &lt;code&gt;readr&lt;/code&gt;. Its very similar to other &lt;code&gt;readr&lt;/code&gt; functions like &lt;code&gt;read_csv&lt;/code&gt;. The only difference is that you need to specify the positions of the data which can be done in a wide variety of ways. Here i used &lt;code&gt;fwf_widths&lt;/code&gt; to tell the function how wide each column is and what to call each column.&lt;/p&gt;
&lt;p&gt;The file also contains information at a State level and contains sets for both genders, Males only, and Females only. The &lt;em&gt;rowid&lt;/em&gt; construction is so I can pull out only the rows I need for both genders and for the rows with age segment data. Finally, the &lt;em&gt;group_by&lt;/em&gt; / &lt;em&gt;summarize&lt;/em&gt; is to aggregate the population over the State values.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ages_1980_to_1989 &amp;lt;- read_fwf(
  file = &amp;#39;https://www2.census.gov/programs-surveys/popest/tables/1980-1990/state/asrh/s5yr8090.txt&amp;#39;,
  fwf_widths(c(16, 9, 9, 9, 9, 9, 9,9 , 9, 9, 9, 9,7 ),
             c(&amp;#39;Term&amp;#39;, &amp;#39;dropme&amp;#39;, &amp;#39;y1980&amp;#39;, &amp;#39;y1981&amp;#39;,&amp;#39;y1982&amp;#39;, &amp;#39;y1983&amp;#39;,&amp;#39;y1984&amp;#39;,
               &amp;#39;y1985&amp;#39;, &amp;#39;y1986&amp;#39;,&amp;#39;y1987&amp;#39;, &amp;#39;y1988&amp;#39;,&amp;#39;y1989&amp;#39;, &amp;#39;y1990&amp;#39;)),
  skip = 10
) %&amp;gt;% 
  mutate(rowid = row_number() %% 58) %&amp;gt;% 
  filter(rowid &amp;lt;= 20 &amp;amp; !rowid %in% c(0, 2, 1)) %&amp;gt;% 
  select(-dropme, -y1990, -rowid) %&amp;gt;% 
  gather(year, population, -Term) %&amp;gt;% 
  transmute(
    label = Term,
    year = as.numeric(str_remove_all(year, &amp;#39;y&amp;#39;)),
    population = as.numeric(population)
  ) %&amp;gt;% 
  group_by(label, year) %&amp;gt;% 
  summarize(population = sum(population), .groups = &amp;#39;drop&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;section-2&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;1990 - 2000&lt;/h3&gt;
&lt;p&gt;The data for the 1990s comes from a &lt;a href=&#34;https://www2.census.gov/programs-surveys/popest/tables/1990-2000/national/totals/nat-agesex.txt&#34;&gt;single file&lt;/a&gt; in a very machine unfriendly format. Here I copied and pasted the data I needed into an Excel file and used &lt;code&gt;datapasta&lt;/code&gt; to copy it into R as a tibble. The wide-format data is then cleaned and turned into long-format data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ages_1990_to_2000 &amp;lt;- tibble::tribble(
                                   ~Age_Group,    ~y2000,    ~y1999,    ~y1998,    ~y1997,    ~y1996,    ~y1995,    ~y1994,    ~y1993,    ~y1992,    ~y1991,    ~y1990,
                       &amp;quot;Under 5 years.......&amp;quot;, 18945000L, 18942000L, 18989000L, 19099000L, 19292000L, 19532000L, 19700000L, 19674000L, 19492000L, 19189000L, 18853000L,
                       &amp;quot;5 to 9 years........&amp;quot;, 19681000L, 19947000L, 19929000L, 19754000L, 19439000L, 19096000L, 18752000L, 18442000L, 18293000L, 18205000L, 18062000L,
                       &amp;quot;10 to 14 years......&amp;quot;, 20017000L, 19548000L, 19242000L, 19097000L, 19004000L, 18853000L, 18716000L, 18508000L, 18102000L, 17679000L, 17198000L,
                       &amp;quot;15 to 19 years......&amp;quot;, 19894000L, 19748000L, 19542000L, 19146000L, 18708000L, 18203000L, 17743000L, 17375000L, 17180000L, 17235000L, 17765000L,
                       &amp;quot;20 to 24 years......&amp;quot;, 18693000L, 18026000L, 17678000L, 17488000L, 17508000L, 17982000L, 18389000L, 18785000L, 19047000L, 19156000L, 19135000L,
                       &amp;quot;25 to 29 years......&amp;quot;, 17625000L, 18209000L, 18575000L, 18820000L, 18933000L, 18905000L, 19107000L, 19570000L, 20140000L, 20713000L, 21236000L,
                       &amp;quot;30 to 34 years......&amp;quot;, 19564000L, 19727000L, 20168000L, 20739000L, 21313000L, 21825000L, 22133000L, 22227000L, 22240000L, 22157000L, 21912000L,
                       &amp;quot;35 to 39 years......&amp;quot;, 22044000L, 22545000L, 22615000L, 22636000L, 22553000L, 22296000L, 21978000L, 21605000L, 21098000L, 20530000L, 19982000L,
                       &amp;quot;40 to 44 years......&amp;quot;, 22769000L, 22268000L, 21883000L, 21378000L, 20812000L, 20259000L, 19716000L, 19209000L, 18807000L, 18761000L, 17795000L,
                       &amp;quot;45 to 49 years......&amp;quot;, 20059000L, 19356000L, 18853000L, 18467000L, 18430000L, 17458000L, 16678000L, 15931000L, 15359000L, 14099000L, 13824000L,
                       &amp;quot;50 to 54 years......&amp;quot;, 17626000L, 16446000L, 15722000L, 15158000L, 13928000L, 13642000L, 13195000L, 12728000L, 12055000L, 11648000L, 11370000L,
                       &amp;quot;55 to 59 years......&amp;quot;, 13452000L, 12875000L, 12403000L, 11755000L, 11356000L, 11086000L, 10931000L, 10678000L, 10483000L, 10422000L, 10474000L,
                       &amp;quot;60 to 64 years......&amp;quot;, 10757000L, 10514000L, 10263000L, 10061000L,  9997000L, 10046000L, 10077000L, 10236000L, 10438000L, 10581000L, 10619000L,
                       &amp;quot;65 to 69 years......&amp;quot;,  9414000L,  9447000L,  9592000L,  9777000L,  9901000L,  9926000L,  9967000L, 10013000L,  9974000L, 10027000L, 10077000L,
                       &amp;quot;70 to 74 years......&amp;quot;,  8758000L,  8771000L,  8798000L,  8751000L,  8789000L,  8831000L,  8736000L,  8616000L,  8468000L,  8244000L,  8023000L,
                       &amp;quot;75 to 79 years......&amp;quot;,  7425000L,  7329000L,  7215000L,  7083000L,  6891000L,  6700000L,  6586000L,  6483000L,  6398000L,  6280000L,  6147000L,
                       &amp;quot;80 to 84 years......&amp;quot;,  4968000L,  4817000L,  4732000L,  4661000L,  4575000L,  4478000L,  4360000L,  4255000L,  4140000L,  4039000L,  3935000L,
                       &amp;quot;85 to 89 years......&amp;quot;,  2734000L,  2625000L,  2554000L,  2477000L,  2415000L,  2352000L,  2300000L,  2247000L,  2178000L,  2104000L,  2051000L,
                       &amp;quot;90 to 94 years......&amp;quot;,  1196000L,  1148000L,  1116000L,  1078000L,  1043000L,  1017000L,   967000L,   916000L,   865000L,   827000L,   765000L,
                       &amp;quot;95 to 99 years......&amp;quot;,   369000L,   343000L,   323000L,   304000L,   291000L,   268000L,   250000L,   240000L,   231000L,   218000L,   206000L,
                       &amp;quot;100 years and over..&amp;quot;,    68000L,    59000L,    57000L,    54000L,    51000L,    48000L,    45000L,    43000L,    41000L,    40000L,    37000L
                       )  %&amp;gt;% 
  mutate(label = str_remove_all(Age_Group, &amp;#39;\\.&amp;#39;)) %&amp;gt;% 
  select(-Age_Group) %&amp;gt;% 
  gather(year, population, -label) %&amp;gt;% 
  mutate(year = as.numeric(str_remove_all(year, &amp;#39;y&amp;#39;)))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;section-3&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;2001 - 2004&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&#34;https://www2.census.gov/programs-surveys/popest/tables/2000-2005/national/asrh/nc-est2005-01.csv&amp;#39;&#34;&gt;2001-2004 file&lt;/a&gt; is pretty similar to the 1980s file.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ages_2001_to_2004 &amp;lt;- read_csv(&amp;#39;https://www2.census.gov/programs-surveys/popest/tables/2000-2005/national/asrh/nc-est2005-01.csv&amp;#39;,
                              skip = 3) %&amp;gt;% 
  filter(between(row_number(), 2, 22)) %&amp;gt;% 
  gather(year, population, -X1) %&amp;gt;% 
  transmute(
    label = str_remove_all(X1, &amp;#39;\\.&amp;#39;),
    year = as.numeric(str_extract(year, &amp;#39;\\d{4}&amp;#39;)),
    population
  ) %&amp;gt;%
  filter(!is.na(year), between(year, 2001, 2004))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;section-4&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;2005 - 2019&lt;/h3&gt;
&lt;p&gt;There’s probably a better way to do this but my original plan was to try to find as granular age buckets as possible and 2005 - 2019 was the first set of years I worked with. So I leveraged the &lt;code&gt;tidycensus&lt;/code&gt; package to access the data from the American Community Survey to get population estimates.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Register API Key
census_api_key(Sys.getenv(&amp;quot;CENSUS_API_KEY&amp;quot;))

#Download Data Dictionary
vars &amp;lt;- load_variables(2019, &amp;#39;acs1&amp;#39;)

#Subset to Information For the Age Table
mapping &amp;lt;- vars %&amp;gt;% 
  filter(str_detect(name, &amp;#39;B01001_&amp;#39;))

# Define Function that Takes in a Year and Returns the Age Group Data
# Data provided at a State Level because I couldn&amp;#39;t figure out the 
# geography name for National.
get_2005_2019 &amp;lt;- function(yr){
  get_acs(
  geography = &amp;#39;state&amp;#39;,
  variables = c(
    &amp;#39;B01001_001&amp;#39;,
    &amp;#39;B01001_002&amp;#39;,
    &amp;#39;B01001_003&amp;#39;,
    &amp;#39;B01001_004&amp;#39;,
    &amp;#39;B01001_005&amp;#39;,
    &amp;#39;B01001_006&amp;#39;,
    &amp;#39;B01001_007&amp;#39;,
    &amp;#39;B01001_008&amp;#39;,
    &amp;#39;B01001_009&amp;#39;,
    &amp;#39;B01001_010&amp;#39;,
    &amp;#39;B01001_011&amp;#39;,
    &amp;#39;B01001_012&amp;#39;,
    &amp;#39;B01001_013&amp;#39;,
    &amp;#39;B01001_014&amp;#39;,
    &amp;#39;B01001_015&amp;#39;,
    &amp;#39;B01001_016&amp;#39;,
    &amp;#39;B01001_017&amp;#39;,
    &amp;#39;B01001_018&amp;#39;,
    &amp;#39;B01001_019&amp;#39;,
    &amp;#39;B01001_020&amp;#39;,
    &amp;#39;B01001_021&amp;#39;,
    &amp;#39;B01001_022&amp;#39;,
    &amp;#39;B01001_023&amp;#39;,
    &amp;#39;B01001_024&amp;#39;,
    &amp;#39;B01001_025&amp;#39;,
    &amp;#39;B01001_026&amp;#39;,
    &amp;#39;B01001_027&amp;#39;,
    &amp;#39;B01001_028&amp;#39;,
    &amp;#39;B01001_029&amp;#39;,
    &amp;#39;B01001_030&amp;#39;,
    &amp;#39;B01001_031&amp;#39;,
    &amp;#39;B01001_032&amp;#39;,
    &amp;#39;B01001_033&amp;#39;,
    &amp;#39;B01001_034&amp;#39;,
    &amp;#39;B01001_035&amp;#39;,
    &amp;#39;B01001_036&amp;#39;,
    &amp;#39;B01001_037&amp;#39;,
    &amp;#39;B01001_038&amp;#39;,
    &amp;#39;B01001_039&amp;#39;,
    &amp;#39;B01001_040&amp;#39;,
    &amp;#39;B01001_041&amp;#39;,
    &amp;#39;B01001_042&amp;#39;,
    &amp;#39;B01001_043&amp;#39;,
    &amp;#39;B01001_044&amp;#39;,
    &amp;#39;B01001_045&amp;#39;,
    &amp;#39;B01001_046&amp;#39;,
    &amp;#39;B01001_047&amp;#39;,
    &amp;#39;B01001_048&amp;#39;,
    &amp;#39;B01001_049&amp;#39;
  ),
  year = yr,
  survey = &amp;#39;acs1&amp;#39;
  ) %&amp;gt;% 
  mutate(year = yr) %&amp;gt;%
  inner_join(vars, by = c(&amp;quot;variable&amp;quot; = &amp;quot;name&amp;quot;)) %&amp;gt;% 
  filter(str_detect(label, &amp;quot;years&amp;quot;)) %&amp;gt;% 
  mutate(label = str_remove_all(label, &amp;quot;Estimate.*!!&amp;quot;))
}

# Download the Data from the API and Clean Up
ages_2005_to_2019 &amp;lt;- map_dfr(2005:2019, get_2005_2019) %&amp;gt;% 
  group_by(year, label) %&amp;gt;% 
  summarize(population = sum(estimate), .groups = &amp;#39;drop&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;final-data-preparation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Final Data Preparation&lt;/h2&gt;
&lt;p&gt;In addition to have different file formats each of the files had different age groupings. They’re not wildly different from each other but we need to have standardized groupings to carry out the analysis:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_years &amp;lt;- ages_1947_to_1979 %&amp;gt;% 
  mutate(
    age = parse_number(age),
    label = case_when(
      age &amp;lt;= 4 ~ &amp;#39;Under 5 years&amp;#39;,
      age &amp;lt;= 9 ~ &amp;#39;5 to 9 years&amp;#39;,
      age &amp;lt;= 14 ~ &amp;#39;10 to 14 years&amp;#39;,
      age &amp;lt;= 19 ~ &amp;#39;15 to 19 years&amp;#39;,
      age &amp;lt;= 24 ~ &amp;#39;20 to 24 years&amp;#39;,
      age &amp;lt;= 29 ~ &amp;#39;25 to 29 years&amp;#39;,
      age &amp;lt;= 34 ~ &amp;#39;30 to 34 years&amp;#39;,
      age &amp;lt;= 39 ~ &amp;#39;35 to 39 years&amp;#39;,
      age &amp;lt;= 44 ~ &amp;#39;40 to 44 years&amp;#39;,
      age &amp;lt;= 49 ~ &amp;#39;45 to 49 years&amp;#39;,
      age &amp;lt;= 54 ~ &amp;#39;50 to 54 years&amp;#39;,
      age &amp;lt;= 59 ~ &amp;#39;55 to 59 years&amp;#39;,
      age &amp;lt;= 64 ~ &amp;#39;60 to 64 years&amp;#39;,
      age &amp;lt;= 69 ~ &amp;#39;65 to 69 years&amp;#39;,
      age &amp;lt;= 74 ~ &amp;#39;70 to 74 years&amp;#39;,
      age &amp;lt;= 79 ~ &amp;#39;75 to 79 years&amp;#39;,
      age &amp;lt;= 84 ~ &amp;#39;80 to 84 years&amp;#39;,
      TRUE ~ &amp;#39;85 years&amp;#39;,
    )
  ) %&amp;gt;% 
  group_by(year, label) %&amp;gt;% 
  summarize(population = sum(population), .groups = &amp;#39;drop&amp;#39;) %&amp;gt;% 
  rbind(ages_1980_to_1989) %&amp;gt;% 
  rbind(
    ages_1990_to_2000 %&amp;gt;% 
      rbind(ages_2001_to_2004) %&amp;gt;% 
      mutate(
        label = if_else(label %in% c(&amp;#39;85 to 89 years&amp;#39;,
                                     &amp;#39;90 to 94 years&amp;#39;,
                                     &amp;#39;95 to 99 years&amp;#39;,
                                     &amp;#39;100 years and over&amp;#39;),
                        &amp;#39;85 years&amp;#39;,
                        label
        )
      )
  ) %&amp;gt;% 
  rbind(
    ages_2005_to_2019 %&amp;gt;% 
      mutate(label = case_when(
        label %in% c(&amp;quot;15 to 17 years&amp;quot;, &amp;quot;18 and 19 years&amp;quot;) ~ &amp;quot;15 to 19 years&amp;quot;,
        label %in% c(&amp;quot;20 years&amp;quot;, &amp;quot;21 years&amp;quot;, &amp;quot;22 to 24 years&amp;quot;) ~ &amp;quot;20 to 24 years&amp;quot;,
        label %in% c(&amp;quot;60 and 61 years&amp;quot;, &amp;quot;62 to 64 years&amp;quot;) ~ &amp;quot;60 to 64 years&amp;quot;,
        label %in% c(&amp;quot;65 and 66 years&amp;quot;, &amp;quot;67 to 69 years&amp;quot;) ~ &amp;quot;65 to 69 years&amp;quot;,
        label == &amp;#39;85 years and over&amp;#39; ~ &amp;#39;85 years&amp;#39;,
        TRUE ~ label
        )
      )
  ) %&amp;gt;% 
  group_by(year, label) %&amp;gt;%
  summarize(population = sum(population), .groups = &amp;#39;drop&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By law a US Senator needs to be at least 30 years old (technically, this wasn’t always true as there are 4 US Senators who were in their late-20s, but those were all in the early 1800s so out of scope for this analysis) so to create a comparable population I’ll limit the US population data to those 30 and older and create the share of 30+ population by age:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;eligible_age_bckt &amp;lt;- all_years %&amp;gt;% 
  filter(parse_number(label) &amp;gt;= 30) %&amp;gt;%
  add_count(year, wt = population, name = &amp;#39;total_population&amp;#39;) %&amp;gt;% 
  mutate(pct = population / total_population)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I’ll summarize the Senate data by the same groupings and create the % of Senators by age:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;senate_age_bckt &amp;lt;- senate_clean %&amp;gt;%
  count(start_year, label, name = &amp;#39;num_senators&amp;#39;) %&amp;gt;% 
  add_count(start_year, wt = num_senators, name = &amp;quot;total_senators&amp;quot;) %&amp;gt;% 
  mutate(pct = num_senators / total_senators)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, we’ll complete the data building steps by stacking the US population data and Senate data on top of each other:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pop_senate_merged &amp;lt;- 
  senate_age_bckt %&amp;gt;% 
  transmute(
    year = start_year, label, pct, grp = &amp;quot;Senators&amp;quot;
  ) %&amp;gt;% 
  rbind(eligible_age_bckt %&amp;gt;% 
          transmute(year, label, pct, grp = &amp;quot;US Pop. Over 30&amp;quot;))
  

knitr::kable(head(pop_senate_merged, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;year&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;label&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;pct&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;grp&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1947&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;35 to 39 years&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0315789&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Senators&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1947&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;40 to 44 years&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0947368&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Senators&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1947&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;45 to 49 years&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.0842105&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Senators&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;looking-at-similarity-of-senate-vs.-us-population&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Looking At Similarity of Senate vs. US Population&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Now onto the main course!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Our goal is to determine when the distribution of ages in the Senate are most similar / dissimilar to the distribution of ages in the US Over 30 population. There are many different ways to calculate similarity but for I’m going to keep it simple and use &lt;em&gt;mean absolute difference&lt;/em&gt; because its simple and the results are pretty similar to other methods I tried.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dist_measures &amp;lt;- pop_senate_merged %&amp;gt;% 
  #Convert to Long Format to Wide Format
  spread(grp, pct) %&amp;gt;% 
  # Replace NAs with 0s 
  replace_na(list(Senators = 0, `US Pop. Over 30` = 0)) %&amp;gt;% 
  # Calculate Mean Abs Difference
  mutate(distance = abs(Senators - `US Pop. Over 30`)) %&amp;gt;% 
  # Limit to Only Odd Years To Align with Congressional Sessions
  # There isn&amp;#39;t 2021 Data in the Census Data
  filter(year %% 2 == 1, year != 2021) %&amp;gt;%
  # Add Up Absolute Deviations
  group_by(year) %&amp;gt;% 
  summarize(distance = mean(distance))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then the dissimilarity over time can be plotted:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dist_measures %&amp;gt;% 
  ggplot(aes(x = year, y = distance)) + 
    geom_line(lwd = 1.5, color = &amp;#39;blue&amp;#39;) + 
    scale_x_continuous(breaks = seq(1950, 2020, 10)) + 
    labs(x = &amp;quot;Year&amp;quot;, y = &amp;quot;Distance between Senate and US Pop&amp;quot;, 
         title = &amp;quot;When was the US Senate Most/Least Representitive of the US Population&amp;quot;,
         subtitle = &amp;quot;1947 - 2019&amp;quot;) + 
    cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/plot-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Based on the above, the most representative era for the US Senate was in the 70s when the distance was minimized while least representative time was in the late 80s/early 90s. The three most representative years are 1971, 1979, and 1973, while the least representative years are 1989, 1993, and 1991. What was surprising is that the present time is actually more representative than in the 90s and about on the level that it was in the 60s.&lt;/p&gt;
&lt;p&gt;To get a better idea of what makes these years representative or non-representative we can look at the distributions for the most similar year, 1971, the most dissimilar year, 1989, and the most recent year available, 2019.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pop_senate_merged %&amp;gt;% 
  filter(year %in% c(1971, 1989, 2019)) %&amp;gt;% 
  ggplot(aes(x = grp, y = pct, fill = fct_rev(label))) + 
  geom_col() + 
  geom_text(aes(label =if_else(pct &amp;gt; .01,
                               paste(label, pct %&amp;gt;% scales::percent(accuracy = 1), sep = &amp;#39;: &amp;#39;), &amp;quot;&amp;quot;)),
            position = position_stack(vjust = .5)) + 
  scale_fill_discrete(guide = F) + 
  scale_x_discrete(expand = c(0, 0)) + 
  scale_y_continuous(expand = c(0, 0), labels = scales::percent_format(),
                     breaks = seq(0, 1, .2)) + 
  facet_wrap(~year, nrow = 1) + 
  labs(title = &amp;quot;Difference in US Senate Age Distribution vs. US Population&amp;quot;,
       subtitle = &amp;quot;1971 (Most Similar),  1989 (Most Different), 2019 (Most Recent)&amp;quot;,
       x = &amp;quot;&amp;quot;,
       y = &amp;quot;% of Group&amp;quot;) + 
  cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/drill-down-1.png&#34; width=&#34;960&#34; /&gt;&lt;/p&gt;
&lt;p&gt;While the Senate never represented the 30-45 population well, in 1971 the distributions were closer with 15% of Senators vs. 35% of the Population. This is &lt;strong&gt;much&lt;/strong&gt; closer than in 1989 when this group made up 4% of Senators vs. 43% of the population and closer than today (2019) when its 3% of Senators vs. 32% of the population.&lt;/p&gt;
&lt;p&gt;Finally, between 1989 and 2019 it looks like a glut of Senators who were between 45 and 60 (which was 66% of the Senate in 1989 vs. 26% of the Population) have hung-around as in 2019 this group would be 65 to 80 which still makes up 44% of the Senate vs. 21% of the US Population).&lt;/p&gt;
&lt;p&gt;So while this &lt;strong&gt;is the oldest Senate we’ve ever had&lt;/strong&gt; its not the most non-representative to the US Population as the population has gotten older too.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Predicting the Winner of Super Bowl LV</title>
      <link>https://jlaw.netlify.app/2021/01/07/predicting-the-winner-of-super-bowl-lv/</link>
      <pubDate>Thu, 07 Jan 2021 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2021/01/07/predicting-the-winner-of-super-bowl-lv/</guid>
      <description>


&lt;div id=&#34;tldr&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Using Pythagorean expectation we should expect the Baltimore Ravens to be Super Bowl Champions&lt;/li&gt;
&lt;li&gt;Using a Bradley-Terry model we should expect the Kansas City Chiefs to be Super Bowl champions&lt;/li&gt;
&lt;li&gt;Seems like it will be a good year for the AFC&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;It’s Playoff Time in the NFL!&lt;/strong&gt;. While my team has unfortunately missed the playoffs, I wanted to take advantage of the season to try to predict who will win the Super Bowl this year through two different mechanisms:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Pythagorean Expectation&lt;/li&gt;
&lt;li&gt;Simulation using Bradley-Terry Models&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting the Data&lt;/h2&gt;
&lt;p&gt;While ideally having more historical data would be better, I’m going to keep this exercise quick and dirty by only using the data from the 2020 NFL Regular Season which recently concluded. Data for this season can be easily imported using the &lt;code&gt;nflfastR&lt;/code&gt; package. By using the &lt;code&gt;fast_scraper_schedules&lt;/code&gt; function, I can quickly get all the games and their results for the 2020 season.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(nflfastR)
library(scales)

#Get Season 2020 Schedule and results
nfl_games &amp;lt;- fast_scraper_schedules(2020) %&amp;gt;% 
  #Weeks Beyond Week 17 Are the Playoffs
  filter(week &amp;lt;= 17)

knitr::kable(head(nfl_games, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;col width=&#34;12%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;6%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;game_id&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;season&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;game_type&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;week&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;gameday&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;weekday&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;gametime&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;away_team&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;home_team&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;away_score&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;home_score&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;home_result&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;stadium&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;location&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;roof&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;surface&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;old_game_id&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2020_01_HOU_KC&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2020&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;REG&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-09-10&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Thursday&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;20:20&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;HOU&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;KC&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;34&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;14&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Arrowhead Stadium&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Home&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;outdoors&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020091000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2020_01_SEA_ATL&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2020&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;REG&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-09-13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Sunday&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;13:00&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;SEA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;ATL&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;25&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;-13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Mercedes-Benz Stadium&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Home&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020091300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2020_01_CLE_BAL&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2020&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;REG&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-09-13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Sunday&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;13:00&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;CLE&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;BAL&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;38&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;32&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;M&amp;amp;T Bank Stadium&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Home&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;outdoors&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020091301&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The package returned both the data I’m looking for, but also a lot of additional data that could be used if necessary (day of week, dome vs. outdoor, etc.).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;method-1-pythagorean-expectation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Method 1: Pythagorean expectation&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Pythagorean_expectation#Use_in_pro_football&#34;&gt;Pythagorean expectation&lt;/a&gt; was developed by Bill James for Baseball and estimates the % of games that a team “should win” based on runs scored and runs allowed.&lt;/p&gt;
&lt;p&gt;It was adapted for Pro Football by Football Outsiders to use the following formula:&lt;/p&gt;
&lt;center&gt;
&lt;img src=&#34;formula.PNG&#34; /&gt;
&lt;/center&gt;
&lt;p&gt;Football Outside Almanac in 2011 stated that “From 1988 through 2004, 11 of 16 Super Bowls were won by the team that led the NFL in Pythagorean wins, while only seven were won by the team with the most actual victories”&lt;/p&gt;
&lt;p&gt;There needs to be a little data manipulation to get the NFL schedule data into a format to calculate the pythagorean expectation. Most notably splitting each game into two rows of data to capture information on both the home team and away teams.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;p_wins &amp;lt;- nfl_games %&amp;gt;% 
  pivot_longer(
    cols = c(contains(&amp;#39;team&amp;#39;)),
    names_to = &amp;quot;category&amp;quot;,
    values_to = &amp;#39;team&amp;#39;
  ) %&amp;gt;% 
  mutate(points_for = (category==&amp;#39;home_team&amp;#39;)*home_score+
           (category==&amp;#39;away_team&amp;#39;)*away_score,
         points_against = (category==&amp;#39;away_team&amp;#39;)*home_score+
           (category==&amp;#39;home_team&amp;#39;)*away_score
  ) %&amp;gt;% 
  group_by(team) %&amp;gt;%
  summarize(pf = sum(points_for, na.rm = T),
            pa = sum(points_against, na.rm = T),
            actual_wins = sum(points_for &amp;gt; points_against, na.rm = T),
            .groups = &amp;#39;drop&amp;#39;
  ) %&amp;gt;% 
  mutate(p_expectation = pf^2.37/(pf^2.37+pa^2.37)*16)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;By pythagorean expectation the top 3 teams in the NFL are:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;team&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;points_for&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;points_against&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;actual_wins&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;expected_wins&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;BAL&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;468&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;303&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;11&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;11.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;NO&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;482&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;337&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;11.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;TB&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;492&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;355&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;11&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;10.9&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;According to Pythagorean Expectation, the &lt;strong&gt;Baltimore Ravens&lt;/strong&gt; are the best team in the NFL while the formula would say that the Kansas City Chiefs, the team with the most actual wins, “&lt;em&gt;should&lt;/em&gt;” have only had 10.5 wins vs. the 14 actual wins they had.&lt;/p&gt;
&lt;div id=&#34;an-aside-who-outkicked-their-coverage&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;An aside: Who “outkicked their coverage”?&lt;/h3&gt;
&lt;p&gt;The concept of “Expected Wins” allows us to see who outperformed their expectation vs. under-performed. The following plot shows actual wins on the x-axis and expected wins on the y-axis.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggrepel)
p_wins %&amp;gt;% 
  mutate(diff_from_exp = actual_wins - p_expectation) %&amp;gt;% 
  ggplot(aes(x = actual_wins, y = p_expectation, fill = diff_from_exp)) + 
    geom_label_repel(aes(label = team)) + 
    geom_abline(lty = 2) + 
    annotate(&amp;quot;label&amp;quot;, x = 1, y = 10, hjust = &amp;#39;left&amp;#39;, label = &amp;quot;Underachievers&amp;quot;) +
    annotate(&amp;quot;label&amp;quot;, x = 10, y = 5, hjust = &amp;#39;left&amp;#39;, label = &amp;quot;Overachievers&amp;quot;) +
    labs(x = &amp;quot;Actual Wins&amp;quot;, y = &amp;quot;Expected Wins&amp;quot;, 
         title = &amp;quot;What NFL Teams Over/Under Performed?&amp;quot;, 
         caption = &amp;quot;Expected Wins Based on Pythagorian Expectation&amp;quot;) + 
    scale_fill_gradient2(guide = F) + 
    cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The largest over-achievers appear to be Kansas city, and Cleveland while the largest under-achievers were Atlanta and Jacksonville.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;method-2-simulation-with-bradley-terry-models&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Method #2: Simulation with Bradley-Terry Models&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model&#34;&gt;Bradley-Terry Models&lt;/a&gt; are probability models to predict the outcomes of paired comparisons (such as sporting events or ranking items in a competition).&lt;/p&gt;
&lt;p&gt;In this case, to predict the future winner of Super Bowl LV. I’ll be using regular season data to estimate “ability parameters” for each team and then using those parameters to run simulations to estimate the winners of the NFL Playoff Match-ups.&lt;/p&gt;
&lt;p&gt;The Bradley-Terry Model can be fit using the &lt;code&gt;BradleyTerry2&lt;/code&gt; package.&lt;/p&gt;
&lt;div id=&#34;step-1-reshaping-the-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 1: Reshaping the Data&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;BradleyTerry2&lt;/code&gt; package can take data in a number of different ways but it is opinionated about the structure so we’ll need to reshape the data to get it into a format that the package wants.&lt;/p&gt;
&lt;p&gt;Specifically, it can take in data similar to how &lt;code&gt;glm()&lt;/code&gt; can use counts to fit a logistic regression. In this case it would be similar to:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;BTm(cbind(win1, win2), team1, team2, ~ team, id = &amp;quot;team&amp;quot;, data = sports.data)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The inclusion of only team in the formula means that only the “team” factors are used to estimate abilities. Other predictors can be added such as a home-field advantage but considering the nature of the 2020 season, I’m going to assume there was no home field advantage. The &lt;code&gt;id=&#34;team&#34;&lt;/code&gt; portion of the formula tells the function how to label factors for the output. For example the team “NYG” will become the “teamNYG” predictor.&lt;/p&gt;
&lt;p&gt;Given the nature of the NFL schedule there shouldn’t be any repeats of Home/Away combinations. But to be sure we can &lt;code&gt;group_by()&lt;/code&gt; and &lt;code&gt;summarize()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Since the package used for modeling requires that each team variable has the same factor levels, I’ll recode &lt;code&gt;home_team&lt;/code&gt; and &lt;code&gt;away_team&lt;/code&gt; with new levels.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Get List of All Teams
all_teams &amp;lt;- sort(unique(nfl_games$home_team))

nfl_shaped &amp;lt;- nfl_games %&amp;gt;%
  mutate(
    home_team = factor(home_team, levels = all_teams),
    away_team = factor(away_team, levels = all_teams),
    home_wins = if_else(home_score &amp;gt; away_score, 1, 0),
    away_wins = if_else(home_score &amp;lt; away_score, 1, 0) 
  ) %&amp;gt;% 
  group_by(home_team, away_team) %&amp;gt;% 
  summarize(home_wins = sum(home_wins),
            away_wins = sum(away_wins),
            .groups= &amp;#39;drop&amp;#39;) 

knitr::kable(head(nfl_shaped, 3), align = &amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;home_team&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;away_team&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;home_wins&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;away_wins&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ARI&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;BUF&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ARI&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;DET&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ARI&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;LA&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;step-2-fitting-the-bradley-terry-model&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 2: Fitting the Bradley-Terry Model&lt;/h3&gt;
&lt;p&gt;The Bradley-Terry model can be fit similar to how other models like &lt;code&gt;glm()&lt;/code&gt; are fit. By default, the first factor alphabetically becomes the reference factor and takes a coefficient of zero. All other coefficients are relative to that factor.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(BradleyTerry2)
base_model &amp;lt;- BTm(cbind(home_wins, away_wins), home_team, away_team,
                  data = nfl_shaped, id = &amp;quot;team&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;summary()&lt;/code&gt; function will provide information on residuals, coefficients, and statistical significance, but for brevity, I’ll skip that output.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-3-extracting-the-team-abilities&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 3: Extracting the Team Abilities&lt;/h3&gt;
&lt;p&gt;While the package contains a &lt;code&gt;BTAbilities()&lt;/code&gt; function to extract the abilities and their standard errors. The &lt;code&gt;qvcalc()&lt;/code&gt; function will output abilities along with quasi-standard errors. The advantage of using quasi standard errors is that for the reference category the ability estimate and standard error will both be 0 while quasi-standard errors will be non-zero. The use of quasi-standard errors allow for any comparison.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;base_abilities &amp;lt;- qvcalc(BTabilities(base_model)) %&amp;gt;% 
  .[[&amp;quot;qvframe&amp;quot;]] %&amp;gt;% 
  as_tibble(rownames = &amp;#39;team&amp;#39;) %&amp;gt;% 
  janitor::clean_names()

knitr::kable(base_abilities %&amp;gt;% 
               mutate(across(where(is.numeric), round, 2)) %&amp;gt;% 
               head(3),
             align = &amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;team&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;estimate&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;se&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;quasi_se&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;quasi_var&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ARI&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.00&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.00&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.57&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;ATL&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-0.91&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.88&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.64&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.41&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;BAL&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.06&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.89&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.65&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;0.42&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-4-simulating-playoff-matchups&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 4: Simulating Playoff Matchups&lt;/h2&gt;
&lt;p&gt;To determine each team’s likelihood of winning their match-up I run 1,000 simulations pulling from a distribution of the ability scores using team ability and standard error as parameters. The percent of those 1,000 simulations won by each each represents the likelihood of winning that match-up.&lt;/p&gt;
&lt;p&gt;To generate the 1,000 simulations I use the &lt;code&gt;tidyr::crossing()&lt;/code&gt; function to replicate each row 1,000 times; then using dplyr to summarize over all simulations.&lt;/p&gt;
&lt;p&gt;Since running this for any arbitrary combination of teams isn’t too time consuming, I’ll generate every combination of playoff team across the NFC and AFC even though at least half of these comparisons will be impossible in practice.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;playoff_teams = c(&amp;#39;BAL&amp;#39;, &amp;#39;BUF&amp;#39;, &amp;#39;CHI&amp;#39;, &amp;#39;CLE&amp;#39;, &amp;#39;GB&amp;#39;, &amp;#39;IND&amp;#39;, &amp;#39;KC&amp;#39;, &amp;#39;LA&amp;#39;, &amp;#39;NO&amp;#39;,
                  &amp;#39;PIT&amp;#39;, &amp;#39;SEA&amp;#39;, &amp;#39;TB&amp;#39;, &amp;#39;TEN&amp;#39;, &amp;#39;WAS&amp;#39;)

comparisons &amp;lt;- base_abilities %&amp;gt;% 
  filter(team %in% playoff_teams)

#Generate All Potential Combination of Playoff Teams
comparisons &amp;lt;- comparisons %&amp;gt;% 
  rename_with(~paste0(&amp;quot;t1_&amp;quot;, .x)) %&amp;gt;% 
  crossing(comparisons %&amp;gt;% rename_with(~paste0(&amp;quot;t2_&amp;quot;, .x)))  %&amp;gt;% 
  filter(t1_team != t2_team)

#Run 1000 Simulations per comparison
set.seed(20210107)

#Draw from Ability Distribution
simulations &amp;lt;- comparisons %&amp;gt;% 
  crossing(simulation = 1:1000) %&amp;gt;% 
  mutate(
    t1_val = rnorm(n(), t1_estimate, t1_quasi_se),
    t2_val = rnorm(n(), t2_estimate, t2_quasi_se),
    t1_win = t1_val &amp;gt; t2_val,
    t2_win = t2_val &amp;gt; t1_val
  )

#Roll up the 1000 Results
sim_summary &amp;lt;- simulations %&amp;gt;% 
  group_by(t1_team, t2_team, t1_estimate, t2_estimate) %&amp;gt;% 
  summarize(t1_wins_pct = mean(t1_win), #Long-Term Average Winning % for Team 1
            t2_wins_pct = mean(t2_win), #Long-Term Average Winning % for Team 2
            .groups = &amp;#39;drop&amp;#39;) %&amp;gt;% 
  mutate(
    #Create a label for the winner
    winner = if_else(t1_wins_pct &amp;gt; t2_wins_pct, t1_team, t2_team)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-5-and-the-winner-is.&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 5: And the winner is….&lt;/h2&gt;
&lt;p&gt;Now since we have all potential combinations we can step through each of the games on the schedule to determine the likelihood of winning that match-up. For rounds after the initial wild-card round, the teams are re-seeded so the #1 seed will play whatever the lowest winning seed is (can be anywhere from #4 to #7). While initially I wanted to look at each team’s likelihood of winning the Super Bowl, I couldn’t quite figure out how to easily determine the probability of each scenario given the re-seeding process. So I will just step through each round based on the result of the previous round.&lt;/p&gt;
&lt;p&gt;For simplicity I define a function to take in the two teams and return the ability scores from the simulations above.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;winners &amp;lt;- function(t1, t2){
  dt = sim_summary %&amp;gt;% filter(t1_team == t1 &amp;amp; t2_team == t2) %&amp;gt;% 
    inner_join(
      nflfastR::teams_colors_logos %&amp;gt;% 
        filter(team_abbr == t1) %&amp;gt;% 
        select(t1_team = team_abbr, t1_name = team_name),
      by = &amp;quot;t1_team&amp;quot;
    ) %&amp;gt;% 
    inner_join(
      nflfastR::teams_colors_logos %&amp;gt;% 
        filter(team_abbr == t2) %&amp;gt;% 
        select(t2_team = team_abbr, t2_name = team_name),
      by = &amp;quot;t2_team&amp;quot;
    )
  
  return(
     list(
       team1 = dt$t1_name,
       team1_prob = dt$t1_wins_pct,
       team2 = dt$t2_name,
       team2_prob = dt$t2_wins_pct,
       winner = if_else(dt$winner == dt$t1_team, dt$t1_name, dt$t2_name)
     )
  )
}&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;nfc&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;NFC&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Wild-Card Round&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#2. New Orleans Saints (95%) vs. #7. Chicago Bears (5%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;New Orleans Saints&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#3. Seattle Seahawks (71%) vs. #6. Los Angeles Rams (29%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Seattle Seahawks&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#4. Washington Football Team (4%) vs. #5. Tampa Bay Buccaneers (96%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Tampa Bay Buccaneers&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Divisional Round&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#1. Green Bay Packers (66%) vs. #5. Tampa Bay Buccaneers (34%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Green Bay Packers&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#2. New Orleans Saints (60%) vs. #3. Seattle Seahawks (40%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;New Orleans Saints&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;NFC Championship Game&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#1. Green Bay Packers (55%) vs. #2. New Orleans Saints (45%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The Green Bay Packers are heading to the Super Bowl!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;afc&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;AFC&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Wild-Card Round&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#2. Buffalo Bills (91%) vs. #7. Indianapolis Colts (9%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Buffalo Bills&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#3. Pittsburgh Steelers (68%) vs. #6. Cleveland Browns (32%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Pittsburgh Steelers&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#4. Tennessee Titans (47%) vs. #5. Baltimore Ravens (53%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Baltimore Ravens&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Divisional Round&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#1. Kansas City Chiefs (89%) vs. #5. Baltimore Ravens (11%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Kansas City Chiefs&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#2. Buffalo Bills (76%) vs. #3. Pittsburgh Steelers (24%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Winner:&lt;/em&gt;&lt;/strong&gt; &lt;em&gt;Buffalo Bills&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;em&gt;AFC Championship Game&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;#1. Kansas City Chiefs (64%) vs. #2. Buffalo Bills (36%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Kansas City Chiefs is headed to the Super Bowl!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;super-bowl-lv&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Super Bowl LV&lt;/h3&gt;
&lt;p&gt;&lt;strong&gt;#1. Green Bay Packers (18%) vs. #1. Kansas City Chiefs (82%)&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Apparently the NFC and AFC alternate who the home team is and since the Chiefs were the home team in Super Bowl LIV, the NFC representative will be the home team in Super Bowl LV.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;your-super-bowl-lv-champions-the-kansas-city-chiefs&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Your Super Bowl LV Champions… the Kansas City Chiefs&lt;/h3&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>7 Tricks I Learned During Advent of Code 2020</title>
      <link>https://jlaw.netlify.app/2020/12/28/7-tricks-i-learned-during-advent-of-code-2020/</link>
      <pubDate>Mon, 28 Dec 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/12/28/7-tricks-i-learned-during-advent-of-code-2020/</guid>
      <description>


&lt;p&gt;I got into the &lt;a href=&#34;https://adventofcode.com/&#34;&gt;Advent of Code&lt;/a&gt; through some co-workers for the first time this year. For those not familiar, its a series of programming puzzles created by &lt;a href=&#34;http://was.tl/&#34;&gt;Eric Wastl&lt;/a&gt; released once a day for the first 25 days of December. The puzzles are programming language agnostic so some use it to learn new a language and others, like myself, just thought it would be something fun to do. While I use R often in my job and for writing this blog, the Advent of Code puzzles are quite different my usual use case. As I did the puzzles, I kept track of some tricks that I learned that I thought were useful (I learned &lt;strong&gt;a lot&lt;/strong&gt; of things but to keep things short, I’ll only list a couple).&lt;/p&gt;
&lt;div id=&#34;not-a-trick..-but-credit-where-credit-is-due&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Not A Trick.. But Credit Where Credit Is Due&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;I can’t imagine the amount of work that goes into creating these puzzles.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A bit a cop-out that the first item has nothing to do with R. But I did want to specifically give props to &lt;a href=&#34;http://was.tl/&#34;&gt;Eric Wastl&lt;/a&gt; for making these puzzles. As hard as it was at times to complete the puzzles, I found myself constantly thinking how difficult it must be to &lt;em&gt;create&lt;/em&gt; them and ensure that they are solvable.&lt;/p&gt;
&lt;p&gt;Now onto the R.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;trick-1-break-apart-a-string-of-text-into-a-vector-with-str_split-and-unlist&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Trick #1: Break apart a string of text into a vector with &lt;code&gt;str_split()&lt;/code&gt; and &lt;code&gt;unlist()&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;The inputs for Advent of Code are usually flat files and its often necessary to break up the input in order to fill out a matrix or columns in a data frame.&lt;/p&gt;
&lt;p&gt;Suppose there is an input like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;....#..
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and we want to have each character as a vector element . A function like &lt;code&gt;readLines&lt;/code&gt; will input each row as a vector, but in order to split the string into each element we’ll call upon &lt;code&gt;str_split()&lt;/code&gt; to break apart the string by a delimiter. Using the empty string (’’) will separate each character to create a list. Then &lt;code&gt;unlist()&lt;/code&gt; will break each character into its own element in the vector&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;input &amp;lt;- &amp;quot;....#...&amp;quot;

print(str_split(input, &amp;#39;&amp;#39;) %&amp;gt;% unlist())&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;.&amp;quot; &amp;quot;.&amp;quot; &amp;quot;.&amp;quot; &amp;quot;.&amp;quot; &amp;quot;#&amp;quot; &amp;quot;.&amp;quot; &amp;quot;.&amp;quot; &amp;quot;.&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now as opposed to having 1 string, we have a character vector with each character as its own element.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;trick-2-combining-str_split-with-unnest-can-turn-a-vector-of-strings-into-a-tidy-data-frame.&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Trick #2: Combining &lt;code&gt;str_split()&lt;/code&gt; with &lt;code&gt;unnest()&lt;/code&gt; can turn a vector of strings into a tidy data frame.&lt;/h2&gt;
&lt;p&gt;One thing that I worked with more in Advent of Code than I have in the last few years have been &lt;strong&gt;matrices&lt;/strong&gt;. As shown before, most of the input comes as a flat file needing to be processed. Sometimes it was helpful to represent the matrix as a tidy data-set with columns for &lt;code&gt;row_id&lt;/code&gt;, &lt;code&gt;col_id&lt;/code&gt;, and &lt;code&gt;value&lt;/code&gt; vs. the traditional matrix format. The &lt;code&gt;unnest()&lt;/code&gt; function will break apart each element of a list into its own row. Using a using a similar input to before but with more rows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;input &amp;lt;- c(&amp;quot;....#.......&amp;quot;,
           &amp;quot;.#..#....###&amp;quot;,
           &amp;quot;....###.....&amp;quot;)

tibble(raw = input) %&amp;gt;% 
  mutate(
    row_id = row_number(), #Create Row ID
    value = str_split(raw, &amp;#39;&amp;#39;) #Break Each Row Into A List Of Elements
  ) %&amp;gt;% 
  unnest(value) %&amp;gt;% #Break Each Element Into Its Own Row
  group_by(row_id) %&amp;gt;% 
  mutate(col_id = row_number()) %&amp;gt;% #Create Column ID
  head(10) %&amp;gt;% 
  kable(align = &amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;raw&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;row_id&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;value&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;col_id&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;.&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;.&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;.&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;.&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;#&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;.&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;.&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;.&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;.&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;….#…….&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;.&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Now each element of the character vector is its own row its with own &lt;code&gt;row_id&lt;/code&gt; and &lt;code&gt;col_id&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;trick-3-extract-is-a-powerhouse-function-for-working-with-strings&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Trick #3: &lt;code&gt;extract()&lt;/code&gt; is a powerhouse function for working with strings&lt;/h2&gt;
&lt;p&gt;I’ve mentioned before that I think regular expressions are amazing and opens up a world of possibilities. &lt;code&gt;extract()&lt;/code&gt; allows for the use to regular expressions and capture groups to create any number of new columns. Its similar to &lt;code&gt;separate()&lt;/code&gt; but to me seems more customizable. Given the inputs:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;6-7 z: dqzzzjbzz 67
13-16 j: jjjvjmjjkjjjjjjj 123
5-6 m: mmbmmlvmbmmgmmf 5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And you wanted to create a data.frame that had columns for the number range, the character before the ‘:’, the series of characters after the ‘:’ and a final digit . This could be done with &lt;code&gt;str_match()&lt;/code&gt; or similar but &lt;code&gt;extract()&lt;/code&gt; just makes it so &lt;strong&gt;&lt;em&gt;easy&lt;/em&gt;&lt;/strong&gt;. Just give &lt;code&gt;extract()&lt;/code&gt; a regular expression and capture in parentheses the things to turn into columns.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;input &amp;lt;- c(&amp;quot;6-7 z: dqzzzjbzz 67&amp;quot;,
           &amp;quot;13-16 j: jjjvjmjjkjjjjjjj 123&amp;quot;,
           &amp;quot;5-6 m: mmbmmlvmbmmgmmf 5&amp;quot;)

tibble(raw = input) %&amp;gt;% 
  extract(raw, 
          into = c(&amp;#39;number_range&amp;#39;, &amp;#39;single_char&amp;#39;, 
                   &amp;#39;many_char&amp;#39;, &amp;#39;single_digit&amp;#39;),
          regex = &amp;#39;(\\d+-\\d+) (\\w+): (\\w+) (\\d+)&amp;#39;,
          convert = T) %&amp;gt;% 
  kable(align = &amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;number_range&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;single_char&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;many_char&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;single_digit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;6-7&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;z&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;dqzzzjbzz&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;67&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;13-16&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;j&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;jjjvjmjjkjjjjjjj&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;123&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;5-6&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;m&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;mmbmmlvmbmmgmmf&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Done and Done (and with convert=T it even turned the &lt;code&gt;single_digit&lt;/code&gt; into an int)!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;trick-4-memoization&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Trick #4: Memoization&lt;/h2&gt;
&lt;p&gt;Some of the puzzles in AoC use programming concepts I haven’t thought about in a long-term (linked lists) and some used concepts I didn’t know existed. Memoization is one of those terms that I’d heard before but had no idea what it meant. There were a number of puzzles where my initial brute force solutions would take hours or days to complete. But in certain cases, memoization sped things up immensely.&lt;/p&gt;
&lt;p&gt;Memoization caches the results of function calls so that if the same call happens a second time, rather than doing the work again, the program can just recall the value from the cache.&lt;/p&gt;
&lt;p&gt;Functions can be memoised in R using the &lt;code&gt;memoise::memoise()&lt;/code&gt; function to wrap the function.&lt;/p&gt;
&lt;p&gt;For this example, I’m borrowing the Fibonacci example from this post on &lt;a href=&#34;https://www.inwt-statistics.com/read-blog/optimize-your-r-code-using-memoization.html&#34;&gt;IWNT Statistics&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(memoise)

# Vanilla Function
fibb &amp;lt;- function(x){
  if(x==0){return(1)}
  else if(x==1){return(1)}
  else{return(fibb(x - 1) + fibb(x-2))}
}

# Same Function But Wrapped In Memoise
memo_fib &amp;lt;- memoise(function(x){
  if(x==0){return(1)}
  else if(x==1){return(1)}
  else{return(memo_fib(x - 1) + memo_fib(x-2))}
})&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Running the original version:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tictoc::tic()
fibb(35)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 14930352&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tictoc::toc()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 26.58 sec elapsed&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And the memoised version:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tictoc::tic()
memo_fib(35)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 14930352&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tictoc::toc()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 0.08 sec elapsed&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The memoised version produces a &lt;strong&gt;&lt;em&gt;way&lt;/em&gt;&lt;/strong&gt; faster result! While hard to believe, the original function makes close to 30 million calls on its way to finding &lt;code&gt;fibb(35)&lt;/code&gt;. However, the memoised version, only needs to solve for the 35 unique function calls and can recall the answer from cache for the recursive calls.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;trick-5---string-replacement-with-back-references&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Trick #5 - String Replacement with Back References&lt;/h2&gt;
&lt;p&gt;Back to string manipulation!&lt;/p&gt;
&lt;p&gt;Within regular expressions there is a concept of “capture groups” which is when you wrap something in parenthesis and then are able to extract it from the string match (like how &lt;code&gt;str_match()&lt;/code&gt; can work). However, you can also reference what is in the capture group to use it for replacement in functions like &lt;code&gt;str_replace_all()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;In our example, image we have a string of animals, &lt;code&gt;&#34;the cat, a bird, the dog, ze goat&#34;&lt;/code&gt; and we want to insert the adjective &lt;strong&gt;red&lt;/strong&gt; between “the” and each animal. There are many ways to do this, but I will use back-references, which will reference the contents of the capture group without knowing specifically what’s in it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;input &amp;lt;- &amp;quot;the cat, a bird, the dog, ze goat&amp;quot;

str_replace_all(input, &amp;#39;(\\w+) (\\w+)&amp;#39;, &amp;#39;\\1 red \\2&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] &amp;quot;the red cat, a red bird, the red dog, ze red goat&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;\\1&lt;/code&gt; is a back-reference to the first capture group in parenthesis (the, a, the, and ze) while &lt;code&gt;\\2&lt;/code&gt; is a reference to the animals.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;trick-6---escaping-stringrs-regular-expression-matching-with-coll&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Trick #6 - Escaping stringR’s regular expression matching with &lt;code&gt;coll()&lt;/code&gt;&lt;/h2&gt;
&lt;p&gt;More often than not, stringR’s use of regular expressions as the pattern is a blessing. One place where it was troublesome was when I was trying to use one variable as a pattern to replace another variable. In these cases, the special characters in my pattern (the ‘+’) were treated as part of a RegEx rather than the literal string I wanted to match.&lt;/p&gt;
&lt;p&gt;For this example, suppose I want to replace an equation within parenthesis with the word ‘hi’ (not sure &lt;strong&gt;why&lt;/strong&gt; I’d want to do this, but oh well).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tibble(
  eq = c(&amp;quot;(1 + 1)&amp;quot;, &amp;quot;(7 - 3)&amp;quot;, &amp;quot;(12 * 1)&amp;quot;)
) %&amp;gt;% 
  mutate(ptrn = str_extract(eq, &amp;#39;\\(.+\\)&amp;#39;),
         new_eq = str_replace_all(eq, ptrn, &amp;#39;hi&amp;#39;),
  ) %&amp;gt;% 
  kable(align = &amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;eq&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;ptrn&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;new_eq&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;(1 + 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(1 + 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(1 + 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;(7 - 3)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(7 - 3)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(hi)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;(12 * 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(12 * 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(12 * 1)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Notice that the &lt;code&gt;str_replace_all&lt;/code&gt; either didn’t work 100% correctly or didn’t work at all for all three cases. Even though as a person this obviously should be a match, in computer-land the symbols “(”, “)”, “+”, and &#34;*&#34; all are special characters for regular expressions and therefore aren’t matching the literal symbols they are intended to match.&lt;/p&gt;
&lt;p&gt;Fortunately, there is a function &lt;code&gt;coll()&lt;/code&gt; which will compare strings using standard collation rules rather than using RegExp rules. Wrapping the pattern variable in &lt;code&gt;coll()&lt;/code&gt; should solve all problems.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tibble(
  eq = c(&amp;quot;(1 + 1)&amp;quot;, &amp;quot;(7 - 3)&amp;quot;, &amp;quot;(12 * 1)&amp;quot;)
) %&amp;gt;% 
  mutate(ptrn = str_extract(eq, &amp;#39;\\(.+\\)&amp;#39;),
         new_eq = str_replace_all(eq, ptrn, &amp;#39;hi&amp;#39;),
         with_coll = str_replace_all(eq, coll(ptrn), &amp;#39;hi&amp;#39;)
  ) %&amp;gt;%
  kable(align = &amp;#39;c&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;eq&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;ptrn&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;new_eq&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;with_coll&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;(1 + 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(1 + 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(1 + 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;hi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;(7 - 3)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(7 - 3)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(hi)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;hi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;(12 * 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(12 * 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;(12 * 1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;hi&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Now everything works!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;trick-7---use-the-assign-function-to-programatically-create-new-objects&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Trick #7 - Use the &lt;code&gt;assign()&lt;/code&gt; function to programatically create new objects&lt;/h2&gt;
&lt;p&gt;I always struggle with doing programmatic naming of objects. In the course of one of the puzzles I came across the &lt;code&gt;assign()&lt;/code&gt; function which takes a variable name, and a object that will be given the variable name.&lt;/p&gt;
&lt;p&gt;Suppose we have data in a data.frame with a column for Player and a value for the cards help by the player and we want to create 2 vectors; one for player 1 and one for player 2. We can use assign to create those objects.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;input &amp;lt;- tibble::tribble(
  ~Player, ~Cards,
       1L,     1L,
       1L,     2L,
       1L,     3L,
       2L,     4L,
       2L,     5L,
       2L,     6L
  )

# Generate the string for the variable name with paste and assign an object
for(i in seq_len(n_distinct(input$Player))){
  assign(paste0(&amp;#39;player_&amp;#39;,i), input %&amp;gt;% filter(Player == i) %&amp;gt;% pull(Cards))
}

print(player_1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 1 2 3&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;print(player_2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 4 5 6&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now there are two objected in the environment with names “player_1” and “player_2”&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;thanks-for-reading&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Thanks for Reading!&lt;/h2&gt;
&lt;p&gt;I would highly encourage everyone to try &lt;a href=&#34;https://adventofcode.com/&#34;&gt;Advent of Code&lt;/a&gt; at some point. I found it really enjoyable to do a different type of programming from my day to day. Although there were instances where doing it in R made things difficult (mainly R being a 1-indexed language) I found the experience really enjoyable.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Exploring NHL Stanley Cup Champion&#39;s Points Percentage In Four GGPlots</title>
      <link>https://jlaw.netlify.app/2020/12/01/exploring-nhl-stanley-cup-champion-s-points-percentage-in-four-ggplots/</link>
      <pubDate>Tue, 01 Dec 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/12/01/exploring-nhl-stanley-cup-champion-s-points-percentage-in-four-ggplots/</guid>
      <description>


&lt;div id=&#34;motivation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Motivation&lt;/h2&gt;
&lt;p&gt;While browsing Reddit’s &lt;a href=&#34;https://www.reddit.com/r/dataisbeautiful/&#34;&gt;r/DataIsBeautiful&lt;/a&gt; sub-reddit I came across a post from Fabio Votta showing a &lt;a href=&#34;https://www.reddit.com/r/dataisbeautiful/comments/jwzsm6/oc_countylevel_results_of_us_2020_election/&#34;&gt;beeswarm plot of US County vote share in the 2020 Election&lt;/a&gt;. Having never seen a beeswarm plot before I wanted to come up with an excuse to try it out. &lt;strong&gt;As an NHL fan, I decided to look at the Points Percentage of NHL Stanley Cup champions&lt;/strong&gt;. This analysis will use information from &lt;a href=&#34;https://www.hockey-reference.com/awards/stanley.html&#34;&gt;hockey-reference.com&lt;/a&gt; and &lt;code&gt;ggplot&lt;/code&gt; to visualize the information.&lt;/p&gt;
&lt;div id=&#34;sidebar-what-is-a-points-percentage&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Sidebar: What is a Points Percentage?&lt;/h3&gt;
&lt;p&gt;In the NHL a win is worth 2 points, a loss with worth 0 points, and a tie (or Overtime Loss beginning in the 2005-2006 season) is worth one point. The Points Percentage is the number of points earned by the team (2*Wins + 1*(Ties + OTL)) divided the number of potential points (2*Games Played).&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting the Data&lt;/h2&gt;
&lt;p&gt;The data for this analysis will come from &lt;a href=&#34;https://www.hockey-reference.com/awards/stanley.html&#34;&gt;hockey-reference.com&lt;/a&gt; which provides statistics on the Stanley Cup Champion teams from 1918 through 2020 (with some exceptions). The points percentage is provided as a direct column in the table.&lt;/p&gt;
&lt;div id=&#34;setting-up-libraries&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Setting Up Libraries&lt;/h3&gt;
&lt;p&gt;The libraries used in this analysis include stalwarts like tidyverse as well as ggplot extensions such as &lt;code&gt;ggtext&lt;/code&gt;, &lt;code&gt;ggbeeswarm&lt;/code&gt;, &lt;code&gt;ggridges&lt;/code&gt;, &lt;code&gt;ggimage&lt;/code&gt; to do different visualizations. The &lt;a href=&#34;https://github.com/wch/extrafont/&#34;&gt;&lt;code&gt;extrafont&lt;/code&gt;&lt;/a&gt; package enables the use of the fonts installed on my machine in ggplots. The &lt;code&gt;loadfonts(device = &#34;win&#34;)&lt;/code&gt; function loads the additional fonts (if running for the first time the &lt;code&gt;font_import()&lt;/code&gt; function needs to be called to build the references).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) # Data Manipulation and Visualizations
library(rvest) # Web Scraping the NHL Champion Data &amp;amp; Team Colors
library(ggbeeswarm) # Creating Beeswarm Plots
library(ggtext) # Enabling Use of Markdown in ggplots
library(ggridges) # Creating Ridge Density Plots
library(ggimage) # Creating Plots with Images as the Points 
library(glue) # Package for String Manipulation
library(extrafont) # Package to enable use of additional fonts for plotting
loadfonts(device = &amp;quot;win&amp;quot;) # Actually loads the fonts&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-the-data-on-the-champions&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Getting the Data on the Champions&lt;/h3&gt;
&lt;p&gt;The points data for the Stanley Cup Champions comes from &lt;a href=&#34;https://www.hockey-reference.com/awards/stanley.html&#34;&gt;hockey-reference.com&lt;/a&gt;. I’ll scrape the table from this website by using &lt;code&gt;rvest&lt;/code&gt; and referencing the CSS class &lt;code&gt;.stats_table&lt;/code&gt;. Since there’s only one table on the page I can use &lt;code&gt;html_node&lt;/code&gt; vs. &lt;code&gt;html_nodes&lt;/code&gt;. Eventually I’m planning on joining some additional data to this data frame so I’m doing a minimal amount of data cleaning such as changing the Chicago Blackhawks to 1 word so that it matches the second data set. Additionally I’m renaming the points percentage column to something more R friendly.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nhl_data &amp;lt;- read_html(&amp;#39;https://www.hockey-reference.com/awards/stanley.html&amp;#39;) %&amp;gt;% 
  html_node(css = &amp;#39;.stats_table&amp;#39;) %&amp;gt;% 
  html_table() %&amp;gt;% 
  mutate(Team = str_replace_all(Team, &amp;quot;Black Hawks&amp;quot;, &amp;quot;Blackhawks&amp;quot;)) %&amp;gt;% 
  rename(pts_pct = `PTS%`)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-data-on-team-colors&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Getting Data on Team Colors&lt;/h3&gt;
&lt;p&gt;For one of the future plots I want to use each team’s color to represent their data. This information comes from &lt;a href=&#34;https://teamcolorcodes.com&#34;&gt;teamcolorcodes.com&lt;/a&gt;. Each team page has a formulaic URL where the team name is ‘-’ delimited. Since this page only has information on current teams, older teams like the Toronto Arenas or Montreal Maroons will not appear. Typically, these names might wind up breaking a loop when they throw an error. However, the use of the &lt;code&gt;possibly()&lt;/code&gt; function from &lt;code&gt;purrr&lt;/code&gt; will accommodate the error handling. The &lt;code&gt;possibly()&lt;/code&gt; function wraps another function and has an &lt;code&gt;otherwise&lt;/code&gt; parameter that allows the user to say what the function should provide in case of an error.&lt;/p&gt;
&lt;p&gt;In this case, the &lt;code&gt;possibly()&lt;/code&gt; function wraps an anonymous function that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Takes a string for a team name, &lt;code&gt;t&lt;/code&gt;, which is converted to lower-case and has the spaces replaced with dashes&lt;/li&gt;
&lt;li&gt;Scrapes the first instance of the &lt;code&gt;.colorblock&lt;/code&gt; CSS class from the &lt;a href=&#34;https://teamcolorcodes.com&#34;&gt;teamcolorcodes.com&lt;/a&gt; webpage for the specific team as text.&lt;/li&gt;
&lt;li&gt;Performance a regular expression map for the HEX code for the color&lt;/li&gt;
&lt;li&gt;Since &lt;code&gt;str_match&lt;/code&gt; returns a list where the first element is the entire match and each additional element represents a capture group, pulls the 2nd element from the list.&lt;/li&gt;
&lt;li&gt;Finally, the function returns a 1-row tibble with the team name, &lt;code&gt;t&lt;/code&gt;, and the HEX code, named &lt;code&gt;color.&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;In the case that there’s an error, the function will return a 1-row tibble with the team value set to ‘non-match’ and the color value set to &lt;code&gt;NA&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;code&gt;map_dfr&lt;/code&gt; function from &lt;code&gt;purrr&lt;/code&gt; is used to run the above function for all unique team names and appends the results into a data.frame.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;get_color &amp;lt;-  possibly(
  function(t){
    tibble(
      team = t,
      color = glue(&amp;quot;https://teamcolorcodes.com/{t}-color-codes/&amp;quot;, 
                   t = str_replace_all(
                     str_to_lower(t), &amp;#39; &amp;#39;, &amp;#39;-&amp;#39;)
                   ) %&amp;gt;% 
              read_html() %&amp;gt;% 
              html_node(css = &amp;quot;.colorblock&amp;quot;) %&amp;gt;% 
              html_text() %&amp;gt;% 
              str_match(&amp;quot;Hex Color: (#[0-9A-Za-z]{6})&amp;quot;) %&amp;gt;% 
              .[, 2]
    )
  },
  otherwise = tibble(team = &amp;quot;non-match&amp;quot;, color = NA_character_))

nhl_colors &amp;lt;- map_dfr(unique(nhl_data$Team), get_color)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;combining-the-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Combining the Data&lt;/h3&gt;
&lt;p&gt;Finally, the color data is joined to the Champions data. In the cases where there were not matches in the color data, I’m using &lt;em&gt;black&lt;/em&gt; as a default color.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nhl_w_color &amp;lt;- nhl_data %&amp;gt;% 
  left_join(nhl_colors, by = c(&amp;quot;Team&amp;quot; = &amp;quot;team&amp;quot;)) %&amp;gt;% 
  mutate(
    color = if_else(is.na(color), &amp;quot;black&amp;quot;, color)
  ) &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;visualizations&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Visualizations&lt;/h2&gt;
&lt;div id=&#34;the-overall-distribution-of-points-percentage-for-nhl-stanley-cup-champions&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;The Overall Distribution of Points Percentage for NHL Stanley Cup Champions&lt;/h3&gt;
&lt;p&gt;This code block is a doozy as I did a lot of annotations to add error bars, text labels, arrows, and theme formatting to change what at its heart is a standard density plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nhl_w_color %&amp;gt;% 
  ggplot(aes(x = pts_pct)) + 
    geom_density(fill = &amp;#39;#8394A1&amp;#39;) + 
    annotate(&amp;quot;errorbarh&amp;quot;,
            xmin = quantile(nhl_w_color$pts_pct, .10),
            xmax = quantile(nhl_w_color$pts_pct, .90),
            y = 6,
            color = &amp;quot;#e6e7eb&amp;quot;) + 
    annotate(&amp;quot;linerange&amp;quot;,
             x = median(nhl_w_color$pts_pct),
             ymin = 0,
             ymax = 5,
             color = &amp;quot;#e6e7eb&amp;quot;,
             lty = 2
    ) + 
    annotate(&amp;quot;text&amp;quot;,
             label = &amp;quot;Middle 80% and Median&amp;quot;,
             y = 6.45,
             x = median(nhl_w_color$pts_pct),
             color = &amp;quot;#e6e7eb&amp;quot;) + 
    annotate(&amp;quot;text&amp;quot;,
             label = quantile(nhl_w_color$pts_pct, .10) %&amp;gt;% 
               scales::percent(accuracy = 1),
             y = 5.2,
             x = quantile(nhl_w_color$pts_pct, .10),
             color = &amp;quot;#e6e7eb&amp;quot;) + 
    annotate(&amp;quot;text&amp;quot;,
             label = quantile(nhl_w_color$pts_pct, .90) %&amp;gt;% 
               scales::percent(accuracy = 1),
             y = 5.2,
             x = quantile(nhl_w_color$pts_pct, .90),
             color = &amp;quot;#e6e7eb&amp;quot;) + 
    geom_curve(
      x = median(nhl_w_color$pts_pct),
      xend = median(nhl_w_color$pts_pct)-.005,
      y = 6,
      yend = 3,
      color = &amp;quot;#e6e7eb&amp;quot;,
      arrow = arrow(length = unit(0.03, &amp;quot;npc&amp;quot;)),
      size = 1
    ) + 
    annotate(&amp;quot;text&amp;quot;, x = median(nhl_w_color$pts_pct)-.02, y = 3.3,
             label = median(nhl_w_color$pts_pct) %&amp;gt;% 
               scales::percent(accuracy = 1),
             color =  &amp;quot;#e6e7eb&amp;quot;) + 
    labs(title = &amp;quot;Points Percentage of Stanley Cup Champions (1918 - 2020)&amp;quot;,
         caption = &amp;quot;*Source: hockey-reference.com*&amp;quot;,
         x = &amp;quot;Points %&amp;quot;,
         y = &amp;quot;&amp;quot;
    ) + 
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) + 
    
    cowplot::theme_cowplot() + 
    theme(
      text = element_text(color = &amp;quot;#e6e7eb&amp;quot;, family = &amp;#39;BentonSans Regular&amp;#39;),
      plot.background = element_rect(fill = &amp;quot;#1a1c2e&amp;quot;),
      axis.text = element_text(color = &amp;quot;#e6e7eb&amp;quot;),
      axis.ticks = element_line(color = &amp;quot;#e6e7eb&amp;quot;),
      axis.line = element_line(color = &amp;quot;#878890&amp;quot;),
      plot.caption = element_markdown(),
      axis.title.y = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      plot.title = element_text(hjust = .5)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/overall_density-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Of the 100 champions that there is data for, the median points percentage is 63% while the middle 80% spans 54% - 74%. &lt;strong&gt;Ultimately this makes sense since you’d expect a champion to do better than just 50%&lt;/strong&gt;. However, there are some teams that are really great and have &amp;gt;80% points percentages and a few instances of unlikely champions with a points percentage in the 40s.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;has-the-distribution-of-champions-points-percentages-changed-by-decade&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Has the Distribution of Champion’s Points Percentages Changed By Decade&lt;/h3&gt;
&lt;p&gt;To see the density curves over time one approach would be to facet by decade and show each decade in its own panel. Another approach is to use the &lt;code&gt;ggridges&lt;/code&gt; package to make a ridge density plot to have each density curve on its own line. The package is very easy to use as its primarily adding a &lt;code&gt;y&lt;/code&gt; value and then using &lt;code&gt;geom_density_ridges&lt;/code&gt; vs. &lt;code&gt;geom_density&lt;/code&gt;.&lt;/p&gt;
&lt;div id=&#34;sidebar-computing-decades-from-years&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Sidebar: Computing Decades from Years&lt;/h4&gt;
&lt;p&gt;In order to create the decade variable I use a trick I learned from &lt;a href=&#34;https://www.youtube.com/channel/UCeiiqmVK07qhY-wvg3IZiZQ&#34;&gt;David Robinson’s TidyTuesday videos&lt;/a&gt; which is to divide the number by bucket width, take the floor of the result, and then multiply it back by the bucket width.&lt;/p&gt;
&lt;p&gt;For example, 2016 divided by 10 is 201.6, which after taking the floor is 201, then multiplying back by 10 is 2010. So 2016 is in the 2010s decade.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nhl_w_color %&amp;gt;% 
  mutate(decade = str_sub(Season, 1, 4),
         decade = as.integer(decade),
         decade = floor(decade/10)*10
  ) %&amp;gt;% 
  ggplot(aes(x = pts_pct, y = factor(decade), fill = factor(decade))) + 
    geom_density_ridges() + 
    geom_vline(xintercept = median(nhl_w_color$pts_pct), lty = 2, color = &amp;#39;white&amp;#39;) + 
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) + 
    scale_fill_viridis_d(option = &amp;quot;C&amp;quot;, guide = F) + 
    labs(
      x = &amp;quot;Points %&amp;quot;,
      y = &amp;quot;Decade&amp;quot;,
      title = &amp;quot;Points Percentage of Stanley Cup Champions (1918 - 2020)&amp;quot;,
      subtitle = &amp;quot;*By Decade*&amp;quot;,
      caption = &amp;quot;*Source: hockey-reference.com*&amp;quot;
    ) + 
    cowplot::theme_cowplot() +
    theme(
      plot.caption = element_markdown(),
      plot.subtitle = element_markdown(),
      text = element_text(color = &amp;quot;#e6e7eb&amp;quot;,  family = &amp;#39;BentonSans Regular&amp;#39;),
      plot.background = element_rect(fill = &amp;quot;#1a1c2e&amp;quot;),
      axis.text = element_text(color = &amp;quot;#e6e7eb&amp;quot;),
      axis.ticks = element_line(color = &amp;quot;#e6e7eb&amp;quot;),
      axis.line = element_line(color = &amp;quot;#878890&amp;quot;)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/ridgelines-1.png&#34; width=&#34;768&#34; /&gt;
I would have expected there to be a trend of some sort but there’s not a very common story from this chart. The main takeaways are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The 1970s seems to have had the most dominant teams from a points percentage standpoint&lt;/li&gt;
&lt;li&gt;There appears to be a large shift from the 1990s to the 2000s which might be due to the introduction of the shootout and the overtime loss concept which meant that three points could be awarded in a (2 for the winner, 1 for the loser) vs. always being two.&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;looking-the-points-percentage-for-each-team&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Looking the Points Percentage for Each Team&lt;/h3&gt;
&lt;p&gt;At the beginning of the post I mentioned that seeing a beeswarm plot provided the motivation for this post. Now I’ll actually create it. The following plot will have one point for each champion which will be highlighted by the team’s colors when that team’s tab is selected.&lt;/p&gt;
&lt;p&gt;The two things to note in this code block is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The tabset is dynamically generated by the markdown by setting the chunk setting to &lt;code&gt;results=&#39;asis&#39;&lt;/code&gt; and then using &lt;code&gt;cat()&lt;/code&gt; to add the HTML for the tabs through a for-loop.&lt;/li&gt;
&lt;li&gt;In vanilla RMarkdown, the tabset effect is really easy to do with &lt;code&gt;{.tabset}&lt;/code&gt; but in Blogdown/Hugo its a bit trickier to nail the formatting. But its doable by referencing the &lt;a href=&#34;https://getbootstrap.com/docs/4.0/components/navs/&#34;&gt;bootstrap.js documentation&lt;/a&gt; To make things look decent, I’m omitting the code chunk but will include it at the bottom.&lt;/li&gt;
&lt;/ul&gt;
&lt;style type=&#34;text/css&#34;&gt;
.nav-pills li a {
  font-size:14px;
  }
&lt;/style&gt;
&lt;ul class=&#34;nav nav-pills nav-fill&#34;&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link active&#34; data-toggle=&#34;tab&#34; href=&#34;#anaheimducks&#34;&gt;   Anaheim Ducks    &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#bostonbruins&#34;&gt;   Boston Bruins    &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#calgaryflames&#34;&gt;   Calgary Flames   &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#carolinahurricanes&#34;&gt;Carolina Hurricanes &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#chicagoblackhawks&#34;&gt; Chicago Blackhawks &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#coloradoavalanche&#34;&gt; Colorado Avalanche &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#dallasstars&#34;&gt;    Dallas Stars    &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#detroitredwings&#34;&gt; Detroit Red Wings  &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#edmontonoilers&#34;&gt;  Edmonton Oilers   &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#losangeleskings&#34;&gt; Los Angeles Kings  &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#montrealcanadiens&#34;&gt; Montreal Canadiens &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#montrealmaroons&#34;&gt;  Montreal Maroons  &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#newjerseydevils&#34;&gt; New Jersey Devils  &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#newyorkislanders&#34;&gt; New York Islanders &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#newyorkrangers&#34;&gt;  New York Rangers  &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#ottawasenators&#34;&gt;  Ottawa Senators   &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#philadelphiaflyers&#34;&gt;Philadelphia Flyers &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#pittsburghpenguins&#34;&gt;Pittsburgh Penguins &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#stlouisblues&#34;&gt;  St. Louis Blues   &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#tampabaylightning&#34;&gt;Tampa Bay Lightning &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#torontoarenas&#34;&gt;   Toronto Arenas   &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#torontomapleleafs&#34;&gt;Toronto Maple Leafs &lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#torontostpatricks&#34;&gt;Toronto St. Patricks&lt;/a&gt;
&lt;/li&gt;
&lt;li class=&#34;nav-item&#34;&gt;
&lt;a class = &#34;nav-link &#34; data-toggle=&#34;tab&#34; href=&#34;#washingtoncapitals&#34;&gt;Washington Capitals &lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&#34;tab-content&#34;&gt;
&lt;div id=&#34;anaheimducks&#34; class=&#34;tab-pane show active&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-1.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;bostonbruins&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-2.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;calgaryflames&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-3.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;carolinahurricanes&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-4.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;chicagoblackhawks&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-5.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;coloradoavalanche&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-6.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;dallasstars&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-7.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;detroitredwings&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-8.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;edmontonoilers&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-9.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;losangeleskings&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-10.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;montrealcanadiens&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-11.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;montrealmaroons&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-12.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;newjerseydevils&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-13.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;newyorkislanders&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-14.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;newyorkrangers&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-15.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;ottawasenators&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-16.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;philadelphiaflyers&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-17.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;pittsburghpenguins&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-18.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;stlouisblues&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-19.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;tampabaylightning&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-20.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;torontoarenas&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-21.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;torontomapleleafs&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-22.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;torontostpatricks&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-23.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;div id=&#34;washingtoncapitals&#34; class=&#34;tab-pane&#34;&gt;
&lt;img src=&#34;index_files/figure-html/unnamed-chunk-2-24.png&#34; width=&#34;768&#34; /&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Looking at the results of this plot we see that the Montreal Canadiens have been the most frequent winner as well as the team that makes up most of those 80%+ seasons. On the other hand, the Chicago Blackhawks have the honor of being the overachieving team that won despite having a sub-40% points percentage.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;making-a-histogram-with-team-logos&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Making a Histogram with Team Logos&lt;/h3&gt;
&lt;p&gt;An alternative view to the one above that doesn’t require highlighting would be to make a conventional histogram but using the team icons rather than points or bars. The &lt;code&gt;ggimage&lt;/code&gt; package allows for a &lt;code&gt;geom_image&lt;/code&gt; to be used by referencing a URL for an image. Fortunately the &lt;code&gt;teamcolors&lt;/code&gt; package contains a dataset with links to logos for current NHL team. However, for some of the champion teams that no longer exist I needed to manually add their logos.&lt;/p&gt;
&lt;p&gt;In this code block I manually create bin widths of 2.5% using the floor trick mentioned above and make use to a dummy variable to create the stacking effect for each of the logos. Then the &lt;code&gt;geom_image&lt;/code&gt; references the URLs contained in the ‘logo’ column.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nhl_w_color %&amp;gt;% 
  left_join(teamcolors::teamcolors %&amp;gt;% select(name, logo), 
            by = c(&amp;#39;Team&amp;#39; = &amp;#39;name&amp;#39;)) %&amp;gt;% 
  mutate(
    logo = case_when(
      Team == &amp;#39;Montreal Maroons&amp;#39; ~ &amp;#39;https://content.sportslogos.net/logos/1/40/thumbs/4039161926.gif&amp;#39;,
      Team == &amp;#39;Toronto Arenas&amp;#39; ~ &amp;#39;https://content.sportslogos.net/logos/1/996/thumbs/lgtkven0lgs74prrf26p6rmes.gif&amp;#39;,
      Team == &amp;#39;Toronto St. Patricks&amp;#39; ~ &amp;#39;https://content.sportslogos.net/logos/1/997/thumbs/6438.gif&amp;#39;,
      TRUE ~ logo
    ),
    point_pct_bckt = floor(pts_pct*100/2.5)*2.5/100
  ) %&amp;gt;% 
  arrange(point_pct_bckt, desc(Team)) %&amp;gt;% 
  group_by(point_pct_bckt) %&amp;gt;% 
  mutate(
    dummy = 1,
    y_val = (cumsum(dummy)-1)*3
  ) %&amp;gt;% 
  ggplot(aes(x = point_pct_bckt, y = y_val)) + 
    geom_image(aes(image = logo),
               asp = 1.5,
               size = .05
               ) +
    geom_vline(xintercept = quantile(nhl_data$pts_pct, .5), lty = 2) + 
    labs(x = &amp;quot;Points %&amp;quot;, y = &amp;quot;&amp;quot;, 
         title = &amp;quot;Points Percentage of Stanley Cup Champions (1918 - 2020)&amp;quot;,
         caption = &amp;quot;*Source: hockey-reference.com*&amp;quot;) + 
    scale_x_continuous(labels = scales::percent_format(accuracy =)) + 
    cowplot::theme_cowplot() + 
    theme(
      text = element_text( family = &amp;#39;BentonSans Regular&amp;#39;),
      axis.title.y = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      plot.caption = element_markdown(),
      plot.subtitle = element_markdown(),
      plot.margin = unit(rep(1.2, 4), &amp;quot;cm&amp;quot;),
      plot.title = element_text(hjust = .7)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/hist_with_images-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now its much easier to see that Montreal makes up most of the dominant teams while Chicago has been both dominant and on the lower ends of the distribution.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;concluding-thoughts&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Concluding Thoughts&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;ggplot2&lt;/code&gt; ecosystem is quite impressive and this post hardly scratches the surface of all the possible options. However, in this post I show 4 ways a single variable, points percentage of NHL Stanley Cup Champions, can be represented.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First, &lt;code&gt;geom_density&lt;/code&gt; creates a baseline distribution&lt;/li&gt;
&lt;li&gt;&lt;code&gt;geom_density_ridge&lt;/code&gt; from &lt;code&gt;ggridges&lt;/code&gt; can stratify that initial density plot over another variable&lt;/li&gt;
&lt;li&gt;&lt;code&gt;geom_quasirandom&lt;/code&gt; from &lt;code&gt;ggbeeswarm&lt;/code&gt; will make a ‘violin-type’ plot but with specific points that can then be operated on.&lt;/li&gt;
&lt;li&gt;Finally, &lt;code&gt;ggimage&lt;/code&gt; can change the geom to reference image URLs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And as a bonus, I dynamically generated the tabsets for all the teams!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;appendix-code-for-dynamic-tab-generation-in-blogdown&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Appendix: Code for Dynamic Tab Generation in Blogdown&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;##Construct Tabs
cat(&amp;#39;&amp;lt;ul class=&amp;quot;nav nav-pills nav-fill&amp;quot;&amp;gt; \n&amp;#39;)
for(t in sort(unique(nhl_data$Team))){
  tid = str_to_lower(str_remove_all(t, &amp;#39; |\\.&amp;#39;))
  cat(glue(&amp;#39;&amp;lt;li class=&amp;quot;nav-item&amp;quot;&amp;gt;&amp;lt;a class = &amp;quot;nav-link {active}&amp;quot; data-toggle=&amp;quot;tab&amp;quot; href=&amp;quot;#{tid}&amp;quot;&amp;gt;{t}&amp;lt;/a&amp;gt;&amp;lt;/li&amp;gt; \n&amp;#39;,
      active = if_else(t == sort(unique(nhl_data$Team))[1], &amp;quot;active&amp;quot;, &amp;quot;&amp;quot;)))
}
cat(&amp;#39;&amp;lt;/ul&amp;gt; \n&amp;#39;)

cat(&amp;#39;&amp;lt;div class=&amp;quot;tab-content&amp;quot;&amp;gt; \n&amp;#39;)

for(t in sort(unique(nhl_data$Team))){
  tid = str_to_lower(str_remove_all(t, &amp;#39; |\\.&amp;#39;))
  cat(glue(&amp;#39;&amp;lt;div id=&amp;quot;{tid}&amp;quot; class=&amp;quot;tab-pane {active}&amp;quot;&amp;gt; \n&amp;#39;,
           active = if_else(t == sort(unique(nhl_data$Team))[1], &amp;quot;show active&amp;quot;, &amp;quot;&amp;quot;)))
  set.seed(20201121)
  
  g &amp;lt;- nhl_w_color %&amp;gt;% 
      mutate(color = if_else(Team == glue(&amp;#39;{t}&amp;#39;), 
                             color, 
                             alpha(&amp;quot;grey&amp;quot;, 0.7))) %&amp;gt;% 
    ggplot(aes(y = 1, x = pts_pct, color = color)) + 
    geom_quasirandom(method = &amp;quot;tukeyDense&amp;quot;, groupOnX=F, size = 3, width = 0.2) +
    geom_vline(xintercept = quantile(nhl_data$pts_pct, .5), lty = 2) + 
    labs(x = &amp;quot;Points %&amp;quot;, y = &amp;quot;&amp;quot;, 
         title = &amp;quot;Points Percentage of Stanley Cup Champions (1918 - 2020)&amp;quot;,
         subtitle = glue(&amp;quot;&amp;lt;span style=&amp;#39;color:{col};&amp;#39;&amp;gt;&amp;lt;b&amp;gt;&amp;lt;i&amp;gt;{t}&amp;lt;/i&amp;gt;&amp;lt;/b&amp;gt;&amp;lt;/span&amp;gt; Championships Highlighted&amp;quot;,
                         col = nhl_w_color %&amp;gt;% 
                           filter(Team == glue(&amp;#39;{t}&amp;#39;)) %&amp;gt;% 
                           pull(color) %&amp;gt;% 
                           unique
                         ),
         caption = &amp;quot;*Source: hockey-reference.com*&amp;quot;) + 
    scale_color_identity(guide = F) + 
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) + 
    cowplot::theme_cowplot() + 
    theme(
      text = element_text( family = &amp;#39;BentonSans Regular&amp;#39;),
      axis.title.y = element_blank(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank(),
      axis.line.y = element_blank(),
      plot.caption = element_markdown(),
      plot.subtitle = element_markdown(),
      plot.margin = unit(rep(1.2, 4), &amp;quot;cm&amp;quot;),

    )
  
  print(g)
  
  cat(&amp;quot;&amp;lt;/div&amp;gt; \n&amp;quot;) 
}
cat(&amp;quot;&amp;lt;/div&amp;gt; \n&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>What&#39;s the most successful Dancing With the Stars &#34;Profession&#34;? Visualizing with {gt}</title>
      <link>https://jlaw.netlify.app/2020/11/24/what-s-the-most-successful-dancing-with-the-stars-profession-visualizing-with-gt/</link>
      <pubDate>Tue, 24 Nov 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/11/24/what-s-the-most-successful-dancing-with-the-stars-profession-visualizing-with-gt/</guid>
      <description>


&lt;div id=&#34;motivation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Motivation&lt;/h1&gt;
&lt;p&gt;During this pandemic I’ve found a source of comfort in Dancing with the Stars (DWTS). I’ve never watched any other season before and I think a large part of starting now are:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Lack of anything else to watch&lt;/li&gt;
&lt;li&gt;The rapper Nelly (and the St. Lunatics) have a near and dear place in my heart.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;On the R front, I’ve wanted to mess around with the &lt;code&gt;gt&lt;/code&gt; package for a while now but hadn’t had a great reason to. However, I had originally wanted to do a post on whether DWTS has “score inflation” throughout the season, but that wound up being more complicated than I would have liked. So instead why not answer &lt;strong&gt;what is the most successful type of star on Dancing with the Stars&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;And on the &lt;code&gt;gt&lt;/code&gt; front a huge shout-out to &lt;a href=&#34;https://rpubs.com/kaustav/table_contest_2020&#34;&gt;Kaustav Sen&lt;/a&gt; whose post on &lt;code&gt;gt&lt;/code&gt; for the Great American Beer Festival served as a large design inspiration for this post.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-final-output&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The Final Output&lt;/h1&gt;
&lt;p&gt;At the end of this post, the final output for the table will look like:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;final_table.png&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-pre-processing&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The Pre-Processing&lt;/h1&gt;
&lt;div id=&#34;load-the-libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Load the Libraries&lt;/h2&gt;
&lt;p&gt;The main focus of this post is on the &lt;code&gt;gt&lt;/code&gt; package to make the table, however, other packages are used to get and work with the data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rvest) #Web Scrape Wikipedia
library(tidyverse) #Data Manipulation / Plots
library(lubridate) #Date Manipulation
library(gt) #Making Fancy Tables / The Focus of This Post
library(glue) #Text Manipulation&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-all-the-dwts-contestants&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting all the DWTS Contestants&lt;/h2&gt;
&lt;p&gt;In order to find the most successful star type, we need to get a list of all the contestants. Fortunately, Wikipedia has a page for every season and on those pages has a list of information about the contestants including, their name, what their known for, and their status from the season.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;contestant_list_example.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Since there are 29 completed Dancing with the Stars seasons this seems like a job for a function to iterate through each season’s Wikipedia page to extract that table. One note is that Season 15 was an all-star season so it will be excluded from this analysis Unfortunately, the contestant table isn’t always in the same place on the page, so the function will need to be a little flexible.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dwts_constants &amp;lt;- function(season_number, tbl_number){
  read_html(glue(&amp;#39;https://en.wikipedia.org/wiki/Dancing_with_the_Stars_(American_season_{season_number})&amp;#39;)) %&amp;gt;% 
    html_nodes(&amp;#39;table&amp;#39;) %&amp;gt;% 
    .[[tbl_number]] %&amp;gt;%
    html_table() %&amp;gt;% 
    mutate(season = season_number) %&amp;gt;% 
    janitor::clean_names()
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Given a season number, and the number table on the page to extract the above function will extract and lightly clean the data. The following code will append all of the contestants on top of each other.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;contestants &amp;lt;- dwts_constants(1, 2) %&amp;gt;% 
  bind_rows(dwts_constants(2, 3) ) %&amp;gt;% 
  bind_rows(dwts_constants(3, 3) ) %&amp;gt;% 
  bind_rows(dwts_constants(4, 3) ) %&amp;gt;% 
  bind_rows(dwts_constants(5, 2) ) %&amp;gt;% 
  bind_rows(dwts_constants(6, 2) ) %&amp;gt;% 
  bind_rows(dwts_constants(7, 2) ) %&amp;gt;% 
  bind_rows(dwts_constants(8, 2) ) %&amp;gt;% 
  bind_rows(dwts_constants(9, 2) ) %&amp;gt;% 
  bind_rows(dwts_constants(10, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(11, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(12, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(13, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(14, 3)) %&amp;gt;% 
  #bind_rows(dwts_constants(15, 2)) %&amp;gt;%  #Season 15 is an All-Star Season
  bind_rows(dwts_constants(16, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(17, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(18, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(19, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(20, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(21, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(22, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(23, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(24, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(25, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(26, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(27, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(28, 2)) %&amp;gt;% 
  bind_rows(dwts_constants(29, 2))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Directly from this function the raw data looks like:&lt;/p&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;col width=&#34;12%&#34; /&gt;
&lt;col width=&#34;12%&#34; /&gt;
&lt;col width=&#34;17%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;13%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;14%&#34; /&gt;
&lt;col width=&#34;9%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;celebrity&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;notability_known_for&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;professional_partner&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;status&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;season&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;result&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;professional_partner_a&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;ref&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;professional_partner_a_7&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;celebrity_12_13&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Trista Sutter&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;The Bachelorette star&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Louis Van Amstel&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Eliminated 1ston June 8, 2005&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Evander Holyfield&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Heavyweight boxer&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Edyta Sliwinska&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Eliminated 2ndon June 15, 2005&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Rachel Hunter&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Supermodel&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Jonathan Roberts&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Eliminated 3rdon June 22, 2005&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div id=&#34;cleaning-the-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Cleaning the data&lt;/h3&gt;
&lt;p&gt;Looking at the raw data there is a lot of data cleaning to be done:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The contestant’s result shows up in two different columns (&lt;code&gt;result&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;result&lt;/code&gt; field has both placing information as well as dates for when they were either eliminated or won. For example Eliminated 1st needs to be turned into placed last (depending on how many contestants there were that season)&lt;/li&gt;
&lt;li&gt;The data contains contestants who withdrew so their place had nothing to do with their “Profession”&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;result&lt;/code&gt; field can be cleaned up to be standardized&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;notability&lt;/code&gt; field needs to be standardized&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All of these steps are handled in the following code:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;contestant_clean &amp;lt;- contestants %&amp;gt;%
  mutate(
    #Compress Fields That Have Different Names Per Season
    result = coalesce(result, status),
    #Get the dates when Eliminations / Wins Happen
    status_date = mdy(str_extract(result, &amp;quot;\\w+ \\d+, \\d{4}&amp;quot;)),
    #Get the Order of Elimination
    eliminated_state = str_extract(result, &amp;quot;Eliminated \\d+&amp;quot;) %&amp;gt;% 
      str_remove(&amp;#39;Eliminated &amp;#39;) %&amp;gt;%
      as.numeric()
  ) %&amp;gt;% 
  # Remove Contestants that Withdraw
  filter(!str_detect(result, &amp;#39;Withdrew&amp;#39;)) %&amp;gt;% 
  group_by(season) %&amp;gt;% 
  # Add the number of contestants for each season
  mutate(n_contestants = n()) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  #Overwrite Places for 1st/2nd/3rd
  mutate(
    place = case_when(
      str_detect(result, &amp;quot;Winner&amp;quot;) ~ 1,
      str_detect(result, &amp;quot;Runner|Second&amp;quot;) ~ 2,
      str_detect(result, &amp;quot;Third&amp;quot;) ~ 3,
      str_detect(result, &amp;quot;Fourth&amp;quot;) ~ 4,
      TRUE ~ n_contestants - eliminated_state + 1
    ),
    # Standardize What Contestants Are &amp;quot;Known For&amp;quot;
    known_for = case_when(
      str_detect(str_to_lower(notability_known_for), 
                 &amp;#39;actor|actress|disney&amp;#39;) ~ &amp;#39;Actor/Actress&amp;#39;,
      str_detect(str_to_lower(notability_known_for), 
                 &amp;#39;singer|rapper|band|composer&amp;#39;) ~ &amp;#39;Musician&amp;#39;,
      str_detect(str_to_lower(notability_known_for), 
                 &amp;#39;model|miss usa&amp;#39;) ~ &amp;#39;Model&amp;#39;,
      str_detect(str_to_lower(notability_known_for),
                 &amp;#39;nhl|nfl|nba|boxer|olympi|diva|tennis|soccer|football|lakers|swim|ufc|nascar|snowboard|wwe|mlb|basketball|rodeo|skier|race car|jockey|dolphins|steelers|packers|lakers|indy 500&amp;#39;) ~ &amp;#39;Athlete&amp;#39;,
      str_detect(str_to_lower(notability_known_for), 
                 &amp;#39;journ|anchor|host|caster|personality&amp;#39;) ~ &amp;#39;Media Personality&amp;#39;,
      str_detect(str_to_lower(notability_known_for), 
                 &amp;#39;bachelor|star|chef&amp;#39;) ~ &amp;#39;Reality TV Star&amp;#39;,
      str_detect(str_to_lower(notability_known_for), 
                 &amp;#39;comedian|magician|entertainer&amp;#39;) ~ &amp;#39;Entertainer&amp;#39;,
      str_detect(str_to_lower(notability_known_for), 
                 &amp;#39;owner|co-founder|business|designer&amp;#39;) ~ &amp;#39;Businessperson&amp;#39;,
      TRUE ~ &amp;quot;Other&amp;quot;
    )
  ) %&amp;gt;% 
  # Fix Celebrity Column for Season 29
  mutate(celebrity = if_else(is.na(celebrity), celebrity_12_13, celebrity)) %&amp;gt;% 
  # Remove Unneeded Columns
  select(-contains(&amp;#39;professional&amp;#39;), -ref, -status, 
         -eliminated_state, -celebrity_12_13) %&amp;gt;% 
  #Want Scores to be between 0 and 1 where 1 is Last Place and 0 is first place.
  mutate(scaled_place = (place-1)/(n_contestants-1))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;scaled_place&lt;/code&gt; variable will be used to create a standardized density plot by putting each season on a 1 (Last Place) to 0 (1st Place) scale regardless of the number of contestants in the season. The cleaned data now looks like:&lt;/p&gt;
&lt;table style=&#34;width:100%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;12%&#34; /&gt;
&lt;col width=&#34;19%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;21%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;col width=&#34;9%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;celebrity&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;notability_known_for&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;season&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;result&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;status_date&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;n_contestants&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;place&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;known_for&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;scaled_place&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Trista Sutter&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;The Bachelorette star&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Eliminated 1ston June 8, 2005&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2005-06-08&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Reality TV Star&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Evander Holyfield&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Heavyweight boxer&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Eliminated 2ndon June 15, 2005&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2005-06-15&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Athlete&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Rachel Hunter&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Supermodel&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Eliminated 3rdon June 22, 2005&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2005-06-22&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Model&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Joey McIntyre&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;New Kids on the Block singer&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Third placeon June 29, 2005&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2005-06-29&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Musician&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;John O’Hurley&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Actor &amp;amp; game show host&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Runner-upon July 6, 2005&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2005-07-06&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Actor/Actress&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Using Regular Expressions, I’ve collapsed 237 different levels into 9 which are:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Profession&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Examples&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Actor/Actress&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Zendaya, Alexa PenaVega, Amber Riley&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Athlete&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Jamie Anderson, Antonio Brown, Martina Navratilova&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Businessperson&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Steve Wozniak, Robert Herjavec, Mark Cuban&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Entertainer&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Penn Jillette, Marie Osmond, Margaret Cho&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Media Personality&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Jerry Springer, Bobby Bones, Giselle Fernandez&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Model&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Bonner Bolton, Shandi Finnessey, Sailor Brinkley-Cook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Musician&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Joey McIntyre, Gavin DeGraw, Nick Carter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Other&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Sean Spicer, Buzz Aldrin, Noah Galloway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Reality TV Star&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;The Situation, Lisa Vanderpump, Terra Jolé&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;constructing-the-table&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Constructing The Table&lt;/h1&gt;
&lt;div id=&#34;organizing-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Organizing the Data&lt;/h2&gt;
&lt;p&gt;For the table, the information we want is for each “Profession”:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;How many contestants were there?&lt;/li&gt;
&lt;li&gt;What percentages came in 1st, 2nd, 3rd, and Last?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Some quick &lt;code&gt;dplyr&lt;/code&gt; magic will allow us to collapse the list of contestants into the structure we want. We’ll also set the order of the table by the descending percentage of first place wins by “profession”.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;contestant_summary &amp;lt;- contestant_clean %&amp;gt;% 
  group_by(known_for) %&amp;gt;% 
  summarize(
    num_stars = n(),
    pct_1st_place = sum(place == 1)/n(),
    pct_2nd_place = sum(place == 2)/n(),
    pct_3rd_place = sum(place == 3)/n(),
    pct_last_place = sum(n_contestants == place) / n()
  ) %&amp;gt;% 
  arrange(-pct_1st_place)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;using-gt-to-build-the-table&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Using {gt} to Build the Table&lt;/h2&gt;
&lt;p&gt;Now onto actually constructing the table with &lt;code&gt;gt&lt;/code&gt;. The &lt;a href=&#34;https://gt.rstudio.com/&#34;&gt;&lt;code&gt;gt&lt;/code&gt; package&lt;/a&gt; provides a grammar for tables similar to what &lt;code&gt;ggplot2&lt;/code&gt; does for charts. The package provides this visualization to show the different parts of a table:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://gt.rstudio.com/reference/figures/gt_parts_of_a_table.svg&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;step-1-the-basic-construction&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 1: The basic construction&lt;/h3&gt;
&lt;p&gt;The most basic construction of a table is done by using the &lt;code&gt;gt()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g1 &amp;lt;- gt(contestant_summary))&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#munfbphqza .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#munfbphqza .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#munfbphqza .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#munfbphqza .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#munfbphqza .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#munfbphqza .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#munfbphqza .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#munfbphqza .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#munfbphqza .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#munfbphqza .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#munfbphqza .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#munfbphqza .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#munfbphqza .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#munfbphqza .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#munfbphqza .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#munfbphqza .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#munfbphqza .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#munfbphqza .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#munfbphqza .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#munfbphqza .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#munfbphqza .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#munfbphqza .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#munfbphqza .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#munfbphqza .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#munfbphqza .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#munfbphqza .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#munfbphqza .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#munfbphqza .gt_left {
  text-align: left;
}

#munfbphqza .gt_center {
  text-align: center;
}

#munfbphqza .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#munfbphqza .gt_font_normal {
  font-weight: normal;
}

#munfbphqza .gt_font_bold {
  font-weight: bold;
}

#munfbphqza .gt_font_italic {
  font-style: italic;
}

#munfbphqza .gt_super {
  font-size: 65%;
}

#munfbphqza .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;munfbphqza&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_left&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;known_for&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;num_stars&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_1st_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_2nd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_3rd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_last_place&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13924051&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06329114&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07894737&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10526316&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.15789474&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13157895&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.09230769&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06923077&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04615385&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.03846154&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.23809524&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.40000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.44444444&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;p&gt;When I said &lt;em&gt;basic&lt;/em&gt;, I meant &lt;strong&gt;basic&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;step-2-adding-titles-and-subtitles&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 2: Adding Titles and Subtitles&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;tab_header()&lt;/code&gt; function allows alterations to the header of the table. The &lt;code&gt;title&lt;/code&gt; and &lt;code&gt;subtitle&lt;/code&gt; arguments create the title and subtitle respectively. A nice feature of &lt;code&gt;gt&lt;/code&gt; is the &lt;code&gt;html()&lt;/code&gt; function will allows the use of HTML and CSS to style these titles. There is also a &lt;code&gt;md()&lt;/code&gt; function that allows for markdown rendering.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g2 &amp;lt;- g1 %&amp;gt;% 
  tab_header(
    title = html(&amp;#39;Most &amp;lt;span style=&amp;quot;color:#F2CB05&amp;quot;&amp;gt;Successful&amp;lt;/span&amp;gt; Dancing With the Stars &amp;lt;i&amp;gt;&amp;quot;Professions&amp;quot;&amp;lt;/i&amp;gt;&amp;#39;),
    subtitle = html(
      &amp;quot;&amp;lt;span style = &amp;#39;color: grey&amp;#39;&amp;gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&amp;lt;/span&amp;gt;&amp;quot;
    )
  ))&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#axkxzgcnox .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#axkxzgcnox .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#axkxzgcnox .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#axkxzgcnox .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#axkxzgcnox .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#axkxzgcnox .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#axkxzgcnox .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#axkxzgcnox .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#axkxzgcnox .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#axkxzgcnox .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#axkxzgcnox .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#axkxzgcnox .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#axkxzgcnox .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#axkxzgcnox .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#axkxzgcnox .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#axkxzgcnox .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#axkxzgcnox .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#axkxzgcnox .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#axkxzgcnox .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#axkxzgcnox .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#axkxzgcnox .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#axkxzgcnox .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#axkxzgcnox .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#axkxzgcnox .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#axkxzgcnox .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#axkxzgcnox .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#axkxzgcnox .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#axkxzgcnox .gt_left {
  text-align: left;
}

#axkxzgcnox .gt_center {
  text-align: center;
}

#axkxzgcnox .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#axkxzgcnox .gt_font_normal {
  font-weight: normal;
}

#axkxzgcnox .gt_font_bold {
  font-weight: bold;
}

#axkxzgcnox .gt_font_italic {
  font-style: italic;
}

#axkxzgcnox .gt_super {
  font-size: 65%;
}

#axkxzgcnox .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;axkxzgcnox&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_left&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;known_for&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;num_stars&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_1st_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_2nd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_3rd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_last_place&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13924051&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06329114&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07894737&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10526316&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.15789474&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13157895&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.09230769&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06923077&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04615385&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.03846154&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.23809524&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.40000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.44444444&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-3-adding-more-style-to-the-title&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 3: Adding More Style to the Title&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;tab_style()&lt;/code&gt; function adds various formatting to the table rows and cells. The style section of the arguments tells &lt;code&gt;gt&lt;/code&gt; what the style will be and the location argument says where that style should be applied.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;google_font()&lt;/code&gt; function allows access to all the fonts on the &lt;a href=&#34;https://fonts.google.com/&#34;&gt;Google Fonts&lt;/a&gt; site.&lt;/p&gt;
&lt;p&gt;In this step I’m making the title left-justified, size XX-Large, and using the Anton font.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g3 &amp;lt;- g2 %&amp;gt;% 
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Anton&amp;quot;), 
      align = &amp;quot;left&amp;quot;, 
      size = &amp;quot;xx-large&amp;quot;
    ),
    locations = cells_title(&amp;quot;title&amp;quot;)
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#ufxtnyhcon .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#ufxtnyhcon .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#ufxtnyhcon .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#ufxtnyhcon .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#ufxtnyhcon .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ufxtnyhcon .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#ufxtnyhcon .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#ufxtnyhcon .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#ufxtnyhcon .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#ufxtnyhcon .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#ufxtnyhcon .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#ufxtnyhcon .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#ufxtnyhcon .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#ufxtnyhcon .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#ufxtnyhcon .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#ufxtnyhcon .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#ufxtnyhcon .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#ufxtnyhcon .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#ufxtnyhcon .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#ufxtnyhcon .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#ufxtnyhcon .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#ufxtnyhcon .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#ufxtnyhcon .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ufxtnyhcon .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#ufxtnyhcon .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#ufxtnyhcon .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#ufxtnyhcon .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#ufxtnyhcon .gt_left {
  text-align: left;
}

#ufxtnyhcon .gt_center {
  text-align: center;
}

#ufxtnyhcon .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#ufxtnyhcon .gt_font_normal {
  font-weight: normal;
}

#ufxtnyhcon .gt_font_bold {
  font-weight: bold;
}

#ufxtnyhcon .gt_font_italic {
  font-style: italic;
}

#ufxtnyhcon .gt_super {
  font-size: 65%;
}

#ufxtnyhcon .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;ufxtnyhcon&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_left&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;known_for&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;num_stars&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_1st_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_2nd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_3rd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_last_place&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13924051&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06329114&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07894737&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10526316&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.15789474&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13157895&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.09230769&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06923077&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04615385&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.03846154&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.23809524&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.40000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.44444444&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-4-add-styling-to-the-subtitles&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 4: Add Styling to the Subtitles&lt;/h3&gt;
&lt;p&gt;Similar to step 3, this step applies formatting to the subtitle&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g4 &amp;lt;- g3 %&amp;gt;% 
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Caveat&amp;quot;),
      align = &amp;quot;left&amp;quot;, 
      size = &amp;quot;x-large&amp;quot;
    ),
    locations = cells_title(&amp;quot;subtitle&amp;quot;)
  ) 
)&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#uzrspjphjw .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#uzrspjphjw .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#uzrspjphjw .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#uzrspjphjw .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#uzrspjphjw .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#uzrspjphjw .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#uzrspjphjw .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#uzrspjphjw .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#uzrspjphjw .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#uzrspjphjw .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#uzrspjphjw .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#uzrspjphjw .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#uzrspjphjw .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#uzrspjphjw .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#uzrspjphjw .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#uzrspjphjw .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#uzrspjphjw .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#uzrspjphjw .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#uzrspjphjw .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#uzrspjphjw .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#uzrspjphjw .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#uzrspjphjw .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#uzrspjphjw .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#uzrspjphjw .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#uzrspjphjw .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#uzrspjphjw .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#uzrspjphjw .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#uzrspjphjw .gt_left {
  text-align: left;
}

#uzrspjphjw .gt_center {
  text-align: center;
}

#uzrspjphjw .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#uzrspjphjw .gt_font_normal {
  font-weight: normal;
}

#uzrspjphjw .gt_font_bold {
  font-weight: bold;
}

#uzrspjphjw .gt_font_italic {
  font-style: italic;
}

#uzrspjphjw .gt_super {
  font-size: 65%;
}

#uzrspjphjw .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;uzrspjphjw&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_left&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;known_for&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;num_stars&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_1st_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_2nd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_3rd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_right&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_last_place&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13924051&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06329114&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07894737&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10526316&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.15789474&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13157895&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.09230769&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06923077&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04615385&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.03846154&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.23809524&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.40000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.44444444&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-5-adding-a-spanner-column&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 5: Adding a Spanner Column&lt;/h3&gt;
&lt;p&gt;A &lt;em&gt;spanner column&lt;/em&gt; is a column header that is merged across a number of different columns. It is added with the &lt;code&gt;tab_spanner()&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g5 &amp;lt;- g4 %&amp;gt;% 
  tab_spanner(
    label = &amp;quot;Distribution of Results&amp;quot;,
    columns = 3:6
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#obxnxsqdal .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#obxnxsqdal .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#obxnxsqdal .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#obxnxsqdal .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#obxnxsqdal .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#obxnxsqdal .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#obxnxsqdal .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#obxnxsqdal .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#obxnxsqdal .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#obxnxsqdal .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#obxnxsqdal .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#obxnxsqdal .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#obxnxsqdal .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#obxnxsqdal .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#obxnxsqdal .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#obxnxsqdal .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#obxnxsqdal .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#obxnxsqdal .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#obxnxsqdal .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#obxnxsqdal .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#obxnxsqdal .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#obxnxsqdal .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#obxnxsqdal .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#obxnxsqdal .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#obxnxsqdal .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#obxnxsqdal .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#obxnxsqdal .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#obxnxsqdal .gt_left {
  text-align: left;
}

#obxnxsqdal .gt_center {
  text-align: center;
}

#obxnxsqdal .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#obxnxsqdal .gt_font_normal {
  font-weight: normal;
}

#obxnxsqdal .gt_font_bold {
  font-weight: bold;
}

#obxnxsqdal .gt_font_italic {
  font-style: italic;
}

#obxnxsqdal .gt_super {
  font-size: 65%;
}

#obxnxsqdal .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;obxnxsqdal&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34;&gt;known_for&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34;&gt;num_stars&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;4&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Distribution of Results&lt;/span&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_1st_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_2nd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_3rd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_last_place&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13924051&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06329114&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07894737&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10526316&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.15789474&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13157895&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.09230769&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06923077&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04615385&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.03846154&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.23809524&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.40000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.44444444&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-6-styling-the-spanner&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 6: Styling the Spanner&lt;/h3&gt;
&lt;p&gt;Similar to the title and subtitle, we can use &lt;code&gt;tab_style()&lt;/code&gt; to apply specific styles to the spanner via the &lt;code&gt;cells_column_spanners()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g6 &amp;lt;- g5 %&amp;gt;% 
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Courgette&amp;quot;), 
      size = &amp;quot;medium&amp;quot;, 
      weight = &amp;quot;bold&amp;quot;
    ),
    locations = cells_column_spanners(&amp;quot;Distribution of Results&amp;quot;)
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Courgette:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#iinlrtdxox .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#iinlrtdxox .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#iinlrtdxox .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#iinlrtdxox .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#iinlrtdxox .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#iinlrtdxox .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#iinlrtdxox .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#iinlrtdxox .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#iinlrtdxox .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#iinlrtdxox .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#iinlrtdxox .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#iinlrtdxox .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#iinlrtdxox .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#iinlrtdxox .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#iinlrtdxox .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#iinlrtdxox .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#iinlrtdxox .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#iinlrtdxox .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#iinlrtdxox .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#iinlrtdxox .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#iinlrtdxox .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#iinlrtdxox .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#iinlrtdxox .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#iinlrtdxox .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#iinlrtdxox .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#iinlrtdxox .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#iinlrtdxox .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#iinlrtdxox .gt_left {
  text-align: left;
}

#iinlrtdxox .gt_center {
  text-align: center;
}

#iinlrtdxox .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#iinlrtdxox .gt_font_normal {
  font-weight: normal;
}

#iinlrtdxox .gt_font_bold {
  font-weight: bold;
}

#iinlrtdxox .gt_font_italic {
  font-style: italic;
}

#iinlrtdxox .gt_super {
  font-size: 65%;
}

#iinlrtdxox .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;iinlrtdxox&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34;&gt;known_for&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34;&gt;num_stars&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;4&#34; style=&#34;font-family: Courgette; font-size: medium; font-weight: bold;&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Distribution of Results&lt;/span&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_1st_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_2nd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_3rd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34;&gt;pct_last_place&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13924051&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06329114&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07894737&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10526316&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.15789474&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13157895&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.09230769&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06923077&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04615385&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.03846154&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.23809524&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.40000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.44444444&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-7-style-the-column-headers-and-the-profession-column&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 7: Style the Column Headers and the Profession Column&lt;/h3&gt;
&lt;p&gt;You can apply the same style to different parts of the table by using a &lt;code&gt;list()&lt;/code&gt; for the &lt;code&gt;locations&lt;/code&gt; argument. Here the style is being applied to all column labels (&lt;code&gt;cells_column_labels(everything())&lt;/code&gt;) and to the values in the first column (&lt;code&gt;cells_body(columns = 1)&lt;/code&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g7 &amp;lt;- g6 %&amp;gt;% 
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Secular One&amp;quot;), 
      size = &amp;quot;large&amp;quot;
    ),
    locations = list(
      cells_column_labels(everything()), 
      cells_body(columns = 1)
    )
  )  
)&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Secular+One:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Courgette:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#kapnldxvew .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#kapnldxvew .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#kapnldxvew .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#kapnldxvew .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#kapnldxvew .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#kapnldxvew .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#kapnldxvew .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#kapnldxvew .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#kapnldxvew .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#kapnldxvew .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#kapnldxvew .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#kapnldxvew .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#kapnldxvew .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#kapnldxvew .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#kapnldxvew .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#kapnldxvew .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#kapnldxvew .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#kapnldxvew .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#kapnldxvew .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#kapnldxvew .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#kapnldxvew .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#kapnldxvew .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#kapnldxvew .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#kapnldxvew .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#kapnldxvew .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#kapnldxvew .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#kapnldxvew .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#kapnldxvew .gt_left {
  text-align: left;
}

#kapnldxvew .gt_center {
  text-align: center;
}

#kapnldxvew .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#kapnldxvew .gt_font_normal {
  font-weight: normal;
}

#kapnldxvew .gt_font_bold {
  font-weight: bold;
}

#kapnldxvew .gt_font_italic {
  font-style: italic;
}

#kapnldxvew .gt_super {
  font-size: 65%;
}

#kapnldxvew .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;kapnldxvew&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;known_for&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;num_stars&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;4&#34; style=&#34;font-family: Courgette; font-size: medium; font-weight: bold;&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Distribution of Results&lt;/span&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_1st_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_2nd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_3rd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_last_place&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13924051&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06329114&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10126582&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07894737&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.10526316&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.15789474&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.13157895&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.09230769&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.06923077&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04615385&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.03846154&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.07142857&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.23809524&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.20000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.40000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.44444444&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34;&gt;0.00000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-8-styling-the-cells&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 8: Styling the cells&lt;/h3&gt;
&lt;p&gt;Applying a center alignment to the 2nd through 6th columns.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g8 &amp;lt;- g7 %&amp;gt;% 
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Spartan&amp;quot;), 
      size = &amp;quot;medium&amp;quot;,
      align = &amp;#39;center&amp;#39;
    ),
    locations = cells_body(columns = 2:6)
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Spartan:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Secular+One:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Courgette:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#aogymjeevw .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#aogymjeevw .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#aogymjeevw .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#aogymjeevw .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#aogymjeevw .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#aogymjeevw .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#aogymjeevw .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#aogymjeevw .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#aogymjeevw .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#aogymjeevw .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#aogymjeevw .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#aogymjeevw .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#aogymjeevw .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#aogymjeevw .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#aogymjeevw .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#aogymjeevw .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#aogymjeevw .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#aogymjeevw .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#aogymjeevw .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#aogymjeevw .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#aogymjeevw .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#aogymjeevw .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#aogymjeevw .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#aogymjeevw .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#aogymjeevw .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#aogymjeevw .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#aogymjeevw .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#aogymjeevw .gt_left {
  text-align: left;
}

#aogymjeevw .gt_center {
  text-align: center;
}

#aogymjeevw .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#aogymjeevw .gt_font_normal {
  font-weight: normal;
}

#aogymjeevw .gt_font_bold {
  font-weight: bold;
}

#aogymjeevw .gt_font_italic {
  font-style: italic;
}

#aogymjeevw .gt_super {
  font-size: 65%;
}

#aogymjeevw .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;aogymjeevw&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;known_for&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;num_stars&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;4&#34; style=&#34;font-family: Courgette; font-size: medium; font-weight: bold;&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Distribution of Results&lt;/span&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_1st_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_2nd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_3rd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_last_place&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.13924051&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.10126582&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.06329114&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.10126582&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.07894737&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.10526316&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.15789474&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.13157895&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.09230769&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.06923077&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.04615385&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.07692308&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.03846154&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.07142857&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.07142857&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.04761905&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.23809524&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.20000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.20000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.40000000&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.00000000&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.44444444&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0.00000000&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-9-turn-cell-decimals-to-percentages&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 9: Turn Cell Decimals to Percentages&lt;/h3&gt;
&lt;p&gt;There are a number of &lt;code&gt;fmt_*&lt;/code&gt; functions to handle formatting for values. The &lt;code&gt;fmt_percent&lt;/code&gt; function will apply a percent format to all the columns beginning with “pct_”. While this is the first instance of using tidyselect syntax for telling &lt;code&gt;gt&lt;/code&gt; what columns to use, the package can take names, column numbers, or tidyselect.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g9 &amp;lt;- g8 %&amp;gt;% 
  fmt_percent(
    columns = starts_with(&amp;#39;pct&amp;#39;),
    decimals = 1,
    drop_trailing_zeros = TRUE
  )
 )&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Spartan:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Secular+One:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Courgette:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#wkglpgwfey .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#wkglpgwfey .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#wkglpgwfey .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#wkglpgwfey .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#wkglpgwfey .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#wkglpgwfey .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#wkglpgwfey .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#wkglpgwfey .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#wkglpgwfey .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#wkglpgwfey .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#wkglpgwfey .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#wkglpgwfey .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#wkglpgwfey .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#wkglpgwfey .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#wkglpgwfey .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#wkglpgwfey .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#wkglpgwfey .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#wkglpgwfey .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#wkglpgwfey .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#wkglpgwfey .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#wkglpgwfey .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#wkglpgwfey .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#wkglpgwfey .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#wkglpgwfey .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#wkglpgwfey .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#wkglpgwfey .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#wkglpgwfey .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#wkglpgwfey .gt_left {
  text-align: left;
}

#wkglpgwfey .gt_center {
  text-align: center;
}

#wkglpgwfey .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#wkglpgwfey .gt_font_normal {
  font-weight: normal;
}

#wkglpgwfey .gt_font_bold {
  font-weight: bold;
}

#wkglpgwfey .gt_font_italic {
  font-style: italic;
}

#wkglpgwfey .gt_super {
  font-size: 65%;
}

#wkglpgwfey .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;wkglpgwfey&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;known_for&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;num_stars&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;4&#34; style=&#34;font-family: Courgette; font-size: medium; font-weight: bold;&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Distribution of Results&lt;/span&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_1st_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_2nd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_3rd_place&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;pct_last_place&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;13.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;6.3&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;10.5&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;15.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;13.2&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;9.2&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;6.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.6&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;3.8&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;23.8&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;20&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;20&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;40&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;44.4&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-10-have-some-fun-by-turning-column-headers-into-emojis&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 10: Have some fun by turning column headers into emojis&lt;/h3&gt;
&lt;p&gt;Like other markdown text in R &lt;code&gt;gt&lt;/code&gt; can also support emojis! Here we can add in medals for 1st, 2nd, and 3rd…. and a personal favorite emoji to represent last. Emojis can be added into markdown through the &lt;code&gt;emo::ji()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g10 &amp;lt;- g9 %&amp;gt;% 
  cols_label(
    known_for = &amp;quot;&amp;quot;,
    num_stars = paste0(&amp;quot;# &amp;quot;,emo::ji(&amp;#39;star&amp;#39;), &amp;quot;s&amp;quot;),
    pct_1st_place = paste0(emo::ji(&amp;quot;1st_place_medal&amp;quot;), &amp;quot;(1st)&amp;quot;),
    pct_2nd_place = paste0(emo::ji(&amp;quot;2nd_place_medal&amp;quot;), &amp;quot;(2nd)&amp;quot;),
    pct_3rd_place = paste0(emo::ji(&amp;quot;3rd_place_medal&amp;quot;), &amp;quot;(3rd)&amp;quot;),
    pct_last_place = paste0(emo::ji(&amp;quot;poo&amp;quot;), &amp;quot; (last)&amp;quot;)
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Spartan:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Secular+One:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Courgette:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#nzxqjbqoqh .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#nzxqjbqoqh .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#nzxqjbqoqh .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#nzxqjbqoqh .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#nzxqjbqoqh .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#nzxqjbqoqh .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#nzxqjbqoqh .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#nzxqjbqoqh .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#nzxqjbqoqh .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#nzxqjbqoqh .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#nzxqjbqoqh .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#nzxqjbqoqh .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#nzxqjbqoqh .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#nzxqjbqoqh .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#nzxqjbqoqh .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#nzxqjbqoqh .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#nzxqjbqoqh .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#nzxqjbqoqh .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#nzxqjbqoqh .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#nzxqjbqoqh .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#nzxqjbqoqh .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#nzxqjbqoqh .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#nzxqjbqoqh .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#nzxqjbqoqh .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#nzxqjbqoqh .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#nzxqjbqoqh .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#nzxqjbqoqh .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#nzxqjbqoqh .gt_left {
  text-align: left;
}

#nzxqjbqoqh .gt_center {
  text-align: center;
}

#nzxqjbqoqh .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#nzxqjbqoqh .gt_font_normal {
  font-weight: normal;
}

#nzxqjbqoqh .gt_font_bold {
  font-weight: bold;
}

#nzxqjbqoqh .gt_font_italic {
  font-style: italic;
}

#nzxqjbqoqh .gt_super {
  font-size: 65%;
}

#nzxqjbqoqh .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;nzxqjbqoqh&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34;&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;# ⭐s&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;4&#34; style=&#34;font-family: Courgette; font-size: medium; font-weight: bold;&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Distribution of Results&lt;/span&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥇(1st)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥈(2nd)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥉(3rd)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;💩 (last)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;13.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;6.3&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;10.5&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;15.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;13.2&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;9.2&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;6.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.6&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;3.8&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;23.8&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;20&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;20&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;40&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;44.4&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-11-add-a-source-and-do-some-formatting&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 11: Add a source and do some formatting&lt;/h3&gt;
&lt;p&gt;There a couple things going on in this step:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;I’m adding a source line with &lt;code&gt;tab_source_note()&lt;/code&gt; and using &lt;code&gt;md()&lt;/code&gt; to allow me to use markdown style formatting.&lt;/li&gt;
&lt;li&gt;I’m using &lt;code&gt;tab_options()&lt;/code&gt; to remove the top border from the table and shrink the gaps between the rows in the table.&lt;/li&gt;
&lt;li&gt;I’m using &lt;code&gt;cols_width()&lt;/code&gt; to tell &lt;code&gt;gt&lt;/code&gt; to make the first column 200px wide&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g11 &amp;lt;- g10 %&amp;gt;%
  tab_source_note(md(&amp;quot;**Data:** DWTS Wikipedia Articles | **Table Author:** JLaw&amp;quot;)) %&amp;gt;%
  tab_options(
    table.border.top.color = &amp;quot;white&amp;quot;,
    data_row.padding = px(0),
  ) %&amp;gt;% 
  cols_width(
    1 ~ px(200),
  )
)&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Spartan:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Secular+One:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Courgette:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#bpwbmkejog .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: white;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#bpwbmkejog .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#bpwbmkejog .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#bpwbmkejog .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#bpwbmkejog .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#bpwbmkejog .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#bpwbmkejog .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#bpwbmkejog .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#bpwbmkejog .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#bpwbmkejog .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#bpwbmkejog .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#bpwbmkejog .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#bpwbmkejog .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#bpwbmkejog .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#bpwbmkejog .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#bpwbmkejog .gt_row {
  padding-top: 0px;
  padding-bottom: 0px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#bpwbmkejog .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#bpwbmkejog .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#bpwbmkejog .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#bpwbmkejog .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#bpwbmkejog .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#bpwbmkejog .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#bpwbmkejog .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#bpwbmkejog .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#bpwbmkejog .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#bpwbmkejog .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#bpwbmkejog .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#bpwbmkejog .gt_left {
  text-align: left;
}

#bpwbmkejog .gt_center {
  text-align: center;
}

#bpwbmkejog .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#bpwbmkejog .gt_font_normal {
  font-weight: normal;
}

#bpwbmkejog .gt_font_bold {
  font-weight: bold;
}

#bpwbmkejog .gt_font_italic {
  font-style: italic;
}

#bpwbmkejog .gt_super {
  font-size: 65%;
}

#bpwbmkejog .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;bpwbmkejog&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34; style=&#34;table-layout: fixed;&#34;&gt;
  &lt;colgroup&gt;
    &lt;col style=&#34;width:200px;&#34;/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
  &lt;/colgroup&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;# ⭐s&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;4&#34; style=&#34;font-family: Courgette; font-size: medium; font-weight: bold;&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Distribution of Results&lt;/span&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥇(1st)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥈(2nd)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥉(3rd)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;💩 (last)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;13.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;6.3&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;10.5&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;15.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;13.2&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;9.2&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;6.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.6&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;3.8&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;23.8&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;20&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;20&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;40&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;44.4&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;0&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  &lt;tfoot class=&#34;gt_sourcenotes&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_sourcenote&#34; colspan=&#34;6&#34;&gt;&lt;strong&gt;Data:&lt;/strong&gt; DWTS Wikipedia Articles | &lt;strong&gt;Table Author:&lt;/strong&gt; JLaw&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tfoot&gt;
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;step-12-adding-a-color-scale-for-the-columns&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Step 12: Adding a Color Scale for the % Columns&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;data_color&lt;/code&gt; function allows for doing conditional formatting based on the values in the columns. The &lt;code&gt;columns&lt;/code&gt; argument allows to specific which colors should receive the formatting. The &lt;code&gt;colors&lt;/code&gt; argument defines the palette. And the &lt;code&gt;apply_to&lt;/code&gt; argument can take the values of “fill” to fill the background or “text” to change the color of the text.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(g12 &amp;lt;- g11 %&amp;gt;% 
  data_color(
    columns = vars(pct_1st_place, pct_2nd_place, pct_3rd_place, pct_last_place),
    colors = scales::col_numeric(
      palette = c(&amp;quot;white&amp;quot;, &amp;quot;#3fc1c9&amp;quot;),
      #F2CB05 = Gold COlor
      domain = NULL
    ),
    apply_to = &amp;quot;fill&amp;quot;,
  )
 )&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Spartan:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Secular+One:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Courgette:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#wkwzhbancp .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: white;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#wkwzhbancp .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#wkwzhbancp .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#wkwzhbancp .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#wkwzhbancp .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#wkwzhbancp .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#wkwzhbancp .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#wkwzhbancp .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#wkwzhbancp .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#wkwzhbancp .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#wkwzhbancp .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#wkwzhbancp .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#wkwzhbancp .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#wkwzhbancp .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#wkwzhbancp .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#wkwzhbancp .gt_row {
  padding-top: 0px;
  padding-bottom: 0px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#wkwzhbancp .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#wkwzhbancp .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#wkwzhbancp .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#wkwzhbancp .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#wkwzhbancp .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#wkwzhbancp .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#wkwzhbancp .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#wkwzhbancp .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#wkwzhbancp .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#wkwzhbancp .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#wkwzhbancp .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#wkwzhbancp .gt_left {
  text-align: left;
}

#wkwzhbancp .gt_center {
  text-align: center;
}

#wkwzhbancp .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#wkwzhbancp .gt_font_normal {
  font-weight: normal;
}

#wkwzhbancp .gt_font_bold {
  font-weight: bold;
}

#wkwzhbancp .gt_font_italic {
  font-style: italic;
}

#wkwzhbancp .gt_super {
  font-size: 65%;
}

#wkwzhbancp .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;wkwzhbancp&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34; style=&#34;table-layout: fixed;&#34;&gt;
  &lt;colgroup&gt;
    &lt;col style=&#34;width:200px;&#34;/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
  &lt;/colgroup&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;6&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;# ⭐s&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;4&#34; style=&#34;font-family: Courgette; font-size: medium; font-weight: bold;&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Distribution of Results&lt;/span&gt;
      &lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥇(1st)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥈(2nd)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥉(3rd)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;💩 (last)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #3FC1C9;&#34;&gt;13.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #4BC3CB;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #E8F6F7;&#34;&gt;6.3&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #D6F0F1;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #A0DCE0;&#34;&gt;7.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #3FC1C9;&#34;&gt;10.5&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #C5E9EC;&#34;&gt;15.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #C9EBED;&#34;&gt;13.2&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #A2DDE1;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #61C9D0;&#34;&gt;9.2&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #E6F6F6;&#34;&gt;6.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #ECF8F9;&#34;&gt;4.6&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #A2DDE1;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #81D2D7;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #E3F4F5;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #EFF9FA;&#34;&gt;3.8&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #A9E0E3;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #8BD5DA;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #E2F4F5;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #C7EAEC;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #B4E3E6;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #EEF8F9;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #9BDADF;&#34;&gt;23.8&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #ACE0E4;&#34;&gt;20&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #B5E3E6;&#34;&gt;20&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #3FC1C9;&#34;&gt;40&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #3FC1C9;&#34;&gt;44.4&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  &lt;tfoot class=&#34;gt_sourcenotes&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_sourcenote&#34; colspan=&#34;6&#34;&gt;&lt;strong&gt;Data:&lt;/strong&gt; DWTS Wikipedia Articles | &lt;strong&gt;Table Author:&lt;/strong&gt; JLaw&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tfoot&gt;
  
&lt;/table&gt;&lt;/div&gt;
&lt;p&gt;This looks pretty good… but we can do better!!!&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;turning-it-up-to-11-by-adding-in-density-plots&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Turning it up to 11 by adding in Density Plots&lt;/h2&gt;
&lt;p&gt;In order to add in ggplots into a row in the table we need to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Build a function to create the plot for each row of the table&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;purrr:map()&lt;/code&gt; to add the plot as a list-column to the table&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;gt::text_transform&lt;/code&gt; to insert the image into the table&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; Since this required making a new data set much of the &lt;code&gt;gt&lt;/code&gt; code is repeating the first section but is provided in its entirely for completeness.&lt;/p&gt;
&lt;div id=&#34;writing-the-function-to-build-the-chart&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Writing the function to build the chart&lt;/h3&gt;
&lt;p&gt;For the function I want it to take a “profession” and return a density part using the &lt;code&gt;scaled_place&lt;/code&gt; variable defined at the top. The function takes in a profession label and a dataset and returns a density plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot_dens &amp;lt;- function(profession, data) {
  
  plot_data &amp;lt;- 
    data %&amp;gt;% 
    filter(known_for == {{ profession }}) 
  
  plot &amp;lt;- 
    plot_data %&amp;gt;% 
    ggplot(aes(x = scaled_place)) +
    geom_density(aes(y = ..scaled..), fill = &amp;#39;gold&amp;#39;) +
    annotate(&amp;quot;text&amp;quot;, x = 0, y = -.05, 
             label = &amp;quot;1st\nPlace&amp;quot;, size = 10, color = &amp;quot;grey40&amp;quot;, vjust = 1) +
    annotate(&amp;quot;text&amp;quot;, x = 1, y = -.05, 
             label = &amp;quot;Last\nPlace&amp;quot;, size = 10, color = &amp;quot;grey40&amp;quot;, vjust = 1) +
    coord_cartesian(
      xlim = c(-.1, 1.1),
      ylim = c(-.7, NA)
    ) + 
    theme_void()
  
  plot
  
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;adding-the-plots-into-the-data-set&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Adding the plots into the data set&lt;/h3&gt;
&lt;p&gt;The main part of this step is using the &lt;code&gt;map()&lt;/code&gt; function to iterate through the professions and use them as input into the function defined above. The column &lt;code&gt;plots&lt;/code&gt; is a list-column containing all the ggplot information.&lt;/p&gt;
&lt;p&gt;The left join is because I want to add in a column for the most recent winner in each category.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;contestant_summary_with_graph &amp;lt;- contestant_summary %&amp;gt;% 
  mutate(plots = purrr::map(contestant_summary$known_for %&amp;gt;% unique, 
                            plot_dens, data = contestant_clean)) %&amp;gt;% 
  left_join(
  ###Add in Recent Winner Images
  contestant_clean %&amp;gt;% 
    filter(place == 1) %&amp;gt;% 
    group_by(known_for) %&amp;gt;% 
    slice_max(season, n = 1) %&amp;gt;% 
    select(celebrity, season, known_for) %&amp;gt;% 
    ungroup() %&amp;gt;% 
    transmute(
      known_for,
      lbl = paste0(celebrity,&amp;#39; (Season &amp;#39;,season,&amp;quot;)&amp;quot;)
    )
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-the-final-table&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Creating the Final Table&lt;/h3&gt;
&lt;p&gt;In order to turn the plots into columns the &lt;code&gt;text_transform()&lt;/code&gt; function is used to take the &lt;code&gt;plots&lt;/code&gt; column and run a function that calls &lt;code&gt;ggplot_image&lt;/code&gt; with certain height and aspect ratio parameters on each row in the table.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;text_transform(
    locations = cells_body(vars(plots)),
    fn = function(x) {
      map(contestant_summary_with_graph$plots, ggplot_image, 
          height = px(120), aspect_ratio = 1.5)
    }
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can put it all together. Besides adding in the plots, there’s a few steps to format the Most Recent Winner cell. But nothing that hasn’t been covered earlier.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Base Table
gt(contestant_summary_with_graph) %&amp;gt;% 
  #Add Titles
  tab_header(
    title = html(&amp;#39;Most &amp;lt;span style=&amp;quot;color:#F2CB05&amp;quot;&amp;gt;Successful&amp;lt;/span&amp;gt; Dancing With the Stars &amp;lt;i&amp;gt;&amp;quot;Professions&amp;quot;&amp;lt;/i&amp;gt;&amp;#39;),
    subtitle = html(
      &amp;quot;&amp;lt;span style = &amp;#39;color: grey&amp;#39;&amp;gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&amp;lt;/span&amp;gt;&amp;quot;
    )
  ) %&amp;gt;% 
  #Format Title
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Anton&amp;quot;), 
      align = &amp;quot;left&amp;quot;, 
      size = &amp;quot;xx-large&amp;quot;
    ),
    locations = cells_title(&amp;quot;title&amp;quot;)
  ) %&amp;gt;% 
  #Format Subtitle
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Caveat&amp;quot;),
      align = &amp;quot;left&amp;quot;, 
      size = &amp;quot;x-large&amp;quot;
    ),
    locations = cells_title(&amp;quot;subtitle&amp;quot;)
  )  %&amp;gt;% 
  #Adding Spanning Column
  tab_spanner(
    label = &amp;quot;Distribution of Results&amp;quot;,
    columns = 3:7
  ) %&amp;gt;% 
  #Style The Spanner Column
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Courgette&amp;quot;), 
      size = &amp;quot;medium&amp;quot;, 
      weight = &amp;quot;bold&amp;quot;
    ),
    locations = cells_column_spanners(&amp;quot;Distribution of Results&amp;quot;)
  ) %&amp;gt;% 
  #Style the Column Labels and Profession Column
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Secular One&amp;quot;), 
      size = &amp;quot;large&amp;quot;
    ),
    locations = list(
      cells_column_labels(everything()), 
      cells_body(columns = 1)
    )
  )  %&amp;gt;% 
  #Style the Cells
  tab_style(
    style = cell_text(
      font = google_font(&amp;quot;Spartan&amp;quot;), 
      size = &amp;quot;medium&amp;quot;,
      align = &amp;#39;center&amp;#39;
    ),
    locations = cells_body(columns = 2:6)
  ) %&amp;gt;% 
  #Format Cells to %s
  fmt_percent(
    columns = starts_with(&amp;#39;pct&amp;#39;),
    decimals = 1,
    drop_trailing_zeros = TRUE
  ) %&amp;gt;% 
  #Turn Headers to Emojis
  cols_label(
    known_for = &amp;quot;&amp;quot;,
    num_stars = paste0(&amp;quot;# &amp;quot;,emo::ji(&amp;#39;star&amp;#39;), &amp;quot;s&amp;quot;),
    pct_1st_place = paste0(emo::ji(&amp;quot;1st_place_medal&amp;quot;), &amp;quot;(1st)&amp;quot;),
    pct_2nd_place = paste0(emo::ji(&amp;quot;2nd_place_medal&amp;quot;), &amp;quot;(2nd)&amp;quot;),
    pct_3rd_place = paste0(emo::ji(&amp;quot;3rd_place_medal&amp;quot;), &amp;quot;(3rd)&amp;quot;),
    pct_last_place = paste0(emo::ji(&amp;quot;poo&amp;quot;), &amp;quot; (last)&amp;quot;),
    plots = &amp;quot;&amp;quot;,
    lbl = &amp;quot;Most Recent Winner&amp;quot;
  ) %&amp;gt;% 
  ###Add in Source and Doing Some Minor Formatting
  tab_source_note(md(&amp;quot;**Data:** DWTS Wikipedia Articles | **Table Author:** JLaw&amp;quot;)) %&amp;gt;%
  tab_options(
    table.border.top.color = &amp;quot;white&amp;quot;,
    data_row.padding = px(0),
  ) %&amp;gt;% 
  cols_width(
    1 ~ px(200)
  ) %&amp;gt;% 
###Add a Color Scale for 1st Place
  data_color(
    columns = vars(pct_1st_place, pct_2nd_place, pct_3rd_place, pct_last_place),
    colors = scales::col_numeric(
      palette = c(&amp;quot;white&amp;quot;, &amp;quot;#3fc1c9&amp;quot;),
      #F2CB05 = Gold COlor
      domain = NULL
    ),
    apply_to = &amp;quot;fill&amp;quot;,
  ) %&amp;gt;% 
  ######################NEW THINGS START HERE#########################
  # Add In Density Plots (NEW)
  text_transform(
    locations = cells_body(vars(plots)),
    fn = function(x) {
      map(contestant_summary_with_graph$plots, ggplot_image, 
          height = px(120), aspect_ratio = 1.5)
    }
  ) %&amp;gt;% 
  text_transform(
    locations = cells_body(vars(lbl)),
    fn = function(x){
      if_else(!is.na(x), str_replace_all(x, &amp;quot; \\(&amp;quot;, &amp;quot;&amp;lt;br&amp;gt; \\(&amp;quot;), &amp;quot;&amp;quot;)
    }
  ) %&amp;gt;% 
  tab_style(
    style = cell_text(
      style = &amp;#39;italic&amp;#39;,
      size = px(13),
      v_align = &amp;#39;middle&amp;#39;,
      align = &amp;#39;left&amp;#39;
    ),
    locations = cells_body(columns = vars(lbl))
  ) %&amp;gt;%
  cols_width(
    8 ~ px(100)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;style&gt;@import url(&#34;https://fonts.googleapis.com/css2?family=Spartan:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Secular+One:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Courgette:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Caveat:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
@import url(&#34;https://fonts.googleapis.com/css2?family=Anton:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;display=swap&#34;);
html {
  font-family: -apple-system, BlinkMacSystemFont, &#39;Segoe UI&#39;, Roboto, Oxygen, Ubuntu, Cantarell, &#39;Helvetica Neue&#39;, &#39;Fira Sans&#39;, &#39;Droid Sans&#39;, Arial, sans-serif;
}

#yovgbnmept .gt_table {
  display: table;
  border-collapse: collapse;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: white;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#yovgbnmept .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#yovgbnmept .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#yovgbnmept .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 0;
  padding-bottom: 4px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#yovgbnmept .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#yovgbnmept .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#yovgbnmept .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#yovgbnmept .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#yovgbnmept .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#yovgbnmept .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#yovgbnmept .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#yovgbnmept .gt_group_heading {
  padding: 8px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
}

#yovgbnmept .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#yovgbnmept .gt_from_md &gt; :first-child {
  margin-top: 0;
}

#yovgbnmept .gt_from_md &gt; :last-child {
  margin-bottom: 0;
}

#yovgbnmept .gt_row {
  padding-top: 0px;
  padding-bottom: 0px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#yovgbnmept .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 12px;
}

#yovgbnmept .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#yovgbnmept .gt_first_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
}

#yovgbnmept .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#yovgbnmept .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#yovgbnmept .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#yovgbnmept .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#yovgbnmept .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#yovgbnmept .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding: 4px;
}

#yovgbnmept .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#yovgbnmept .gt_sourcenote {
  font-size: 90%;
  padding: 4px;
}

#yovgbnmept .gt_left {
  text-align: left;
}

#yovgbnmept .gt_center {
  text-align: center;
}

#yovgbnmept .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#yovgbnmept .gt_font_normal {
  font-weight: normal;
}

#yovgbnmept .gt_font_bold {
  font-weight: bold;
}

#yovgbnmept .gt_font_italic {
  font-style: italic;
}

#yovgbnmept .gt_super {
  font-size: 65%;
}

#yovgbnmept .gt_footnote_marks {
  font-style: italic;
  font-size: 65%;
}
&lt;/style&gt;
&lt;div id=&#34;yovgbnmept&#34; style=&#34;overflow-x:auto;overflow-y:auto;width:auto;height:auto;&#34;&gt;&lt;table class=&#34;gt_table&#34; style=&#34;table-layout: fixed;&#34;&gt;
  &lt;colgroup&gt;
    &lt;col style=&#34;width:200px;&#34;/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col/&gt;
    &lt;col style=&#34;width:100px;&#34;/&gt;
  &lt;/colgroup&gt;
  &lt;thead class=&#34;gt_header&#34;&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;8&#34; class=&#34;gt_heading gt_title gt_font_normal&#34; style=&#34;font-family: Anton; font-size: xx-large; text-align: left;&#34;&gt;Most &lt;span style=&#34;color:#F2CB05&#34;&gt;Successful&lt;/span&gt; Dancing With the Stars &lt;i&gt;&#34;Professions&#34;&lt;/i&gt;&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th colspan=&#34;8&#34; class=&#34;gt_heading gt_subtitle gt_font_normal gt_bottom_border&#34; style=&#34;font-family: Caveat; font-size: x-large; text-align: left;&#34;&gt;&lt;span style = &#39;color: grey&#39;&gt;Covering Seasons 1 to 29 (excluding All-Star Season 15)&lt;/span&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;thead class=&#34;gt_col_headings&#34;&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;# ⭐s&lt;/th&gt;
      &lt;th class=&#34;gt_center gt_columns_top_border gt_column_spanner_outer&#34; rowspan=&#34;1&#34; colspan=&#34;5&#34; style=&#34;font-family: Courgette; font-size: medium; font-weight: bold;&#34;&gt;
        &lt;span class=&#34;gt_column_spanner&#34;&gt;Distribution of Results&lt;/span&gt;
      &lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_center gt_columns_bottom_border&#34; rowspan=&#34;2&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Most Recent Winner&lt;/th&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥇(1st)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥈(2nd)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;🥉(3rd)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;💩 (last)&lt;/th&gt;
      &lt;th class=&#34;gt_col_heading gt_columns_bottom_border gt_center&#34; rowspan=&#34;1&#34; colspan=&#34;1&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody class=&#34;gt_table_body&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Athlete&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;79&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #3FC1C9;&#34;&gt;13.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #4BC3CB;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #E8F6F7;&#34;&gt;6.3&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #D6F0F1;&#34;&gt;10.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;img cid=&#34;pkcyvjeqdgnm__temp_ggplot.png&#34; src=&#34;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAu4AAAH0CAMAAABVZliYAAAAulBMVEUAAABmZmZmZoFmZp1mgYFmgZ1mgbZmnc+BZmaBgWaBgYGBgZ2BgbaBnbaBnc+Bts+BtuedZmadgWadgYGdnZ2dnbadts+dtuedz+edz/+2gWa2nYG2nZ22tp22ts+2z8+2z+e25//PnWbPnYHPtoHPtp3Pz7bPz8/P5+fP5//P///ntoHntp3nz53nz7bnz8/n57bn58/n5+fn5//n/+fn////z53/1wD/57b/58//5+f//8///+f///+NxnQNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAONElEQVR4nO3de3/T9hXAYZxyCWSsrBd6ZzSMdjQr7drBgDnv/20t8lVObAKxjnSs8zx/9JPYov4p+UrWLdatcyjj1tADgP7InULkTiFypxC5U4jcKUTuFCJ3CpE7hcidQuROIXKnELlTiNwpRO4UIncKkTuFyJ1C5E4hcqcQuVNI4txvtQ09GEYhYUfLwv/bpns6kC2fy5lfoXluLlM5V1bp728+09A5DGma+eDUN6MfetgclCS9fHzr7fX80KPnUGRI5QYrdqt5bmL4TPZt3WqeDzZ0IR3F3mp+4Bkis4Hr6DT2dfKaZ6tBw+h21X65+SHnjJyGjCIs9nXymqdtuB7iVu2aZ4ehUugnds2zYaAK+oxd8ywNE8AAtbeaV31Zg/zmB6v9YKq/dTNDDzu9IX5Cg9e+lCSSLd3uN0Pq32WAH0ma2pd676Obtj9mxpQ/1//PIV3tLUHrxo7W3d3MWXezdYD6nvtBftc3c4ON423/ZJi8dyudfc9znes3//F29Jyy62uUzL7fuT2oHoooFX2v86n2tIqs6vucQbWnN/boe5w1tR+K8Tbf31yp/bCMckXf2/yo/SCNLPneTiQO/Xvjxka0mpc7H2QcyfczC4d1AoYdDj/5XoYv9vE47OL7GLvax+WAi+9h4Gofn0PdrJE7N3SIxcePWO3jdXDFhw9X7eN2WMFHD1bto3dI2/FypwOHEnzwMNVexWEEHztItRdyCMHLnc7k34gPHZ/ay0kefOTo1F5R6uADx6b2ohIHHziyoX/qDCbtRrzcCZGzd7kTI+UKXu5ESRi83ImTrne5EyjbCl7uhMoVvNwJlql3uRMt0Qpe7sRL07vc6UGW3uVOH5Js0MidfqToXe70JMMKXu70Zvje5U5/Bl/By50+Ddy73OnVsL3LnX4N2rvc6dmQvcudvg3Yu9zp3XAHaOTOAIbqXe4MYaDe5c4ghuld7gxjkN7lzkCG6F3uDGWA3uXOYPrvXe4Mp/fe5c6A+j7hJHcG1W/vcmdYvfYudwbWZ+9yZ2Byp5Iee5c7g+uvd7kzvN56lzsJ9NW73Mmgp97lTgr99C53cuild7mTRB+9y50seuhd7qQR37vcSUPuVBLeu9xJJLp3uZNJcO9yJ5XY3uVOKnKnktDe5U4ykZ9OIHeyiUtS7qQTl6TcSScuSbmTTlyScieduCTlTjpxScqddOKSlDvpxCUpd9KJS1LupBOXpNxJJy5JuZNOXJJyJ524JOVOOnFJyp104pKUO+nEJSl30olLUu6kE5ek3EknLkm5k05cknInnbgk5U46cUnKnXTikpQ76cQlKXfSiUtS7qQTl6TcSScuSbmTTlyScieduCTlTjpxScqddOKSlDvpxCUpd9KJS1LupBOXpNxJJy5JuZNOXJJyJ524JOVOOnFJyp104pKUO+nEJSl30olLUu6kE5ek3EknLkm5k05cknInnbgk5U46cUnKnXTikpQ76cQlKXfSiUtS7qQTl6TcSScuSbmTTlyScieduCTlTjpxScqddOKSlDvpxCUpd9KJS1LupBOXpNxJJy5JuZNOXJJyJ524JOVOOnFJyp104pKUO+nEJSl30olLUu6kE5ek3EknLkm5k05cknInnbgk5U46cUnKnXTikpQ76cQlKXfSiUtS7qQTl6TcSScuSbmTTlyScieduCTlTjpxScqddOKSlDvpxCUpd9KJS1LupBOXpNxJJy5JuZNOXJJyJ524JOVOOnFJyp104pKUO+nEJSl30olLUu6kE5ek3EknLkm5k05cknInnbgk5U46cUnKnXTikpQ76cQlKXfSiUtS7qQTl6TcSScuSbmTTlyScieduCTlTjpxScqddOKSlDvpxCUpd9KJS1LupBOXpNxJJy5JuZNOXJJyJ524JOVOOnFJyp104pKUO+nEJSl30olLUu6kE5ek3EknLkm5k05cknInnbgk5U46cUnKnXTikpQ76cQlKXfSiUtS7qQTl6TcSScuSbmTTlyScieduCTlTjpxScqddOKSlDvpxCUpd9KJS1LupBOXpNxJJy5JuZNOXJJyJ524JOVOOnFJyp104pKUO+nEJSl30olLUu6kE5ek3EknLkm5k05cknInnbgk5U46cUnKnXTikpQ76cQlKXfSiUtS7qQTl6TcSScuSbmTTlyScieduCTlTjpxScqddOKSlDvpxCUpd9KJS1LupBOXpNxJJy5JuZNOXJJyJ524JOVOOnFJyp104pKUO+nEJSl30olLUu6kE5ek3EknLkm5k05cknInnbgk5U46cUnKnXTikpQ76cQlKXfSiUtS7qQTl2Rk7pCN3ClE7hQidwqRO4XInULkTiFypxC5U4jcKUTuFCJ3CpE7hcidQuROIV3mPv12cq/D/x1j9OZ4Mnk02Kt3mfvZ5Lrcp8//8qrDF+Tw7J37Xg11mPvryXW5v/xsclfute2b+34NdZd7Mx/vz/1iY0fu1e2Z+54NdZb7r03tcucao8h9+mwykTvXG0PuFxtUk8nRfblznRHkftqs2O/+51u5c51x5H70+NX0Uu7T5w+adf7J0/noziZL33Txmhym3bm//eHB8SyYJ+ucu26oo9wfvpgveK3c/3W8HNrRV50MlTHYlfvbz1Z9LII5D2iok9xfvmj+u5n7emST+fzJnfOdub8+nlwOJqKhDk8zbeTezNVR8xY0fd589eNyAtvuxW3P/X9/u0j44T8vvpr++7uLLz/55Tykoajcz5bjm59tfbScQO7Fbc+9WW1/3fpm1k5AQ1G5n65HtXpc7mzPfbOMZpJmYyWgocDcl0vmxgRyL+4DDkS2cu+6oajcm3efo4cvrkwg9+Kuy/3db98fL3ZFAxqKyr35pnHy91cbE8i9uN25v/vpi8/vt4+8BDQUlXvrOOrtx60NMLkXtyv3l63j7ssDjd03FJb7/OjRxmFUubMr9/UR9ZMnvx8vj6t33lBc7ufNaeHlaB2ZYW577q9nZ06//PmPV+frXdWZbhsKzb158OV3szn58VzunO/IvTnLdPS0PUn7rGmHDUXn3jx+ungrkjvbc1+dRVp9d+kiga4aCsq9WVxX42++uXcud8535946wn46zz2ioaDcN0a1HLfc+YDc3yyOu0c0FHjNzGquzlrHUeVe3Nbcmwfvtb5eX//YcUNRua8vcXvX/BnrfISnsx2S6Z/dvSaH5s3mlb7zNmYnlO42tcyuh1xkHtBQ2K7qxmwt3qnOWqcQqGlr7vMDkUufLkPqvqG4IzNv1qfJ7iyue5gtrkN+ZhqD25774nNbFrGsLoXsvKHIA5EvP2/m4fbs7WgxybPmkb9295ocmh25n0+f31/F0tpz7bghnwBMIXKnELlTiNwpRO4UIncKkTuFyJ1C5E4hcqcQuVOI3ClE7hQidwqRO4XInULkTiFypxC5U4jcKUTuFCJ3CpE7hcidQuROIXKnkI/J/XTjs85unzz+s/WMT7Jmlzzh3Dz3xsNXq2fkzi55wtkv98mdX5bPyJ1d8oSzZ+6Lj/yVO++RJ5yPzL01tnfzW7x+c/UZ2JAnnJvnvvhk+XvbnoGWPOHsk/vsFiOf/LLtGVjLE85euTf365jdZkHuvEeecPbKfXYbtMujfvvDg2bb7OjkSXvad7N7kTx43H5sOrsTydHJU0vKuOUJp+u1+9v1zaMmR1+tJny2euzr1b9e332q9SAjlCecvXJvdjk2N8Feb95o6tF67laWtwg8a0/o7mRjliecvXdVZ48sn1nf+HU6ux/sJ6uTCRcL4qvz6T/WjzWDPmrOJk9/vT9x88lRyxPOPrnP7hn4qP1MM5TV+8vZ8naBzXSLs2i/Lv9F89jyfzY9Xd9YkBHKE86Nc58vhYtXWzyzecP6ZmDfbAy/NcXpamk9ny/bG3dkZVTyhLPvRQSPrs7P5VFv3F54MQfNOFtvQ2cOZI5ZnnD2zP3e6plLr/rut++PF7sXzegvb2FdPNZ+F3pta2bM8oSzV+6r40DtUb/76YvP77f3pl+39qpbw7zs8iSMRp5wbpz70YMvn7afWYz65WdXhrJtCdw4mCT3kcsTzn5nVa88sx7NyZPfj+VOI0843eY+e6s5+vLnP5pv3rx/1Ov9a0YuTzid5t7sNh+t3qlao76yBJ7ZNy0kTzid5t6Mb70rvRztxg724uDS5pSMW55wus69teidTrYcPl3MQrM4O9JeRZ5w4nJ/szx8urras7F8MzrdeKOybTNqecLpNPdmoMvFcTboK5c5rKZovlgdfW0uiHARwXjlCafT3GfXa95trmubXxex3Mxqjh7dedpcw3a8WhpnR5QerqZ1nGbE8oQTcCBy6dPVpte0fZ5h+U70rD2tTZkxyxNOx6eZ1n9ocudFa/rZ9crz4W37o5RmYsYrTzgd534+nf1p4e3Zm017B2T3nxwuJma88oTjE4ApRO4UIncKkTuFyJ1C5E4hcqcQuVOI3ClE7hQidwqRO4XInULkTiFypxC5U4jcKUTuFCJ3CpE7hcidQuROIXKnELlTiNwpRO4UIncKkTuFyJ1C5E4hcqcQuVOI3ClE7hQidwqRO4XInULkTiFypxC5U4jcKUTuFCJ3CpE7hcidQuROIXKnELlTiNwpRO4UIncKkTuFyJ1C5E4hcqcQuVOI3ClE7hQidwqRO4XInULkTiFypxC5U4jcKUTuFCJ3CpE7hcidQuROIXKnELlTiNwpRO4UIncKkTuFyJ1C5E4hcqcQuVOI3ClE7hQidwqRO4XInULkTiFypxC5U4jcKUTuFCJ3CpE7hcidQuROIXKnELlTiNwpRO4UIncKkTuFyJ1C5E4hcqcQuVOI3ClE7hQidwqRO4XInULkTiFyp5D/AwulKuKNEL4TAAAAAElFTkSuQmCC&#34; style=&#34;height:120px;&#34;&gt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-size: 13px; text-align: left; vertical-align: middle; font-style: italic;&#34;&gt;Adam Rippon&lt;br&gt; (Season 26)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Musician&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;38&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #A0DCE0;&#34;&gt;7.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #3FC1C9;&#34;&gt;10.5&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #C5E9EC;&#34;&gt;15.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #C9EBED;&#34;&gt;13.2&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;img cid=&#34;rgoufbwhykzt__temp_ggplot.png&#34; src=&#34;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAu4AAAH0CAMAAABVZliYAAAAulBMVEUAAABmZmZmZoFmZp1mgYFmgZ1mgbZmnc+BZmaBgWaBgYGBgZ2BgbaBnbaBnc+Bts+BtuedZmadgWadgYGdnZ2dnbadts+dtuedz+edz/+2gWa2nYG2nZ22tp22ts+2z8+2z+e25//PnWbPnYHPtoHPtp3Pz7bPz8/P5+fP5//P///ntoHntp3nz53nz7bnz8/n57bn58/n5+fn5//n/+fn////z53/1wD/57b/58//5+f//8///+f///+NxnQNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAP2ElEQVR4nO3da3vTRh6GcWxOgSxbFgo9UxqWdiFL2m03IWTt7/+1NiNLspTYOVnPzCP979+LXiRx7dHoti3LsnVvCYRxr/QAgHzIHYGQOwIhdwRC7giE3BEIuSMQckcg5I5AyB2BkDsCIXcEQu4IhNwRCLkjEHJHIOSOQMgdgZA7AiF3BELuCGS0ud/rKj0YjMT4SqkL/9xF97iRcQVyqfOLaB5XGU8b16ZO87jOSLK4cerd5ksPGnbG0MTtW18/ypceO6zY93DX1nmQv+TeFUqPLRPz5dytdYpf2bQz6/IkRajeewEHiL1dmaWXpYxrO99QfekxCzkv2xAP7d01WXp5crtd6f3mS49dxHe5ho29Xo2lFyqbu6bemaspzpbrIg0fe70WSy9YBrum3pmtqU2X5/KIYq9XoecyD2So1NfTVXqJhmS5MLrYm3VYeglVFI8TU3qAcFwQde2fpxn8YNswE54vv8UQbsj0V+BUVuGKetqmMV12y5An9mYVll7agQgf16c1XW5LkLP2aazBbM+HnyfwlOg1+nwrrrsGSy/1LpQb7NsmrPQy78Bp7AVir1eg0yzcRpkZG+10WeVeKPbxrsFSDw8jna7EZ9wlax/jGiwX+2q6xjZfFZtBF659ZGsw+xb7xvkqPQu35zLk8msvGUnxBq1XxjFbXSYDNll/n0ewCh0e2Fv2s3WBx3CNVqD5Q7xT6xXnybrMYbB2q9B1HVo9sLesHx4uMBip4yp0XIeWra/4TdYW5cdpuxK9ijeOPXGaqisUH6b1WnQJ3nMrpsdlqq5WepDuq9HgIX4ErVfKz9T1Cg9xDCuybPEjab3iH3zZAY5lVRYrfkyxJ+7Bk/sNFSh+LFsxPd7BFx3cyNZm1uJH2XrFufeim6WlV8ztZSp+vK0nxg/w5UY21jUqL37crVdsey82sDGvUuFHNifQemKw/3ajUqMa/UpVJD+R1lcseyf3HQxZfP7PWKs5PsCX2p9cel0MZZBvyp1e6it+vZcZ0cRW7i7NTzX1il3vRQY0yfV767O9XH8GmfFz26ApMZwpr+IbnNqruUjpoebhFTy5a1w6td1a6aHl5tR7gbGEW9/RGT3A5x8Jtcdj03v+w/xKTz0KcOk99zjibboiMdmgyTwKYg/Love8g6D2wBwe4LMOgdpjK987uSOf4r3nHAC1h1d6gybnpy9LzzUMFP4ugIw3VXqm4aDslwHku6XS8wwPRb8NINsNlZ5luCjYO7kju3IvWHPdMLWjo9hHpDPdTOn5hZdSn5HOcyulZxduymzQ5PkOuNJzC0NFPkiX4zZKTywslfgkXYabKD2tMFXgo3QZbqL0rMJV/s/S6W+h9JzCV+4XrOSOojJ/vkh+A6XnE97yfuJCff2lZxPusn7kQn39pScT9nIehC6++tJTiRHIeBS69tpLTyRGId9xudprLz2PGIdsB+ZKr7z0LGIkppA7teOmch2ILrzq0lOIEcl0fmbdVZeeQIxLloNzdVddevowMjkOV9RddenZw9hkOIBLd9WlJw+joz+CS3fVpecO4yM/pkV31aWnDiOkPqhFd9WlZw5jJH6bX3fVpScOY0TuiER7WIvuqkvPG8ZJ+f4qucONLklyhx1dkuQOO7okyR12dEmSO+zokiR32NElSe6wo0uS3GFHlyS5w44uSXKHHV2S5A47uiTJHXZ0SZI77OiSJHfY0SVJ7rCjS5LcYUeXJLnDji5JcocdXZLkDju6JMkddnRJkjvs6JIkd9jRJUnusKNLktxhR5ckucOOLklyhx1dkuQOO7okyR12dEmSO+zokiR32NElSe6wo0uS3GFHlyS5w44uSXKHHV2S5A47uiTJHXZ0SZI77OiSJHfY0SVJ7rCjS5LcYUeXJLnDji5JcocdXZLkDju6JMkddnRJkjvs6JIkd9jRJUnusKNLktxhR5ckucOOLklyhx1dkuQOO7okyR12dEmSO+zokiR32NElSe6wo0uS3GFHlyS5w44uSXKHHV2S5A47uiTJHXZ0SZI77OiSJHfY0SVJ7rCjS5LcYUeXJLnDji5JcocdXZLkDju6JMkddnRJkjvs6JIkd9jRJUnusKNLktxhR5ckucOOLklyhx1dkuQOO7okyR12dEmSO+zokiR32NElSe6wo0uS3GFHlyS5w44uSXKHHV2S5A47uiTJHXZ0SZI77OiSJHfY0SVJ7rCjS5LcYUeXJLnDji5JcocdXZLkDju6JMkddnRJkjvs6JIkd9jRJUnusKNLktxhR5ckucOOLklyhx1dkuQOO7okyR12dEmSO+zokiR32NElSe6wo0uS3GFHlyS5w44uSXKHHV2S5A47uiTJHXZ0SZI77OiSJHfY0SVJ7rCjS5LcYUeXJLnDji5JcocdXZLkDju6JMkddnRJkjvs6JIkd9jRJUnusKNLktxhR5ckucOOLklyhx1dkuQOO7okyR12dEmSO+zokiR32NElSe6wo0uS3GFHlyS5w44uSXKHHV2S5A47uiTJHXZ0SZI77OiSJHfY0SVJ7rCjS5LcYUeXJLnDji5JcocdXZLkDju6JMkddnRJkjvs6JIkd9jRJUnusKNLktxhR5ckucOOLklyhx1dkuQOO7okyR12dEmSO+zokiR32NElSe6wo0uS3GFHlyS5w44uSXKHHV2S5A47uiTJHXZ0SZI77OiSJHfY0SVJ7rCjS5LcYUeXJLnDji5JcocdXZLkDju6JMkddnRJkjvs6JIkd9jRJUnusKNLktxhR5ckucOOLklyhx1dkuQOO7okyR12dEmSO+zokiR32NElSe6wo0uS3GFHlyS5w44uSXKHHV2S5A47uiTJHXZ0SZI77OiSVOYOuCF3BELuCITcEQi5IxByRyDkjkDIHYGQOwIhdwRC7giE3BEIuSMQckcg5I5Ahsx98f3s8YBXhyk63ZvNnhe79SFzP5xdl/vi/d+OB7xBjM/Oue/U0IC5n8yuy/3oxewRuce2a+67NTRc7mk5rs79fGOH3KPbMfcdGxos90+pdnLHNSaR++LdbEbuuN4Ucj/foJrN5k/IHdeZQO4H6YH90X+/J3dcZxq5z18fLy7kvnj/ND3m779dje5w1vhuiNvEOG3P/ctPT/eqYN6scx66oYFyf/Zhdcfr5P7vvWZo828GGSqmYFvuX160fdTBLAUNDZL70Yf0337u65HNVstH7lhuzf1kb3YxGEVDA77N1Ms9LdU8PQUt3qd//dJcgG334Dbn/r9/nCf87F/n/1r854fzf97/uJQ0pMr9sBnf6t3W580FyD24zbmnh+1vOz9U7QgaUuV+sB5V+3tyx+bc+2Wki6SNFUFDwtybe2bvAuQe3A12RHZyH7ohVe7p2Wf+7MOlC5B7cNflfvb7j3v1S1FBQ6rc0w/J/s/HvQuQe3Dbcz/79euXT7p7XgQNqXLv7Ed98LqzAUbuwW3L/aiz373Z0Th8Q7LcV3uPertRyR3bcl/vUd9/88des1998IZ0uS/T28LNaNkzg5XNuZ9U75y++u3P4+X6pWpl2IakuadfHv1QLckvS3LHckvu6V2m+dvuRbrvmg7YkDr39PuD+qmI3LE59/ZdpPanCwcJDNWQKPd0d23Hn354vCR3LLfn3tnDfrDKXdGQKPfeqJpxkztukPtpvd9d0ZDwmJl2qQ47+1HJPbiNuadfPu78e33848ANqXJfH+J2lj7GuhrhQfWCZPHXcLeJsTntH+m7aqN6Q+lRqqU6HrLOXNCQ7KVqb7HqZ6rDzlsIiGlj7qsdkY2vmpCGb0i3Z+Z0/TbZw/q4h+ruWvI701Dc5tzr722pY2kPhRy8IeWOyKOXaRkeVE9H9UXepd/8fbjbxNhsyX25eP+kjaXzynXghvgGYARC7giE3BEIuSMQckcg5I5AyB2BkDsCIXcEQu4IhNwRCLkjEHJHIOSOQMgdgZA7AiF3BELuCITcEQi5IxByRyDkjkDIHYGQOwIhdwRC7gjkNrkf9L7r7MH+6786f+GbrLGNTzh3zz15dtz+hdyxjU84u+U+e/ix+Qu5YxufcHbMvf7KX3LHFXzCuWXunbGdrU7x+t3lvwA9PuHcPff6m+Ufb/oL0OETzi65V6cYuf9x01+ANZ9wdso9na+jOs0CueMKPuHslHt1GrSLo/7y09O0bTbff9O97Fl1LpKnr7u/W1RnIpnvv+WeMm0+4Qz96P5lffKo2fyb9oLv2t992/7f67NPdX6JCfIJZ6fc00uO/ibYSf9EU8/XS9dqThF42L0gZyebMp9wdn6pWv2m+cv6xK+L6nyw99s3E87viMfLxT/Xv0uDnqd3kxefnsw4+eSk+YSzS+7VOQOfd/+ShtI+vxw2pwtMl6vfRfvU/B/pd82VLQ7WJxbEBPmEc+fcV/fC+tbqv/RPWJ8G9l1v+J1LHLT31uXqvt07IysmxSecXQ8ieH55eS6Ound64XoJ0jg7T0OH7MicMp9wdsz9cfuXC7d69vuPe/XLizT6i1tY57/rPgudsDUzZT7h7JR7ux+oO+qzX79++aT7avqk86q6M8yLLl4Ek+ETzp1znz999bb7l3rURy8uDWXTPbC3M4ncJ84nnN3eVb30l/Vo9t/8sUfuSHzCGTb36qlm/uq3P9MPp1ePev36GhPnE86guaeXzfP2maoz6kv3wENemwbiE86guafxrV9KN6PtvcCudy71L4lp8wln6Nw7d72D2Ybdp/UipLsze9qj8AlHl/tps/u0PdozaZ6MDnpPVGzbTJpPOIPmngba3B2rQV86zKG9RPpHu/c1HRDBQQTT5RPOoLlXx2s+Sse1rY6LaDaz0t6jh2/TMWx77b2x2qP0rL0s+2kmzCccwY7Ixlftptei+z5D80z0rntZNmWmzCecgd9mWn/Q5OGHzuWr45VXw9v0oZR0YUyXTzgD575cVB8tfFA92XRfgGz/yGF9YUyXTzh8AzACIXcEQu4IhNwRCLkjEHJHIOSOQMgdgZA7AiF3BELuCITcEQi5IxByRyDkjkDIHYGQOwIhdwRC7giE3BEIuSMQckcg5I5AyB2BkDsCIXcEQu4IhNwRCLkjEHJHIOSOQMgdgZA7AiF3BELuCITcEQi5IxByRyDkjkDIHYGQOwIhdwRC7giE3BEIuSMQckcg5I5AyB2BkDsCIXcEQu4IhNwRCLkjEHJHIOSOQMgdgZA7AiF3BELuCITcEQi5IxByRyDkjkDIHYGQOwIhdwRC7giE3BEIuSMQckcg5I5AyB2BkDsCIXcEQu4IhNwRCLkjEHJHIOSOQMgdgZA7AiF3BELuCITcEQi5IxByRyDkjkDIHYGQOwIhdwRC7giE3BEIuSMQckcg5I5AyB2BkDsCIXcEQu4IhNwRCLkjEHJHIOSOQMgdgZA7AiF3BELuCITcEQi5IxByRyDkjkD+D16KWeyBATqFAAAAAElFTkSuQmCC&#34; style=&#34;height:120px;&#34;&gt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-size: 13px; text-align: left; vertical-align: middle; font-style: italic;&#34;&gt;Kellie Pickler&lt;br&gt; (Season 16)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Actor/Actress&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;130&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #A2DDE1;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #61C9D0;&#34;&gt;9.2&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #E6F6F6;&#34;&gt;6.9&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #ECF8F9;&#34;&gt;4.6&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;img cid=&#34;ofhjcdsqubvr__temp_ggplot.png&#34; src=&#34;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAu4AAAH0CAMAAABVZliYAAAAulBMVEUAAABmZmZmZoFmZp1mgYFmgZ1mgbZmnc+BZmaBgWaBgYGBgZ2BgbaBnbaBnc+Bts+BtuedZmadgWadgYGdnZ2dnbadts+dtuedz+edz/+2gWa2nYG2nZ22tp22ts+2z8+2z+e25//PnWbPnYHPtoHPtp3Pz7bPz8/P5+fP5//P///ntoHntp3nz53nz7bnz8/n57bn58/n5+fn5//n/+fn////z53/1wD/57b/58//5+f//8///+f///+NxnQNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAPmklEQVR4nO3da5vTxh2G8bU5LWxpKARypmQpSWELSZNCYWt//6/VHdmS5dOerL/mmXnu34tc4HXk0fi2bMtm52gO2DjKPQBgPOQOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+Qe5WiH3GOyxz0wpH7a/92B8DNj3gdwaeK7UX0WTPghblz5rupz74QTJvuWDux880Cfe3dMMM83dugRfV/zuffLAZN8AyGhU/yImOFrigy9V3zu3awc83sNo6TeBk/ygZjby4W+fNmbfO69rhYzu1+G0gk+FvO6W77UCT4Qs7otd+pt8Nw3g2NKN0ikvkTwQ2NCezQO630EPyymc0kv9QWCHxKTOddNfYG7aDjMpXTqDd60DsZ8IvVbXzC/mwZjPI/aL2E2cIAfhO0sFpT6gu09NSTPSSzpuN7hAH84wykssvWG4Z01MLMZLOr1+jbO0RzIavqKTn3J6g4bnM/slX1cX+EAfwCPuSv8NcwGj/sshMPU1ZR6w+FOi1H9zFV1XG/xguaWKp+3GltvEPytVDxrdb1g30Lwt1DtnFWd+kK1912cOqes7uN6p847L1KFM2bSesILmhuqbr58Wm9Ud//Fqmu6jA7srbruwGg1zZZf6wlfG7uBaqbK8MDeqeZODFfHTDm3ntRxL46ghokybz3hBc31lD9NxN4o/44cQ+Gz5P4qpocD/DUUPUe0vobgr1TuDHFg31buvTmSUieI1nfiAH+5IqeHA/t+Rd6hoylwdmj9UhzgL1Hc3BD7lYq7T8dT2NQQ+3XwNZp9ypoXYr+usu7X0RQ0LbxBvQkO8LsUMym0flMEv62QKSH22yD4TUVMCLHfVhF374hKmA9ivz1O0qyRnwzeoB6K4FfUp4LWB8AhvqU9DxzZh0LwDeVZIPYhcYifS+dO7EOjeN39p/YI5sWL7jznY+KkXwQuereH09xvWo9mmrziPnNkH8fRkd2BXnBniX1cR0c+3evtI7XnctSXO4MYcrtF7RrqDF9rZzghI6im7KV2gtaV1ZC80g5Qu7zSD/NCY6f2QhRcvM7Aqb0gpR7kZQZN7aUpMXiVIVN7gco7xIuMl9oLVVjwGqOl9nIVdYgXGCqfLZVOIKJryj9SWi9fMQf47OOk9ioUEnzuUVJ7LXKXdC25B0nu1SjhAJ95iNReE/3g8w6Q2iujHnzW4VF7fbR7zzk6aq+RdO/5BseHS5VS7j3b2Ii9WsIv4HONjNprJtt7poFRe91UD/B5hkXt1dPsndwRQ7L3LIOidgeKvecYE7V7EOw9w5Co3YVe7+OPiNp9yPVO7gik1vvY4+GrA17Eeh95OMTuRusDp3EHQ+2GlHofdSzUbkmod3JHOJ3exxwJtbuS6X3EgVC7L5XeyR1jEOl9vGFQuzO33Kndm0bvY42C2t1J9E7uGInC56sjDYHaoXCAH2cE1I4ke++jDIDa0bDIndqxlLv3EW6f2tHJ/QvWR7iJ3FMMIZl/w3r8LeSeYEjJ+yvWw28g9/RCC7nDSdYlBaK3n3tyISdj78E3Te3Ylu/rBOSO8VWaO7Vjp2yraIRuPPesQlWuZTQit03u2CfTwgKBm849oxBWXe65JxTS8iykEbfp3PMJbVlW0ojbdO7phLgcS2nEbTr3bEIcucNJhrU04jadezIhb/TeyR0Zjb54TNymc08lCjD26jFxm849kyjByMvHxG0690SiBOQOJ+OuHxO36dzziDKMuoBM3KZzTyMKMeaSGnGbzj2LKAS5w8mIS8jEbTr3JKIY460hE7fp3HOIYpA7nIy2iEzcpnNPIQoy1ioycZvOPYMoyUjLyMRtOvcEoijjrCMTt+nc84eyjLKQTNymc08fykLucDLGSjJxm849eyjNCEvJxG069+ShNOQOJ/FrycRtOvfcoTzhi8nEbTr31KE85A4n0avJxG0698yhRMHLycRtOvfEoUTkDiexyyfFbTr3vKFMkctQkjvUxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATlyS5Q05ckuQOOXFJkjvkxCVJ7pATl2Rk7oAacocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcoeRIXOffT95MODmUKPPx5PJk2y3PmTuZ5Orcp+9+cvHAW8Q5Tk494MaGjD3T5Orcv/wdHKf3L0dmvthDQ2Xe9qPy3O/eLFD7u4OzP3AhgbL/X2qndxxhSpyn72eTMgdV6sh94sXVJPJ9CG54yoV5H6aDuz3//M9ueMqdeQ+ffFxtpH77M2jdMw/ebUY3dmk9d0Qt4ky7c/9y0+PjptgXq5yHrqhgXJ//HbxwOvl/q/jdmjTbwYZKmqwL/cvT7s+lsHMAxoaJPcPb9N/13NfjWyy2D9yx3xv7p+OJ5vBRDQ04MdMa7mnvZqmp6DZm/Snn9sr8Nrd3O7c//e3i4Qf//PiT7N//3Dxxzvv5iENReV+1o5v8Wnrk/YK5G5ud+7psP1t7y9NOwENReV+uhpVdzm5Y3fu62Wkq6QXKwENBebePjLXrkDu5q5xIrKX+9ANReWenn2mj99uXYHczV2V+/lvPx4v34oGNBSVe/pLcvL3j2tXIHdz+3M//+XrZw/7Z14CGorKvXce9e6L3gswcje3L/cPvfPu7YnG4RsKy31x9mjtNCq5Y1/uqzPqJy9/P27Pqw/eUFzu8/SxcDtazsxgYXfun5pPTp//+sfH+eqtamPYhkJzTxd++KHZk5/n5I75ntzTp0zTV/2r9D81HbCh6NzT5afLpyJyx+7cu0+Rur9tfElgqIaCck8P12786S8P5uSO+f7ce2fYTxe5RzQUlPvaqNpxkzuukfvn5Xn3iIYCvzPT7dVZ7zwquZvbmXu68EHvz6vvPw7cUFTuq6+4nad/xroY4WnzhmT253C3idJ8Xv+m76KN5gOl+6mW5vuQy8wDGgp7q7q2W8tnqrPeRwjwtDP3xYnI1ldtSMM3FHdm5vPqY7J7y+89NA/XnL8zDdntzn35e1uWsXRfhRy8ocgTkR+epX242zwdLa/yOl3y1+FuE6XZk/t89uZhF0vvnevADfEbgGGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GHkJrmfrv2us7snL/7s/YTfZI19dMK5fe7J44/dT8gd++iEc1juk3vv2p+QO/bRCefA3Je/8pfccQmdcG6Ye29s54slXr/b/gmwRiec2+e+/M3yD3b9BOjRCeeQ3JslRu682/UTYEUnnINyT+t1NMsskDsuoRPOQbk3y6BtjvrLT4/Sa7Ppycv+dc+btUgevehfNmtWIpmevOKRUjedcIY+un9ZLR41mX7TXfF1d9m33f+9Wn2qdyEqpBPOQbmntxzrL8E+rS809WS1d512icCz/hVZnaxmOuEc/Fa1uaT9yWrh11mzHuyd7sOEiwfix/nsH6vL0qCn6dPk2fuHExafrJpOOIfk3qwZ+KT/kzSU7vnlrF0uMF1v+Sna+/b/SJe1G5udrhYWRIV0wrl17otH4fLWlj9ZX7A+Dey7teH3rnHaPVrni8f22oqsqIpOOId+ieDJ9v5sjnpteeHlHqRx9p6GzjiRWTOdcA7M/UH3k41bPf/tx+Pl24s0+s1XWBeX9Z+FPvFqpmY64RyUe3ceqD/q81++fvaw/276U+9ddW+YmzavgmrohHPr3KePnr/q/2Q56g9Pt4ay6xG4djKJ3CunE85hn6pu/WQ1mpOXvx+TOxKdcIbNvXmqmT7/9Y/0l8+Xj3r1/hqV0wln0NzT2+Zp90zVG/XWI/CM96ZGdMIZNPc0vtVb6Xa0a2+wlyeX1q+JuumEM3TuvYfe6WTH6dPlLqSHM2faXeiEE5f75/b0afdtz6R9Mjpde6LitU3VdMIZNPc00Pbh2Ax662sO3TXSH7qzr+kLEXyJoF464Qyae/N9zfvpe22L70W0L7PS2aN7r9J32I67R2NzRulxd13O01RMJ5yAE5Gtr7qXXrP+5wztM9Hr/nV5KVMznXAG/php9Q9N7r3tXb/5vvJieLv+UUq6MuqlE87Auc9nzT8tvNs82fTfgOz/J4fLK6NeOuHwG4BhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxhhNxh5P/opnm7f+dlrAAAAABJRU5ErkJggg==&#34; style=&#34;height:120px;&#34;&gt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-size: 13px; text-align: left; vertical-align: middle; font-style: italic;&#34;&gt;Jordan Fisher&lt;br&gt; (Season 25)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Reality TV Star&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;26&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #A2DDE1;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #81D2D7;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #E3F4F5;&#34;&gt;7.7&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #EFF9FA;&#34;&gt;3.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;img cid=&#34;kgwaihvujrxc__temp_ggplot.png&#34; src=&#34;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAu4AAAH0CAMAAABVZliYAAAAulBMVEUAAABmZmZmZoFmZp1mgYFmgZ1mgbZmnc+BZmaBgWaBgYGBgZ2BgbaBnbaBnc+Bts+BtuedZmadgWadgYGdnZ2dnbadts+dtuedz+edz/+2gWa2nYG2nZ22tp22ts+2z8+2z+e25//PnWbPnYHPtoHPtp3Pz7bPz8/P5+fP5//P///ntoHntp3nz53nz7bnz8/n57bn58/n5+fn5//n/+fn////z53/1wD/57b/58//5+f//8///+f///+NxnQNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAQJElEQVR4nO3de3/TRhqG4TjlEMiyZXugZ0rD0i7NlnbbBQNrf/+vtRnZkiXHzsl6Nc/Mc19/9AeJa71jbtuyrMRHS8DGUe4BgOmQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQe9GOenLPUgJupDKtE5/3UP31uHXKs935EMlfgVumLFenPnigzz2qIm6Ugtwo9V7zuefVw01Situ13j7I555aDLdHGW7fOsXvwI1RgrvGvi4+9/g6uCnk3WEvhuD34IYQd3DrBN/DzSBtnNhXwfNPTe7Sxot9VXzu9eTHTaBr3Njn7NKQu6zDX6DuDD73sjJzX7+okNab3r3/wb1Xryoq9ib43IvLyXrxosIe2te9Gz/C+65cVmzsq+JzrzEX24WLinmFerl3039302WrmqT1VfC5l5qF56pVTVe7ae+WixY1zX6Mde+OaxY1bexzyx14vxWrmrx2x+Dd1qtq4h2ZTfC5Fz4ts+WKyhW7Xe9eqxWVL3a33q0WKypr7V478EZLFZVxR6YLPvdtMBmflYrKH/vcqHebhYqSqN2nd5d1ihKp3WYH3mOVqmRqn5s8wFssUtNE5/remEMKDmvUpNV6YtCCwRI16dXu0Hv9K9SkWLtB79UvUJNm7fX3Xvv6NKnWXn3vlS9PkdoRmaG6g6h7dYqUW0+qfsOp5rVJUq99XvUDfMVLk1RA7TX3Xu/KJBVRe8W9V7swSYXUXm/vta5LUjG1V9t7pctSJH388ZI6w6hzVYLKin1eae9VLkpQabHP6zwAX+GSFBVY+7zGB/j6VqSozNor7L26BSkqtfb6eq9tPYrKrb263itbjqKSa6+t97pWI6i4A5DbqiqkqsXoKT72eV2917QWPRXEPq+q94qWoqeO2mvqvZ6V6Kml9op6r2YheuqpvZ7ea1mHnppqr6b3SpYhp4ZDMgN1hFLHKuTUFvu8kt6rWISa6h7aGzWkUsMa1FQZ+7yK3itYgppaa6+h9/JXoKbe2ivovfgFiKlzt71Tei6lzy+m7tjnxfde+Phiqq+99N7Lnl6MQe2F91708Foq323vlJxMybNLcYl9XnTvBY8uxSf2ecm9lzu5FKvaC+692MGlmNVO7tbcai+391LnFmL0InWj0G4KHVuHZezzUn8/cJFDCzGNPSkxnRJn1uH60L5SYDsFjqzDOvZ5ib2XN7EO99rJ3Yl97QX2XtzAKrx321ul5VPavCKIfa2wfgobVwSxd8oKqKxpNfDQ3ldUQUUNq4HYh0pKqKRZJfDQfklBDRU0qgRi36GciMqZVMARD+27FVNRMYMKoPW9SsmolDnz45H9KoV0VMiY+RH71coIqYwps+Oh/TplhFTGlLkR+/WKKKmIIfPieMzNlJBSCTPmRes3VUBLBYyYFY/st6D/49ryA2ZF7LeknpP6fDkR+62p56Q+X0bEfgfiPYmPlw2HY+5IOyjt6XKh9buTLkp6uEyI/SDKSSnPlgexH0q4KeHRcmCXfQTCTQmPNjlaH4luVLqTTY3WxyNblexgEyP2UameTiA61qSO2IsZn2ZYmlNNidRjSJYlOdR0eFwPI1mW5FDTYB8mlmJaijNNgNQnINiW4EjhSH0ienHpTRTp6IiH9SnJ1SU3UBhCz0AtL7V5xnd0xGN6Nmp5qc0zmiMyVyDWl9g4d3e0Jfe/M1a0TieQGuamttMmb2VKiSnNciXSLpZQY0Kj7Ebi5dOJTGeSy8i8EjqR6UwyQOlVkalMZpANSq+PSmYqc7RIvU4inYmM0eBhvWIaoWlMkZB63SRKkxiCfRgHCqlJzEDrFgRayz8CrdsQiC339ondSO7aMudO7GZy955z+8RuxzZ3Xp9aytx7rs3TuqnMe895tkrttrL2nmPjxG4t68vFDJvMfXsjr4y9T71pXqEi449rT7xhWsc84+P7pBvmkR0r2Q4ITrmt3DcyZGTqfcLNUjs2Mh0Bn2xD1I6+LL1PtVFix5Ysb/lMsxVqx7Zacyd27JLjLc4JNpH7ZoWo6XuP3yK1Y5/Jew/fILVjv6lPJwjeHLvtuNq0vcdujdhxjXpy56Ed15v2rK3Aq859Q6IIk562FXfVuW9GFGLK87birjr3rYhSTNc7uSO/6U5UjLvq3LchyjHZmYpxV537JkQ5yB1OJuqd3CFhmtMJyB0iJjkXPe6qc998KMwUJ6PHXXXuWw+lmeBs9Lirzn3joTjxp6PHXXXu2w7FIXc4ie6d3KEk+seN4q469y2HEgX/vFHcVee+4VCk0N7JHWIi318ld6iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJS5LcIScuSXKHnLgkyR1y4pIkd8iJSzIyd0ANucMIucMIucMIucMIucMIucMIucMIucMIucMIucMIucMIucMIucMIucPImLkvvp09HPHqUKP3J7PZ02xbHzP389l1uS9e/e3tiBtEeQ7O/aCGRsz93ey63N98NntA7t4Ozf2whsbLPa3j6twvdnbI3d2BuR/Y0Gi5/5ZqJ3dco4rcFy9nM3LH9WrI/WKHajY7fkTuuE4FuZ+lB/YH//2W3HGdOnI/fvZ2sZX74tXj9Jh/+mI13fms9c0Y20SZ9uf+4YfHJ00wzzc5j93QSLk/eb264/Vy//dJO9rxV6OMihrsy/3DZ10f62CWAQ2Nkvub1+m/w9w3k81W6yN3LPfm/u5kth1MREMjvs00yD2t6jg9BS1epT/91F6AfXdzu3P/3z8uEn7yr4s/Lf7z3cUfP/l1GdJQVO7n7Xyrd1ufthcgd3O7c08P21/3/tK0E9BQVO5nm6m6r5M7duc+LCNdJO2sBDQUmHt7zxxcgNzN3eBAZC/3sRuKyj09+xw/eX3pAuRu7rrcP/7+/cn6pWhAQ1G5p78kpz++HVyA3M3tz/3jz198/qh/5CWgoajce8dR7z3r7YCRu7l9ub/pHXdvDzSO31BY7qujR4PDqOSOfblvjqifPv/jpD2uPnpDcbkv09vC7bQcmcHK7tzfNe+cfvnLn2+Xm5eqjXEbCs09ffHNd81KflqSO5Z7ck/vMh2/6F+k/67piA1F556+frZ+KiJ37M69exep+9vWSQJjNRSUe7q7dvOnvzxckjuW+3PvHWE/W+Ue0VBQ7oOp2rnJHTfI/f36uHtEQ4HnzHSrOu8dRyV3cztzT1982Pvz5vzHkRuKyn1zitvH9GOsqwnPmhcki7/G2yZK8354pu+qjeYNpQepluZ8yHXmAQ2FvVQdLGv9THXeewsBnnbmvjoQ2fq0DWn8huKOzLzfvE12f33eQ3N3zfk705Dd7tzXv7dlHUt3KuToDUUeiHzzeVrDvebpaH2Rl+krfx9vmyjNntyXi1ePulh6r1xHbojfAAwj5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4j5A4jt8n9bPC7zu6dPvur9x1+kzX20Qnn7rknT9523yF37KMTzmG5z+7/2n6H3LGPTjgH5r7+lb/kjivohHPL3HuzfVx9xOs3l78DDOiEc/fc179Z/uGu7wA9OuEcknvzESOf/LrrO8CGTjgH5Z4+r6P5mAVyxxV0wjko9+Zj0Lan/vDD47Rvdnz6vH/Zj81nkTx+1v/aovkkkuPTF9xT6qYTztiP7h82Hx41O/6qu+DL7mtfd//35tOnel9EhXTCOSj39JJjuAv2bvhBU083q+u0HxF43r8gn05WM51wDn6p2nyl/c7mg18XzefBftK9mXBxR3y7XPxz87U09HF6N3nx26MZHz5ZNZ1wDsm9+czAp/3vpFG655fz9uMC0+XW76L91v4f6WvtlS3ONh8siArphHPn3Ff3wvXW1t8ZfmB9Guybwfi9S5x199bl6r49+ERWVEUnnENPInh6eT3bUw8+Xni9gjRn72nonAOZNdMJ58DcH3bf2drqx9+/P1m/vEjTb+9hXXyt/yz0jr2ZmumEc1Du3XGg/tQff/7i80f9V9Pveq+qe2Nu274IqqETzp1zP3785Yv+d9ZTv/ns0ii77oGDg0nkXjmdcA57V/XSdzbTnD7/44TckeiEM27uzVPN8Ze//Jn+8v7qqTevr1E5nXBGzT29bD7unql6U1+6B57z2tSITjij5p7m27yUbqcdvMBeH1waXhJ10wln7Nx7d72z2Y7Dp+slpLszR9pd6IQTl/v79vBpd7Zn0j4ZnQ2eqNi3qZpOOKPmngZt747N0JdOc+gukf7QHX1NJ0RwEkG9dMIZNffmfM0H6by21XkR7W5WOnp0/0U6h+2kuzc2R5SedJflOE3FdMIJOBDZ+rTb9Vr032don4le9i/LrkzNdMIZ+W2mzQ+a3H/du3xzvvJqvF0/lJIujHrphDNy7stF86OF95onm/4LkP0/cri+MOqlEw6/ARhGyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1GyB1G/g8Z0zF0wAeixgAAAABJRU5ErkJggg==&#34; style=&#34;height:120px;&#34;&gt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-size: 13px; text-align: left; vertical-align: middle; font-style: italic;&#34;&gt;Kaitlyn Bristowe&lt;br&gt; (Season 29)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Model&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;14&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #A9E0E3;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #8BD5DA;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #E2F4F5;&#34;&gt;7.1&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;img cid=&#34;vrtukcawgqsn__temp_ggplot.png&#34; src=&#34;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAu4AAAH0CAMAAABVZliYAAAAulBMVEUAAABmZmZmZoFmZp1mgYFmgZ1mgbZmnc+BZmaBgWaBgYGBgZ2BgbaBnbaBnc+Bts+BtuedZmadgWadgYGdnZ2dnbadts+dtuedz+edz/+2gWa2nYG2nZ22tp22ts+2z8+2z+e25//PnWbPnYHPtoHPtp3Pz7bPz8/P5+fP5//P///ntoHntp3nz53nz7bnz8/n57bn58/n5+fn5//n/+fn////z53/1wD/57b/58//5+f//8///+f///+NxnQNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAQUElEQVR4nO3da5sSRx6GcZl4GHXduDmYszHjmqyZjckmq6gL3/9r7XRDQ8PAwND1dP2rnvv3IpcDpLqwbpqmweHOHLBxJ/cEgPGQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO4yQO3a7syH3bBKp5X4glS7w6YZKqi/+DiCdHZlPd1Sfe5oDlDx3pHSo9M3mc8/2RKXOGwkd3qvvSj73rE9R5KSR0K1LL7n48maMlE5tvSs+9/xvqbT5Ip3bH8IUH3xZs0U6w1Pvgi+ooYKmioRSxb4sPvfdOVYxE0UyKQ5iCg2+kGkimeSpL4PPfb+OUsYskYoo9mkhO/gS5ohUdLG3wee+e4cVMEUkoo19WsJJmujzQyry2BfF576bNws+PSQyTuzT6IfwoSeHREaLvQ0+9729QeS5IY1RY4/de+CpIY2RYw/de9yZIYX0b6Ae1XvUI/ig00ISWVpfFp/7vu8Uc1ZIIl/sUXsPOSmkkHHXHrb3iHNCArljj9l7wCkhgfyxTyP2Hm9GGCzP6ZgdwtUVbkIYLEjrjWh5RZsPhoqyZ18I1lew6WCoULFPo31kLNRkMFSsXftCpMQizQVDBYx9Gqr3QFPBMGHOx1wTJ7I4M8EwUVtvhKkszEQwSNg9+0KUzKLMA4PEjn0apvcg08AQwXftrRihxZgFhigg9mmQ3kNMAqeLez5mW4Q3nAJMAQOU0norf2z5Z4DTFbNnX8peW/YJ4HSFxT7N33vu7eNkpe3aG7lzy719nKrA2KfZeyf3MpW4a2/lDY7cS1Rs7NPMvZN7gQqOfZq3d3IvTsm79lbG5si9NKXHPs3ZO7kXpoLaM/ZO7iUp5wMyB+TKjtwLUknrjUzdkXsxatmzL+QJj9xLUVXs5I4bVVZ7pt7JvQh1Hcgs5EiP3EtQYezTLL2Te3jVnH3cRu64ptLWG+PHR+7BVVx7ht7JPbRaj2M6Y/92AnKPrPLYp6Pv38k9rtp37a1xAyT3qCxin47cO7kHZRL7dNzeyT0kl117g9zdGcU+HbV3co/HadfeGi9Ccg/HLXZy91XtB2RuNFqF5B6KY+uNsTIk90hcax+td3KPw/I4pjNOiOQehnPs05F6J/cgrHftrTFSJPcY7GMndxueZx+vGaFFcs+P1pf0MZJ7buzZV8i9esTeI6+R3LNi175JnSO550Ts28Q9kns2nI/ZgdwrRes7aYMk9zzYs+8jLZLcsyD2/ZRJknsG7NpvJGyS3EdH7Aco/+6FY2MHYj9I+bcvHBvXEfthyr9+4djYxq79GMoFEI6NLcR+FOUKCMfGBnbtR1KugXBsrPGJgeMpl0E4NlZo/RaU6yAcG0vs2W9FuRLCsbFA7LejXArh2Giwa78t5WIIxwavUE+hXA/h2KD1UygXRDi2O/bsp1EuiXBsb8R+KuWiCMd2RuynUy6LcGxjxD6Acl2EY7vidMwwyqURju2J1odSLo5wbEfEPpxyeYRj+yH2FJQLJBzbDrEnoVwh4dheeIGainKRhGM7ofV0lMskHNsHsaekXCjh2C6IPS3lUgnHtsAhe3LK1RKObYDWBZTrJRy7esQuoVwx4diVI3YR5ZoJx64ascsoV004drXu8PpUSblywrErRepiyrUTjl0lYpdTrp5w7AoR+wiU6yccuzIcsY9EuYbCsatC6qNRrqJw7HqwXx+TciGFY1eC1kemXEvh2DWg9fEpl1M4dvFoPQvligrHLhytZ6JcU+HYBeOkY0bKdRWOXSpSz0u5tMKxi0Tr2SlXVzh2cTiECUG5wsKxi0LqYShXWTh2MUg9FOVKC8cuA6lHo1xs4djxsVuPSLngwrFDu0PqUSlXXTh2WJQemnLlhWNHxE49PuXyC8eOhtLLoExAOHYg7NQLouxAOHYMlF4aZQzCsbOj9CIpixCOnROll0uZhXDsPO5QeuGUcQjHHh+h10AZiHDsMbFPr4cyE+HYIyH0yihbEY6tR+k1UgYjHFuK0qulrEY4tgql102ZjnBsBUqvnzIf4diJsVM3oWxIOHZClG5E2ZFw7DTYqbtRxiQcOwFKN6TsSTj2QOzVTSmbEo49AKkbU3YlHPtUpO5NmZZw7JPQuj1lXcKxb41DGExNcid1LCgjE459C7SOFWVnwrGPRevoU6YmHPu4CdA6NilrE459zOZpHduUvQnHPrhtYscOyuSEY9+8YVrHbsrqhGPfsFVax17K8IRj790mreMGyvSEY+/ZIrHjRsr4hGPv2hyx4xBlf8Kxr2+M1nGYskDh2FtbInYcRRmhcOyN7dA6jqTMUDj2eiPEjuMpSxSO3W2C1nEbyhaFYy82QOy4HWWNwrHnxI4TKHtUjk3sOIEySeHQuf/aUCZdksrcc/+toVC6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSJHeEo0uS3BGOLklyRzi6JMkd4eiSVOYOREPuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMJIy99m3kwcJh0ON3p9PJk+zbT1l7peTQ7nPXv3tbcINojyDcx/UUMLc300O5f7ms8l9cvc2NPdhDaXLvbkfN+d+dbBD7u4G5j6woWS5/9bUTu44oIrcZy8nE3LHYTXkfnVANZmcPSR3HFJB7hfNjv3+f78ldxxSR+5nz97OtnKfvXrU7PMfv1jM7nLS+SbFNlGm/bl/+OHReRvM83XOqRtKlPuT14sHXi/3f593Uzv7KslUUYN9uX/4bNXHMpi5oKEkub953fx3M/f1zCaL+0fumO/N/d35ZDsYRUMJ32bayL25V2fNU9DsVfOnn7obcOxubnfu//vHVcJP/nX1p9l/vrv64ye/ziUNqXK/7Oa3eLf1aXcDcje3O/dmt/1174e2HUFDqtwv1rNaXU7u2J37ZhnNTZqDFUFDwty7R+bGDcjd3BEnInu5p25IlXvz7HP25PW1G5C7uUO5f/z9+/PlS1FBQ6rcmx8aj398u3EDcje3P/ePP3/x+cP+mRdBQ6rce+dR7z7rHYCRu7l9ub/pnXfvTjSmb0iW++Ls0cZpVHLHvtzXZ9QfP//jvDuvnrwhXe7z5m3hbracmcHC7tzfte+cfvnLn2/n65eqrbQNSXNvLnzzXXtPfpqTO+Z7cm/eZTp70b9J/13ThA2pc28uv1g+FZE7due+ehdp9dPWhwRSNSTKvXm4rubf/PBgTu6Y78+9d4b9YpG7oiFR7huz6uZN7jgi9/fL8+6KhoSfmVndq8veeVRyN7cz9+bCB70/rz//mLghVe7rj7h9bP4Z62KGF+0Lktlf6baJ0rzf/KTvoo32DaX7TS3t5yGXmQsakr1U3bhby2eqy95bCPC0M/fFicjOp11I6RvSnZl5v36b7N7ycw/twzXn70xDdrtzX/7elmUsq49CJm9IeSLyzefNfbjbPh0tb/KyueTv6baJ0uzJfT579XAVS++Va+KG+A3AMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMHKb3C82ftfZ3cfP/updw2+yxj5xwjk998aTt6tryB37xAlnWO6Te79215A79okTzsDcl7/yl9xxgzjh3DL33tw+Lr7i9Zvr1wAb4oRzeu7L3yz/YNc1QE+ccIbk3n7FyCe/7roGWIsTzqDcm+/raL9mgdxxgzjhDMq9/Rq07Vl/+OFRc2x29vh5/7Yf2+8iefSsf9ms/SaSs8cveKTULU44qffuH9ZfHjU5+2p1w5ery75e/d/rb5/qXYgKxQlnUO7NS47NQ7B3m1809XR971a6rwi87N+QbyerWZxwBr9UbS/prll/8eus/T7YT1ZvJlw9EN/OZ/9cX9ZM+qx5N3n228MJXz5ZtTjhDMm9/c7Ap/1rmqmsnl8uu68LbG63fBftt+7/aC7rBptdrL9YEBWKE87JuS8ehcutLa/Z/ML6ZmLfbEy/d4uL1aN1vnhsb3wjK6oSJ5yhHyJ4ev3+bM964+uFl/egmWfvaeiSE5k1ixPOwNwfrK7Z2urH378/X768aGa/fYR1dVn/WegdRzM1ixPOoNxX54H6s/748xefP+y/mn7Xe1Xdm+a27ZugGnHCOTn3s0dfvuhfs5z1m8+uTWXXI3DjZBK5Vy5OOMPeVb12zXo2j5//cU7uaMQJJ23u7VPN2Ze//Nn88P7mWa9fX6NyccJJmnvzsvls9UzVm/W1R+Alr02NxAknae7N/NYvpbvZbrzAXp5c2rwl6hYnnNS59x56F5Mdp0+Xd6F5OHOm3UWccHS5v+9On64+7dnonowuNp6oOLapWpxwkubeTLR7OLaTvvYxh9Utmj+szr42H4jgQwT1ihNO0tzbz2vebz7XtvhcRHeY1Zw9uvei+Qzb+erR2J5RerK6LedpKhYnHMGJyM6nq0OvWf99hu6Z6GX/thzK1CxOOInfZlr/Q5N7r3u3bz+vvJjern+U0twY9YoTTuLc57P2nxbebZ9s+i9A9v+Tw+WNUa844fAbgGGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GGE3GHk/z3YZrkp1fN3AAAAAElFTkSuQmCC&#34; style=&#34;height:120px;&#34;&gt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-size: 13px; text-align: left; vertical-align: middle; font-style: italic;&#34;&gt;Brooke Burke&lt;br&gt; (Season 7)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Media Personality&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;21&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #C7EAEC;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #B4E3E6;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #EEF8F9;&#34;&gt;4.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #9BDADF;&#34;&gt;23.8&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;img cid=&#34;hlpufxbiwaqd__temp_ggplot.png&#34; src=&#34;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAu4AAAH0CAMAAABVZliYAAAAulBMVEUAAABmZmZmZoFmZp1mgYFmgZ1mgbZmnc+BZmaBgWaBgYGBgZ2BgbaBnbaBnc+Bts+BtuedZmadgWadgYGdnZ2dnbadts+dtuedz+edz/+2gWa2nYG2nZ22tp22ts+2z8+2z+e25//PnWbPnYHPtoHPtp3Pz7bPz8/P5+fP5//P///ntoHntp3nz53nz7bnz8/n57bn58/n5+fn5//n/+fn////z53/1wD/57b/58//5+f//8///+f///+NxnQNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAQF0lEQVR4nO3d63rbxhlFYVPxQbbqxs3BOTuOXCd11DhpUpu2St7/bVUACRCUeMZszDfY6/2RxyKZwRDPIgiAkHhvDti4l3sCwHDIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHUbIHendW8k9lXXBpoOyLRufrsSqPso8ULrbna8L0nyAKaB8u0pfaz73PDMvHyNwUOtN8XlnmnXpKN7OXZhwm3hyRw9Hpp69eHLHyU6LfVl8nhlnWSpGoEfsdfBZ5pxjoSjesXvsQTbw5I7j9W99EfzwEx98iShdotinGY5ZyR1HShb7ovhh5z7o0lC8dJv2pvchEyR3HC7BAeqm4Ad8AsMtCqVTtD5s7+SOA0m27Mveh8qQ3HEYXex18AM9iWEWg8IJN+3L3gcpkdxxAHXsdfBDPI8BloHCyTftg/VO7thjoNgH6Z3csdtgsQ/RO7ljpyFr1/dO7thhuB2ZYXond2w3dOzy3skdW2WoXdw7uWMzzeVgh/QubJLcsVGm1ivKZyUcG+XKWDu5Y2A5ayd3DCrXXju5Y3h5Yyd3DCjzpp3cMaDssZM7BhOgdnLHMPLvyFSUT1A4NgoTInZyxyCC1E7uGECU2skdemFqJ3eoxThIXVA+TeHYKEWk2MkdWqFiJ3dIBaud3CEUrXZyh0642skdMvFqJ3eoBKyd3CGR7Y8N7KZ8xsKxEVvI1qfkDoWotZM70gtbO7kjubi1kztSC1w7uSOxyLWTO1IKev6xpXzqwrERUuzWp+SOhMLXTu5IJn7t5I5UCqid3JFICbWTO9IoonZyRwrBzz+2lKtAODYiKSV2ckd/xcRO7uitoNrJHT2VVDu5o5dydttryjUhHBsxlBU7uaOP0mond5yuuNrJHacqbLe9plwdwrGRW4mxkztOU2Ts5I6TFFo7ueMEpdZO7jhambvtNeVaEY6NfMqNndxxpII37VNyx3GKjp3ccYyyN+1TcscRSo+d3HG48msndxyo+B2ZinL9CMfG0MYQO7njENH/sO/BlOtIODaGNJLWp+SO/cZTO7ljj7Hsx9SU60k4NoYyptjJHTuNatM+JXfsMLbYyR1bjS92ck/n3l25p9THGGMn975WbW9YuQWHP8rYyb2HbZFvUlj149y0T8n9NMeU3lVE86O5YmAD5WoTjp3RiaWvBN/Qj7f1Kbkfp3fqK0GTH/GWvaJcc8Kxc0j/Jh9uMz/y2Mn9ULod2jjJjz52cj+I/OAtQPJjPkBdUa5A4dgDGqqDrMlbtD4l932G7SBT8S6xk/tuGToY/PDVYy9mSbkehWMPIWMHgyVv1fqU3LfK3oE+ebfWp+S+RZAQlMkHeYrDUq3Mecm5hypBkrzhhr2WfEV2VqlwbKWAJaQ8fj318rZRSLQON65X4dgygVNI0XzgpzeIJI1sWbXCsVXCx3D65ZTWW/VG6l6661c4tkYxORz32yJbf9vKjzIe4dgK5RWx+5cDd/1aoStlPsKxBYqu4vYvhVP5Zsp+hGOnRh0elAkJx06L1l0oIxKOnRSx21BWJBw7ITbtRpQdCcdOh9idKEMSjp0Km3YvypSEY6dB7G6UMQnHToHY/ShzEo6dALEbUvYkHLs/anekDEo4dl/syHhSJiUcuydiN6VsSjh2P9TuShmVcOweuBrMmLIr4dino3VnyrCEY5+M2q0pyxKOfSL2Y8wp2xKOfRpid6eMSzj2SajdnrIu4djH44QMfHKndUxtcqd2VJSJCcc+ErWjpmxMOPZR2GvHkrIy4djHIHY0lJkJxz4CtaOl7Ew49uGoHSvK0IRjHzoFdtvRpWxNOPaBM8i9dhGMMjbh2IdNIPfKRTTK2oRjH7T83OsW4ShzE459yOJzr1rEo+xNOPb+hVM77lIWJxx777Jzr1eEpExOOPaeJVM7NlJGJxx794Jzr1REpaxOOPbO5eZepwhLmZ1w7B1LpXZspQxPOPb2heZeoYhMWZ5w7K3LzL0+EZoyPeHY2xaZe3UiNmV7wrG3LDH32kRwyviEY29cHrVjD2V+wrE3LI3YsZcyQOHYdxeWe0WiBMoChWPfWVbu9YgiKBMUjn17UblXI8qgbFA49q0l5V6LKIQyQuHY6wvKvRJRCmWFwrE7S+GUDA6mDFE49mohuVcgSqIsUTh2u4zc6w9FUaYoHLtZRO7Vh7IoWxSOvVxC7rWHwihjFI69WEDulYfSKGsUjj3nIhmcQNmjcGw27TiFMkjl2NSOEyiTFA6de7WhTLoklbnnXmsolC5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5Jckc4uiTJHeHokiR3hKNLktwRji5JZe5ANOQOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOIylzn307eZRwOIzRh/PJ5Fm2pafM/WqyL/fZ67+9S7hAlKd37r0aSpj7+8m+3N9+NnlI7t765t6voXS5V89jd+43Ozvk7q5n7j0bSpb7b1Xt5I49RpH77NVkQu7Ybwy53+xQTSZnj8kd+4wg98tqw/7wv9+SO/YZR+5nz9/NbuU+e/2k2uZfvFzM7mrS+CbFMlGm7bl//OHJeR3Mi1XOqRtKlPvTN4sXXif3f583Uzv7KslUMQbbcv/4WdvHMpi5oKEkub99U/13PffVzCaL50fumG/N/f355HYwioYSfsy0lnv1rM6qt6DZ6+pfPzUPYN/d3Obc//ePm4Sf/uvmX7P/fHfzz09+nUsaUuV+1cxv8Wnrs+YB5G5uc+7VZvvrzg91O4KGVLlfrmbV3k7u2Jz7ehnVQ6qdFUFDwtybV+baA8jd3AEnIju5p25IlXv17nP29M2dB5C7uX25X//+/fnyUFTQkCr36ofKxY/v1h5A7ua253798xefP+6eeRE0pMq9cx71/vPODhi5m9uW+9vOeffmRGP6hmS5L84erZ1GJXdsy311Rv3ixR/nzXn15A3pcp9XHws3s+XMDBY25/6+/uT0y1/+fDdfHarW0jYkzb268e139TP5aU7umG/JvfqU6exl9yHdT00TNqTOvbr9cvlWRO7YnHv7KVL7062LBFI1JMq9erm2869+eDQnd8y35945w365yF3RkCj3tVk18yZ3HJD7h+V5d0VDwmtm2md11TmPSu7mNuZe3fio8+/V9Y+JG1LlvrrE7br6NdbFDC/rA5LZX+mWidJ8WL/Sd9FG/YHSw6qW+nrIZeaChmSHqmtPa/lOddX5CAGeNua+OBHZ+LQJKX1DujMzH1Yfkz1YXvdQv1xz/s00ZLc59+XfbVnG0l4Kmbwh5YnIt59Xz+F+/Xa0fMir6pa/p1smSrMl9/ns9eM2ls6Ra+KG+AvAMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMELuMHJM7pdrf+vs/sXzvzr38JessU2ccE7PvfL0XXsPuWObOOH0y33y4NfmHnLHNnHC6Zn78k/+kjt2iBPOkbl35na9+IrXb+7eA6yJE87puS//svyjTfcAHXHC6ZN7/RUjn/y66R5gJU44vXKvvq+j/poFcscOccLplXv9NWi3Z/3xhyfVvtnZxYvuY6/r7yJ58rx726z+JpKzi5e8UsYtTjipt+4fV18eNTn7qn3gq/a2r9v/e/XtU50bMUJxwumVe3XIsb4L9n79i6aerZ5dq/mKwKvuA/l2sjGLE07vQ9X6luae1Re/zurvg/2k/TDh5oX4bj775+q2atJn1afJs98eT/jyyVGLE06f3OvvDHzWvaeaSvv+ctV8XWD1uOWnaL81/0d1WzPY7HL1xYIYoTjhnJz74lW4XNrynvUvrK8m9s3a9DuPuGxfrfPFa3vtG1kxKnHC6XsRwbO7z+f2rNe+Xnj5DKp5dt6GrjiROWZxwumZ+6P2nltLvf79+/Pl4UU1+9t7WDe3dd+F3rM3M2ZxwumVe3seqDvr65+/+Pxx92j6feeoujPN224/BKMRJ5yTcz978uXL7j3LWb/97M5UNr0C104mkfvIxQmn36eqd+5ZzebixR/n5I5KnHDS5l6/1Zx9+cuf1Q8fds96dXyNkYsTTtLcq8Pms/adqjPrO6/AK45NjcQJJ2nu1fxWh9LNbNcOsJcnl9YfiXGLE07q3DsvvcvJhtOny6dQvZw50+4iTji63D80p0/bqz0rzZvR5dobFfs2oxYnnKS5VxNtXo71pO9c5tA+ovpHe/a1uiCCiwjGK044SXOvr9d8WF3XtrguotnNqs4ePXhZXcN23r4a6zNKT9vHcp5mxOKEIzgR2fi03fWadT9naN6JXnUfy67MmMUJJ/HHTKtfNHnwpvP4+nrlxfQ2/VJK9WCMV5xwEuc+n9W/Wni/frPpHoBs/5XD5YMxXnHC4S8Awwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5wwi5w8j/AfhJS9TTw93mAAAAAElFTkSuQmCC&#34; style=&#34;height:120px;&#34;&gt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-size: 13px; text-align: left; vertical-align: middle; font-style: italic;&#34;&gt;Bobby Bones&lt;br&gt; (Season 27)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Businessperson&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #ACE0E4;&#34;&gt;20&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;img cid=&#34;krqxpfyhtbue__temp_ggplot.png&#34; src=&#34;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAu4AAAH0CAMAAABVZliYAAAAulBMVEUAAABmZmZmZoFmZp1mgYFmgZ1mgbZmnc+BZmaBgWaBgYGBgZ2BgbaBnbaBnc+Bts+BtuedZmadgWadgYGdnZ2dnbadts+dtuedz+edz/+2gWa2nYG2nZ22tp22ts+2z8+2z+e25//PnWbPnYHPtoHPtp3Pz7bPz8/P5+fP5//P///ntoHntp3nz53nz7bnz8/n57bn58/n5+fn5//n/+fn////z53/1wD/57b/58//5+f//8///+f///+NxnQNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAR3klEQVR4nO3da5vTxh2G8fUGwgKloTmQYylZSlKyDUmTcuz6+3+tItmSJVs+SJrHGv2f+/ciF+w6MxrNvQcfsC6WgI2LqQ8AOB9yhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFyhxFy93ZxYVWA1WLR9rH1t2/fOiXgtFa0Fa0XjL7B+6wUW6ranb6/+6wUbZvajXq3WSjamrX79O6yTrS1a7fp3WSZ2LKdu0nvHqvElp3ayR1h7dZu0rvFIrGF3OGjq3aP3h3WiC3kDh/dtVv0brBEtO2rndwRz97aHXqPv0K0kTt8HKid3BHNodzj9x5+gWgjd/g4WDu5I5bDuYfvPfr60HKkdnJHJOQOI8dyj9578OWh5Wjt5I44yH3qA8D5HK+d3BHGCbkH7z326tBC7rFXh6ZTaid3BHFS7rF7D704NJ1WO7kjhBNzD9175LWh6dTayR0BkPuS3G2cXHvo3gMvDU3kXgi8NDT0qJ3cMXfkXgq8NDT0yT1w73FXhoZetZM75o3cV+KuDBv9aid3zFrP3OP2HnZhaCD3tbALw0bf2skdM0bulbALw0bv3MP2HnVd2OhfO7ljtsi9FnVdqA2ondwxV+S+EXVdqA3JPWrvQZeF2qDayR3zRO4NQZeFyrDayR2zNDD3oL3HXBUqQ2snd8wQubfEXBXWBtdO7pid4bWTO2ZnRO4xew+5KKyR+5aQi8LKmNrJHTND7ttCLgqlUbXH7D3imrBC7jsirgkrI3OP2HvAJWFlbO3kjhkh910Bl4TS6NrJHfMxPveAvcdbEUoJaid3zAW5d4m3IpTIvUu8FaGQonZyx0wkyT1e7+EWhEKa2skds0Du3cItCMtktZM7ZiBV7eSO/CWrndyRv3S5h+s92nqQsnZyR+7Ifb9o60HK2skdmSP3A6Ktx17S2sP1Hmw5IPdDgi0H5H5IsOXYS1w7uSNn5H5QsOW4S117tN5jrcYeuR8WazXu0tdO7sgWuR8RazXmBLWTO3JF7sfEWo03Re3Beg+1GG+a2skdORLVTu7IEbmfINRinKlqJ3dkSJZ7qN4jrcWZrnZyR3bI/SSR1mJMWDu5IzfkfppIa/GlrJ3ckRlp7pF6D7QUX9rayR1ZIfdTBVqKLXHt5I6cqHMP1Huclfgi95PFWYktee2Beg+zEF/kfrowC7F1htrJHZk4R+3kjjycpXZyRx7Ok3uY3qOsw9SZaid35IDc+4myDlPk3k+UdXg6V+3kjgycLfcovQdZhqfz1U7umBy59xVkGZbOWDu5Y2rk3luQZVg6Z+5Beo+xCktnrZ3cMS1y7y/GKhydt3Zyx6TIfYAYq3BE7gPEWIWhM9cepPcQi3BE7kOEWIShs9dO7pjM+Wsnd0yG3IcJsYhMXRQ0I5+/9hi9R1hDni5WTUqCj5u77lvEanjd0N42RabfvklqP0vu5cqU4wvHdnahDCVs7quFKScQjm3sQlpK1NzX61LOIBzb1sVOj0lP8zS163uv1qWcQji2q64cU57noLnXy1LOIRzbVHeN6U70VLVrc2/8RFTOIhzb074ak53pkLk3F6WcRji2p701pjrVEXNvrUk3Dbmntj/GRKd6stqVvZP7PB2KMc25jph7e02qWZbkntrBGJOc7IC5by1JNEs5k3BsQ4dbTHGyJ6yd3NF2JMYEZztg7ttL0syymko4tp9jLY4/21PWruqd3GfqaIyjT3fA3HeWpJikmks4tp0TWhx7vsl93FzCse3oc5+2dknuu0sSTFJPJhzbzUktjjvhE+eu6J3cZ0qf+9S1C3LvWFLyORqzCcc2c2KLI8745LULeif3eTq5xeGnnNxHzyYc20qPFIee8wxqP8s/u008RWs64dhWyD3ZmhJP0ZpOOLaVPi0OO+k51J669841JZ1haz7h2E56tUjuB5eUcobtCYVjO+nX4qCzHi/3PStKOMPOjMKxjfRMcchZz6N2ckf/FgecdnJPMaNwbB+9U+x/2jOp/RzvIJJsgo4phWP76N9i7/NO7kmmFI7tQ597NrWf4f1yUk3QNadwbBtDWux54sk9zZzCsW1Y5a5/e6hE43fOKRzbhj73jGrXvz1UmvG7JxWO7WJYi73OPLknmlQ4tgt97jnVrn8ztCTj75lVOLaJoS32OPXkngi5jza4xZPPfVa169/7L8Xw+2YVjm1ieIynnvyAuR9aUoLh904rHNvDmBZPO/uZ1U7uzuS551a7/J0uxw+/f17h2BbGxXjC6c+udvlbXY4e/cC8wrEdjIyR3DuMHv3AvMKxHYyN8ej5z7B29Ru7jh390MTCsQ2MjpHcd40d/dDEwrENjI/xyAbkWLv6jV3HDX54ZuHYBshdcNbGDX54ZuHY8aWI8eAO5Fm7+G2MRw1+ZGrh2PElqfHQFkTM/eiaxgx+bG7h2OGlifHAFmRa+7jeyX2mEtW4dw+yrZ3cHaXKcd8mhMz9+KKGj318cuHY0aWr8aJzG/KtndwNpcxxdx8uMq59RO8nrGrUphyZXTh2dNLcs46d3P2kDfJCOXh65O4mcZGt399zr31w76csLMnu7JleOHZs6Yu8EI6dHLl7ESR5UUk/dHq6SzKk3qnm/MKxY5tFk0Lk7sS9duElSJLvVWN+4dihkbvsCiTpN2tzAMKxQyN3cvdB7borkCj2qzoC4diRkTu5+6D2t+Rug9pLoiuQaPZsdQjCseMi9xK5W6D2FdEFd0S7Vh6DcOywyH1NcwUS2b6R+xDUXtFcgES3c+Q+ALlXyN0Audck19tRbp1w7KCofYPcwyP3DXKPjtqbTuyn10lTbp5w7JjIvYncY6P2FsW1A5W7Jxw7JHJvE1w8ULl7wrEjovYtpwTU86Qpt084dkTkvi39xQOV2yccOyJy33a8oL7nTLl9wrEDovZdxxLqfc6U+yccOyBy75D6amrK/ROOHQ+1dzncUP9zptxA4djxkHungxdTI/e5ovZuiS+mptxB4djRUPs+aS+mptxC4djBUPt+SS+mptxD4djBkPt+ezIadsqUeygcOxZqPyTlpQOVmygcOxZyP6gjpKFnTLmJwrFDofYjOi6VOXAk5S4Kxw6F3I/ZTmnwGVPuonDsSKj9uFZLIy4wpdxG4diBUPspLuqLZY66mppyH4VjB0LuJ0px6UDlPgrHjoPaz0m5kcKx4yD3c1JupHDsMKj9rJQ7KRw7Cmo/L+VWCscOYh7XbA9EuZfCsWMg9nNTbqZw7BCo/eyUuykcOwRyPzvlbgrHjoDaz0+5ncKxA6D2CSj3Uzj2/FH7FJQbKhx79qh9EsodFY49e+Q+CeWOCseeO2qfhnJLhWPPHLVPRLmnwrFnjZcOTEa5q8KxZ4zYJ6TcV+HY80XsU1JurHDs2aL2SSl3Vjj2TPGLzMSUeysce5aIfXLK3RWOPT9j/w09UlBusHDsuaH1PCi3WDj2rPCNPRvKXRaOPRfj3wgIKSm3Wjh2/hK85RWSU264cOyMXHSbemPRRdmBcOiM6FaJOSEEGCF3GCF3GCF3GCF3GCF3GCF3GCF3GCF3GCF3GCF3GCF3GCF3GCF3GEmZ++13i3sJh0NE764Wi8eTzZ4y95vFsdxvX/zldcIJMT+jcx/VUMLc3yyO5f7q88Wn5O5tbO7jGkqXe7GOw7l//GWH3N2NzH1kQ8ly/7WondxxRIjcb58vFuSO4yLk/vEXqsXi8j6545gAuV8X39g//e935I5jYuR++eT17Vbuty8eFN/zHz5bHd3NovJtijkxT/tzf//3B1dlME83OaduKFHuj16uvvAauf/7qjq0y6+THCoi2Jf7+8/rPtbBLAUNJcn91cviv+3cN0e2WK2P3LHcm/ubq8V2MIqGEj7N1Mq9WNVl8SPo9kXxpx+rG/C7u7nu3P/3t48JP/rXxz/d/uf7j3/85JelpCFV7jfV8a2ebX1c3YDczXXnXnzb/qbxl7IdQUOq3K83R1V/nNzRnXu7jOImxS8rgoaEuVdfma0bkLu5Ex6IbOSeuiFV7sVPn8tHL3duQO7mjuX+4bcfrtZ3RQUNqXIv/lJ4+I/XrRuQu7n9uX/46csv7jcfeRE0pMq98TjqnSeNX8DI3dy+3F81HnevHmhM35As99WjR62HUckd+3LfPKL+8OnvV9Xj6skb0uW+LJ4Wro6WR2aw0p37m/KZ069+/uP1cnNXtZS2IWnuxQdffV+u5McluWO5J/fiWabLZ82bNJ81TdiQOvfi49frH0Xkju7c62eR6r9tvUggVUOi3Isv1/r4i7/cW5I7lvtzbzzCfr3KXdGQKPfWUVXHTe44Ifd368fdFQ0JXzNTr+qm8TgquZvrzL344L3Gnzevf0zckCr3zUvcPhT/jHV1hNflHZLbP9PNibl5136l76qN8gmlT4taytdDrjMXNCS7q9pa1von1U3jKQR46sx99UBk5bMqpPQN6R6Zebd5muzu+nUP5ZfrlO+Zhsl1575+35Z1LPVLIZM3pHwg8tUXxRrulD+O1jd5Xnzkr+nmxNzsyX15++J+HUvjnmvihngHYBghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxjpk/t1673O7jx88mfjM7yTNfbJJ5zhuRceva4/Q+7YJ59wxuW+uPtL9Rlyxz75hDMy9/Vb/pI7DsgnnJ65N47tw+oSr9/ufgZoySec4bmv31n+XtdngIZ8whmTe3mJkU9+6foMsJFPOKNyL67XUV5mgdxxQD7hjMq9vAza9lG///uD4nezy4dPm7f9UF6L5MGT5sduyyuRXD58xldKbPmEk/q7+/vNxaMWl1/XN3xef+yb+v/eXH2q8UEElE84o3Iv7nK0fwV7077Q1OPN6mrVJQJvmjfk6mSR5RPO6Luq5Ueqz2wu/HpbXg/2k/rJhI9fiK+Xt//cfKw46Mvi2eTbX+8vuPhkaPmEMyb38pqBj5ufKQ6l/vlyU10usLjd+lm0X6v/o/hYNdjt9ebCgggon3AG5776KlzPtv5M+4L1xYF92zr8xi2u66/W5epru3VFVoSSTzhjX0TweHc920fdurzwegXFcTZ+DN3wQGZk+YQzMvd79We2Zv3w2w9X67sXxdFv/4b18WPNn0Jv+G0msnzCGZV7/ThQ86g//PTlF/eb96bfNO5VNw5z2/ZNEEY+4QzO/fLBV8+an1kf9avPdw6l6yuw9WASuQeXTzjjnlXd+czmaB4+/f2K3FHIJ5y0uZc/ai6/+vmP4i/vDh/15v41gssnnKS5F3ebL+ufVI2j3vkKvOG+qZF8wkmae3F8m7vS1dG27mCvH1xq3xKx5RNO6twbX3rXi46HT9dLKL6ceaTdRT7h6HJ/Vz18Wr/as1D9MLpu/aDid5vQ8gknae7FgVZfjuVB77zMob5F8Yf60dfiBRG8iCCufMJJmnv5es1Pi9e1rV4XUf2aVTx6dPdZ8Rq2q/qrsXxE6VF9Wx6nCSyfcAQPRFY+q3/1um0+z1D9JHrevC2/ykSWTziJn2ba/EOTuy8bty9fr7w6vK5/lFLcGHHlE07i3Je35T8tvFP+sGneAdn/Tw7XN0Zc+YTDOwDDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDCLnDyP8B+RmBIyqachMAAAAASUVORK5CYII=&#34; style=&#34;height:120px;&#34;&gt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-size: 13px; text-align: left; vertical-align: middle; font-style: italic;&#34;&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Entertainer&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;5&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #B5E3E6;&#34;&gt;20&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #3FC1C9;&#34;&gt;40&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;img cid=&#34;cxanbhsqiedo__temp_ggplot.png&#34; src=&#34;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAu4AAAH0CAMAAABVZliYAAAAulBMVEUAAABmZmZmZoFmZp1mgYFmgZ1mgbZmnc+BZmaBgWaBgYGBgZ2BgbaBnbaBnc+Bts+BtuedZmadgWadgYGdnZ2dnbadts+dtuedz+edz/+2gWa2nYG2nZ22tp22ts+2z8+2z+e25//PnWbPnYHPtoHPtp3Pz7bPz8/P5+fP5//P///ntoHntp3nz53nz7bnz8/n57bn58/n5+fn5//n/+fn////z53/1wD/57b/58//5+f//8///+f///+NxnQNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAQ8UlEQVR4nO3de3/TRhaHcTvlEsiyZUtL75SGpV3IlnbbhUDWfv9vayPZkiVHsiVrzswZ/Z7vH/1A4o6ko8fxlXixBmQsUu8AEA+5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5Qwi5I4BF24CLpAmP3DHFtt3rtv20Oy5ynaR6csfJOiIeLW7x5I4TBWi9Kj7ePkfbEuYkxA/2BMGTO04QsvVN8JH2O85mMCdBf7JXvUf5CU/uGMsg9k3xEXbdfhOYF6vaY/RO7hjHrvYIvZM7RrGs3b53cscIFg9SY/ZO7hjOOnbz3skdg0Wo3bh3csdQUWq3fYmV3DFQpNqvry2PwXBtzAq5Q0e02skdycWrndyRWsTayR2pkTt0xKyd3JEYuUNH1NrJHUnFrZ3ckVLk2skdCcWundyRELlDR/TayR3JxK+d3JEMuUNHgtrJHamQO3SkqJ3ckQi5Q0eS2skdaZA7hJA7dKSpndyRBLlDR6LayR0pkDt0pKqd3BFfstrJHfGRO3Skq53cER25Q0fC2skdsZE7dKSsndwRGblDCLlDR9LayR1xkTuEkDt0pK2d3BEVuUNH4trJHTGRO4SQO3Skrp3cERG5Q0fy2skd8ZA7dKSvndwRi4PayR2xkDuEkDt0eKid3BEJuUMIuUOHi9rJHXGQO4SQO3T4qJ3cEQW5Qwi5Q4eT2skdMZA7hJA7dHipndwRAblDCLlDh5vayR32yB1CyB06/NRO7jBH7tDhqHZyhzVyhw5PtZM7jJE7dLiqndxhi9whhNyhw1ft5A5T5A4h5A4dzmond1gidwghd+jwVju5wxC5Qwi5Q4e72skddsgdQsgdQsgdOvzVTu4wQ+4QQu7Q4bB2cocVcocQcocOj7WTO4yQO3S4rJ3cYYPcocNn7eQOE+QOHU5rJ3dYIHcIIXcIIXfo8Fo7ucMAuUMIuUOH29rJHeGRO4SQO3T4rZ3cERy5Qwi5Q4fj2skdoZE7hJA7dHiundwRGLlDCLlDh+vayR1hkTuEkDt0+K6d3BEUuUMIuUOH89rJHSGRO3R4r53cERC5Q4f72skd4ZA7hJA7dPivndwRDLlDCLlna9GQel8ykUHt5N7lNvHmjEh+EHLP0qLrvBH8UeSeoc7YC3keTjw51E7ubb2xX/MD/ghyz86RU5bhEcVD7pk59KN9I7tDiieL2sl9Z8gJ4zmaPuRuuLaBoecrs8OKhdwN1w5v+OnK67giyaN2ct84fre9IacDi4XcM6pi5MnK6MhiIfd8ohh9rvI5tEgyqZ3c1yedq2yOLRJyzyb3k05VLgcXRy61k/uppyqTo4uD3OeeO73vZFM7uZ9+qvI4vhjIfZ1J7lPOVBYHGEE+tavnPu1M5XCEEZB7OQXDtQOZeKIyOMIIMqpdO/dRbx3o4v8QIyD3zRgM1w4iwHlyf4wRkPtmDIZrhxDkNHk/SHs51S6ce5jT5PwgIyD37RwM1w4g0GlyfpTmsqpdN/dgp0n8n/ORezUIw7UnC3iWXB+nOXKvBmG49lRBT5LnA7WWV+3kHoLnIzVG7vUkDNeeKPRJcnyoxsi9noTh2tMEP0d+D9VYZrWTexh+j9UWue9GYbj2JBbnyO3BmsqtdsXcTc6R14O1Re6NWRiuPYHRKXJ6tLbIvTELw7VPZ3aGfB6uqexql8vd7gy5PFxb5N4chuHaJzM8Qy6P11J+tavlbnqG1N4sRu6taRiufSLjE+TwiA1lWDu5B+XwkO2Qe3schmufxv4E+TtmMznWLpV7hBPk7pjNZFk7uQfm7qCtkPv+QAzXPkWcE+TtqI3kWbtQ7pFOkLOjtkLudyZiuPZ4k39j2FC+DtsKud+ZiOHao0U8Pa6O20imtavkHvP0eDpuK+R+dySGa48V9fR4OnAbudYuknvk0+PoyG2Qe8dMDNceJ/rZ8XPoJrKtndxt+Dl2C+TeNRTDtUdJcXbcHLyBfGtXyD3J2fFy8BbIvXMqhmuPkebseDn68DKuXSD3VGfHyeGHR+7dYzFce7hkJ8fH4YeXc+2zzz3hyXFx/OGRe89cDNceKtobw7p4GEBwWdc+89zTnhsHAwgu79rnnXvqc5N+AqGlnuhUlqMxXHvgHqSebvoRBJZ8ohNZjsZw7WE7kHq4s+vdwUSnsZyN4dqDtp96toXUQwjLxUinsJyN4dqDtp96tqXUUwjJx0SnsByO4dpDNp96tFsz6t3LSE9nORzDtQdsPfVkK/PJ3c1IT2c5HcO1j2889WB3ZtO7o5meynI6hmsf33jqwTbMpHdPIz2V5XgM1z667dRzbZlH775mehrL8RiufWTL3s7MHHr3NtOTWM7HcO3DG0491DtmkLu/oZ7CckCGax/cbuqZdsi/d49THc9yQIZrH9xu6pl2yb13l0Mdz3JChmsf2mzqkXbLu3enQx3NckSGax/YauqJ9sn5c/ncDnUsyxkZrt2/0dQD7Zdv7o6HOpLlkAzX7t1m6nkekmvvroc6juWUDNfu22TqcR6Wae/OpzqG5ZQM1+7bZOpxHpFl796HOoblmAzX7tli6mkelWHv/oc6guWcDNfu3F4OJya/3nOY6mCWczJcu2tzqUc5TG69ZzLWgSwHZbh2x9ZST3KovHrPZqzDWE7KcO27G0s9yOFy6j2jsQ5iOSrDte9sK/Ucx8jn9dWsxjqE5awM197fVOoxjpRJ71k8+B/FcliGa+9tKfUUR8ui9/zGepTltAzXbm8o9RBPkEHvOY71GMtxGa7d2Eqmt7jue89zrEdYzstw7d1GUg/wZM4fsOY72EMsB2a4dr2N1PObwnPvWQ+2n+XEDNeuNpF6fNP47T3zwfayHJnh2psNZH9SvN6hyX6wfSxnZrh2uX7q2YXgMvhZTLaT5dAM117P55x46z3Xp7oGsZyb5dozOie+fsDPZ65dLAdnuHTqsYXlqPeZTXaf5eTslk49tdC8/ICf0W1mN8vZ2S2demrhLRwUP/vYyd2R1MHPP3ZydyVh8HN67H+A5QTtlk49NTOJgtdo/Zrc3VlEvxcv8oO9ZDlGu6VTT81YzOKFWr8md6+iFL9Q+sFeshym3dKppxbFwvR+jV7qBbNxknsQFs0vNFMvBJ5ka6p2S6eeWmSLRaDqF8Kll6aPsH+2dkunnloai4YRw2r8X6mPIDmzIsnd1GK41LvqiV2SaT8kHoiL3CGE3CGE3CGE3CGE3CGE3CGE3CGE3CGE3CGE3CGE3CGE3CGE3CGE3CEkZO6r75YPAy6HOfp4vlw+Tbb1kLlfLY/lvnr9t/cBN4j8TM59UkMBc/+wPJb7uy+WD8hd29TcpzUULvfiOA7nfntnh9zVTcx9YkPBcv+tqJ3cccQscl+9Wi7JHcfNIffbO1TL5dkjcscxM8j9svjB/uC/35E7jplH7mfP36/2cl+9flz8zL94udm7q2Xl2xDbRJ76c//04+PzMpgXu5xDNxQo9ydvNle8Ru7/Pq927ezrILuKOejL/dMXdR/bYNYGDQXJ/d2b4r/t3Hd7ttwcH7lj3Zv7h/PlfjAWDQV8mamVe3FUZ8VN0Op18aefqwtw311cd+7/+8dtwk/+dfun1X++v/3jZ2/XJg1Z5X5V7d/m1dan1QXIXVx37sWP7W8afynbMWjIKvfL3V7VXyd3dOfeLqO4SHFnxaAhw9yra2brAuQubsATkY3cQzdklXtx63P25M2dC5C7uGO53/z+w/n2oahBQ1a5F38pXPz0vnUBchfXn/vNL18+e9R85sWgIavcG8+j3nveuANG7uL6cn/XeN69eqIxfENmuW+ePWo9jUru6Mt994z6xYs/zqvn1YM3ZJf7unhZuNpbnpnBRnfuH8pXTr/69c/3691D1VLYhkxzL7747vvySH5ekzvWPbkXrzKdvWxepPmqacCGrHMvvn65vSkid3TnXr+KVP9t700CoRoyyr24utb7X/zl4Zrcse7PvfEM++Umd4uGjHJv7VW13+SOAbl/3D7vbtGQ4Xtm6qO6ajyPSu7iOnMvvviw8efd+x8DN2SV++4tbjfFP2Pd7OFl+YBk9Ve4bSI3H9vv9N20Ub6g9KCopXw/5DZzg4bMHqq2Dmt7S3XVeAkBmjpz3zwRWfm8Cil8Q3bPzHzcvUx2f/u+h/LqmvJ3piG57ty3v7dlG0v9VsjgDVk+EfnuWXEM98qbo+1FXhVf+Xu4bSI3PbmvV68f1bE0HrkGbojfAAwh5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4h5A4hY3K/bP2us3sXz/9qfIffZI0+fsI5PffCk/f1d8gdffyEMy335f231XfIHX38hDMx9+2v/CV3HOAnnJG5N/btZvMRr9/e/Q7Q4iec03Pf/mb5h13fARr8hDMl9/IjRj572/UdYMdPOJNyLz6vo/yYBXLHAX7CmZR7+TFo+3v96cfHxX2zs4sXzcvelJ9F8vh582ur8pNIzi5eck2ZNz/hhP7p/mn34VHLs6/rC76qv/ZN/X/vPn2q8UXMkJ9wJuVePORo3wX70P6gqae7o6tVHxF41bwgn042Z37CmfxQtfxK9Z3dB7+uys+D/ax+MeH2ivh+vfrn7mvFTp8Vryavfnu05MMnZ81POFNyLz8z8GnzO8Wu1LcvV9XHBRaX276K9lv1fxRfqxZbXe4+WBAz5Ceck3PfXAu3W9t+p/2B9cWOfdva/cYlLutr63pz3W59IitmxU84U99E8PTu8ezvdevjhbdHUOxn42boiicy58xPOBNzf1h/Z2+rN7//cL59eFHs/f49rNuvNW+FPnBvZs78hDMp9/p5oOZe3/zy5bNHzUfTHxqPqhu7uW//IpgNP+GcnPvZ469eNr+z3et3X9zZla5rYOvJJHKfOT/hTHtV9c53dntz8eKPc3JHwU84YXMvb2rOvvr1z+IvHw/v9e7xNWbOTzhBcy8eNp/Vt1SNvb5zDbzisakQP+EEzb3Yv91D6WpvWw+wt08utS+JefMTTujcG1e9y2XH06fbQyiuzjzTrsJPOHa5f6yePq3f7VmobowuWzdU3LeZNT/hBM292NHq6lju9J23OdSXKP5QP/tavCGCNxHMl59wguZevl/zQfG+ts37Iqq7WcWzR/dfFu9hO6+vjeUzSk/qy/I8zYz5CcfgicjK5/Vdr1XzdYbqluhV87LclZkzP+EEfplp9w9N7r9pXL58v/Jm97r+UUpxYcyXn3AC575elf+08F55Y9N8ANL/Tw63F8Z8+QmH3wAMIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIeQOIf8HOBufXPGxETAAAAAASUVORK5CYII=&#34; style=&#34;height:120px;&#34;&gt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-size: 13px; text-align: left; vertical-align: middle; font-style: italic;&#34;&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-family: &amp;#39;Secular One&amp;#39;; font-size: large;&#34;&gt;Other&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center;&#34;&gt;9&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #3FC1C9;&#34;&gt;44.4&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_right&#34; style=&#34;font-family: Spartan; font-size: medium; text-align: center; color: #000000; background-color: #FFFFFF;&#34;&gt;0&amp;percnt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_center&#34;&gt;&lt;img cid=&#34;jyzknpxfwagr__temp_ggplot.png&#34; src=&#34;data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAu4AAAH0CAMAAABVZliYAAAAulBMVEUAAABmZmZmZoFmZp1mgYFmgZ1mgbZmnc+BZmaBgWaBgYGBgZ2BgbaBnbaBnc+Bts+BtuedZmadgWadgYGdnZ2dnbadts+dtuedz+edz/+2gWa2nYG2nZ22tp22ts+2z8+2z+e25//PnWbPnYHPtoHPtp3Pz7bPz8/P5+fP5//P///ntoHntp3nz53nz7bnz8/n57bn58/n5+fn5//n/+fn////z53/1wD/57b/58//5+f//8///+f///+NxnQNAAAACXBIWXMAAA9hAAAPYQGoP6dpAAAPcUlEQVR4nO3da3vaRh6GcUNzcOLNNtu06TlNnU27ibdpt93EiRe+/9daj5CEsMEn9Nc8M8/9e9ErBiqPNDcgBEYHS8DGQe4BANMhdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxghdxipNveDgdxjgYr6UmgTPx0ge6zUlcDFzi8geXf1TP81qZM86sn9RqkPks89XmRRxbzf7HGd4lHBpN++9b74CtYet1H6hN/hgX0z+dwrgCkVPt17tU7wdoqe7P0e2QneT8FTPU7sq+AL3gy4hXLnebTY2+Jzrw8mUOwsj1s7wXsoc473PB5D8K6KnOGI1tvgi9weuKkSpzes9lXxuVcPccqb3JD9GIL3UNzURsdO8DUrbWKnqJ3gq1XYtE5UO8FXqqxJna52gq9SSVMa/iKV4GtX0IROHTvB16ec6cxRewo+93pjRMXMZqbaeae1KqVMZbbaV8XnXn2Mo5CJzFo7wVejjGnMXTvBV6KISRSo/ZQXrTUoYQ41audFawX0J3DyN5euQvBlk58+pdgTgi+Z+uSp1X5K8CUTnzrB2k950Vou7ZnTrJ0XrcXSnjbV3E/ZpSmT9KQJ135K8CVSnjLt2k8JvjzCEyZf+ynBl0Z3ukqo/ZRXrWWRnatCak8IvhiqMyX10YFr8RBfCNFpKir2BsGXQHOSyqv9lIf4EkjOUJG1JwQvTnJ+is2dh3hxipNTcO0JwesSnJrCaz/lBMW69Kal/NobFK9Ib04qyf2U4gXJTUg9tSfs1mhRm4y6am9QvA6xmaiw9uSAR3kNWpNQ1idlboni85OagZpjb/Agn5nS1q++9sYBezb5CG13j9pbJJ+F0Da3yj054IF+ajob2672DtVPR2Yr29beOSD7eDJb1z73DtUHUtms1H7BwQEP9+MT2ZjUvhvhj0dkE5L7TRxQ/p40thu139YB6d+FxLai9r1Q/o0pbKCqPxg2LcK/msBmIfYIdL9N/q1B7bHIfiD/ViD3aZD9UiB3ap+Ydfa515rac7GsPvfqkntmXs1nXlNql2DzQJ93HaldiUHz5I4NdTefdc2oXVS1yedcK2pXVuXDPLnjCrUVn3FtqL0IVT3K51sRPgdZkFqKz7YWxF6YKorPtQrUXqDyd2vIHbdSdvCZBk/t5Sr5IT7PyKm9bMUGT+64i0J347OMmdqrUGDw5I67Ky74HOOl9noUFnyG0VJ7VYraiSd37K2c4qcfJ7XXqJDeyR2jKOMBfvJBUnutSgh+6iFSe8X09+EnHh+1V048+GlHR+31k+590sFRuwPlB3hyx+h0g59yYNRuQzV4ckcIzeAnHBS1e1EMfrohUbsdvd7JHXHkHuAnGw+1WxLrndwRSuuDBVONhdp9CfU+0VCo3ZlO7+SOeDI7NNOMg9rdifQ+yTCoHRq9TzEKaofIIZoJhkDtWMnfO7ljOtl7jx8AtaOXe4eG3DGpzOetDv8FubcvtOQ9cXX08nNvXajJeirf6OXn3riQk/GQZPAvpnZsk+0MeMGLz71doSnXKfBil557q0JVppOChS489zaFrjynSYpcNrljtxwvWAN/JbHjahlOLhC36NwbE/KmP7tA3KJzb0vom3qHhtyR1cTfuB636NwbEkWY9juo4xadezuiDJN+K2/conNvRhRiwh14ckd+032XXdyic29DlGOyL7OLW3TuTYiCTPX1XnGLzr0FUZJpduDJHSIm+cqjuEXn3nwozBRfAhO36NxbD6WZ4Gsx4hade+OhOPHfixG36NzbDuUJ/6aAuEXn3nQoUPTfTsctOveWQ4mC/3g6btG5NxyKFPvX03GLzr3dUKbIN5zIHWrikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy5JcoecuCTJHXLikiR3yIlLktwhJy7JyNwBNeQOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI+QOI2Pmvvhu9nDExaFGHw9ns6fZfvuYuZ/Mrst98fpv70f8hSjP3rnv1dCIuX+YXZf7uy9mD8jd276579fQeLmn9bg69/OdHXJ3t2fuezY0Wu6/pdrJHdeoIvfFq9mM3HG9GnI/36GazeaPyB3XqSD34/TA/uC/35E7rlNH7vPn7xcXcl+8fpwe849erkZ3Mut8O8bvRJl25/7px8eHTTAv1jmP3dBIuT95s7rjDXL/92E3tPnXowwVNdiV+6cv+j7aYJYBDY2S+7s36b+bua9HNlutH7ljuTP3D4ezi8FENDTi20wbuae1mqenoMXr9K+fuxuw725ue+7/+8d5wk/+df6vxX++P//nZ2+XIQ1F5X7SjW/1buvT7gbkbm577ulh+5vBD007AQ1F5X68HlV/Oblje+6bZaSbpJ2VgIYCc+/umRs3IHdzNzgQOch97Iaick/PPvMnby7dgNzNXZf72e8/HLYvRQMaiso9/ZAc/fR+4wbkbm537me/fPns0fDIS0BDUbkPjqPeez7YASN3c7tyfzc47t4daBy/obDcV0ePNg6jkjt25b4+on704o/D7rj66A3F5b5Mbwt3o+XIDFa25/6heef0q1//fL9cv1RtjNtQaO7pwnffN2vy85LcsdyRe3qXaf5yeJPhu6YjNhSde7r8uH0qIndsz71/F6n/6cKHBMZqKCj3dHftx59+eLgkdyx35z44wn68yj2ioaDcN0bVjZvccYPcP7bH3SMaCvzMTL9WJ4PjqORubmvu6cKHg3+vP/84ckNRua8/4naW/ox1NcLj5gXJ4q/xfidK83Hzk76rNpo3lB6kWprPQ7aZBzQU9lJ1Y7XaZ6qTwVsI8LQ199WByM7nXUjjNxR3ZObj+m2y++3nHpq7a87vTEN223Nvv7eljaX/KOToDUUeiHz3LK3DvebpqL3Jq3TJ38f7nSjNjtyXi9eP+lgGr1xHbohvAIYRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcocRcoeR2+R+vPFdZ/eOnv81uIZvssYuOuHcPffkyfv+GnLHLjrh7Jf77P7b7hpyxy464eyZe/uVv+SOK+iEc8vcB2M7W53i9dvL1wAbdMK5e+7tN8s/3HYNMKATzj65N6cY+ezttmuANZ1w9so9na+jOc0CueMKOuHslXtzGrSLo/704+O0bzY/ejG87VlzLpLHz4eXLZozkcyPXnJPqZtOOGM/un9anzxqNv+6v+Gr/rJv+v97ffapwYWokE44e+WeXnJs7oJ92DzR1NP12vW6UwSeDG/I2clqphPO3i9Vm0u6a9Ynfl0054P9rH8z4fyO+H65+Of6sjToeXo3efHboxknn6yaTjj75N6cM/Dp8Jo0lP755aQ7XWC6Xfsu2m/d/5Eu6xa2OF6fWBAV0gnnzrmv7oXtb2uv2TxhfRrYtxvDH9ziuL+3Llf37Y0zsqIqOuHs+yGCp5fX5+KoN04v3K5BGufgaeiEA5k10wlnz9wf9tdc+K1nv/9w2L68SKO/uId1ftnwWegDezM10wlnr9z740DDUZ/98uWzR8NX0x8Gr6oHw7zo4k1QDZ1w7pz7/PFXL4fXtKN+98WloWy7B24cTCL3yumEs9+7qpeuWY/m6MUfh+SORCeccXNvnmrmX/36Z/rh49WjXr++RuV0whk19/Syed4/Uw1GfekeeMJrUyM64Yyaexrf+qV0N9qNF9jtwaXNW6JuOuGMnfvgrnc823L4tF2FdHfmSLsLnXDicv/YHT7tP+2ZdE9GxxtPVOzbVE0nnFFzTwPt7o7NoC99zKG/RfpHf/Q1fSCCDxHUSyecUXNvPq/5IH2ubfW5iG43Kx09uv8yfYbtsL83NkeUnvS35ThNxXTCCTgQ2fm83/VaDN9n6J6JXg1vy65MzXTCGfltpvUfmtx/M7h983nl1fC2/VFKujHqpRPOyLkvF82fFt5rnmyGL0B2/8lhe2PUSyccvgEYRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRsgdRv4PjWH98XV74zAAAAAASUVORK5CYII=&#34; style=&#34;height:120px;&#34;&gt;&lt;/td&gt;
      &lt;td class=&#34;gt_row gt_left&#34; style=&#34;font-size: 13px; text-align: left; vertical-align: middle; font-style: italic;&#34;&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
  &lt;tfoot class=&#34;gt_sourcenotes&#34;&gt;
    &lt;tr&gt;
      &lt;td class=&#34;gt_sourcenote&#34; colspan=&#34;8&#34;&gt;&lt;strong&gt;Data:&lt;/strong&gt; DWTS Wikipedia Articles | &lt;strong&gt;Table Author:&lt;/strong&gt; JLaw&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tfoot&gt;
  
&lt;/table&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;so-what-is-the-most-successful-profession-in-dwts&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;So what is the most successful “profession” in DWTS?&lt;/h1&gt;
&lt;p&gt;Seems pretty clearly to be the athletes as close to 14% of the Athletes have wound up winning. On the other end of the spectrum, the Media Personalities have faired less well with the lowest winning percentage of a group with 10+ stars and nearly 1 in 4 coming in last place… Reality TV Stars, while in the middle of the pack has been surging with ex-Bachelorettes winning the last two seasons (28 and 29).&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>An Attempt at Tweaking the Electoral College</title>
      <link>https://jlaw.netlify.app/2020/11/16/an-attempt-at-tweaking-the-electoral-college/</link>
      <pubDate>Mon, 16 Nov 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/11/16/an-attempt-at-tweaking-the-electoral-college/</guid>
      <description>
&lt;script src=&#34;index_files/kePrint/kePrint.js&#34;&gt;&lt;/script&gt;
&lt;link href=&#34;index_files/lightable/lightable.css&#34; rel=&#34;stylesheet&#34; /&gt;


&lt;div id=&#34;motivation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Motivation&lt;/h1&gt;
&lt;p&gt;With the 2020 Election wrapping up and a renewed discussion about the merits of the Electoral College I’ve been thinking more about the system and why it might be the way it is. While I understand the rationale of why doing a complete popular vote would have unintended consequences, I personally feel like the current system has overly valued small states by virtue of having a minimum of 3 electoral votes. &lt;strong&gt;My personal hypothesis is that we have too many states.&lt;/strong&gt; Therefore, my solution would be to start combining the small states that they meet a minimum threshold of the US population. I fully recognize that this would be &lt;em&gt;completely infeasible&lt;/em&gt; in practice… but this is just a humble blog. So this analysis will attempt to accomplish three things:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;&lt;p&gt;When comparing the population from 1792 vs. 2020, do states generally represent smaller percentages of the US Population? (Do we have too many states from an Electoral College perspective?)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;How could a new system be devised by combining states to reach a minimum population threshold?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Would this new system have impacted the results of the 2016 election? (At the time of writing, votes for the 2020 election are still being counted).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;gathering-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Gathering Data&lt;/h1&gt;
&lt;p&gt;Throughout this post, a number of difference libraries will be used as outputs will include plots, maps, and tables:&lt;/p&gt;
&lt;div id=&#34;loading-libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Loading Libraries&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rvest) #Web-Scraping
library(tidyverse) #Data Cleaning and Plotting
library(janitor) #Data Cleaning 
library(sf) #Manipulate Geographic Objects
library(httr) #Used to Download Excel File from Web
library(readxl) #Read in Excel Files
library(kableExtra) #Create HTML Tables&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;getting-the-us-population-by-state-in-1790&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Getting the US Population by State in 1790&lt;/h3&gt;
&lt;p&gt;Data from the 1790 US Census will be gathered from &lt;a href=&#34;https://en.wikipedia.org/wiki/1790_United_States_Census&#34;&gt;Wikipedia&lt;/a&gt; and scraped using the &lt;code&gt;rvest&lt;/code&gt; package. In the following code block, all &lt;em&gt;table&lt;/em&gt; tags will be extracted from the webpage and then I guessed and checked until I found the table I was looking for (in this case what I wanted was the 3rd table). The &lt;code&gt;html_table()&lt;/code&gt; function converts the HTML table into a data frame and &lt;code&gt;clean_names()&lt;/code&gt; from the &lt;code&gt;janitor&lt;/code&gt; package will change the column headers into an R friendly format.&lt;/p&gt;
&lt;p&gt;Finally, &lt;code&gt;stringr::str_remove_all()&lt;/code&gt; will use regular expressions to remove the footnote notation “[X]” from the totals and &lt;code&gt;readr::parse_number()&lt;/code&gt; will convert the character variable with commas into a numeric.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;us_pop_1790 &amp;lt;- read_html(&amp;#39;https://en.wikipedia.org/wiki/1790_United_States_Census&amp;#39;) %&amp;gt;%
  html_nodes(&amp;quot;table&amp;quot;) %&amp;gt;% 
  .[[3]] %&amp;gt;% 
  html_table() %&amp;gt;% 
  clean_names() %&amp;gt;% 
  filter(state_or_territory != &amp;#39;Total&amp;#39;) %&amp;gt;% 
  transmute(
    state = state_or_territory,
    population_1790 = str_remove_all(total, &amp;#39;\\[.+\\]&amp;#39;) %&amp;gt;% 
      parse_number(),
    population_percent_1790 = population_1790/sum(population_1790)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-us-population-by-state-in-2019&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Getting US Population by State in 2019&lt;/h3&gt;
&lt;p&gt;A similar process will be used to get the population estimates for 2019 from &lt;a href=&#34;https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population&#34;&gt;Wikipedia&lt;/a&gt;. In this case there is only 1 table on the page so &lt;code&gt;html_node(&#39;table&#39;)&lt;/code&gt; can be used rather than &lt;code&gt;html_nodes(&#39;table&#39;)&lt;/code&gt; like in the above code block for 1790.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;us_pop_2019 &amp;lt;- read_html(&amp;#39;https://simple.wikipedia.org/wiki/List_of_U.S._states_by_population&amp;#39;) %&amp;gt;% 
  html_node(&amp;#39;table&amp;#39;) %&amp;gt;% 
  html_table() %&amp;gt;% 
  clean_names() %&amp;gt;% 
  filter(!is.na(estimated_population_per_electoral_vote_2019_note_2),
         !estimated_population_per_electoral_vote_2019_note_2 %in% c(&amp;#39;&amp;#39;, &amp;#39;—&amp;#39;),
         rank_in_states_territories_2010 != &amp;#39;—&amp;#39;) %&amp;gt;%
  transmute(
    state,
    population_2019 = parse_number(population_estimate_july_1_2019_2),
    population_percent_2019 = population_2019 / sum(population_2019)
    )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-of-electoral-votes-for-each-state-by-year&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Getting # of Electoral Votes for Each State by Year&lt;/h3&gt;
&lt;p&gt;Finally, the table containing number of electoral votes by state by year will be extracted from Wikipedia. New code pieces for this code block are the use of selecting columns by number in the &lt;code&gt;dplyr::select()&lt;/code&gt; and &lt;code&gt;dplyr::rename()&lt;/code&gt; calls. Also, the use of &lt;code&gt;dplyr::across()&lt;/code&gt; which in this context is a replacement for &lt;code&gt;mutate_if&lt;/code&gt;, &lt;code&gt;mutate_at&lt;/code&gt;, and &lt;code&gt;mutate_all&lt;/code&gt;. Here I tell the &lt;code&gt;mutate()&lt;/code&gt; to take all variables that start with &lt;em&gt;“electoral votes”&lt;/em&gt; and apply the &lt;code&gt;readr::parse_number()&lt;/code&gt; function to them keeping the names the same. We’ll use this data set later on.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;electoral_votes &amp;lt;- read_html(&amp;#39;https://en.wikipedia.org/wiki/United_States_Electoral_College&amp;#39;) %&amp;gt;% 
  html_nodes(&amp;quot;table&amp;quot;) %&amp;gt;% 
  .[[5]] %&amp;gt;% 
  html_table(fill = T) %&amp;gt;% 
  select(2, 4, 36) %&amp;gt;% 
  filter(!Electionyear %in% c(&amp;#39;Total&amp;#39;, &amp;#39;Electionyear&amp;#39;, &amp;quot;State&amp;quot;)) %&amp;gt;% 
  rename(state = 1, electoral_votes_1792 = 2, electoral_votes_2020 = 3) %&amp;gt;% 
  mutate(across(starts_with(&amp;#39;electoral_votes&amp;#39;), parse_number))&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;q1-do-states-today-represent-smaller-proportions-of-the-population-than-they-did-when-the-electoral-college-was-formed&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Q1: Do states today represent smaller proportions of the population than they did when the Electoral College was formed?&lt;/h2&gt;
&lt;p&gt;My hypothesis is that the electoral college has become less effective because we’ve added too many small states that reflect minor amounts of the US population and that when the Electoral College was established the population distributions of states were more similar.&lt;/p&gt;
&lt;p&gt;To check this I’ll be comparing the distributions of State populations as a % of the Total US Population for 1790 and 2019. One note before getting into the code is that in the article for the 1790 state population, Maine is given its own row. However, Maine was a part of Massachusetts until 1820, so since we’re more focused on “electing blocks” rather than states I will merge Maine into Massachusetts.&lt;/p&gt;
&lt;p&gt;For this next code block, I join the two population data sets together and then all numeric variables summarized. Then, I melt the population percentages by year into a long-form data frame. Finally, I extract the numeric year from the variable names and compare the box plots of the % of Total Population for each State from 1790 and 2019.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;us_pop_2019 %&amp;gt;% 
  left_join(
    us_pop_1790 %&amp;gt;% 
      mutate(state = if_else(state == &amp;#39;Maine&amp;#39;, &amp;#39;Massachusetts&amp;#39;, state)) %&amp;gt;% 
      group_by(state) %&amp;gt;% 
      summarize(across(where(is.numeric), sum)),
    by = &amp;quot;state&amp;quot;
  ) %&amp;gt;% 
  pivot_longer(
    cols = c(contains(&amp;quot;percent&amp;quot;)),
    names_to = &amp;quot;year&amp;quot;,
    values_to = &amp;quot;population_dist&amp;quot;
  ) %&amp;gt;% 
  mutate(year = str_extract(year, &amp;#39;\\d+&amp;#39;) %&amp;gt;% as.integer) %&amp;gt;% 
  ggplot(aes(x = fct_rev(factor(year)), y = population_dist, 
             fill = factor(year))) + 
    geom_boxplot() + 
    labs(x = &amp;quot;Year&amp;quot;, y = &amp;quot;Population Distribution&amp;quot;, 
         title = &amp;quot;State Population Distribution by % of US Population&amp;quot;) +
    annotate(&amp;#39;linerange&amp;#39;, y = 1/nrow(us_pop_2019), 
             xmin = .6, xmax = 1.45, lty = 2) + 
    annotate(&amp;#39;linerange&amp;#39;, y = 1/(nrow(us_pop_1790)-1), 
             xmin = 1.6, xmax = 2.45, lty = 2) + 
    scale_y_continuous(label = scales::percent_format(accuracy = 1)) + 
    scale_fill_discrete(guide = F) +
    coord_flip() +
    theme_bw()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/population_changes-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In the chart above we’re looking at the distribution of states by the % of the total US population they make up. The dashed lines represent the expected values if all states had the same amount. For example, there are 51 “voting bodies” that make up 100% of the US population, so the “expected” amount would be 1/51 or 2.0%. In 1790, the largest state made up 19.2% and the smallest state made up 1.5% of the total population. In 2019, the largest state makes up 12% of the total population and the smallest makes up 0.2% of the total population.&lt;/p&gt;
&lt;p&gt;While some of this is due to having more states which means the same 100% is being cut into more pieces. Another way to see whether states are making up smaller pieces of the population today than back is to compare the data to those expected values from before. In the case of 1790, there are 15 voting bodies so on average we’d expected each state to make up 6.7%. And when looking the distribution of the states in 1790, 60% are below the expected amount of 6.7%. This is compared to the distribution in 2019 where 67% are below the expected amount of 2.0%.&lt;/p&gt;
&lt;p&gt;When asking whether or not there are more small states in 2019 vs. 1790, I find that 28 of the 51 states (with DC) [55%] have a % of the US Population smaller than the minimum state from 1790 [1.5%]. These 28 states make up 141 or 26% of the 538 electoral votes.&lt;/p&gt;
&lt;p&gt;So while there’s not a large difference between actual and expected it does seem that we have a greater concentration of smaller population states now than when the electoral college was first established based on the concentration that make up less than 1.5% of the US population.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;q2.-how-could-states-be-combined-to-ensure-each-voting-group-meets-a-minimum-population-threshold&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Q2. How could states be combined to ensure each “voting group” meets a minimum population threshold?&lt;/h2&gt;
&lt;p&gt;The fact that 55% of states have a % of 2019 US Population smaller than the smallest percentage in 1790 gives promise to the idea that combining states could be feasible. So for this exercise, &lt;strong&gt;I’ll combine states together in order to ensure that each group has at least a minimum of 1.5% of the US Population&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Originally I had wanted to come up with a cool algorithm to find the optimal solution to ensure that each state group hit the 1.5% while taking into account the location of the states being combined and the political culture of the states… but alas I couldn’t figure out how to do it. So I combined the states manually taking into account geography but completely ignoring how states usually vote. In my new construction the following states get combined:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Alaska &amp;amp; Oregon&lt;/li&gt;
&lt;li&gt;Arkansas &amp;amp; Mississippi&lt;/li&gt;
&lt;li&gt;Connecticut &amp;amp; Rhode Island&lt;/li&gt;
&lt;li&gt;Washington DC, Delaware, and West Virginia&lt;/li&gt;
&lt;li&gt;Hawaii &amp;amp; Nevada&lt;/li&gt;
&lt;li&gt;Iowa &amp;amp; Nebraska&lt;/li&gt;
&lt;li&gt;Idaho, Montana, North Dakota, South Dakota, and Wyoming&lt;/li&gt;
&lt;li&gt;Kansas &amp;amp; Oklahoma&lt;/li&gt;
&lt;li&gt;New Hampshire, Maine, and Vermont&lt;/li&gt;
&lt;li&gt;New Mexico &amp;amp; Utah&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;new_groupings &amp;lt;- us_pop_2019 %&amp;gt;% 
  mutate(
    state = if_else(state == &amp;#39;D.C.&amp;#39;, &amp;#39;District of Columbia&amp;#39;, state),
    new_grouping = case_when(
      state %in% c(&amp;#39;New Hampshire&amp;#39;, &amp;#39;Maine&amp;#39;, &amp;#39;Vermont&amp;#39;) ~ &amp;#39;NH/ME/VT&amp;#39;,
      state %in% c(&amp;#39;Rhode Island&amp;#39;, &amp;#39;Connecticut&amp;#39;) ~ &amp;#39;CT/RI&amp;#39;,
      state %in% c(&amp;#39;West Virginia&amp;#39;, &amp;#39;Delaware&amp;#39;, &amp;#39;District of Columbia&amp;#39;) ~ 
        &amp;#39;DC/DE/WV&amp;#39;,
      state %in% c(&amp;#39;Alaska&amp;#39;, &amp;#39;Oregon&amp;#39;) ~ &amp;#39;AK/OR&amp;#39;,
      state %in% c(&amp;#39;Utah&amp;#39;, &amp;#39;New Mexico&amp;#39;) ~ &amp;#39;NM/UT&amp;#39;,
      state %in% c(&amp;#39;Hawaii&amp;#39;, &amp;#39;Nevada&amp;#39;) ~ &amp;#39;HI/NV&amp;#39;,
      state %in% c(&amp;#39;Idaho&amp;#39;, &amp;#39;Montana&amp;#39;, &amp;#39;North Dakota&amp;#39;, 
                   &amp;#39;South Dakota&amp;#39;, &amp;#39;Wyoming&amp;#39;) ~ &amp;#39;ID/MT/ND/SD/WY&amp;#39;,
      state %in% c(&amp;#39;Iowa&amp;#39;, &amp;#39;Nebraska&amp;#39;) ~ &amp;#39;IA/NE&amp;#39;,
      state %in% c(&amp;#39;Arkansas&amp;#39;, &amp;#39;Mississippi&amp;#39;) ~ &amp;#39;AR/MS&amp;#39;,
      state %in% c(&amp;#39;Oklahoma&amp;#39;, &amp;#39;Kansas&amp;#39;) ~ &amp;#39;KS/OK&amp;#39;,
      TRUE ~ state
    )
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To display this brave new world, I will construct a map that shows my new compressed electoral map and the resulting changes in the number of electoral votes. The first step is adding the electoral votes into the data frame constructed in the last code block:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;new_groupings &amp;lt;- new_groupings %&amp;gt;% 
  left_join(
    electoral_votes %&amp;gt;% 
      transmute(state = if_else(state == &amp;#39;D.C.&amp;#39;, &amp;#39;District of Columbia&amp;#39;, state),
                electoral_votes_2020),
    by = &amp;quot;state&amp;quot;
  ) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, I need a mechanism to assign a number of electoral votes to my compressed map. Normally, there are 538 electoral votes representing the 435 voting members of Congress, the 100 Senators, and 3 additional electoral votes for Washington DC. Since I’m not trying to rock the boat too much. My new system will maintain the 2 votes per group represented by the Senate allocation and the population allocation from the Congressional side. In order to understand and apply this relationship I’m building a quick and dirty linear regression model to predict the population component for the new of electoral votes&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;electorial_vote_model &amp;lt;- lm(electoral_votes_2020-2 ~ population_2019, 
                            data = new_groupings)

electorial_vote_model&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Call:
## lm(formula = electoral_votes_2020 - 2 ~ population_2019, data = new_groupings)
## 
## Coefficients:
##     (Intercept)  population_2019  
##     0.094428506      0.000001313&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This model shows that there is 1.313 electoral votes per 1 million people.&lt;/p&gt;
&lt;p&gt;To visualize what this new electoral map will look map, I will use the &lt;code&gt;sf&lt;/code&gt; package. While I’m not very familiar with this package (maybe a subject of a future post), I’ve tinkered around with the format before and have found it very compatible with tidy principles.&lt;/p&gt;
&lt;p&gt;The first step is getting a shape file. For the United States, I will leverage the &lt;code&gt;usa_sf&lt;/code&gt; function from the &lt;code&gt;albersusa&lt;/code&gt; package which will return a map as a simple feature. The “laea” represents the projection.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;usa &amp;lt;-  albersusa::usa_sf(&amp;quot;laea&amp;quot;) %&amp;gt;% select(name, geometry)

knitr::kable(head(usa))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
name
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
geometry
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Arizona
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((-1111066 -8…
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Arkansas
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((557903.1 -1…
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
California
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((-1853480 -9…
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Colorado
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((-613452.9 -…
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Connecticut
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((2226838 519…
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
District of Columbia
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((1960720 -41…
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;What makes the magic of the &lt;code&gt;sf&lt;/code&gt; class is that the shape information is contained in the geometry column, but everything else can be operated on like a normal data frame. So for the next step, I’ll join the “state groupings” information to this shape file data using the “name” column from the shape data and the state column from the groupings data.&lt;/p&gt;
&lt;p&gt;Next, I summarize the data to “combined state groupings” level where I get the sums of the population and the number of original electoral votes. The two unique parts of this summarize statement are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;st_union&lt;/code&gt; which will combine geographic areas from the shape file into new shapes. If you wanted to combine the groups but maintain all original boundaries then &lt;code&gt;st_combine&lt;/code&gt; would be used instead.&lt;/li&gt;
&lt;li&gt;Creating a better label for the combined state names by using &lt;code&gt;paste&lt;/code&gt; in the summarize with the &lt;code&gt;collapse&lt;/code&gt; option which concatenates the states in the aggregation.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;The final mutate step uses the &lt;code&gt;predict&lt;/code&gt; function to apply the regression model to compute the new electoral vote values for the combined states. Any state that wasn’t combined retained its original number of votes.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Afterwards, the new data set looks like:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;new_usa &amp;lt;- usa %&amp;gt;% 
  left_join(new_groupings %&amp;gt;% 
              transmute(state, 
                        new_grouping, 
                        population_2019, 
                        electoral_votes_2020
                        ), 
            by = c(&amp;quot;name&amp;quot; = &amp;quot;state&amp;quot;)
  ) %&amp;gt;% 
  group_by(new_grouping) %&amp;gt;% 
  summarize(
    geom = st_union(geometry),
    population_2019 = sum(population_2019),
    electoral_votes = sum(electoral_votes_2020),
    states = paste(name, collapse = &amp;#39;/&amp;#39;)
  ) %&amp;gt;% 
  mutate(
    new_ev = if_else(
      states == new_grouping,
      electoral_votes,
      ceiling(predict(electorial_vote_model, newdata = .) + 2)
    ),
    lbl = if_else(new_grouping == states, NA_character_, 
                  paste0(new_grouping, &amp;quot;: &amp;quot;, new_ev - electoral_votes)))

knitr::kable(head(new_usa))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
new_grouping
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
geom
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
population_2019
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
electoral_votes
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
states
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
new_ev
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
lbl
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AK/OR
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((-1899337 -2…
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4949282
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Oregon/Alaska
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AK/OR: -1
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Alabama
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((1145349 -15…
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4903185
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Alabama
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NA
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AR/MS
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((1052956 -15…
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5993974
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
12
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Arkansas/Mississippi
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AR/MS: -2
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Arizona
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((-1111066 -8…
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
7278717
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
11
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Arizona
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
11
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NA
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
California
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((-1853480 -9…
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
39512223
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
55
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
California
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
55
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NA
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Colorado
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
MULTIPOLYGON (((-613452.9 -…
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5758736
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Colorado
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NA
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Now we’re ready to plot the map. Plotting &lt;code&gt;sf&lt;/code&gt; geometries work within the &lt;code&gt;ggplot&lt;/code&gt; paradigm where &lt;code&gt;geom_sf&lt;/code&gt; will draw the geometries and &lt;code&gt;geom_sf_text&lt;/code&gt; will handle the overlays for the given groups. &lt;code&gt;coord_sf&lt;/code&gt; changes the coordinate system of the plot. And everything else should be familiar from vanilla ggplot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;new_usa %&amp;gt;% 
ggplot() +
  geom_sf(color = &amp;quot;#2b2b2b&amp;quot;, size=0.125, aes(fill = lbl)) +
  geom_sf_text(aes(label = lbl), check_overlap = T, size = 3) + 
  coord_sf(crs = st_crs(&amp;quot;+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs&amp;quot;), datum = NA) +
  scale_fill_discrete(guide = F, na.value = &amp;quot;grey90&amp;quot;) + 
  labs(title = &amp;quot;Proposed Electoral Map&amp;quot;,
       subtitle = &amp;quot;Combining States so each &amp;#39;Group&amp;#39; makes up at least ~1.5% of US Population&amp;quot;,
       caption = &amp;quot;Number represents the change in Electoral Votes due to combining&amp;quot;) + 
  ggthemes::theme_map() + 
  theme(
    plot.title = element_text(size = 14)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/build_map-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The states in gray remained unchanged and the filled in states represent our new groupings. The states that directly border each other have been combined into an “electoral grouping” with a newly assigned number of electoral votes. Since the electoral vote model was based on population, the change in the number of electoral votes comes primarily from the loss of the two senate votes for each combined state.&lt;/p&gt;
&lt;p&gt;For example, NH/ME/VT originally would have had 11 electoral votes and under the new system will have 7 for a net change of -4 due to the loss of two combined states 2 senate votes.&lt;/p&gt;
&lt;p&gt;Under the normal electoral college there were 538 votes and under this new system that number is reduced to 512.&lt;/p&gt;
&lt;p&gt;Now that we have our new electoral college, would it have made a difference in 2016?&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;q3-would-this-new-system-have-impacted-the-results-of-the-2016-election&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Q3: Would this new system have impacted the results of the 2016 election?&lt;/h2&gt;
&lt;p&gt;The 2016 election results between Donald Trump and Hillary Clinton is provided in great detail from the &lt;a href=&#34;https://www.fec.gov/documents/1890/federalelections2016.xlsx&#34;&gt;Federal Election Commission&lt;/a&gt;. Surprisingly, it was difficult to find the number of votes by state in an easily consumable way where I wouldn’t have to recode all the state names. So the FEC data will have to do even if its took some complicated data manipulation.&lt;/p&gt;
&lt;p&gt;Since the FEC data comes from an Excel file, I first need to download the file from the FEC website. I’ll use the &lt;code&gt;GET&lt;/code&gt; function from &lt;code&gt;httr&lt;/code&gt; to download the Excel file to a temporary file and then will use &lt;code&gt;read_excel&lt;/code&gt; from &lt;code&gt;readxl&lt;/code&gt; to read in the file.&lt;/p&gt;
&lt;p&gt;Before data manipulation, but after filtering to just Trump and Clinton, the data looks like.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;GET(&amp;quot;https://www.fec.gov/documents/1890/federalelections2016.xlsx&amp;quot;, 
    write_disk(tf &amp;lt;- tempfile(fileext = &amp;quot;.xlsx&amp;quot;)))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results2016 &amp;lt;- read_excel(tf, sheet = &amp;#39;2016 Pres General Results&amp;#39;) %&amp;gt;% 
  clean_names() %&amp;gt;% 
  filter(last_name %in% c(&amp;#39;Trump&amp;#39;, &amp;#39;Clinton&amp;#39;)) %&amp;gt;% 
  select(state, state_abbreviation, last_name, general_results)

knitr::kable(head(results2016, 5))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
state
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
state_abbreviation
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
last_name
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
general_results
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Alabama
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AL
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Trump
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1318255
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Alabama
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AL
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Clinton
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
729547
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Alaska
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AK
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Trump
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
163387
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Alaska
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AK
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Clinton
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
116454
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Arizona
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AZ
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Trump
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1252401
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;There was a small data quirk with New York state where because the same candidate can appear on multiple party lines a single candidate appears in multiple rows (Clinton appears 4 times and Trump 3). Therefore a first group-by is done to make the data 2 rows per state. Then the data is cast to a wider format, the electoral votes are added back and allocated to the winning candidate (technically this is wrong since Nebraska and Maine do not use all-or-nothing allocations, but its close enough for this exercise).&lt;/p&gt;
&lt;p&gt;Then the data is aggregated to the new electoral groupings from the prior section and our “new” electoral votes are allocated in an all or nothing fashion to the candidate.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results2016 &amp;lt;- results2016 %&amp;gt;% 
  group_by(state, state_abbreviation, last_name) %&amp;gt;% 
  summarize(general_results = sum(general_results, na.rm = T), 
            .groups = &amp;#39;drop&amp;#39;) %&amp;gt;% 
  pivot_wider(
    names_from = &amp;quot;last_name&amp;quot;,
    values_from = &amp;quot;general_results&amp;quot;
  ) %&amp;gt;% 
  left_join(
    new_groupings %&amp;gt;% 
      select(state, new_grouping, electoral_votes_2020, population_2019),
    by = &amp;quot;state&amp;quot;
  ) %&amp;gt;% 
  mutate(trump_ev = (Trump &amp;gt; Clinton)*electoral_votes_2020,
         clinton_ev = (Clinton &amp;gt; Trump)*electoral_votes_2020
  ) %&amp;gt;% 
  group_by(new_grouping) %&amp;gt;% 
  summarize(across(where(is.numeric), sum, na.rm = T),
            states = paste(state, collapse = &amp;#39;/&amp;#39;)) %&amp;gt;% 
  mutate(new_ev = if_else(
              states == new_grouping,
              electoral_votes_2020,
              ceiling(predict(electorial_vote_model, newdata = .) + 2)
            )) %&amp;gt;% 
  mutate(
    new_trump_ev = if_else(Trump &amp;gt; Clinton, new_ev, 0),
    new_clinton_ev = if_else(Trump &amp;lt; Clinton, new_ev, 0)
  )

knitr::kable(head(results2016, 5))&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
new_grouping
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Clinton
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
Trump
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
electoral_votes_2020
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
population_2019
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
trump_ev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
clinton_ev
&lt;/th&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
states
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
new_ev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
new_trump_ev
&lt;/th&gt;
&lt;th style=&#34;text-align:right;&#34;&gt;
new_clinton_ev
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AK/OR
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1118560
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
945790
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4949282
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
3
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
7
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Alaska/Oregon
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Alabama
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
729547
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1318255
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4903185
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Alabama
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
9
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AR/MS
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
865625
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1385586
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
12
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
5993974
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
12
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Arkansas/Mississippi
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
10
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Arizona
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1161167
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
1252401
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
11
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
7278717
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
11
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
Arizona
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
11
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
11
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
California
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
8753792
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
4483814
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
55
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
39512223
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
55
&lt;/td&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
California
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
55
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:right;&#34;&gt;
55
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Finally to visualize the difference in electoral votes between the actual 2016 results and our new 2016 results, the prior data set will be summarized and reshaped to get the data back into a tidy format with the proper labeling. The plot is a simple stacked barplot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;results2016 %&amp;gt;% 
  summarize(across(contains(c(&amp;quot;trump_ev&amp;quot;, &amp;quot;clinton_ev&amp;quot;)), sum)) %&amp;gt;% 
  pivot_longer(cols = everything(),
               names_to = &amp;#39;variable&amp;#39;,
               values_to = &amp;#39;electoral_votes&amp;#39;) %&amp;gt;% 
  group_by(str_detect(variable, &amp;#39;new&amp;#39;)) %&amp;gt;% 
  mutate(
    percents = electoral_votes/sum(electoral_votes),
    old_v_new = if_else(str_detect(variable, &amp;#39;new&amp;#39;), &amp;#39;New EC&amp;#39;, &amp;#39;Original EC&amp;#39;),
    candidate = case_when(
       str_detect(variable, &amp;#39;trump&amp;#39;) ~ &amp;quot;trump&amp;quot;,
       str_detect(variable, &amp;#39;clinton&amp;#39;) ~ &amp;#39;clinton&amp;#39;,
       TRUE ~ &amp;#39;total&amp;#39;
     ),
    lbl = paste0(electoral_votes, 
                 &amp;#39;\n(&amp;#39;, 
                 scales::percent(percents, accuracy = .1) ,&amp;#39;)&amp;#39;)
  ) %&amp;gt;% 
   ggplot(aes(y = old_v_new, x = percents, fill = candidate)) +
    geom_col(width = .5) +
    geom_text(aes(label = lbl), position = position_stack(vjust = .5)) + 
    geom_vline(xintercept = .5, lty = 2) + 
    scale_x_continuous(label = scales::percent, expand = c(0,0)) + 
    scale_fill_manual(values = c(&amp;#39;clinton&amp;#39; = &amp;#39;blue&amp;#39;, &amp;#39;trump&amp;#39; = &amp;#39;red&amp;#39;)) + 
    guides(fill = guide_legend(reverse = T)) + 
    labs(x = &amp;quot;% of Electoral Vote&amp;quot;,
         y = &amp;quot;&amp;quot;,
         title = &amp;quot;Comparing 2016 Election Results in the Original vs. New System&amp;quot;,
         fill = &amp;quot;&amp;quot;) + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title.position = &amp;#39;plot&amp;#39;,
      axis.line = element_blank(),
      axis.ticks.x = element_blank(),
      axis.text.x = element_blank()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/election_pt3-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;With the new electoral grouping system the net change in percentage of electoral votes was only 0.3%, so the overall result wouldn’t have changed.&lt;/p&gt;
&lt;div id=&#34;what-actually-changed-in-the-new-system&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;What Actually Changed in the New System?&lt;/h3&gt;
&lt;p&gt;The final question would be &lt;strong&gt;how did the electoral votes change between the old system and the new system&lt;/strong&gt;. The tbl_dl data frame is restructuring the data into the table format which will only have rows for groupings where the number of electoral votes is different and I’m creating labels to include the “+” and “-” symbols.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_dt &amp;lt;- results2016 %&amp;gt;% 
  filter(trump_ev != new_trump_ev | clinton_ev != new_clinton_ev) %&amp;gt;% 
  transmute(
    new_grouping,
    clinton_delta = (new_clinton_ev - clinton_ev),
    trump_delta = (new_trump_ev - trump_ev),
    clinton_lbl = paste0(
      if_else(clinton_delta &amp;gt; 0, &amp;quot;+&amp;quot;, &amp;quot;&amp;quot;),
      clinton_delta
    ),
    trump_lbl = paste0(
      if_else(trump_delta &amp;gt; 0, &amp;quot;+&amp;quot;, &amp;quot;&amp;quot;),
      trump_delta
    )
  ) %&amp;gt;%
  select(new_grouping, clinton_lbl, trump_lbl)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To complete the table visualization I’m using the &lt;code&gt;kableExtra&lt;/code&gt; package. The &lt;code&gt;kable_paper&lt;/code&gt; argument is a style setting and the two uses of &lt;code&gt;column_spec&lt;/code&gt; sets the cell background to either red or green if the label constructed above is non-zero and white otherwise (which will appear blank). This was my first experience with &lt;code&gt;kableExtra&lt;/code&gt; and while I’m happy that I was able to get this to be how I wanted, I found certain parts of the syntax a little frustrating.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tbl_dt %&amp;gt;% 
  kbl(align = c(&amp;#39;l&amp;#39;, &amp;#39;c&amp;#39;, &amp;#39;c&amp;#39;),
      col.names = c(&amp;#39;&amp;#39;, &amp;#39;Clinton&amp;#39;, &amp;#39;Trump&amp;#39;),
      caption = &amp;quot;Election 2016: Candidate&amp;#39;s Change in Electoral Votes&amp;quot;) %&amp;gt;% 
  kable_paper(full_width = F) %&amp;gt;% 
  column_spec(2, color = &amp;#39;white&amp;#39;, background = case_when(
    str_detect(tbl_dt$clinton_lbl, &amp;quot;\\+&amp;quot;) ~ &amp;#39;green&amp;#39;,
    str_detect(tbl_dt$clinton_lbl, &amp;quot;\\-&amp;quot;) ~ &amp;#39;red&amp;#39;,
    TRUE ~ &amp;#39;white&amp;#39;
    )
  ) %&amp;gt;% 
  column_spec(3, color = &amp;#39;white&amp;#39;, background = case_when(
    str_detect(tbl_dt$trump_lbl, &amp;quot;\\+&amp;quot;) ~ &amp;#39;green&amp;#39;,
    str_detect(tbl_dt$trump_lbl, &amp;quot;\\-&amp;quot;) ~ &amp;#39;red&amp;#39;,
    TRUE ~ &amp;#39;white&amp;#39;
    )
  )&lt;/code&gt;&lt;/pre&gt;
&lt;table class=&#34; lightable-paper&#34; style=&#39;font-family: &#34;Arial Narrow&#34;, arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;&#39;&gt;
&lt;caption&gt;
&lt;span id=&#34;tab:tabletime&#34;&gt;Table 1: &lt;/span&gt;Election 2016: Candidate’s Change in Electoral Votes
&lt;/caption&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th style=&#34;text-align:left;&#34;&gt;
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Clinton
&lt;/th&gt;
&lt;th style=&#34;text-align:center;&#34;&gt;
Trump
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AK/OR
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: green !important;&#34;&gt;
+2
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-3
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
AR/MS
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: white !important;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-2
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
CT/RI
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-2
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: white !important;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
DC/DE/WV
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: green !important;&#34;&gt;
+1
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-5
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
HI/NV
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-2
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: white !important;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
IA/NE
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: white !important;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-2
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
ID/MT/ND/SD/WY
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: white !important;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-7
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
KS/OK
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: white !important;&#34;&gt;
0
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-1
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NH/ME/VT
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-4
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: white !important;&#34;&gt;
0
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td style=&#34;text-align:left;&#34;&gt;
NM/UT
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: red !important;&#34;&gt;
-5
&lt;/td&gt;
&lt;td style=&#34;text-align:center;color: white !important;background-color: green !important;&#34;&gt;
+4
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In most cases, votes were lost due to the combining of smaller states into these groupings but in a few instances the combination of multiple states changed who won the popular vote. For example, in the Alaska/Oregon there were originally 10 electoral votes (3 from Alaska which went to Trump, 7 from Oregon that went to Clinton). The grouping lost vote in the combining and then the combined Oregon/Alaska went to Clinton overall. Therefore, Clinton gets all 9 new electoral votes (+2 from the initial 7) and Trump loses the 3 he had from Alaska.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;wrapping-up&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Wrapping Up&lt;/h2&gt;
&lt;p&gt;Back at the beginning of this analysis I hypothesized that the Electoral College had become more over-weighted towards smaller states than back in the 1790s during the early days of the electoral college. Based on comparing the % of the US Population of states from 1790 vs. 2019 I showed that this was true although not massively.&lt;/p&gt;
&lt;p&gt;I proposed an idea to revise the Electoral College by combining states to ensure that each grouping makes up at a minimum 1.5% of the US Population, which was the smallest share of population from 1790. This reduced the overall number of electoral votes due to the reduction of the automatic 2 votes per state for the combined states.&lt;/p&gt;
&lt;p&gt;Finally, I applied my new Electoral College to the 2016 election… it made almost no difference.&lt;/p&gt;
&lt;p&gt;So overall, this thought exercise was fun to work through but it winds up being an incredibly small change to the results from the current system.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Sequence Mining My Browsing History with arulesSequences</title>
      <link>https://jlaw.netlify.app/2020/11/01/sequence-mining-my-browsing-history-with-arulessequences/</link>
      <pubDate>Sun, 01 Nov 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/11/01/sequence-mining-my-browsing-history-with-arulessequences/</guid>
      <description>


&lt;p&gt;Typically when thinking of pattern mining people tend to think of Market Basket Analysis with the conventional example showing people typically buy both Beer and Diapers in the same trip. When order doesn’t matter this is called Association Rules Mining and is implemented by the &lt;code&gt;arules&lt;/code&gt; package in R. In this example, the person is buying &lt;strong&gt;both&lt;/strong&gt; diapers and beer. It doesn’t really matter if diapers led to the beer purchase or beer lead to the diaper purchased. However, there are instances where the order of events are important to what we’d consider a pattern. For example, “cause and effect” relationships imply order. Putting your hand on a hot stove leads to burning your hand. The reverse direction of burning your hand leading you to put your hand on a hot stove makes less sense. When the notion of order is applied to association rules mining it becomes “Sequence Mining”. And to do this, we’ll use the &lt;code&gt;arulesSequences&lt;/code&gt; package to run the &lt;a href=&#34;https://link.springer.com/article/10.1023/A:1007652502315&#34;&gt;cSPADE&lt;/a&gt; algorithm.&lt;/p&gt;
&lt;p&gt;Unfortunately, I don’t have access to grocery store data or much other data that would be an interesting use-case for sequence mining. But what I do have is access to my own browsing history. So for this post, I’ll be looking for common sequential patterns in my web own browsing habits.&lt;/p&gt;
&lt;div id=&#34;getting-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting the Data&lt;/h2&gt;
&lt;p&gt;I wasn’t able to figure out how to extract my browsing history directly from Chrome in a way that would easily be read into R. However, there are 3rd party programs that can extract browsing histories. In this case, I used a program called &lt;a href=&#34;https://www.nirsoft.net/utils/browsing_history_view.html&#34;&gt;BrowsingHistoryView&lt;/a&gt; by Nir Sofer. The interface is very straight forward and allowed for extracting my browsing history to a CSV file.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;chromehistorytool.PNG&#34; /&gt;&lt;/p&gt;
&lt;p&gt;From this program I was able to extract 85 days worth of browsing history from 2020-06-13 through 2020-09-09.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;loading-libraries-and-reading-in-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Loading Libraries and Reading in Data&lt;/h2&gt;
&lt;p&gt;The libraries used in this analysis are the usual gang of &lt;code&gt;tidyverse&lt;/code&gt;, &lt;code&gt;lubridate&lt;/code&gt;, &lt;code&gt;ggtext&lt;/code&gt; which are often used in this blog. Some new ones specific for this analysis are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;arulesSequences&lt;/code&gt; - Which will run the sequence mining algorithm&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tidygraph&lt;/code&gt; and &lt;code&gt;ggraph&lt;/code&gt; - Which will allow for plotting my browsing history as a directed graph&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) #Data Manipulation and Plotting
library(lubridate) #Date Manipulation
library(arulesSequences) #Running the Sequence mining algorithm
library(ggtext) #Making adding some flair to plots
library(tidygraph)  ## Creating a Graph Structure
library(ggraph) ## Plotting the Network Graph Structure&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A .csv file was created from the Browsing History View software and read into R through &lt;code&gt;readr&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;browsing_history &amp;lt;- read_csv(&amp;#39;browsing_history_v2.csv&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The read-in data looks as follows:&lt;/p&gt;
&lt;table style=&#34;width:100%;&#34;&gt;
&lt;colgroup&gt;
&lt;col width=&#34;41%&#34; /&gt;
&lt;col width=&#34;14%&#34; /&gt;
&lt;col width=&#34;7%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;3%&#34; /&gt;
&lt;col width=&#34;2%&#34; /&gt;
&lt;col width=&#34;4%&#34; /&gt;
&lt;col width=&#34;5%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;URL&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Title&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Visited On&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Visit Count&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Typed Count&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Referrer&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;Visit ID&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Profile&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;URL Length&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Transition Type&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Transition Qualifiers&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&lt;a href=&#34;https://watch.wwe.com/original/undertaker-the-last-ride-134576&#34; class=&#34;uri&#34;&gt;https://watch.wwe.com/original/undertaker-the-last-ride-134576&lt;/a&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;wwe network - undertaker: the last ride&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;6/13/2020 2:59:23 PM&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;331141&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Default&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;62&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Typed&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Chain Start,Chain End&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&lt;a href=&#34;https://watch.wwe.com/original/undertaker-the-last-ride-134576&#34; class=&#34;uri&#34;&gt;https://watch.wwe.com/original/undertaker-the-last-ride-134576&lt;/a&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;wwe network - undertaker: the last ride&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;6/13/2020 2:59:28 PM&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;331142&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Default&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;62&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Link&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Chain Start,Chain End&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&lt;a href=&#34;https://www.google.com/search?q=vtt+to+srt&amp;amp;oq=vtt+to+srt&amp;amp;aqs=chrome.0.69i59j0l7.1395j0j4&amp;amp;sourceid=chrome&amp;amp;ie=utf-8&#34; class=&#34;uri&#34;&gt;https://www.google.com/search?q=vtt+to+srt&amp;amp;oq=vtt+to+srt&amp;amp;aqs=chrome.0.69i59j0l7.1395j0j4&amp;amp;sourceid=chrome&amp;amp;ie=utf-8&lt;/a&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;vtt to srt - google search&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;6/13/2020 4:33:34 PM&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;331157&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Default&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;113&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Generated&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Chain Start,Chain End&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&lt;a href=&#34;https://www.google.com/search?q=vtt+to+srt&amp;amp;oq=vtt+to+srt&amp;amp;aqs=chrome.0.69i59j0l7.1395j0j4&amp;amp;sourceid=chrome&amp;amp;ie=utf-8&#34; class=&#34;uri&#34;&gt;https://www.google.com/search?q=vtt+to+srt&amp;amp;oq=vtt+to+srt&amp;amp;aqs=chrome.0.69i59j0l7.1395j0j4&amp;amp;sourceid=chrome&amp;amp;ie=utf-8&lt;/a&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;vtt to srt - google search&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;6/13/2020 4:33:37 PM&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;331158&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Default&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;113&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Link&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Chain Start,Chain End&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&lt;a href=&#34;https://twitter.com/&#34; class=&#34;uri&#34;&gt;https://twitter.com/&lt;/a&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;home / twitter&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;6/13/2020 5:19:55 PM&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;98&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;90&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;331167&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Default&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;20&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Typed&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Chain Start,Chain End&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&lt;a href=&#34;https://twitter.com/home&#34; class=&#34;uri&#34;&gt;https://twitter.com/home&lt;/a&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;home / twitter&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;6/13/2020 5:20:03 PM&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;414&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;NA&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;331168&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Default&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;24&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Link&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Chain Start,Chain End&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Looking at the data there are a number of cleaning steps that will need to be done to make the sequence mining more useful.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The variable names are not machine friendly and contain spaces,&lt;/li&gt;
&lt;li&gt;Some of the URLs are redirects or generated and therefore not URLs I specifically went to. I’ll want to exclude those.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Visited On&lt;/em&gt; is a character rather than a date&lt;/li&gt;
&lt;li&gt;If we’re looking for common patterns, I should probably limit the URLs to just the domains as its very unlikely that I would read the same news articles multiple times.
&lt;ul&gt;
&lt;li&gt;Therefore I’ll shorten “&lt;a href=&#34;https://twitter.com/home&#34; class=&#34;uri&#34;&gt;https://twitter.com/home&lt;/a&gt;” to just “twitter.com/”&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The following code block carries out the cleaning steps outlined above:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;browsing_history_cleaned &amp;lt;- browsing_history %&amp;gt;% 
  #Make the names more R friendly
  janitor::clean_names() %&amp;gt;%
  #Subset to URLs I either typed or 
  #Linked to (excluding redirects/form submissions)
  filter(transition_type %in% c(&amp;#39;Link&amp;#39;, &amp;#39;Typed&amp;#39;),
         str_detect(transition_qualifiers, &amp;#39;Chain Start&amp;#39;)
         )%&amp;gt;% 
  #Keep Only the Base URL and remove the prefix
  mutate(base_url = str_remove(url, &amp;#39;^https?:\\/\\/&amp;#39;) %&amp;gt;% 
           str_remove(&amp;#39;^www\\.&amp;#39;) %&amp;gt;% 
           str_extract(., &amp;#39;^.+?\\/&amp;#39;),
         #Parse the Date Format
         dttm = mdy_hms(visited_on),
         ds = as.Date(dttm)
  ) %&amp;gt;% 
  select(base_url, dttm, title, ds)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The above block:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Uses &lt;code&gt;janitor::clean_names()&lt;/code&gt; to convert the column names into an R-friendly format (Visited On -&amp;gt; visited_on)&lt;/li&gt;
&lt;li&gt;Keeps only the ‘Typed’ and ‘Link’ transition types to keep only URLs I either typed to or clicked to&lt;/li&gt;
&lt;li&gt;Keep only ‘Chain Start’ qualifiers to remove URLs that came from redirects&lt;/li&gt;
&lt;li&gt;Create a &lt;code&gt;base_url&lt;/code&gt; field by removing the “http[s]://” and “www.” strings if they exist.&lt;/li&gt;
&lt;li&gt;Converts &lt;code&gt;visited_on&lt;/code&gt; into both a timestamp and a datestamp&lt;/li&gt;
&lt;li&gt;Only keeps the four columns we’re interested in.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;After these changes, the data looks like:&lt;/p&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;17%&#34; /&gt;
&lt;col width=&#34;23%&#34; /&gt;
&lt;col width=&#34;46%&#34; /&gt;
&lt;col width=&#34;12%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;base_url&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;dttm&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;title&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;ds&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;watch.wwe.com/&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 14:59:23&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;wwe network - undertaker: the last ride&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;watch.wwe.com/&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 14:59:28&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;wwe network - undertaker: the last ride&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;google.com/&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 16:33:37&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;vtt to srt - google search&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;twitter.com/&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 17:19:55&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;home / twitter&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;twitter.com/&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 17:20:03&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;home / twitter&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;sessionizing-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Sessionizing the Data&lt;/h2&gt;
&lt;p&gt;Even though I have a date field for my browsing history, the cSPADE algorithm is going to want to be able to differentiate between when one session begins and another session ends. While a reasonable choice might be to break things apart by day, it’s likely that on weekends I have multiple browsing sessions which can sometimes stretch past midnight. So a more reasonable choice might be to say a new session begins if there is a gap of at least 1 hour since the last page I browsed to.&lt;/p&gt;
&lt;p&gt;Another aspect of the data that I’d like to deal with is to eliminate when I go to multiple pages within the same domain. Having an eventual rule that “twitter.com/ -&amp;gt; twitter.com” isn’t that interesting. So I will also remove any consecutive rows that have the same domain.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;collapsed_history &amp;lt;- browsing_history_cleaned %&amp;gt;% 
  #Order by Time
  arrange(dttm) %&amp;gt;% 
  # Create a new marker every time a Page Browsing is more than 1 hour since
  # the last one
  # Also, create a segment_id to identify each session
  mutate(time_diff = dttm-lag(dttm),
         #Count Segments as more than an hour btw events
         new_segment = if_else(is.na(time_diff) | time_diff &amp;gt;= 60*60, 1, 0),
         segment_id = cumsum(new_segment)
  ) %&amp;gt;% 
  group_by(segment_id) %&amp;gt;% 
  arrange(dttm) %&amp;gt;% 
  #Remove Instances where the same baseurl appears consecutively
  filter(base_url != lag(base_url) | is.na(lag(base_url))) %&amp;gt;% 
  #Create Within Segment ID
  mutate(item_id = row_number()) %&amp;gt;% 
  select(segment_id, ds, dttm, item_id, base_url) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  #Convert Everything to Factor
  mutate(across(.cols = c(&amp;quot;segment_id&amp;quot;, &amp;quot;base_url&amp;quot;), .f = as.factor))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In order to create &lt;code&gt;segment_id&lt;/code&gt;s to represent each session, I use &lt;code&gt;dplyr::lag()&lt;/code&gt; to calculate the difference in seconds between each event. Then when the event occurs more than 1 hour after the prior event I mark it with a 1 in the &lt;code&gt;new_segment&lt;/code&gt; column. Then using the &lt;code&gt;cumsum&lt;/code&gt; option, I can fill down the segment_ids to all the other events in that session.&lt;/p&gt;
&lt;p&gt;Similarly I use the lag function to remove consecutively occurring identical &lt;code&gt;base_url&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Finally, a quirk of the &lt;code&gt;arulesSequences&lt;/code&gt; package is that the “items” or the URLs in this case must be factors.&lt;/p&gt;
&lt;p&gt;The data for the 154 browsing sessions now looks like:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;collapsed_history %&amp;gt;% head(5) %&amp;gt;% knitr::kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;segment_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;ds&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;dttm&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;item_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;base_url&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 14:59:23&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;watch.wwe.com/&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 16:33:37&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;google.com/&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 17:19:55&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;twitter.com/&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 17:20:09&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;gmail.com/&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2020-06-13 17:24:14&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;twitter.com/&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;constructing-the-transactions-data-set-for-arulessequences&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Constructing the Transactions Data Set for arulesSequences&lt;/h2&gt;
&lt;p&gt;I haven’t found a ton of resources online about using the &lt;code&gt;arulesSequences&lt;/code&gt; package. This &lt;a href=&#34;https://blog.revolutionanalytics.com/2019/02/sequential-pattern-mining-in-r.html&#34;&gt;blog post from Revolution Analytics&lt;/a&gt; has been one of the best that I’ve found. However, their process involves exporting to .csv and then reading back in to create the transactions data set. Personally, I’d like to avoid doing as much outside of R as possible.&lt;/p&gt;
&lt;p&gt;However, the blog post does provide a good amount of detail about how to properly get the data in the proper format. Using the &lt;code&gt;as&lt;/code&gt; function, I can convert the previous data frame into a “transactions” format and set the following fields for use in cSPADE:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;items&lt;/strong&gt;: The elements that make up a sequence&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;sequenceID&lt;/strong&gt;: The identifier for each sequence&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;eventID&lt;/strong&gt;: The identifier for an item within a sequence&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sessions &amp;lt;-  as(collapsed_history %&amp;gt;% transmute(items = base_url), &amp;quot;transactions&amp;quot;)
transactionInfo(sessions)$sequenceID &amp;lt;- collapsed_history$segment_id
transactionInfo(sessions)$eventID = collapsed_history$item_id&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If I wanted to use better controls around time gaps, I would need to provide better information about time. But since this is pretty basic, I don’t use that field as the differentiation between sessions is enough.&lt;/p&gt;
&lt;p&gt;The Transaction data class can be viewed with the &lt;code&gt;inspect()&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;inspect(head(sessions))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     items                  transactionID sequenceID eventID
## [1] {items=watch.wwe.com/} 1             1          1      
## [2] {items=google.com/}    2             2          1      
## [3] {items=twitter.com/}   3             2          2      
## [4] {items=gmail.com/}     4             2          3      
## [5] {items=twitter.com/}   5             2          4      
## [6] {items=gothamist.com/} 6             2          5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Having the “items=” for every items is a little annoying so let’s remove that by altering the &lt;code&gt;itemLabels&lt;/code&gt; for the transactions set:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;itemLabels(sessions) &amp;lt;- str_replace_all(itemLabels(sessions), &amp;quot;items=&amp;quot;, &amp;quot;&amp;quot;)
inspect(head(sessions))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##     items            transactionID sequenceID eventID
## [1] {watch.wwe.com/} 1             1          1      
## [2] {google.com/}    2             2          1      
## [3] {twitter.com/}   3             2          2      
## [4] {gmail.com/}     4             2          3      
## [5] {twitter.com/}   5             2          4      
## [6] {gothamist.com/} 6             2          5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Much better.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;running-the-cspade-algorithm&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Running the cSPADE algorithm&lt;/h2&gt;
&lt;p&gt;The sequence mining algorithm can be run by using the &lt;code&gt;cspade()&lt;/code&gt; function in the &lt;code&gt;arulesSequences&lt;/code&gt; package. Before running the algorithm, I’ll need to explain the concept of &lt;em&gt;support&lt;/em&gt;. &lt;em&gt;Support&lt;/em&gt; can be best thought of as the proportion of sessions that contain a certain URL. Why that’s important is that the cSPADE algorithm works recursively to find the frequent patterns starting with 1-item sets, then moving to 2-items, etc. In order to limit how much time the algorithm will run for, you can set a minimum support threshold. Why this helps is that by definition the support of a 2-item set will be less than or equal to the support of either 1-item set. For example, if A occurs 40% of the time, A and B cannot occur more frequently.&lt;/p&gt;
&lt;p&gt;So if A alone does not meet the support threshold, then we don’t need to care about any 2 or more item subsets that contain A.&lt;/p&gt;
&lt;p&gt;For this purpose I’ll set a minimum support of 25%. The &lt;code&gt;cspade&lt;/code&gt; function will return all of the frequent itemsets that occur in my browsing data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;itemsets &amp;lt;- cspade(sessions, 
                   parameter = list(support = 0.25), 
                   control = list(verbose = FALSE))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;summary()&lt;/code&gt; function will provide a lot of useful information, but we’ll just look at the first few rows with &lt;code&gt;inspect()&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;inspect(head(itemsets))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    items                   support 
##  1 &amp;lt;{buzzfeed.com/}&amp;gt;     0.4090909 
##  2 &amp;lt;{en.wikipedia.org/}&amp;gt; 0.3311688 
##  3 &amp;lt;{facebook.com/}&amp;gt;     0.3311688 
##  4 &amp;lt;{github.com/}&amp;gt;       0.3051948 
##  5 &amp;lt;{google.com/}&amp;gt;       0.8051948 
##  6 &amp;lt;{gothamist.com/}&amp;gt;    0.4090909 
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here we see the results of a series of 1-item sets where the support is the number of sessions containing at least 1 visit to that URL. &lt;strong&gt;Apparently I use google A LOT as it appears in 80% of my sessions&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;We can also convert the itemsets data back to a data frame using the &lt;code&gt;as()&lt;/code&gt; function and go back to using the usual &lt;code&gt;dplyr&lt;/code&gt; or &lt;code&gt;ggplot&lt;/code&gt; functions. For example, I can visualize the 10 Most Frequent Sequences I visit based on the support metric:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Convert Back to DS
itemsets_df &amp;lt;- as(itemsets, &amp;quot;data.frame&amp;quot;) %&amp;gt;% as_tibble()

#Top 10 Frequent Item Sets
itemsets_df %&amp;gt;%
  slice_max(support, n = 10) %&amp;gt;% 
  ggplot(aes(x = fct_reorder(sequence, support),
                    y = support,
                    fill = sequence)) + 
    geom_col() + 
    geom_label(aes(label = support %&amp;gt;% scales::percent()), hjust = 0.5) + 
    labs(x = &amp;quot;Site&amp;quot;, y = &amp;quot;Support&amp;quot;, title = &amp;quot;Most Frequently Visited Item Sets&amp;quot;,
         caption = &amp;quot;**Support** is the percent of segments the contain the item set&amp;quot;) + 
    scale_fill_discrete(guide = F) +
    scale_y_continuous(labels = scales::percent,
                       expand = expansion(mult = c(0, .1))) + 
    coord_flip() + 
    cowplot::theme_cowplot() + 
    theme(
      plot.caption = element_markdown(hjust = 0),
      plot.caption.position = &amp;#39;plot&amp;#39;,
      plot.title.position = &amp;#39;plot&amp;#39;
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/top_by_support-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now we see some of the 2-item sets. Not only do I use Google in 80% of sessions. In 66% of sessions I visit google twice!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;turning-frequent-sequences-into-rules&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Turning Frequent Sequences into Rules&lt;/h2&gt;
&lt;p&gt;While knowing what URLs occur frequently is interesting, it would be more interesting if I could generate rules around what websites lead to visits to other websites.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;ruleInduction()&lt;/code&gt; function will turn the item sets into “if A then B” style rules. To control the size of the output, I will introduce the concept of &lt;em&gt;confidence&lt;/em&gt;. The &lt;em&gt;Confidence&lt;/em&gt; of an “If A then B” rule is the % of the times the rule is true when A occurs. So if “if A then B” has a 50% confidence then when A occurs we have a 50% chance of seeing B vs. seeing anything other than B.&lt;/p&gt;
&lt;p&gt;For this post, I’ll use a minimum confidence of 60%.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rules &amp;lt;- ruleInduction(itemsets, 
                       confidence = 0.6, 
                       control = list(verbose = FALSE))

inspect(head(rules, 3))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##    lhs                     rhs                    support confidence     lift 
##  1 &amp;lt;{gothamist.com/}&amp;gt;   =&amp;gt; &amp;lt;{westsiderag.com/}&amp;gt; 0.2727273  0.6666667 1.901235 
##  2 &amp;lt;{gothamist.com/}&amp;gt;   =&amp;gt; &amp;lt;{twitter.com/}&amp;gt;     0.2662338  0.6507937 1.113580 
##  3 &amp;lt;{t.co/}&amp;gt;            =&amp;gt; &amp;lt;{twitter.com/}&amp;gt;     0.3246753  0.7812500 1.336806 
## &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The returned data structure has 5 fields:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;lhs&lt;/strong&gt;: Left-hand side - The “A” in our “if A then B” rule&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;rhs&lt;/strong&gt;: Right-hand side - The “B” in our “if A then B” rule&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;support&lt;/strong&gt;: The % of sessions where “A then B” occurs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;confidence&lt;/strong&gt;: How often the rule is true (If A occurs the % of Time that B occurs)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;lift&lt;/strong&gt;: The strength of the association. Defined as the ratio of support “A then B” divided by the Support of A times the Support of B. In other words, how much more likely are we to see “A and B together” vs. what we would expect if A and B were completely independent of each other.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first row shows two NYC specific blogs, one of NYC overall and one for the Upper West Side. The support shows that 27% of my sessions include these two blogs. The confidence shows that if I visit Gothamist there’s 67% chance I’ll visit WestSideRag after. Finally, the lift shows that the likelihood of this rule is 90% higher than you’d expect if there was no relation between my visiting these sites.&lt;/p&gt;
&lt;div id=&#34;redundant-rules&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Redundant Rules&lt;/h3&gt;
&lt;p&gt;In order to create the most effective and simplest rules we’ll want to remove redundant rules. In this context a rule is &lt;em&gt;redundant&lt;/em&gt; when a subset of the left-hand side has a higher confidence than the rule with more items on the left-hand side. In simpler terms, we want to simplest rule that doesn’t sacrifice information. For example, {A, B, C} -&amp;gt; D is redundant of {A, B} -&amp;gt; {D} if the confidence of the 2nd rule is greater than or equal to the 1st&lt;/p&gt;
&lt;p&gt;A real example from this data is:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;lhs&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;rhs&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;support&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;confidence&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{t.co/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;=&amp;gt; &amp;lt;{twitter.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3246753&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.7812500&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.336806&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{twitter.com/}, {t.co/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;=&amp;gt; &amp;lt;{twitter.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3181818&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.7777778&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.330864&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The addition of “twitter.com” to the left-hand side does not make for a more confident rule so therefore it is redundant.&lt;/p&gt;
&lt;p&gt;Removing redundant rules can be done easily with the &lt;code&gt;is.redundant()&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rules_cleaned &amp;lt;- rules[!is.redundant(rules)]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The rules class can also be converted back to a data.frame with the &lt;code&gt;as()&lt;/code&gt; function. Then we can use &lt;code&gt;tidyr::separate()&lt;/code&gt; to break apart the &lt;code&gt;rule&lt;/code&gt; column into the &lt;code&gt;lhs&lt;/code&gt; and &lt;code&gt;rhs&lt;/code&gt; columns.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rules_df &amp;lt;- as(rules_cleaned, &amp;quot;data.frame&amp;quot;) %&amp;gt;% 
  as_tibble() %&amp;gt;% 
  separate(col = rule, into = c(&amp;#39;lhs&amp;#39;, &amp;#39;rhs&amp;#39;), sep = &amp;quot; =&amp;gt; &amp;quot;, remove = F)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can look at the highest confidence rules:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rules_df %&amp;gt;% 
  arrange(-confidence) %&amp;gt;% 
  select(lhs, rhs, support, confidence, lift) %&amp;gt;% 
  head() %&amp;gt;% 
  knitr::kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;colgroup&gt;
&lt;col width=&#34;55%&#34; /&gt;
&lt;col width=&#34;15%&#34; /&gt;
&lt;col width=&#34;9%&#34; /&gt;
&lt;col width=&#34;10%&#34; /&gt;
&lt;col width=&#34;8%&#34; /&gt;
&lt;/colgroup&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;lhs&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;rhs&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;support&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;confidence&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;lift&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{google.com/},{google.com/},{google.com/},{google.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{google.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.3701299&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9193548&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.141779&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{github.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{google.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2792208&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.9148936&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.136239&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{buzzfeed.com/},{google.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{google.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2597403&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8510638&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.056966&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{t.co/},{google.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{google.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2727273&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8400000&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.043226&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{lifehacker.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{reddit.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.2532468&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8297872&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.726854&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{google.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&amp;lt;{google.com/}&amp;gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.6623377&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;0.8225806&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1.021592&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;And this is pretty boring. I wind up on Google a lot, so it appears in a lot of the rules. So let’s make this more interesting by removing Google from the results and by also looking at both confidence and lift.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rules_df %&amp;gt;% 
  #Remove All Rules that Involve Google
  filter(!str_detect(rule, &amp;#39;\\{google.com\\/\\}&amp;#39;)) %&amp;gt;% 
  #Keep only Rule, Confidence, and Lift - 1
  transmute(rule, confidence, lift = lift - 1) %&amp;gt;% 
  #Pivot Lift and confidence into a single column
  pivot_longer(cols = c(&amp;#39;confidence&amp;#39;,&amp;#39;lift&amp;#39;),
               names_to = &amp;quot;metric&amp;quot;, 
               values_to = &amp;quot;value&amp;quot;) %&amp;gt;% 
  group_by(metric) %&amp;gt;% 
  #Keep only the Top 10 Rules for Each Metric
  top_n(10, value) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  # Reorder so that order is independent for each metrics
  ggplot(aes(x = tidytext::reorder_within(rule, value, metric),
             y = value,
             fill = rule)) + 
    geom_col() + 
    geom_label(aes(label = value %&amp;gt;% scales::percent()), 
               hjust = 0) +
    scale_fill_discrete(guide = F) + 
    tidytext::scale_x_reordered() + 
    scale_y_continuous(label = scales::percent, 
                       limits = c(0, 1),
                       expand = expansion(mult = c(0, .1))) + 
    labs(x = &amp;quot;Rule&amp;quot;, 
         y = &amp;quot;&amp;quot;, 
         title = &amp;quot;Top Rules by Confidence and Lift&amp;quot;,
         caption = &amp;quot;**Confidence** is the probability RHS occurs 
         given LHS occurs &amp;lt;br /&amp;gt;
         **Lift** is the increased liklihood of seeing LHS &amp;amp; RHS together vs. independent&amp;quot;) +
    facet_wrap(~metric, ncol = 1, scales = &amp;quot;free_y&amp;quot;) +
    coord_flip() +
    theme_minimal() +
    theme(
      plot.caption = element_markdown(hjust = 0),
      plot.caption.position = &amp;#39;plot&amp;#39;,
      strip.text = element_textbox(
        size = 12,
        color = &amp;quot;white&amp;quot;, fill = &amp;quot;#5D729D&amp;quot;, box.color = &amp;quot;#4A618C&amp;quot;,
        halign = 0.5, linetype = 1, r = unit(5, &amp;quot;pt&amp;quot;), width = unit(1, &amp;quot;npc&amp;quot;),
        padding = margin(2, 0, 1, 0), margin = margin(3, 3, 3, 3)
      )
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/top-confidence-lift-1.png&#34; width=&#34;768&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Some of the high lift rules that occur are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;I visit WestSideRag after Gothamist&lt;/li&gt;
&lt;li&gt;I visit Reddit after LifeHacker&lt;/li&gt;
&lt;li&gt;I visit Buzzfeed after Twitter.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By the way, all this is true. My usually weekday pattern tends to be Twitter -&amp;gt; Gothamist -&amp;gt; WestSideRag -&amp;gt; ILoveTheUpperWest -&amp;gt; Buzzfeed -&amp;gt; LifeHacker -&amp;gt; Reddit.&lt;/p&gt;
&lt;p&gt;So it does appear that the Sequence Mining rules do in fact represent my browsing habits! But certain sites like the 2nd Upper West Side blog did not make the top rules.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;visualizing-these-relationships-as-a-graph&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Visualizing these relationships as a graph&lt;/h2&gt;
&lt;p&gt;Ultimately, my browsing habits can be restructured as a directed graph where each URL leads to another URL. Then rather than relying on statistical measures like Support, Confidence, and Lift, I can visualize my browsing as a network. However, to turn my data into an edge list I need to re-structure the URLs from a sequential list into a series of “Source/Destination” edges.&lt;/p&gt;
&lt;p&gt;To do this, I’ll group by each browsing session, setting the URL to the &#34;source’ and using &lt;code&gt;dplyr::lead()&lt;/code&gt; to grab the URL from the next row to form the destination. Then since for the last URL, the destination will be null, I’ll remove these endpoints from the data. Finally, to create edge weightings I’ll count the number of instances for each source/destination pair.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;collapsed_history_graph_dt &amp;lt;- collapsed_history %&amp;gt;% 
  group_by(segment_id) %&amp;gt;% 
  transmute(item_id, source = base_url) %&amp;gt;% 
  mutate(destination = lead(source)) %&amp;gt;% 
  ungroup() %&amp;gt;%
  filter(!is.na(destination)) %&amp;gt;% 
  select(source, destination, segment_id) %&amp;gt;% 
  count(source, destination, name = &amp;#39;instances&amp;#39;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In order to create the graph, I’ll be using the &lt;code&gt;tidygraph&lt;/code&gt; and &lt;code&gt;ggraph&lt;/code&gt; packages to convert the data frame into the appropriate format and visualize the network in a ggplot style.&lt;/p&gt;
&lt;p&gt;To make the resulting network more readable, I’ll filter my edge list to only those with at least 15 occurrences and then use &lt;code&gt;tidygraph::as_tbl_graph&lt;/code&gt; to convert to a graph-friendly data type.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;g &amp;lt;- collapsed_history_graph_dt %&amp;gt;% 
  filter(instances &amp;gt; 14) %&amp;gt;% 
  as_tbl_graph()&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;creating-graph-clusters&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Creating Graph Clusters&lt;/h3&gt;
&lt;p&gt;To make the visualization a little more interesting I thought it would be fun to cluster the network. The &lt;code&gt;igraph::cluster_optimal&lt;/code&gt; function will calculate the optimal community structure of the graph. This membership label then gets applied as a node attribute to the graph object &lt;code&gt;g&lt;/code&gt; created in the prior code block.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;clp &amp;lt;- igraph::cluster_optimal(g)

g &amp;lt;- g %&amp;gt;% 
  activate(&amp;quot;nodes&amp;quot;) %&amp;gt;% 
  mutate(community = clp$membership)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;plotting-the-network-with-ggraph&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Plotting the Network WIth ggraph&lt;/h3&gt;
&lt;p&gt;Ggraph follows a similar syntax to ggplot where the data object is based in and then there are geoms to reflect the nodes/edges of the plot. The layout option specifies how the nodes and edges will be laid out. Here I’m using the results of the Fruchterman-Reingold algorithm for a force-directed layout. As used in this code block the relevant geoms are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;geom_node_voronoi&lt;/code&gt; - Used to plot the clustering as the background of the graph&lt;/li&gt;
&lt;li&gt;&lt;code&gt;geom_edge_parallel&lt;/code&gt; - Since this is a directional graph, it will draw separate parallel arrows for each direction. The shading will be based on the log number of instances.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;geom_node_point&lt;/code&gt; - Plots a circle for each node&lt;/li&gt;
&lt;li&gt;&lt;code&gt;geom_node_text&lt;/code&gt; - Plots the names of the nodes and reduces overlap&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(20201029)
ggraph(g, layout = &amp;#39;fr&amp;#39;) + 
  geom_node_voronoi(aes(fill = as.factor(community)), alpha = .4) + 
  geom_edge_parallel(aes(edge_alpha = log(instances)),
                  #color = &amp;quot;#5851DB&amp;quot;,
                  edge_width = 1,
                  arrow = arrow(length = unit(4, &amp;#39;mm&amp;#39;)),
                  start_cap = circle(3, &amp;#39;mm&amp;#39;),
                  end_cap = circle(3, &amp;#39;mm&amp;#39;)) +
  geom_node_point(fill = &amp;#39;orange&amp;#39;, size = 5, pch = 21) + 
  geom_node_text(aes(label = name), repel = T) + 
  labs(title = &amp;quot;My Browsing History&amp;quot;,
       caption = &amp;quot;Minimum 15 Instances&amp;quot;) + 
  scale_fill_viridis_d(guide = F) + 
  scale_edge_alpha_continuous(guide = F) + 
  theme_graph()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;index_files/figure-html/unnamed-chunk-13-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This graph shows 5 clusters:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Twitter -&amp;gt; Gothamist -&amp;gt; WestSideRag -&amp;gt; ILoveTheUpperWestSide
&lt;ul&gt;
&lt;li&gt;The websites I typically visit after work on weekdays&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Datacamp / Google Docs
&lt;ul&gt;
&lt;li&gt;When I did some Datacamp courses, I take notes in Google Docs so constantly switching back and forth makes sense.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;Facebook.com / l.facebook.com
&lt;ul&gt;
&lt;li&gt;This is just using Facebook. But interesting that Facebook has no frequent connection outside of the Facebook ecosystem.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;BuzzFeed/LifeHacker
&lt;ul&gt;
&lt;li&gt;This a the last piece of my usual post-work routine. But perhaps it occurs later after the Twitter/NYC Blog Cluster&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;The Google Centered Cluster
&lt;ul&gt;
&lt;li&gt;Google is the center of my browsing universe but some fun connections here are 127.0.0.1:4321 which is the local instance when I’m developing this blog. This co-occurs with lots to trips to Google, Github, and Stack Overflow while I try to figure out / debug aspects of my blog development pipeline.&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;There weren’t a ton of resources that showed how to use the &lt;code&gt;arulesSequences&lt;/code&gt; package in my searches and most required dumping and rereading a .csv file. Hopefully, this post showed that it isn’t necessary to do that. Additionally, it shows an example of how sequence mining can be used to identify interesting patterns when the order is important. There is a lot of functionality of the &lt;code&gt;arulesSequences&lt;/code&gt; package not touched upon in this post, but this should serve as good starting point.&lt;/p&gt;
&lt;p&gt;As for visualization, I’ve covered how to plot rules in the usual tabular structure with ggplot2 as well as a network using ggraph. I really like the way the network visualization worked out and in a future post I may go more in-depth to learn about how to best use &lt;code&gt;tidygraph&lt;/code&gt; and &lt;code&gt;ggraph&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Looking for Media Bias in Coverage of Trump&#39;s COVID Diagnosis</title>
      <link>https://jlaw.netlify.app/2020/10/07/looking-for-media-bias-in-coverage-of-trump-s-covid-diagnosis/</link>
      <pubDate>Wed, 07 Oct 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/10/07/looking-for-media-bias-in-coverage-of-trump-s-covid-diagnosis/</guid>
      <description>


&lt;p&gt;Within the United States, especially these last few years, there has been an increased focus on “fake news” or “bias in the media”. Fox News typically is the poster-child for right-wing bias and everything else seems to be the poster child for left-wing bias. While this is just a humble R blog (and &lt;strong&gt;NOT&lt;/strong&gt; a political blog) I thought it could be an interesting question to look at how a single event is covered from different media sources.&lt;/p&gt;
&lt;p&gt;I don’t know much about assessing Media Bias, but fortunately websites like &lt;a href=&#34;https://www.allsides.com/media-bias/media-bias-chart&#34;&gt;Allsides.com&lt;/a&gt; have done the research for me and produced the following chart breaking Media outlets based on direction of bias:&lt;/p&gt;
&lt;center&gt;
&lt;img src=&#34;https://www.allsides.com/sites/default/files/AllSidesMediaBiasChart-Version3.jpg&#34; style=&#34;width:50.0%&#34; /&gt;
&lt;/center&gt;
&lt;p&gt;I don’t have a perspective on the accuracy of this chart. But it provides enough information to work with.&lt;/p&gt;
&lt;p&gt;Originally, I was planning on using the first Presidential Debate in late September, but with President Trump’s positive COVID-19 diagnosis on Friday 10/2, I’ve decided to use that event instead.&lt;/p&gt;
&lt;p&gt;The rules for this analysis are:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Pick one media outlet from each column of the Media Bias Chart above and find an article about Trump’s COVID diagnosis.&lt;/li&gt;
&lt;li&gt;No opinion pieces or editorials. The articles should be intended to be reporting on the facts of an event.&lt;/li&gt;
&lt;/ol&gt;
&lt;div id=&#34;the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The Data&lt;/h2&gt;
&lt;p&gt;All data was collected on Friday, October 2 (some articles have since changed). The five articles are listed below from most left-leaning to most right-leaning:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://www.huffpost.com/entry/donald-trump-tests-positive-coronavirus_n_5eb5d776c5b69c4b317a5ee5&#34;&gt;Huffington Post&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.cnn.com/2020/10/02/politics/president-donald-trump-coronavirus-positive-test/index.html&#34;&gt;CNN&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://apnews.com/article/virus-outbreak-donald-trump-elections-melania-trump-michael-pence-f6ba3a16ab9b74b161a3a7211248e97e&#34;&gt;Associated Press (AP)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.foxnews.com/politics/president-trump-confirms-he-first-lady-melania-trump-tested-positive-for-coronavirus&#34;&gt;Fox News&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.theblaze.com/news/trump-positive-coronavirus&#34;&gt;The Blaze&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I manually copy-pasted the titles, subtitles, and articles text in &lt;code&gt;.txt&lt;/code&gt; files ensuring that inserted links to other articles were not accidentally picked up.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;analysis-plan&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Analysis Plan&lt;/h2&gt;
&lt;p&gt;The main objectives for this analysis are to look at:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Sentiment Analysis of the five different articles&lt;/li&gt;
&lt;li&gt;Looking for the most representative words for each article&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;to see if we can learn anything about Media Bias from these five sources.&lt;/p&gt;
&lt;p&gt;The libraries that will be used for this analysis are:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) #Our Workhorse Data Manipulation / Plotting Functions
library(tidytext) #Tidyverse Friend Package for Text-Mining
library(scales) #For easier value formatting
library(ggtext) # To Be Able to Use Images On Plots
library(ggupset) # For Creating an Upset Chart
library(UpSetR) # For Creating an Upset Chart&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;reading-in-the-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Reading in the data&lt;/h3&gt;
&lt;p&gt;As mentioned above, the text of the five articles are contained in five text files. The following code block will look into the working directory and use &lt;code&gt;purrr&#39;s map_dfr&lt;/code&gt; function to execute the &lt;code&gt;readr::read_table&lt;/code&gt; function and create a column to mark which source the text came from:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;articles &amp;lt;- dir() %&amp;gt;% 
  keep(str_detect(., &amp;#39;\\.txt&amp;#39;)) %&amp;gt;% 
  map_dfr(
    ~read_table(.x, col_names = F) %&amp;gt;% 
      mutate(source = str_remove_all(.x, &amp;#39;\\.txt&amp;#39;))
    ) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One of my earlier blog posts &lt;a href=&#34;https://jlaw.netlify.app/2020/08/02/what-s-the-difference-between-instagram-and-tiktok-using-word-embeddings-to-find-out/&#34;&gt;on creating work embeddings to compare TikTok and Instagram&lt;/a&gt; describes the basic pieces of &lt;code&gt;tidytext&lt;/code&gt;. But the first step of the text analysis is to ‘tokenize’ (split the data into one word per row) using &lt;code&gt;tidytext::unnest_tokens()&lt;/code&gt; which will break apart the &lt;code&gt;X1&lt;/code&gt; column containing sentences into a new column called &lt;code&gt;word&lt;/code&gt;. Then the next step is removing stop words, which are common words like “the”, “and”, and “a” which don’t add much meaning to analysis. These words are contained in the stop_words data set and using &lt;code&gt;anti_join()&lt;/code&gt; will remove them from our wordlist.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;words &amp;lt;- articles %&amp;gt;% 
  unnest_tokens(word, X1) %&amp;gt;% 
  anti_join(stop_words)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;how-long-are-each-of-the-articles&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;How long are each of the articles?&lt;/h3&gt;
&lt;p&gt;The first analysis that we can do is look at the word count of the five articles:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;count(words, source, name = &amp;quot;article length&amp;quot;, sort = T) %&amp;gt;% 
  knitr::kable()&lt;/code&gt;&lt;/pre&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;source&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;article length&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;ap&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;603&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;cnn&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;494&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;foxnews&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;394&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;huffpo&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;271&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;theblaze&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;135&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;What’s potentially interesting about the word counts is that the Associated Press articles which is supposed to be the most non-biased has the largest word count. Then the slightly biased sources (CNN and Fox News) are the next longest. Finally, the articles that were representing the most biased sources (Huffington Post and The Blaze) were the shortest.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-are-the-most-common-words-from-each-source&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;What Are The Most Common Words From Each Source?&lt;/h3&gt;
&lt;p&gt;Another quick analysis can be to look at the most frequent words from each of the five sources. The following code block takes advantage of the &lt;code&gt;ggtext&lt;/code&gt; package to use logos for the axis-labels. &lt;code&gt;ggtext&lt;/code&gt; is able to render html tags on ggplots using the &lt;code&gt;element_markdown()&lt;/code&gt; function and the &lt;code&gt;icons_strip&lt;/code&gt; object contains those tags.&lt;/p&gt;
&lt;p&gt;The following code block gets the word counts for each word in each source and then uses dplyr’s &lt;code&gt;slice_max()&lt;/code&gt; to only keep the Top 10 words for each source. In the &lt;code&gt;ggplot&lt;/code&gt; command, the &lt;code&gt;reorder_within()&lt;/code&gt; and &lt;code&gt;scale_x_reordered()&lt;/code&gt; allows for separate sorting within each facet on the resulting chart.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;icons_strip &amp;lt;- c(
  huffpo = &amp;quot;&amp;lt;img src=&amp;#39;huffpost.jpg&amp;#39; width = 90 /&amp;gt;&amp;quot;,
  cnn = &amp;quot;&amp;lt;img src=&amp;#39;CNN.png&amp;#39; width=&amp;#39;40&amp;#39; /&amp;gt;&amp;quot;,
  ap = &amp;quot;&amp;lt;img src=&amp;#39;ap.png&amp;#39; width=&amp;#39;80&amp;#39; /&amp;gt;&amp;quot;,
  foxnews = &amp;quot;&amp;lt;img src=&amp;#39;foxnews.jpg&amp;#39; width=&amp;#39;70&amp;#39; /&amp;gt;&amp;quot;,
  theblaze = &amp;quot;&amp;lt;img src=&amp;#39;theblaze.jpeg&amp;#39; width=&amp;#39;50&amp;#39; /&amp;gt;&amp;quot;
  )

words %&amp;gt;% 
  group_by(source, word) %&amp;gt;% 
  summarize(cnt = n()) %&amp;gt;% 
  slice_max(order_by = cnt, n = 10, with_ties = F) %&amp;gt;% 
  mutate(icon = unname(icons_strip[source]),
         icon = factor(icon, 
                       levels = c(icons_strip[&amp;#39;huffpo&amp;#39;], icons_strip[&amp;#39;cnn&amp;#39;],
                                  icons_strip[&amp;#39;ap&amp;#39;], icons_strip[&amp;#39;foxnews&amp;#39;],
                                  icons_strip[&amp;#39;theblaze&amp;#39;]))
         ) %&amp;gt;%
  ggplot(aes(x = reorder_within(word, cnt, source), y = cnt, fill = icon)) +
    geom_col() +
    scale_x_reordered() +
    scale_fill_discrete(guide = F) +
    labs(x = &amp;quot;Words&amp;quot;,
         y = &amp;quot;# of Occurances&amp;quot;,
         title = &amp;quot;Most Common Words in Articles about Trump&amp;#39;s Positive COVID Test&amp;quot;) +
    facet_wrap(~icon, nrow = 2, scales = &amp;quot;free&amp;quot;) +
    coord_flip() +
    theme(
      strip.text.x = element_markdown(),
      strip.background.x = element_blank()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-07-looking-for-media-bias-in-coverage-of-trump-s-covid-diagnosis/index_files/figure-html/most_common_words-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Unsurprisingly, the words “President” or “Trump” were the most common words in all five articles.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;looking-into-the-sentiment-of-each-article&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Looking into the sentiment of each article&lt;/h3&gt;
&lt;p&gt;A common type of text analysis is &lt;strong&gt;&lt;em&gt;sentiment analysis&lt;/em&gt;&lt;/strong&gt; and a simple version of sentiment analysis is to lookup each word of text in a dictionary that labels it as either positive sentiment or negative sentiment. The &lt;code&gt;tidytext&lt;/code&gt; package contains a number of different sentiment lexicons and in this analysis I’ll be using the &lt;a href=&#34;https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html&#34;&gt;Bing Liu lexicon&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the following code block I am appending the sentiment labels to our existing data set. Since most of the words will not appear in the sentiment lexicon I’m setting the &lt;code&gt;NA&lt;/code&gt; values to a label of “neutral”. Additionally, I’m creating numeric codes for positive words (+1), negative words (-1) and neutral words (0).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;words_with_sentiment &amp;lt;- words %&amp;gt;% 
  left_join(get_sentiments(&amp;#39;bing&amp;#39;)) %&amp;gt;% 
  mutate(
    sentiment = if_else(is.na(sentiment), &amp;#39;neutral&amp;#39;, sentiment),
    sentiment_numeric  = case_when(
                sentiment == &amp;#39;positive&amp;#39; ~ 1,
                sentiment == &amp;#39;negative&amp;#39; ~ -1,
                sentiment == &amp;#39;neutral&amp;#39; ~ 0)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first look at sentiment across the five articles is to look at an average sentiment score using the numeric coding above. Everything in this code block is either vanilla &lt;code&gt;dplyr&lt;/code&gt; or &lt;code&gt;ggplot&lt;/code&gt; or has been covered in other blog posts. New to this block is &lt;code&gt;scale_x_discrete(labels = icons)&lt;/code&gt; which uses the named vector of &lt;code&gt;&amp;lt;img&amp;gt;&lt;/code&gt; tags to apply the logos to the y-axis after the coordinate flip:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;icons &amp;lt;- c(
  huffpo = &amp;quot;&amp;lt;img src=&amp;#39;huffpost.jpg&amp;#39; width = 130 /&amp;gt;&amp;quot;,
  cnn = &amp;quot;&amp;lt;img src=&amp;#39;CNN.png&amp;#39; width=&amp;#39;50&amp;#39; /&amp;gt;&amp;quot;,
  ap = &amp;quot;&amp;lt;img src=&amp;#39;ap.png&amp;#39; width=&amp;#39;130&amp;#39; /&amp;gt;&amp;quot;,
  foxnews = &amp;quot;&amp;lt;img src=&amp;#39;foxnews.jpg&amp;#39; width=&amp;#39;75&amp;#39; /&amp;gt;&amp;quot;,
  theblaze = &amp;quot;&amp;lt;img src=&amp;#39;theblaze.jpeg&amp;#39; width=&amp;#39;75&amp;#39; /&amp;gt;&amp;quot;
  )

words_with_sentiment %&amp;gt;% 
  group_by(source) %&amp;gt;% 
  summarise(
    avg_sentiment = sum(sentiment_numeric, na.rm = T) / n()
  ) %&amp;gt;% 
  ggplot(aes(
    x = factor(source, 
               levels = c(&amp;#39;theblaze&amp;#39;, &amp;#39;foxnews&amp;#39;, &amp;#39;ap&amp;#39;, &amp;#39;cnn&amp;#39;, &amp;#39;huffpo&amp;#39;)),
    y = avg_sentiment
  )) + 
  geom_linerange(ymin = 0, aes(ymax = avg_sentiment)) + 
  geom_point(size = 4, aes(color = ifelse(avg_sentiment &amp;lt; 0, &amp;#39;red&amp;#39;, &amp;#39;green&amp;#39;))) + 
  geom_text(aes(label = avg_sentiment %&amp;gt;% round(2)), vjust = -1) + 
  geom_hline(yintercept = 0) + 
  labs(x = &amp;quot;&amp;quot;, 
       y = expression(paste(&amp;quot;Avg. Sentiment Scores (&amp;quot;,
                            Sigma,&amp;quot; Positive - &amp;quot;, 
                            Sigma,&amp;quot; Negative) / # Words&amp;quot;)),
       title = &amp;quot;Total Sentiment of Articles About Trump&amp;#39;s Positve COVID Test&amp;quot;) + 
  scale_x_discrete(labels = icons) + 
  scale_y_continuous(labels = scales::percent) + 
  scale_color_identity(guide = F) + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    axis.text.y = element_markdown(color = &amp;quot;black&amp;quot;, size = 11),
    axis.text.x = element_blank(),
    axis.title.x = element_text(size = 10),
    plot.title.position = &amp;#39;plot&amp;#39;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-07-looking-for-media-bias-in-coverage-of-trump-s-covid-diagnosis/index_files/figure-html/sentiment_lollipop-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Most interesting about these results is that &lt;strong&gt;there is a rank ordering ranging from the most left-leaning article is the most negative and the most right-leaning is the most positive&lt;/strong&gt;. Other interesting items of note is that the right-leaning articles have higher absolute sentiment scores than the left-leaning articles and that the Associated Press articles has an average sentiment score of nearly &lt;strong&gt;zero&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;While I went into this with no prior hypothesis, I’m guessing that the left-leaning articles are taking a more dire view of Trump’s COVID diagnosis while the right-leaning are focusing more on the hopeful recovery.&lt;/p&gt;
&lt;p&gt;An alternative lens is rather than looking at average sentiment scores, I can look at the distribution of Positive/Negative/Neutral words within each source.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;words_with_sentiment %&amp;gt;% 
  count(source, sentiment) %&amp;gt;% 
  group_by(source) %&amp;gt;% 
  mutate(pct = n / sum(n)) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  ggplot(aes(x = factor(source, 
                        levels = c(&amp;#39;theblaze&amp;#39;, &amp;#39;foxnews&amp;#39;, &amp;#39;ap&amp;#39;, &amp;#39;cnn&amp;#39;, &amp;#39;huffpo&amp;#39;)),
             y = pct, 
             fill = factor(sentiment, 
                           levels = c(&amp;#39;negative&amp;#39;, &amp;#39;neutral&amp;#39;, &amp;#39;positive&amp;#39;))
             )
         ) + 
    geom_col() + 
    geom_text(aes(label = pct %&amp;gt;% scales::percent(accuracy = 1)), 
              position = position_stack(vjust = .5)) + 
    labs(x = &amp;quot;&amp;quot;, 
         y = &amp;quot;% of Words&amp;quot;,
         title = &amp;quot;What is the Sentiment of Different Articles on Trump&amp;#39;s Positive COVID Test?&amp;quot;, 
         fill = &amp;quot;Sentiment&amp;quot;) + 
  scale_x_discrete(labels = icons) + 
  guides(fill = guide_legend(reverse = T)) + 
  coord_flip() + 
  cowplot::theme_cowplot() + 
  theme(
    plot.title.position = &amp;#39;plot&amp;#39;,
    legend.position = &amp;#39;bottom&amp;#39;,
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    axis.text.y = element_markdown(),
    plot.title = element_text(size = 12)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-07-looking-for-media-bias-in-coverage-of-trump-s-covid-diagnosis/index_files/figure-html/sentiment_dist-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;While most words in all five articles are neutral this views lets us see the Positive vs. Negative Distribution in more detail. The left-leaning sources skew slightly towards negative sentiment with close to equal percentages of positive and negative while the right-leaning articles have a higher occurrence of positive terms and a decently lower occurrence of negative terms.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;determing-the-most-representitive-words-with-tf-idf&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Determing the Most Representitive Words with TF-IDF&lt;/h3&gt;
&lt;p&gt;&lt;a href=&#34;https://en.wikipedia.org/wiki/Tf%E2%80%93idf&#34;&gt;TF-IDF&lt;/a&gt; or Term Frequency-Inverse Document Frequency produces a numeric value to represent “how important a word is to a document in a collection of documents”. Earlier in this post we looked at the most common words in each source. A problem with using &lt;em&gt;most frequent word&lt;/em&gt; as a measure of importance is that if a word is very common everywhere then it can’t be really important. For example, the word “President” is likely not descriptive to any one document since all five are about President Trump’s COVID diagnosis. However, we should still consider frequency as part of the measure of importance. However, we can solve the common in all documents problem by weighting the metric by the inverse of the number of documents a word appears in. This is the IDF (inverse document frequency) portion of TF-IDF. In TF-IDF, the result is the product of the:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Term Frequency (TF) - Within each document what % of words is word X?&lt;/li&gt;
&lt;li&gt;Inverse Document Frequency (IDF) - How many documents does word X appear in?
&lt;ul&gt;
&lt;li&gt;This is defined as Log(Total # of Documents / # of Documents with Word X)&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Calculating TF-IDF is easy with &lt;code&gt;tidytext::bind_tfidf()&lt;/code&gt; which takes as parameters, the word column (word), the document column (source), and a counts column (n). The function appends the tf, idf, and tf_idf columns to the data set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;words %&amp;gt;% 
  add_count(word, name = &amp;#39;total_count&amp;#39;) %&amp;gt;%
  filter(total_count &amp;gt;= 5) %&amp;gt;% 
  count(source, word) %&amp;gt;% 
  bind_tf_idf(word, source, n) %&amp;gt;% 
  mutate(icon = unname(icons_strip[source]),
         icon = factor(icon, 
                       levels = c(icons_strip[&amp;#39;huffpo&amp;#39;], icons_strip[&amp;#39;cnn&amp;#39;],
                                  icons_strip[&amp;#39;ap&amp;#39;], icons_strip[&amp;#39;foxnews&amp;#39;],
                                  icons_strip[&amp;#39;theblaze&amp;#39;]))
         ) %&amp;gt;% 
  group_by(icon) %&amp;gt;% 
  slice_max(order_by = tf_idf, n = 10, with_ties = F) %&amp;gt;% 
  ggplot(aes(x = reorder_within(word, tf_idf, icon), 
             y = tf_idf, 
             fill = icon)) +
    geom_col() +
    scale_x_reordered() +
    scale_fill_discrete(guide = F) +
    labs(x = &amp;quot;Words&amp;quot;,
         y = &amp;quot;TF-IDF&amp;quot;,
         title = &amp;quot;Most Characteristic Words For Each Source&amp;quot;,
         subtitle = &amp;quot;Based on TF-IDF&amp;quot;) +
    facet_wrap(~icon, nrow = 2, scales = &amp;quot;free_y&amp;quot;) +
    coord_flip() +
    theme(
      strip.text.x = element_markdown(),
      strip.background.x = element_blank()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-07-looking-for-media-bias-in-coverage-of-trump-s-covid-diagnosis/index_files/figure-html/tfidf-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;For the Huffington post article the most important word is “debate”, followed by “wearing”. Within the other documents the word &#34;debate: does not appear at all in the right-leaning articles.&lt;/p&gt;
&lt;p&gt;In the Fox News article it shouldn’t be a surprise than the word “fox” has a higher importance to Fox News than to any other article.&lt;/p&gt;
&lt;p&gt;What was interesting was that the two right leaning articles’ most representative words were “tweeted”. In actually reading the articles, they both primarily were quoting the tweets from the President, Vice President, First Lady and various other White House spokespeople.&lt;/p&gt;
&lt;p&gt;Overall, the TF-IDF results weren’t particularly interesting besides the Huffington Post’s referencing the debates and the right-leaning articles relying on Twitter for much of the information.&lt;/p&gt;
&lt;p&gt;Content aside, I was a little confused how “tweeted” could be the most representative word for two different articles since importance is partially determined by the word NOT appearing in other articles. However, after thinking about this more it could be possible if one of the articles had a lot of word overlap with other articles. The Blaze article was already the shortest with only 135 words and perhaps it doesn’t add much new information.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-is-the-overlap-of-words-across-all-sources&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;What is the Overlap of Words Across All Sources?&lt;/h3&gt;
&lt;p&gt;In order to determine whether The Blaze just doesn’t have many unique words we’ll need to construct a way to see what sources contain each word. While with fewer numbers of sources tools like Venn Diagrams would be useful to determine overlaps, with 5 different sources there could be 32 different overlap combinations.&lt;/p&gt;
&lt;p&gt;A useful tool for viewing overlap of many groups in a simpler way is an “Upset” chart. In an upset chart the number of words occurring in each overlap group is displayed as a bar and what the group represents is shown in the box beneath the chart where a circle is filled in if the source is part of the group and not filled in otherwise. Shout out to &lt;a href=&#34;https://soroosj.netlify.app/2020/07/07/cocktails-upset/&#34;&gt;Joel Soroos, whose blog post helped me implement Upset charts in R&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;There are a couple of packages that can make these charts but I’ll use &lt;code&gt;ggupset&lt;/code&gt; since it works well with the tidy data format. In order to get the data in the proper format I’ll need to structure the data so that each row is a word and then there is a list-column containing all of the sources that contain that word. This can be done using &lt;code&gt;group_by&lt;/code&gt; and &lt;code&gt;summarize&lt;/code&gt; with &lt;code&gt;list()&lt;/code&gt; as the aggregate function.&lt;/p&gt;
&lt;p&gt;Then a &lt;code&gt;ggplot&lt;/code&gt; can be turned into an Upset chart through the use of &lt;code&gt;scale_x_upset()&lt;/code&gt;. Another piece of this code that’s pretty new to me is &lt;code&gt;geom_text(stat=&#39;count&#39;, aes(label=after_stat(count)), nudge_y = 10)&lt;/code&gt;. Since the data is structured as a list of words, I don’t have a column that represents the number of words in each group to pass into &lt;code&gt;geom_text()&lt;/code&gt;. Therefore the line &lt;code&gt;after_stat&lt;/code&gt; will tell &lt;code&gt;geom_text()&lt;/code&gt; that we’re going to use the &lt;code&gt;count&lt;/code&gt; statistic but also to set the label value &lt;strong&gt;after doing the stat calculation&lt;/strong&gt;. Admittedly I’m not great with the &lt;code&gt;stat_*&lt;/code&gt; aspects of ggplot and the &lt;code&gt;after_*&lt;/code&gt; functions. But this is nice that I don’t have to do all the calculations before passing into &lt;code&gt;ggplot&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;words %&amp;gt;% 
  count(source, word) %&amp;gt;%
  group_by(word) %&amp;gt;% 
  summarize(sources = list(source)) %&amp;gt;% 
  ggplot(aes(x = sources)) + 
    geom_bar() + 
    geom_text(stat=&amp;#39;count&amp;#39;, aes(label=after_stat(count)), nudge_y = 10) +
    scale_x_upset() + 
    labs(x = &amp;quot;Set of Sources&amp;quot;, 
         y = &amp;quot;# of Unique Words&amp;quot;,
         title = &amp;quot;How Many Sources Does Each Word Appear In?&amp;quot;,
         caption = &amp;quot;Each column represents a unique combination of sources&amp;quot;) + 
    cowplot::theme_cowplot() &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-07-looking-for-media-bias-in-coverage-of-trump-s-covid-diagnosis/index_files/figure-html/upset_chart-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;To read the Upset chart, the first bar shows that the largest group is composed of 197 words and represents the words that are &lt;strong&gt;ONLY&lt;/strong&gt; in the CNN article. The second bar is 186 words that &lt;strong&gt;ONLY&lt;/strong&gt; appear in the Associated Press article. For an example of an overlap, the 5th bar represents the 40 words that appear in &lt;strong&gt;BOTH&lt;/strong&gt; the AP and CNN article.&lt;/p&gt;
&lt;p&gt;To answer the original question of whether The Blaze’s high TF-IDF score for ‘tweeted’ is due to a low number of unique words in The Blaze article we can look for the group that is &lt;strong&gt;ONLY&lt;/strong&gt; the Blaze words. Finding the column that contains only the one filled in circle for the Blaze we can see that there are only eight words that are unique to the Blaze article. Granted some of this is due to The Blaze being the shortest article and the AP article having the most words.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;This post looked at five articles about the event of President Trump’s COVID-19 diagnosis from different media sources to see how coverage might differ depending on each outlet’s bias. While I’m not an expert in bias and I don’t think any results here are so strong as to suggest obvious bias there were a few areas where the ordering does seem to indicate some amount of ‘slant’ in line with how &lt;a href=&#34;https://www.allsides.com/&#34;&gt;AllSides.com&lt;/a&gt; rated the five outlets.&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;The level of bias (both left and right) of a media outlet correlated with shorter articles.&lt;/li&gt;
&lt;li&gt;For this particular event, based on sentiment analysis the left-leaning outlets took a slight negative slant while the right-leaning outlets took a more positive slant.&lt;/li&gt;
&lt;li&gt;The right-leaning articles appeared to rely more on tweets for the text-content of the article&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;appendix-upset-charts-with-upsetr&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Appendix: Upset Charts with UpSetR&lt;/h2&gt;
&lt;p&gt;There is an alternative implementation for Upset charts using the &lt;code&gt;UpSetR&lt;/code&gt; package that doesn’t run through ggplot. In order to use this package each source needs to become its own column with a value of 1 if the word appears in the source and zero otherwise. Additionally, the data can’t be in the tibble format which is why &lt;code&gt;as.data.frame&lt;/code&gt; is used before calling the &lt;code&gt;upset()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;words %&amp;gt;% 
  distinct(source, word) %&amp;gt;% 
  mutate(val = 1) %&amp;gt;%
  pivot_wider(
    names_from = &amp;#39;source&amp;#39;,
    values_from = &amp;#39;val&amp;#39;,
    values_fill = 0
  ) %&amp;gt;%
  as.data.frame %&amp;gt;% 
  upset(order.by = &amp;#39;freq&amp;#39;,
        empty.intersections = T,
        sets.x.label = &amp;#39;Word Count&amp;#39;,
        text.scale = 1.25)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-07-looking-for-media-bias-in-coverage-of-trump-s-covid-diagnosis/index_files/figure-html/upset_chart2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The two main advantages of UpSetR is the ability to show empty intersecting groups and the Word Count graph on the left. For example, with this version we can see that there are zero words that only appear in the Blaze and the Huffington Post article. Also, its clearer in this package that the AP and CNN have more words than the rest of the articles.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>What&#39;s The Best Day to Get Married?</title>
      <link>https://jlaw.netlify.app/2020/10/01/what-s-the-best-day-to-get-married/</link>
      <pubDate>Thu, 01 Oct 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/10/01/what-s-the-best-day-to-get-married/</guid>
      <description>


&lt;div id=&#34;tldr&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;There really &lt;strong&gt;isn’t&lt;/strong&gt; a best day to get married as there isn’t much differentiation between various days. Both as part of this analysis and in life. Do what’s best for you and your relationship.&lt;/li&gt;
&lt;li&gt;However, if &lt;strong&gt;ALL&lt;/strong&gt; you care about is both being married on a Saturday, maximizing the number of times your anniversary will fall on a Saturday, and maximizing the number of “big” anniversaries that fall on a Saturday then avoid the 24 months after a leap day!&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;the-original-objective-of-this-analysis&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The original objective of this analysis&lt;/h1&gt;
&lt;p&gt;Having somewhat recently celebrated a 5th anniversary on a Saturday (🥂), Mrs. JLaw asked “How many anniversaries will we have on a Saturday?” and “When is the next big one that we’ll have?”. Upon finding out that our next “big” Saturday anniversary won’t come until our 50th, she suggested that I look into whether certain days would have been best to have gotten married.&lt;/p&gt;
&lt;p&gt;In actually looking into this analysis, there’s not that much difference in the number of Saturdays or “big” Saturdays regardless of wedding date.&lt;/p&gt;
&lt;p&gt;So the initial question was, &lt;strong&gt;&lt;em&gt;what are the BEST and WORST dates to get married&lt;/em&gt;&lt;/strong&gt; when optimizing for &lt;em&gt;maximizing the number of “big” (multiples of 5) anniversaries occurring on a Saturday&lt;/em&gt;. The constraints being that the initial wedding date ALSO needed to be a on a Saturday.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;exploring-wedding-dates-and-anniversaries&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Exploring Wedding Dates and Anniversaries&lt;/h1&gt;
&lt;p&gt;Since I’ll be working with dates the &lt;code&gt;lubridate&lt;/code&gt; package will be the workhorse for preparing my data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) #Data Manipulation
library(lubridate) #Working with Dates
library(glue) # A package that works similar to the paste function
library(eulerr) # A package to create Venn-Diagrams (technically Euler Diagrams)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In order to make the universe of wedding dates tractable I’ll be looking at all potential dates occurring on a Saturday in the past 10 years (since 1/1/2010) and through the next 5 years (through 12/31/2025). The &lt;code&gt;seq.Date()&lt;/code&gt; function from the &lt;code&gt;lubridate&lt;/code&gt; package makes generating sets of dates super easy. It works similar to &lt;code&gt;seq()&lt;/code&gt; where you give it a starting point, and ending point but in this case you also provide the interval (‘day’, ‘month’, ‘year’, etc.).&lt;/p&gt;
&lt;p&gt;In the following code block, I’m constructing a tibble with a column called &lt;em&gt;wedding_date&lt;/em&gt; that is all days between 1/1/2010 and 12/31/2025 using the &lt;code&gt;ymd()&lt;/code&gt; function from &lt;code&gt;lubridate&lt;/code&gt; to turn the integers into a date. Then I’m creating a column called &lt;em&gt;wedding_date_day&lt;/em&gt; that uses the &lt;code&gt;wday()&lt;/code&gt; function from &lt;code&gt;lubridate&lt;/code&gt; to return the day of the week. The “abbr” and “label” options have it return “Mon”, “Tue”, “Wed” rather than integer values which is the default (this is in part because I constantly forget whether 1 refers to Sunday or Monday… so this eliminates that problem). Finally, I keep only dates that are Saturdays and remove leap days since those will get weird as we look at annual anniversaries.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wedding_dates &amp;lt;- tibble(
  wedding_date = seq.Date(ymd(20100101), ymd(20251231), by = &amp;#39;day&amp;#39;),
  wedding_date_day = wday(wedding_date, abbr = T, label = T)
) %&amp;gt;% 
  #Keep only Saturdays
  filter(wedding_date_day == &amp;#39;Sat&amp;#39;) %&amp;gt;% 
  #Remove Leap Years because they&amp;#39;re unique
  filter(!(day(wedding_date)==29 &amp;amp; month(wedding_date) == 2))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will create a tibble with 834 rows representing all Saturdays between 2010 and 2025.&lt;/p&gt;
&lt;div id=&#34;counting-the-number-of-saturday-anniversaires-and-big-saturday-anniversaries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Counting the Number of Saturday Anniversaires and “Big” Saturday Anniversaries&lt;/h2&gt;
&lt;p&gt;I will look at the first 50 years of marriage for any of these wedding dates. So for each of the 834 potential wedding dates I need to:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Calculate the day of week for each anniversary for the next 50 years&lt;/li&gt;
&lt;li&gt;For each wedding date, count the number of anniversaries that fall on a Saturday&lt;/li&gt;
&lt;li&gt;For each wedding date, count the number of “big” anniversaries that fall on a Saturday (again, “big” anniversaries being multiples of 5 such as 5th, 10th, …)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;At first I really wanted to figure out a way to do this in a wide-format using &lt;code&gt;map&lt;/code&gt; functions or &lt;code&gt;rowwise&lt;/code&gt; functions, but in the end I couldn’t figure it out in the time I wanted to spend exploring. Therefore, I’m keeping the data in a long-format by using &lt;code&gt;tidyr::crossing()&lt;/code&gt; to expand each wedding days by the 50 anniversaries. So in the end each row in the initial data set will now have 50 rows.&lt;/p&gt;
&lt;p&gt;Then for each of the Wedding Date/Anniversary Year combinations, I re-use the &lt;code&gt;wday()&lt;/code&gt; function to get the day of the week and then &lt;code&gt;group_by&lt;/code&gt; the wedding date and &lt;code&gt;summarize()&lt;/code&gt; to count the number of Saturday anniversaries (&lt;em&gt;num_sat&lt;/em&gt;) and “big” Saturday anniversaries (&lt;em&gt;num_big_sat&lt;/em&gt;).&lt;/p&gt;
&lt;p&gt;The two non-typical parts of this code block are the &lt;code&gt;.groups&lt;/code&gt; argument to &lt;code&gt;summarize()&lt;/code&gt; and the use of &lt;code&gt;paste()&lt;/code&gt; in the &lt;code&gt;summarize()&lt;/code&gt;. The &lt;code&gt;.groups&lt;/code&gt; argument returns an ungrouped tibble rather than only removing the last grouping layer which is the default (this would have returned a grouped tibble with &lt;em&gt;wedding_date&lt;/em&gt; as the grouping variable… which would probably be fine but occasionally grouped tibbles cause downstream issues).&lt;/p&gt;
&lt;p&gt;Using &lt;code&gt;paste()&lt;/code&gt; in the &lt;code&gt;summarize()&lt;/code&gt; with the &lt;code&gt;collapse=&#39;,&#39;&lt;/code&gt; argument creates a concatenated comma-space separated string of the “big” anniversary years that fall on a Saturday and &lt;code&gt;NA&lt;/code&gt; otherwise. The use of &lt;code&gt;stringr::str_remove_all()&lt;/code&gt; is to remove the NAs from the string.&lt;/p&gt;
&lt;p&gt;If you’re reading this and are unfamiliar with regular expressions, I highly recommend getting familiar with them, especially when working with text. The regular expression “NA,? ?” means to remove the pattern “NA” followed by 0 or 1 commas followed by 0 or 1 spaces. But the TL;DR here is that when a “big” anniversary didn’t fall on a Saturday the string “NA” would be concatenated and I wanted to remove those. So “5, NA, NA, NA, 45, NA” would just become “5, 45,”. Not ideal.. but it’ll do.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wedding_dates_w_annv &amp;lt;- wedding_dates %&amp;gt;% 
  #Expand Each Date to Have 50 Anniversaries
  crossing(anniversary = 1:50) %&amp;gt;% 
  #Get the Day of Week for Those Anniversaries
  mutate(anniversary_day = wday(wedding_date + years(anniversary), label = T, abbr = T)) %&amp;gt;% 
  #Summarize By Wedding Date counting the number of saturdays, number of saturdays w/ meaningful anniversary
  group_by(wedding_date, wedding_date_day) %&amp;gt;% 
  summarize(
    num_sat = sum(anniversary_day == &amp;#39;Sat&amp;#39;),
    num_big_sat = sum(anniversary_day == &amp;#39;Sat&amp;#39; &amp;amp; anniversary %% 5 == 0),
    #Building a string of all meaningful anniversary years,
    big_sat_years = str_remove_all(
      paste(
        if_else(anniversary_day == &amp;#39;Sat&amp;#39; &amp;amp; anniversary %% 5 == 0, 
                anniversary, 
                NA_integer_
                ), 
        collapse = &amp;#39;, &amp;#39;), 
      &amp;quot;NA,? ?&amp;quot;),
    .groups = &amp;#39;drop&amp;#39;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Post-processing the data looks like:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;center&#34;&gt;wedding_date&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;wedding_date_day&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;num_sat&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;num_big_sat&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;big_sat_years&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;2010-01-02&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sat&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;7&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;45,&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;center&#34;&gt;2010-01-09&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sat&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;7&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;45,&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;center&#34;&gt;2010-01-16&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Sat&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;7&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;45,&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;div id=&#34;how-many-big-saturday-anniversaries-does-anyone-get&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;How many “Big” Saturday Anniversaries Does Anyone Get?&lt;/h3&gt;
&lt;p&gt;The first question to explore is for the 834 Saturdays in our data as potential wedding dates, how many of the “big” anniversaries will fall on a Saturday. The following code block is pretty vanilla &lt;code&gt;dplyr&lt;/code&gt; with the use of &lt;code&gt;count()&lt;/code&gt; and &lt;code&gt;mutate()&lt;/code&gt;. If you’ve never seen the &lt;code&gt;glue()&lt;/code&gt; package and function before, it works a lot like &lt;code&gt;paste()&lt;/code&gt; in its most basic form. The main difference is that R will execute the code within the &lt;code&gt;{ }&lt;/code&gt; so it can be included directly within the quotes rather than separated by commas. It can also be used similar to &lt;code&gt;.format()&lt;/code&gt; in Python.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wedding_dates_w_annv %&amp;gt;% 
  # Get frequencies of Big Saturday Anniversaries
  count(num_big_sat) %&amp;gt;% 
  # Create %s 
  mutate(pct = n/sum(n)) %&amp;gt;% 
  ggplot(aes(x = as.factor(num_big_sat), y = pct, fill = as.factor(num_big_sat))) +
    geom_col() + 
    geom_text(aes(label = glue(&amp;quot;{pct %&amp;gt;% scales::percent()} (n={n %&amp;gt;% scales::comma()})&amp;quot;)), nudge_y = 0.02) + 
    labs(title = &amp;quot;How many ***BIG*** anniversaries are celebrated on Saturday?&amp;quot;,
         subtitle = glue(&amp;quot;Saturday Wedding Dates 2010 - 2025 (n = {nrow(wedding_dates_w_annv)})&amp;quot;),
         caption = &amp;quot;Big = Multiple of 5 (5th, 10th, etc.)&amp;quot;,
         x = &amp;quot;# of Big Anniversaries on Saturdays&amp;quot;,
         y = &amp;quot;% of Wedding Dates&amp;quot;) + 
    scale_fill_discrete(guide = F) + 
    cowplot::theme_cowplot() + 
    theme(
      plot.title = ggtext::element_markdown(),
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-01-what-s-the-best-day-to-get-married/index_files/figure-html/plot_num_big_saturdays-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The primary reason there &lt;strong&gt;isn’t&lt;/strong&gt; a best or worst wedding date is that all potential wedding dates either have 1 or 2 &lt;strong&gt;BIG&lt;/strong&gt; anniversaries on a Saturday. So there isn’t too much of a difference in choice of dates.&lt;/p&gt;
&lt;p&gt;So let’s look at how many anniversaries in total occur on a Saturday.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;how-many-total-saturday-anniversaries-does-anyone-get&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;How many total Saturday Anniversaries Does Anyone Get?&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wedding_dates_w_annv %&amp;gt;% 
  count(num_sat) %&amp;gt;% 
  mutate(pct = n/sum(n)) %&amp;gt;% 
  ggplot(aes(x = as.factor(num_sat), y = pct, fill = as.factor(num_sat))) +
  geom_col() + 
  geom_text(aes(label = glue(&amp;quot;{pct %&amp;gt;% scales::percent()} (n={n %&amp;gt;% scales::comma()})&amp;quot;)), nudge_y = 0.02) + 
  labs(title = &amp;quot;How many anniversaries are celebrated on Saturday?&amp;quot;,
       subtitle = glue(&amp;quot;Saturday Wedding Dates 2010 - 2025 (n = {nrow(wedding_dates_w_annv)})&amp;quot;),
       x = &amp;quot;# of Anniversaries on Saturdays&amp;quot;,
       y = &amp;quot;% of Wedding Dates&amp;quot;) + 
  scale_fill_discrete(guide = F) + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-01-what-s-the-best-day-to-get-married/index_files/figure-html/plot_num_total_saturday-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Continuing with the theme of there not being major differences. 75% of Saturday Wedding dates will have 7 anniversaries on a Saturday and 25% will have 6. So, while 7 would be preferable, the difference between 7 vs. 6 again is not large.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;looking-at-both-total-saturdays-and-big-saturdays&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Looking at Both Total Saturdays and “Big” Saturdays&lt;/h3&gt;
&lt;p&gt;Since “Big” Anniversaries had a 50/50 distribution and overall Saturdays had a 25/75 distribution the next step would be to see the cross-product of the two previous fields:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wedding_dates_w_annv %&amp;gt;% 
  count(num_sat, num_big_sat) %&amp;gt;% 
  mutate(pct = n / sum(n)) %&amp;gt;% 
  ggplot(aes(x = factor(num_sat), y = factor(num_big_sat), fill = pct)) + 
    geom_tile() + 
    geom_text(aes(label = glue(&amp;quot;{pct %&amp;gt;% scales::percent()} \n (n={n})&amp;quot;))) + 
    labs(title = &amp;quot;A deeper look into Saturday Anniversaries&amp;quot;,
         subtitle = glue(&amp;quot;Saturday Wedding Dates 2010 - 2025 (n = {nrow(wedding_dates_w_annv)})&amp;quot;),
         x = &amp;quot;# of Anniversaries on Saturdays&amp;quot;,
         y = &amp;quot;# of &amp;#39;Big&amp;#39; Anniversaries on Saturday&amp;quot;) + 
    scale_fill_gradient(guide = F, low = &amp;quot;#769293&amp;quot;, high = &amp;quot;#fad7d5&amp;quot;) +
    cowplot::theme_cowplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-01-what-s-the-best-day-to-get-married/index_files/figure-html/unnamed-chunk-1-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Looking across both dimensions, everyone who has two “big” anniversaries on a Saturday ALSO has 7 anniversaries on a Saturday. However, not everyone who has 7 anniversaries on Saturday will have 2 “big” anniversaries on a Saturday. Instead there are three groups:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;6 Total / 1 Big (25%)&lt;/li&gt;
&lt;li&gt;7 Total / 1 Big (25%)&lt;/li&gt;
&lt;li&gt;7 Total / 2 Big (50%)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In this case, having 7 anniversaries and 2 “Big” Anniversaries seems preferable to the other two groups… if you only cared about having your anniversary on a Saturday.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-big-anniversaries-will-be-celebrated-on-saturdays&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;What “Big” Anniversaries Will Be Celebrated on Saturdays?&lt;/h3&gt;
&lt;p&gt;So far, I’ve defined “big” anniversaries as multiples of 5 (5th, 10th, … 45th, 50th). However, I haven’t looked at which of those big ones are occurring on a Saturday. To show these “big” anniversaries I’ll use the &lt;code&gt;eulerr&lt;/code&gt; package to create a Venn-Diagram of these years.&lt;/p&gt;
&lt;p&gt;The package expects a specific format where each column is a logical indicating whether or not an observation is a member of that group. From a quick check on the &lt;em&gt;big_sat_years&lt;/em&gt; field I can see that the only “big” anniversaries that fall on Saturdays are the 5th, 45th, and 50th.&lt;/p&gt;
&lt;p&gt;Of note is the regular expression “\b5\b” for identifying the 5th anniversary. &lt;code&gt;\\b&lt;/code&gt; represents a word boundary so it is included to make sure that the 5th anniversary doesn’t accidentally get picked up by &lt;code&gt;str_detect()&lt;/code&gt; as part of 45 or 50, which would occur if I only searched for “5”.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wedding_dates_w_annv %&amp;gt;% 
  #Constructing Logicals for Venn Diagrams
  transmute(
    `5th \n Anniversary` = str_detect(big_sat_years, &amp;#39;\\b5\\b&amp;#39;),
    `45th \n Anniversary` = str_detect(big_sat_years, &amp;#39;45&amp;#39;),
    `50th \n Anniversary` = str_detect(big_sat_years, &amp;#39;50&amp;#39;)
  ) %&amp;gt;%
  #Plot the Venn-Diagram
  euler() %&amp;gt;% 
  plot(quantities = list(type = c(&amp;#39;counts&amp;#39;, &amp;#39;percent&amp;#39;)),
       percentages = TRUE,
       main = &amp;quot;Which &amp;#39;big&amp;#39; anniversaries get celebrated on Saturdays?&amp;quot;,
       )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-01-what-s-the-best-day-to-get-married/index_files/figure-html/venn_diagram_big_anniversaries-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;So 75% of wedding dates will celebrate their 45th anniversary on a Saturday. 50% will celebrate ONLY their 45th anniversary and 25% will celebrate their 45th and 50th anniversaries on a Saturday. The last 25% will celebrate their 5th and 50th anniversary on a Saturday. No one will ONLY celebrate either their 5th or 50th. Fitting this into our three group paradigm from the prior section:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;6 Total / 1 Big (25%) - Will &lt;strong&gt;ONLY&lt;/strong&gt; celebrate their 45th Anniversary&lt;/li&gt;
&lt;li&gt;7 Total / 1 Big (25%) - Will &lt;strong&gt;ONLY&lt;/strong&gt; celebrate their 45th Anniversary&lt;/li&gt;
&lt;li&gt;7 Total / 2 Big (50%)
&lt;ul&gt;
&lt;li&gt;25% will celebrate their 5th and 50th&lt;/li&gt;
&lt;li&gt;25% will celebrate their 45th and 50th&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;is-there-a-time-component-to-which-group-you-end-up-in&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Is there a time component to which group you end up in?&lt;/h3&gt;
&lt;p&gt;This final section looks at the time component to whether you wind up in the 6/1, 7/1, or 7/2 group. In order to summarize to a Year/Month level, the average number of Saturdays and “Big” Saturdays will be used. Then in the following heat-map, the year of the wedding date appears on the y-axis and the month of the wedding is on the x-axis.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wedding_dates_w_annv %&amp;gt;% 
  #Reformat to Year-Month (%Y = Year w/ Century, %m = Month as Zero-Padded Decimal)
  mutate(
    m = month(wedding_date),
    y = year(wedding_date),
  ) %&amp;gt;% 
  group_by(m, y) %&amp;gt;% 
  #Get Averages
  summarize(across(starts_with(&amp;#39;num&amp;#39;), mean), .groups = &amp;#39;drop&amp;#39;) %&amp;gt;%
  mutate(grp = glue(&amp;quot;{num_sat} Total / {num_big_sat} Big&amp;quot;)) %&amp;gt;% 
  ggplot(aes(x = factor(y), y = factor(m), fill = grp)) + 
  geom_tile() + 
  scale_fill_viridis_d(option = &amp;quot;D&amp;quot;) + 
  labs(x = &amp;quot;Year of Wedding Date&amp;quot;,
       y = &amp;quot;Month of Wedding Date&amp;quot;,
       title = &amp;quot;Looking at # Saturdays / ***&amp;#39;BIG&amp;#39;*** Saturdays&amp;quot;,
       fill = &amp;quot;&amp;quot;) + 
  cowplot::theme_cowplot() + 
  theme(plot.title = ggtext::element_markdown()) +
  coord_flip()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-10-01-what-s-the-best-day-to-get-married/index_files/figure-html/time_dimension-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;There appears to be a reproducible pattern to which of the three groups you’ll wind up in based on the initial wedding date. Probably not surprisingly this occurs in a 4 year cycle.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;6 Total / 1 Big - Starts in March after a leap year and continues for the next 12 months.
&lt;ul&gt;
&lt;li&gt;Examples: Mar 2012-Feb 2013, Mar 2016-Feb 2017, Mar 2020-Feb 2021&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;7 Total / 1 Big - The following &lt;strong&gt;12&lt;/strong&gt; months after the first group
&lt;ul&gt;
&lt;li&gt;Examples: Mar 2013-Feb 2014, Mar 2017-Feb 2018, Mar 2021-Feb 2022&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;7 Total / 2 Big - The following &lt;strong&gt;24&lt;/strong&gt; months after the second group
&lt;ul&gt;
&lt;li&gt;Examples: Mar 2014-Feb 2016, Mar 2018-Feb 2020, Mar 2022-Feb 2024&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusion&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;Weddings (or the choice not to have one) are personal decisions for which there is no right or wrong. &lt;strong&gt;&lt;em&gt;HOWEVER&lt;/em&gt;&lt;/strong&gt;, if you should choose to require to have your wedding on a Saturday and want to maximize the number of anniversaries you celebrate on Saturday as well as the number the “big” anniversaries celebrated on Saturdays then you’d do well to avoid the 24 months after leap-day.&lt;/p&gt;
&lt;p&gt;But the differences between the three groups identified here are pretty small. So while the original question was what are the best and worst days to get married the good answer is that it really doesn’t matter!&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>COVID-19s Impact on the NYC Subway System</title>
      <link>https://jlaw.netlify.app/2020/09/07/covid-19s-impact-on-the-nyc-subway-system/</link>
      <pubDate>Mon, 07 Sep 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/09/07/covid-19s-impact-on-the-nyc-subway-system/</guid>
      <description>
&lt;script src=&#34;https://jlaw.netlify.app/rmarkdown-libs/htmlwidgets/htmlwidgets.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://jlaw.netlify.app/rmarkdown-libs/pymjs/pym.v1.js&#34;&gt;&lt;/script&gt;
&lt;script src=&#34;https://jlaw.netlify.app/rmarkdown-libs/widgetframe-binding/widgetframe.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;At 8pm on March 22nd, 2020, the &lt;a href=&#34;https://www.governor.ny.gov/news/governor-cuomo-signs-new-york-state-pause-executive-order&#34;&gt;“New York State on PAUSE”&lt;/a&gt; executive order became effective and New York City went on lockdown until June 8th, when the Phase 1 reopening began. During this time usage of the public transit systems had a sudden drop as all non-essential services needed to close. In this analysis, I look at &lt;a href=&#34;http://web.mta.info/developers/fare.html&#34;&gt;MTA Subway Fare&lt;/a&gt; data to understand the effect of the PAUSE order on New York City Subway Ridership.&lt;/p&gt;
&lt;p&gt;The goals here are to:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;See the overall effect of the PAUSE order on ridership&lt;/li&gt;
&lt;li&gt;See if regional differences around the city differ by type of Metrocard (Full Fare, Unlimited, etc.)&lt;/li&gt;
&lt;li&gt;Create an interactive map to understand the regional differences in usage declines&lt;/li&gt;
&lt;/ol&gt;
&lt;div id=&#34;packages-used&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Packages Used&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse) #For Data Manipulation and Plotting
library(janitor) #For cleaning up the variable names in the CSV Files
library(lubridate) #For date processing 
library(patchwork) # For combining multiple ggplots together
library(ggmap) # For producing a static map
library(ggtext) # For adding some flair to ggplot
library(leaflet) # For Making Interactive Plots
library(rvest) # For Web Scraping Links to Download&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;gathering-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Gathering the Data&lt;/h2&gt;
&lt;p&gt;The Metropolitan Transit Authority (MTA), which runs the New York City Subway system, publishes the &lt;a href=&#34;http://web.mta.info/developers/fare.html&#34;&gt;number of Metrocard swipes that occur in the system on a weekly basis&lt;/a&gt; by Fare type (Full-Fare, 30-day Unlimited, Student Discount, Senior Discount, etc).&lt;/p&gt;
&lt;p&gt;Fortunately, since each weekly file exists as a &lt;code&gt;.csv.&lt;/code&gt; with a roughly similar format it can be easily scraped using the &lt;code&gt;rvest&lt;/code&gt; package. For this initial scrape, I will be getting any file with a filename from 2019 or 2020. According to the MTA website, the data is uploaded on a two-week delay so a file titled &lt;code&gt;fares_200905.csv&lt;/code&gt; (9/5/20) will actually contain the data from two-weeks earlier.&lt;/p&gt;
&lt;p&gt;The process for obtaining all of the data will be:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Use &lt;code&gt;rvest&lt;/code&gt; to extract the paths to all files in a vector by identifying all the anchor tags on the page (&lt;code&gt;html_nodes(&#34;a&#34;)&lt;/code&gt;) and then extracting the &lt;code&gt;href&lt;/code&gt; attribute (&lt;code&gt;html_attr(&#34;href&#34;)&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;purrr&lt;/code&gt;’s &lt;code&gt;keep&lt;/code&gt; and &lt;code&gt;stringr&lt;/code&gt;’s &lt;code&gt;str_detect&lt;/code&gt; to keep only the elements of the initial vector that match a certain pattern (have titles for 2019 or 2020)&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;purrr&lt;/code&gt;’s &lt;code&gt;map_dfr&lt;/code&gt; function to apply a function to each &lt;code&gt;.csv&lt;/code&gt; file where the function:
&lt;ul&gt;
&lt;li&gt;Read’s the &lt;code&gt;.csv&lt;/code&gt; file the MTA’s website (using &lt;code&gt;readr::read_csv&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Cleans the column names to a more R friend format (using &lt;code&gt;janitor::clean_names&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Removes any columns where all values are &lt;code&gt;NA&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Creates some meta-data around the actual time periods the data reflects&lt;/li&gt;
&lt;li&gt;Turns character formatted numbers into actual numbers (using &lt;code&gt;readr::parse_number&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Cast to a long-format (using &lt;code&gt;tidyr::pivot_longer&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_weeks &amp;lt;- read_html(&amp;quot;http://web.mta.info/developers/fare.html&amp;quot;) %&amp;gt;%
  html_nodes(&amp;quot;a&amp;quot;) %&amp;gt;% 
  html_attr(&amp;quot;href&amp;quot;) %&amp;gt;% 
  keep(str_detect(., &amp;#39;fares_(20)|(19)\\d{4}\\.csv&amp;#39;)) %&amp;gt;% 
  map_dfr(., function(x){
    return(
      read_csv(paste0(&amp;quot;http://web.mta.info/developers/&amp;quot;, x), skip = 2) %&amp;gt;% 
        clean_names %&amp;gt;%
        #Drop Dead Columns
        select_if(~!all(is.na(.x))) %&amp;gt;%
        mutate(
          key = str_extract(x, &amp;#39;\\d+&amp;#39;),
          
          #The data in the files covers seven-day periods beginning on the Saturday 
          #two weeks prior to the posting date and ending on the following Friday. 
          #Thus, as an example, the file labeled Saturday, January 15, 2011, has data 
          #covering the period from Saturday, January 1, 2011, through Friday, January 7. 
          #The file labeled January 22 has data covering the period from 
          #Saturday, January 8, through Friday, January 14. And so on and so forth
          week_start = ymd(paste0(&amp;#39;20&amp;#39;,key)) - days(14),
          week_end = ymd(paste0(&amp;#39;20&amp;#39;,key)) - days(8)
        ) %&amp;gt;%
        mutate(across(c(-remote, -station, -week_start, -week_end, -key), parse_number)) %&amp;gt;% 
        pivot_longer(
          cols = c(-remote, -station, -week_start, -week_end, -key),
          names_to = &amp;quot;fare_type&amp;quot;,
          values_to = &amp;quot;fares&amp;quot;
        )
    )
  }
) &lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;time-series-of-subway-usage-by-week&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Time-Series of Subway Usage by Week&lt;/h2&gt;
&lt;p&gt;A first glance at understanding to effect of COVID on the NYC Subway system is to look at a weekly time-series of total subway usage. In this chart and in the future, when looking at the amount of ridership decline I will be comparing points one months prior to the start of the PAUSE act (week of February 22nd) and one month after the PAUSE act (week of April 18th).&lt;/p&gt;
&lt;p&gt;From a coding perspective, this step is aggregating all the individual fare data by week and plotting it using &lt;code&gt;ggplot2&lt;/code&gt;. The only non-vanilla ggplot portion is the use of &lt;code&gt;ggtext&lt;/code&gt;’s &lt;code&gt;geom_textbox&lt;/code&gt; to add to flair to the annotations.&lt;/p&gt;
&lt;p&gt;The red dots on the chart represent the comparison points used for the rest of this analysis and the dashed black line is March 22nd, when the PAUSE act went into effect.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;all_weeks %&amp;gt;% 
  group_by(key, week_start, week_end) %&amp;gt;% 
  summarize(fares = sum(fares, na.rm = T), .groups = &amp;#39;drop&amp;#39;) %&amp;gt;% 
  ggplot(aes(x = week_start, y = fares/1e6)) + 
    geom_line(color = &amp;#39;#0039A6&amp;#39;) + 
    geom_vline(xintercept = ymd(20200322), lty = 2) + 
    geom_point(data = tibble(
      week_start = c(ymd(20200222), ymd(20200418)),
      fares = c(30768135, 2548002)
    ), color = &amp;#39;red&amp;#39;, size =3
    ) +
    geom_textbox(
      x = ymd(20191001),
      y = 15,
      label = &amp;quot;A ***&amp;lt;span style = &amp;#39;color:red&amp;#39;&amp;gt;92% Decrease&amp;lt;/span&amp;gt;*** \n in Subway Ridership \n 1 month before \n vs. 1 month after \n PAUSE order&amp;quot;,
      fill = &amp;#39;cornsilk&amp;#39;,
      halign = 0.5,
    ) + 
    labs(x = &amp;quot;Week Beginning&amp;quot;, y = &amp;quot;# of MTA Subway Fares (millions)&amp;quot;,
         title = &amp;quot;&amp;lt;span style=&amp;#39;color:#0039A6&amp;#39;&amp;gt;MTA&amp;lt;/span&amp;gt; Ridership (Jan 2019 - Aug 2020)&amp;quot;,
         subtitle = &amp;quot;PAUSE Order Begins on 3/22/2020&amp;quot;) + 
    scale_y_continuous(labels = scales::comma) +
    cowplot::theme_cowplot() + 
    theme(
      plot.title = element_markdown()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-09-07-covid-19s-impact-on-the-nyc-subway-system/index_files/figure-html/overall_trends-1.png&#34; width=&#34;672&#34; /&gt;
From this chart its clear to see that COVID had a strong effect on Subway ridership as there was a 92% decline between a month prior and a month post. While the ridership is beginning to trend upwards again, the overall numbers are still drastically smaller than in the pre-COVID time.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;exploring-the-overall-distribution-of-fares&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Exploring the Overall Distribution of Fares&lt;/h2&gt;
&lt;p&gt;The NYC Subway uses &lt;em&gt;Metrocards&lt;/em&gt; in order to gain access to the system. There are also a number of different types of Metrocards. Since ~94% of rides occur on the 7 most common card types, I’ll be focusing on those and bucketing the rest into an “other” group. The 7 most common are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Full Fare&lt;/strong&gt; - A person loads money on their Metrocard and pays per trip&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Annual Unlimited&lt;/strong&gt; - A person pays a fixed amount for a year of unlimited rides (typically offered through a person’s workplace)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;30 Day Unlimited&lt;/strong&gt; - A person pays a fixed amount for 30 days of unlimited rides&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;7 Day Unlimited&lt;/strong&gt; - A person pays a fixed amount for 7 days of unlimited rides&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Student&lt;/strong&gt; - Assigned by schools to students for a certain number of trips per day&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Senior Citizen&lt;/strong&gt; - A reduced-fare Metrocard used by those Age 65 and over or with a disability&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;EasyPayXpress&lt;/strong&gt; - A person sets up an account that automatically reloads the card when the balance gets low&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There needs to be some data cleaning to make our data more human readable as well as only focus on the weeks we want to compare vs. all weeks since 2019. This code step will keep only the weeks we care about, cast each time period to a column, given those time periods a nicer name, and give the fare_types a nicer name, and finally filter out some stations that are part of the MTA system but aren’t actually subway stations. These include the Airtrain at JFK Airport and the PATH trains between New York and New Jersey.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;combined &amp;lt;- all_weeks %&amp;gt;% 
  filter(week_start %in% c(ymd(20200222), ymd(20200418))) %&amp;gt;% 
  pivot_wider(
    id_cols = c(&amp;#39;remote&amp;#39;, &amp;#39;station&amp;#39;, &amp;#39;fare_type&amp;#39;),
    names_from = week_start,
    values_from = fares,
    values_fill = list(fares = 0)
  ) %&amp;gt;% 
  rename(apr=`2020-04-18`, feb=`2020-02-22`) %&amp;gt;% 
  mutate(
    fare_type = case_when(
      fare_type == &amp;#39;ff&amp;#39; ~ &amp;#39;Full Fare&amp;#39;,
      fare_type == &amp;#39;x30_d_unl&amp;#39; ~ &amp;#39;30-Day Unlimited&amp;#39;,
      fare_type == &amp;#39;x7_d_unl&amp;#39; ~ &amp;#39;7-Day Unlimited&amp;#39;,
      fare_type == &amp;#39;students&amp;#39; ~ &amp;#39;Student&amp;#39;,
      fare_type == &amp;#39;sen_dis&amp;#39; ~ &amp;#39;Senior Citizen/Disabled&amp;#39;,
      fare_type == &amp;#39;tcmc_annual_mc&amp;#39; ~ &amp;#39;Annual Metrocard&amp;#39;,
      fare_type == &amp;#39;mr_ezpay_exp&amp;#39; ~ &amp;#39;EasyPayXpress&amp;#39;,
      TRUE ~ fare_type
    )
  ) %&amp;gt;% 
  #Remove SBS Bus Stations and PATH
  filter(!str_detect(station, &amp;quot;SBS-|PA-|AIRTRAIN&amp;quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After cleaning, our data covers 443 different subway stations and 26 different fare_types.&lt;/p&gt;
&lt;p&gt;In order to recode the fare types outside of the top 7 I first need to identify what the Top 7 fare types are. In the below code, I create a vector of the Top 7 Fare Types based on the February data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;top_7 &amp;lt;- combined %&amp;gt;% 
  count(fare_type, wt = feb, sort = T) %&amp;gt;% 
  head(7) %&amp;gt;% 
  pull(fare_type)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then the final step is to aggregate the data over the various stations. In this step, there is also the use of &lt;code&gt;fct_other&lt;/code&gt; from &lt;code&gt;forcats&lt;/code&gt; to keep only the top 7 fares and create an “Other Fares” label for everything else. Also, the use of other &lt;code&gt;forcats&lt;/code&gt; functions such as &lt;code&gt;fct_reorder&lt;/code&gt; and &lt;code&gt;fct_relevel&lt;/code&gt; are used to order the Fare Types by most common to least common (&lt;code&gt;fct_reorder&lt;/code&gt;) but the ensure the the “Other Fares” group is last (&lt;code&gt;fct_relevel&lt;/code&gt;).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;agg_data &amp;lt;- combined %&amp;gt;% 
  pivot_longer(
    cols = c(&amp;#39;feb&amp;#39;, &amp;#39;apr&amp;#39;),
    names_to = &amp;quot;month&amp;quot;,
    values_to = &amp;#39;fares&amp;#39;
  ) %&amp;gt;% 
  # Collapse Non-Top 7 Fares to &amp;quot;Other&amp;quot; Group
  mutate(
    fare_type = fct_other(fare_type, keep = top_7, other_level = &amp;quot;Other Fares&amp;quot;)
  ) %&amp;gt;% 
  #Order with Month First So Summarize Will Return a Grouped DF by Month
  group_by(month, fare_type) %&amp;gt;% 
  summarize(fares = sum(fares)) %&amp;gt;% 
  #Create % Variable
  mutate(pct = fares / sum(fares),
         period = if_else(month == &amp;#39;feb&amp;#39;, &amp;#39;2/22 - 2/28&amp;#39;, &amp;#39;4/18 - 4/24&amp;#39;)
  ) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  #Refactor Fare Type for Charts
  mutate(
    fare_type = fct_reorder(fare_type, fares, .fun = max) %&amp;gt;% fct_relevel(., &amp;quot;Other Fares&amp;quot;, after = 0L)
  ) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The following plots leverage the &lt;code&gt;patchwork&lt;/code&gt; package to combine multiple ggplots together to show both the share of Fare Types Pre/Post COVID as well as the actual number of fares. This code is somewhat cumbersome and could probably be done more easily with facets, but I wanted to play with &lt;code&gt;plot_annotation&lt;/code&gt; and &lt;code&gt;plot_layout&lt;/code&gt; from &lt;code&gt;patchwork&lt;/code&gt; in order to add titles to the combined image rather than each plot individually. If you haven’t used &lt;code&gt;patchwork&lt;/code&gt; to combine multiple plots, I highly recommend it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;(agg_data %&amp;gt;% 
  ggplot(aes(x = fare_type, 
             y = pct, 
             fill = fct_rev(period))) + 
    geom_col(position = &amp;#39;dodge&amp;#39;) + 
    geom_text(aes(label = pct %&amp;gt;% scales::percent(accuracy = .1)),
              position = position_dodge(width = .9),
              hjust = 0,
              size = 3) +
    labs(x = &amp;quot;Fare Type&amp;quot;, y = &amp;quot;Share of Fares&amp;quot;,
         title = &amp;quot;Share of Subway Rides&amp;quot;,
         fill = &amp;quot;Period&amp;quot;) + 
    guides(fill = guide_legend(reverse = T)) + 
    coord_flip(ylim = c(0, .6)) + 
    cowplot::theme_cowplot() + 
    theme(
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank(),
      plot.title = element_text(size = 12)
    )
) + 
(agg_data %&amp;gt;% 
  ggplot(aes(x = fare_type, 
             y = fares, 
             fill = fct_rev(period))) + 
  geom_col(position = &amp;#39;dodge&amp;#39;) + 
  geom_text(aes(label = fares %&amp;gt;% scales::comma()),
            position = position_dodge(width = .9),
            hjust = 0,
            size = 3) +
  labs(x = &amp;quot;&amp;quot;, y = &amp;quot;Number of Fares&amp;quot;,
       title = &amp;quot;# of Subway Rides&amp;quot;,
       fill = &amp;quot;Period&amp;quot;) + 
  scale_fill_discrete(guide = F) +
  coord_flip(ylim = c(0, 15e6)) + 
  cowplot::theme_cowplot() + 
  theme(
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    plot.title = element_text(size = 12)
  )
) + plot_annotation(
  title = &amp;#39;Changes in NYC Subway Ridership Pre/Post PAUSE&amp;#39;,
  caption = &amp;#39;NYC PAUSE Began March 22nd&amp;#39;
) + plot_layout(guides = &amp;quot;collect&amp;quot;) &amp;amp; theme(legend.position = &amp;quot;bottom&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-09-07-covid-19s-impact-on-the-nyc-subway-system/index_files/figure-html/create_overall_fare_plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The panel on the right (number of fares) makes it very clear that the number of subway rides have plummeted in the month following the PAUSE act with Full Fare rides dropping from 10M to 1.2M. But more interesting is the specialty types of cards (Unlimited and Student) have very severe declines with the 30-day unlimited dropping 96% from 8M to 350k.&lt;/p&gt;
&lt;p&gt;In terms of a share of swipes. The Full Fare Metrocard actually &lt;strong&gt;increases&lt;/strong&gt; in share from 36% to 50%. However, this is likely because Students are learning virtually and those who are able to work from home doing so. Additionally, if subway travel is becoming more infrequent its no longer cost effective to use 30-day unlimited cards, so there is also an effect from people who WOULD have used specialty cards switching to Full Fare.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;does-the-decline-by-fare-type-depend-on-the-area-of-nyc&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Does the decline by Fare Type depend on the area of NYC?&lt;/h2&gt;
&lt;p&gt;From the first two charts its clear that there is an overall decline in Subway ridership and that decline is occurring across all Fare types. Another question is “do these declines change by area of the city?” To do this, I’ll be using &lt;code&gt;ggmap&lt;/code&gt; to create maps of NYC Subway Stations by the various Fare types.&lt;/p&gt;
&lt;p&gt;The first step is to create data at the station and fare type level, geocode the MTA station data (Huge thanks to &lt;a href=&#34;https://github.com/chriswhong/nycturnstiles/&#34;&gt;Chris Whong&lt;/a&gt; who had done the work of mapping Lat/Longs to the Station Names). Since Chris’ work was from 2013, the newer stations such as Hudson Yards and the 2nd Avenue Subway do not appear.&lt;/p&gt;
&lt;p&gt;In order to clean up the map, in cases where there were multiple geocodes for a single station the max Lat and max Long were used and stations with less than 1000 pre-COVID swipes of a given fare type were removed.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;station_level &amp;lt;- combined %&amp;gt;% 
  mutate(
    fare_type = fct_other(fare_type, keep = top_7, other_level = &amp;quot;Other Fares&amp;quot;)
  ) %&amp;gt;% 
  group_by(remote, station, fare_type) %&amp;gt;% 
  summarize(feb = sum(feb),
            apr = sum(apr)
  ) %&amp;gt;% 
  mutate(
    abs_change = apr-feb,
    rel_change = apr/feb - 1
  )

geocodes &amp;lt;- read_csv(&amp;#39;https://raw.githubusercontent.com/chriswhong/nycturnstiles/master/geocoded.csv&amp;#39;, 
                     col_names = c(&amp;#39;remote&amp;#39;, &amp;#39;zuh&amp;#39;, &amp;#39;station&amp;#39;, &amp;#39;line&amp;#39;, &amp;#39;system&amp;#39;, &amp;#39;lat&amp;#39;, &amp;#39;long&amp;#39;),
)

comb_geo &amp;lt;- station_level %&amp;gt;% 
  inner_join(geocodes %&amp;gt;% group_by(remote) %&amp;gt;% summarize(lat = max(lat), long = max(long)), by = &amp;quot;remote&amp;quot;) %&amp;gt;%
  filter(feb &amp;gt; 1000) %&amp;gt;% 
  ungroup()&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;creating-the-maps-with-ggmap&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Creating the Maps with ggmap&lt;/h3&gt;
&lt;p&gt;Since the overall trends seem like there is a large decline in ridership across the entire city, I wanted to create new breakpoints to understand where were larger declines vs. smaller declines. To do this I used the &lt;code&gt;classInt::classIntervals()&lt;/code&gt; function with the &lt;code&gt;fisher&lt;/code&gt; style to algorithmically find the breakpoints in the data. The &lt;code&gt;cut_format&lt;/code&gt; function will format the break labels are percentages rather than decimals.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;brks &amp;lt;- classInt::classIntervals(comb_geo$rel_change, n = 5, style = &amp;#39;fisher&amp;#39;)

comb_geo$grp_val = kimisc::cut_format(comb_geo$rel_change, 
                                      brks$brks, 
                                      include.lowest = T,  
                                      format_fun = scales::percent)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To create the static map using &lt;code&gt;ggmap&lt;/code&gt; I first need to create the base layer that the data will be plotted on. There a many ways to do this but I chose to define a boundary box using Lats and Longs from &lt;a href=&#34;https://www1.nyc.gov/assets/planning/download/pdf/data-maps/open-data/nybb_metadata.pdf?ver=18c&#34;&gt;NYC.gov&lt;/a&gt;. The zoom option controls how many tiles should be used in the boundary box. The larger the number the more tiles / more zoomed in your are.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nyc &amp;lt;-get_map(c(
  left = -74.1,
  right = -73.699215,
  top = 40.915568,
  bottom = 40.55
),  zoom = 11, source = &amp;#39;osm&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Since there are 7 different Fare Types to look at I’m breaking apart the maps into two sets of Fare Types, the unlimited cards, and everything else. The &lt;code&gt;element_markdown()&lt;/code&gt; in the &lt;code&gt;theme()&lt;/code&gt; block is from &lt;code&gt;ggtext&lt;/code&gt; and allows for certain HTML tags to format text in ggplots.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggmap(nyc, 
      base_layer = ggplot(comb_geo %&amp;gt;%
                            filter(str_detect(fare_type, &amp;quot;Unlimited|Annual&amp;quot;)), 
                          aes(x = long, y = lat, color = grp_val))) +
  geom_point() + 
  labs(
    title = &amp;quot;NYC Ridership Decline by &amp;lt;b&amp;gt;&amp;lt;i style=&amp;#39;color:#0039A6&amp;#39;&amp;gt;Unlimited Fare Types&amp;lt;/i&amp;gt;&amp;lt;/b&amp;gt;&amp;quot;,
    color = &amp;quot;% Ridership Decline (Feb vs. Apr)&amp;quot;,
    x = &amp;quot;&amp;quot;, y = &amp;quot;&amp;quot;) +
  facet_wrap(~fare_type, nrow = 1) +
  guides(color=guide_legend(nrow=2,byrow=TRUE)) +
  theme(legend.position = &amp;#39;bottom&amp;#39;,
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_markdown())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-09-07-covid-19s-impact-on-the-nyc-subway-system/index_files/figure-html/unlimited_map-1.png&#34; width=&#34;672&#34; /&gt;
Based on the unlimited cards decline by Subway station its clear that there ARE regional difference in how much COVID has affected usage. The 30-day unlimited card has the highest amount of decline in Manhattan and the parts of Brooklyn and Queens nearest to Manhattan. Meanwhile, the outer parts of Brooklyn, the Bronx, and Spanish Harlem have lower levels of decline. This is consistent with areas of lower socioeconomic status still needing to take the subway due to a higher likelihood of jobs that cannot be done from home.&lt;/p&gt;
&lt;p&gt;On the whole the different types of unlimited cards have similar patterns. Although the 7-day Unlimited has more areas not in the largest decline bucket.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggmap(nyc, 
      base_layer = ggplot(comb_geo %&amp;gt;%
                            filter(!str_detect(fare_type, &amp;quot;Unlimited|Annual|Other&amp;quot;)), 
                          aes(x = long, y = lat, 
                              color = grp_val))) +
  geom_point() + 
  labs(
    title = &amp;quot;NYC Ridership Decline by &amp;lt;b&amp;gt;&amp;lt;i style=&amp;#39;color:#0039A6&amp;#39;&amp;gt;Other Fare Types&amp;lt;/i&amp;gt;&amp;lt;/b&amp;gt;&amp;quot;,
    color = &amp;quot;% Ridership Decline (Feb vs. Apr)&amp;quot;,
    x = &amp;quot;&amp;quot;, y = &amp;quot;&amp;quot;) +
  facet_wrap(~fct_reorder(fare_type, -feb), nrow = 1) +
  guides(color=guide_legend(nrow=2,byrow=TRUE)) +
  theme(legend.position = &amp;#39;bottom&amp;#39;,
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_markdown()
        )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-09-07-covid-19s-impact-on-the-nyc-subway-system/index_files/figure-html/other_map-1.png&#34; width=&#34;672&#34; /&gt;
The two largest contrasts in the non-unlimited groups are that Student cards are almost entirely in the largest decline bucket, which makes sense, as Students were engaged in distance learning. Similarly the EasyPayXpress is almost entirely in the largest decline bucket and almost entirely in Manhattan. This also makes sense as its potentially made up of commuters who don’t want to normally deal with refilling a card constantly but wouldn’t use it enough to justify an unlimited card. The closing of non-essential businesses and rise of Work-From-Home is the likely cause.&lt;/p&gt;
&lt;p&gt;For the full-fare cards the only area with the most severe declines are in “Core Manhattan” but other areas have smaller declines, potentially due to shifting from one Fare Type to the full fare due to less need to use the Subway System.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-an-interactive-map-with-leaflet&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating an Interactive Map with Leaflet&lt;/h2&gt;
&lt;p&gt;while the above ggmaps are useful, its difficult to know exactly where the neighborhoods are with the largest/smaller declines. The above maps are useful for a general idea but having an interactive map that would allow the user to pan and zoom would yield greater insights. In order to create one I will use the &lt;code&gt;leaflet&lt;/code&gt; package which serves as an API to the javascript library of the same name.&lt;/p&gt;
&lt;p&gt;Since for this map I will only be looking at the overall declines as opposed to the declines by Fare Type I need to re-summarize the data and create new breaks based on the overalls. The creation of the &lt;em&gt;msg&lt;/em&gt; variable is to provide a pop-up to &lt;code&gt;leaflet&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;map_prep &amp;lt;- comb_geo %&amp;gt;%
  group_by(remote, station, lat, long) %&amp;gt;% 
  summarize(feb = sum(feb),
            apr = sum(apr),
            .groups = &amp;#39;drop&amp;#39;
  ) %&amp;gt;% 
  mutate(rel_change = apr/feb - 1,    
         msg = paste(station, &amp;quot;has decreased&amp;quot;, scales::percent(rel_change, accuracy = .1),
                &amp;quot;pre-PAUSE to post-PAUSE from&amp;quot;, feb %&amp;gt;% scales::comma(), &amp;quot;to&amp;quot;,
                apr %&amp;gt;% scales::comma(), &amp;quot;fares.&amp;quot;)
  )

map_prep_breaks &amp;lt;- classInt::classIntervals(map_prep$rel_change, 
                                            n = 5, 
                                            style = &amp;#39;fisher&amp;#39;)

##Add in the Breaks
map_prep$grp_val = kimisc::cut_format(map_prep$rel_change, 
                                      map_prep_breaks$brks, 
                                      include.lowest = T,  
                                      format_fun = scales::percent
                                      )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;One of the things that I found difficult about &lt;code&gt;leaflet&lt;/code&gt; was that creating a color palette to go with my breaks required a function that mapped the values to the colors. The &lt;code&gt;factpal&lt;/code&gt; in leaflet associates a factor variable with a palette. In this case it takes the factors for the &lt;em&gt;grp_val&lt;/em&gt; created above and maps them to colors from the “Set1” palette.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;factpal &amp;lt;- colorFactor(&amp;quot;Set1&amp;quot;, map_prep$grp_val)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Creating a basic map with &lt;code&gt;leaflet&lt;/code&gt; is fairly straight-forward and the syntax is pretty user friendly. The main things to know when interpreting the code is that the “~” character means that its referring to a variable name within the passed in data (similar to how &lt;code&gt;aes()&lt;/code&gt; does the same for ggplot).&lt;/p&gt;
&lt;p&gt;This function call while long does the following:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Passes in my dataset &lt;code&gt;map_prep&lt;/code&gt; to the &lt;code&gt;leaflet()&lt;/code&gt; function&lt;/li&gt;
&lt;li&gt;Adds the background tiles from the CartoDB.Positron theme&lt;/li&gt;
&lt;li&gt;Adds circle markers for each observation in my data set using the lats/longs with a fixed radius of 250, no border (stroke), and using a fill color from our pre-defined palette with 100% opacity. The hover labels will be the station names and when clicked the &lt;em&gt;msg&lt;/em&gt; variable will be the pop-up.&lt;br /&gt;
&lt;/li&gt;
&lt;li&gt;Finally add a legend in the top-right corner with the pre-defined colors and breakpoints.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The use of the &lt;code&gt;widgetframe::frameWidget()&lt;/code&gt; was to get the map to load on the blog and was not necessary for use in RStudio.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(widgetframe)

ll_map &amp;lt;- leaflet(map_prep) %&amp;gt;%
  addProviderTiles(providers$CartoDB.Positron) %&amp;gt;% 
  addCircles(
    lng = ~long,
    lat = ~lat,
    radius = 250,
    #radius = 4,
    stroke = F,
    fill = T,
    color = ~factpal(grp_val),
    fillOpacity = 1,
    label = ~station,
    group = &amp;#39;stations&amp;#39;,
    popup = ~msg
    ) %&amp;gt;%
    addLegend(
      title = &amp;quot;% Change in Rides&amp;quot;,
      pal = factpal,
      values = ~grp_val,
      position = &amp;#39;topright&amp;#39;
    )

frameWidget(ll_map)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;div id=&#34;htmlwidget-1&#34; style=&#34;width:100%;height:480px;&#34; class=&#34;widgetframe html-widget&#34;&gt;&lt;/div&gt;
&lt;script type=&#34;application/json&#34; data-for=&#34;htmlwidget-1&#34;&gt;{&#34;x&#34;:{&#34;url&#34;:&#34;/post/2020-09-07-covid-19s-impact-on-the-nyc-subway-system/index_files/figure-html//widgets/widget_ll_map.html&#34;,&#34;options&#34;:{&#34;xdomain&#34;:&#34;*&#34;,&#34;allowfullscreen&#34;:false,&#34;lazyload&#34;:false}},&#34;evals&#34;:[],&#34;jsHooks&#34;:[]}&lt;/script&gt;
From this view, the regional difference in Subway usage declines are very apparent. The ‘red’ circles representing the largest declines are clustered in “Core Manhattan” which is from Lower Manhattan up to around 59th street. This would be where the majority of commuter swipes would be that were eliminated due to PAUSE. Then as you move further from central Manhattan the declines become less severe.&lt;/p&gt;
&lt;p&gt;The two callouts are the prevalence of the purple dots in the Bronx and orange “X” pattern in eastern Brooklyn (Brownsville, New Lots, East New York). According to &lt;a href=&#34;https://www1.nyc.gov/site/opportunity/poverty-in-nyc/data-tool.page&#34;&gt;New York City Government Poverty Measures&lt;/a&gt;, Bronx Community Districts 1-6 have the largest percent of population below the poverty line followed by Brownsville and East New York which matches the narrative of areas of lower Socioeconomic status being less likely to be able to avoid the Subway during the pandemic and having less severe declines in ridership than areas of Lower and Midtown Manhattan.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conclusions&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;COVID-19 and the New York State PAUSE act have had a dramatic impact on the ridership of the NYC Subway System. Overall ridership was down 92% between February and April as New York became the “COVID capital of the world” during those months. The MTA’s detailed data on types of fares at each station allows for a more granular look into how the pandemic is altering rider behavior leading to decreased usage of Unlimited Cards and Student cards as people are more constrained to their homes as well as areas of lower socioeconomic status having less severe changes in ridership comparable to more affluent areas of the city.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>What&#39;s the Difference Between Instagram and TikTok? Using Word Embeddings to Find Out</title>
      <link>https://jlaw.netlify.app/2020/08/02/what-s-the-difference-between-instagram-and-tiktok-using-word-embeddings-to-find-out/</link>
      <pubDate>Sun, 02 Aug 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/08/02/what-s-the-difference-between-instagram-and-tiktok-using-word-embeddings-to-find-out/</guid>
      <description>


&lt;div id=&#34;tldr&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;TL;DR&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Instagram - Tiktok = Photos, Photographers and Selfies&lt;/li&gt;
&lt;li&gt;Tiktok - Instagram = Witchcraft and Teens&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;but read the whole post to find out why!&lt;/p&gt;
&lt;div id=&#34;purpose&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Purpose&lt;/h2&gt;
&lt;p&gt;The original intent of this post was to learn to train my own Word2Vec model, however, as is a running theme.. my laptop is not great and training a neural network would never work. However, in looking for alternatives, I had come across a post from &lt;a href=&#34;https://juliasilge.com/blog/word-vectors-take-two/&#34;&gt;Julia Silge&lt;/a&gt; from 2017 which outlined how create Word Embeddings using a combination of &lt;a href=&#34;https://en.wikipedia.org/wiki/Pointwise_mutual_information&#34;&gt;point-wise mutual information (PMI)&lt;/a&gt; and Singular Value Decomposition (SVD). This was based on a methodology from Chris Moody’s Stitchfix Post called &lt;a href=&#34;https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/&#34;&gt;Stop Using word2vec&lt;/a&gt;. Ms. Silge’s methodology has been updated as part of her book &lt;a href=&#34;https://smltar.com/embeddings.html#understand-word-embeddings-by-finding-them-yourself&#34;&gt;Supervised Machine Learning for Text Analysis in R&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Word Embeddings are vector representations of words in a large number of dimensions that capture the context of how words are used. They have been used to show fancy examples of how you can do math with words. One of the most well known example is &lt;code&gt;King - Man + Woman = Queen&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Since TikTok and Instagram are both popular social media apps, especially among teenagers, I figured it would be an interesting exercise to see if I could figure out &lt;code&gt;Tiktok - Instagram = ????&lt;/code&gt; and &lt;code&gt;Instagram - TikTok = ????&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-and-cleaning-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Getting and Cleaning the Data&lt;/h2&gt;
&lt;p&gt;In order to create these vector representations I need data. In the example posts above, they use the Hacker News corpus which is available on Google’s BigQuery. In quickly browsing that data it didn’t seem like there was enough to do something as targeted as Instagram vs. TikTok. So I decided to use Twitter data both because I thought it would be a decent source of information and second because it was a good excuse to try out the &lt;a href=&#34;https://github.com/ropensci/rtweet&#34;&gt;rtweet&lt;/a&gt; package.&lt;/p&gt;
&lt;p&gt;In addition to the &lt;code&gt;rtweet&lt;/code&gt; package, I’ll be using &lt;code&gt;tidyverse&lt;/code&gt; for data manipulations and plotting, &lt;code&gt;tidytext&lt;/code&gt; to create the word tokens, and &lt;code&gt;widyr&lt;/code&gt; in order to do the PMI, SVD, and Cosine Similarity calculations.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rtweet) # To Exract Data from Twitter
library(tidyverse) # Data Manipulation and Plotting
library(tidytext) # To create the Work Tokens and Bigrams
library(widyr) #For doing PMI, SVD, and Similarity&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Turns out getting data from Twitter really couldn’t be easier with &lt;code&gt;rtweet&lt;/code&gt;. The &lt;code&gt;search_tweets()&lt;/code&gt; function is very straight forward and really is all you need. In this case, I wanted to run two separate queries, one for “instagram” and one for “tiktok”, so I used &lt;code&gt;search_tweets2()&lt;/code&gt; which allows you to pass a vector of queries rather than a single one. In the code below, my two queries, one for “instagram” and one for “tiktok” are captured in the &lt;code&gt;q&lt;/code&gt; parameter (with additionally filters to remove tweets with links and tweets with replies). The &lt;code&gt;n&lt;/code&gt; parameter says that I want 50,000 tweets for each query. Additionally, I tell the query that I don’t want retweets, I want to grab recent tweets (the package can only search the last 6-7 days), and I want only English language.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tweets &amp;lt;- search_tweets2(
  q = c(&amp;quot;tiktok -filter:quote -filter:replies -filter:links&amp;quot;, 
        &amp;#39;instagram -filter:quote -filter:replies -filter:links&amp;#39;),
  n = 50000,
  include_rts = FALSE,
  retryonratelimit = TRUE,
  type = &amp;#39;recent&amp;#39;,
  lang = &amp;#39;en&amp;#39;
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The query for this data was originally run on 7/21/2020 and returned 108,000 rows. Because the Twitter data contains many characters not typically considered words, the data was run through some data cleaning and duplicated tweets (ones that contained both “instagram” and “tiktok” were deduped.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;cleaned_tweets &amp;lt;- tweets %&amp;gt;% 
  #Encode to Native
  transmute(status_id, text = plain_tweets(text)) %&amp;gt;% 
  #Remove Potential Dups
  distinct() %&amp;gt;% 
  #Clean Data
  mutate(
    text = str_remove_all(text, &amp;quot;^[[:space:]]*&amp;quot;), # Remove leading whitespaces
    text = str_remove_all(text, &amp;quot;[[:space:]]*$&amp;quot;), # Remove trailing whitespaces
    text = str_replace_all(text, &amp;quot; +&amp;quot;,&amp;quot; &amp;quot;), # Remove extra whitespaces
    text = str_replace_all(text, &amp;quot;&amp;#39;&amp;quot;, &amp;quot;%%&amp;quot;), # Replacing apostrophes with %%
    #text = iconv(text, &amp;quot;latin1&amp;quot;, &amp;quot;ASCII&amp;quot;, sub=&amp;quot;&amp;quot;), # Remove emojis/dodgy unicode
    text = str_remove_all(text, &amp;quot;&amp;lt;(.*)&amp;gt;&amp;quot;), # Remove pesky Unicodes like &amp;lt;U+A&amp;gt;
    text = str_replace_all(text, &amp;quot;\\ \\. &amp;quot;, &amp;quot; &amp;quot;), # Replace orphaned fullstops with space
    text = str_replace_all(text, &amp;quot;  &amp;quot;, &amp;quot; &amp;quot;), # Replace double space with single space
    text = str_replace_all(text, &amp;quot;%%&amp;quot;, &amp;quot;\&amp;#39;&amp;quot;), # Changing %% back to apostrophes
    text = str_remove_all(text, &amp;quot;https(.*)*$&amp;quot;), # remove tweet URL
    text = str_replace_all(text, &amp;quot;\\n&amp;quot;, &amp;quot; &amp;quot;), # replace line breaks with space
    text = str_replace_all(text, &amp;quot;&amp;amp;amp;&amp;quot;, &amp;quot;&amp;amp;&amp;quot;), # fix ampersand &amp;amp;,
    text = str_remove_all(text, &amp;#39;&amp;amp;lt;|&amp;amp;gt;&amp;#39;), 
    text = str_remove_all(text, &amp;#39;\\b\\d+\\b&amp;#39;) #Remove Numbers
  ) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Finally, the data was tokenized to break apart the tweets into a tidy format of 1 row per word. For example, “The quick brown fox” will be broken into 4 rows, the first containing “the”, the second containing “quick” and so on. Besides tokenization, &lt;em&gt;stop words&lt;/em&gt; and infrequent words (&amp;lt;20 occurrences) were removed. Stop words are very common words like “the”, “it”, etc. that don’t add much meaning to the Tweets.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tokens &amp;lt;- cleaned_tweets %&amp;gt;% 
  #Tokenize
  unnest_tokens(word, text) %&amp;gt;% 
  #Remove Stop Words
  anti_join(stop_words, by = &amp;quot;word&amp;quot;) %&amp;gt;% 
  #Remove All Words Occurring Less Than 20 Times
  add_count(word) %&amp;gt;%
  filter(n &amp;gt;= 20) %&amp;gt;%
  select(-n)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-the-embeddings&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating the Embeddings&lt;/h2&gt;
&lt;p&gt;The way that word embeddings are able to capture the context of individual words is by looking at what words appear around the word of interest. Getting from Tokens to Embeddings are done in three steps:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Create the Sliding Window to capture words occurring together&lt;/li&gt;
&lt;li&gt;Calculate the point-wise mutual information to provide a measure to how likely two words will appear together&lt;/li&gt;
&lt;li&gt;Use SVD to decompose the matrix of 4,586 words into some number of dimensions (in this case we’ll use 100).&lt;/li&gt;
&lt;/ol&gt;
&lt;div id=&#34;creating-the-sliding-windows&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Creating the Sliding Windows&lt;/h3&gt;
&lt;p&gt;Sliding windows in text is kind of like a rolling average for numbers where at any point there is a subset of data that we’re looking at as subset moves over the entire data set.&lt;/p&gt;
&lt;p&gt;A very simple example would be the string “the quick brown fox jumps over the lazy dog” with a window size of four will generate six windows&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;window_id&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;words&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1_1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;the, quick, brown, fox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1_2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;quick, brown, fox, jumps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1_3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;brown, fox, jumps, over&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1_4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;fox, jumps, over, the&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1_5&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;jumps, over, the, lazy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1_6&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;over, the, lazy, dog&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The examples from the blog post use a combination of nesting and the &lt;code&gt;furrr&lt;/code&gt; package to create sliding windows in parallel. Since my laptop only has two cores, there isn’t much benefit to parallelizing this code. Fortunately, someone in the comments linked to this gist from &lt;a href=&#34;https://gist.github.com/JasonPunyon/3bca3bf606e7583c7ea2d8a00f86418e&#34;&gt;Jason Punyon&lt;/a&gt; which works very quickly for me.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;slide_windows &amp;lt;- function(tbl, doc_var, window_size) {
  tbl %&amp;gt;%
    group_by({{doc_var}}) %&amp;gt;%
    mutate(WordId = row_number() - 1,
           RowCount = n()) %&amp;gt;%
    ungroup() %&amp;gt;%
    crossing(InWindowIndex = 0:(window_size-1)) %&amp;gt;%
    filter((WordId - InWindowIndex) &amp;gt;= 0, # starting position of a window must be after the beginning of the document
           (WordId - InWindowIndex + window_size - 1) &amp;lt; RowCount # ending position of a window must be before the end of the document
    ) %&amp;gt;%
    mutate(window_id = WordId - InWindowIndex + 1)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The one parameter that we need to choose when creating the windows is the window size. There is no right or wrong answer for a window size since it will depend on the question being asked. From Julia Silge’s post, “A smaller window size, like three or four, focuses on how the word is used and learns what other words are functionally similar. A larger window size, like ten, captures more information about the domain or topic of each word, not constrained by how functionally similar the words are (Levy and Goldberg 2014). A smaller window size is also faster to compute”. For this example, I’m choosing a size of &lt;strong&gt;eight&lt;/strong&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;point-wise-mutual-information&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Point-wise Mutual Information&lt;/h3&gt;
&lt;p&gt;Point-wise mutual information is an association measurement to determine how likely two words are to occur together normalized by how likely each of the words are to be found on their own. The higher the PMI the more likely words are to be found close together vs. on their own.&lt;/p&gt;
&lt;p&gt;PMI(word1, word2) = log(P(word1, word2)/(P(word1)P(word2)))&lt;/p&gt;
&lt;p&gt;This can be calculated using the &lt;code&gt;pairwise_pmi()&lt;/code&gt; function from the &lt;code&gt;widyr&lt;/code&gt; package.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;singular-value-decomposition&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Singular Value Decomposition&lt;/h3&gt;
&lt;p&gt;This final step will turn our set of Word/Word PMI values into a 100-dimensional embedding for each word using Singular Value Decomposition, which is a technique for dimensionality reduction.&lt;/p&gt;
&lt;p&gt;This is calculated using the &lt;code&gt;widely_svd()&lt;/code&gt; function also from &lt;code&gt;widyr&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;putting-it-all-together&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Putting it all together&lt;/h3&gt;
&lt;p&gt;Executing these three steps can done in the following code:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tidy_word_vectors &amp;lt;- tokens %&amp;gt;%  
  slide_windows(status_id, 8) %&amp;gt;% #Create Sliding Window of 8 Words (Step 1)
  unite(window_id, status_id, window_id) %&amp;gt;% #Create new ID for each window
  pairwise_pmi(word, window_id) %&amp;gt;%  #Calculate the PMI (Step 2)
  widely_svd(item1, item2, pmi, nv = 100, maxit = 1000) #Create 100 Dimension Embedding (Step 3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The data at this point looks like:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 x 3
##   item1     dimension    value
##   &amp;lt;chr&amp;gt;         &amp;lt;int&amp;gt;    &amp;lt;dbl&amp;gt;
## 1 instagram         1  0.0273 
## 2 instagram         2 -0.0101 
## 3 instagram         3 -0.0729 
## 4 instagram         4  0.107  
## 5 instagram         5 -0.00237
## 6 instagram         6 -0.0844&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Where &lt;em&gt;item1&lt;/em&gt; represents the word, &lt;em&gt;dimension&lt;/em&gt; is each of our 100 dimensions for the word vector, and &lt;em&gt;value&lt;/em&gt; is the numeric value for that dimension for that word.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;the-fun-stuff&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The Fun Stuff&lt;/h2&gt;
&lt;p&gt;Now that we have these embeddings, which again are 100-dimensional vectors to represent each word we can start doing analysis to hopefully find some find things.&lt;/p&gt;
&lt;div id=&#34;what-word-is-most-similar-to-instagram&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;What Word is Most Similar to Instagram?&lt;/h3&gt;
&lt;p&gt;To find the most similar words, we can use &lt;a href=&#34;https://en.wikipedia.org/wiki/Cosine_similarity&#34;&gt;cosine similarity&lt;/a&gt; to determine which vectors are most similar to our target words. Cosine similarity can be calculated using the &lt;code&gt;pairwise_similarity()&lt;/code&gt; function from the &lt;code&gt;widyr&lt;/code&gt; package.&lt;/p&gt;
&lt;p&gt;Let’s look at what’s most similar to “Instagram”:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Get 10 Most Similar Words to Instagram
ig &amp;lt;- tidy_word_vectors %&amp;gt;% 
  pairwise_similarity(item1, dimension, value) %&amp;gt;%
  filter(item1 == &amp;#39;instagram&amp;#39;) %&amp;gt;% 
  arrange(desc(similarity)) %&amp;gt;% 
  head(10)

#Plot most similar words
ig %&amp;gt;%
  ggplot(aes(x = fct_reorder(item2, similarity), y = similarity, fill = item2)) + 
    geom_col() + 
    scale_fill_discrete(guide = F) +
    labs(x = &amp;quot;&amp;quot;, y = &amp;quot;Similarity Score&amp;quot;, 
         title = &amp;quot;Words Most Similar to &amp;lt;i style=&amp;#39;color:#833AB4&amp;#39;&amp;gt;Instagram&amp;lt;/i&amp;gt;&amp;quot;) + 
    coord_flip() + 
    hrbrthemes::theme_ipsum_rc(grid=&amp;quot;X&amp;quot;) + 
    theme(
      plot.title.position = &amp;quot;plot&amp;quot;,
      plot.title = ggtext::element_markdown()
    )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/20200802_ig_vs_tiktok/index_files/figure-html/insta_similar-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Looking at what words are most similar can provide a good gut check for whether things are working. Among the top words are “post(s)”, “dms”, “celebrities” which all seem to make sense in the context of Instagram. Admittedly, I got a chuckle about “Instagram hoes”, but that does have its own &lt;a href=&#34;https://www.urbandictionary.com/define.php?term=Instagram%20Hoe&#34;&gt;Urban Dictionary&lt;/a&gt; definition so I suppose its legit.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-word-is-most-similar-to-tiktok&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;What Word is Most Similar to TikTok?&lt;/h3&gt;
&lt;p&gt;We can do the same calculation with ‘tiktok’ as opposed to ‘instagram’&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tt &amp;lt;- tidy_word_vectors %&amp;gt;% 
  pairwise_similarity(item1, dimension, value) %&amp;gt;%
  filter(item1 == &amp;#39;tiktok&amp;#39;) %&amp;gt;% 
  arrange(desc(similarity)) %&amp;gt;% 
  head(10)

tt %&amp;gt;%
  ggplot(aes(x = fct_reorder(item2, similarity), y = similarity, fill = item2)) + 
  geom_col() + 
  scale_fill_discrete(guide = F) +
  labs(x = &amp;quot;&amp;quot;, y = &amp;quot;Similarity Score&amp;quot;, 
       title = &amp;quot;Words Most Similar to &amp;lt;i style=&amp;#39;color:#69C9D0&amp;#39;&amp;gt;TikTok&amp;lt;/i&amp;gt;&amp;quot;) + 
  coord_flip() + 
  hrbrthemes::theme_ipsum_rc(grid=&amp;quot;X&amp;quot;) + 
  theme(
    plot.title.position = &amp;quot;plot&amp;quot;,
    plot.title = ggtext::element_markdown()
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/20200802_ig_vs_tiktok/index_files/figure-html/tiktok_similar-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now admittedly, I’m less familiar with TikTok than I am Instagram, but from what I do know (and what I can Google), these make a lot of sense. The word most similar to TikTok is “dances” and I do know that TikTok is known for its viral dances. Some of the other terms I needed to look up but they seem legit. For example, “Straight TikTok” is used to refer to more mainstream TikTok vs. “Alt Tiktok” and “fyp” is the “For You Page” (I don’t actually know what this is, but I know its something TikTok-y). So again, I feel pretty good about these results.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-is-the-difference-between-tiktok-and-instagram&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;What is the Difference between TikTok and Instagram?&lt;/h3&gt;
&lt;p&gt;As mentioned at the start the goal of this post is to create &lt;code&gt;Instagram - Tiktok = ?&lt;/code&gt; and &lt;code&gt;Tiktok - Instagram = ?&lt;/code&gt; similar to the &lt;code&gt;king - man + woman = queen&lt;/code&gt; often referenced in Word2Vec (or other embedding) posts.&lt;/p&gt;
&lt;p&gt;Since both TikTok and Instagram are now represented by 100-dimensional numeric vectors doing the subtraction is as simple as doing a pairwise subtraction on each dimension. Since our data is in a tidy format it takes a little bit of data wrangling to pull that off, but ultimately we’re going to grab the Top 10 Closest words to (Instagram-TikTok) and (TikTok-Instagram) by treating these resulting vectors as fake “words” and adding them to the data set before calculating the cosine similarity.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tt_ig_diff &amp;lt;- tidy_word_vectors %&amp;gt;% 
  #Calculate TikTok - Instagram
  filter(item1 %in% c(&amp;#39;tiktok&amp;#39;, &amp;#39;instagram&amp;#39;)) %&amp;gt;% 
  pivot_wider(names_from = &amp;quot;item1&amp;quot;, values_from = &amp;quot;value&amp;quot;) %&amp;gt;% 
  transmute(
    item1 = &amp;#39;tiktok_minus_ig&amp;#39;,
    dimension,
    value = tiktok - instagram
  ) %&amp;gt;% 
  bind_rows(
    #Calculate Instagram - TikTok
    tidy_word_vectors %&amp;gt;% 
    filter(item1 %in% c(&amp;#39;tiktok&amp;#39;, &amp;#39;instagram&amp;#39;)) %&amp;gt;% 
    pivot_wider(names_from = &amp;quot;item1&amp;quot;, values_from = &amp;quot;value&amp;quot;) %&amp;gt;% 
    transmute(
      item1 = &amp;#39;ig_minus_tiktok&amp;#39;,
      dimension,
      value = instagram - tiktok
    )
  ) %&amp;gt;% 
  #Add in the rest of the individual words
  bind_rows(tidy_word_vectors) %&amp;gt;% 
  #Calculate Cosine Similarity on All Words
  pairwise_similarity(item1, dimension, value) %&amp;gt;% 
  #Keep just the simiarities to the two &amp;quot;fake words&amp;quot;
  filter(item1 %in% c(&amp;#39;tiktok_minus_ig&amp;#39;, &amp;#39;ig_minus_tiktok&amp;#39;)) %&amp;gt;% 
  #Grab top 10 most similar values for each &amp;quot;fake word&amp;quot;
  group_by(item1) %&amp;gt;% 
  top_n(10, wt = similarity) 

#Plotting the Top 10 Words by Similarity
tt_ig_diff %&amp;gt;%
  mutate(item1 = if_else(
    item1 == &amp;quot;ig_minus_tiktok&amp;quot;, &amp;quot;Instagram - TikTok = &amp;quot;, &amp;quot;Tiktok - Instagram = &amp;quot;
  )) %&amp;gt;% 
  ggplot(aes(x = reorder_within(item2, by = similarity, within = item1), 
             y = similarity, fill = item2)) + 
  geom_col() + 
  scale_fill_discrete(guide = F) +
  scale_x_reordered() + 
  labs(x = &amp;quot;&amp;quot;, y = &amp;quot;Similarity Score&amp;quot;, 
       title = &amp;quot;What&amp;#39;s the Difference between &amp;lt;i style=&amp;#39;color:#833AB4&amp;#39;&amp;gt;Instagram&amp;lt;/i&amp;gt; 
       and  &amp;lt;i style=&amp;#39;color:#69C9D0&amp;#39;&amp;gt;TikTok&amp;lt;/i&amp;gt;&amp;quot;) + 
  facet_wrap(~item1, scales = &amp;quot;free_y&amp;quot;) + 
  coord_flip() + 
  hrbrthemes::theme_ipsum_rc(grid=&amp;quot;X&amp;quot;) + 
  theme(
    plot.title.position = &amp;quot;plot&amp;quot;,
    plot.title = ggtext::element_markdown(),
    strip.text = ggtext::element_textbox(
      size = 12,
      color = &amp;quot;white&amp;quot;, fill = &amp;quot;#5D729D&amp;quot;, box.color = &amp;quot;#4A618C&amp;quot;,
      halign = 0.5, linetype = 1, r = unit(5, &amp;quot;pt&amp;quot;), width = unit(1, &amp;quot;npc&amp;quot;),
      padding = margin(2, 0, 1, 0), margin = margin(3, 3, 3, 3)
    )
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/20200802_ig_vs_tiktok/index_files/figure-html/diff_similarity-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;A couple of things jump out from the results. First, the vectors for TikTok and Instagram aren’t similar enough to each other to not make “TikTok” or “Instagram” the most similar value. This is likely because of the data collection methodology of using TikTok and Instagram as search terms on Twitter. Also, as a result of this there is a bit of overlap between the “Most Similar Word to X” and the “Most Similar Word to X-Y”.&lt;/p&gt;
&lt;p&gt;However, once you get past the overlaps there are some interesting findings. For &lt;code&gt;Instagram - TikTok&lt;/code&gt; you get “Selfies, Photo(s), Photographer” which makes a ton of sense since Instagram is primary a photo app while TikTok is entirely a video app.&lt;/p&gt;
&lt;p&gt;For &lt;code&gt;Tiktok - Instagram&lt;/code&gt;, there still is a lot of overlap with just &lt;code&gt;TikTok&lt;/code&gt;, but for the new items there’s a bunch of Witchcraft terms (coven, witchtok). But according to Wired UK &lt;a href=&#34;https://www.wired.co.uk/article/witchcraft-tiktok&#34;&gt;TikTok has become the home of modern witchcraft&lt;/a&gt; that seems to track. Also, “Teens” is surfaced as a difference between Tiktok and Instagram reflecting its popularity among US Teenagers.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;concluding-thoughts&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Concluding Thoughts&lt;/h2&gt;
&lt;p&gt;I wanted to get involved with Word Embeddings through Word2Vec but I don’t have the technology for it. Luckily resources on the internet provided a way to do this with tools not requiring a Neural Network. By grabbing data from Twitter it was easy to create word embeddings and to try to understand the differences between TikTok and Instagram. In practice it would be good to have had more than 100,000 Tweets and I wish that there was a way to get word context more in the wild than specific search terms. But in the end, I’m happy with the results.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>A Racing Barplot of Top US Baby Names 1880-2018</title>
      <link>https://jlaw.netlify.app/2020/07/04/a-racing-barplot-of-top-us-baby-names-1880-2018/</link>
      <pubDate>Sat, 04 Jul 2020 00:00:00 +0000</pubDate>
      <guid>https://jlaw.netlify.app/2020/07/04/a-racing-barplot-of-top-us-baby-names-1880-2018/</guid>
      <description>


&lt;p&gt;A few month’s back Mrs. JLaw and I were discussing baby names (purely for academic purposes) and it got me thinking about how have popular names changed over time. It was a particular interest to me as someone who had a name that was somewhat popular for a while and has since fallen out of fashion.&lt;/p&gt;
&lt;p&gt;This also provided me an opportunity to try out one of those ‘racing barplots’ that have been popping up all over the place. Also, while I’ve used the &lt;a href=&#34;https://gganimate.com/articles/gganimate.html&#34;&gt;gganimate&lt;/a&gt; package a number of times, I constantly forget the syntax. And since this site is as much for me (probably moreso) than anyone else, this will be a good reference in case I try to do this again.&lt;/p&gt;
&lt;p&gt;On to the project….&lt;/p&gt;
&lt;p&gt;Fortunately, I know that baby name data is easily available as the &lt;a href=&#34;https://www.ssa.gov/oact/babynames/index.html&#34;&gt;Social Security Administration&lt;/a&gt; website. And while I don’t reminder how I found the flat files for all years it is available as a &lt;a href=&#34;https://www.ssa.gov/oact/babynames/names.zip&#34;&gt;ZIP file&lt;/a&gt; containing 139 .txt files, containing popular boys and girls names for each year. However, I don’t really want to deal with downloading and unzipping files, so I’m going to try to query the SSA site directly.&lt;/p&gt;
&lt;div id=&#34;loading-some-libraries&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Loading Some Libraries&lt;/h2&gt;
&lt;p&gt;To do this project, I’ll use:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;httr&lt;/code&gt; - To construct the POST command to get the SSA to return a webpage with the data I want&lt;/li&gt;
&lt;li&gt;&lt;code&gt;rvest&lt;/code&gt; - To scrape the table of popular name data from the content returned from the &lt;code&gt;httr&lt;/code&gt; request&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tidyverse&lt;/code&gt; meta-package - for combining the data from each request (purrr), data manipulation (dplyr), and visualization (ggplot2)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gganimate&lt;/code&gt; - to animate the ggplot2 plots and make them look super cool&lt;/li&gt;
&lt;li&gt;&lt;code&gt;scales&lt;/code&gt; - To make the count of baby names in the chart appear prettier (comma-formatted)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(gganimate)
library(scales)
library(httr)
library(rvest)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;reading-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Reading the Data&lt;/h2&gt;
&lt;p&gt;As mentioned before, the data is available as a series of .txt files from the &lt;a href=&#34;https://www.ssa.gov/oact/babynames/names.zip&#34;&gt;SSA&lt;/a&gt;. When I originally did this, I downloaded and extracted the ZIP file, but as I’m redoing this for the post, I’d rather have a solution that is entirely self-contained so I’m going to try to use &lt;code&gt;httr&lt;/code&gt; to actually query the SSA data.&lt;/p&gt;
&lt;p&gt;So how to actually get the data from the website?&lt;/p&gt;
&lt;p&gt;From the &lt;a href=&#34;https://www.ssa.gov/oact/babynames/index.html&#34;&gt;Baby Names By Birth Year&lt;/a&gt; section, I can input the birth year, how many names I want, and whether I want counts or percentages.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&#34;https://jlaw.netlify.app/post/2020-07-04-a-racing-barplot-of-top-us-baby-names-1880-2018.en_files/webscrape1.PNG&#34; /&gt;
&lt;/center&gt;
&lt;p&gt;When I click go, I wind up at &lt;code&gt;https://www.ssa.gov/cgi-bin/popularnames.cgi&lt;/code&gt; with my desired results in a table. Using Google Chrome’s Network Inspector I can see that I sent a &lt;code&gt;POST&lt;/code&gt; request with three parameters (year, top, and number):&lt;/p&gt;
&lt;center&gt;
&lt;img src=&#34;https://jlaw.netlify.app/post/2020-07-04-a-racing-barplot-of-top-us-baby-names-1880-2018.en_files/webscrape2.PNG&#34; /&gt;
&lt;/center&gt;
&lt;p&gt;Now that I know what I need to send, I can create a function in purrr to request each year and stack each response on top of each other using &lt;code&gt;map_dfr&lt;/code&gt;. For the inputs, I know that I want all available years (which I know are 1880 through 2018) and I only need the top 10 (so top = 10) and I want counts rather than percentages (number = “n”)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;babynames &amp;lt;- map_dfr(
  1880:2018, #Inputs to My Function
  #Define Function to Apply for Each Year
  function(year){
    #Construct POST Command
    POST(
      #Where to Send the Request
      url = &amp;quot;https://www.ssa.gov/cgi-bin/popularnames.cgi&amp;quot;,
      #What to Send the Requests (my three parameters)
      body = paste0(&amp;quot;year=&amp;quot;,year,&amp;quot;&amp;amp;top=10&amp;amp;number=n&amp;quot;)
    ) %&amp;gt;%
    #Extract the Content from the Request Response
    content(&amp;quot;parsed&amp;quot;) %&amp;gt;% 
    #Extract All The Tables
    html_nodes(&amp;#39;table&amp;#39;) %&amp;gt;%
    #Only Keep the 3rd Table (done through some guess and check)
    .[[3]] %&amp;gt;% 
    #Store the Table Data as a data.frame
    html_table() %&amp;gt;%
    #Add a column to the data frame for year
    mutate(
      year = year
    )
  }
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;My expectation for this data is that there would be 139 distinct values for year and 1390 rows in the data. And in fact there are 139 distinct years (😍) and 1529 rows (😡).&lt;/p&gt;
&lt;p&gt;So what’s going on… Let’s look at the year 1880.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Rank&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Male name&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Number of males&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Female name&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Number of females&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;year&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;John&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;9,655&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Mary&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;7,065&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;William&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;9,532&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Anna&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2,604&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;James&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;5,927&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Emma&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2,003&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Charles&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;5,348&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Elizabeth&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1,939&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;George&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;5,126&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Minnie&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1,746&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;6&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Frank&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;3,242&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Margaret&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1,578&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;7&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Joseph&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2,632&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Ida&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1,472&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Thomas&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2,534&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Alice&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1,414&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;9&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Henry&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2,444&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Bertha&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1,320&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Robert&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;2,415&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Sarah&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1,288&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;Note: Rank 1 is the most popular,&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;rank 2 is the next most popular, and so forth. All names are from Social Security card applications&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;for births that occurred in the United States. Note: Rank 1 is the most popular,&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;rank 2 is the next most popular, and so forth. All names are from Social Security card applications&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;for births that occurred in the United States. Note: Rank 1 is the most popular,&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;rank 2 is the next most popular, and so forth. All names are from Social Security card applications&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;for births that occurred in the United States. Note: Rank 1 is the most popular,&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;rank 2 is the next most popular, and so forth. All names are from Social Security card applications&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;for births that occurred in the United States. Note: Rank 1 is the most popular,&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;rank 2 is the next most popular, and so forth. All names are from Social Security card applications&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;for births that occurred in the United States. 1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;div id=&#34;cleaning-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Cleaning the Data&lt;/h2&gt;
&lt;p&gt;We were expected 10 rows but we got 11 because of a footnote at the bottom of the table. I could go fix the data pulling step to explicitly only get the Top 10 rows but there are a bunch of other data cleaning steps to do, so may as well do everything at once. In this step I’m going to:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Remove that pesky footer row&lt;/li&gt;
&lt;li&gt;Turn the Table from Wide Format to Long Format (so genders are on top of each other)&lt;/li&gt;
&lt;li&gt;Convert the Counts to Numeric&lt;/li&gt;
&lt;/ol&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;babynames_clean &amp;lt;- babynames %&amp;gt;% 
  #Remove the Note row by filter rows where the Rank column has the string &amp;quot;Note&amp;quot;
  filter(!str_detect(Rank, &amp;quot;Note&amp;quot;)) %&amp;gt;%
  #Turn Data from Wide Format to Long Format 
  pivot_longer(
    cols = c(&amp;quot;Male name&amp;quot;, &amp;quot;Female name&amp;quot;, &amp;quot;Number of males&amp;quot;, &amp;quot;Number of females&amp;quot;),
    names_to = &amp;quot;variable&amp;quot;,
    values_to = &amp;quot;value&amp;quot;
  ) %&amp;gt;% 
  #Construct a way to split the Names and Counts
  mutate(
    gender = if_else(str_detect(str_to_lower(variable), &amp;#39;female&amp;#39;), &amp;#39;F&amp;#39;, &amp;#39;M&amp;#39;),
    new_variable = if_else(str_detect(variable, &amp;quot;name&amp;quot;), &amp;quot;name&amp;quot;, &amp;quot;count&amp;quot;)
  ) %&amp;gt;% 
  #Pivot Wider to Have Names and Counts in Separate Columns
  pivot_wider(
    id_cols = c(&amp;#39;Rank&amp;#39;, &amp;#39;year&amp;#39;, &amp;#39;gender&amp;#39;), 
    names_from = &amp;quot;new_variable&amp;quot;,
    values_from = &amp;quot;value&amp;quot;
  ) %&amp;gt;% 
  #Convert Count to Numeric
  mutate(
    count = parse_number(count),
    Rank = parse_number(Rank)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now let’s look at our cleaned data for year 1880:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;right&#34;&gt;Rank&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;year&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;gender&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;name&lt;/th&gt;
&lt;th align=&#34;right&#34;&gt;count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;M&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;John&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9655&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;F&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Mary&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;7065&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;M&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;William&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;9532&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;F&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Anna&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2604&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;M&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;James&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5927&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;3&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;F&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Emma&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;2003&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;M&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Charles&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5348&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;F&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Elizabeth&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1939&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;M&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;George&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;5126&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;right&#34;&gt;5&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1880&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;F&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Minnie&lt;/td&gt;
&lt;td align=&#34;right&#34;&gt;1746&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Beautiful!!!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;making-the-barplot&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Making The Barplot&lt;/h2&gt;
&lt;p&gt;Now that we’ve gotten and cleaned the data, the real fun can begin.&lt;/p&gt;
&lt;p&gt;My personal strategy for building animated ggplots is to first build the static version of the plot (in this case filtering to one year). Then once that is good, adding in the gganimate magic keynotes like &lt;code&gt;transition&lt;/code&gt; and &lt;code&gt;ease&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;While you can generated an animated plot by the code interactively, I find it easiest to save the plot object and then render using the &lt;code&gt;animate()&lt;/code&gt; function. This way there are more ways to control how the animation occurs like duration, and frames per second.&lt;/p&gt;
&lt;p&gt;Because my laptop isn’t particular great, trying to nail down the aesthetics of making the animation look good (not too fast, not too slow) is the most time consuming part.&lt;/p&gt;
&lt;div id=&#34;creating-a-generic-function&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Creating a generic function&lt;/h3&gt;
&lt;p&gt;Since I’m creating two charts for Baby Boys and Baby Girls that will be identical except for some labeling, I’m going to write a function to actually build the animated chart and then I will call them in a future section.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;#Input a 
gen_graph &amp;lt;- function(cond){
  
  #Use stereotypical gender colors for the two graphs
  if(cond == &amp;quot;F&amp;quot;){
    lbl = &amp;quot;Girl&amp;quot;
    col = &amp;quot;#FFC0CB&amp;quot;
  }else{
    lbl = &amp;quot;Boy&amp;quot;
    col = &amp;quot;#89cff0&amp;quot;
  }
  
  #Construct Animated Object
  animated &amp;lt;- babynames_clean %&amp;gt;% 
    #Filter to specific gender
    filter(gender == cond) %&amp;gt;%
    # Construct Basic GGPLOT Plot
    ggplot(aes(x = Rank, y = count/2, group = name)) + 
    geom_col(fill = col) + 
    geom_text(aes(label = count %&amp;gt;% comma(accuracy = 1)), hjust = 0, size = 10) + 
    geom_text(aes(label = name), y = 0, vjust = .2, hjust = 1, size = 10) +
    labs(x = paste0(lbl,&amp;quot;&amp;#39;s Name&amp;quot;), y = &amp;quot;# of Babies&amp;quot;,
         title = paste0(&amp;quot;Top 10 &amp;quot;, lbl, &amp;quot;&amp;#39;s Baby Names (1880-2018)&amp;quot;),
         #{frame_time} is a gganimate param that updates based on the time value
         #Its used to dynamically update the subtitle
         subtitle = &amp;#39;{round(frame_time,0)}&amp;#39;,
         caption = &amp;#39;Source: Social Security Administration&amp;#39;) + 
    scale_x_reverse() + 
    coord_flip(clip = &amp;#39;off&amp;#39;) + 
    theme_minimal() +
    theme(axis.line=element_blank(),
          axis.text.x=element_blank(),
          axis.text.y=element_blank(),
          axis.ticks=element_blank(),
          axis.title.x=element_blank(),
          axis.title.y=element_blank(),
          legend.position=&amp;quot;none&amp;quot;,
          panel.background=element_blank(),
          panel.border=element_blank(),
          panel.grid.major=element_blank(),
          panel.grid.minor=element_blank(),
          panel.grid.major.x = element_line(size=.4, 
                                            color=&amp;quot;grey&amp;quot; ),
          panel.grid.minor.x = element_line(size=.1, 
                                            color=&amp;quot;grey&amp;quot; ),
          plot.title.position = &amp;quot;plot&amp;quot;,
          plot.title=element_text(size=20, 
                                  face=&amp;quot;bold&amp;quot;, 
                                  colour=&amp;quot;#313632&amp;quot;),
          plot.subtitle=element_text(size=50, 
                                     color=&amp;quot;#a3a5a8&amp;quot;),
          plot.caption =element_text(size=15, 
                                     color=&amp;quot;#313632&amp;quot;),
          plot.background=element_blank(),
          plot.margin = margin(1, 9, 1, 9, &amp;quot;cm&amp;quot;)) + 
    #Add in GGANIMATE Magic
    transition_time(year) + 
    ease_aes(&amp;#39;cubic-in-out&amp;#39;) +
    view_follow(fixed_x = T)

  animate(animated, fps = 10, duration = 30, width = 1000, height = 600, 
          end_pause = 20, start_pause = 20)
    
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;most-popular-boys-names&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Most Popular Boy’s Names&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gen_graph(&amp;quot;M&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-07-04-a-racing-barplot-of-top-us-baby-names-1880-2018.en_files/figure-html/boys_plot-1.gif&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;most-popular-boys-names-1&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Most Popular Boy’s Names&lt;/h3&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gen_graph(&amp;quot;F&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://jlaw.netlify.app/post/2020-07-04-a-racing-barplot-of-top-us-baby-names-1880-2018.en_files/figure-html/girls_plot-1.gif&#34; /&gt;&lt;!-- --&gt;&lt;/p&gt;
&lt;p&gt;Thanks for reading my first blog post! In the future, I’ll work to get the sizing of the output charts to work better but for now… good &amp;gt; perfect.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
