When Will NYC's Subway Ridership Recover?

While writing my posts about COVID’s effect on NYC Subway ridership the New York Times published an article called The Pandemic Wasn’t Supposed to Hurt New York Transit This Much. The article states:

I believe the 80% target by 2026 comes from a McKinsey study. While I don’t know the details of the study I do have some subway fare data sitting around. So why not compare the article’s data to my own.

The methodology will be similar to what I did in my initial subway forecasting post using the modeltime package and the champion model Prophet w/ XGBoosted Errors to do the forecasting.


### Data Manipulation Packages
library(timetk) # For time series features in recipe
library(tidyverse) # General Data Manipulation
library(scales) # Making prettier scales
library(lubridate) # Dealing with Dates

# Modeling Ecosystem
library(modeltime) # Framework for Time Series models
library(tidymodels) # Framework for general recipe and workflows

### Model Packages
library(prophet) # Algorithm for forecasting


The data is the same as from my initial post. Its initially at the week/station/fare level. For this exercise I just need the data at the weekly level.

fares <- readRDS(file.path(here(), 'content', 'post', 
                           'mta_data.RDS')) %>% 
  group_by(week_start) %>% 
  summarize(fares = sum(fares))

My first blog post in this series covered the modeltime package in more detail for trying out many different forecasting models. That post found that Prophet with XGBoosted Errors was the best model. Here I’ll be replicating that workflow for that model type.

Step 1: Defining the pre-processing recipe

This step defines the forecasting formula as predicting fares based on all other features. Then it creates a bunch of time series specific features from the date field in step_timeseries_signature. step_rm removes some variables created in the prior step that aren’t useful, and finally step_dummy turns all the categorical variables into one-hot encoded indicators. Here I also set the training data set as the MTA Fares beginning during the COVID period (after April 1, 2020) since training on the prior time period will give very strange results.

rec <- recipe(fares ~ ., data = fares %>% filter(week_start >= ymd(20200401))) %>%
  step_timeseries_signature(week_start) %>% 
  step_rm(matches("(.iso$)|(am.pm$)|(.xts$)|(hour)|(minute)|(second)|(wday)")) %>% 
  step_dummy(all_nominal(), one_hot = TRUE)

Step 2: Define the Model Workflow and Fit the Model

Sticking with the tidymodels framework, here I define a workflow which will consist of the recipe created in Step 1 through add_recipe and the model set through add_model(). Within add_model() the model type is set to Boosted Prophet. I believe the ‘prophet_xgboost’ is the default engine so set_engine isn’t necessary, but good to keep around anyway.

prophet_boost_wf <- workflow() %>%
    prophet_boost(seasonality_yearly = TRUE) %>%
  ) %>% 
  add_recipe(rec) %>%
  fit(fares %>% filter(week_start >= ymd(20200401)) )

Step 3: Using the Model to Forecast the Future

In this instance I don’t have a test set to work with so I’m jumping directly into forecasting. Also since I don’t know how long it will take for the forecast to recover to pre-COVID levels, I’ll set the forecast horizon for 6 years in the h parameter. Passing in the actual_data let it be included in the output data set.

final_fcst <- modeltime_table(
  ) %>% 
    h = "6 years",
    actual_data = fares,
    keep_data = TRUE

Visualizing the Forecast

The modeltime package makes it easy to visualize the data through the plot_modeltime_forecast package. The default is to create a plot.ly plot but that can be converted to a ggplot2 plot by setting .interactive to FALSE.

final_fcst %>% 
  plot_modeltime_forecast(.interactive = F) + 
  scale_y_continuous(labels = comma)

When will Subway fares return to 80% of Pre-COVID? To 100%?

Now we can see how close my forecast is to the New York Times Report. I don’t actually know what the NY Times is considering Pre-COVID levels but for my purposes I’ll use the average number of fares in December 2019 to be my Pre-COVID.

baseline <- fares %>% 
  filter(month(week_start)==12 & year(week_start) == 2019) %>% 
  summarize(avg_fares = mean(fares)) %>% 

From the projection plot above its clear that there is a seasonality that peaks in the fall and drops in December through the New Year. To declare victory at 80% I’m going to require that there are 4 consecutive weeks of fares being above the Pre-COVID baseline.

I’m not sure of a great way to define when is the earliest date of the first run of 4 weeks above a threshold but I’m working it out in three steps:

  1. Define an indicator for whether that week is above 80% (above_80_ind)
  2. Run a counter for each time that the indicator flips from 0 to 1 (run_id_80) to get an id for each run
  3. For each run_id_80 get the sum of above_80_inds to represent the length of each run (run_length_80)
rec_pct <- final_fcst %>% 
  filter(week_start >= ymd(20200401)) %>% 
  # Build Recovery Percentage
  mutate(recovery_pct = .value / baseline) %>%
  # Define Runs of when recovery_pct is above .8
    above_80_ind = (recovery_pct > .8),
    above_100_ind = (recovery_pct > 1)
  ) %>% 
  # Define ID for each time we start a run
    run_id_80 = cumsum(if_else(above_80_ind == 1 & lag(above_80_ind) == 0, 
                               1, 0)),
    run_id_100 = cumsum(if_else(above_100_ind == 1 & lag(above_100_ind) == 0, 
                               1, 0))
  ) %>% 
  add_count(run_id_80, wt = above_80_ind, name = "run_length_80") %>%
  add_count(run_id_100, wt = above_100_ind, name = "run_length_100")

Now I can plot the recovery percentage by week and show that the first time there are four consecutive weeks above 80% is 2025-07-05 and the first time there are four consecutive weeks above 100% of the Pre-COVID value is 2027-06-26.

rec_pct %>% 
  ggplot(aes(x = week_start, y = recovery_pct)) + 
    geom_line(color = "#0039A6") + 
    geom_segment(aes(x = min(week_start), 
                     xend = rec_pct[which.max(rec_pct$run_length_80 >= 4), ]$week_start,
                     y = .8,
                     yend = .8), lty = 2) + 
    geom_segment(aes(x = rec_pct[which.max(rec_pct$run_length_80 >= 4), ]$week_start,
                     xend = rec_pct[which.max(rec_pct$run_length_80 >= 4), ]$week_start,
                     y = 0,
                     yend = .8), lty = 2) + 
    geom_segment(aes(x = min(week_start), 
                     xend = rec_pct[which.max(rec_pct$run_length_100 >= 4), ]$week_start,
                     y = 1,
                     yend = 1), lty = 2) + 
    geom_segment(aes(x = rec_pct[which.max(rec_pct$run_length_100 >= 4), ]$week_start,
                     xend = rec_pct[which.max(rec_pct$run_length_100 >= 4), ]$week_start,
                     y = 0,
                     yend = 1), lty = 2) + 
    scale_x_date(breaks = "1 years",
                 labels = year,
                 expand = c(0, 0)) + 
    scale_y_continuous(labels = percent, expand = c(0, 0),
                       breaks = seq(0, 1.6, .2)) + 
    labs(title = "Projected MTA Recovery vs. Pre-COVID",
         subtitle = "Pre-COVID Baseline from December 2019", 
         x = "Date", y = "% of Dec 2019 Baseline") + 

Based on this projection the NY Times article is being slightly pessimistic. According to the above NYC should reach 80% of Pre-COVID baseline by Mid-2025 which is earlier than the article’s projection of 2026.

Who will be right? We’ll have to wait at least 3 years to find out!

comments powered by Disqus