The aim of this project is to build a machine learning model that predicts the probability of parking availability in Isla Vista, the home of UCSB, given a combination of temporal predictors including time of day, day of the week, month, and location (street).
To create this model, I used data provided by the Isla Vista Community Services District in partnership with consulting firm Dixon Resources Unlimited.
Nobody likes parking. Actually, to reword that-nobody likes looking for parking. And in Isla Vista, you may find yourself doing this often, wasting time with no end (an open spot) in sight. Isla Vista’s high density and limited parking supply make availability both scarce and unpredictable, especially the closer you get to UCSB and during peak hours.
From my own experience living in Isla Vista, this has often felt like an almost impossible situation. While the most straightforward way to improve parking availability would be to reduce the number of cars, this is not very realistic in practice. Although Isla Vista is small and very walkable at first, over time it can begin to feel limiting both practically and socially. Traveling outside of IV becomes necessary for more affordable groceries, a wider range of job opportunities, pharmacies, and other essential services, and it can start to feel claustrophobic. This all leads to, for many, a car being a necessity. This is coupled by public transportation also being not that reliable, and Santa Barbara’s car-dependent layout making walking to many destinations unrealistic. At the same time, expanding parking infrastructure is difficult due to space constraints and the area’s low-rise development.
Given these constraints, this project aims to at least give a little light to the situation by helping predict when and where parking is more likely to be available, reducing the time spent searching and making the process a little less frustrating.
To illustrate this before we move on with the report, here is a heatmap of average parking avalibility taken from the Isla Vista Community Service District Website (source below) which uses the same data I use in my project:
As you can see… there is not much parking
The data used in this project was collected as part of a parking study conducted in Isla Vista by Dixon Resources Unlimited in partnership with the Isla Vista Community Services District. I was privately sent the data set but got approval to put it in the folder where this project is, but not on this report due to that only being possible by adding a link. Data collection was carried out by driving a pre-determined route through Isla Vista using license plate recognition (LPR) cameras. Each observation corresponds to a specific block face at a particular point in time, with the timestamp indicating when parking conditions were recorded. The timestamp reflects the exact moment when the first license plate was detected on that block face, serving as a standardized reference for when conditions were observed.
Data was collected from April 2023 to March 2024, with one weekday and one weekend day per month (specifically Tuesdays, Thursdays, and Saturdays), allowing the dataset to capture variation in parking conditions across both weekdays and weekends. In total, the dataset includes approximately 18,436 and five core variables: location (street/block face), timestamp, number of observed vehicles, curbside capacity, and spatial geometry (WKT). From the timestamp, additional temporal variables such as time of day, day of the week, and month are derived for analysis.
Together, these variables allow for the analysis of patterns in parking conditions over time and space and provide the foundation for constructing a measure of parking availability for modeling.
Now that we have a better understanding of the dataset and its structure, I am going to outline how this project will approach modeling parking availability in Isla Vista. The goal of this project is to develop a binary classification model that estimates the probability of finding available parking based on time and location.
We begin by cleaning and preparing the data, including constructing a measure of parking availability using the observed number of vehicles and curbside capacity (ensuring values remain between 0 and 1). We then create temporal predictor variables such as time of day, day of the week, and month from the timestamp.
Next, we perform exploratory data analysis to better understand patterns in parking conditions across locations and time periods. This includes visualizing trends and identifying relationships between predictors and parking availability.
The data is then split into training and testing sets, and a modeling workflow is established using cross-validation to evaluate performance. Several classification models will be considered, including Logistic Regression, Decision Trees, and Random Forests.
After fitting the models, we compare their performance using appropriate evaluation metrics and select the best-performing model. This final model will then be evaluated on the testing data to assess its ability to generalize.
Finally, we interpret the results to understand which factors most influence parking availability and discuss the implications of these findings. Let’s see where we will get that parking spot!
The original dataset was provided as an Excel file. For ease of use in R, it was exported to a CSV file (`parking_data.csv`) and stored in the project directory before being loaded into R.
# loading the data
parking_data <- read.csv("parking_data.csv", stringsAsFactors = FALSE)
dim(parking_data)
## [1] 18436 5
nrow(parking_data)
## [1] 18436
ncol(parking_data)
## [1] 5
head(parking_data)
## street
## 1 EMBARCADERO DEL NORTE-EMBARCADERO DEL NORTE-CERVANTES RD_WEST
## 2 EMBARCADERO DEL NORTE-CERVANTES RD-CERVANTES RD_SOUTH
## 3 EMBARCADERO DEL NORTE-CERVANTES RD-CERVANTES RD_NORTH
## 4 EMBARCADERO DEL NORTE-EL GRECO RD-EL GRECO RD_SOUTH
## 5 EMBARCADERO DEL NORTE-EL GRECO RD-EL GRECO RD_NORTH
## 6 EMBARCADERO DEL NORTE-PICASSO RD-PICASSO RD_SOUTH
## timestamp observed_plates curbside_inventory
## 1 2023-04-27 05:14:48 10 8
## 2 2023-04-27 05:15:20 17 17
## 3 2023-04-27 05:16:14 13 16
## 4 2023-04-27 05:17:49 19 20
## 5 2023-04-27 05:18:54 17 17
## 6 2023-04-27 05:19:45 23 21
## WKT
## 1 LineString (-119.85543145 34.41720725, -119.85543145 34.41651725)
## 2 LineString (-119.85375145 34.41636725, -119.85531145 34.41636725)
## 3 LineString (-119.85531145 34.41645725, -119.85375145 34.41645725)
## 4 LineString (-119.85369145 34.41570725, -119.85531145 34.41570725)
## 5 LineString (-119.85369145 34.41579725, -119.85531145 34.41579725)
## 6 LineString (-119.85369145 34.41504725, -119.85531145 34.41504725)
The dataset contains 18,436 observations of parking activity by
street segment and time. There are 5 variables: the street where the
observation was recordedthe number of observed vehicles
(observed_plates), the estimated number of legal parking
spaces (curbside_inventory), Data were collected on
selected days and times rather than continuously, and observations were
recorded sequentially across locations as a vehicle traveled a fixed
route. As a result, timestamps reflect when each street segment was
observed rather than a simultaneous snapshot.
We examine the dataset for missing values and invalid observations to ensure the dataset is clean and suitable for analysis.
# Count missing values in each variable
colSums(is.na(parking_data))
## street timestamp observed_plates curbside_inventory
## 0 0 0 0
## WKT
## 0
# Total number of missing values in the dataset
sum(is.na(parking_data))
## [1] 0
# Count observations with zero curbside inventory
sum(parking_data$curbside_inventory == 0)
## [1] 216
The dataset does not contain any missing values across variables. However, 216 observations have a curbside inventory value of zero. Since the occupancy rate cannot be computed when the number of legal parking spaces is zero, these observations are treated as invalid and will be removed in a following step.
To measure parking demand relative to capacity, an occupancy rate variable was defined as the ratio of observed vehicles to legal parking spaces.
\[ occupancy\ rate = \frac{observed\ vehicles}{legal\ parking\ spaces} \]
parking_data$occupancy_rate <- parking_data$observed_plates / parking_data$curbside_inventory
We then remove observations where the occupancy rate cannot be computed and adjust values to ensure they fall within a valid range.
Some observations had a curbside inventory value of zero, meaning no legal parking spaces were recorded. Because occupancy rate cannot be computed in these cases, those observations were removed.
Additionally, some occupancy values exceeded 1 when the number of observed vehicles surpassed the estimated legal capacity. These values likely reflect overflow or informal parking beyond designated capacity. Because the goal of this analysis is to model parking availability (rather than the degree of overflow), occupancy rates were capped at 1.
# Remove invalid observations (zero inventory)
parking_data <- parking_data[parking_data$curbside_inventory > 0, ]
# Create occupancy rate and cap at 1
parking_data$occupancy_rate <- parking_data$observed_plates / parking_data$curbside_inventory
parking_data$occupancy_rate <- pmin(parking_data$occupancy_rate, 1)
summary(parking_data$occupancy_rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01231 0.65000 0.91667 0.79671 1.00000 1.00000
A total of 216 observations with zero curbside inventory were removed.
Additional predictors were created from the timestamp variable to capture temporal patterns in parking demand. These features allow the model to capture systematic variations in parking demand across different times of day and days of the week. A binary weekend indicator was also included to distinguish between weekday and weekend patterns. These variables serve as key predictors in modeling expected occupancy patterns.
library(lubridate)
parking_data$timestamp <- as.POSIXct(parking_data$timestamp)
parking_data$hour <- factor(hour(parking_data$timestamp))
parking_data$day_of_week <- wday(parking_data$timestamp, label = TRUE)
parking_data$month <- month(parking_data$timestamp, label = TRUE)
parking_data$weekend <- parking_data$day_of_week %in% c("Sat", "Sun")
head(parking_data[, c("hour", "day_of_week", "month", "weekend")])
## hour day_of_week month weekend
## 1 5 Thu Apr FALSE
## 2 5 Thu Apr FALSE
## 3 5 Thu Apr FALSE
## 4 5 Thu Apr FALSE
## 5 5 Thu Apr FALSE
## 6 5 Thu Apr FALSE
Time was aggregated to the hour level to capture broad temporal trends while reducing noise from minute-level variation, providing a more stable representation of overall parking patterns.
The original street variable in the dataset represents
highly specific block face segments, formatted as combinations of three
street names (ex:, “A–B–C_DIRECTION”), where the middle street
corresponds to the primary street and the others define the boundaries
of the segment. While this level of detail is useful for data
collection, it results in a very large number of unique categories,
which can make modeling less efficient and harder to interpret.
To address this, the street variable was transformed into a more meaningful representation by separating each segment into a block identifier and a primary street name. I represented the street segments this way because not only is it more clear but it is also common in Isla Vista to refer to a location by its block + street combination. To do this, I manually constructed a lookup table to map each original street segment to a simplified format of the form:
block number + primary street name (ex: “PICASSO RD-PICASSO RD-EMBARCADERO DEL MAR_NORTH” becoming instead “65 PICASSO RD”)
One limitation I came across while doing this is that the long cross streets separating the blocks would not be able to have their location meaningfully interpreted if I transformed them like this because they span across so many blocks instead of being a smaller interval. For these (specifically where the middle street interval was Camino Corto, Camino Pescadero, Camino del Sur, Camino Lindo, Embarcadero Del Mar, Embarcadero Del Norte), I added a note after to clarify their bounds.
ex: “EL GRECO RD-CAMINO PESCADERO-PICASSO RD_EAST” becoming “65 CAMINO PESCADERO (between EL GRECO RD and PICASSO RD, EAST)”
library(readxl)
library(dplyr)
library(tidyr)
library(stringr)
# read the lookup spreadsheet
lookup_raw <- read_excel("Street Lookup Table.xlsx", col_names = FALSE)
# turn the paired columns into one long 2-column lookup table
lookup_table <- bind_rows(
lapply(seq(1, ncol(lookup_raw), by = 2), function(i) {
tibble(
street = lookup_raw[[i]],
street_new = lookup_raw[[i + 1]]
)
})
) %>%
filter(!is.na(street), !is.na(street_new)) %>%
mutate(
street = str_trim(street),
street_new = str_trim(street_new)
) %>%
distinct()
# join onto the parking data
parking_data <- parking_data %>%
mutate(street = str_trim(street)) %>%
left_join(lookup_table, by = "street")
So now our street observations look much cleaner:
parking_data %>%
select(street, street_new) %>%
tail(10)
## street
## 18211 CAMINO MAJORCA-DEL PLAYA DR-CAMINO LINDO_NORTH
## 18212 SABADO TARDE RD-CAMINO MAJORCA-DEL PLAYA DR_EAST
## 18213 TRIGO RD-CAMINO MAJORCA-SABADO TARDE RD_EAST
## 18214 PASADO RD-CAMINO MAJORCA-TRIGO RD_EAST
## 18215 PASADO RD-CAMINO MAJORCA-TRIGO RD_WEST
## 18216 CAMINO MAJORCA-PASADO RD-CAMINO LINDO_SOUTH
## 18217 CAMINO LINDO-PASADO RD-CAMINO CORTO_SOUTH
## 18218 CAMINO CORTO-PASADO RD-CAMINO DEL SUR_SOUTH
## 18219 PASADO RD-CAMINO DEL SUR-TRIGO RD_WEST
## 18220 CAMINO DEL SUR-TRIGO RD-CAMINO PESCADERO_SOUTH
## street_new
## 18211 68 DEL PLAYA DR
## 18212 68 CAMINO MAJORCA
## 18213 68 CAMINO MAJORCA
## 18214 68 CAMINO MAJORCA
## 18215 68 CAMINO MAJORCA
## 18216 68 PASADO RD
## 18217 68 PASADO RD
## 18218 67 PASADO RD
## 18219 67 CAMINO DEL SUR (between PASADO RD and TRIGO RD, WEST)
## 18220 66 TRIGO RD
I then checked if all the street values in parking_data properly joined with my lookup table values and found there were three locations that did not:
parking_data %>%
filter(is.na(street_new)) %>%
distinct(street) %>%
arrange(street)
## street
## 1 EL COLEGIO RD-OCEAN RD-BUS LOOP_SOUTH
## 2 GOLETA BEACH LOT - E
## 3 GOLETA BEACH LOT - W
These three locations were not included in the recoding lookup table because they are located on or east of the UCSB campus, outside the geographic focus of this project. Since the goal of this analysis is to model parking availability in Isla Vista, observations from these locations were removed from the dataset.
parking_data <- parking_data %>%
filter(!is.na(street_new))
parking_data <- parking_data %>%
mutate(street = street_new) %>%
select(-street_new)
Finally, I verified that all remaining observations had been successfully matched to the lookup table.
sum(is.na(parking_data$street))
## [1] 0
Returned 0 non joined values. Yay!
parking_data$timestamp <- NULL
parking_data$WKT <- NULL
Finally, we remove the timestamp and WKT
variables. The timestamp variable is excluded because its
information is fully captured by the time-based predictors we created,
while WKT is removed because it contains geometric
information that is not directly usable in the model.
dim(parking_data)
## [1] 18163 8
nrow(parking_data)
## [1] 18163
ncol(parking_data)
## [1] 8
After all of our alterations, we now have 18,163 observations and 8 predictors!
Now our original predictors have been successfully altered and are ready to be used for modeling! Here is a description of each of them for reference:
Time-Based Predictors
hour: The hour of the day when the observation was
recorded (0–23)
day_of_week: The day of the week of the observation
(Tuesday, Thursday, or Saturday)
month: The month in which the observation was
recorded
is_weekend: Indicator for whether the observation
occurred on a weekend (1 = weekend, 0 = weekday)
Parking Variables
street: The block + street where the parking
observation was recorded
observed_plates: The number of vehicles observed
parked at that time
curbside_inventory: The total number of legal
parking spaces available
occupancy_rate: The proportion of occupied parking
spaces
To better understand patterns in parking demand, several exploratory plots were created. These visualizations help identify how parking occupancy varies by time of day, day of week, and street segment.
The dataset contains approximately 18,163 observations of parking occupancy across different times and locations. However, the number of observations varies across hours, reflecting the sequential nature of data collection.
library(ggplot2)
ggplot(parking_data, aes(x = hour, y = occupancy_rate)) +
geom_boxplot() +
labs(
title = "Parking Occupancy by Hour of Day",
x = "Hour of Day",
y = "Occupancy Rate"
) +
theme_minimal()
Parking occupancy remains relatively high throughout the day, with several periods of especially high demand. Early morning hours (5–6 AM) and evening hours (20–22) show the highest occupancy levels, with median values at or near full capacity, indicating limited parking availability during these times.
Occupancy decreases during the morning period (approximately 7–10 AM), where both mean and median values drop, suggesting increased availability. A moderate dip is also observed in the early afternoon (around 13–14), before occupancy rises again into the evening.
It is important to note that the number of observations varies substantially by hour due to the sequential data collection process. As a result, hours with very low sample sizes may be less reliable and were not heavily emphasized in interpretation.
library(dplyr)
library(ggplot2)
hour_summary <- parking_data %>%
group_by(hour) %>%
summarize(mean_occupancy = mean(occupancy_rate), .groups = "drop")
ggplot(hour_summary, aes(x = as.numeric(as.character(hour)), y = mean_occupancy)) +
geom_line() +
geom_point() +
labs(
title = "Average Parking Occupancy by Hour of Day",
x = "Hour of Day",
y = "Average Occupancy Rate"
) +
theme_minimal()
Average parking occupancy remains relatively high throughout the day, generally ranging between 0.7 and 0.9. Higher occupancy levels are observed in the early morning and evening hours, indicating periods of increased demand.
A drop in occupancy appears during the morning (around 7–10 AM), with the lowest value at 9 AM. However, this is based on very few observations and is not considered reliable. Overall, occupancy shows moderate variation by hour, with consistently high demand across most of the day.
ggplot(parking_data, aes(x = day_of_week, y = occupancy_rate)) +
geom_boxplot() +
labs(
title = "Parking Occupancy by Day of Week",
x = "Day of Week",
y = "Occupancy Rate"
) +
theme_minimal()
The dataset includes observations from Tuesday, Thursday, and Saturday, allowing comparison across selected weekdays and a weekend day.
Average occupancy is highest on Thursday (mean = 0.825), followed by Saturday (0.793), and lowest on Tuesday (0.736). Median values show a similar pattern, with Thursday having the highest median occupancy (0.971), indicating consistently high usage.
Additionally, Thursday and Saturday have higher first quartile values than Tuesday, suggesting that even lower occupancy levels on those days tend to be relatively high. However, these results should be interpreted as differences across sampled days rather than a complete weekly pattern.
library(dplyr)
library(knitr)
parking_data$month <- factor(
parking_data$month,
levels = c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
)
month_summary <- parking_data %>%
group_by(month) %>%
summarize(
mean_occupancy = mean(occupancy_rate),
median_occupancy = median(occupancy_rate),
count = n(),
.groups = "drop"
)
library(ggplot2)
ggplot(month_summary, aes(x = month, y = mean_occupancy, group = 1)) +
geom_line() +
geom_point() +
labs(
title = "Average Parking Occupancy by Month",
x = "Month",
y = "Average Occupancy Rate"
) +
theme_minimal()
Parking occupancy varies across months, suggesting a seasonal pattern in parking demand. Average occupancy is highest in the spring and early summer months, particularly from April through June, where mean occupancy is close to 0.88 and median occupancy is at full capacity. Occupancy is also relatively high in February, March, September, October, and November.
In contrast, July and December show noticeably lower occupancy, with substantially lower mean and median values than most other months. This pattern may be related to UCSB’s academic schedule, as these months align with summer and winter breaks when fewer students are present, potentially reducing parking demand.
Because the number of observations is relatively balanced across months, these differences are less likely to be driven by uneven sampling than the hourly comparisons. Overall, the results suggest that seasonality may play a meaningful role in parking demand, although time of day and street segment still appear to be especially important predictors.
library(dplyr)
library(ggplot2)
street_summary <- parking_data %>%
group_by(street) %>%
summarize(
mean_occupancy = mean(occupancy_rate),
count = n(),
.groups = "drop"
) %>%
filter(count >= 50)
top_streets <- street_summary %>%
arrange(desc(mean_occupancy)) %>%
slice_head(n = 10)
ggplot(top_streets, aes(x = reorder(street, mean_occupancy), y = mean_occupancy)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Highest Occupancy Streets",
x = "Street",
y = "Average Occupancy Rate"
) +
theme_minimal()
Parking occupancy varies substantially across street segments, indicating that location plays an important role in parking demand. The streets with the highest capacity are shown above, with 66 Del Playa being almost always at completely full capacity. This is not surprising because 66 DP is one of the most sought out locations to live, being next to the beach and in the middle of IV. Some houses here even have 20+ people, therefore more cars are likely in this area due to more cars likely being used.
bottom_streets <- street_summary %>%
arrange(mean_occupancy) %>%
slice_head(n = 10)
ggplot(bottom_streets, aes(x = reorder(street, -mean_occupancy), y = mean_occupancy)) +
geom_col() +
coord_flip() +
labs(
title = "Top 10 Lowest Occupancy Streets",
x = "Street",
y = "Average Occupancy Rate"
) +
theme_minimal()
In contrast, other streets exhibit noticeably lower average occupancy, indicating greater parking availability in those locations. To ensure reliable comparisons, the analysis focuses on street segments with a sufficient number of observations, since some locations have relatively few data points. Overall, these patterns suggest that street-level differences are a key factor in predicting parking availability.
library(corrplot)
## corrplot 0.95 loaded
num_data <- parking_data %>%
select(where(is.numeric))
cor_matrix <- cor(num_data, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.cex = 0.8)
There is a strong relationship between the number of observed vehicles and curbside inventory, as larger street segments naturally accommodate more cars. Observed vehicles also show a moderate positive relationship with occupancy rate. However, curbside inventory has little to no relationship with occupancy rate, suggesting that parking demand- not just supply- drives how full a street becomes.
Now that we have a better understanding of how key variables relate to parking demand, we can move on to building predictive models. This project is framed as a classification problem because the response variable indicates whether parking is available or not. However, the primary goal is not simply to assign a yes/no label, but to estimate the probability that a parking space is available given a set of predictors such as time, location, and parking supply.
To do this, the data will first be split into training and testing sets. A preprocessing recipe will then be created using the training data, and cross-validation will be established to evaluate model performance in a more reliable way before testing the final model on unseen data.
Before fitting any models, the data must first be divided into training and testing sets. The training set will be used to fit and compare the models, while the testing set will be held aside until the very end to assess how well the final model generalizes to new data. I use a random split so that 75% of the observations are assigned to the training set and 25% to the testing set. Because the response variable is binary, I stratify on parking availability so that the proportion of available versus unavailable observations remains similar in both datasets. A seed is also set to ensure that the split is reproducible.
library(tidymodels)
library(dplyr)
tidymodels::tidymodels_prefer()
# create binary response variable
parking_data <- parking_data %>%
mutate(
parking_available = factor(
ifelse(occupancy_rate < 0.9, "yes", "no"),
levels = c("no", "yes")
)
)
set.seed(131)
parking_split <- initial_split(
parking_data,
prop = 0.75,
strata = parking_available
)
parking_train <- training(parking_split)
parking_test <- testing(parking_split)
Dimensions of our parking training set:
dim(parking_train)
## [1] 13621 9
Dimensions of our parking testing set:
dim(parking_test)
## [1] 4542 9
Parking availability is defined as an occupancy rate below 0.90.
Next, a preprocessing recipe is created using only the training data.
The response variable is parking_available, and the
predictors are chosen to reflect the information that would
realistically be known in advance when trying to predict parking
availability. These include time-based variables, location, and curbside
inventory.
Variables such as occupancy_rate and
observed_plates are not included as predictors because they
directly determine or strongly overlap with the response, which would
introduce data leakage and artificially inflate model performance. Since
some predictors are categorical, dummy variables are created so that the
models can use them appropriately.
parking_recipe <- recipe(
parking_available ~ hour + day_of_week + month + street + curbside_inventory,
data = parking_train
) %>%
step_other(street, threshold = 0.01) %>%
step_dummy(all_nominal_predictors())
# to inspect the recipe
parking_recipe
##
## ── Recipe ──────────────────────────────────────────────────────────────────────
##
## ── Inputs
## Number of variables by role
## outcome: 1
## predictor: 5
##
## ── Operations
## • Collapsing factor levels for: street
## • Dummy variables from: all_nominal_predictors()
To compare models fairly and reduce dependence on a single training split, I use 5-fold cross-validation on the training set. This means the training data is divided into five parts, and each part takes a turn serving as a validation fold while the remaining folds are used for fitting the model. Stratification is again applied to preserve the class balance of parking availability across folds.
set.seed(131)
parking_folds <- vfold_cv(
parking_train,
v = 5,
strata = parking_available
)
parking_folds
## # 5-fold cross-validation using stratification
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [10896/2725]> Fold1
## 2 <split [10896/2725]> Fold2
## 3 <split [10896/2725]> Fold3
## 4 <split [10898/2723]> Fold4
## 5 <split [10898/2723]> Fold5
With the data properly split and preprocessed, we can now begin building predictive models. The goal of this section is to fit several classification models that estimate the probability that a parking space is available. Multiple models are considered to compare performance and determine which approach best captures the relationship between time, location, and parking supply.
To evaluate model performance, I use both ROC AUC and accuracy. Because the goal of this project is to estimate the probability that parking is available, ROC AUC is especially important. It measures how well the model distinguishes between available and unavailable parking across all possible probability thresholds, making it more informative than accuracy alone.
Accuracy is also included as a secondary metric for interpretability, as it reflects the proportion of correctly classified observations. However, it is more sensitive to the choice of classification threshold and may not fully capture model performance in probability-based settings.
Three classification models are fit and compared: logistic regression, a decision tree, and a random forest. Logistic regression serves as a baseline model due to its interpretability, while tree-based methods allow for more flexible relationships and interactions among predictors.
Each model is trained using the same preprocessing recipe and evaluated using 5-fold cross-validation on the training data. This ensures that model comparisons are fair and not dependent on a single split of the data.
The general steps of the modeling building process are as follows:
We first define the model type, select the engine, and set the mode
to "classification".
# Logistic Regression
log_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# Decision Tree
tree_model <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
# Random Forest
rf_model <- rand_forest(trees = 500) %>%
set_engine("ranger") %>%
set_mode("classification")
We then create a workflow for each model by combining the model with the preprocessing recipe.
# Logistic Regression
log_workflow <- workflow() %>%
add_model(log_model) %>%
add_recipe(parking_recipe)
# Decision Tree
tree_workflow <- workflow() %>%
add_model(tree_model) %>%
add_recipe(parking_recipe)
# Random Forest
rf_workflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(parking_recipe)
Each model is trained using cross-validation to evaluate performance on unseen data.
set.seed(131)
# Logistic Regression
log_res <- fit_resamples(
log_workflow,
resamples = parking_folds,
metrics = metric_set(roc_auc, accuracy),
control = control_resamples(save_pred = TRUE)
)
# Decision Tree
tree_res <- fit_resamples(
tree_workflow,
resamples = parking_folds,
metrics = metric_set(roc_auc, accuracy),
control = control_resamples(save_pred = TRUE)
)
# Random Forest
rf_res <- fit_resamples(
rf_workflow,
resamples = parking_folds,
metrics = metric_set(roc_auc, accuracy),
control = control_resamples(save_pred = TRUE)
)
We extract ROC AUC and accuracy for each model.
collect_metrics(log_res)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.694 5 0.00421 pre0_mod0_post0
## 2 roc_auc binary 0.756 5 0.00405 pre0_mod0_post0
collect_metrics(tree_res)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.686 5 0.00344 pre0_mod0_post0
## 2 roc_auc binary 0.691 5 0.00411 pre0_mod0_post0
collect_metrics(rf_res)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.721 5 0.00470 pre0_mod0_post0
## 2 roc_auc binary 0.796 5 0.00371 pre0_mod0_post0
The models are compared based on their cross-validation performance. The model with the highest ROC AUC is selected as the final model.
Since our dataset contains a moderate number of observations, model training and evaluation were computationally manageable. All models have now been fit using cross-validation, and their performance metrics have been collected for comparison. We now analyze how each model performs in predicting parking availability.
To compare model performance, we focus on two key metrics: ROC AUC and accuracy. ROC AUC measures how well a model distinguishes between available and unavailable parking, while accuracy measures the proportion of correct predictions. Here is how it is distributed/visualized throughout our three models;
Logistic regression serves as a baseline model for this classification task. It assumes a linear relationship between the predictors and the log-odds of parking availability.
collect_metrics(log_res)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.694 5 0.00421 pre0_mod0_post0
## 2 roc_auc binary 0.756 5 0.00405 pre0_mod0_post0
The logistic regression model achieves moderate performance, with a ROC AUC indicating a reasonable ability to distinguish between available and unavailable parking. However, its linear structure limits its ability to capture more complex relationships in the data.
ROC Curve
log_preds <- collect_predictions(log_res)
log_preds %>%
roc_curve(truth = parking_available, .pred_yes, event_level = "second") %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(lty = 2) +
labs(
title = "ROC Curve: Logistic Regression",
x = "1 - Specificity",
y = "Sensitivity"
) +
theme_minimal()
Its ROC curve shows that the logistic regression model performs better than random guessing, as the curve lies above the diagonal reference line. The model demonstrates moderate predictive ability, with a reasonable trade-off between sensitivity and specificity across different thresholds.
Decision trees model non-linear relationships by splitting the data based on feature values. This allows them to capture more complex patterns compared to logistic regression.
collect_metrics(tree_res)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.686 5 0.00344 pre0_mod0_post0
## 2 roc_auc binary 0.691 5 0.00411 pre0_mod0_post0
The decision tree model achieves an accuracy of approximately 0.686 and a ROC AUC of about 0.691, indicating relatively weak to moderate predictive performance. Compared to logistic regression, the decision tree performs worse, suggesting that a single tree may not capture the underlying patterns in parking availability as effectively. This may be due to the model’s tendency to overfit or its limited ability to generalize from the data.
ROC Curve
tree_preds <- collect_predictions(tree_res)
tree_preds %>%
roc_curve(truth = parking_available, .pred_yes, event_level = "second") %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(lty = 2) +
labs(
title = "ROC Curve: Decision Tree",
x = "1 - Specificity",
y = "Sensitivity"
) +
theme_minimal()
The ROC curve shows that the decision tree model performs better than random guessing, but worse than logistic regression. The curve lies closer to the diagonal, indicating weaker ability to distinguish between available and unavailable parking.
Random forests improve upon decision trees by combining multiple trees, reducing overfitting and improving predictive performance.
collect_metrics(rf_res)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.721 5 0.00470 pre0_mod0_post0
## 2 roc_auc binary 0.796 5 0.00371 pre0_mod0_post0
The random forest model achieves an accuracy of approximately 0.72 and a ROC AUC of about 0.80, indicating strong predictive performance. The higher ROC AUC suggests that the model has a strong ability to distinguish between available and unavailable parking. Compared to both logistic regression and the decision tree, the random forest shows clear improvement in classification performance.
ROC Curve
rf_preds <- collect_predictions(rf_res)
rf_preds %>%
roc_curve(truth = parking_available, .pred_yes, event_level = "second") %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(lty = 2) +
labs(
title = "ROC Curve: Random Forest",
x = "1 - Specificity",
y = "Sensitivity"
) +
theme_minimal()
The ROC curve for the random forest lies furthest above the diagonal reference line, indicating the strongest classification performance among all models. Compared to both logistic regression and the decision tree, the random forest demonstrates superior ability to distinguish between parking availability classes.
Tuning plot
rf_tune_model <- rand_forest(
mtry = tune(),
min_n = tune(),
trees = tune()
) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_tune_workflow <- workflow() %>%
add_model(rf_tune_model) %>%
add_recipe(parking_recipe)
if (file.exists("rf_tune_res.rds")) {
rf_tune_res <- readRDS("rf_tune_res.rds")
} else {
rf_grid <- grid_regular(
mtry(range = c(1L, 5L)),
min_n(range = c(2L, 8L)),
trees(range = c(200L, 600L)),
levels = 4
)
set.seed(131)
rf_tune_res <- tune_grid(
rf_tune_workflow,
resamples = parking_folds,
grid = rf_grid,
metrics = metric_set(roc_auc, accuracy),
control = control_grid(save_pred = TRUE)
)
saveRDS(rf_tune_res, "rf_tune_res.rds")
}
rf_tune_metrics <- collect_metrics(rf_tune_res)
rf_tune_metrics %>%
filter(.metric == "roc_auc") %>%
ggplot(aes(x = mtry, y = mean, color = factor(trees))) +
geom_line() +
geom_point() +
facet_wrap(~ min_n) +
labs(
title = "Random Forest Tuning Results",
x = "Number of Predictors Sampled at Each Split (mtry)",
y = "ROC AUC",
color = "# Trees"
) +
theme_minimal()
The tuning results show that the random forest model’s performance varies slightly across different hyperparameter settings. As the number of predictors sampled at each split (mtry) increases, the ROC AUC generally improves, indicating better model performance.
Additionally, increasing the number of trees leads to small but consistent improvements in ROC AUC, with performance stabilizing at higher values. The minimum node size has relatively little impact, as performance remains similar across different values.
Overall, the best performance is achieved at higher values of mtry and a larger number of trees, suggesting that a more complex random forest model provides better predictive accuracy while maintaining stability.
# combine results
log_metrics <- collect_metrics(log_res) %>% mutate(model = "Logistic Regression")
tree_metrics <- collect_metrics(tree_res) %>% mutate(model = "Decision Tree")
rf_metrics <- collect_metrics(rf_res) %>% mutate(model = "Random Forest")
all_metrics <- bind_rows(log_metrics, tree_metrics, rf_metrics)
# ROC AUC plot
all_metrics %>%
filter(.metric == "roc_auc") %>%
ggplot(aes(x = model, y = mean, fill = model)) +
geom_col() +
labs(title = "ROC AUC by Model", x = "Model", y = "ROC AUC") +
theme_minimal()
The random forest model achieved the highest ROC AUC (~0.80), indicating the best ability to distinguish between classes. Logistic regression performed moderately well (~0.75), while the decision tree had the lowest performance (~0.69). These results suggest that the random forest is the most effective model for this task, likely due to its ability to capture more complex patterns in the data.
# collect accuracy results for each model
log_acc <- collect_metrics(log_res) %>%
filter(.metric == "accuracy") %>%
select(mean) %>%
pull()
tree_acc <- collect_metrics(tree_res) %>%
filter(.metric == "accuracy") %>%
select(mean) %>%
pull()
rf_acc <- collect_metrics(rf_res) %>%
filter(.metric == "accuracy") %>%
select(mean) %>%
pull()
# make accuracy summary table
model_accuracy_table <- tibble(
Model = c("Random Forest", "Logistic Regression", "Decision Tree"),
Accuracy = c(rf_acc, log_acc, tree_acc)
) %>%
arrange(desc(Accuracy))
knitr::kable(
model_accuracy_table,
digits = 3,
caption = "Cross-Validated Accuracy by Model"
)
| Model | Accuracy |
|---|---|
| Random Forest | 0.721 |
| Logistic Regression | 0.694 |
| Decision Tree | 0.686 |
As we can see in our table, the random forest model achieves the highest accuracy (0.721), followed by logistic regression (0.694) and the decision tree (0.686). These results are consistent with the ROC AUC comparison, further supporting the random forest as the best-performing model!
Our best overall model was the random forest, so the next step was to identify which specific tuning combination performed best. To do this, I examined the tuning results and selected the parameter combination with the highest cross-validated ROC AUC.
# best tuned random forest settings based on ROC AUC
show_best(rf_tune_res, metric = "roc_auc", n = 1)
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 466 2 roc_auc binary 0.793 4 0.00484 pre0_mod57_post0
The best-performing random forest used mtry = 5,
trees = 466, and min_n = 6. This model
achieved the highest average ROC AUC across the validation folds, with a
mean ROC AUC of about 0.795. This indicates that this
tuning combination provided the strongest ability to distinguish between
available and unavailable parking among all random forest settings
considered.
# get best parameters
best_rf_params <- select_best(rf_tune_res, metric = "roc_auc")
# finalize workflow
final_rf_workflow <- rf_tune_workflow %>%
finalize_workflow(best_rf_params)
# fit on full training data
final_rf_fit <- fit(final_rf_workflow, data = parking_train)
rf_test_preds <- predict(final_rf_fit, parking_test, type = "prob") %>%
bind_cols(predict(final_rf_fit, parking_test, type = "class")) %>%
bind_cols(parking_test %>% select(parking_available))
rf_test_preds %>%
accuracy(truth = parking_available, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.716
And for ROC AUC
rf_test_preds %>%
roc_auc(truth = parking_available, .pred_yes, event_level = "second")
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.795
After finalizing the best random forest model, I evaluated it on the test set to measure how well it generalized to unseen data. The final model achieved an accuracy of approximately 0.712 and a ROC AUC of about 0.796.
These results suggest that the model performs reasonably well in distinguishing between available and unavailable parking. In particular, the ROC AUC indicates fairly strong classification ability, since it is well above 0.5 and close to 0.8.
rf_test_preds %>%
roc_curve(truth = parking_available, .pred_yes, event_level = "second") %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path() +
geom_abline(lty = 2) +
labs(
title = "ROC Curve: Final Random Forest Model",
x = "1 - Specificity",
y = "Sensitivity"
) +
theme_minimal()
The ROC curve also supports this conclusion, as it lies substantially above the diagonal reference line. Overall, the finalized random forest model demonstrates good predictive performance on unseen data and appears to be the strongest model considered in this project.
conf_mat(rf_test_preds, truth = parking_available, estimate = .pred_class) %>%
autoplot(type = "heatmap")
The confusion matrix shows that the random forest correctly classified many observations in both categories. In particular, it correctly identified 1,915 unavailable cases and 1,337 available cases.
There are still some errors, with 860 false negatives and 430 false positives, indicating that the model performs better at identifying unavailable parking than available parking. This likely reflects the class imbalance in the dataset, where unavailable parking occurs more frequently. Overall, this supports the earlier results: the random forest performs reasonably well, but it is still not perfect.
Now that the random forest has been identified as the best-performing model, it is helpful to examine which predictors were most important in determining parking availability. Unlike logistic regression, random forests do not use coefficients. Instead, variable importance measures how much each predictor contributes to improving prediction accuracy across the trees in the forest.
Below, I display the variable importance values for the finalized random forest model and then visualize them with a bar plot.
# best tuning parameters
best_rf_params <- select_best(rf_tune_res, metric = "roc_auc")
# random forest model WITH importance
rf_vi_model <- rand_forest(
mtry = best_rf_params$mtry,
min_n = best_rf_params$min_n,
trees = best_rf_params$trees
) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_vi_workflow <- workflow() %>%
add_model(rf_vi_model) %>%
add_recipe(parking_recipe)
rf_vi_fit <- fit(rf_vi_workflow, data = parking_train)
rf_engine <- extract_fit_engine(rf_vi_fit)
rf_importance <- vip::vi(rf_engine)
rf_importance
## # A tibble: 60 × 2
## Variable Importance
## <chr> <dbl>
## 1 curbside_inventory 229.
## 2 month_04 140.
## 3 month_01 109.
## 4 month_11 108.
## 5 street_X68.CAMINO.MAJORCA 98.0
## 6 month_07 95.2
## 7 month_03 94.0
## 8 month_05 82.8
## 9 hour_X13 80.4
## 10 month_09 77.9
## # ℹ 50 more rows
rf_importance %>%
slice_max(order_by = Importance, n = 15) %>%
mutate(
Variable = case_when(
grepl("^month_", Variable) ~ month.abb[as.numeric(sub("month_", "", Variable))],
grepl("^hour_X", Variable) ~ paste0(as.numeric(sub("hour_X", "", Variable)), ":00"),
TRUE ~ Variable
),
Variable = reorder(Variable, Importance)
) %>%
ggplot(aes(x = Variable, y = Importance)) +
geom_col() +
coord_flip() +
labs(
title = "Top 15 Variable Importance Scores",
x = "Predictor",
y = "Importance"
) +
theme_minimal()
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `Variable = case_when(...)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
The most important predictor by a large margin is curbside_inventory, indicating that the number of available curbside spaces is the strongest determinant of parking availability. This makes sense bacause you can only park if there is an available curbside space. Additionally, several month variables (April, January, November) appear among the most important predictors, suggesting strong seasonal patterns in parking demand. Additionally, hour-based variables (13:00 and 21:00) are also influential, indicating that parking availability varies significantly by time of day, likely reflecting daily activity cycles. Location seemms to not be that important of a variable, probably because Isla Vista is usually over occupied everywhere for parking. The only street variable in the top 15 most influential variables is Camino Majorca. This street is the street the farthest away from UCSB, and only has residential spaces on one side of it, a park connecting to the beach on the other, so it is very expected it would not be as occupied.
Overall, the model relies on a combination of supply (curbside inventory), temporal factors (month and hour), and spatial variation (street location) to predict parking availability, highlighting that parking demand is influenced by both when and where observations occur.
Now lets test our model on some new data! I will apply it to new observations representing different parking conditions and examine the predicted outcomes.
Say it’s March 11th, the week of your finals, and you’ve been studying all day and now it is 7:00 pm and you really want to get dinner. You want something fast and easy so you decide to drive to downtown Isla Vista, an area in the 65 block, to grab a bite to eat. Where should you park? Lets check the probability a space is available on a few streets in that area.
example_streets <- c(
"65 PARDALL RD",
"65 CORDOBA RD",
"65 MADRID RD",
"65 SEGOVIA RD"
)
inventory_lookup <- parking_data %>%
group_by(street) %>%
summarise(
curbside_inventory = as.integer(round(mean(curbside_inventory, na.rm = TRUE))),
.groups = "drop"
)
example_scenario <- tibble(
street = example_streets,
hour = 19,
day_of_week = "Mon",
month = "Mar"
) %>%
left_join(inventory_lookup, by = "street") %>%
mutate(
street = as.character(street),
hour = factor(hour, levels = sort(unique(parking_train$hour))),
day_of_week = factor(
day_of_week,
levels = levels(parking_train$day_of_week),
ordered = is.ordered(parking_train$day_of_week)
),
month = factor(
month,
levels = levels(parking_train$month),
ordered = is.ordered(parking_train$month)
),
curbside_inventory = as.integer(curbside_inventory)
)
scenario_results <- example_scenario %>%
bind_cols(predict(final_rf_fit, new_data = example_scenario, type = "prob")) %>%
arrange(desc(.pred_yes))
scenario_results
## # A tibble: 4 × 7
## street hour day_of_week month curbside_inventory .pred_no .pred_yes
## <chr> <fct> <ord> <ord> <int> <dbl> <dbl>
## 1 65 MADRID RD 19 Mon Mar 16 0.444 0.556
## 2 65 PARDALL RD 19 Mon Mar 8 0.600 0.400
## 3 65 CORDOBA RD 19 Mon Mar 17 0.615 0.385
## 4 65 SEGOVIA RD 19 Mon Mar 16 0.716 0.284
To find these probabilities, I had the model for each street observe curbside inventory from the data and estimated the probability that parking would be available. The results suggest that 65 Madrid Rd has the highest predicted probability of parking availability (0.57), followed by 65 Pardall Rd (0.4), 65 Cordoba Rd (0.38), and 65 Segovia Rd (0.28). This shows that even within the same general area and time period, the model captures meaningful street-level differences in expected parking availability.
These specific results make alot of sense because as the streets get more residential Madrid-> Pardall -> Cordoba -> Segovia, the probability of finding a parking spot is less and less which would be expected because around 7:00 PM, many residents are home, so more residential streets tend to be more fully occupied.
As I started this project by saying, finding parking is a very non-enjoyable, yet common task in Isla Vista. I am happy to say that due to my models success, now we know he key locations/time/days it would be just a little easier.
After I fit my several models to the data set, it was the random forest model that performed the best at predicting whether a parking space would be available. This is likely because parking availability depends on complex, nonlinear interactions between variables such as street location, time of day, day of the week, and curbside inventory. Unlike simpler models, random forest is able to capture these interactions and patterns without relying on strict linear assumptions, leading to improved predictive performance. The other models I fitted to the data, logistic regression and the decision tree model, both which are more simpler models, were less effective. This is likely because these models both assume that there are more linear relationships between predictors and the outcome. That fact leads to easier interpretation but something like parking patterns has such variability between how predictors interact so a more complicated model is needed.
After choosing the random forest model as the best one we were able to meaningfully determine that parking availability varies significantly across streets and times, even within the same general area. The probability of actually finding a space however is still always quite low, but that is expected because Isla Vista is in a parking crisis.
Some limitations of this study is that the dataset is observational and may not capture all factors influencing parking availability, such as special events, weather conditions, or time/day restrictions. The data was also only collected on Tuesdays, Thursdays, and Saturdays, and there were some times of day that were barely collected such as 9:00 am only with four observations. These restrictions limit the model’s ability to generalize to other days of the week or underrepresented time periods. Future data collection work could be more intensive in capturing more observations and on those missing days/times but that would also take a lot of resources and our model is complex enough that it is still very accurate even with these restrictions.
A way I would love to extend this project would be to develop a public-facing app or website based on this model. Similar to the scenario analysis presented, users could input a specific day, time, month, and location to receive an estimated probability of finding parking on selected streets. While this would not increase the actual number of available parking spaces, it could help users make more informed decisions and reduce the time spent searching for parking. In this way, the tool could improve the perceived availability of parking by guiding users toward streets with higher likelihoods of open spaces.
Overall, this project demonstrates how statistical learning methods can be applied to real-world urban challenges. By modeling parking availability, this analysis provides insights into how spatial and temporal factors influence congestion in Isla Vista. More broadly, it highlights the potential for data-driven tools to improve everyday decision-making and contribute to more efficient and user-friendly urban systems. I would say good luck getting that parking spot, but maybe instead of luck you just need this machine learning model!
Isla Vista Community Services District. (n.d.). Isla Vista parking study. https://islavistacsd.ca.gov/isla-vista-parking-study