Car Visualization & Natual Spline Fitting
mtcars
is a classical dataset that is used to illustrate how to use ggplot2
in some good visualization reference. Here in this blog post, I will analyze a much larger version of mtcars
dataset from TidyTuesday.
library(tidyverse)
library(tidytext)
library(splines)
library(broom)
library(scales)
car_economy <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-15/big_epa_cars.csv") %>%
rename(city_mpg = "city08",
highway_mpg = "highway08",
fuel_type = "fuelType",
combined_mpg = "comb08",
model_year = "year") %>%
mutate(fuel_type = fct_recode(fuel_type, Hybrid = "Regular Gas and Electricity"))
car_economy
## # A tibble: 41,804 x 83
## barrels08 barrelsA08 charge120 charge240 city_mpg city08U cityA08 cityA08U
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 15.7 0 0 0 19 0 0 0
## 2 30.0 0 0 0 9 0 0 0
## 3 12.2 0 0 0 23 0 0 0
## 4 30.0 0 0 0 10 0 0 0
## 5 17.3 0 0 0 17 0 0 0
## 6 15.0 0 0 0 21 0 0 0
## 7 13.2 0 0 0 22 0 0 0
## 8 13.7 0 0 0 23 0 0 0
## 9 12.7 0 0 0 23 0 0 0
## 10 13.2 0 0 0 23 0 0 0
## # ... with 41,794 more rows, and 75 more variables: cityCD <dbl>, cityE <dbl>,
## # cityUF <dbl>, co2 <dbl>, co2A <dbl>, co2TailpipeAGpm <dbl>,
## # co2TailpipeGpm <dbl>, combined_mpg <dbl>, comb08U <dbl>, combA08 <dbl>,
## # combA08U <dbl>, combE <dbl>, combinedCD <dbl>, combinedUF <dbl>,
## # cylinders <dbl>, displ <dbl>, drive <chr>, engId <dbl>, eng_dscr <chr>,
## # feScore <dbl>, fuelCost08 <dbl>, fuelCostA08 <dbl>, fuel_type <fct>,
## # fuelType1 <chr>, ghgScore <dbl>, ghgScoreA <dbl>, highway_mpg <dbl>, ...
The dataset has too many extraneous columns. In this analysis, I will select a number of interesting ones so that the column number can be lowered, thus easier to work with.
car_clean <- car_economy %>%
select(make, model, model_year, city_mpg, displ, highway_mpg, co2TailpipeGpm, combined_mpg, cylinders, drive, fuel_type, trany, VClass, youSaveSpend) %>%
separate(trany, into = c("transmission", "transmission_number"), sep = " ") %>%
janitor::clean_names()
car_clean
## # A tibble: 41,804 x 15
## make model model_year city_mpg displ highway_mpg co2tailpipe_gpm
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alfa Romeo Spider Veloce 2000 1985 19 2 25 423.
## 2 Ferrari Testarossa 1985 9 4.9 14 808.
## 3 Dodge Charger 1985 23 2.2 33 329.
## 4 Dodge B150/B250 Wagon 2WD 1985 10 5.2 12 808.
## 5 Subaru Legacy AWD Turbo 1993 17 2.2 23 468.
## 6 Subaru Loyale 1993 21 1.8 24 404.
## 7 Subaru Loyale 1993 22 1.8 29 355.
## 8 Toyota Corolla 1993 23 1.6 26 370.
## 9 Toyota Corolla 1993 23 1.6 31 342.
## 10 Toyota Corolla 1993 23 1.8 30 355.
## # ... with 41,794 more rows, and 8 more variables: combined_mpg <dbl>,
## # cylinders <dbl>, drive <chr>, fuel_type <fct>, transmission <chr>,
## # transmission_number <chr>, v_class <chr>, you_save_spend <dbl>
Fuel economy from the most popular makes
car_clean %>%
filter(fuel_type != "Electricity") %>%
mutate(make = fct_lump(make, n = 10)) %>%
filter(make != "Other") %>%
mutate(make = fct_reorder(make, city_mpg, median)) %>%
ggplot(aes(city_mpg, make)) +
geom_boxplot() +
labs(x = "city MPG",
y = NULL,
title = "Top 10 Popular Makes on City MPG",
subtitle = "Electric vehicles are excluded")
car_clean %>%
filter(fuel_type != "Electricity") %>%
mutate(make = fct_lump(make, n = 10)) %>%
filter(make != "Other") %>%
mutate(make = fct_reorder(make, highway_mpg, median)) %>%
ggplot(aes(highway_mpg, make)) +
geom_boxplot() +
labs(x = "highway MPG",
y = NULL,
title = "Top 10 Popular Makes on Highway MPG",
subtitle = "Electric vehicles are excluded")
I am a bit surprised that Volkswagen is more efficient on its median MPG than Toyota.
The most energy efficient models
car_clean %>%
filter(fuel_type != "Electricity") %>%
group_by(model) %>%
slice_max(combined_mpg, n = 1, with_ties = F) %>%
ungroup() %>%
slice_max(combined_mpg, n = 30) %>%
pivot_longer(cols = contains("mpg")) %>%
mutate(name = str_replace(name, "_", " ")) %>%
mutate(model = reorder_within(model, value, name)) %>%
ggplot(aes(value, model, fill = fuel_type)) +
geom_col() +
facet_wrap(~name, scales = "free_y") +
scale_y_reordered() +
labs(x = "MPG",
y = NULL,
fill = "fuel type",
title = "30 Most Energy Efficient Models")
The most energy costly models
car_clean %>%
filter(fuel_type != "Electricity") %>%
group_by(model) %>%
slice_min(combined_mpg, n = 1, with_ties = F) %>%
ungroup() %>%
slice_min(combined_mpg, n = 30) %>%
pivot_longer(cols = contains("mpg")) %>%
mutate(name = str_replace(name, "_", " ")) %>%
mutate(model = reorder_within(model, value, name)) %>%
ggplot(aes(value, model, fill = fuel_type)) +
geom_col() +
facet_wrap(~name, scales = "free_y") +
scale_y_reordered() +
labs(x = "MPG",
y = NULL,
fill = "fuel type",
title = "30 Most Gas-Guzzeling Models")
Which models can save or cost money most?
car_clean %>%
filter(fuel_type != "Electricity") %>%
group_by(model) %>%
summarize(money_saved = mean(you_save_spend)) %>%
ungroup() %>%
mutate(save_or_spend = if_else(money_saved > 0, TRUE, FALSE)) %>%
group_by(save_or_spend) %>%
slice_max(abs(money_saved), n = 20) %>%
mutate(model = fct_reorder(model, money_saved)) %>%
ggplot(aes(money_saved, model, fill = save_or_spend)) +
geom_col() +
labs(x = "money saved",
y = NULL,
title = "Top 20 models for $ saved/spent on average") +
theme(legend.position = "none") +
scale_x_continuous(labels = dollar)
Compared to the models that help save money, those luxury vehicles can cost much more.
Analyzing Toyota
I am a loyal customer of Toyota, which is famous for its reliability, durability, and energy efficiency. Here in this section I will analyze Toyota vehicles presented in the dataset in particular.
toyota <- car_clean %>%
filter(make == "Toyota")
toyota
## # A tibble: 2,109 x 15
## make model model_year city_mpg displ highway_mpg co2tailpipe_gpm
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Toyota Corolla 1993 23 1.6 26 370.
## 2 Toyota Corolla 1993 23 1.6 31 342.
## 3 Toyota Corolla 1993 23 1.8 30 355.
## 4 Toyota Corolla 1993 23 1.8 30 342.
## 5 Toyota Camry 1993 18 2.2 26 423.
## 6 Toyota Camry 1993 19 2.2 27 404.
## 7 Toyota Camry 1993 16 3 22 468.
## 8 Toyota Camry 1993 16 3 22 494.
## 9 Toyota Corolla Wagon 1993 23 1.8 30 355.
## 10 Toyota Corolla Wagon 1993 23 1.8 30 342.
## # ... with 2,099 more rows, and 8 more variables: combined_mpg <dbl>,
## # cylinders <dbl>, drive <chr>, fuel_type <fct>, transmission <chr>,
## # transmission_number <chr>, v_class <chr>, you_save_spend <dbl>
As two of the most popular sedans in America for more than 30 years, Corolla and Camry are worth of being singled out for visualization purposes.
toyota %>%
filter(model %in% c("Corolla", "Camry")) %>%
group_by(model, model_year, transmission) %>%
mutate(avg_mpg = mean(combined_mpg)) %>%
ungroup() %>%
mutate(model = fct_reorder(model, -avg_mpg, sum)) %>%
ggplot(aes(model_year, avg_mpg, color = model)) +
geom_line() +
geom_smooth(method = "lm", se = F) +
geom_point() +
labs(x = "model year",
y = "mean MPG",
color = NULL,
title = "Corolla V.S. Camry on MPG with linear model fitting line") +
scale_x_continuous(breaks = seq(1985, 2020, 5)) +
facet_wrap(~transmission)
Although declining in some years, there is a upward trend for MPG for both models. Also, Corolla has always been more energy efficient than Camry, and what is surprising is that manual transmission is not more energy efficient than its automatic counterparts for the same model.
Natural Spline Fitting
This little section is inspired by David Robinson’s code on spline fitting.
We would like to use displ
to predict combined_mpg
.
car_clean %>%
filter(fuel_type != "Electricity") %>%
ggplot(aes(displ, combined_mpg)) +
geom_point() +
labs(x = "engine displacement",
y = "combined MPG")
As we can see from the scatter plot above, there is no linear relationship. Therefore, it is not wise to use linear regression. Instead, natural spline model is used here.
Making training_set
.
training_set <- car_clean %>%
filter(fuel_type != "Electricity",
!is.na(displ)) %>%
filter(row_number() %% 5 != 0)
lm(combined_mpg ~ ns(displ, 2), data = training_set) %>%
augment(data = training_set) %>%
ggplot(aes(displ, .resid)) +
geom_point()
The residual plot above has a specific pattern, indicating the fitting is not good.
Now trying degree of freedom from 1 to 10 for the natural spline model:
augment_unnest <- tibble(df = seq(1,10)) %>%
mutate(ns_model = map(df, ~lm(combined_mpg ~ ns(displ, df = .), data = training_set)),
tidied = map(ns_model, tidy)) %>%
mutate(augmented = map(ns_model, augment, data = training_set)) %>%
unnest(augmented)
augment_unnest %>%
ggplot(aes(displ, combined_mpg)) +
geom_point() +
geom_line(aes(y = .fitted, color = factor(df)), size = 2) +
labs(x = "engine displacement",
y = "combined MPG",
color = "ns degree of freedom")
Residual plots
augment_unnest %>%
ggplot(aes(displ, .resid)) +
geom_point() +
facet_wrap(~ df) +
labs(title = "residual plot with df from 1 to 10")
It is difficult to tell which degree of freedom fits the data the best.