The Simpsons Data Visualization
Mon, Nov 22, 2021
3-minute read
In this blog post, I will analyze the Simpsons dataset from TidyTuesday about its guest stars.
library(tidyverse)
theme_set(theme_bw())
This is my first time using read_delim()
to separate |
. It is also the first time I recognize read_csv()
does not have a delim
argument.
simpsons <- readr::read_delim("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-27/simpsons-guests.csv", delim = "|", quote = "") %>%
rename(episode = "number") %>%
mutate(season = as.numeric(season))
simpsons
## # A tibble: 1,386 x 6
## season episode production_code episode_title guest_star role
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 002–102 7G02 Bart the Genius Marcia Walla~ Edna Krabapp~
## 2 1 003–103 7G03 Homer's Odyssey Sam McMurray Worker
## 3 1 003–103 7G03 Homer's Odyssey Marcia Walla~ Edna Krabapp~
## 4 1 006–106 7G06 Moaning Lisa Miriam Flynn Ms. Barr
## 5 1 006–106 7G06 Moaning Lisa Ron Taylor Bleeding Gum~
## 6 1 007–107 7G09 The Call of the S~ Albert Brooks Cowboy Bob
## 7 1 008–108 7G07 The Telltale Head Marcia Walla~ Edna Krabapp~
## 8 1 009–109 7G11 Life on the Fast ~ Albert Brooks Jacques
## 9 1 010–110 7G10 Homer's Night Out Sam McMurray Gulliver Dark
## 10 1 011–111 7G13 The Crepes of Wra~ Christian Co~ Gendarme Off~
## # ... with 1,376 more rows
Total # of roles
simpsons %>%
mutate(guest_star = fct_lump(guest_star, n = 10)) %>%
count(guest_star, role, sort = T) %>%
filter(guest_star != "Other") %>%
group_by(guest_star) %>%
mutate(total_guest_roles = n()) %>%
ungroup() %>%
distinct(guest_star, total_guest_roles, .keep_all = T) %>%
mutate(guest_star = fct_reorder(guest_star, total_guest_roles)) %>%
ggplot(aes(total_guest_roles, guest_star, fill = guest_star)) +
geom_col(show.legend = F) +
labs(x = "# of roles in total",
y = "",
title = "Top 10 Popular Guest Stars with Most Roles")
The most crowded episodes
simpsons %>%
filter(season != "Movie") %>%
group_by(episode_title) %>%
mutate(num_of_episode_guests = n(),
season = as.character(season)) %>%
arrange(desc(num_of_episode_guests)) %>%
distinct(episode_title, num_of_episode_guests, .keep_all = T) %>%
ungroup() %>%
head(50) %>%
mutate(episode_title = fct_reorder(episode_title, num_of_episode_guests)) %>%
ggplot(aes(num_of_episode_guests, episode_title, fill = season)) +
geom_col() +
labs(x = "# of guest stars",
y = NULL,
title = "The Episodes with Most Guest Stars")
Working on self
The dataset has Himself
, Herself
etc. on the role
column. Merging guest_star
to role
is helpful for further analysis.
separate_rows()
is used with \\s
regular expression.
role_processed <- simpsons %>%
separate_rows(role, sep = ";\\s+") %>%
mutate(guest_star = if_else(guest_star == "\"Weird Al\" Yankovic", "Weird Al Yankovic", guest_star)) %>%
mutate(self = if_else(str_detect(role, "self"), "self", "not self"),
role = if_else(str_detect(role, "self"), guest_star, role))
role_processed %>%
add_count(role) %>%
filter(n > 6) %>%
distinct(role, .keep_all = T) %>%
mutate(guest_star = fct_reorder(guest_star, n, sum)) %>%
ggplot(aes(n, guest_star, fill = role)) +
geom_col() +
labs(x = "role count",
y = "guest star",
title = "Guest Stars & Their Role Count")
How many stars play themselves in Simpsons?
role_processed %>%
filter(self == "self") %>%
mutate(guest_star = str_remove(guest_star, '\"')) %>%
count(role, sort = T) %>%
filter(n > 1) %>%
mutate(role = fct_reorder(role, n)) %>%
ggplot(aes(n, role)) +
geom_col() +
labs(x = "role count",
title = "# of Times Guest Stars Play by Themselves")
Episode Count
simpsons %>%
distinct(episode_title, .keep_all = T) %>%
count(season) %>%
ggplot(aes(season, n)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(1,33)) +
labs(y = "# of episodes in dataset",
title = "# of Episodes in Each Season",
subtitle = "The true # of episodes might not be reflected on the dataset")