Horror Movies Data Visualization with Text Mining and LASSO Model

Thu, Sep 2, 2021 7-minute read

In this blog post, we will analyze a horror movie dataset from TidyTuesday by visualizing some interesting metrics and by using text mining and applying LASSO model on predicting movie ratings.

Data Introduction and Processing

First, load the necessary libraries we need for the analysis.

library(tidyverse)
library(lubridate)
library(tidytext)
library(broom)
library(glmnet)
library(Matrix)
library(scales)

Then, the dataset is loaded and we can see the first few rows.

movies <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-22/horror_movies.csv")

head(movies)
## # A tibble: 6 x 12
##   title        genres    release_date release_country movie_rating review_rating
##   <chr>        <chr>     <chr>        <chr>           <chr>                <dbl>
## 1 Gut (2012)   Drama| H~ 26-Oct-12    USA             NA                     3.9
## 2 The Hauntin~ Horror    13-Jan-17    USA             NA                    NA  
## 3 Sleepwalkin~ Horror    21-Oct-17    Canada          NA                    NA  
## 4 Treasure Ch~ Comedy| ~ 23-Apr-13    USA             NOT RATED              3.7
## 5 Infidus (20~ Crime| D~ 10-Apr-15    USA             NA                     5.8
## 6 In Extremis~ Horror| ~ 2017         UK              NA                    NA  
## # ... with 6 more variables: movie_run_time <chr>, plot <chr>, cast <chr>,
## #   language <chr>, filming_locations <chr>, budget <chr>

Then use skim() from skimr package to give an overview of the dataset.

skimr::skim(movies)
Table 1: Data summary
Name movies
Number of rows 3328
Number of columns 12
_______________________
Column type frequency:
character 11
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1.00 8 62 0 3303 0
genres 1 1.00 6 79 0 261 0
release_date 0 1.00 4 9 0 1332 0
release_country 0 1.00 2 20 0 72 0
movie_rating 1877 0.44 1 9 0 11 0
movie_run_time 544 0.84 6 7 0 113 0
plot 1 1.00 20 547 0 3310 0
cast 14 1.00 11 308 0 3296 0
language 71 0.98 4 64 0 187 0
filming_locations 1232 0.63 2 125 0 1171 0
budget 2094 0.37 2 18 0 417 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
review_rating 252 0.92 5.08 1.47 1 4 5 6.1 9.8 ▁▆▇▃▁

Keep in mind that there are some missing values in the dataset, and this means na.rm = TRUE is needed when summary results related to the missing-value column go awry.

Another thing we have noticed is that release_date column is a char rather than date column. Also, its formats are not consistent in a way that some entries are date with year, month and day, but others only include year. To deal with this, using parse_date_time is an ideal option.

movies$release_date <- parse_date_time(movies$release_date, 
         orders=guess_formats(movies$release_date, c('dmy', 'y')))

budget is not a numeric column rather than a char string.

movies$budget <- parse_number(movies$budget)

Data Visualization

movies %>%
  mutate(year = year(release_date)) %>%
  count(release_country, year, sort = TRUE) %>%
  mutate(release_country = fct_lump(release_country, n = 15, w = n),
         release_country = fct_reorder(release_country, n)) %>%
  ggplot(aes(n, release_country, fill = release_country)) +
  geom_col() +
  facet_wrap(~year) +
  theme_bw() +
  theme(
    strip.text = element_text(face = "bold"),
    legend.position = "none"
  ) +
  labs(x = "# of horror movies", y = NULL, title = "Yearly # of Horror Movies Produced by Each Country")

movies %>%
  mutate(year = year(release_date)) %>%
  count(language, year, sort = TRUE) %>%
  mutate(language = fct_lump(language, n = 15, w = n),
         language = fct_reorder(language, n)) %>%
  ggplot(aes(n, language, fill = language)) +
  geom_col() +
  facet_wrap(~year) +
  theme_bw() +
  theme(
    strip.text = element_text(face = "bold"),
    legend.position = "none"
  ) +
  labs(x = "# of horror movies", y = NULL, title = "Yearly # of Horror Movies Produced in Each Language")

It looks like from 2012 to 2017, USA produced the most number of horror movies in each year compared to other countries, and English horror movies were the most in all these years. This is consistent, as movies produced by USA are primarily English.

movies %>%
  mutate(release_country = fct_lump(release_country, 15)) %>%
  ggplot(aes(release_country, budget)) +
  geom_boxplot() +
  coord_flip() +
  theme_bw() +
  theme(
    axis.title.y = element_blank()
  ) +
  scale_y_continuous(labels = dollar)

The above box plot is not informative in a way that a salient outlier overshadows all other data points. In such a case, using a log transformation can give us more information about movie budgets related to each country.

movies %>%
  mutate(release_country = fct_lump(release_country, 20),
         release_country = fct_reorder(release_country, log10(budget), median, na.rm = TRUE)) %>%
  ggplot(aes(release_country, log10(budget), fill = release_country)) +
  geom_boxplot() +
  coord_flip() +
  theme_bw() +
  theme(
    legend.position = "none"
  ) +
  labs(y = "budget (log10-scale)", x = NULL, title = "20 Country-wise Movie Budget")+
  scale_y_continuous(labels = dollar)

movies %>%
  ggplot(aes(budget, review_rating)) +
  geom_point() +
  scale_x_log10(labels = dollar) +
  geom_smooth(method = "lm") +
  theme_bw() +
  labs(x = "budget (log10 scale)")

Interestingly enough, there is no relationship between budget and review_rating. This is counter-intuitive, as investing more budget on a movie should generally have better ratings. However, this is not the case based on the plot shown above.

Text Analysis and Lasso Model

With a few minor changes, all the code below is largely referenced from David Robinson’s Github page for learning purposes.

movies_processed <- movies %>%
  separate(plot, c("director", "cast_sentence", "plot"), extra = "merge", sep = "\\. ", fill = "right") %>%
  filter(!is.na(review_rating)) %>%
  mutate(director = str_remove(director, "Directed by "))
movies_unnested <- movies_processed %>%
  unnest_tokens(word, plot) %>%
  anti_join(stop_words, by = "word") 

movies_unnested %>%
  group_by(word) %>%
  summarize(movies = n(),
            avg_rating = mean(review_rating, na.rm = TRUE)) %>%
  arrange(desc(movies)) %>%
  filter(movies > 100) %>%
  mutate(word = fct_reorder(word, avg_rating)) %>%
  ggplot(aes(avg_rating, word)) +
  geom_point() +
  theme_bw() +
  labs(x = "average rating", y = "description word", title = "Description Words and Movie Ratings",
       subtitle = "Word apeared in movie description and average rating of all movies whose description includes the word")
## `summarise()` ungrouping output (override with `.groups` argument)

This is my first time using cast_sparse for making a document-term matrix (or DTM). For more information about DTM, please visit Text Mining with R or Wikipedia

movie_word_matrix <- movies_unnested %>%
  filter(!is.na(review_rating)) %>%
  add_count(word) %>%
  filter(n >= 20) %>%
  count(title, word) %>%
  cast_sparse(title, word, n)

rating <- movies$review_rating[match(rownames(movie_word_matrix), movies$title)]

lasso_model <- cv.glmnet(movie_word_matrix, rating)

plot(lasso_model)

tidy(lasso_model$glmnet.fit) %>%
  filter(lambda == lasso_model$lambda.min,
         term != "(Intercept)") %>%
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(term, estimate)) +
  geom_col() +
  coord_flip() +
  ggtitle("Words Contributing to Movie Ratings")

Now, we can select a few words from the above plot to visualize their relations between estimates and lambda values.

tidy(lasso_model$glmnet.fit) %>% 
  filter(term %in% c("quickly", "seek", "army", "teacher", "unexpected", "friends", "evil")) %>%
  ggplot(aes(lambda, estimate, color = term)) +
  geom_line() +
  geom_vline(xintercept = lasso_model$lambda.min) +
  geom_hline(yintercept = 0, lty = 2)

Now we throw all features into a LASSO model and see how each feature contributes to movie rating.

# The following code David Robinson used gather() function, but pivot_longer() should be more contemporary, #though achieving the same goal.
features <- movies_processed %>%
  filter(!is.na(review_rating)) %>%
  select(title, genres, director, cast, movie_rating, language, release_country) %>%
  mutate(director = str_remove(director, "Directed by ")) %>%
  pivot_longer(!title, names_to = "type", values_to = "value") %>%
  #gather(type, value, -title) %>% View()
  filter(!is.na(value)) %>%
  separate_rows(value, sep = "\\| ?") %>%
  unite(feature, type, value, sep = ": ") %>%
  mutate(n = 1)

movie_feature_matrix <- movies_unnested %>%
  filter(!is.na(review_rating)) %>%
  count(title, feature = paste0("word: ", word)) %>%
  bind_rows(features) %>%
  add_count(feature) %>%
  filter(nn >= 10) %>%
  cast_sparse(title, feature)
## Storing counts in `nn`, as `n` already present in input
## i Use `name = "new_name"` to pick a new name.
rating <- movies$review_rating[match(rownames(movie_feature_matrix), movies$title)]
feature_lasso_model <- cv.glmnet(movie_feature_matrix, rating)

plot(feature_lasso_model)

tidy(feature_lasso_model$glmnet.fit) %>%
  filter(lambda == feature_lasso_model$lambda.1se,
         term != "(Intercept)") %>%
  mutate(term = fct_reorder(term, estimate)) %>%
  ggplot(aes(term, estimate)) +
  geom_col() +
  coord_flip() +
  labs(x = "",
       y = "Coefficient for predicting movie rating",
       title = "What affects a horror movie rating?",
       subtitle = "Based on a lasso regression to predict IMDb ratings of ~3000 movies")

language:Kannada contributes the most to movie ratings in a positive way and word:shark in a negative way.

Conclusion

In this blog post, we analyzed the horror movie dataset from TidyTuesday through visualization and LASSO model predicting movie rating based on a number of features. The notion of document-term matrix is a tad difficult to use, especially combining it with cv.glmnet(). The LASSO model is a highly useful and practical prediction model that gives us great flexibility, and through this post, it can shed some light on how to use the model and we can follow the similar steps in the future when encountering analogous prediction problems.