Data Visualization and Sentiment Analysis On Trending Youtube Statistics

Sat, Aug 7, 2021 4-minute read

Data Introduction & Processing

This dataset is from Kaggle, and it is about the trending Youtube statistics in the U.S. with more than 40000 rows and 16 columns.

After loading the tidyverse library, we can take a peek on what the dataset looks like.

library(tidyverse)
yb <- read_csv("data\\USvideos.csv")
head(yb)
## # A tibble: 6 x 16
##   video_id  trending_date title    channel_title category_id publish_time       
##   <chr>     <chr>         <chr>    <chr>               <dbl> <dttm>             
## 1 2kyS6SvS~ 17.14.11      WE WANT~ CaseyNeistat           22 2017-11-13 17:13:01
## 2 1ZAPwfrt~ 17.14.11      The Tru~ LastWeekToni~          24 2017-11-13 07:30:00
## 3 5qpjK5Dg~ 17.14.11      Racist ~ Rudy Mancuso           23 2017-11-12 19:05:24
## 4 puqaWrEC~ 17.14.11      Nickelb~ Good Mythica~          24 2017-11-13 11:00:04
## 5 d380meD0~ 17.14.11      I Dare ~ nigahiga               24 2017-11-12 18:01:41
## 6 gHZ1Qz0K~ 17.14.11      2 Weeks~ iJustine               28 2017-11-13 19:07:23
## # ... with 10 more variables: tags <chr>, views <dbl>, likes <dbl>,
## #   dislikes <dbl>, comment_count <dbl>, thumbnail_link <chr>,
## #   comments_disabled <lgl>, ratings_disabled <lgl>,
## #   video_error_or_removed <lgl>, description <chr>

While exploring the dataset, one thing we have noticed is that category_id column should be a factor column rather than a double one, as we need to group by this column to see the average statistics category-wise.

yb$category_id <- factor(yb$category_id)

Data Visualization

Now it would be interesting to visualize the average likes and dislikes etc. across different categories.

yb %>% group_by(category_id) %>%
  summarize(`Average Views` = mean(views), `Average Likes` = mean(likes), `Average Dislikes` = mean(dislikes), `Average Number Of Comments` = mean(comment_count)) %>%
  pivot_longer(!category_id, names_to = "popularity", values_to = "Counts") %>%
  ggplot(aes(category_id, Counts, fill = category_id)) +
  geom_bar(stat = "identity") +
  facet_wrap(~popularity, scales = "free") +
  theme_bw() +
  theme(
    strip.text = element_text(size = 16),
    legend.position = "none",
    axis.ticks = element_blank(),
    axis.title = element_text(size = 15),
    axis.text = element_text(size = 13),
    plot.title = element_text(size = 18)
  ) +
  ggtitle("Category-wise Trending Youtube Statistics")

Although the dataset does not say specifically what each category_id corresponds to which video topic, but based on the bar plots above, category_id = 29 attracted much more attention (likes, dislikes, comments) on average yet the average views do not stand out at all. It seems like youtube viewers do not watch these videos but give their opinions excessively!

Let’s dive into this specific category.

yb %>% filter(category_id == 29) %>%
  ggplot(aes(row.names(yb %>% filter(category_id == 29)), comment_count)) +
    geom_bar(stat = "identity") +
  labs(x = "video index", y = "comment counts")

As we can see, there are few outliers in category_id = 29 that have attracted a slew of attentions and comments, yet the rest of them are barely watched by viewers.

Sentiment Analysis

There are two columns (tags and description) in the dataset where we can apply sentiment analysis. Since videos are categorized by their category ID, it is worth to see what sentiment each video category possesses by using three sentiment dictionaries (AFINN, Bing, NRC) built in the sentiment library tidytext.

library(tidytext)

Here we write a function that can generate bar charts for sentiment score grouped by category_id faceted by each dictionary.

sentiment_barplots <- function(text_column, plot_title){
  text_column = enquo(text_column)
  bind_rows(
  yb %>% select(category_id, !!text_column)%>%
    unnest_tokens(word, !!text_column) %>%
    anti_join(stop_words) %>% 
    inner_join(get_sentiments("afinn"))%>% 
    group_by(category_id) %>% 
    summarize(score = sum(value)) %>%
    mutate(method = "AFINN"), 
   bind_rows(
    yb%>% select(category_id, !!text_column) %>%
      unnest_tokens(word, !!text_column) %>%
      anti_join(stop_words) %>%
      inner_join(get_sentiments("bing")) %>% 
      group_by(category_id) %>% 
      count(sentiment) %>%
      pivot_wider(names_from = sentiment,
                  values_from = n,
                  values_fill = 0) %>%
      mutate(score = positive - negative) %>%
      mutate(method = "Bing et al."), 
    yb%>% select(category_id, !!text_column) %>%
      unnest_tokens(word, !!text_column) %>%
      anti_join(stop_words) %>%
      inner_join(get_sentiments("nrc")) %>% 
      group_by(category_id) %>% 
      count(sentiment) %>%
      pivot_wider(names_from = sentiment,
                  values_from = n,
                  values_fill = 0) %>%
      mutate(score = positive - negative) %>%
      select(category_id, negative, positive, score) %>%
      mutate(method = "NRC")
  ) %>%
    select(category_id, score, method)
) %>%
  ggplot(aes(category_id, score)) +
  geom_bar(aes(fill= method), stat = "identity") +
  facet_wrap(~method, ncol = 1, scales = "free") +
  theme_bw() +
  labs(x = "Video Category", y = "Sentiment Score", title = plot_title) +
  theme(legend.position = "none",
        strip.text = element_text(size = 16),
        axis.title = element_text(size = 16),
        axis.text = element_text(size = 15),
        axis.ticks = element_blank(),
        plot.title = element_text(size = 18))
}
sentiment_barplots(tags, "Tags Sentiment Score")

sentiment_barplots(description, "Description Sentiment Score")

Sentiment Analysis Conclusion

When looking at Tag Sentiment Score barplot, an interesting thing to note is that bing dictionary outputs some negative sentiment scores for some categories, yet the other two only have non-negative scores. There is some inconsistency in this regard. In other words, using tags as a way to analyze video sentiment might not be a good idea due to the fact that tags are usually short and do not cover the whole picture of videos try to convey to viewers. Video description, however, offers a good consistency among all three dictionaries in a way that the sentiment scores are highly analogous.