African-American Achievements Data Processing & Visualization

Wed, Jan 12, 2022 5-minute read

This blog post analyzes the African-American achievements data sets from TidyTuesday. The data sets are webscraped from Wikipeida and some preprocessing steps are needed in order to visualize the data.

library(tidyverse)
library(tidytext)
library(widyr)
library(ggraph)
library(igraph)
theme_set(theme_bw())
firsts <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-09/firsts.csv') %>%
  mutate(first_accomplishment = str_trim(str_remove_all(accomplishment, "First|African.?American"), "left"),
         decade = 10 * floor(year/10))
  
science <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-09/science.csv')

firsts
## # A tibble: 479 x 7
##     year accomplishment     person     gender  category first_accomplish~ decade
##    <dbl> <chr>              <chr>      <chr>   <chr>    <chr>              <dbl>
##  1  1738 First free Africa~ Gracia Re~ Africa~ Social ~ free  community     1730
##  2  1760 First known Afric~ Jupiter H~ Female~ Arts & ~ known  published~   1760
##  3  1768 First known Afric~ Wentworth~ Africa~ Social ~ known  to be ele~   1760
##  4  1773 First known Afric~ Phillis W~ Female~ Arts & ~ known  woman to ~   1770
##  5  1773 First separate Af~ Silver Bl~ Africa~ Religion separate  church    1770
##  6  1775 First African-Ame~ Prince Ha~ Africa~ Social ~ to join the Free~   1770
##  7  1778 First African-Ame~ the 1st R~ Africa~ Military U.S. military re~   1770
##  8  1783 First African-Ame~ James Der~ Africa~ Educati~ to formally prac~   1780
##  9  1785 First African-Ame~ Rev. Lemu~ Africa~ Religion ordained as a Ch~   1780
## 10  1792 First major Afric~ 3,000 Bla~ Africa~ Social ~ major  Back-to-A~   1790
## # ... with 469 more rows
science
## # A tibble: 120 x 7
##    name    birth death occupation_s    inventions_accompli~ references links    
##    <chr>   <dbl> <dbl> <chr>           <chr>                <chr>      <chr>    
##  1 Amos, ~  1918  2003 Microbiologist  First African-Ameri~ 6,         https://~
##  2 Alcorn~  1940    NA Physicist; inv~ Invented a method o~ 7,8,       https://~
##  3 Andrew~  1930  1998 Mathematician   Put forth the Andre~ 9,         https://~
##  4 Alexan~  1888  1958 Civil engineer  Responsible for the~ <NA>       https://~
##  5 Bailey~  1825  1918 Inventor        Folding bed          10,        https://~
##  6 Ball, ~  1892  1916 Chemist         Extracted chaulmoog~ 11,        https://~
##  7 Bannek~  1731  1806 Almanac author~ Constructed wooden ~ 12,        https://~
##  8 Banyag~  1947    NA Mathematician   Work on diffeomorph~ 13,        https://~
##  9 Bashen~  1957    NA Inventor; entr~ First African-Ameri~ 14,        https://~
## 10 Bath, ~  1942  2019 Ophthalmologist First African-Ameri~ 15,16,17,  https://~
## # ... with 110 more rows

Working on the firsts tibble

firsts %>%
  group_by(decade, category) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  ggplot(aes(decade, n, fill = category)) +
  geom_col() +
  scale_x_continuous(breaks = seq(1700, 2000, 50)) +
  scale_y_continuous(breaks = seq(0,60,10)) +
  labs(y = "# of first achievement",
       title = "Category-wise first African American achievements from 1730s to 2010s")

African-American accomplishment description TF-IDF among all categories

firsts %>%
  unnest_tokens(word, first_accomplishment) %>%
  anti_join(stop_words) %>%
  filter(!str_detect(word, "[:digit:]")) %>%
  count(word, category) %>%
  bind_tf_idf(word, category, n) %>%
  group_by(category) %>%
  slice_max(tf_idf, n = 10) %>%
  ungroup() %>%
  mutate(word = reorder_within(word, tf_idf, category)) %>%
  ggplot(aes(tf_idf, word, fill = category)) +
  geom_col() +
  scale_y_reordered() +
  facet_wrap(~category, scales = "free_y", ncol = 4) +
  theme(legend.position = "none",
        strip.text = element_text(size = 12),
        axis.text.y = element_text(size = 12)) +
  labs(x = "TF-IDF",
       y = NULL,
       title = "The 10 largest categorical TF-IDF values")

Now let's work on science.

science_processed <- science %>%
  separate(name, into = c("last_name", "first_name"), sep = ", ") %>%
  separate_rows(occupation_s, sep = ";\\s+") %>%
  rename(occuption = "occupation_s") %>%
  mutate(occuption = str_to_title(occuption),
         full_name = paste(first_name, last_name),
         alive_or_dead = if_else(is.na(death), "Alive", "Dead"),
         age = if_else(is.na(death), 2022 - birth, death - birth)) %>%
  select(-c(references, links))

science_processed
## # A tibble: 164 x 9
##    last_name first_name    birth death occuption      inventions_acco~ full_name
##    <chr>     <chr>         <dbl> <dbl> <chr>          <chr>            <chr>    
##  1 Amos      Harold         1918  2003 Microbiologist First African-A~ Harold A~
##  2 Alcorn    George Edward  1940    NA Physicist      Invented a meth~ George E~
##  3 Alcorn    George Edward  1940    NA Inventor       Invented a meth~ George E~
##  4 Andrews   James J.       1930  1998 Mathematician  Put forth the A~ James J.~
##  5 Alexander Archie         1888  1958 Civil Engineer Responsible for~ Archie A~
##  6 Bailey    Leonard C.     1825  1918 Inventor       Folding bed      Leonard ~
##  7 Ball      Alice Augusta  1892  1916 Chemist        Extracted chaul~ Alice Au~
##  8 Banneker  Benjamin       1731  1806 Almanac Author Constructed woo~ Benjamin~
##  9 Banneker  Benjamin       1731  1806 Surveyor       Constructed woo~ Benjamin~
## 10 Banneker  Benjamin       1731  1806 Farmer         Constructed woo~ Benjamin~
## # ... with 154 more rows, and 2 more variables: alive_or_dead <chr>, age <dbl>
science_processed %>%
  mutate(occuption = fct_lump(occuption, n = 10)) %>%
  filter(!is.na(occuption),
         age < 100) %>%
  ggplot(aes(age, occuption, fill = alive_or_dead)) +
  geom_boxplot() +
  scale_x_continuous(breaks = seq(20, 100, 20)) +
  labs(y = NULL,
       fill = NULL,
       title = "Age among all people presented in science_processed",
       subtitle = "The age of people who fall into the Alive category is computed by using 2022 - their birth year") +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 13),
        plot.title = element_text(size = 15))

set.seed(2022)

science_processed %>%
  add_count(occuption) %>%
  filter(n > 2) %>%
  pairwise_cor(occuption, full_name, sort = T) %>% 
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(color = correlation > 0, width  = abs(correlation))) +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1, size = 5) +
  theme_void() +   
  guides(edge_width = "none") +
  ggtitle("How are different occuptions linked together?") +
  theme(plot.title = element_text(size = 15))

It is very interesting to see how the occuption(s) each person has are linked together. Statistican and mathematician are strongly positively correlated, but it seems like mathematician is slightly and negatively correlated with social scientist. You can come up with many interesting findings from the graph above!