Web Scraping & Visualization on American Statistical Association Fellows
Mon, Nov 8, 2021
3-minute read
I came across the wikipedia page on ASA Fellows. Since my Ph.D. advisor is ASA fellow, it is interesting to web scrape the page by using the R package rvest
and then analyze the data after obatining it.
Load the packages!
library(tidyverse)
library(rvest)
The webpage only contains the full name of each fellow and year. In order to get first, middle and last name, I need to do some data wrangling. This turns out to be a bit harder than I thought, as some fellows (for example, the Chinese) do not have middle name.
Some useful functions are used here (na_if()
and fill()
).
asa <- read_html("https://en.wikipedia.org/wiki/List_of_fellows_of_the_American_Statistical_Association") %>%
html_nodes(".div-col li , h3 .mw-headline") %>%
html_text() %>%
as_tibble() %>%
mutate(year = if_else(str_detect(value, "[:digit:]"), value, "Missing")) %>%
mutate(year = na_if(year, "Missing")) %>%
fill(year) %>%
mutate(year = as.numeric(year)) %>%
filter(!str_detect(value, "[:digit:]")) %>%
rename(name = "value") %>%
separate(name, into = c("first_name", "middle_last_name"), sep = " ", extra = "merge", remove = F) %>%
separate(middle_last_name, into = c("middle_name", "last_name"), sep = " ", fill = "left")
asa
## # A tibble: 3,024 x 5
## name first_name middle_name last_name year
## <chr> <chr> <chr> <chr> <dbl>
## 1 John Lee Coulter John Lee Coulter 1914
## 2 Miles Menander Dawson Miles Menander Dawson 1914
## 3 Frank H. Dixon Frank H. Dixon 1914
## 4 David Parks Fackler David Parks Fackler 1914
## 5 Henry Walcott Farnam Henry Walcott Farnam 1914
## 6 Charles Ferris Gettemy Charles Ferris Gettemy 1914
## 7 Franklin Henry Giddings Franklin Henry Giddings 1914
## 8 Henry J. Harris Henry J. Harris 1914
## 9 Edward M. Hartwell Edward M. Hartwell 1914
## 10 Joseph A. Hill Joseph A. Hill 1914
## # ... with 3,014 more rows
asa %>%
count(year) %>%
ggplot(aes(year, n)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(1914, 2021, by = 15)) +
labs(y = "# of ASA Fellows awarded",
title = "How many American Statistical Association Fellows are awarded each year?")
There is a generally upward trend for the number of fellows awarded each year.
asa %>%
mutate(decade = 10 * floor(year / 10)) %>%
count(decade) %>%
filter(decade != 2020) %>%
ggplot(aes(decade, n, fill = as.factor(decade))) +
geom_col(show.legend = F) +
scale_x_continuous(breaks = seq(1910, 2020, 10)) +
labs(y = "# of ASA Fellows awarded",
title = "How many American Statistical Association Fellows are awarded each decade?")
Indeed, 2010-2019 is the largest decade in terms of # of ASA fellows.
Last names
asa %>%
mutate(last_name = fct_lump(last_name, 10)) %>%
count(last_name) %>%
filter(last_name != "Other") %>%
mutate(last_name = fct_reorder(last_name, n)) %>%
ggplot(aes(n, last_name, fill = last_name)) +
geom_col(show.legend = F) +
labs(x = "# of ASA Fellows awarded",
y = "last name",
title = "Top 10 family names awarded ASA Fellows")
First names
asa %>%
mutate(first_name = fct_lump(first_name, 10)) %>%
count(first_name) %>%
filter(first_name != "Other") %>%
mutate(first_name = fct_reorder(first_name, n)) %>%
ggplot(aes(n, first_name, fill = first_name)) +
geom_col(show.legend = F) +
labs(x = "# of ASA Fellows awarded",
y = "first name",
title = "Top 10 first names awarded ASA Fellows")