Mask Wearing Percentage in Each State

Wed, Aug 25, 2021 7-minute read

The New York Times has released data related to mask-wearing (link is here), and in this dataset, it provides the estimated prevalence of the frequency of people wearing mask in each county of every state. County residents are divided into 5 categories in terms of mask-wearing frequency: never, rarely, sometimes, frequently, always. For each county, the prevalances of these aforementioned categories add up to 1, which is problematic to our analysis in a way that each county has a different number of population and it is unlikely to level them up to the state level without having the actual number of residents in each of the categories and without taking the county population into consideration. The goal of this project is to triangulate different datasets and then to obtain a mask-wearing frequency percent in each category in every state based on data.

Data Introduction

First off, let’s load the libraries we need in this project.

library(tidyverse)
library(vroom)
library(readxl)
library(scales)
library(geofacet)

Then, three datasets are loaded. Two of which are from the New York Times Github repo, and the county population dataset is downloaded from USDA (here is the download link).

mask <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/mask-use/mask-use-by-county.csv")
mask <- janitor::clean_names(mask)

county <- vroom("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv")

pop <- read_xls("PopulationEstimates.xls", skip = 2)
pop <- janitor::clean_names(pop)
head(mask)
## # A tibble: 6 x 6
##   countyfp never rarely sometimes frequently always
##   <chr>    <dbl>  <dbl>     <dbl>      <dbl>  <dbl>
## 1 01001    0.053  0.074     0.134      0.295  0.444
## 2 01003    0.083  0.059     0.098      0.323  0.436
## 3 01005    0.067  0.121     0.12       0.201  0.491
## 4 01007    0.02   0.034     0.096      0.278  0.572
## 5 01009    0.053  0.114     0.18       0.194  0.459
## 6 01011    0.031  0.04      0.144      0.286  0.5
head(county)
## # A tibble: 6 x 6
##   date       county    state      fips  cases deaths
##   <date>     <chr>     <chr>      <chr> <dbl>  <dbl>
## 1 2020-01-21 Snohomish Washington 53061     1      0
## 2 2020-01-22 Snohomish Washington 53061     1      0
## 3 2020-01-23 Snohomish Washington 53061     1      0
## 4 2020-01-24 Cook      Illinois   17031     1      0
## 5 2020-01-24 Snohomish Washington 53061     1      0
## 6 2020-01-25 Orange    California 06059     1      0
head(pop)
## # A tibble: 6 x 165
##   fip_stxt state area_name  rural_urban_cont~ rural_urban_cont~ urban_influence~
##   <chr>    <chr> <chr>                  <dbl>             <dbl>            <dbl>
## 1 00000    US    United St~                NA                NA               NA
## 2 01000    AL    Alabama                   NA                NA               NA
## 3 01001    AL    Autauga C~                 2                 2                2
## 4 01003    AL    Baldwin C~                 4                 3                5
## 5 01005    AL    Barbour C~                 6                 6                6
## 6 01007    AL    Bibb Coun~                 1                 1                1
## # ... with 159 more variables: urban_influence_code_2013 <dbl>,
## #   economic_typology_2015 <dbl>, census_2010_pop <dbl>,
## #   estimates_base_2010 <dbl>, pop_estimate_2010 <dbl>,
## #   pop_estimate_2011 <dbl>, pop_estimate_2012 <dbl>, pop_estimate_2013 <dbl>,
## #   pop_estimate_2014 <dbl>, pop_estimate_2015 <dbl>, pop_estimate_2016 <dbl>,
## #   pop_estimate_2017 <dbl>, pop_estimate_2018 <dbl>, pop_estimate_2019 <dbl>,
## #   n_pop_chg_2010 <dbl>, n_pop_chg_2011 <dbl>, n_pop_chg_2012 <dbl>,
## #   n_pop_chg_2013 <dbl>, n_pop_chg_2014 <dbl>, n_pop_chg_2015 <dbl>,
## #   n_pop_chg_2016 <dbl>, n_pop_chg_2017 <dbl>, n_pop_chg_2018 <dbl>,
## #   n_pop_chg_2019 <dbl>, births_2010 <dbl>, births_2011 <dbl>,
## #   births_2012 <dbl>, births_2013 <dbl>, births_2014 <dbl>, births_2015 <dbl>,
## #   births_2016 <dbl>, births_2017 <dbl>, births_2018 <dbl>, births_2019 <dbl>,
## #   deaths_2010 <dbl>, deaths_2011 <dbl>, deaths_2012 <dbl>, deaths_2013 <dbl>,
## #   deaths_2014 <dbl>, deaths_2015 <dbl>, deaths_2016 <dbl>, deaths_2017 <dbl>,
## #   deaths_2018 <dbl>, deaths_2019 <dbl>, natural_inc_2010 <dbl>,
## #   natural_inc_2011 <dbl>, natural_inc_2012 <dbl>, natural_inc_2013 <dbl>,
## #   natural_inc_2014 <dbl>, natural_inc_2015 <dbl>, natural_inc_2016 <dbl>,
## #   natural_inc_2017 <dbl>, natural_inc_2018 <dbl>, natural_inc_2019 <dbl>,
## #   international_mig_2010 <dbl>, international_mig_2011 <dbl>,
## #   international_mig_2012 <dbl>, international_mig_2013 <dbl>,
## #   international_mig_2014 <dbl>, international_mig_2015 <dbl>,
## #   international_mig_2016 <dbl>, international_mig_2017 <dbl>,
## #   international_mig_2018 <dbl>, international_mig_2019 <dbl>,
## #   domestic_mig_2010 <dbl>, domestic_mig_2011 <dbl>, domestic_mig_2012 <dbl>,
## #   domestic_mig_2013 <dbl>, domestic_mig_2014 <dbl>, domestic_mig_2015 <dbl>,
## #   domestic_mig_2016 <dbl>, domestic_mig_2017 <dbl>, domestic_mig_2018 <dbl>,
## #   domestic_mig_2019 <dbl>, net_mig_2010 <dbl>, net_mig_2011 <dbl>,
## #   net_mig_2012 <dbl>, net_mig_2013 <dbl>, net_mig_2014 <dbl>,
## #   net_mig_2015 <dbl>, net_mig_2016 <dbl>, net_mig_2017 <dbl>,
## #   net_mig_2018 <dbl>, net_mig_2019 <dbl>, residual_2010 <dbl>,
## #   residual_2011 <dbl>, residual_2012 <dbl>, residual_2013 <dbl>,
## #   residual_2014 <dbl>, residual_2015 <dbl>, residual_2016 <dbl>,
## #   residual_2017 <dbl>, residual_2018 <dbl>, residual_2019 <dbl>,
## #   gq_estimates_base_2010 <dbl>, gq_estimates_2010 <dbl>,
## #   gq_estimates_2011 <dbl>, gq_estimates_2012 <dbl>, gq_estimates_2013 <dbl>,
## #   gq_estimates_2014 <dbl>, ...

mask dataframe only provides county FIPS countyfp and relative frequency of each mask-wearing category, but there is no information about county and state name, which is why we need county dataframe to act as a crosswalk to add extra county and state information to mask dataframe. pop has a slew of columns, yet what we need is FIPS code, state name and 2010 census county population. Of course, the 2010 population might be outdated and the dataset provides the yearly estimated population number, yet for this project, we still use 2010 census data, since it is the official count of each county rather than the estimated one, and the latest 2020 census data has not been released as of when the analysis is done.

Data Processing

Some data cleaning on each dataset is needed and then a number of joins are carried out.

# selecting three relevant columns from pop
pop <- pop %>% 
  rename(countyfp = fip_stxt) %>%
  select(countyfp, state, census_2010_pop)

# Selecting three relevant columns and get the distinct rows without repetition
county <- county %>%
  select(county, state, fips) %>%
  distinct()

county and pop are inner joined to mask, and then pivot_longer() is needed before obtaining the estimated number of residents num_of_people in each category in each county.

mask <- mask %>%
  inner_join(county, by = c("countyfp" = "fips"))

mask <- mask %>%
  inner_join(pop, by = "countyfp")

mask <- mask %>%
  pivot_longer(cols = c("never", "rarely", "sometimes", "frequently", "always"), 
               names_to = "mask_freq",
               values_to = "prevalence") %>%
  mutate(
    num_of_people = census_2010_pop * prevalence
  )

head(mask)
## # A tibble: 6 x 8
##   countyfp county  state.x state.y census_2010_pop mask_freq  prevalence
##   <chr>    <chr>   <chr>   <chr>             <dbl> <chr>           <dbl>
## 1 01001    Autauga Alabama AL                54571 never           0.053
## 2 01001    Autauga Alabama AL                54571 rarely          0.074
## 3 01001    Autauga Alabama AL                54571 sometimes       0.134
## 4 01001    Autauga Alabama AL                54571 frequently      0.295
## 5 01001    Autauga Alabama AL                54571 always          0.444
## 6 01003    Baldwin Alabama AL               182265 never           0.083
## # ... with 1 more variable: num_of_people <dbl>

Since each state comprises a number of counties and when adding num_of_people up after grouping by each state, we get total population in each state state_pop.

state_pop <- mask %>%
  group_by(state.x) %>%
  summarize(pop = sum(num_of_people))
head(state_pop)
## # A tibble: 6 x 2
##   state.x          pop
##   <chr>          <dbl>
## 1 Alabama     4779131.
## 2 Alaska       704696.
## 3 Arizona     6387841.
## 4 Arkansas    2916172.
## 5 California 37254038.
## 6 Colorado    5030329.

Adding state_pop to mask through left_join().

mask <- mask %>%
  left_join(state_pop, by = "state.x")

One thing we have noticed is that mask_freq is not a factor, but converting the column into a factor column is neat for visualization purposes.

mask$mask_freq <- factor(mask$mask_freq, levels = c("never", "rarely", "sometimes", "frequently", "always"))

Data Visualization

mask %>%
  group_by(state.x, mask_freq) %>%
  summarize(count = sum(num_of_people)) %>%
  left_join(state_pop, by = "state.x") %>%
  mutate(prevalence = count/pop) %>%
  ggplot(aes(mask_freq, prevalence, fill = prevalence)) +
  geom_col() +
  facet_geo(~state.x) +
  coord_flip() +
  scale_y_continuous(labels = percent_format()) +
  theme_bw() +
  theme(
    strip.text = element_text(size = 10, face = "bold"),
    legend.position = "none",
    axis.title = element_text(size = 18),
    axis.text.y = element_text(size = 15),
    axis.text.x = element_text(size = 10),
    plot.title = element_text(size = 20)
  ) +
  labs(x = "Mask Wearing Frequency", y = "Prevalence", title = "Percentage Of State Mask Wearing")

Based on the plot above, it is easy to visualize that states located on both coasts have a relative higher always and frequently mask-wearing percents. Their counterparts in the middle are lower in this regard.

Conclusion

This is an interesting real-world project that combines few datasets to make the final visual product, which conveys information about the mask-wearing situation in each state. We started from the county level and upgraded the relevant statistics to the state level, and finally the public opinion on mask-wearing is shown through a collection of geo-facet bar charts. A number of joins and some reshaping method (pivot_longer()), which are the basic data science skills, have been used in the project and hopefully, some techniques presented here can shed some light on data wrangling and visualization.