2  Data

2.1 Description

Our project draws on several datasets hosted by the MIT Election Lab. Primary Source: https://electionlab.mit.edu/data

Data Sources:

  1. Presidential Election Results (1976-2020) Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/42MVDX Description: This dataset contains state-level presidential election results, including vote counts for major parties, minor parties, and independent candidates. It provides data on total votes cast, voter turnout, and percentage share of each candidate in every presidential election from 1976 to 2020. Data Format: CSV file with columns for state, year, candidate, party, vote count, percentage share, and total votes cast. Frequency of Updates: Static dataset last updated to include results through 2020. Dimensions: Rows represent individual state-election combinations, with thousands of records. Planned Import: Data will be imported into R Studio using the read.csv function.

  2. Senate Election Results (1976-2020) Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/PEJ5QU Description: This dataset provides results for Senate elections, including candidate vote counts, party affiliations, and special elections. Data covers statewide elections and includes party-specific turnout information. Data Format: CSV file with columns for state, year, candidate, party, votes, percentage share, and whether the election was a regular or special election. Frequency of Updates: Static dataset last updated to include results through 2020. Dimensions: State-year combinations with additional rows for runoff elections. Planned Import: Data will be imported into R Studio using the read.csv function.

  3. House Election Results (1976-2022) Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2 Description: This dataset includes congressional district-level results for House elections, with details on candidates, vote counts, and party affiliations. Data Format: CSV file with columns for district, state, year, candidate, party, vote count, and total votes. Frequency of Updates: Static dataset last updated to include results through 2022. Dimensions: Rows represent congressional district-year combinations, totaling tens of thousands of records. Planned Import: Data will be imported into R Studio using the read.csv function.

Issues/Problems:

Temporal Mismatch: The House election data extends through 2022, while presidential and Senate data only go up to 2020. To ensure consistency, we will analyze all datasets only up to 2020. Runoff Election Data: Significant missing data exists for runoff elections, particularly in Senate and House results. This component will be excluded from the analysis to maintain data reliability. Granularity: While presidential data is state-level, House data is district-level. This mismatch in granularity may limit the scope of comparative analyses. Aggregation or normalization may be necessary in some cases. Data Quality: There are missing or incomplete fields in some years for minor-party candidates and inconsistencies in party naming conventions across datasets.

2.2 Missing value analysis

Code
options(repos = c(CRAN = "https://cloud.r-project.org/")) 
library(ggplot2)
.libPaths()
library(dplyr)
#install.packages(c("glue", "withr", "gtable", "cpp11", "Rcpp"))
remotes::install_github("jtr13/redav")
#install.packages("mi")
library(mi)

president <- read.csv('1976-2020-president.csv')
house <- read.csv('1976-2022-house.csv')
senate <- read.csv('1976-2020-senate.csv')
Code
president_missing <- mi::missing_data.frame(president)
image(president_missing)

Code
print(redav::plot_missing(president, num_char = 5, max_cols = 10))

Code
president_cleaned <- president[, !(colnames(president) %in% c("notes"))]
president_cleaned <- na.omit(president_cleaned)

The visualization of missing values in the presidential election dataset reveals that most variables are complete, with no missing values, such as year, state, and totalvotes. However, the notes variable is entirely missing (100%), which likely indicates that there were no special remarks or annotations needed for the observations. Given its lack of content and relevance, we have decided to remove this column from the dataset to streamline the analysis. Additionally, the writein variable shows a very small proportion of missing values (less than 0.1%), which are unlikely to significantly impact the dataset’s overall integrity. We have removed the rows containing these missing values to maintain the integrity of the analysis. (The first plot does not display ‘note’ as all values are missing. To keep the image clean, the second plot displays only 10 variables.)

Code
house_missing <- mi::missing_data.frame(house)
image(house_missing)

Code
print(redav::plot_missing(house, num_char = 5, max_cols = 10))

Code
runoff_missing <- house[is.na(house$runoff), ]
runoff_missing_by_year <- runoff_missing |>count(year)
ggplot(runoff_missing_by_year, aes(x = factor(year), y = n)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(title = "Distribution of Runoff Missing Values by Year",
    x = "Year",
    y = "Number of Missing Values") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
house_cleaned <- house[, !(colnames(house) %in% c("runoff"))]

The plot shows that the missing values in the runoff column are distributed relatively evenly across the years, suggesting that the missing data may be due to inconsistencies or issues in the data collection process. Given the high percentage of missing values in this column (28%) and its relatively low importance for our analysis, we have decided to remove the runoff column entirely to streamline the dataset and focus on more relevant variables. (To keep the image clean, the third plot displays only 10 variables.)

Code
print(redav::plot_missing(senate, num_char = 5, max_cols = 10))

The plot shows that there are no missing values in senate data, so we do not make any modification to this data. (To keep the image clean, the plot displays only 10 variables.)

Code
library(dplyr)
library(tidyr)
house_data <- house_cleaned |> 
  filter(year<=2020)

house_data_grouped <- house_data |>
  group_by(year, state, state_po, party, district) |>
  summarise(
    candidatevotes = sum(candidatevotes),
    totalvotes = first(totalvotes)
  ) |>
  ungroup()

house_data_grouped <- house_data_grouped |>
  mutate(
    simplified_party = case_when(
      grepl("DEMOCRAT", party, ignore.case = TRUE) ~ "DEMOCRAT_House",
      grepl("REPUBLICAN", party, ignore.case = TRUE) ~ "REPUBLICAN_House",
      grepl("LIBERTARIAN", party, ignore.case = TRUE) ~ "LIBERTARIAN_House",
      TRUE ~ "OTHER_House"
    )
  )

house_data_simplified <- house_data_grouped |>
  group_by(year, state, state_po, simplified_party) |>
  summarise(candidate_house = sum(candidatevotes), total_house = sum(totalvotes)) |>
  ungroup() |>
  mutate(vote_share = round(100*candidate_house / total_house,2)) |>
  select(-candidate_house)

house_pivoted_data <- house_data_simplified |>
  pivot_wider(
    names_from = simplified_party,  
    values_from = vote_share,       
    values_fill = 0                 
  ) |>
  group_by(year, state, state_po) |>
  summarise(across(everything(), sum))

senate_data <- senate |> 
  filter(year<=2020)

senate_data_grouped <- senate_data |>
  group_by(year, state, state_po, party_simplified) |>
  summarise(
    candidatevotes = sum(candidatevotes),
    totalvotes = first(totalvotes)
  ) |>
  mutate(vote_share = round(100*candidatevotes / totalvotes,2)) |>
  ungroup()

senate_data_simplified <- senate_data_grouped |>
  group_by(year, state, state_po, party_simplified) |>
  summarise(total_senate = first(totalvotes), vote_share = sum(vote_share)) |>
  ungroup()

senate_pivoted_data <- senate_data_simplified |>
  pivot_wider(
    names_from = party_simplified,  
    values_from = vote_share,       
    values_fill = 0                 
  ) |>
  rename(DEMOCRAT_Senate = "DEMOCRAT", 
         REPUBLICAN_Senate = "REPUBLICAN",
         LIBERTARIAN_Senate = "LIBERTARIAN",
         OTHER_Senate = "OTHER"
         )

president_data_grouped <- president_cleaned |>
  group_by(year, state, state_po, party_simplified) |>
  summarise(
    candidatevotes = sum(candidatevotes),
    totalvotes = first(totalvotes)
  ) |>
  mutate(vote_share = round(100*candidatevotes / totalvotes,2)) |>
  ungroup()

president_data_simplified <- president_data_grouped |>
  group_by(year, state, state_po, party_simplified) |>
  summarise(total_president = first(totalvotes), vote_share = sum(vote_share)) |>
  ungroup()

president_pivoted_data <- president_data_simplified |>
  pivot_wider(
    names_from = party_simplified,  
    values_from = vote_share,       
    values_fill = 0                 
  ) |>
  rename(DEMOCRAT_President = "DEMOCRAT", 
         REPUBLICAN_President = "REPUBLICAN",
         LIBERTARIAN_President = "LIBERTARIAN",
         OTHER_President = "OTHER"
         )

states <- data.frame(
  state = c(
    "ALABAMA", "ALASKA", "ARIZONA", "ARKANSAS", "CALIFORNIA", 
    "COLORADO", "CONNECTICUT", "DELAWARE", "FLORIDA", "GEORGIA", 
    "HAWAII", "IDAHO", "ILLINOIS", "INDIANA", "IOWA", 
    "KANSAS", "KENTUCKY", "LOUISIANA", "MAINE", "MARYLAND", 
    "MASSACHUSETTS", "MICHIGAN", "MINNESOTA", "MISSISSIPPI", "MISSOURI", 
    "MONTANA", "NEBRASKA", "NEVADA", "NEW HAMPSHIRE", "NEW JERSEY", 
    "NEW MEXICO", "NEW YORK", "NORTH CAROLINA", "NORTH DAKOTA", "OHIO", 
    "OKLAHOMA", "OREGON", "PENNSYLVANIA", "RHODE ISLAND", "SOUTH CAROLINA", 
    "SOUTH DAKOTA", "TENNESSEE", "TEXAS", "UTAH", "VERMONT", 
    "VIRGINIA", "WASHINGTON", "WEST VIRGINIA", "WISCONSIN", "WYOMING", 
    "DISTRICT OF COLUMBIA", "PUERTO RICO"
  ),
  state_po = c(
    "AL", "AK", "AZ", "AR", "CA", 
    "CO", "CT", "DE", "FL", "GA", 
    "HI", "ID", "IL", "IN", "IA", 
    "KS", "KY", "LA", "ME", "MD", 
    "MA", "MI", "MN", "MS", "MO", 
    "MT", "NE", "NV", "NH", "NJ", 
    "NM", "NY", "NC", "ND", "OH", 
    "OK", "OR", "PA", "RI", "SC", 
    "SD", "TN", "TX", "UT", "VT", 
    "VA", "WA", "WV", "WI", "WY", 
    "DC", "PR"
  )
)

years <- 1976:2020

state_year_tibble <- expand_grid(
  year = years,
  states
)

joined_data <- state_year_tibble |>
  left_join(house_pivoted_data, by = c("year", "state", "state_po")) |>
  left_join(senate_pivoted_data, by = c("year", "state", "state_po")) |>
  left_join(president_pivoted_data, by = c("year", "state", "state_po")) |>
  mutate(across(c(total_house, total_senate, total_president), ~ replace_na(., 0))) |>
  mutate(across(everything(), ~ replace_na(., 0.00)))

write.csv(joined_data, 'joined_data.csv', row.names=FALSE)


library(jsonlite)
democrat_data <- joined_data |>
  group_by(state_po) |>
  summarise(
    DemocratSupport = mean(DEMOCRAT_President[DEMOCRAT_President > 0], na.rm = TRUE)
  ) |>
  arrange(desc(DemocratSupport)) 

write_json(democrat_data, "democrat_support.json")


republican_data <- joined_data |>
  group_by(state_po) |>
  summarise(
    RepublicanSupport = mean(REPUBLICAN_President[REPUBLICAN_President > 0], na.rm = TRUE)
  ) |>
  arrange(desc(RepublicanSupport)) 

write_json(republican_data, "republican_support.json")