Classifying Bigfoot Encounters

By Brendan Graham in tidy tuesday text model data science

September 17, 2022

In this post I analyze a TidyTuesday data set about Bigfoot Encounters. After exploring the data I test several different text model to categorize the quality of Bigfoot encounters based on their free text description.

Explore the data

let’s read in the data from the TidyTuesday repo and display it in an interactive table. Looks like there are some free text fields describing the encounter and location, as well as some variables for location, date, weather and the classification of the encounter. The encounter observation are used to drive the classification system.

tt_url <- 
  "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-09-13/bigfoot.csv"

bigfoot <- 
  readr::read_csv(tt_url) %>%
  mutate(
    year = year(date),
    month = floor_date(date, 'month'),
    month_name = month(date, label = T, abbr = T),
    day_name = wday(date, abbr = T, label = T)
  )

bigfoot %>% 
  get_table()

The data goes pretty far, with the most encounters occurring in July 2006, with 35 total sightings.

bigfoot %>%
  group_by(month, month_name) %>%
  tally()  %>%
  filter(!is.na(month),
         month > "1950-01-01") %>%
  ungroup() %>%
  padr::pad() %>%
  padr::fill_by_value(n, 0) %>%
  ggplot(aes(x = month, y = n)) + 
  geom_point(color = bg_green, alpha = .75) + 
  geom_smooth(method = 'loess', color = "cornflowerblue", se = F) + 
  # geom_line(color = bg_green, alpha = .75) + 
  scale_x_date(breaks = scales::date_breaks("5 years"), date_labels = "%b '%y") + 
  labs(x = "Month", y = "# Sightings", title = "Monthly Bigfoot Sightings") + 
  geom_label_repel(data = . %>% filter(n == max(n)),
                  aes(label = paste0(format(month, "%B '%y"), ": ", n, " Total Sightings. \nOver 1 sighting per day on avg!")), nudge_x = -20)

Most encounters are Class A or Class B, and those 2 classes seem to be pretty evenly distributed across seasons and day of the week.

bigfoot %>%
  group_by(season, classification) %>%
  tally()  %>%
  ggplot(aes(x = season, y = n, fill = classification), alpha = .85) +
  geom_bar(stat = 'identity', position = 'fill', alpha = .85) + 
  bg_theme() + 
  scale_fill_npg() + 
  scale_y_continuous(labels = scales::percent) + 
  labs(x = "Season", y = "Frequency of Observations", title = "Bigfoot Observations by Season")

bigfoot %>%
  group_by(day_name, classification) %>%
  tally()  %>%
  filter(!is.na(day_name)) %>%
  ggplot(aes(x = day_name, y = n, fill = classification), alpha = .85) +
  geom_bar(stat = 'identity', position = 'fill', alpha = .85) + 
  bg_theme() + 
  scale_fill_npg() + 
  scale_y_continuous(labels = scales::percent)+  
  labs(x = "Day of Week", y = "Frequency of Observations", title = "Bigfoot Observations by Day of Week")

Text model to classify encounters

Here we try several models to predict the class of encounters - remove some irrelevant predictors, remove Class C and omit rows with any missing data.

The BFRO’s report classification system rates the circumstantial potential for misinterpretation, not the credibility of the witness or how interesting the report is.

So our model is not classifying the credibility of the report, but instead classifying the surrounding environment and circumstances of the encounter.

model_data <- 
  bigfoot %>%
  select(number, classification, observed, date, latitude, longitude, state, season, temperature_high:visibility) %>%
  filter(classification != "Class C") %>%
  na.omit()

prep for modeling: split the data and create folds for model cross validation

set.seed(194)
bigfoot_split <- 
  initial_split(model_data, strata = classification)

train <- 
  training(bigfoot_split)
test <- 
  testing(bigfoot_split)

folds <- 
  bootstraps(train, 25, strata = classification)

We try 2 model recipes * include all columns - free text and weather data * include just the observation free-text data

bigfoot_rec_full <-
  recipe(classification ~ ., data = train) %>%
  step_rm(summary) %>%
  step_naomit(all_predictors()) %>%
  update_role(number, new_role = "ID Variable") %>%
  step_tokenize(observed) %>%
  step_stopwords(observed) %>%
  step_tokenfilter(observed, max_tokens = 100) %>%
  step_tfidf(observed) %>%
  step_date(date, features = c("dow", "month", "year"), keep_original_cols = FALSE) %>%
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

bigfoot_rec_text_only <-
  recipe(classification ~ number + observed, data = train) %>%
  update_role(number, new_role = "ID Variable") %>%
  step_tokenize(observed) %>%
  step_stopwords(observed) %>%
  step_tokenfilter(observed, max_tokens = 100) %>%
  step_tfidf(observed)

# check that things work ok
bigfoot_rec_full %>% 
  prep() %>% 
  bake(new_data = NULL)
## # A tibble: 633 × 193
##    number latitude longitude temperatu…¹ tempe…² tempe…³ dew_p…⁴ humid…⁵ cloud…⁶
##     <dbl>    <dbl>     <dbl>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1  55269    0.618     1.28        0.661   0.510   0.320   0.352 -0.0886  -0.135
##  2   6643    0.271     1.32        0.318   0.603   0.885   0.819  0.711    0.523
##  3  65664    1.32     -1.27       -1.27   -1.51   -1.70   -1.41  -0.711    0.489
##  4   9873    1.26     -1.17        0.586   0.437   0.254   0.243 -0.800   -0.827
##  5  64024   -0.797     0.624      -0.966  -0.932  -0.848  -0.663  0.800    0.558
##  6   4676   -0.763     0.926       1.24    1.39    1.49    1.36   0.178    0.731
##  7  65677   -0.813     0.833       0.932   0.954   0.930   0.911  0.356   -0.689
##  8   9750   -0.937     0.979       1.29    1.31    1.27    1.32   0.0892  -1.31 
##  9  24334   -0.960     0.994      -0.244  -0.677  -1.12   -0.657 -1.60    -0.273
## 10   3332   -0.797     0.688      -0.286  -0.560  -0.831  -0.191  0.178   -0.204
## # … with 623 more rows, 184 more variables: moon_phase <dbl>,
## #   precip_intensity <dbl>, precip_probability <dbl>, pressure <dbl>,
## #   uv_index <dbl>, visibility <dbl>, classification <fct>,
## #   tfidf_observed_2 <dbl>, tfidf_observed_3 <dbl>, tfidf_observed_30 <dbl>,
## #   tfidf_observed_4 <dbl>, tfidf_observed_5 <dbl>,
## #   tfidf_observed_across <dbl>, tfidf_observed_also <dbl>,
## #   tfidf_observed_animal <dbl>, tfidf_observed_another <dbl>, …
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Specify 3 models and bundle them into a workflow:

null_spec <- 
  null_model() %>%
  set_engine("parsnip") %>%
  set_mode("classification")

svm_spec <-
  svm_rbf() %>%
  set_engine("kernlab") %>%
  set_mode("classification")

random_forest_spec <-
  parsnip::rand_forest(
    mtry = tune(),
    trees = 1000,
    min_n = tune()
  ) %>%
  set_mode("classification") %>%
  set_engine("ranger")

workflows <- 
  workflow_set(
    preproc = list(
      full_recipe = bigfoot_rec_full,
      text_only = bigfoot_rec_text_only), 
    models = list(
      null_model = null_spec,
      svm = svm_spec,
      random_forest = random_forest_spec
      ),
    cross = TRUE)

workflows
## # A workflow set/tibble: 6 × 4
##   wflow_id                  info             option    result    
##   <chr>                     <list>           <list>    <list>    
## 1 full_recipe_null_model    <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 full_recipe_svm           <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 full_recipe_random_forest <tibble [1 × 4]> <opts[0]> <list [0]>
## 4 text_only_null_model      <tibble [1 × 4]> <opts[0]> <list [0]>
## 5 text_only_svm             <tibble [1 × 4]> <opts[0]> <list [0]>
## 6 text_only_random_forest   <tibble [1 × 4]> <opts[0]> <list [0]>

tune all model/recipe combinations using the cross validation folds

cl <- parallel::makeCluster(3)
doParallel::registerDoParallel(cl)

grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    allow_par = TRUE,
    parallel_over = "everything",
    verbose = TRUE
  )

results <- 
  workflow_map(fn = "tune_grid",
               object = workflows,
               seed = 155,
               verbose = TRUE,
               control = grid_ctrl,
               grid = 10, 
               resamples = folds,
               metrics = metric_set(accuracy, roc_auc, mn_log_loss)
  )

Evaluate the model performance:

results %>%
  rank_results(rank_metric = "accuracy", select_best = TRUE) %>%
  filter(.metric == "accuracy") %>%
  arrange(model) %>%
  mutate(type = case_when(
    str_detect(wflow_id, "full") ~ "full recipe",
    str_detect(wflow_id, "no_text") ~ "no text",
    TRUE ~ "text only")) %>%
  ggplot(., aes(x = rank, y = mean, color = model)) + 
  geom_point() + 
  geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err)) + 
  facet_wrap(vars(type)) + 
  scale_y_continuous(limits =c(.45, .85), breaks = seq(.45, .85, .01))

finalize model

best_results <- 
  results %>% 
   extract_workflow_set_result("text_only_random_forest") %>% 
   select_best(metric = "roc_auc")

rf_test_results <- 
  results %>% 
  extract_workflow("text_only_random_forest") %>% 
  finalize_workflow(best_results) %>% 
  last_fit(split = bigfoot_split)

rf_test_results %>% 
  collect_metrics()
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.759 Preprocessor1_Model1
## 2 roc_auc  binary         0.845 Preprocessor1_Model1
rf_test_results %>% 
   collect_predictions() %>% 
  conf_mat(truth = classification, estimate = .pred_class)
##           Truth
## Prediction Class A Class B
##    Class A      73      26
##    Class B      25      88
rf_test_results %>% 
  pull(.predictions)
## [[1]]
## # A tibble: 212 × 6
##    `.pred_Class A` `.pred_Class B`  .row .pred_class classification .config     
##              <dbl>           <dbl> <int> <fct>       <fct>          <chr>       
##  1           0.263           0.737     3 Class B     Class B        Preprocesso…
##  2           0.308           0.692     5 Class B     Class B        Preprocesso…
##  3           0.327           0.673     6 Class B     Class B        Preprocesso…
##  4           0.435           0.565     9 Class B     Class B        Preprocesso…
##  5           0.320           0.680    12 Class B     Class A        Preprocesso…
##  6           0.200           0.800    15 Class B     Class A        Preprocesso…
##  7           0.502           0.498    23 Class A     Class B        Preprocesso…
##  8           0.574           0.426    24 Class A     Class B        Preprocesso…
##  9           0.460           0.540    25 Class B     Class B        Preprocesso…
## 10           0.683           0.317    27 Class A     Class A        Preprocesso…
## # … with 202 more rows
## # ℹ Use `print(n = ...)` to see more rows
final_workflow <-
  results %>% 
  extract_workflow("text_only_random_forest") %>% 
  finalize_workflow(best_results)
  
final_model <- 
  fit(final_workflow, model_data)

stopCluster(cl)

predict on new data

bigfoot_sigthings <-
  tibble(
    number = c(1, 2, 3, 4),
    observed = c(
      "I walked up hills and came up to ridge that was very difficult to climb. I spotted a foot print. I climbed up and there it was a foot print. I
      did not really believe in Bigfoot but nowhaving second thoughts. We even went back the next day which was today 11/1/09 and made a cast of the
      print. Of course it was not as deep as yesterday but we still took the print and have pictures from yesterday on the cell phone.",
      "I had a conversation with an animal that was definitely bigfoot. He told me his name and we shook hands and I even have a framed picture of a
      selfie we took together in the woods in front of his favorite tree",
      "My dad said he saw bigfoot but didn't get a really good look.",
      "It was in Montana on the Flathead Indian Reservation,near an area known as Evaro. This is about 20 miles northwest of Missoula, MT. It occured near our residence, which was located about 4 miles deep into the woods from the nearest main roadway (Highway 93). The only way to get to our residence was by following an old dirt road that winds its way through the forest."
    )
  )
    
predict(
  final_model,
  new_data = bigfoot_sigthings
)
## # A tibble: 4 × 1
##   .pred_class
##   <fct>      
## 1 Class B    
## 2 Class B    
## 3 Class A    
## 4 Class B
Posted on:
September 17, 2022
Length:
9403 minute read, 2002831 words
Categories:
tidy tuesday text model data science
Tags:
tidy tuesday text model data science
See Also:
Creating a Hex Bin Map to Show Changes Pell Grants
Getting started with topic modeling with dog breed traits
Combining 'Traditional' and Text-Based Models to Board Game Ratings