Combining 'Traditional' and Text-Based Models to Board Game Ratings

By Brendan Graham in tidy tuesday tidymodels data science

January 26, 2022

Last week I tried out a text based model for the first time. This week I want to continue working with a text based model, but supplement the text data with other non-text predictors. The goal will be to use the board game category (text data) and other non-text data to predict the average board game rating.

Explore the Data

there are 2 tables, one that has the ratings variable we’ll use as the outcome, and the other has some columns to be used as predictors.

ratings

## # A tibble: 21,831 × 10
##      num     id name     year  rank average bayes_average users_rated url       
##    <dbl>  <dbl> <chr>   <dbl> <dbl>   <dbl>         <dbl>       <dbl> <chr>     
##  1   105  30549 Pandem…  2008   106    7.59          7.49      108975 /boardgam…
##  2   189    822 Carcas…  2000   190    7.42          7.31      108738 /boardgam…
##  3   428     13 Catan    1995   429    7.14          6.97      108024 /boardgam…
##  4    72  68448 7 Wond…  2010    73    7.74          7.63       89982 /boardgam…
##  5   103  36218 Domini…  2008   104    7.61          7.50       81561 /boardgam…
##  6   191   9209 Ticket…  2004   192    7.41          7.30       76171 /boardgam…
##  7   100 178900 Codena…  2015   101    7.6           7.51       74419 /boardgam…
##  8     3 167791 Terraf…  2016     4    8.42          8.27       74216 /boardgam…
##  9    15 173346 7 Wond…  2015    16    8.11          7.98       69472 /boardgam…
## 10    35  31260 Agrico…  2007    36    7.93          7.81       66093 /boardgam…
## # … with 21,821 more rows, and 1 more variable: thumbnail <chr>

details

## # A tibble: 21,631 × 23
##      num     id primary   description        yearpublished minplayers maxplayers
##    <dbl>  <dbl> <chr>     <chr>                      <dbl>      <dbl>      <dbl>
##  1     0  30549 Pandemic  In Pandemic, seve…          2008          2          4
##  2     1    822 Carcasso… Carcassonne is a …          2000          2          5
##  3     2     13 Catan     In CATAN (formerl…          1995          3          4
##  4     3  68448 7 Wonders You are the leade…          2010          2          7
##  5     4  36218 Dominion  &quot;You are a m…          2008          2          4
##  6     5   9209 Ticket t… With elegantly si…          2004          2          5
##  7     6 178900 Codenames Codenames is an e…          2015          2          8
##  8     7 167791 Terrafor… In the 2400s, man…          2016          1          5
##  9     8 173346 7 Wonder… In many ways 7 Wo…          2015          2          2
## 10     9  31260 Agricola  Description from …          2007          1          5
## # … with 21,621 more rows, and 16 more variables: playingtime <dbl>,
## #   minplaytime <dbl>, maxplaytime <dbl>, minage <dbl>,
## #   boardgamecategory <chr>, boardgamemechanic <chr>, boardgamefamily <chr>,
## #   boardgameexpansion <chr>, boardgameimplementation <chr>,
## #   boardgamedesigner <chr>, boardgameartist <chr>, boardgamepublisher <chr>,
## #   owned <dbl>, trading <dbl>, wanting <dbl>, wishing <dbl>

Using the skimr package we can get a quick understanding of the characteristics of each table.

skimr::skim_to_list(ratings)

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1	1	107	21432
url	0	1	16	68	21831
thumbnail	6	1	135	139	21816

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
num	1	10915.00	6302.21	0.00	5457.50	10915.00	16372.50	21830.00	▇▇▇▇▇
id	1	118144.78	105369.55	1.00	12308.50	104994.00	207219.00	350992.00	▇▂▃▃▂
year	1	1987.44	193.51	0.00	2001.00	2011.00	2017.00	3500.00	▁▁▇▁▁
rank	1	10916.00	6302.21	1.00	5458.50	10916.00	16373.50	21831.00	▇▇▇▇▇
average	1	6.42	0.93	1.04	5.83	6.45	7.04	9.57	▁▁▅▇▁
bayes_average	1	5.68	0.36	0.00	5.51	5.54	5.67	8.51	▁▁▁▇▁
users_rated	1	866.96	3679.82	30.00	56.00	122.00	392.00	108975.00	▇▁▁▁▁

skimr::skim_to_list(details)

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
primary	0	1.00	1	107	21236
description	1	1.00	49	16144	21615
boardgamecategory	283	0.99	8	216	6730
boardgamemechanic	1590	0.93	8	478	8291
boardgamefamily	3761	0.83	13	2768	11285
boardgameexpansion	16125	0.25	7	18150	5264
boardgameimplementation	16769	0.22	6	890	4247
boardgamedesigner	596	0.97	7	332	9136
boardgameartist	5907	0.73	6	8408	9080
boardgamepublisher	1	1.00	6	3744	11265

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
num	1	10815.00	6244.48	0	5407.5	10815	16222.5	21630	▇▇▇▇▇
id	1	118133.09	105310.42	1	12280.5	105187	207013.0	350992	▇▂▃▃▂
yearpublished	1	1986.09	210.04	-3500	2001.0	2011	2017.0	2023	▁▁▁▁▇
minplayers	1	2.01	0.69	0	2.0	2	2.0	10	▇▁▁▁▁
maxplayers	1	5.71	15.10	0	4.0	4	6.0	999	▇▁▁▁▁
playingtime	1	90.51	534.83	0	25.0	45	90.0	60000	▇▁▁▁▁
minplaytime	1	63.65	447.21	0	20.0	30	60.0	60000	▇▁▁▁▁
maxplaytime	1	90.51	534.83	0	25.0	45	90.0	60000	▇▁▁▁▁
minage	1	9.61	3.64	0	8.0	10	12.0	25	▂▇▆▁▁
owned	1	1487.92	5395.08	0	150.0	322	903.5	168364	▇▁▁▁▁
trading	1	43.59	102.41	0	5.0	13	38.0	2508	▇▁▁▁▁
wanting	1	42.03	117.94	0	3.0	9	29.0	2011	▇▁▁▁▁
wishing	1	233.66	800.66	0	14.0	39	131.0	19325	▇▁▁▁▁

Looks like there are 200 games missing descriptions:

ratings %>%
  anti_join(., details, by = c("id")) %>%
  nrow()

## [1] 200

The distribution of the outcome variable, ratings, is somewhat normally distributed with a mean rating of 6.42 out of 10.

n_games <- 
  ratings %>% 
  distinct(name) %>% 
  nrow()

overall_avg <- 
  ratings %>% 
  summarise(mean_rating = mean(average, na.rm = T)) %>%
  pull(mean_rating)

ratings %>%
  ggplot(aes(x = average)) + 
  geom_histogram(alpha = .75, fill = bg_green) + 
  bg_theme(base_size = 13) + 
  geom_vline(aes(xintercept = mean(average)), linetype = 'dashed') + 
  scale_y_continuous(expand = c(0, 25), breaks = seq(0, 3000, 100)) + 
  labs(x = "rating", y = '', title =  paste("rating distribution of", format(n_games, big.mark = ','),
                                            "games"),
       subtitle = paste("overall avg rating:", round(overall_avg, 2)))

Here we can dive into the category column. First we combine the ratings data with the details and then tokenize the category column into individual rows per category descriptor. This wirll help form the basis of the text recipe developed later on.

combined <- 
  ratings %>%
  inner_join(., details %>% select(-num), by = c("id")) %>%
  select(-c(url, rank, bayes_average, users_rated, yearpublished, thumbnail, primary,
            boardgameexpansion, boardgameimplementation, boardgamedesigner, description,
            boardgamepublisher,boardgameartist, boardgamefamily, boardgamemechanic,
            owned, trading, wanting, wishing, minplaytime, maxplaytime, year)) %>%
  select(num, id, name, average, boardgamecategory, everything()) %>%
  rename(category = boardgamecategory) %>%
  mutate(category = str_remove_all(category, "'"),
         category = str_remove_all(category, '"'),
         category = str_replace_all(category, "\\[|\\]", ""),
         category = trimws(category)) %>%
  filter(!(is.na(category)))

tidy_category <-
  combined %>%
  unnest_tokens(word, category)

tidy_category %>%
  count(word, sort = TRUE)

## # A tibble: 113 × 2
##    word         n
##    <chr>    <int>
##  1 game     10882
##  2 card      6402
##  3 wargame   3820
##  4 fantasy   2681
##  5 war       1996
##  6 party     1968
##  7 dice      1847
##  8 fiction   1666
##  9 science   1666
## 10 fighting  1658
## # … with 103 more rows

Since we are also including non-text data in the model, we should explore those predictors as well. The function below iterates over the numeric columns and checks the relationship between the numeric columns with the average rating. There are a number of outliers that need to be handled to make the plots more meaningful.

get_scatter <- 
  function(data, var){
    
    variable_filter <- 
      if (var == "maxplayers"){
        25
      } else if (var == "playingtime"){
        500
      } else {
        10000000
      }
    
    data %>%
      select(average, one_of({{var}})) %>%
      rename(variable = 2) %>%
      filter(variable <= variable_filter) %>%
      ggplot(., aes(x = average, y = variable)) +
      geom_point(alpha = .3, color = bg_green) + 
      bg_theme(base_size = 13) +
      labs(x = "avg rating", y = var, title = paste(var, "vs average rating"))

  }

numeric_cols <- 
  combined %>%
  select(where(is.numeric)) %>%
  select(-c(id, num, average)) %>%
  names

purrr::map(numeric_cols, ~get_scatter(combined, .x))

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

Create Some Models

To prep the text data for modeling we can create a matrix with each words TF-IDf value. From the step_tfidf() documentation:

Term frequency-inverse document frequency is the product of two statistics: the term frequency (TF) and the inverse document frequency (IDF).

Term frequency measures how many times each token appears in each observation.

Inverse document frequency is a measure of how informative a word is, e.g., how common or rare the word is across all the observations. If a word >appears in all the observations it might not give that much insight, but if it only appears in some it might help differentiate between observations.

The IDF is defined as follows: idf = log(1 + (# documents in the corpus) / (# documents where the term appears))

Normally we’d used step_tokenfilter() as well, but there are not that many categories so we can use all 113. The important step here is to prep() the recipe and then bake() the results. This sequence of functions applies and executes the recipe, and returns a dataframe ready to be joined with the other predictors.

text_prep <- 
  recipe(average ~ num + id + name + category, data = combined) %>%
  update_role(num, id, name, new_role = "ID variables") %>%
  step_tokenize(category) %>%
  step_tfidf(category) %>%
  prep() %>%
  bake(new_data = NULL)

text_prep

## # A tibble: 21,348 × 117
##      num     id name  average tfidf_category_… tfidf_category_… tfidf_category_…
##    <dbl>  <dbl> <fct>   <dbl>            <dbl>            <dbl>            <dbl>
##  1   105  30549 Pand…    7.59                0                0                0
##  2   189    822 Carc…    7.42                0                0                0
##  3   428     13 Catan    7.14                0                0                0
##  4    72  68448 7 Wo…    7.74                0                0                0
##  5   103  36218 Domi…    7.61                0                0                0
##  6   191   9209 Tick…    7.41                0                0                0
##  7   100 178900 Code…    7.6                 0                0                0
##  8     3 167791 Terr…    8.42                0                0                0
##  9    15 173346 7 Wo…    8.11                0                0                0
## 10    35  31260 Agri…    7.93                0                0                0
## # … with 21,338 more rows, and 110 more variables:
## #   tfidf_category_adventure <dbl>, tfidf_category_age <dbl>,
## #   tfidf_category_agents <dbl>, tfidf_category_american <dbl>,
## #   tfidf_category_ancient <dbl>, tfidf_category_and <dbl>,
## #   tfidf_category_animals <dbl>, tfidf_category_arabian <dbl>,
## #   tfidf_category_aviation <dbl>, tfidf_category_base <dbl>,
## #   tfidf_category_based <dbl>, tfidf_category_bluffing <dbl>, …

model_data <- 
  combined %>%
  select(-category) %>%
  left_join(., text_prep %>% select(-c(num, name, average)), "id") 

model_data

## # A tibble: 21,348 × 121
##      num     id name            average minplayers maxplayers playingtime minage
##    <dbl>  <dbl> <chr>             <dbl>      <dbl>      <dbl>       <dbl>  <dbl>
##  1   105  30549 Pandemic           7.59          2          4          45      8
##  2   189    822 Carcassonne        7.42          2          5          45      7
##  3   428     13 Catan              7.14          3          4         120     10
##  4    72  68448 7 Wonders          7.74          2          7          30     10
##  5   103  36218 Dominion           7.61          2          4          30     13
##  6   191   9209 Ticket to Ride     7.41          2          5          60      8
##  7   100 178900 Codenames          7.6           2          8          15     14
##  8     3 167791 Terraforming M…    8.42          1          5         120     12
##  9    15 173346 7 Wonders Duel     8.11          2          2          30     10
## 10    35  31260 Agricola           7.93          1          5         150     12
## # … with 21,338 more rows, and 113 more variables:
## #   tfidf_category_abstract <dbl>, tfidf_category_action <dbl>,
## #   tfidf_category_adult <dbl>, tfidf_category_adventure <dbl>,
## #   tfidf_category_age <dbl>, tfidf_category_agents <dbl>,
## #   tfidf_category_american <dbl>, tfidf_category_ancient <dbl>,
## #   tfidf_category_and <dbl>, tfidf_category_animals <dbl>,
## #   tfidf_category_arabian <dbl>, tfidf_category_aviation <dbl>, …

Then we prep for modelling by creating splits, resamples, model specifications, recipes (several to compare), and the workflowset:

set.seed(113)
splits <- 
  model_data %>%
  initial_split(strata = average)

train <-
  training(splits)
test <-
  testing(splits)

folds <- 
  vfold_cv(train, strata = average)

3 recipes are created here with the goal of seeing which combination works the best: one with the just text features, recipe_text and one without any text features, recipe_no_text and then one with all features, recipe_all.

recipe_all <- 
  recipe(average ~ ., data = train) %>%
  update_role(num, id, name, new_role = "ID variables") %>%
  step_normalize(minplayers, maxplayers, playingtime, minage)

recipe_text <- 
  recipe_all %>%
  step_rm(minplayers, maxplayers, playingtime, minage)

recipe_no_text <- 
  recipe_all %>%
  step_rm(starts_with('tfidf'))

svm_spec <-
  svm_linear(cost = tune(),
             margin = tune()
  ) %>%
  set_mode("regression") %>%
  set_engine("LiblineaR")

lasso_spec <- 
  parsnip::linear_reg(penalty = tune(), 
                      mixture = 1) %>%
  set_engine("glmnet")

mars_spec <- 
  parsnip::mars(num_terms = tune(),
                prod_degree = tune()) %>%
  set_mode('regression') %>%
  set_engine("earth")

workflows <- 
  workflow_set(
    preproc = list(recipe_all = recipe_all,
                   recipe_text = recipe_text,
                   recipe_numeric = recipe_no_text), 
    models = list(svm = svm_spec,
                  lasso = lasso_spec,
                  mars = mars_spec
    ),
    cross = TRUE)

workflows

## # A workflow set/tibble: 9 × 4
##   wflow_id             info             option    result    
##   <chr>                <list>           <list>    <list>    
## 1 recipe_all_svm       <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 recipe_all_lasso     <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 recipe_all_mars      <tibble [1 × 4]> <opts[0]> <list [0]>
## 4 recipe_text_svm      <tibble [1 × 4]> <opts[0]> <list [0]>
## 5 recipe_text_lasso    <tibble [1 × 4]> <opts[0]> <list [0]>
## 6 recipe_text_mars     <tibble [1 × 4]> <opts[0]> <list [0]>
## 7 recipe_numeric_svm   <tibble [1 × 4]> <opts[0]> <list [0]>
## 8 recipe_numeric_lasso <tibble [1 × 4]> <opts[0]> <list [0]>
## 9 recipe_numeric_mars  <tibble [1 × 4]> <opts[0]> <list [0]>

Here we compare the model performance. It looks like the lasso models with the full set of predictors performs the best (circles) and the models without the category text performed the worst (triangles)

grid_ctrl <-
  control_grid(
    save_pred = TRUE,
    allow_par = TRUE,
    parallel_over = "everything",
    verbose = TRUE
  )

cl <- 
  makeCluster(10)

doParallel::registerDoParallel(cl)

results <- 
  workflow_map(fn = "tune_grid",
               object = workflows,
               seed = 155,
               verbose = TRUE,
               control = grid_ctrl,
               grid = 10, 
               resamples = folds,
               metrics = metric_set(rmse, mae)
  )

stopCluster(cl)

rank_results(results, select_best = T) %>% 
  mutate(model = ifelse(str_detect(wflow_id, "lasso"), "lasso", model),
         recipe_type = case_when(
           str_detect(wflow_id, "text") ~ "text_recipe",
           str_detect(wflow_id, "all") ~ "full recipe",
           TRUE ~ "no_text_recipe")) %>%
  filter(.metric == 'rmse') %>%
  ggplot(.,aes(x = rank,  y = mean, color = model, shape = recipe_type)) +
  geom_errorbar(aes(ymin = mean - std_err, ymax = mean + std_err)) + 
  geom_point(size = 3, alpha = .75) + 
  labs(title = "Model Performance Across Recipes", subtitle = "metric: RMSE") + 
  bg_theme(base_size = 13) + 
  ggsci::scale_color_npg()

The predicted vs actual performance of the “best” model is not very good. But oh well, and anyway the goal of this post was to combine text data and non-text data into a model.

best_results <- 
   results %>% 
   extract_workflow_set_result("recipe_all_lasso") %>% 
   select_best(metric = "rmse")

best_results

## # A tibble: 1 × 2
##   penalty .config              
##     <dbl> <fct>                
## 1 0.00350 Preprocessor1_Model08

cl <- 
  makeCluster(10)

test_results <- 
   results %>% 
   extract_workflow("recipe_all_lasso") %>% 
   finalize_workflow(best_results) %>% 
   last_fit(split = splits)

collect_metrics(test_results)

## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <fct>               
## 1 rmse    standard       0.822 Preprocessor1_Model1
## 2 rsq     standard       0.215 Preprocessor1_Model1

stopCluster(cl)

test_results %>% 
  collect_predictions() %>% 
  ggplot(aes(x = average, y = .pred)) + 
  geom_abline(col = "#e64b35", lty = 2) + 
  geom_point(alpha = 0.35, color = "#00a087") + 
  coord_obs_pred() + 
  labs(x = "observed", y = "predicted") + 
  bg_theme(base_size = 13) + 
  ggsci::scale_color_npg()

Posted on:: January 26, 2022

Length:: 12 minute read, 2428 words

Categories:: tidy tuesday tidymodels data science

Tags:: tidy tuesday tidymodels data science

See Also:: Classifying Bigfoot Encounters; Creating a Hex Bin Map to Show Changes Pell Grants; Getting started with topic modeling with dog breed traits