14 SDM algorithms

Introduction to different SDM algorithms.

Presentation (PDF)

14.1 Exercises - Practice with models

For this exercise, we’re going to play with a superhero dataset! Although this is an SDM focused course, we don’t want to spoil the surprises coming when you start playing with the environmental data.

We want to run models for two questions:

  1. CLASSIFICATION: Is there a bias towards alignment? (i.e. can we predict if a character will be good or bad)
  2. REGRESSION: Are men stronger than women in the comics? (i.e., can we predict total power, and is Gender one of the most important predictors? Or is there something else that predicts total power)

As per the slides, we’re going to do a few things:

  1. Explore and clean our data
  2. Use one-hot-encoding to help deal with many categories
  3. Build a single decision tree and visualise it. Re-build the tree with different settings
  4. Run a model with gbm.step and visualise the train/test plot, as well as the variable importance plot
  5. Run the same model with random forests and look at the differences in the outputs

Pick either the classification or regression problem. Solution for the classification problem are included in this document if you need a poke in the right direction. We run the GBM model ONLY for the regression problem here, but you could try out with RF if you wanted

Let’s start by loading libraries

library(tidyverse)  ## Isn't tidyverse awesome?
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v readr   1.3.1
## v tidyr   0.8.3     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x nlme::collapse()    masks dplyr::collapse()
## x tidyr::extract()    masks raadtools::extract(), raster::extract()
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x dplyr::lag()        masks stats::lag()
## x raster::select()    masks dplyr::select()
library(dismo)  ## This package contains the gbm.step command
library(randomForest)  ## This package contains the randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(rpart)   ## This package contains a series of useful decision tree functions
library(rpart.plot)
library(gbm)
## Loaded gbm 2.1.5

14.1.1 Load and clean the data

Now that we’ve got some libraries ready, let’s load and clean our SUPERHERO dataset in preparation for our modelling.

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Unnamed: 0` = col_double(),
##   Intelligence = col_double(),
##   Strength = col_double(),
##   Speed = col_double(),
##   Durability = col_double(),
##   Power = col_double(),
##   Combat = col_double(),
##   TotalPower = col_double()
## )
## See spec(...) for full column specifications.
Supes <- readr::read_csv('SuperheroDataset.csv')

For the purposes of this exercise, we’re going to use the following columns: Name, Intelligence, Strength, Speed, Durability, Power, Combat, Gender, Race, Creator, Alignment, Total Power. Use the tidyverse to extract those columns.

SuperData <- Supes %>%
  dplyr::select(Name, Intelligence:Combat, Alignment:Race, Creator, TotalPower)

14.1.1.1 Exploring the data

If you explore the data a little bit, you’ll be able to see some of the interesting quirks of the dataset. Use ggplot2 (or base plot if you’re feeling lazy ;) ).

SuperData %>%
  filter(!is.na(Intelligence))%>% 
  ggplot(aes(x=Intelligence,y=Strength)) + geom_point() + geom_smooth(method=lm) + theme_bw()

## Looking at the line, it seems like the problem is that there are some extreme low values
##   of intelligence. Outliers perhaps?? Try to find out who those values belong to, and
##   that may help you determine if they should be included in the analysis

SuperData %>%
  filter(!is.na(Combat))%>% 
  ggplot(aes(x=Combat,y=Strength)) + geom_point() + geom_smooth(method=lm) + theme_bw()

## We can also filter the data conditionally to explore the dataset with specific classes, e.g.:
SuperData %>%
  filter(!is.na(Combat),Gender=='Female')%>% 
  ggplot(aes(x=Combat,y=Strength)) + geom_point() + geom_smooth(method=lm) + theme_bw()

library(GGally)
## This is a great library for exploring data quickly! Rather than coding everything bit by bit

quant_df <- SuperData %>% dplyr::select(Intelligence:Gender,TotalPower)
ggpairs(quant_df, progress = FALSE)

Note that ‘total power’ is highly correlated with strength, speed, durability and power. This could impact how we interpret the regression problem later as these variables may wash out the signal we’re trying to explore. Not a bad thing if we’re just trying to make predictions! But if we’re trying to answer a specific question, it could ‘muddy the waters’.

If you have checked out some of the variables, you might notice two things:

  1. Race and Creator have too many categories!
  2. There are a number of Creators that are only used once, or may not be appropriate to include.
length(unique(SuperData$Race)) ## Number of unique levels in Race
## [1] 63
length(unique(SuperData$Creator))  ## Same only for creator
## [1] 23
## Let's just check out Creator:
unique(SuperData$Creator)
##  [1] "Dark Horse Comics" "DC Comics"         "George Lucas"     
##  [4] "HarperCollins"     "Ian Fleming"       "Icon Comics"      
##  [7] "IDW Publishing"    "Image Comics"      "J. K. Rowling"    
## [10] "J. R. R. Tolkien"  "Marvel Comics"     "Mattel"           
## [13] "Microsoft"         "NBC - Heroes"      "Rebellion"        
## [16] "Shueisha"          "Sony Pictures"     "South Park"       
## [19] "Star Trek"         "SyFy"              "Team Epic TV"     
## [22] "Universal Studios" "Wildstorm"
SuperData %>% count(Creator)
## # A tibble: 23 x 2
##    Creator               n
##    <chr>             <int>
##  1 Dark Horse Comics    19
##  2 DC Comics           219
##  3 George Lucas         15
##  4 HarperCollins         6
##  5 Ian Fleming           1
##  6 Icon Comics           4
##  7 IDW Publishing        4
##  8 Image Comics         14
##  9 J. K. Rowling         1
## 10 J. R. R. Tolkien      1
## 11 Marvel Comics       395
## 12 Mattel                2
## 13 Microsoft             1
## 14 NBC - Heroes         19
## 15 Rebellion             1
## 16 Shueisha              4
## 17 Sony Pictures         2
## 18 South Park            1
## 19 Star Trek             6
## 20 SyFy                  5
## 21 Team Epic TV          5
## 22 Universal Studios     1
## 23 Wildstorm             4

There are many classes here that may not be appropriate to include. For example, creator Ian Fleming (creator of James Bond) only has one character. This dataset also has Star Trek characters. So, we decide here to filter out the appropriate classes (keeping Marvel Comics, DC Comics, Dark Horse Comics, and Image Comics).

Note the usage of %in%

FilteredSupes <- SuperData %>%
  dplyr::filter(Creator %in% c('Marvel Comics','DC Comics','Dark Horse Comics','Image Comics')) %>%
  dplyr::filter(!is.na(Intelligence))
## we do this last part to remove NA data for cleanliness, though trees don't require you to do this!

FilteredSupes
## # A tibble: 586 x 12
##    Name              Intelligence Strength Speed Durability Power Combat
##    <chr>                    <dbl>    <dbl> <dbl>      <dbl> <dbl>  <dbl>
##  1 Abe Sapien                  95       30    35         65   100     85
##  2 Alien                       80       30    45         90    60     60
##  3 Angel                       90       30    60         90   100     75
##  4 Buffy                       85       30    45         70    50     60
##  5 Dash                        65       15    95         60    20     30
##  6 Elastigirl                  85       35    35         95    55     70
##  7 Hellboy                     85       55    25         95    75     75
##  8 Jack-Jack                   35       35    70         80   100     10
##  9 Jason Voorhees              85       35    25        100   100     50
## 10 Liz Sherman                 55       35    30         45    75     50
## 11 Mr Incredible               80       85    35         95    30     40
## 12 Predator                    85       30    25         85   100     90
## 13 T-1000                      90       35    35        100   100     75
## 14 T-800                       90       35    20         60    75     65
## 15 T-850                       90       80    25         90    85     75
## 16 T-X                         90       65    30         85   100     80
## 17 Violet Parr                 80       10    15         50    80     15
## 18 Abin Sur                    80       90    55         65   100     65
## 19 Adam Strange                85       10    35         40    40     50
## 20 Alfred Pennyworth           85       10    20         10    10     55
## 21 Amazo                       85      100    85        100   100    100
## 22 Animal Man                  80       50    50         85    75     80
## 23 Anti-Monitor                95      100    50        100   100     90
## 24 Aquababy                    60       20    15         15    40     15
## 25 Aqualad                     85       45    45         75    90     60
## 26 Aquaman                     95       85    80         80   100     80
## 27 Ares                        90      100    75        100   100    100
## 28 Arsenal                     80       55    60         60    65     85
## 29 Atlas                       85      100    45        100    30     80
## 30 Atom                        90       95    75         95    90     85
##    Alignment Gender Race            Creator           TotalPower
##    <chr>     <chr>  <chr>           <chr>                  <dbl>
##  1 good      Male   Icthyo Sapien   Dark Horse Comics        410
##  2 bad       Male   Xenomorph XX121 Dark Horse Comics        365
##  3 good      Male   Vampire         Dark Horse Comics        445
##  4 good      Female Human           Dark Horse Comics        340
##  5 good      Male   Human           Dark Horse Comics        285
##  6 good      Female Human           Dark Horse Comics        375
##  7 good      Male   Demon           Dark Horse Comics        410
##  8 good      Male   Human           Dark Horse Comics        330
##  9 bad       Male   Human           Dark Horse Comics        395
## 10 good      Female Human           Dark Horse Comics        290
## 11 good      Male   Human           Dark Horse Comics        365
## 12 bad       Male   Yautja          Dark Horse Comics        415
## 13 bad       Male   Android         Dark Horse Comics        435
## 14 bad       Male   Cyborg          Dark Horse Comics        345
## 15 bad       Male   Cyborg          Dark Horse Comics        445
## 16 bad       Female Cyborg          Dark Horse Comics        450
## 17 good      Female Human           Dark Horse Comics        250
## 18 good      Male   Ungaran         DC Comics                455
## 19 good      Male   Human           DC Comics                260
## 20 good      Male   Human           DC Comics                190
## 21 bad       Male   Android         DC Comics                570
## 22 good      Male   Human           DC Comics                420
## 23 bad       Male   God / Eternal   DC Comics                535
## 24 good      Male   Human           DC Comics                165
## 25 good      Male   Atlantean       DC Comics                400
## 26 good      Male   Atlantean       DC Comics                520
## 27 neutral   Male   God / Eternal   DC Comics                565
## 28 good      Male   Human           DC Comics                405
## 29 bad       Male   God / Eternal   DC Comics                440
## 30 good      Male   Human           DC Comics                530
## # ... with 556 more rows
## How many races do we now have??
length(unique(FilteredSupes$Race))
## [1] 53

Now let’s explore this again.

quant_df <- FilteredSupes %>% dplyr::select(Intelligence:Gender,TotalPower)
ggpairs(quant_df, progress = FALSE)

Not much has really changed, as we have not removed too many rows, which is good thing because it means that those data we removed won’t impact the final conclusions in any major way. This just simplifies our data cleaning for this particular exercise. However, we still have lots of ‘Race’ categories.

FilteredSupes %>% count(Race)
## # A tibble: 53 x 2
##    Race                    n
##    <chr>               <int>
##  1 Alien                   8
##  2 Amazon                  2
##  3 Android                 8
##  4 Animal                  2
##  5 Asgardian               5
##  6 Atlantean               5
##  7 Bizarro                 1
##  8 Bolovaxian              1
##  9 Clone                   1
## 10 Cosmic Entity           4
## 11 Cyborg                  9
## 12 Czarnian                1
## 13 DemiHumanGod            2
## 14 Demon                   6
## 15 Eternal                 3
## 16 Flora Colossus          1
## 17 Frost Giant             2
## 18 God / Eternal          16
## 19 Gorilla                 1
## 20 Human                 384
## 21 Human / Altered         2
## 22 Human / Cosmic          2
## 23 Human / Radiation      12
## 24 HumanHumanInhuman       1
## 25 HumanHumanKree          2
## 26 HumanHumanSpartoi       1
## 27 HumanHumanVuldarian     1
## 28 Icthyo Sapien           1
## 29 Inhuman                 4
## 30 Kakarantharaian         1
## # ... with 23 more rows

Looking at the breakdown of Race, there are many with a count of 1. These can cause some issues when trying to do any cross validation… so we should remove them.

FilteredSupes <- FilteredSupes %>% add_count(Race) %>% filter(n > 1)

FilteredSupes %>% count(Race)
## # A tibble: 25 x 2
##    Race                  n
##    <chr>             <int>
##  1 Alien                 8
##  2 Amazon                2
##  3 Android               8
##  4 Animal                2
##  5 Asgardian             5
##  6 Atlantean             5
##  7 Cosmic Entity         4
##  8 Cyborg                9
##  9 DemiHumanGod          2
## 10 Demon                 6
## 11 Eternal               3
## 12 Frost Giant           2
## 13 God / Eternal        16
## 14 Human               384
## 15 Human / Altered       2
## 16 Human / Cosmic        2
## 17 Human / Radiation    12
## 18 HumanHumanKree        2
## 19 Inhuman               4
## 20 Kryptonian            7
## 21 Metahuman             2
## 22 Mutant               57
## 23 New God               3
## 24 Symbiote              9
## 25 Vampire               2

14.1.2 One hot encoding!

Race has MANY categories! This COULD cause the trees to get swamped by this predictor. Also, by one-hot encoding, we can keep data that would otherwise get thrown out if just filtering out categories. For example, if we removed the ‘cyborg’ category by way of filtering, those character data would be removed as well (they might be useful). One hot encoding turns each category into a binary variable. Can be done easily in tidyverse!

OOEdata <- FilteredSupes %>% 
        separate_rows(Race)%>% 
        mutate(count = 1) %>% 
        spread(Race, count, fill = 0, sep = "_")

OOEdata
## # A tibble: 558 x 38
##    Name              Intelligence Strength Speed Durability Power Combat
##    <chr>                    <dbl>    <dbl> <dbl>      <dbl> <dbl>  <dbl>
##  1 Angel                       90       30    60         90   100     75
##  2 Buffy                       85       30    45         70    50     60
##  3 Dash                        65       15    95         60    20     30
##  4 Elastigirl                  85       35    35         95    55     70
##  5 Hellboy                     85       55    25         95    75     75
##  6 Jack-Jack                   35       35    70         80   100     10
##  7 Jason Voorhees              85       35    25        100   100     50
##  8 Liz Sherman                 55       35    30         45    75     50
##  9 Mr Incredible               80       85    35         95    30     40
## 10 T-1000                      90       35    35        100   100     75
## 11 T-800                       90       35    20         60    75     65
## 12 T-850                       90       80    25         90    85     75
## 13 T-X                         90       65    30         85   100     80
## 14 Violet Parr                 80       10    15         50    80     15
## 15 Adam Strange                85       10    35         40    40     50
## 16 Alfred Pennyworth           85       10    20         10    10     55
## 17 Amazo                       85      100    85        100   100    100
## 18 Animal Man                  80       50    50         85    75     80
## 19 Anti-Monitor                95      100    50        100   100     90
## 20 Aquababy                    60       20    15         15    40     15
## 21 Aqualad                     85       45    45         75    90     60
## 22 Aquaman                     95       85    80         80   100     80
## 23 Ares                        90      100    75        100   100    100
## 24 Arsenal                     80       55    60         60    65     85
## 25 Atlas                       85      100    45        100    30     80
## 26 Atom                        90       95    75         95    90     85
## 27 Atom Girl                   75       10    25         30    40     45
## 28 Atom II                     95       10    35         45    40     60
## 29 Atom IV                     85       55    55         55    60     65
## 30 Azrael                      85       20    20         20    35     80
##    Alignment Gender Creator           TotalPower     n Race_Alien
##    <chr>     <chr>  <chr>                  <dbl> <int>      <dbl>
##  1 good      Male   Dark Horse Comics        445     2          0
##  2 good      Female Dark Horse Comics        340   384          0
##  3 good      Male   Dark Horse Comics        285   384          0
##  4 good      Female Dark Horse Comics        375   384          0
##  5 good      Male   Dark Horse Comics        410     6          0
##  6 good      Male   Dark Horse Comics        330   384          0
##  7 bad       Male   Dark Horse Comics        395   384          0
##  8 good      Female Dark Horse Comics        290   384          0
##  9 good      Male   Dark Horse Comics        365   384          0
## 10 bad       Male   Dark Horse Comics        435     8          0
## 11 bad       Male   Dark Horse Comics        345     9          0
## 12 bad       Male   Dark Horse Comics        445     9          0
## 13 bad       Female Dark Horse Comics        450     9          0
## 14 good      Female Dark Horse Comics        250   384          0
## 15 good      Male   DC Comics                260   384          0
## 16 good      Male   DC Comics                190   384          0
## 17 bad       Male   DC Comics                570     8          0
## 18 good      Male   DC Comics                420   384          0
## 19 bad       Male   DC Comics                535    16          0
## 20 good      Male   DC Comics                165   384          0
## 21 good      Male   DC Comics                400     5          0
## 22 good      Male   DC Comics                520     5          0
## 23 neutral   Male   DC Comics                565    16          0
## 24 good      Male   DC Comics                405   384          0
## 25 bad       Male   DC Comics                440    16          0
## 26 good      Male   DC Comics                530   384          0
## 27 good      Female DC Comics                225   384          0
## 28 good      Male   DC Comics                285   384          0
## 29 good      Male   DC Comics                375   384          0
## 30 good      Male   DC Comics                260   384          0
##    Race_Altered Race_Amazon Race_Android Race_Animal Race_Asgardian
##           <dbl>       <dbl>        <dbl>       <dbl>          <dbl>
##  1            0           0            0           0              0
##  2            0           0            0           0              0
##  3            0           0            0           0              0
##  4            0           0            0           0              0
##  5            0           0            0           0              0
##  6            0           0            0           0              0
##  7            0           0            0           0              0
##  8            0           0            0           0              0
##  9            0           0            0           0              0
## 10            0           0            1           0              0
## 11            0           0            0           0              0
## 12            0           0            0           0              0
## 13            0           0            0           0              0
## 14            0           0            0           0              0
## 15            0           0            0           0              0
## 16            0           0            0           0              0
## 17            0           0            1           0              0
## 18            0           0            0           0              0
## 19            0           0            0           0              0
## 20            0           0            0           0              0
## 21            0           0            0           0              0
## 22            0           0            0           0              0
## 23            0           0            0           0              0
## 24            0           0            0           0              0
## 25            0           0            0           0              0
## 26            0           0            0           0              0
## 27            0           0            0           0              0
## 28            0           0            0           0              0
## 29            0           0            0           0              0
## 30            0           0            0           0              0
##    Race_Atlantean Race_Cosmic Race_Cyborg Race_DemiHumanGod Race_Demon
##             <dbl>       <dbl>       <dbl>             <dbl>      <dbl>
##  1              0           0           0                 0          0
##  2              0           0           0                 0          0
##  3              0           0           0                 0          0
##  4              0           0           0                 0          0
##  5              0           0           0                 0          1
##  6              0           0           0                 0          0
##  7              0           0           0                 0          0
##  8              0           0           0                 0          0
##  9              0           0           0                 0          0
## 10              0           0           0                 0          0
## 11              0           0           1                 0          0
## 12              0           0           1                 0          0
## 13              0           0           1                 0          0
## 14              0           0           0                 0          0
## 15              0           0           0                 0          0
## 16              0           0           0                 0          0
## 17              0           0           0                 0          0
## 18              0           0           0                 0          0
## 19              0           0           0                 0          0
## 20              0           0           0                 0          0
## 21              1           0           0                 0          0
## 22              1           0           0                 0          0
## 23              0           0           0                 0          0
## 24              0           0           0                 0          0
## 25              0           0           0                 0          0
## 26              0           0           0                 0          0
## 27              0           0           0                 0          0
## 28              0           0           0                 0          0
## 29              0           0           0                 0          0
## 30              0           0           0                 0          0
##    Race_Entity Race_Eternal Race_Frost Race_Giant Race_God Race_Human
##          <dbl>        <dbl>      <dbl>      <dbl>    <dbl>      <dbl>
##  1           0            0          0          0        0          0
##  2           0            0          0          0        0          1
##  3           0            0          0          0        0          1
##  4           0            0          0          0        0          1
##  5           0            0          0          0        0          0
##  6           0            0          0          0        0          1
##  7           0            0          0          0        0          1
##  8           0            0          0          0        0          1
##  9           0            0          0          0        0          1
## 10           0            0          0          0        0          0
## 11           0            0          0          0        0          0
## 12           0            0          0          0        0          0
## 13           0            0          0          0        0          0
## 14           0            0          0          0        0          1
## 15           0            0          0          0        0          1
## 16           0            0          0          0        0          1
## 17           0            0          0          0        0          0
## 18           0            0          0          0        0          1
## 19           0            1          0          0        1          0
## 20           0            0          0          0        0          1
## 21           0            0          0          0        0          0
## 22           0            0          0          0        0          0
## 23           0            1          0          0        1          0
## 24           0            0          0          0        0          1
## 25           0            1          0          0        1          0
## 26           0            0          0          0        0          1
## 27           0            0          0          0        0          1
## 28           0            0          0          0        0          1
## 29           0            0          0          0        0          1
## 30           0            0          0          0        0          1
##    Race_HumanHumanKree Race_Inhuman Race_Kryptonian Race_Metahuman
##                  <dbl>        <dbl>           <dbl>          <dbl>
##  1                   0            0               0              0
##  2                   0            0               0              0
##  3                   0            0               0              0
##  4                   0            0               0              0
##  5                   0            0               0              0
##  6                   0            0               0              0
##  7                   0            0               0              0
##  8                   0            0               0              0
##  9                   0            0               0              0
## 10                   0            0               0              0
## 11                   0            0               0              0
## 12                   0            0               0              0
## 13                   0            0               0              0
## 14                   0            0               0              0
## 15                   0            0               0              0
## 16                   0            0               0              0
## 17                   0            0               0              0
## 18                   0            0               0              0
## 19                   0            0               0              0
## 20                   0            0               0              0
## 21                   0            0               0              0
## 22                   0            0               0              0
## 23                   0            0               0              0
## 24                   0            0               0              0
## 25                   0            0               0              0
## 26                   0            0               0              0
## 27                   0            0               0              0
## 28                   0            0               0              0
## 29                   0            0               0              0
## 30                   0            0               0              0
##    Race_Mutant Race_New Race_Radiation Race_Symbiote Race_Vampire
##          <dbl>    <dbl>          <dbl>         <dbl>        <dbl>
##  1           0        0              0             0            1
##  2           0        0              0             0            0
##  3           0        0              0             0            0
##  4           0        0              0             0            0
##  5           0        0              0             0            0
##  6           0        0              0             0            0
##  7           0        0              0             0            0
##  8           0        0              0             0            0
##  9           0        0              0             0            0
## 10           0        0              0             0            0
## 11           0        0              0             0            0
## 12           0        0              0             0            0
## 13           0        0              0             0            0
## 14           0        0              0             0            0
## 15           0        0              0             0            0
## 16           0        0              0             0            0
## 17           0        0              0             0            0
## 18           0        0              0             0            0
## 19           0        0              0             0            0
## 20           0        0              0             0            0
## 21           0        0              0             0            0
## 22           0        0              0             0            0
## 23           0        0              0             0            0
## 24           0        0              0             0            0
## 25           0        0              0             0            0
## 26           0        0              0             0            0
## 27           0        0              0             0            0
## 28           0        0              0             0            0
## 29           0        0              0             0            0
## 30           0        0              0             0            0
## # ... with 528 more rows

Now, you can very easily explore each category!!

Remember though, these need to be converted to factors or else they’ll be confused as integers!!

magrittr package can help do this easily

library(magrittr)

columns <- names(OOEdata)[12:length(names(OOEdata))]
OOEdata <- OOEdata %<>% mutate_at(columns, funs(factor(.)))


quant_df <- OOEdata %>%
  dplyr::select(Intelligence:TotalPower, Race_Human, Race_Cyborg, Race_Mutant)

ggpairs(quant_df, progress = FALSE)

This figure is very messy to look at here in R Markdown - but if you eliminate a few of the variables, or use R Studio’s zoom button in the plot window, you should be able to explore this!

14.1.3 The models

14.1.3.1 CART

Let’s start by building ourselves a single tree. Feel free to try this with different variables.

14.1.3.1.1 Classification example
## Let's remove 'neutral' characters from our dataset so we can look at if we can predict
##   if they are good or evil. We'll also select just a few columns to work with for
##   simplicity's sake

modeldat <- OOEdata %>% 
  dplyr::filter(Alignment!='neutral') %>%
  dplyr::select(Intelligence:TotalPower, Race_Animal, Race_Human, Race_Demon, Race_God,
                Race_Asgardian, Race_Mutant, Race_Android)

form <- as.formula(paste("Alignment ~ ",paste(names(modeldat),collapse='+')))
form <- update(form, . ~ . -Alignment) 

## Check out help(rpart.control) for a list of the parameters that you can change!
## Try playing with these to get an idea of how they impact a single tree

## To run a regression model, change method to 'anova'
treemod <- rpart(form, method='class',data=modeldat,cp=0.001,maxdepth=6)

rpart.plot(treemod,cex=0.7)

14.1.3.2 GBM.step

We’re going to ‘skip’ a step here to a degree and use the gbm.step algorithm, an extension to gbm (generalized boosted regression modelling/ boosted regression trees). gbm.step will calculate the optimal tree to use by way of cross validation, rather than just building a pile of trees and letting the user decide.

We’ll just use the model we had above (classification). Unfortunately, gbm.step uses column index values, rather than names!

gbm.step is not particularly well oriented towards multiple class target / dependent variables gbm.step needs to use a data.frame (can’t use a tibble), and in the target MUST be a factor gbm.step ALSO needs the TARGET variable to be in the form 0/1

modeldat <- modeldat %>% 
  mutate(Alignment_targ = case_when(Alignment == 'good' ~ 0, Alignment == 'bad' ~ 1 )) %>%
  mutate_at(c('Alignment_targ','Gender','Creator'), funs(factor(.)))

modeldf <- data.frame(modeldat)
modeldf$Alignment_targ <- as.numeric(as.character(modeldf$Alignment_targ))

gbmModel <- gbm.step(data=modeldf,gbm.=c(1:6,8:17),gbm.y=18,family="bernoulli",
                     tree.complexity = 10,learning.rate=0.001,bag.fraction=0.8,max.trees = 8000)
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for Alignment_targ and using a family of bernoulli 
## Using 531 observations and 16 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are stratified by prevalence 
## total mean deviance =  1.2623 
## tolerance is fixed at  0.0013 
## ntrees resid. dev. 
## 50    1.2562 
## now adding trees... 
## 100   1.2507 
## 150   1.2457 
## 200   1.2416 
## 250   1.2382 
## 300   1.235 
## 350   1.2322 
## 400   1.2298 
## 450   1.2277 
## 500   1.2256 
## 550   1.224 
## 600   1.2226 
## 650   1.2214 
## 700   1.2207 
## 750   1.2201 
## 800   1.2194 
## 850   1.219 
## 900   1.2183 
## 950   1.2184 
## 1000   1.2182 
## 1050   1.2178 
## 1100   1.2179 
## 1150   1.2179 
## 1200   1.218 
## 1250   1.2183 
## 1300   1.2187 
## 1350   1.2192 
## 1400   1.2192 
## 1450   1.2198 
## 1500   1.2204 
## 1550   1.221
## fitting final gbm model with a fixed number of 1050 trees for Alignment_targ

## 
## mean total deviance = 1.262 
## mean residual deviance = 1.049 
##  
## estimated cv deviance = 1.218 ; se = 0.025 
##  
## training data correlation = 0.61 
## cv correlation =  0.223 ; se = 0.059 
##  
## training data AUC score = 0.856 
## cv AUC score = 0.637 ; se = 0.036 
##  
## elapsed time -  0.3 minutes

To get a summary of the output that shows the variable importance rankings:

summary(gbmModel)

##                           var    rel.inf
## Intelligence     Intelligence 20.5992067
## Speed                   Speed 11.9336893
## TotalPower         TotalPower 11.0557840
## Durability         Durability 10.9474037
## Gender                 Gender  9.8085259
## Power                   Power  9.1455746
## Combat                 Combat  8.9856860
## Strength             Strength  7.0933922
## Creator               Creator  6.5330009
## Race_Human         Race_Human  2.6714001
## Race_Mutant       Race_Mutant  1.0352206
## Race_God             Race_God  0.1911162
## Race_Animal       Race_Animal  0.0000000
## Race_Demon         Race_Demon  0.0000000
## Race_Asgardian Race_Asgardian  0.0000000
## Race_Android     Race_Android  0.0000000

You can also plot the partial dependence plots, and then take it further to explore interactions by plotting the perspective plots

gbm.plot(gbmModel,n.plots=5,write.title=F)

## the gbm perspective plot for intelligence and speed versus the fitted values(z)
gbm.perspec(gbmModel,1,3,z.range=c(0.15,0.6),theta=205)
## maximum value = 0.54

14.1.3.3 random forest

Random forest is much more flexible than gbm.step in some respects - for example, you aren’t being forced to use column indices!

This model can be set up much like the CART model from before.

modeldat <- OOEdata %>% 
  dplyr::filter(Alignment!='neutral') %>%
  dplyr::select(Intelligence:TotalPower, Race_Animal, Race_Human, Race_Demon, Race_God,
                Race_Asgardian, Race_Mutant, Race_Android)

## Have to make sure these things are formatted as factors
modeldat <- modeldat %>% 
  mutate(Alignment_targ = case_when(Alignment == 'good' ~ 0, Alignment == 'bad' ~ 1 )) %>%
  mutate_at(c('Alignment_targ','Gender','Creator'), funs(factor(.)))


form <- as.formula(paste("Alignment_targ ~ ",paste(names(modeldat),collapse='+')))
form <- update(form, . ~ . -Alignment) 
form <- update(form, . ~ . -Alignment_targ) 

rf.model <- randomForest(data=data.frame(modeldat),form,mtry=3,maxnodes=10)
importance(rf.model)
##                MeanDecreaseGini
## Intelligence         6.05098679
## Strength             3.03554856
## Speed                2.84587013
## Durability           3.73971684
## Power                2.32266672
## Combat               2.47631381
## Gender               3.72602957
## Creator              1.95849028
## TotalPower           3.41079472
## Race_Animal          0.08300021
## Race_Human           0.64630039
## Race_Demon           0.25325414
## Race_God             0.68554982
## Race_Asgardian       0.06994368
## Race_Mutant          0.43814728
## Race_Android         0.43067807
partialPlot(rf.model,data.frame(modeldat),Intelligence)

partialPlot(rf.model,data.frame(modeldat),Gender)

The partial dependence plot will focus on the FIRST class. So if you’re using 0/1, then 0 will be focused on in the interpretation of the plots. This can be changed with the ‘which.class’ argument in partialPlot

14.1.4 Regression problem

Here, we use gbm.step to run the regression problem. If you want to run it in random forest, give it a try! Ask your instructor for some help, or use Google! There are a lot of resources available.

modeldat <- OOEdata %>% 
  dplyr::filter(Alignment!='neutral') %>%
  dplyr::select(Intelligence:TotalPower, Race_Animal, Race_Human, Race_Demon, Race_God,
                Race_Asgardian, Race_Mutant, Race_Android)

modeldat <- modeldat %>% 
  mutate_at(c('Alignment','Gender','Creator'), funs(factor(.)))

modeldf <- data.frame(modeldat)

## use names(modeldf) to find indices

gbmModel <- gbm.step(data=modeldf,gbm.x=c(1,6:9,11:17),gbm.y=10,family="laplace",
                     tree.complexity = 20,learning.rate=0.001,bag.fraction=0.5,max.trees = 8000)
## 
##  
##  GBM STEP - version 2.9 
##  
## Performing cross-validation optimisation of a boosted regression tree model 
## for TotalPower and using a family of laplace 
## Using 531 observations and 12 predictors 
## creating 10 initial models of 50 trees 
## 
##  folds are unstratified 
## total mean deviance =  82.8383 
## tolerance is fixed at  0.0828 
## ntrees resid. dev. 
## 50    81.9783 
## now adding trees... 
## 100   80.8959 
## 150   79.8729 
## 200   78.8618 
## 250   77.9282 
## 300   77.0672 
## 350   76.2683 
## 400   75.5307 
## 450   74.8541 
## 500   74.2275 
## 550   73.6477 
## 600   73.131 
## 650   72.6274 
## 700   72.1666 
## 750   71.7407 
## 800   71.3443 
## 850   70.9434 
## 900   70.5647 
## 950   70.2005 
## 1000   69.8613 
## 1050   69.5311 
## 1100   69.2117 
## 1150   68.917 
## 1200   68.6389 
## 1250   68.3814 
## 1300   68.1493 
## 1350   67.9313 
## 1400   67.7368 
## 1450   67.551 
## 1500   67.3831 
## 1550   67.2342 
## 1600   67.0918 
## 1650   66.92 
## 1700   66.7858 
## 1750   66.651 
## 1800   66.5458 
## 1850   66.4298 
## 1900   66.3027 
## 1950   66.1899 
## 2000   66.1056 
## 2050   66.0139 
## 2100   65.9382 
## 2150   65.8776 
## 2200   65.813 
## 2250   65.7649 
## 2300   65.7038 
## 2350   65.6419 
## 2400   65.6039 
## 2450   65.5705 
## 2500   65.527 
## 2550   65.4907 
## 2600   65.4539 
## 2650   65.4461 
## 2700   65.4212 
## 2750   65.4105 
## 2800   65.3828 
## 2850   65.3582 
## 2900   65.3596 
## 2950   65.3554 
## 3000   65.349 
## 3050   65.3398 
## 3100   65.3305 
## 3150   65.3092 
## 3200   65.2969 
## 3250   65.2941 
## 3300   65.2755 
## 3350   65.2698 
## 3400   65.2643 
## 3450   65.2594 
## 3500   65.2578 
## 3550   65.2476 
## 3600   65.249 
## 3650   65.2401 
## 3700   65.2352 
## 3750   65.239
## fitting final gbm model with a fixed number of 3700 trees for TotalPower

## 
## mean total deviance = 82.838 
## mean residual deviance = 52.955 
##  
## estimated cv deviance = 65.235 ; se = 2.25 
##  
## training data correlation = 0.734 
## cv correlation =  0.598 ; se = 0.032 
##  
## elapsed time -  0.89 minutes
summary(gbmModel)

##                           var    rel.inf
## Combat                 Combat 42.0959831
## Intelligence     Intelligence 23.6436979
## Race_Human         Race_Human  9.2883294
## Creator               Creator  9.0825277
## Gender                 Gender  7.5298538
## Alignment           Alignment  4.2547307
## Race_Mutant       Race_Mutant  3.3072987
## Race_God             Race_God  0.7975787
## Race_Animal       Race_Animal  0.0000000
## Race_Demon         Race_Demon  0.0000000
## Race_Asgardian Race_Asgardian  0.0000000
## Race_Android     Race_Android  0.0000000
gbm.plot(gbmModel,n.plots=5,write.title=F)

## But we are interested in gender, so let's plot that.

gbm.plot(gbmModel,variable.no=4,plot.layout=c(1,1))

Interestingly, we see that there does seem to be some sort of bias towards more powerful male characters. However, bearing in mind that the predictive performance (as determined by the summary statistics) is not very high.